How to Remove Duplicate Lines from a File

AlFlo · Post by **AlFlo** » 16 May 2022, 15:59

I frequently have to review thousands of pages of documents which include the same text or email over and over again (sometimes because the person had duplicates on their computer, or sometimes because they are a part of email or text chains).

Currently, I'm reviewing a single document which is 4,377 pages long. I'm guessing that - if duplicates are removed - it would be 1,000 pages or less.

How can I remove duplicates?

The files are usually .pdf, but I can convert them to .txt or Word.

I've seen people on the forums talking about scripts that can remove duplicate lines in a file - i.e. keep the first instance that line appears, and then delete all subsequent instances of the identical line - but that wouldn't work here.

For example, if I had the following two emails in a document:

Dear Joe,

Yes, I will meet you Friday.

Sincerely,
Bob

and

Dear Joe,

I will never do business with you.

Regards,
Bob

A line-duplication script would delete "Dear Joe" and "Bob" in both emails (since those lines are identical) even though the body of the emails ("meet you Friday" versus "never do business") are different.

So the question is whether it's possible for a script to compare larger portions of the text. For example, if a script could loop through a file and compare 5 lines of text at a time, that's more likely to delete the same identical email if it appears multiple times. For example, if an email with 10 lines of text in it is duplicated multiple times in a single file, the script would delete the first 5 lines of the duplicates, and then loop through on a future pass and delete the seconds 5 lines of text in the duplicates.

I don't know whether or not this is possible with AHK.

mikeyww · Post by **mikeyww** » 16 May 2022, 17:02

https://www.autohotkey.com/board/topic/57009-how-to-delete-duplicate-lines-of-a-text-file/
viewtopic.php?t=69674

My secret Web search term: autohotkey remove duplicate lines

You could adapt some of this to find text that spans line feeds.

AlFlo · Post by **AlFlo** » 16 May 2022, 18:46

Mikeyww,

I actually tried that script first (first of around 5 scripts I tried before posting). But when I try to run this:

Code: Select all

F1=alotoflines.txt 

Output = 

loop, read, alotoflines.txt

{
	If Output not contains %A_LoopReadLine%`n
		Output .= A_LoopReadLine . "`n"
}

FileDelete, %F1%
FileAppend, %Output%, %F1%

... all it does is open the .txt file in which the AHK script itself (not my target file of interest) is written. In other words, it neither reads my file alotoflines.txt, or creates/opens a new file with duplicate lines deleted.

I tried tweaking by specifying the name of the Output file, using full long path names for F1 and Output, etc., but it's not working.

What am I doing wrong? I'm obviously messing up a basic part of running this.

mikeyww · Post by **mikeyww** » 16 May 2022, 19:04

Code: Select all

dir  := A_ScriptDir
file  = %dir%\alotoflines.txt
out  := "", n := "`n"
If !FileExist(file)
 MsgBox, 48, Error, File not found.`n`n%file%
Loop, Read, %file%
 (!Instr(n out n, n A_LoopReadLine n)) && out .= A_LoopReadLine n
MsgBox, 64, Result, %out%

AHKStudent · Post by **AHKStudent** » 16 May 2022, 22:19

Is there anything specific separating each email? A special character? Unique spaces?

If there is, it would be easy to run a script and return all unique emails.

AlFlo · Post by **AlFlo** » 16 May 2022, 23:40

AHKStudent, wrote: ↑
16 May 2022, 22:19
Is there anything specific separating each email? A special character? Unique spaces?

If there is, it would be easy to run a script and return all unique emails.

AHKStudent,

After giving this alot of thought, I realize that the only thing unique to each separate email (and also to each separate text) is the time. Specifically, some emails have "From" and "To", while others just say who wrote the email without using the words from or to.

And texts don't contain any email addresses.

But the one thing they all have is the date and time. The date format varies depending on the email service (gmail vs. yahoo vs. Outlook etc.) and the text service.

But the one thing which seems to be consistent is hh:mm time format. I am in the U.S., so most emails are on a 12-hour time system (e.g. 4:50 p.m. or 4:50 PM). Sometimes emails from abroad are on the 24-hour system (e.g. 16:50)

So a script that took all the text between one hh:mm and the next hh:mm should be one discrete email or text. Since - in either 12-hour or 24-hour time format - there should be

2 numerical digits, then a colon (:), then 2 more numerical digits (all without any spaces)

Is it possible to write a script to compare and delete duplicates?

Note:
It looks like the Regex to capture 12-hour time format with an optional leading 0 is:

Code: Select all

  /^(0?[1-9]|1[0-2]):[0-5][0-9]$/

And the Regex to capture 24-hour time format with an optional leading 0 is:

Code: Select all

 /^([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$/

[Source: https://digitalfortress.tech/tips/top-15-commonly-used-regex/]

AHKStudent · Post by **AHKStudent** » 17 May 2022, 03:32

One idea is to put the emails in an array and then remove duplicates.

A lot of script samples on the forum for removing duplicates in an array (with or without sorting)

The script would first parse the emails using a regex while loop and place each email in the array. Next the script will remove the duplicates in the array.

If you can post some dummy data showing both options of the date format, meaning copy and paste a few emails while obviously removing anything real, much better help can be given with code samples too.

AlFlo · Post by **AlFlo** » 17 May 2022, 10:18

mikeyww wrote: ↑

16 May 2022, 19:04

Code: Select all

dir  := A_ScriptDir
file  = %dir%\alotoflines.txt
out  := "", n := "`n"
If !FileExist(file)
 MsgBox, 48, Error, File not found.`n`n%file%
Loop, Read, %file%
 (!Instr(n out n, n A_LoopReadLine n)) && out .= A_LoopReadLine n
MsgBox, 64, Result, %out%

Mikeyww, your script ran for perhaps 12 hours (I kept checking the Task Manager, and it averaged 10% of my CPU for that time period).

But in the end, I got an error message of memory limit reached:

: Error Message.png (12.5 KiB) Viewed 2203 times

mikeyww · Post by **mikeyww** » 17 May 2022, 10:19

For enormous files, you will want some enormous expertise in memory management & alternative approaches, from others!

AlFlo · Post by **AlFlo** » 17 May 2022, 11:25

AHKStudent wrote: ↑
17 May 2022, 03:32
One idea is to put the emails in an array and then remove duplicates.

A lot of script samples on the forum for removing duplicates in an array (with or without sorting)

The script would first parse the emails using a regex while loop and place each email in the array. Next the script will remove the duplicates in the array.

If you can post some dummy data showing both options of the date format, meaning copy and paste a few emails while obviously removing anything real, much better help can be given with code samples too.

AHK student, please find attached a sample of emails and texts I am working with.

Sample for Deleting Duplicate Lines.txt: (34.43 KiB) Downloaded 48 times

AlFlo · Post by **AlFlo** » 17 May 2022, 11:26

mikeyww wrote: ↑
17 May 2022, 10:19
For enormous files, you will want some enormous expertise in memory management & alternative approaches, from others!

Thank you for your help, Mikeyww!!

mikeyww · Post by **mikeyww** » 17 May 2022, 11:49

You are welcome. The array seems like a good choice to try, perhaps avoiding the memory issues. Loop can read the file if needed.

AlFlo · Post by **AlFlo** » 17 May 2022, 14:35

I don't want the script to sort arrays alphabetically. So this one - geared to someone who only has "Title" and "Link" info in their document - seems like a good start:

Code: Select all

MyArray := []
out:=""
Loop, Read, Test.txt
{
  If(SubStr(A_LoopReadLine, 1, 6) = "Title:")
    LastTitle := A_LoopReadLine
  Else If(SubStr(A_LoopReadLine, 1, 5) = "Link:") {
    If MyArray.HasKey(LastTitle A_LoopReadLine)
		Continue
	out.=LastTitle "`r`n" A_LoopReadLine
	MyArray[LastTitle A_LoopReadLine] := 1
  } Else out.=A_LoopReadLine "`r`n"
}
FileDelete, Test2.txt
FileAppend, %out%, Test2.txt

[Source: viewtopic.php?f=76&t=63023]

And the RegEx for catching 24-hour time format - which I assume would also work for catching 12-hour time format as well, since I just need the script to recognize that it is SOME type of timestamp - is

Code: Select all

 /^([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$/

But I don't know how to have the script put the data between two timestamps into an array, for inclusion in the script instead of the Title and Link scripting.

garry · Post by **garry** » 17 May 2022, 14:45

an example remove double lines with example above

Code: Select all

;- example to remove double lines 
;-
F1:=a_scriptdir "\sampletext.txt"     ;- this source-text and this ahk-script is saved in notepad as 'UTF-8 with BOM' 
F2:=a_scriptdir "\sampletext_new.txt"
Output:="" 
Obj := FileOpen(F1, "r",UTF-8)
var := Obj.Read()
Obj.Close()
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 if x=
   continue 
 If Output not contains %x%
   output .= x . "`r`n"
 else
   continue 
 }
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(output)
 try
 run,%f2%
 }
output=
x= 
exitapp
;=====================

AlFlo · Post by **AlFlo** » 17 May 2022, 15:01

garry wrote: ↑

17 May 2022, 14:45

an example remove double lines with example above

Code: Select all

;- example to remove double lines 
;-
F1:=a_scriptdir "\sampletext.txt"     ;- this source-text and this ahk-script is saved in notepad as 'UTF-8 with BOM' 
F2:=a_scriptdir "\sampletext_new.txt"
Output="" 
Obj := FileOpen(F1, "r",UTF-8)
var := Obj.Read()
Obj.Close()
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 if x=
   continue 
 If Output not contains %x%
   output .= x . "`r`n"
 else
   continue 
 }
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(output)
 try
 run,%f2%
 }
output=
x= 
exitapp
;=====================

Thank you, Garry. When I save both the source-text (i.e.

sampletext.txt: (34.44 KiB) Downloaded 43 times

which is 35KB) and your ahk-script in notepad as 'UTF-8 with BOM' , it creates a Test2.txt file which is totally empty, i.e. 0 bytes. Is it possible that your script deletes all duplicates INCLUDING the first instance?

garry · Post by **garry** » 17 May 2022, 15:12

it worked for me , but not tested details ,
strange , this is missing in F2 > Date: Fri, Jan 3, 2020 at 11:59 AM
destfile is F2:=a_scriptdir "\sampletext_new.txt"

AlFlo · Post by **AlFlo** » 17 May 2022, 15:26

garry wrote: ↑
17 May 2022, 15:12
it worked for me , but not tested details ,
strange , this is missing in F2 > Date: Fri, Jan 3, 2020 at 11:59 AM
destfile is F2:=a_scriptdir "\sampletext_new.txt"

Stupid question: Do I run your script alone, or plug in other code with it?

garry · Post by **garry** » 17 May 2022, 15:40

important to clear variables and close all not needed ahk-scripts ( last line exitapp )
I had a failure at begin :
Output="" should be > Output:=""

AlFlo · Post by **AlFlo** » 17 May 2022, 20:49

garry wrote: ↑
17 May 2022, 15:40
important to clear variables and close all not needed ahk-scripts ( last line exitapp )
I had a failure at begin :
Output="" should be > Output:=""

Thanks, Gary! I successfully ran your script on my test file. A question...

There were still a couple of duplicates. After digging in, I realize that the spacing and line breaks in two emails might be different because an email which is replied to in-line in the email may have different spacing than the original email, even though they are otherwise identical. Is there a way to modify the script so that it ignores punctuation (spacing and line breaks) and just looks at the actual words?

Chunjee · Post by **Chunjee** » 17 May 2022, 22:40

Howdy; I wrote a library called string-similarity.ahk for scoring how similar strings are to each other. I wrapped your issue into a function maybe you can experiment with it.

Code: Select all

; requires https://www.npmjs.com/package/string-similarity.ahk
array := ["Peter, I need those TPS Reports!"
	, "Peter, I need those TPS rports"
	, "Peter, I need those TPS Reports"
	, "Milton, Are you working Tuesday?"
	, "Milton, Are you working Tuesday?"]
filteredArray := fn_removeSimilar(array, 0.90)
; => ["Milton, Are you working Tuesday?", "Peter, I need those TPS Reports"]


fn_removeSimilar(inputArr, threshold)
{
	arr := inputArr.clone()
	outputArr := []

	loop, % arr.count() {
		value := arr.pop()
		scoredStrings := stringsimilarity.findBestMatch(value, arr)
		for key2, value2 in scoredStrings.ratings {
			if (value2.rating > threshold) {
				continue 2
			}
		}
		outputArr.push(value)
	}
	return outputArr
}

I'm not familiar with how big your database is but this will probably break if you throw too much data at it; that is more of a RAM/ahk constraint I would think. I usually deal with that by chunking the data into smaller workable bites (1000 e-mails at a time for example)

AutoHotkey Community

How to Remove Duplicate Lines from a File

How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File

Re: How to Remove Duplicate Lines from a File