How to Remove Duplicate Lines from a File

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to Remove Duplicate Lines from a File

Post by garry » 18 May 2022, 03:02

thank you for examples
was strange I didn't see lines if they contain the 2 characters ( , ) e.g. ( Date: Fri, Jan 3, 2020 at 11:59 AM and CC: ... , , )
EDIT : example was wrong , removed .

;- script modified in case1 , because I didn't found these if use file :
;Date: Fri, Jan 3, 2020 at 11:00 AM
;Date: Fri, Jan 3, 2020 at 11:58 AM
;Cc: kelly_DeSilvaGroup.com <kelly_ajogroup.com>, , Bill

EDIT : tried this :

Code: Select all

;-------- saved at 2022-05-19  08:07 UTC --------------
;- How to Remove Duplicate Lines from a File - Page 2 
;- https://www.autohotkey.com/boards/viewtopic.php?f=76&t=104148&start=20
;- example to remove double lines 
;-
;- script modified in case1 , because I didn't found these if use file :
;Date: Fri, Jan 3, 2020 at 11:00 AM
;Date: Fri, Jan 3, 2020 at 11:58 AM
;Cc: [email protected] <[email protected]>, , Bill
;-
#MaxMem 4095
transform,s,chr,32
transform,v,chr,127
;-
;-----------
goto,case1     ;-- case1=readfile   /  case2=test
;-----------
;-
case1:
F1:=a_scriptdir "\sampletext2.txt"                     ;- this source-text and this ahk-script is saved in notepad as 'UTF-8 with BOM' 
F2:=a_scriptdir "\" . a_now . "_sampletext2_new.txt"
Out:="" 
Obj := FileOpen(F1, "r",UTF-8)
var := Obj.Read()
stringreplace,var,var,`,,%s%%v%,all
Obj.Close()
;------
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 x=%x%
 if x=
   continue
 If out not contains %x%
   out .= x . "`r`n"
 }
stringreplace,out,out,%s%%v%,`,,all 
;------
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(out)
 try
 run,%f2%
 }
out=
x= 
exitapp
;---------------------------------------------
esc::exitapp
;---------------------------------------------
;-
case2:
;- example to remove double lines 
;-
out:="" 
gosub,testtext
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 x=%x%
 if x=
   continue
 If out not contains %x%
   out .= x . "`r`n"
 }
msgbox,%out%
out=
x= 
exitapp
;================
;----------------
testtext:
var=
(
          From: Joe Smith <[email protected]> 
          Date: Fri, Jan 3, 2020 at 11:59 AM  
          Subject: Re: PJ160605 237 23rd Ave - site visit  
          To: Bob <[email protected]>  
          Cc: [email protected] <[email protected]>, , Bill 
		  
		  
Dear Joe,     
Yes, I will meet you Friday.
Sincerely me,
Sincerely me,
Bob1 line 
Dear Joe,
I will never do business with you.
Regards you,
Regards you,
Bob2 line
)
return
;============================================================================
EDIT : added convert date and time ... 'Date: Fri, Jan 3, 2020 at 11:59 PM' > 20200103235900

Code: Select all

;- modified = 20220520
;- example try to remove doublelines and convertdatetime
;-
#MaxMem 4095
transform,s,chr,32
transform,v,chr,127
;-
F2:=a_desktop . "\" . a_now . "_result_text.txt"
out:="" 
gosub,testtext
stringreplace,var,var,`,,%s%%v%,all
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 if x=
   continue
 x=%x% 
 If out not contains %x%
   {
   stringmid,x1,x,1,5
   if (x1="Date:")                 ;- convert 'Date: Fri, Jan 3, 2020 at 11:59 PM' > 20200103235900
     {
	 stringmid,x2,x,6,40
	 stringreplace,x2,x2,%s%%v%,`,,all
	 timex:=DateParse(x2)
	 timex2:=timex . "00"
	 FormatTime,TS,%timex2% L0x0804, dddd MMMM yyyy-MM-dd  HH:mm  ;
	 x:="Date:=" . TS
	 x1:=""
	 }
   out .= x . "`r`n"
   }
 }
stringreplace,out,out,%s%%v%,`,,all 
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(out)
 try
   run,%f2%
 }
out=
x= 
exitapp
;================
;----------------
testtext:
var:="
(Ltrim join`r`n
          From: Joe Smith <[email protected]> 
		  ---------------
          Date: Fri, Jan 3, 2020 at 11:59 AM  
		  ---
          Subject: Re: PJ160605 237 23rd Ave - site visit  
          To: Bob <[email protected]>  
          Cc: [email protected] <[email protected]>, , Bill 
----		  
Date: Sat, Jan 4, 2020 at 11:59 PM  
-----

Dear Joe1,     
Yes, I will meet %you Friday.
Dear Joe1,     
Sincerely me,
Sincerely me,
Bob1 line 
Dear Joe2,
I will never do business with you.
Dear Joe2,     
Regards you,
Regards you,
Bob2 line
)"
return
;============================================================================

;time:=DateParse("1/2/2020 9:45 PM")   ;- user 'polyethene'
;msgbox,%time%
;return
;- https://www.autohotkey.com/board/topic/18760-date-parser-convert-any-date-format-to-yyyymmddhh24miss/
;-----------------------------------------------------------------------------------------------
DateParse(str, americanOrder=0) {
	static monthNames := "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*"
		, dayAndMonth := "(?:(\d{1,2}|" . monthNames . ")[\s\.\-\/,]+)?(\d{1,2}|" . monthNames . ")"
	If RegExMatch(str, "i)^\s*(?:(\d{4})([\s\-:\/])(\d{1,2})\2(\d{1,2}))?"
		. "(?:\s*[T\s](\d{1,2})([\s\-:\/])(\d{1,2})(?:\6(\d{1,2})\s*(?:(Z)|(\+|\-)?"
		. "(\d{1,2})\6(\d{1,2})(?:\6(\d{1,2}))?)?)?)?\s*$", i) ;ISO 8601 timestamps
		year := i1, month := i3, day := i4, t1 := i5, t2 := i7, t3 := i8
	Else If !RegExMatch(str, "^\W*(\d{1,2}+)(\d{2})\W*$", t){
		RegExMatch(str, "i)(\d{1,2})"					;hours
				. "\s*:\s*(\d{1,2})"				;minutes
				. "(?:\s*:\s*(\d{1,2}))?"			;seconds
				. "(?:\s*([ap]m))?", t)				;am/pm
		StringReplace, str, str, %t%
		If Regexmatch(str, "i)(\d{4})[\s\.\-\/,]+" . dayAndMonth, d) ;2004/22/03
			year := d1, month := d3, day := d2
		Else If Regexmatch(str, "i)" . dayAndMonth . "[\s\.\-\/,]+(\d{2,4})", d)  ;22/03/2004 or 22/03/04
			year := d3, month := d2, day := d1
		If (RegExMatch(day, monthNames) or americanOrder and !RegExMatch(month, monthNames)) ;try to infer day/month order
			tmp := month, month := day, day := tmp
	}
	f = %A_FormatFloat%
	SetFormat, Float, 02.0
	d := (StrLen(year) == 2 ? "20" . year : (year ? year : A_YYYY))
		. ((month := month + 0 ? month : InStr(monthNames, SubStr(month, 1, 3)) // 4 ) > 0 ? month + 0.0 : A_MM)
		. ((day += 0.0) ? day : A_DD) 
		. t1 + (t1 == 12 ? t4 = "am" ? -12.0 : 0.0 : t4 = "pm" ? 12.0 : 0.0)
		. t2 + 0.0 . t3 + 0.0
	SetFormat, Float, %f%
	return, d
}
;-------------------------------------------------------------------------------------------
/*
	Function: DateParse
		Converts almost any date format to a YYYYMMDDHH24MISS value.
	Parameters:
		str - a date/time stamp as a string
	Returns:
		A valid YYYYMMDDHH24MISS value which can be used by FormatTime, EnvAdd and other time commands.
	Example:
     time := DateParse("2:35 PM, 27 November, 2007")
	License:
		- Version 1.05 <http://www.autohotkey.net/~polyethene/#dateparse>
		- Dedicated to the public domain (CC0 1.0) <http://creativecommons.org/publicdomain/zero/1.0/>
*/

;============================================================================================
Last edited by garry on 20 May 2022, 06:25, edited 5 times in total.

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 18 May 2022, 13:07

Chunjee wrote:
17 May 2022, 22:40
Howdy; I wrote a library called string-similarity.ahk for scoring how similar strings are to each other. I wrapped your issue into a function maybe you can experiment with it.

Code: Select all

; requires https://www.npmjs.com/package/string-similarity.ahk
array := ["Peter, I need those TPS Reports!"
	, "Peter, I need those TPS rports"
	, "Peter, I need those TPS Reports"
	, "Milton, Are you working Tuesday?"
	, "Milton, Are you working Tuesday?"]
filteredArray := fn_removeSimilar(array, 0.90)
; => ["Milton, Are you working Tuesday?", "Peter, I need those TPS Reports"]


fn_removeSimilar(arr, threshold)
{
	outputArr := []
	loop, % arr.count() {
		value := arr.pop()
		scoredStrings := stringsimilarity.findBestMatch(value, arr)
		for key2, value2 in scoredStrings.ratings {
			if (value2.rating > threshold) {
				continue 2
			}
		}
		outputArr.push(value)
	}
	return outputArr
}
I'm not familiar with how big your database is but this will probably break if you throw too much data at it; that is more of a RAM/ahk constraint I would think. I usually deal with that by chunking the data into smaller workable bites (1000 e-mails at a time for example)
Chunjee, thank you for this fascinating approach. May I ask how I can modify the script so that it gets the arrays from my file ... i.e. specifying the filepath of the specific file and picking out arrays from that file?

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 18 May 2022, 16:34

I modified Garry's first script to add the following line of code:

Code: Select all

#MaxMem 4095
This boosts allowable memory allocation for each variable from 64 MB to 4,095 MB.

I'm currently running the script on my monster 4,377-page file (that's how it was given to me). It has been running for almost 24 hours. Hasn't crashed ... and when I check Task Manager, it has consistently shown that the script is using between 10% and 20% of my CPU. I think this is a good sign, as I assume it means it's chugging along.

(It's not really using much memory: averaging 30 MB at any given time)

Again, removing duplicates would be so valuable to me that I don't mind running a script for a long time ... if it ends up working.

I'll post an update when the script actually finishes ... or when it instead times out or crashes.

UPDATE: I tried Garry's first script (adding

Code: Select all

#MaxMem 4095
) with a file around half the length ... 2,300 pages. It completed in one minute! I haven't checked the result yet for accuracy. But this showed that the problem is not with the speed or efficiency of Garry's script, but with my monster file being too damn big (presumably the same with Mikeyww's script).

So now the game is to get accurate results.
Last edited by AlFlo on 18 May 2022, 23:32, edited 2 times in total.

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 18 May 2022, 23:09

garry wrote:
18 May 2022, 03:02
thank you for examples
was strange I didn't see lines if they contain the 2 characters ( , ) e.g. ( Date: Fri, Jan 3, 2020 at 11:59 AM and CC: ... , , )
EDIT : example was wrong , removed . See date but also not removed double lines
EDIT : tried this :

Code: Select all

;- example to remove double lines 
out:="" 
gosub,testtext
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 x=%x%
 if x=
   continue
 If out not contains %x%
   out .= x . "`r`n"
 }
msgbox,%out%
out=
x= 
exitapp
;================
;----------------
testtext:
var=
(
          From: Joe Smith <[email protected]> 
          Date: Fri, Jan 3, 2020 at 11:59 AM  
          Subject: Re: PJ160605 237 23rd Ave - site visit  
          To: Bob <[email protected]>  
          Cc: [email protected] <[email protected]>, , Bill 
		  
		  
Dear Joe,     
Yes, I will meet you Friday.
Sincerely me,
Sincerely me,
Bob1 line 
Dear Joe,
I will never do business with you.
Regards you,
Regards you,
Bob2 line
)
return
;============================================================================

Garry, so is your revised script as follows:

Code: Select all

#MaxMem 4095

;- example to remove double lines 
;-
F1:=a_scriptdir "\pathname.txt"     ;- this source-text and this ahk-script is saved in notepad as 'UTF-8 with BOM' 
F2:=a_scriptdir "\pathname_new2.txt"
Output:="" 
Obj := FileOpen(F1, "r",UTF-8)
var := Obj.Read()
Obj.Close()
Loop,parse,var,`n,`r
{
 x:= A_LoopField
 x=%x%
 if x=
   continue
 If out not contains %x%
   out .= x . "`r`n"
 }
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(output)
 try
 run,%f2%
 }
output=
x= 
exitapp
;=====================
?? Or did I get confused trying to integrate your revisions into your previous version?

garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to Remove Duplicate Lines from a File

Post by garry » 19 May 2022, 04:15

@AlFlo sorry I was confused , see last change on this page-2
I didn't found DATE: and CC: if they contains 2 comma's ... and not tested all ...
I have 4 GB RAM ... , your not so big text-files worked without MaxMem

User avatar
Chunjee
Posts: 1400
Joined: 18 Apr 2014, 19:05
Contact:

Re: How to Remove Duplicate Lines from a File

Post by Chunjee » 19 May 2022, 16:27

AlFlo wrote:
18 May 2022, 13:07
Chunjee, thank you for this fascinating approach. May I ask how I can modify the script so that it gets the arrays from my file ... i.e. specifying the filepath of the specific file and picking out arrays from that file?
I thought you guys figured that out on page 1. You need a good separator for each e-mail or reply if they're all in one file.

With this regexp separator \s{10,100}\w{3}, \w{3}\s\d+,\s\d{4}\sat\s\d*\:\d*\s\w{2}\s it says there are 142 e-mails in your sample txt
With a threshold of 0.98% similar (ie so super similar it must be a duplicate) I brought it down to 111 e-mails. With a more aggressive match like 0.85 it reduces to 86 e-mails. I've attached both if you would like to review.

Code: Select all

; requires https://www.npmjs.com/package/string-similarity.ahk
A := new biga() ; requires https://www.npmjs.com/package/biga.ahk

FileRead, OutputVar, % A_ScriptDir "\Sample for Deleting Duplicate Lines.txt"
array := A.split(OutputVar, "/\s{10,100}\w{3}, \w{3}\s\d+,\s\d{4}\sat\s\d*\:\d*\s\w{2}\s/")
filteredArray := fn_removeSimilar(array, 0.98)
output := A.join(filteredArray, "`n")
FileAppend, %output%, % A_ScriptDir "\output.txt"
msgbox, % "done: " filteredArray.count() " vs " array.count()
return

fn_removeSimilar(inputArr, threshold)
{
	arr := inputArr.clone()
	outputArr := []

	loop, % arr.count() {
		value := arr.removeAt(1)
		scoredStrings := stringsimilarity.findBestMatch(value, arr)
		for key2, value2 in scoredStrings.ratings {
			if (value2.rating > threshold) {
				continue 2
			}
		}
		outputArr.push(value)
	}
	return outputArr
}
Attachments
output_85.txt
(21.27 KiB) Downloaded 13 times
output_98.txt
(60.82 KiB) Downloaded 14 times

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 19 May 2022, 18:29

Chunjee wrote:
19 May 2022, 16:27
AlFlo wrote:
18 May 2022, 13:07
Chunjee, thank you for this fascinating approach. May I ask how I can modify the script so that it gets the arrays from my file ... i.e. specifying the filepath of the specific file and picking out arrays from that file?
I thought you guys figured that out on page 1. You need a good separator for each e-mail or reply if they're all in one file.

With this regexp separator \s{10,100}\w{3}, \w{3}\s\d+,\s\d{4}\sat\s\d*\:\d*\s\w{2}\s it says there are 142 e-mails in your sample txt
With a threshold of 0.98% similar (ie so super similar it must be a duplicate) I brought it down to 111 e-mails. With a more aggressive match like 0.85 it reduces to 86 e-mails. I've attached both if you would like to review.

Code: Select all

; requires https://www.npmjs.com/package/string-similarity.ahk
A := new biga() ; requires https://www.npmjs.com/package/biga.ahk

FileRead, OutputVar, % A_ScriptDir "\Sample for Deleting Duplicate Lines.txt"
array := A.split(OutputVar, "/\s{10,100}\w{3}, \w{3}\s\d+,\s\d{4}\sat\s\d*\:\d*\s\w{2}\s/")
filteredArray := fn_removeSimilar(array, 0.98)
output := A.join(filteredArray, "`n")
FileAppend, %output%, % A_ScriptDir "\output.txt"
msgbox, % "done: " filteredArray.count() " vs " array.count()
return

fn_removeSimilar(inputArr, threshold)
{
	arr := inputArr.clone()
	outputArr := []

	loop, % arr.count() {
		value := arr.removeAt(1)
		scoredStrings := stringsimilarity.findBestMatch(value, arr)
		for key2, value2 in scoredStrings.ratings {
			if (value2.rating > threshold) {
				continue 2
			}
		}
		outputArr.push(value)
	}
	return outputArr
}
Thank you, Chunjee! This looks amazing. I downloaded and extracted the string-similarity and Biga. But your script pops up a message box saying "0 vs", and an Output.txt with 0 bytes.

And when I tried to run these from my command prompt, I get a message saying incorrect command or no such files:

npm install biga.ahk ,

bashnpm install string-similarity.ahk

```bash npm install string-similarity.ahk ```

npm install string-similarity.ahk

install string-similarity.ahk

etc.

Sorry, I don't understand bash or other basic programming concepts, so I'm doubtless misunderstanding a very basic thing I should be doing.

User avatar
Chunjee
Posts: 1400
Joined: 18 Apr 2014, 19:05
Contact:

Re: How to Remove Duplicate Lines from a File

Post by Chunjee » 19 May 2022, 18:34

did you #Include the packages? Not sure what step you are stuck on. Could just as well be A_ScriptDir "\Sample for Deleting Duplicate Lines.txt" if you don't have that sample file named and located the same.

On my computer the top of the script looks like this:

Code: Select all

SetBatchLines, -1
#SingleInstance, force
#NoTrayIcon

#Include %A_ScriptDir%\node_modules
#Include biga.ahk\export.ahk
#Include string-similarity.ahk\export.ahk


But you may include them however you prefer :thumbup:
https://github.com/Chunjee/string-similarity.ahk
https://github.com/biga-ahk/biga.ahk
Last edited by Chunjee on 19 May 2022, 19:05, edited 3 times in total.

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 19 May 2022, 18:40

Chunjee wrote:
19 May 2022, 18:34
did you #Include the packages? Not sure what step you are stuck on. Could just as well be A_ScriptDir "\Sample for Deleting Duplicate Lines.txt" if you don't have that sample file named and located the same.

On my computer the top of the script looks like this:

Code: Select all

SetBatchLines, -1
#SingleInstance, force
#NoTrayIcon

#Include %A_ScriptDir%\node_modules
#Include biga.ahk\export.ahk
#Include string-similarity.ahk\export.ahk
Thanks ... I'll take a look. I don't have node_modules installed. I'll have to check out what that is.

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 19 May 2022, 20:38

I think I got it working! I found the Export.ahk scripts at Github, created and put them in the appropriate subfolders, and then commented out

Code: Select all

; #Include %A_ScriptDir%\node_modules
since I don't have any node-module subfolder, don't have node.js, and can't figure out how to use NPM.

Testing now!

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 22 May 2022, 16:17

garry wrote:
18 May 2022, 03:02
thank you for examples
was strange I didn't see lines if they contain the 2 characters ( , ) e.g. ( Date: Fri, Jan 3, 2020 at 11:59 AM and CC: ... , , )
EDIT : example was wrong , removed .

;- script modified in case1 , because I didn't found these if use file :
;Date: Fri, Jan 3, 2020 at 11:00 AM
;Date: Fri, Jan 3, 2020 at 11:58 AM
;Cc: kelly_DeSilvaGroup.com <kelly_ajogroup.com>, , Bill

EDIT : tried this :

Code: Select all

;-------- saved at 2022-05-19  08:07 UTC --------------
;- How to Remove Duplicate Lines from a File - Page 2 
;- https://www.autohotkey.com/boards/viewtopic.php?f=76&t=104148&start=20
;- example to remove double lines 
;-
;- script modified in case1 , because I didn't found these if use file :
;Date: Fri, Jan 3, 2020 at 11:00 AM
;Date: Fri, Jan 3, 2020 at 11:58 AM
;Cc: [email protected] <[email protected]>, , Bill
;-
#MaxMem 4095
transform,s,chr,32
transform,v,chr,127
;-
;-----------
goto,case1     ;-- case1=readfile   /  case2=test
;-----------
;-
case1:
F1:=a_scriptdir "\sampletext2.txt"                     ;- this source-text and this ahk-script is saved in notepad as 'UTF-8 with BOM' 
F2:=a_scriptdir "\" . a_now . "_sampletext2_new.txt"
Out:="" 
Obj := FileOpen(F1, "r",UTF-8)
var := Obj.Read()
stringreplace,var,var,`,,%s%%v%,all
Obj.Close()
;------
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 x=%x%
 if x=
   continue
 If out not contains %x%
   out .= x . "`r`n"
 }
stringreplace,out,out,%s%%v%,`,,all 
;------
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(out)
 try
 run,%f2%
 }
out=
x= 
exitapp
;---------------------------------------------
esc::exitapp
;---------------------------------------------
;-
case2:
;- example to remove double lines 
;-
out:="" 
gosub,testtext
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 x=%x%
 if x=
   continue
 If out not contains %x%
   out .= x . "`r`n"
 }
msgbox,%out%
out=
x= 
exitapp
;================
;----------------
testtext:
var=
(
          From: Joe Smith <[email protected]> 
          Date: Fri, Jan 3, 2020 at 11:59 AM  
          Subject: Re: PJ160605 237 23rd Ave - site visit  
          To: Bob <[email protected]>  
          Cc: [email protected] <[email protected]>, , Bill 
		  
		  
Dear Joe,     
Yes, I will meet you Friday.
Sincerely me,
Sincerely me,
Bob1 line 
Dear Joe,
I will never do business with you.
Regards you,
Regards you,
Bob2 line
)
return
;============================================================================
EDIT : added convert date and time ... 'Date: Fri, Jan 3, 2020 at 11:59 PM' > 20200103235900

Code: Select all

;- modified = 20220520
;- example try to remove doublelines and convertdatetime
;-
#MaxMem 4095
transform,s,chr,32
transform,v,chr,127
;-
F2:=a_desktop . "\" . a_now . "_result_text.txt"
out:="" 
gosub,testtext
stringreplace,var,var,`,,%s%%v%,all
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 if x=
   continue
 x=%x% 
 If out not contains %x%
   {
   stringmid,x1,x,1,5
   if (x1="Date:")                 ;- convert 'Date: Fri, Jan 3, 2020 at 11:59 PM' > 20200103235900
     {
	 stringmid,x2,x,6,40
	 stringreplace,x2,x2,%s%%v%,`,,all
	 timex:=DateParse(x2)
	 timex2:=timex . "00"
	 FormatTime,TS,%timex2% L0x0804, dddd MMMM yyyy-MM-dd  HH:mm  ;
	 x:="Date:=" . TS
	 x1:=""
	 }
   out .= x . "`r`n"
   }
 }
stringreplace,out,out,%s%%v%,`,,all 
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(out)
 try
   run,%f2%
 }
out=
x= 
exitapp
;================
;----------------
testtext:
var:="
(Ltrim join`r`n
          From: Joe Smith <[email protected]> 
		  ---------------
          Date: Fri, Jan 3, 2020 at 11:59 AM  
		  ---
          Subject: Re: PJ160605 237 23rd Ave - site visit  
          To: Bob <[email protected]>  
          Cc: [email protected] <[email protected]>, , Bill 
----		  
Date: Sat, Jan 4, 2020 at 11:59 PM  
-----

Dear Joe1,     
Yes, I will meet %you Friday.
Dear Joe1,     
Sincerely me,
Sincerely me,
Bob1 line 
Dear Joe2,
I will never do business with you.
Dear Joe2,     
Regards you,
Regards you,
Bob2 line
)"
return
;============================================================================

;time:=DateParse("1/2/2020 9:45 PM")   ;- user 'polyethene'
;msgbox,%time%
;return
;- https://www.autohotkey.com/board/topic/18760-date-parser-convert-any-date-format-to-yyyymmddhh24miss/
;-----------------------------------------------------------------------------------------------
DateParse(str, americanOrder=0) {
	static monthNames := "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*"
		, dayAndMonth := "(?:(\d{1,2}|" . monthNames . ")[\s\.\-\/,]+)?(\d{1,2}|" . monthNames . ")"
	If RegExMatch(str, "i)^\s*(?:(\d{4})([\s\-:\/])(\d{1,2})\2(\d{1,2}))?"
		. "(?:\s*[T\s](\d{1,2})([\s\-:\/])(\d{1,2})(?:\6(\d{1,2})\s*(?:(Z)|(\+|\-)?"
		. "(\d{1,2})\6(\d{1,2})(?:\6(\d{1,2}))?)?)?)?\s*$", i) ;ISO 8601 timestamps
		year := i1, month := i3, day := i4, t1 := i5, t2 := i7, t3 := i8
	Else If !RegExMatch(str, "^\W*(\d{1,2}+)(\d{2})\W*$", t){
		RegExMatch(str, "i)(\d{1,2})"					;hours
				. "\s*:\s*(\d{1,2})"				;minutes
				. "(?:\s*:\s*(\d{1,2}))?"			;seconds
				. "(?:\s*([ap]m))?", t)				;am/pm
		StringReplace, str, str, %t%
		If Regexmatch(str, "i)(\d{4})[\s\.\-\/,]+" . dayAndMonth, d) ;2004/22/03
			year := d1, month := d3, day := d2
		Else If Regexmatch(str, "i)" . dayAndMonth . "[\s\.\-\/,]+(\d{2,4})", d)  ;22/03/2004 or 22/03/04
			year := d3, month := d2, day := d1
		If (RegExMatch(day, monthNames) or americanOrder and !RegExMatch(month, monthNames)) ;try to infer day/month order
			tmp := month, month := day, day := tmp
	}
	f = %A_FormatFloat%
	SetFormat, Float, 02.0
	d := (StrLen(year) == 2 ? "20" . year : (year ? year : A_YYYY))
		. ((month := month + 0 ? month : InStr(monthNames, SubStr(month, 1, 3)) // 4 ) > 0 ? month + 0.0 : A_MM)
		. ((day += 0.0) ? day : A_DD) 
		. t1 + (t1 == 12 ? t4 = "am" ? -12.0 : 0.0 : t4 = "pm" ? 12.0 : 0.0)
		. t2 + 0.0 . t3 + 0.0
	SetFormat, Float, %f%
	return, d
}
;-------------------------------------------------------------------------------------------
/*
	Function: DateParse
		Converts almost any date format to a YYYYMMDDHH24MISS value.
	Parameters:
		str - a date/time stamp as a string
	Returns:
		A valid YYYYMMDDHH24MISS value which can be used by FormatTime, EnvAdd and other time commands.
	Example:
     time := DateParse("2:35 PM, 27 November, 2007")
	License:
		- Version 1.05 <http://www.autohotkey.net/~polyethene/#dateparse>
		- Dedicated to the public domain (CC0 1.0) <http://creativecommons.org/publicdomain/zero/1.0/>
*/

;============================================================================================
Garry, thank you for these modified scripts. The first one doesn't work as well as if I manually do a global find and replace all of the comma spaces i.e. ", " in my sampletext2.txt with three xs i.e. "xxx" run the script, and then do a second global replace of "xxx" with ", " after I run your script.

In other words, by using Notepad's replace box, January 3, 2020 would become January 3xxx 2020, removing the problem you raised earlier of the script not catching lines with comma space. Then after I run the script, I manually replace xxx with comma space, so that it once again become January 3, 2020.

Of course, it would be ideal if your script could do this automatically. Is part of the issue that comma is chr, 44 and we should be replacing THAT character?

garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to Remove Duplicate Lines from a File

Post by garry » 23 May 2022, 06:26

I tried this , it worked with > sampletext.txt - file at page-1 34.44-kB ( this script downloads this text-file to > a_scriptdir "\sampletext.txt" )

Code: Select all

;- modified = 20220523
;- example try to remove doublelines and convertdatetime
;-
;-
;- script modified , because I didn't found these / here small modified for test   : ( if found then converted ) 
;Date: Fri, Jan 3, 2020 at 11:58 AM  > Date:=星期五 一月 2020-01-03  11:58
;Date: Fri, Jan 3, 2020 at 11:58 PM  > Date:=星期五 一月 2020-01-03  23:58
;Cc: [email protected] <[email protected]>, , Bill
;-
#MaxMem 4095
transform,k,chr,44   ;- ' , '
;-
;- https://www.autohotkey.com/boards/viewtopic.php?p=462736#p462736  /  sampletext.txt"  - file at page-1  34.44-kB
url:="https://www.autohotkey.com/boards/download/file.php?id=17830"
F1:=a_scriptdir "\sampletext.txt"
ifnotexist,%f1%
 urldownloadtofile,%url%,%f1%
F2:=a_scriptdir "\" . a_now . "_sampletext_new.txt"
;-
Out:="" 
Obj := FileOpen(F1, "r",UTF-8)
var := Obj.Read()
stringreplace,var,var,%k%,$$,all   ;- or  stringreplace,var,var,`,,$$,all
Obj.Close()
Loop,parse,var,`n,`r
 {
 x:= A_LoopField
 if x=
   continue
 x=%x%                     ;- remove leading spaces
 If out not contains %x%   ;- can this work with big files ( ? )
   {
   stringmid,x1,x,1,5
   if (x1="Date:")         ;- convert 'Date: Fri, Jan 3, 2020 at 11:59 PM' > Date:=星期五 一月 2020-01-03  23:59
     {
	 stringmid,x2,x,6,40
	 stringreplace,x2,x2,$$,%k%,all
	 timex:=DateParse(x2)
	 timex2:=timex . "00"                                         ;- 20200103235900
	 FormatTime,TS,%timex2% L0x0804, dddd MMMM yyyy-MM-dd  HH:mm  ;- example china / change country and format
	 x:="Date:=" . TS
	 x1:=""
	 }
   out .= x . "`r`n"
   }
 }
stringreplace,out,out,$$,%k%,all 
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(out)
 try
   run,%f2%
 }
out=
x= 
exitapp
;================

;- https://www.autohotkey.com/board/topic/18760-date-parser-convert-any-date-format-to-yyyymmddhh24miss/
;-----------------------------------------------------------------------------------------------
DateParse(str, americanOrder=0) {
	static monthNames := "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*"
		, dayAndMonth := "(?:(\d{1,2}|" . monthNames . ")[\s\.\-\/,]+)?(\d{1,2}|" . monthNames . ")"
	If RegExMatch(str, "i)^\s*(?:(\d{4})([\s\-:\/])(\d{1,2})\2(\d{1,2}))?"
		. "(?:\s*[T\s](\d{1,2})([\s\-:\/])(\d{1,2})(?:\6(\d{1,2})\s*(?:(Z)|(\+|\-)?"
		. "(\d{1,2})\6(\d{1,2})(?:\6(\d{1,2}))?)?)?)?\s*$", i) ;ISO 8601 timestamps
		year := i1, month := i3, day := i4, t1 := i5, t2 := i7, t3 := i8
	Else If !RegExMatch(str, "^\W*(\d{1,2}+)(\d{2})\W*$", t){
		RegExMatch(str, "i)(\d{1,2})"					;hours
				. "\s*:\s*(\d{1,2})"				;minutes
				. "(?:\s*:\s*(\d{1,2}))?"			;seconds
				. "(?:\s*([ap]m))?", t)				;am/pm
		StringReplace, str, str, %t%
		If Regexmatch(str, "i)(\d{4})[\s\.\-\/,]+" . dayAndMonth, d) ;2004/22/03
			year := d1, month := d3, day := d2
		Else If Regexmatch(str, "i)" . dayAndMonth . "[\s\.\-\/,]+(\d{2,4})", d)  ;22/03/2004 or 22/03/04
			year := d3, month := d2, day := d1
		If (RegExMatch(day, monthNames) or americanOrder and !RegExMatch(month, monthNames)) ;try to infer day/month order
			tmp := month, month := day, day := tmp
	}
	f = %A_FormatFloat%
	SetFormat, Float, 02.0
	d := (StrLen(year) == 2 ? "20" . year : (year ? year : A_YYYY))
		. ((month := month + 0 ? month : InStr(monthNames, SubStr(month, 1, 3)) // 4 ) > 0 ? month + 0.0 : A_MM)
		. ((day += 0.0) ? day : A_DD) 
		. t1 + (t1 == 12 ? t4 = "am" ? -12.0 : 0.0 : t4 = "pm" ? 12.0 : 0.0)
		. t2 + 0.0 . t3 + 0.0
	SetFormat, Float, %f%
	return, d
}
;===========================================================================

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 23 May 2022, 12:13

Thank you, Garry. Interestingly, your revised script still misses a couple of things which your original script catches (if I replace comma space with xxx before running.

Here are screenshots from showing a comparison between the results from your original script (after replacing comma space) to the new script:
Garry Script Results Comparison 1.png
Garry Script Results Comparison 1.png (65.08 KiB) Viewed 934 times
Garry Script Results Comparison 2.png
Garry Script Results Comparison 2.png (33.94 KiB) Viewed 934 times
Garry Script Results Comparison 3.png
Garry Script Results Comparison 3.png (61.77 KiB) Viewed 934 times

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 23 May 2022, 12:15

More comparisons (looks like I can only attach 3 images to a single post):
Garry Script Results Comparison 4.png
Garry Script Results Comparison 4.png (82.78 KiB) Viewed 934 times
Garry Script Results Comparison 5.png
Garry Script Results Comparison 5.png (109.76 KiB) Viewed 934 times
Garry Script Results Comparison 6.png
Garry Script Results Comparison 6.png (26.9 KiB) Viewed 934 times

AlFlo
Posts: 339
Joined: 29 Nov 2021, 21:46

Re: How to Remove Duplicate Lines from a File

Post by AlFlo » 23 May 2022, 12:16

Last of the comparisons:
Garry Script Results Comparison 7.png
Garry Script Results Comparison 7.png (19.61 KiB) Viewed 934 times

garry
Posts: 3740
Joined: 22 Dec 2013, 12:50

Re: How to Remove Duplicate Lines from a File

Post by garry » 23 May 2022, 14:31

thank you for testing
I'm not good in programming just tried to use the recommended script
was confusing why replace comma to see some lines like date: ..
must ask again for professional ahk-users ..

here also 2 examples ( not tested yet ... )

Code: Select all

;- Example-1
#MaxMem 4095
;- https://www.autohotkey.com/boards/viewtopic.php?p=462736#p462736  /  sampletext.txt"  - file at page-1  34.44-kB
url:="https://www.autohotkey.com/boards/download/file.php?id=17830"
F1:=a_scriptdir "\sampletext.txt"
ifnotexist,%f1%
 urldownloadtofile,%url%,%f1%
F2:=a_scriptdir "\" . a_now . "_sampletext_new.txt"
fileread,aa,%f1%
Loop Parse, aa, `n, `r
 {
 x:= A_LoopField
 if x=
   continue
 x=%x%                     ;- remove leading spaces
 TestStr := x
 Loop Parse, Result, `n, `r
      if (TestStr = a_loopfield)
         continue 2
   Result .= Result ? "`n" : ""
   Result .= x
}
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(result)
 try
   run,%f2%
 }
exitapp

;----------------------------------------------------------------------------------
;- example 2 : don't remove leading spaces / so see more lines with same text ( including spaces )
/*
#MaxMem 4095
;- https://www.autohotkey.com/boards/viewtopic.php?p=462736#p462736  /  sampletext.txt"  - file at page-1  34.44-kB
url:="https://www.autohotkey.com/boards/download/file.php?id=17830"
F1:=a_scriptdir "\sampletext.txt"
ifnotexist,%f1%
 urldownloadtofile,%url%,%f1%
F2:=a_scriptdir "\" . a_now . "_sampletext_new.txt"
fileread,aa,%f1%
Loop Parse, aa, `n, `r
{
   TestStr := A_LoopField
   Loop Parse, Result, `n, `r
      if (TestStr = A_LoopField)
         continue 2
   Result .= Result ? "`n" : ""
   Result .= A_LoopField
}
ifnotexist,%f2%
 {
 FileOpen(F2, "w", "UTF-8").Write(result)
 try
   run,%f2%
 }
exitapp
*/

Post Reply

Return to “Ask for Help (v1)”