Remove both duplicates from a text file

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
alesyt0h
Posts: 214
Joined: 28 Jan 2015, 20:37

Remove both duplicates from a text file

01 Jul 2018, 15:07

Hello!

I want a script that removes duplicate lines + the original, meaning both duplicates and keep only the unique lines.

But I have no idea where to start :(

My files usually are big, like 2million - 3million lines


Is this even possible to do in autohotkey?

Thannks
alesyt0h
Posts: 214
Joined: 28 Jan 2015, 20:37

Re: Remove both duplicates from a text file

01 Jul 2018, 18:08

I only understand 10% of the links >:(

Also wouldn't loop would be so slow to check every every line?
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Remove both duplicates from a text file

01 Jul 2018, 19:32

- What length are the lines (in characters)? How big are the files (in megabytes)?
- Other points are: is it OK to sort the lines alphabetically (or must the order be maintained), and, would you ever get two lines that are identical (case insensitive), but non-identical (case sensitive).
- A simple approach would be to sort the list, and compare consecutive lines.

- Something like this might work, however, I would recommend testing it on a small dataset first:

Code: Select all

;[based on FREQUENCY COUNT (CASE SENSITIVE)]
;jeeswg's objects tutorial - AutoHotkey Community
;https://autohotkey.com/boards/viewtopic.php?f=7&t=29232

;list items that appear once (maintain order, case sensitive)
vText := "q,a,b,c,d,e,f,A,B,C,a,b,c"
vText := StrReplace(vText, ",", "`r`n")

;if we see a line once, it's value is set to 1
;if we see a line more than once, it's value is set to 2
oArray := ComObjCreate("Scripting.Dictionary")
Loop, Parse, vText, `n, `r
	if !oArray.Exists("" A_LoopField)
		oArray.Item("" A_LoopField) := 1
	else if (oArray.Item("" A_LoopField) = 1)
		oArray.Item("" A_LoopField) := 2

;list items that appear once
vOutput := ""
VarSetCapacity(vOutput, StrLen(vText)*2*2)
for vKey in oArray
	if (oArray.Item(vKey) = 1)
		vOutput .= oArray.Item(vKey) "`t" vKey "`r`n"
MsgBox, % vOutput

oArray := ""
return
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
alesyt0h
Posts: 214
Joined: 28 Jan 2015, 20:37

Re: Remove both duplicates from a text file

02 Jul 2018, 17:18

jeeswg wrote:- What length are the lines (in characters)? How big are the files (in megabytes)?
- Other points are: is it OK to sort the lines alphabetically (or must the order be maintained), and, would you ever get two lines that are identical (case insensitive), but non-identical (case sensitive).
- A simple approach would be to sort the list, and compare consecutive lines.

- Something like this might work, however, I would recommend testing it on a small dataset first:

Code: Select all

;[based on FREQUENCY COUNT (CASE SENSITIVE)]
;jeeswg's objects tutorial - AutoHotkey Community
;https://autohotkey.com/boards/viewtopic.php?f=7&t=29232

;list items that appear once (maintain order, case sensitive)
vText := "q,a,b,c,d,e,f,A,B,C,a,b,c"
vText := StrReplace(vText, ",", "`r`n")

;if we see a line once, it's value is set to 1
;if we see a line more than once, it's value is set to 2
oArray := ComObjCreate("Scripting.Dictionary")
Loop, Parse, vText, `n, `r
	if !oArray.Exists("" A_LoopField)
		oArray.Item("" A_LoopField) := 1
	else if (oArray.Item("" A_LoopField) = 1)
		oArray.Item("" A_LoopField) := 2

;list items that appear once
vOutput := ""
VarSetCapacity(vOutput, StrLen(vText)*2*2)
for vKey in oArray
	if (oArray.Item(vKey) = 1)
		vOutput .= oArray.Item(vKey) "`t" vKey "`r`n"
MsgBox, % vOutput

oArray := ""
return

Thanks! it worked more or less!

I now I have 44k duplicated, but got 86k with that code. And yes It has to be case sensitive, and lenght is not much, like 30-40 characters per line

how it can be done?
User avatar
jeeswg
Posts: 6902
Joined: 19 Dec 2016, 01:58
Location: UK

Re: Remove both duplicates from a text file

02 Jul 2018, 19:29

- AFAIK the script does what you want, it keeps any line (case sensitive) that only appears once.
- If you can clarify your requirements and/or give a small example of input text where my script gives the incorrect output, then I can look at editing the script.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
alesyt0h
Posts: 214
Joined: 28 Jan 2015, 20:37

Re: Remove both duplicates from a text file

03 Jul 2018, 01:47

jeeswg wrote:- AFAIK the script does what you want, it keeps any line (case sensitive) that only appears once.
- If you can clarify your requirements and/or give a small example of input text where my script gives the incorrect output, then I can look at editing the script.
You're right! I just checked and the number of duplicates is correct, idk why I thought there was only 44k on my files :crazy:

Thank you very much for this useful script!!!!!!!

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: AlFlo and 220 guests