Hello!
I want a script that removes duplicate lines + the original, meaning both duplicates and keep only the unique lines.
But I have no idea where to start
My files usually are big, like 2million - 3million lines
Is this even possible to do in autohotkey?
Thannks
Remove both duplicates from a text file
Re: Remove both duplicates from a text file
Hello, it is very possible.
Try these:
https://autohotkey.com/docs/misc/Arrays.htm
https://autohotkey.com/docs/commands/IfExpression.htm
https://autohotkey.com/docs/commands/FileAppend.htm
https://autohotkey.com/docs/commands/Loop.htm
https://autohotkey.com/docs/commands/LoopReadFile.htm
https://autohotkey.com/docs/Objects.htm
Try these:
https://autohotkey.com/docs/misc/Arrays.htm
https://autohotkey.com/docs/commands/IfExpression.htm
https://autohotkey.com/docs/commands/FileAppend.htm
https://autohotkey.com/docs/commands/Loop.htm
https://autohotkey.com/docs/commands/LoopReadFile.htm
https://autohotkey.com/docs/Objects.htm
Re: Remove both duplicates from a text file
I only understand 10% of the links >:(
Also wouldn't loop would be so slow to check every every line?
Also wouldn't loop would be so slow to check every every line?
Re: Remove both duplicates from a text file
- What length are the lines (in characters)? How big are the files (in megabytes)?
- Other points are: is it OK to sort the lines alphabetically (or must the order be maintained), and, would you ever get two lines that are identical (case insensitive), but non-identical (case sensitive).
- A simple approach would be to sort the list, and compare consecutive lines.
- Something like this might work, however, I would recommend testing it on a small dataset first:
- Other points are: is it OK to sort the lines alphabetically (or must the order be maintained), and, would you ever get two lines that are identical (case insensitive), but non-identical (case sensitive).
- A simple approach would be to sort the list, and compare consecutive lines.
- Something like this might work, however, I would recommend testing it on a small dataset first:
Code: Select all
;[based on FREQUENCY COUNT (CASE SENSITIVE)]
;jeeswg's objects tutorial - AutoHotkey Community
;https://autohotkey.com/boards/viewtopic.php?f=7&t=29232
;list items that appear once (maintain order, case sensitive)
vText := "q,a,b,c,d,e,f,A,B,C,a,b,c"
vText := StrReplace(vText, ",", "`r`n")
;if we see a line once, it's value is set to 1
;if we see a line more than once, it's value is set to 2
oArray := ComObjCreate("Scripting.Dictionary")
Loop, Parse, vText, `n, `r
if !oArray.Exists("" A_LoopField)
oArray.Item("" A_LoopField) := 1
else if (oArray.Item("" A_LoopField) = 1)
oArray.Item("" A_LoopField) := 2
;list items that appear once
vOutput := ""
VarSetCapacity(vOutput, StrLen(vText)*2*2)
for vKey in oArray
if (oArray.Item(vKey) = 1)
vOutput .= oArray.Item(vKey) "`t" vKey "`r`n"
MsgBox, % vOutput
oArray := ""
return
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Re: Remove both duplicates from a text file
jeeswg wrote:- What length are the lines (in characters)? How big are the files (in megabytes)?
- Other points are: is it OK to sort the lines alphabetically (or must the order be maintained), and, would you ever get two lines that are identical (case insensitive), but non-identical (case sensitive).
- A simple approach would be to sort the list, and compare consecutive lines.
- Something like this might work, however, I would recommend testing it on a small dataset first:Code: Select all
;[based on FREQUENCY COUNT (CASE SENSITIVE)] ;jeeswg's objects tutorial - AutoHotkey Community ;https://autohotkey.com/boards/viewtopic.php?f=7&t=29232 ;list items that appear once (maintain order, case sensitive) vText := "q,a,b,c,d,e,f,A,B,C,a,b,c" vText := StrReplace(vText, ",", "`r`n") ;if we see a line once, it's value is set to 1 ;if we see a line more than once, it's value is set to 2 oArray := ComObjCreate("Scripting.Dictionary") Loop, Parse, vText, `n, `r if !oArray.Exists("" A_LoopField) oArray.Item("" A_LoopField) := 1 else if (oArray.Item("" A_LoopField) = 1) oArray.Item("" A_LoopField) := 2 ;list items that appear once vOutput := "" VarSetCapacity(vOutput, StrLen(vText)*2*2) for vKey in oArray if (oArray.Item(vKey) = 1) vOutput .= oArray.Item(vKey) "`t" vKey "`r`n" MsgBox, % vOutput oArray := "" return
Thanks! it worked more or less!
I now I have 44k duplicated, but got 86k with that code. And yes It has to be case sensitive, and lenght is not much, like 30-40 characters per line
how it can be done?
Re: Remove both duplicates from a text file
- AFAIK the script does what you want, it keeps any line (case sensitive) that only appears once.
- If you can clarify your requirements and/or give a small example of input text where my script gives the incorrect output, then I can look at editing the script.
- If you can clarify your requirements and/or give a small example of input text where my script gives the incorrect output, then I can look at editing the script.
homepage | tutorials | wish list | fun threads | donate
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
WARNING: copy your posts/messages before hitting Submit as you may lose them due to CAPTCHA
Re: Remove both duplicates from a text file
You're right! I just checked and the number of duplicates is correct, idk why I thought there was only 44k on my filesjeeswg wrote:- AFAIK the script does what you want, it keeps any line (case sensitive) that only appears once.
- If you can clarify your requirements and/or give a small example of input text where my script gives the incorrect output, then I can look at editing the script.
Thank you very much for this useful script!!!!!!!