Deleting rows from csv file where different types of duplicates are occuring Topic is solved

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 26 Jul 2021, 20:08

I'm looking to delete rows from a 3 column csv file, just has 3 columns and no header. First I want to remove a row if entry in column B and C on the same row are the same, secondly, I want to check if there is a duplicate entry in column A, as column A is an ID and should have unique values, if it does encounter one or more duplicate entries, I want to keep the last one it finds.
A very simple example (actual real file could be big)

Input from csv file
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads

Output:
599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads

So, output removes 2nd row as B2 = C2, and removes 3rd and 4th rows as the ID occurs again on the 5th row, as 5th row is last entry with this ID, it is kept.

As usual, i'm looking through searches to try to find what I'm looking for, but falling down rabbit holes and struggling to do what is probably an easy enough task
Thanks for any help. I'll try to see if I can make progress myself meanwhile.
Last edited by bazkeys on 26 Jul 2021, 21:31, edited 1 time in total.
User avatar
mikeyww
Posts: 26437
Joined: 09 Sep 2014, 18:38

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by mikeyww » 26 Jul 2021, 20:53

You haven't said what to do when the last multiple has the same second and third column.
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 26 Jul 2021, 21:01

mikeyww wrote:
26 Jul 2021, 20:53
You haven't said what to do when the last multiple has the same second and third column.
If the second and third column are the same then I absolutely want to delete that row, that's why I am applying that step first. then it can check for duplicate IDs.
To explicitly state it in the case where had ID 601 occuring 3 times like
601,a,b
601,a,c
601,a,a

then I would want to keep the middle one "601 a,c" as it's the last occurring entry of 601 where columns B and C are not the same.

I've managed to code Part A of this on the back of searching this forum

After much stumbling, think I got the first part done, it probably could be improved I imagine.

Code: Select all

outputFile  := "E:\test\output.csv"

delimiter := ","
Loop, Read, E:\test\example.csv
   {
   Cell := StrSplit(A_LoopReadLine,delimiter)
   if (Cell[2] <>  Cell[3])
   {
	    If (A_INDEX <> 1)  ;New line not before first line or at end of file
	    fileappend,`n,%outputFile% 
	    fileappend, %A_LoopReadLine%,%outputFile% 
   }
     }
  ExitApp
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 26 Jul 2021, 21:44

I might have got the second part done too, again from extracting from past help on this forum

Code: Select all

FileDelete E:\test\myNewFile.txt

Loop, Read,  E:\test\example.csv
   NumLines++

Loop, %NumLines%
{
FileReadLine, Line, ,  E:\test\example.csv, %NumLines% ; read file from bottom to top
StringSplit, Field, Line,`,
If not InStr(SetFields, Field1) ; skip 'known' records
{
   SetFields .= Field1 ";" ; keep ID of 'unknown' records

   FileAppend, %Line%`n, E:\test\myNewFile.txt
   }
NumLines--
}
It seems to work, but again guessing it can be improved upon.
User avatar
mikeyww
Posts: 26437
Joined: 09 Sep 2014, 18:38

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by mikeyww » 26 Jul 2021, 22:36

It looks like you are on the right track!
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 26 Jul 2021, 23:12

mikeyww wrote:
26 Jul 2021, 22:36
It looks like you are on the right track!
Thanks, though putting it all together I'm running into problems, as I'm looking to work on the data before sending it out to the csv, otherwise I'm constantly opening a csv file, and there's more than those 2 operations that I want to do on the data.
sofista
Posts: 644
Joined: 24 Feb 2020, 13:59
Location: Buenos Aires

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by sofista » 26 Jul 2021, 23:55

Hope it is not too late, here is my take

Code: Select all

SampleText =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)

Output := ""

while pos := RegExMatch(SampleText, "`am)(.+),(.+),\2(*SKIP)(*F)|(.+)", m, A_Index=1?1:pos+StrLen(m)) {
	Output .= m3 "`n"
}

while pos := RegExMatch(Output, "`am)((.+),.*\R)(\2.*\R)+", m, A_Index=1?1:pos+StrLen(m)) {
	Output := RegExReplace(Output, m, m3)
}

MsgBox, % Output
return

/* Output:

599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
 */
User avatar
mikeyww
Posts: 26437
Joined: 09 Sep 2014, 18:38

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by mikeyww » 27 Jul 2021, 00:25

Long-loop longing....

Code: Select all

file := StrReplace(A_ScriptFullPath, ".ahk", ".txt")
If !FileExist(file) {
 MsgBox, 48, Error, File not found. Aborting.`n`n%file%
 Return
} Else line := [], has := {}, n := 0, new := ""
Loop, Read, %file%
{ part := []
  Loop, Parse, A_LoopReadLine, CSV
   part.Push(A_LoopField)
  If (part.2 != part.3)
   line.Push(A_LoopReadLine), has[part.1] := has[part.1].Count() ? has[part.1] : []
   , has[part.1].Push(++n)
}
For k1, v1 in has
 For k2, v2 in v1
  If (A_Index < v1.Count())
   line.Delete(v2)
For k, v in line
 new .= v > "" ? "`n" v : ""
MsgBox, 64, Result, % SubStr(new, 2)
User avatar
Xtra
Posts: 2744
Joined: 02 Oct 2015, 12:15

Re: Deleting rows from csv file where different types of duplicates are occuring  Topic is solved

Post by Xtra » 27 Jul 2021, 01:24

Another example:

Code: Select all

#NoEnv
SetBatchLines, -1

; testing csv:
csvfile =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)

;~ FileRead, csvfile, myfilenamehere.csv

checkString := "|"    ; This var is to keep lines in sequential order
out := {}             ; This array will hold our rows we keep

Loop, Parse, csvfile, `n, `r
{
    ROW := c1 := c2 := c3 := ""       ; Clear temp vars
    
    if ((ROW := A_LoopField) = "")    ; Skip empty rows
        continue
    
    Loop, Parse, ROW, CSV             ; A true CSV you want to use Parsing loop not StrSplit()
    {
        switch A_Index
        {
            case 1:c1 := A_LoopField
            case 2:c2 := A_LoopField
            case 3:c3 := A_LoopField
        }
    }
    
    if (c2 = c3)      ; Remove row when c2 = c3
        continue
    
    out[c1] := ROW    ; Only save last c1 cell to keep rows unique
    
    if InStr(checkString, "|" c1 "|")
        checkString := StrReplace(checkString, "|" c1 "|", "|")    ; Remove last dupe
    checkString .= c1 "|"                                          ; Append c1 to end of string
}

checkString := Trim(checkString, "|")
Loop, Parse, checkString, "|"
    outputCSV .= out[A_LoopField] . "`n"

;~ FileAppend, % RTrim(outputCSV, "`n"), E:\test\myNewFile.txt    ; Save

MsgBox % RTrim(outputCSV, "`n")
ExitApp
Comments added for clarity.
HTH
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 27 Jul 2021, 01:53

@sofista

Thanks for the reply, never too late, and really appreciate the help. Looks like very neat short code. The first part worked perfectly on a large test sample, second part something is going wrong when working on that big sample. I don't understand what the code is doing at one glance, so I can't say what the problem is, but it's not outputting anything.

@mikeyww

That looks like it works, I need to test it now with the actual files in a directory. I'll have to add the code and might have to tweak some things, hopefully won't take too long and I can get back shortly with a positive result ;)

Edit: Tried to plug this into my code, but I have the same issues that I had before. I'm not working directly with files by the time I get to doing this duplicate checking. What I have is a variable called totalcontents that has merged all the csv files from a directory and sorted my csv by ID field. This totalcontents variable is the file that would need to be read at the beginning of your code snippet, as in

Code: Select all

Loop, Read, %totalcontents%
except totalcontents isn't a file, so it can't be done that way.
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 27 Jul 2021, 02:31

@Xtra

My mind is kinda fried now, so haven't had a chance to actually look through the code but I managed to plug it in to the code I have and it works.
Thanks very much :D

And thanks to all who helped. All looks like super good code, that would work with a tweak here or there if I knew what I was doing.
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by AHKStudent » 27 Jul 2021, 03:26

my try at it

Code: Select all

listtext =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
array := {}
loop, parse, listtext, `n
{
	a := strsplit(a_loopfield, ",")
	if (a.2 = a.3)
		continue
	array[a.1] := a.2 "," a.3
}
for k, v in array
{
	csv .= k "," v "`n"
}
MsgBox, % csv
ExitApp
User avatar
mikeyww
Posts: 26437
Joined: 09 Sep 2014, 18:38

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by mikeyww » 27 Jul 2021, 08:08

This last one does not match the desired output unless the line order does not matter. I love the approach anyway!
Last edited by mikeyww on 27 Jul 2021, 09:20, edited 2 times in total.
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 27 Jul 2021, 09:13

mikeyww wrote:
27 Jul 2021, 08:08
This last one does not match the desired output unless the line order does not matter. I love the approach anyway!
AFK now, so haven't had a chance to test. Line order actually might not really matter, but unlike my example (a mistake) I have already presorted the order before it get to this stage, so it looks like this code will work :) I'll check it out tomorrow.
User avatar
mikeyww
Posts: 26437
Joined: 09 Sep 2014, 18:38

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by mikeyww » 27 Jul 2021, 09:20

If order is needed:

Code: Select all

listtext =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
array := {}, order := ""
Loop, Parse, listtext, `n
{ a := StrSplit(A_LoopField, ",")
  If (a.2 != a.3)
   array[a.1] := a.2 "," a.3, order .= a.1 ","
}
While order
 RegExMatch(order, "\d+,", m), csv .= "`n" m array[StrReplace(m, ",")], order := StrReplace(order, m)
MsgBox, 64, Result, % csv := SubStr(csv, 2)
AHKStudent
Posts: 1472
Joined: 05 May 2018, 12:23

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by AHKStudent » 27 Jul 2021, 09:46

mikeyww wrote:
27 Jul 2021, 08:08
This last one does not match the desired output unless the line order does not matter. I love the approach anyway!
glad you liked it because a lot of what I know about ahk is thanks to your posts :thumbup:
sofista
Posts: 644
Joined: 24 Feb 2020, 13:59
Location: Buenos Aires

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by sofista » 27 Jul 2021, 11:31

bazkeys wrote:
27 Jul 2021, 01:53
@sofista

Thanks for the reply, never too late, and really appreciate the help. Looks like very neat short code. The first part worked perfectly on a large test sample, second part something is going wrong when working on that big sample. I don't understand what the code is doing at one glance, so I can't say what the problem is, but it's not outputting anything.
I simplified my code a little, maybe now it is more understandable. The first while cycle discards rows when columns B and C are the same - \2 is a reference to group 2. It also includes an enhancement, as it prevents partial matches. For example, if the second line were "600,R,RR", the replaced regular expression would discard the line, while the new version does not.

As for the second while cycle, it discards two or more consecutive lines, except the last one, when it is the case that the column A is the same. I inferred this condition from the example data you provided, hoping this data is representative. If this condition is not met, the regex may delete unwanted lines.

Code: Select all

SampleText =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)

Output := ""

while pos := RegExMatch(SampleText, "`am)(.+),(.+),\2\R|(.+(\R|\Z))", m, A_Index=1?1:pos+StrLen(m)) {
	Output .= m3
}

while pos := RegExMatch(Output, "`am)^((.+),.*\R)(\2.*\R)+", m, A_Index=1?1:pos+StrLen(m)) {
	Output := RegExReplace(Output, m, m3)
}

MsgBox, % Output
return

/* Output:

599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
 */
Edited: Removed commented out lines
bazkeys
Posts: 98
Joined: 20 Jan 2021, 21:58

Re: Deleting rows from csv file where different types of duplicates are occuring

Post by bazkeys » 27 Jul 2021, 18:22

@AHKStudent
Thanks for the help. it successfully removed the duplicate ids, but the first part didn't seem to work with the actual data I had, it didn't remove where B = C in row.

@sofista

Thanks, that almost worked, but it appears to have completely deleted duplicates, i.e. when there are 3 ids the same for example all having id = 601, it needs to keep one (the last one after any b=c matches have been removed).

Thanks for all your help guys/girls. @Xtra solution worked and I have embedded into my code nicely, so I'm going to mark it as solved now.
Post Reply

Return to “Ask for Help (v1)”