Deleting rows from csv file where different types of duplicates are occuring Topic is solved
Deleting rows from csv file where different types of duplicates are occuring
I'm looking to delete rows from a 3 column csv file, just has 3 columns and no header. First I want to remove a row if entry in column B and C on the same row are the same, secondly, I want to check if there is a duplicate entry in column A, as column A is an ID and should have unique values, if it does encounter one or more duplicate entries, I want to keep the last one it finds.
A very simple example (actual real file could be big)
Input from csv file
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
Output:
599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
So, output removes 2nd row as B2 = C2, and removes 3rd and 4th rows as the ID occurs again on the 5th row, as 5th row is last entry with this ID, it is kept.
As usual, i'm looking through searches to try to find what I'm looking for, but falling down rabbit holes and struggling to do what is probably an easy enough task
Thanks for any help. I'll try to see if I can make progress myself meanwhile.
A very simple example (actual real file could be big)
Input from csv file
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
Output:
599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
So, output removes 2nd row as B2 = C2, and removes 3rd and 4th rows as the ID occurs again on the 5th row, as 5th row is last entry with this ID, it is kept.
As usual, i'm looking through searches to try to find what I'm looking for, but falling down rabbit holes and struggling to do what is probably an easy enough task
Thanks for any help. I'll try to see if I can make progress myself meanwhile.
Last edited by bazkeys on 26 Jul 2021, 21:31, edited 1 time in total.
Re: Deleting rows from csv file where different types of duplicates are occuring
You haven't said what to do when the last multiple has the same second and third column.
Re: Deleting rows from csv file where different types of duplicates are occuring
If the second and third column are the same then I absolutely want to delete that row, that's why I am applying that step first. then it can check for duplicate IDs.
To explicitly state it in the case where had ID 601 occuring 3 times like
601,a,b
601,a,c
601,a,a
then I would want to keep the middle one "601 a,c" as it's the last occurring entry of 601 where columns B and C are not the same.
I've managed to code Part A of this on the back of searching this forum
After much stumbling, think I got the first part done, it probably could be improved I imagine.
Code: Select all
outputFile := "E:\test\output.csv"
delimiter := ","
Loop, Read, E:\test\example.csv
{
Cell := StrSplit(A_LoopReadLine,delimiter)
if (Cell[2] <> Cell[3])
{
If (A_INDEX <> 1) ;New line not before first line or at end of file
fileappend,`n,%outputFile%
fileappend, %A_LoopReadLine%,%outputFile%
}
}
ExitApp
Re: Deleting rows from csv file where different types of duplicates are occuring
I might have got the second part done too, again from extracting from past help on this forum
It seems to work, but again guessing it can be improved upon.
Code: Select all
FileDelete E:\test\myNewFile.txt
Loop, Read, E:\test\example.csv
NumLines++
Loop, %NumLines%
{
FileReadLine, Line, , E:\test\example.csv, %NumLines% ; read file from bottom to top
StringSplit, Field, Line,`,
If not InStr(SetFields, Field1) ; skip 'known' records
{
SetFields .= Field1 ";" ; keep ID of 'unknown' records
FileAppend, %Line%`n, E:\test\myNewFile.txt
}
NumLines--
}
Re: Deleting rows from csv file where different types of duplicates are occuring
It looks like you are on the right track!
Re: Deleting rows from csv file where different types of duplicates are occuring
Thanks, though putting it all together I'm running into problems, as I'm looking to work on the data before sending it out to the csv, otherwise I'm constantly opening a csv file, and there's more than those 2 operations that I want to do on the data.
Re: Deleting rows from csv file where different types of duplicates are occuring
Hope it is not too late, here is my take
Code: Select all
SampleText =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
Output := ""
while pos := RegExMatch(SampleText, "`am)(.+),(.+),\2(*SKIP)(*F)|(.+)", m, A_Index=1?1:pos+StrLen(m)) {
Output .= m3 "`n"
}
while pos := RegExMatch(Output, "`am)((.+),.*\R)(\2.*\R)+", m, A_Index=1?1:pos+StrLen(m)) {
Output := RegExReplace(Output, m, m3)
}
MsgBox, % Output
return
/* Output:
599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
*/
Re: Deleting rows from csv file where different types of duplicates are occuring
Long-loop longing....
Code: Select all
file := StrReplace(A_ScriptFullPath, ".ahk", ".txt")
If !FileExist(file) {
MsgBox, 48, Error, File not found. Aborting.`n`n%file%
Return
} Else line := [], has := {}, n := 0, new := ""
Loop, Read, %file%
{ part := []
Loop, Parse, A_LoopReadLine, CSV
part.Push(A_LoopField)
If (part.2 != part.3)
line.Push(A_LoopReadLine), has[part.1] := has[part.1].Count() ? has[part.1] : []
, has[part.1].Push(++n)
}
For k1, v1 in has
For k2, v2 in v1
If (A_Index < v1.Count())
line.Delete(v2)
For k, v in line
new .= v > "" ? "`n" v : ""
MsgBox, 64, Result, % SubStr(new, 2)
Re: Deleting rows from csv file where different types of duplicates are occuring Topic is solved
Another example:
Comments added for clarity.
HTH
Code: Select all
#NoEnv
SetBatchLines, -1
; testing csv:
csvfile =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
;~ FileRead, csvfile, myfilenamehere.csv
checkString := "|" ; This var is to keep lines in sequential order
out := {} ; This array will hold our rows we keep
Loop, Parse, csvfile, `n, `r
{
ROW := c1 := c2 := c3 := "" ; Clear temp vars
if ((ROW := A_LoopField) = "") ; Skip empty rows
continue
Loop, Parse, ROW, CSV ; A true CSV you want to use Parsing loop not StrSplit()
{
switch A_Index
{
case 1:c1 := A_LoopField
case 2:c2 := A_LoopField
case 3:c3 := A_LoopField
}
}
if (c2 = c3) ; Remove row when c2 = c3
continue
out[c1] := ROW ; Only save last c1 cell to keep rows unique
if InStr(checkString, "|" c1 "|")
checkString := StrReplace(checkString, "|" c1 "|", "|") ; Remove last dupe
checkString .= c1 "|" ; Append c1 to end of string
}
checkString := Trim(checkString, "|")
Loop, Parse, checkString, "|"
outputCSV .= out[A_LoopField] . "`n"
;~ FileAppend, % RTrim(outputCSV, "`n"), E:\test\myNewFile.txt ; Save
MsgBox % RTrim(outputCSV, "`n")
ExitApp
HTH
Re: Deleting rows from csv file where different types of duplicates are occuring
@sofista
Thanks for the reply, never too late, and really appreciate the help. Looks like very neat short code. The first part worked perfectly on a large test sample, second part something is going wrong when working on that big sample. I don't understand what the code is doing at one glance, so I can't say what the problem is, but it's not outputting anything.
@mikeyww
That looks like it works, I need to test it now with the actual files in a directory. I'll have to add the code and might have to tweak some things, hopefully won't take too long and I can get back shortly with a positive result
Edit: Tried to plug this into my code, but I have the same issues that I had before. I'm not working directly with files by the time I get to doing this duplicate checking. What I have is a variable called totalcontents that has merged all the csv files from a directory and sorted my csv by ID field. This totalcontents variable is the file that would need to be read at the beginning of your code snippet, as in
except totalcontents isn't a file, so it can't be done that way.
Thanks for the reply, never too late, and really appreciate the help. Looks like very neat short code. The first part worked perfectly on a large test sample, second part something is going wrong when working on that big sample. I don't understand what the code is doing at one glance, so I can't say what the problem is, but it's not outputting anything.
@mikeyww
That looks like it works, I need to test it now with the actual files in a directory. I'll have to add the code and might have to tweak some things, hopefully won't take too long and I can get back shortly with a positive result
Edit: Tried to plug this into my code, but I have the same issues that I had before. I'm not working directly with files by the time I get to doing this duplicate checking. What I have is a variable called totalcontents that has merged all the csv files from a directory and sorted my csv by ID field. This totalcontents variable is the file that would need to be read at the beginning of your code snippet, as in
Code: Select all
Loop, Read, %totalcontents%
Re: Deleting rows from csv file where different types of duplicates are occuring
@Xtra
My mind is kinda fried now, so haven't had a chance to actually look through the code but I managed to plug it in to the code I have and it works.
Thanks very much
And thanks to all who helped. All looks like super good code, that would work with a tweak here or there if I knew what I was doing.
My mind is kinda fried now, so haven't had a chance to actually look through the code but I managed to plug it in to the code I have and it works.
Thanks very much
And thanks to all who helped. All looks like super good code, that would work with a tweak here or there if I knew what I was doing.
-
- Posts: 1472
- Joined: 05 May 2018, 12:23
Re: Deleting rows from csv file where different types of duplicates are occuring
my try at it
Code: Select all
listtext =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
array := {}
loop, parse, listtext, `n
{
a := strsplit(a_loopfield, ",")
if (a.2 = a.3)
continue
array[a.1] := a.2 "," a.3
}
for k, v in array
{
csv .= k "," v "`n"
}
MsgBox, % csv
ExitApp
Re: Deleting rows from csv file where different types of duplicates are occuring
This last one does not match the desired output unless the line order does not matter. I love the approach anyway!
Last edited by mikeyww on 27 Jul 2021, 09:20, edited 2 times in total.
Re: Deleting rows from csv file where different types of duplicates are occuring
AFK now, so haven't had a chance to test. Line order actually might not really matter, but unlike my example (a mistake) I have already presorted the order before it get to this stage, so it looks like this code will work I'll check it out tomorrow.
Re: Deleting rows from csv file where different types of duplicates are occuring
If order is needed:
Code: Select all
listtext =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
array := {}, order := ""
Loop, Parse, listtext, `n
{ a := StrSplit(A_LoopField, ",")
If (a.2 != a.3)
array[a.1] := a.2 "," a.3, order .= a.1 ","
}
While order
RegExMatch(order, "\d+,", m), csv .= "`n" m array[StrReplace(m, ",")], order := StrReplace(order, m)
MsgBox, 64, Result, % csv := SubStr(csv, 2)
-
- Posts: 1472
- Joined: 05 May 2018, 12:23
Re: Deleting rows from csv file where different types of duplicates are occuring
I simplified my code a little, maybe now it is more understandable. The first while cycle discards rows when columns B and C are the same - \2 is a reference to group 2. It also includes an enhancement, as it prevents partial matches. For example, if the second line were "600,R,RR", the replaced regular expression would discard the line, while the new version does not.bazkeys wrote: ↑27 Jul 2021, 01:53@sofista
Thanks for the reply, never too late, and really appreciate the help. Looks like very neat short code. The first part worked perfectly on a large test sample, second part something is going wrong when working on that big sample. I don't understand what the code is doing at one glance, so I can't say what the problem is, but it's not outputting anything.
As for the second while cycle, it discards two or more consecutive lines, except the last one, when it is the case that the column A is the same. I inferred this condition from the example data you provided, hoping this data is representative. If this condition is not met, the regex may delete unwanted lines.
Code: Select all
SampleText =
(
599,R,Yyyyyyy
600,R,R
601,hhfgdfsghgf,Yyyyyy
601,Yhgshdgf,Dhgfhhg
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
)
Output := ""
while pos := RegExMatch(SampleText, "`am)(.+),(.+),\2\R|(.+(\R|\Z))", m, A_Index=1?1:pos+StrLen(m)) {
Output .= m3
}
while pos := RegExMatch(Output, "`am)^((.+),.*\R)(\2.*\R)+", m, A_Index=1?1:pos+StrLen(m)) {
Output := RegExReplace(Output, m, m3)
}
MsgBox, % Output
return
/* Output:
599,R,Yyyyyyy
601,ggg,iii
188,eeeeeeeeeeee,Yyyyyyy
187,Yhtrthytre,Rdsads
*/
Re: Deleting rows from csv file where different types of duplicates are occuring
@AHKStudent
Thanks for the help. it successfully removed the duplicate ids, but the first part didn't seem to work with the actual data I had, it didn't remove where B = C in row.
@sofista
Thanks, that almost worked, but it appears to have completely deleted duplicates, i.e. when there are 3 ids the same for example all having id = 601, it needs to keep one (the last one after any b=c matches have been removed).
Thanks for all your help guys/girls. @Xtra solution worked and I have embedded into my code nicely, so I'm going to mark it as solved now.
Thanks for the help. it successfully removed the duplicate ids, but the first part didn't seem to work with the actual data I had, it didn't remove where B = C in row.
@sofista
Thanks, that almost worked, but it appears to have completely deleted duplicates, i.e. when there are 3 ids the same for example all having id = 601, it needs to keep one (the last one after any b=c matches have been removed).
Thanks for all your help guys/girls. @Xtra solution worked and I have embedded into my code nicely, so I'm going to mark it as solved now.