Analyze field dividers in CSV-file

Get help with using AutoHotkey and its commands and hotkeys
Albireo
Posts: 793
Joined: 16 Oct 2013, 13:53

Analyze field dividers in CSV-file

20 Jun 2019, 19:09

eg. Have understood that Loop Parse, StrVar, CSV is a powerful instruction for CSV-files.

example on different field separators .:

Code: Select all

StrVar = 302;"Emelie";"174,50"`n303;"Jenny";"19,90"	; ";" as delimiter
StrVar = 302,"Emelie","174,50"`n303,"Jenny","19,90"	; "," as delimiter
302;"Emelie";"174,50"
303;"Jenny";"19,90"


1)
Sometimes I have CSV files with ";" as field separator. (or TAB or "," or ..) as a field separator.)
Is it possible to analyze which field separator is used in the CSV-example above?

2)
Is the only way to replace all ";" (semi-colon) field separators to "," (comma) in that CSV-file? (to use Loop Parse, StrVar, CSV)
(the CSV-files I think about. have about 2,5 miljon field dividers would then be replaced)
Albireo
Posts: 793
Joined: 16 Oct 2013, 13:53

Re: Analyze field dividers in CSV-file

21 Jun 2019, 04:50

1) Have started to test / think about whether this can be a possible way?
Check if the field delimiter or quotation mark comes first in the string.
Can be difficult to handle quotes in a field - or..

Code: Select all

#SingleInstance Force

; Test strings!
StrVar = 302;"Emelie";"174,50"`n303;"Jenny";"19,90"	; ";" as delimiter
StrVar = 302,"Emelie","174,50"`n303,"Jenny","19,90"	; "," as delimiter

; StrVar = "Emelie";"174,50",542`n"Jenny";"19,90";303	; ";" as delimiter
; StrVar = "Emelie","174,50",542`n"Jenny","19,90";303	; "," as delimiter

; StrVar = "19,90";Emelie;302;"Row 1"`n"18";Erica;201;"Row 2"	; ";" as delimiter
; StrVar = "19,90",Emelie,302,"Row 1"`n"18",Erica,201,"Row 2"	; "," as delimiter

; -cannot handle quotes in the field

MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % StrVar

; Suppose the line contains relevant information
; 1) which character comes first
; FoundPos := InStr(Haystack, Needle , CaseSensitive := false, StartingPos := 1, Occurrence := 1)
; QuoteMark := InStr(StrVar, """", CaseSensitive := false, StartingPos := 1, Occurrence := 1)

; CommaDelim := InStr(StrVar, ",", CaseSensitive := false, StartingPos := 1, Occurrence := 1)	; "," as delimiter
; SemiColDelim := InStr(StrVar, ";", CaseSensitive := false, StartingPos := 1, Occurrence := 1)	; ";" as delimiter
; TabDelim := InStr(StrVar, ";", CaseSensitive := false, StartingPos := 1, Occurrence := 1)	; "TAB" as delimiter
; MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % QuoteMark

; StringLength := StrLen(StrVar)

ChrCount = 0
Loop % StrLen(StrVar)
{	ChrCount += 1
	TestStr := SubStr(StrVar, ChrCount, 1)
	
	If ( TestStr = """" )
	{	MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % """ is found! (First position)`nPos .: " ChrCount
		Loop % StrLen(StrVar) - A_Index
		{	If ( A_Index = 1 )
				Continue
			
			ChrCount += 1
			TestStr := SubStr(StrVar, A_Index, 1)
			MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % "TestStr (pos " ChrCount ") .: " TestStr
			
			If ( TestStr = """" )
			{	MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % """ is found! (Second position)`nPos .: " ChrCount
				Break
			}
		}
	}
	
	If ( TestStr = "`;" )
	{	MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % """"";"""" is found! `nPos .: " ChrCount
		Break
	}

	If ( TestStr = "," )
	{	MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % """"","""" is found! `nPos .: " ChrCount
		Break
	}
		
	MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % "TestStr (pos " ChrCount ") .: " TestStr
}

MsgBox ,, Row %A_LineNumber% -> %A_ScriptName%, % "The delimiter must be " TestStr
ExitApp
Albireo
Posts: 793
Joined: 16 Oct 2013, 13:53

Re: Analyze field dividers in CSV-file

21 Jun 2019, 18:57

For simpler test strings, it seems to work! (will try to do some more complicated tests...)
Albireo
Posts: 793
Joined: 16 Oct 2013, 13:53

Re: Analyze field dividers in CSV-file

22 Jun 2019, 08:36

I have found an idea how to solve this with PowerShell Autodetecting CSV Delimiters

From the description .:
To automatically detect the delimiter used by a CSV file, Get-CSVDelimiter uses a heuristic approach which proves to be very fast and dependable. The assumption is that a delimiter must be a character that occurs in each line of the file with the very same frequency. To speed up this process, the function also uses an exclude list with characters like plain letters or numbers that are generally not used as delimiter.
Typically, Get-CSVDelimiter can identify the delimiter after parsing only three to five lines of data. As a side effect, Get-CSVDelimiter now also knows how often the delimiter occured in a line, so it can also report the number of columns.

(The solution with PowerShell)
(but, I don't know how to translate PowerShell to AHK....)



Another solution with C#-code .: Auto-detect CSV Separator
Probably not as safe/good as the last example with PowerShell, but better than mine.
(The solution with #C)


A simple solution with PHP .: Detecting CSV delimiter with PHP
(The solution with PHP)
Maybe not as good as the other methods - Just analyze one line?




Python have a "sniffer" to identfy the delimiter csv – Comma-separated value files (they call it automatically-detecting-dialects" (maybe more complicated to translate into AHK?)



Some discussed strategies .: (from How to programmatically guess whether a CSV file is comma or semicolon delimited
  • Count the number of delimiters in the file. It's a pretty sure bet that whichever delimiter is used the most is the actual delimiter
  • One possible additional check is to see if splitting on a separator produces an equal amount of segments per row. (can probably save time by limiting the comparison to the first 50 or so rows)
  • If every row should have the same number of columns, which I believe is the case with Excel, then, using both commas and semicolons, figure out the number of columns for lines N and N+1. Whichever method (commas or semicolons) produces a different answer is wrong (not the format of the file). You can start at the beginning and you only have to go until one of them is proven incorrect. You don't need header lines or anything. You don't have to read more of the file than is necessary, and it can't ever give you a wrong answer for the format of the file, it just might reach the end and not yet have come to a conclusion. All you need is for the every row has the same number of columns property to hold.
    Count the number of delimiters in the file. It's a pretty sure bet that whichever delimiter is used the most is the actual delimiter
Another link .: Scala: Auto detection of delimiter/separator in CSV file

(A little surprised that this feature is not available in AHK)

Return to “Ask For Help”

Who is online

Users browsing this forum: ilhom, JoeWinograd and 177 guests