Jump to content

Sky Slate Blueberry Blackcurrant Watermelon Strawberry Orange Banana Apple Emerald Chocolate
Photo

How Can I Parse a CSV File with Multi-Line Values


  • Please log in to reply
22 replies to this topic
Rhys
  • Members
  • 761 posts
  • Last active: Aug 09 2013 04:53 PM
  • Joined: 17 Apr 2007
My Format4CSV() raised a question in my mind.

How can we correctly read and parse a CSV file that includes multi-line values?

Reference:
<!-- m -->http://www.creativys... ... m#EmbedBRs<!-- m -->
<!-- m -->http://en.wikipedia....eparated_values<!-- m -->

Example:
To,Subject,Body,Status
Rhys,Hello,"Hey,
How the heck do I ""parse"" this file?
-Me",Sent
Rhys,Dunno,I dunno man....,Sent

The above code when saved as a CSV should have only 3 rows, even though the raw file has 5 lines:
Posted Image
Sourcefile: <!-- m -->https://ahknet.autoh...k/multiline.csv<!-- m -->

derRaphael
  • Members
  • 872 posts
  • Last active: Mar 19 2013 04:42 PM
  • Joined: 23 Nov 2007
using this data: multiline.csv

fileselectfile, csvfile
fileread,csv,%csvfile%
loop,parse,csv,`r,`n
	Loop,Parse,A_LoopField,CSV
		MsgBox % A_LoopField

this works :)

greets
derRaphael

All scripts, unless otherwise noted, are hereby released under CC-BY

Krogdor
  • Members
  • 1391 posts
  • Last active: Jun 08 2011 05:31 AM
  • Joined: 18 Apr 2008
DerRaphael, that is not what he means. That will parse a normal CSV fie just fine, but Rhys is looking for something that will parse it even if one of the CSV fields contains one or more newlines... Which sounds tricky. And I have no idea how to do.

Rhys
  • Members
  • 761 posts
  • Last active: Aug 09 2013 04:53 PM
  • Joined: 17 Apr 2007
@Krogdor, it actually works on the example file that I provided.. That CSV was generated by Excel, though, so it may not work on all CSV files if they don't follow the same linebreak rules that Excel does...

Krogdor
  • Members
  • 1391 posts
  • Last active: Jun 08 2011 05:31 AM
  • Joined: 18 Apr 2008
Oh, really? That's interesting... But since the first parsing loop parses at linebreaks, shouldn't it... separate any CSV sections that have linebreaks in them?

derRaphael
  • Members
  • 872 posts
  • Last active: Mar 19 2013 04:42 PM
  • Joined: 23 Nov 2007
@krogdor:

the script just works fine, cuz the linebreak from the csv lines is `n`r
whereas the linebreak from field is `n

the outer loop parses `r and ommits the `n the inner parses the CSV field and dont care about line breaks

All scripts, unless otherwise noted, are hereby released under CC-BY

Krogdor
  • Members
  • 1391 posts
  • Last active: Jun 08 2011 05:31 AM
  • Joined: 18 Apr 2008
Ah, different linebreak types. That would explain it...

derRaphael
  • Members
  • 872 posts
  • Last active: Mar 19 2013 04:42 PM
  • Joined: 23 Nov 2007
Parsing non comma seperated and quotationmarks encapsulated Delimiter Seperated Values

; Proof of Concept to extract DSV (Delimiter Seperator Values)
;      - adapted for AHK by derRaphael / 21st July 2008 -
;                           [email protected]
; Following rules apply:
;   You have to set a delimiter char and an encapsulation char.
;   1) If you're using the delimeter char within your value, the value has 
;      to be surrounded by your encapsulation char. One at beginning and one
;      at its end. 
;   2) If you're using your encapsulation char within your value you have to 
;      double it each time it occurs and surround your value as in rule 1.
; Remarks:
;   The whole concept will break, when using same EOL (End Of Line) as LineBreaks 
;   in a value as in the entire file. Either you will have to escape these chars
;   somehow or use a single linefeed (`n) in values and carriage return linefeed 
;   (`r`n) as EOL in your DSV file.
;   Encapsulation and delimiter chars have to be single Chars. Strings containing
;   more than one char are not supported by concept.

CurrentDSVLine=a,b,c,"d,e","f"","",g",,i

Loop, % ReturnDSVArray(CurrentDSVLine)
   MsgBox % A_Index ": " DSVfield%A_Index%

ReturnDSVArray(CurrentDSVLine, ReturnArray="DSVfield", Delimiter=",", Encapsulator="""")
{
	global
	if ((StrLen(Delimiter)!=1)||(StrLen(Encapsulator)!=1)) {
		return -1                            ; return -1 indicating an error ... 
	}
	SetFormat,integer,H                      ; needed for escaping the RegExNeedle properly
	local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
	local e := SubStr(ASC(encapsulator)+0,2) ; used as hex notation in the RegExNeedle
	SetFormat,integer,D                      ; no need for Hex values anymore

	local p0 := 1                            ; Start of search at char p0 in DSV Line
	local fieldCount := 0                    ; start off with empty fields.
	CurrentDSVLine .= delimiter              ; Add delimiter, otherwise last field 
	;                                          won't get recognized
	Loop
	{
		Local RegExNeedle := "\" d "(?=(?:[^\" e "]*\" e "[^\" e "]*\" e ")*(?![^\" e "]*\" e "))"
		Local p1 := RegExMatch(CurrentDSVLine,RegExNeedle,tmp,p0)
		; p1 contains now the position of our current delimitor in a 1-based index 
		fieldCount++                         ; add count
		local field := SubStr(CurrentDSVLine,p0,p1-p0)
		; This is the Line you'll have to change if you want different treatment
		; otherwise your resulting fields from the DSV data Line will be stored in AHK array
		if (SubStr(field,1,1)=encapsulator) {
			; This is the exception handling for removing any doubled encapsulators and 
			; leading/trailing encapsulator chars
			field := RegExReplace(field,"^\" e "|\" e "$")
			StringReplace,field,field,% encapsulator encapsulator,%encapsulator%, All
		}
		Local _field := ReturnArray A_Index  ; construct a reference for our ReturnArray name
		%_field% := field                    ; dereference _field and assign our value to it
		if (p1=0) {                          ; p1 is 0 when no more delimitor chars have been found
			fieldCount--                     ; so correct fieldCount due to last appended delimitor
			Break                            ; and exit loop
		} Else
			p0 := p1 + 1                     ; set the start of our RegEx Search to last result 
	}                                        ; added by one
	return fieldCount
}
greets
derRaphael

All scripts, unless otherwise noted, are hereby released under CC-BY

Krogdor
  • Members
  • 1391 posts
  • Last active: Jun 08 2011 05:31 AM
  • Joined: 18 Apr 2008
DerRaphael,
Quite an impressive script, especially that RegEx...! One question though: What exactly is the use of:
local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
local e := SubStr(ASC(encapsulator)+0,2)

I'm afraid I don't quite understand ><

derRaphael
  • Members
  • 872 posts
  • Last active: Mar 19 2013 04:42 PM
  • Joined: 23 Nov 2007
i used setformat,integer,h in that script, so singlelettered returnvalues from asc-function look like this 0xNN

the +0 is needed (at least i remember reading about that existing values need that to be represented in actual format) to display hex notation correctly

since every hex starts with 0x, i use a substr(myhex,2) to start off at 2nd char position

u need the encapsulator string, when handling with field, which content contains either a seperator or a encapsulator char
so the field enclosed by those encpasulationchars

traditional MSexcel CSV makes extensive use of this.

whenever u have either a comma or quitationmarks in your field, excel transforms it by enclosing your field with quotationmarks and doubles the used inner quotations if there are any

so a field looking like this:
This is my content having "," inside
becomes this
"This is my content having "","" inside"

greets
derRaphael

All scripts, unless otherwise noted, are hereby released under CC-BY

n-l-i-d
  • Guests
  • Last active:
  • Joined: --
Nice one! 8)

berban_
  • Members
  • 202 posts
  • Last active: Aug 05 2014 11:52 PM
  • Joined: 16 Mar 2011
I have a function for this task which is still a "work in progress" in that it probably uses a lot of extra memory beyond the optimal amount and is slower but it is pretty capable and I use it all the time in real applications. So I thought I'd link it. Here it is! <!-- l --><a class="postlink-local" href="http://www.autohotkey.com/community/viewtopic.php?t=78105">viewtopic.php?t=78105</a><!-- l -->

JnLLnd
  • Members
  • 193 posts
  • Last active: Jul 23 2015 02:15 AM
  • Joined: 30 Dec 2007

Parsing non comma seperated and quotationmarks encapsulated Delimiter Seperated Values
 

 

 

 

Thank you for this function. Nice RexEx work. However, I'm confused with the array returned by this function. Usually, I can retrieve data from AHK arrays this way:

 

 

obj := Array("one","two","three")
MsgBox, % obj[2]

 

 

But the array DSVfield does not behave the same way:

 

str := "one,two,three"
Loop, % ReturnDSVArray(str)
  MsgBox % A_Index ": " DSVfield%A_Index% ; this works
MsgBox, % DSVfield[2] ; should return "two" but does NOT

 

 

 

Could you help me understand why DSVfield[2] returns a blank?

 

Thanks.



tank
  • Administrators
  • 4345 posts
  • AutoHotkey Foundation
  • Last active: May 02 2019 09:16 PM
  • Joined: 21 Dec 2007

this was done with AHK's old not actual arrays

you willneed to use the first method

DSVfield2 for instance (no brackets)


Never lose.
WIN or LEARN.

JnLLnd
  • Members
  • 193 posts
  • Last active: Jul 23 2015 02:15 AM
  • Joined: 30 Dec 2007

Thank you for reply, Tank. Unfortunately, things are going a little weird here...

 
When the last line of MyFunction is  "MsgBox, % DSVfield2" this field (#2) returns blank in this line and ALSO in the loop just before.
 
If I change the last line of MyFunction to  "MsgBox, % DSVfield3" it is now field (#3) that returns blank in this line and ALSO in the loop just before! How is it possible?
 
(note: The actual MyFunction here does nothing useful since I removed everything that was not relevant to my question)
 
file=
(
one,two,three
1,2,3
)
FileDelete, test.csv
FileAppend, %file%, test.csv
MyFunction("test.csv")
return

MyFunction(strFilePath)
{
	FileRead, strData, %strFilePath%
	Loop, Parse, strData, `n
	{
		MsgBox, % A_Loopfield
		loop % ReturnDSVArray(A_LoopField)
		{
			MsgBox, % A_Index ": " DSVfield%A_Index%
		}
		MsgBox, % "?: " DSVfield2
	}
}

ReturnDSVArray(CurrentDSVLine, ReturnArray="DSVfield", Delimiter=",", Encapsulator="""")
; Delimiter Seperated Values by DerRaphael
; http://www.autohotkey.com/forum/post-203280.html#203280
;
; Proof of Concept to extract DSV (Delimiter Seperator Values)
;      - adapted for AHK by derRaphael / 21st July 2008 -
;                           [email protected]
; Following rules apply:
;   You have to set a delimiter char and an encapsulation char.
;   1) If you're using the delimeter char within your value, the value has
;      to be surrounded by your encapsulation char. One at beginning and one
;      at its end.
;   2) If you're using your encapsulation char within your value you have to
;      double it each time it occurs and surround your value as in rule 1.
; Remarks:
;   The whole concept will break, when using same EOL (End Of Line) as LineBreaks
;   in a value as in the entire file. Either you will have to escape these chars
;   somehow or use a single linefeed (`n) in values and carriage return linefeed
;   (`r`n) as EOL in your DSV file.
;   Encapsulation and delimiter chars have to be single Chars. Strings containing
;   more than one char are not supported by concept.
;CurrentDSVLine=a,b,c,"d,e","f"","",g",,i
;
;Loop, % ReturnDSVArray(CurrentDSVLine)
;   MsgBox % A_Index ": " DSVfield%A_Index%
{
	global
	if ((StrLen(Delimiter)!=1)||(StrLen(Encapsulator)!=1)) {
		return -1                            ; return -1 indicating an error ... 
	}
	SetFormat,integer,H                      ; needed for escaping the RegExNeedle properly
	local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
	local e := SubStr(ASC(encapsulator)+0,2) ; used as hex notation in the RegExNeedle
	SetFormat,integer,D                      ; no need for Hex values anymore

	local p0 := 1                            ; Start of search at char p0 in DSV Line
	local fieldCount := 0                    ; start off with empty fields.
	CurrentDSVLine .= delimiter              ; Add delimiter, otherwise last field 
	;                                          won't get recognized
	Loop
	{
		Local RegExNeedle := "\" d "(?=(?:[^\" e "]*\" e "[^\" e "]*\" e ")*(?![^\" e "]*\" e "))"
		Local p1 := RegExMatch(CurrentDSVLine,RegExNeedle,tmp,p0)
		; p1 contains now the position of our current delimitor in a 1-based index 
		fieldCount++                         ; add count
		local field := SubStr(CurrentDSVLine,p0,p1-p0)
		; This is the Line you'll have to change if you want different treatment
		; otherwise your resulting fields from the DSV data Line will be stored in AHK array
		if (SubStr(field,1,1)=encapsulator) {
			; This is the exception handling for removing any doubled encapsulators and 
			; leading/trailing encapsulator chars
			field := RegExReplace(field,"^\" e "|\" e "$")
			StringReplace,field,field,% encapsulator encapsulator,%encapsulator%, All
		}
		Local _field := ReturnArray A_Index  ; construct a reference for our ReturnArray name
		%_field% := field                    ; dereference _field and assign our value to it
		if (p1=0) {                          ; p1 is 0 when no more delimitor chars have been found
			fieldCount--                     ; so correct fieldCount due to last appended delimitor
			Break                            ; and exit loop
		} Else
			p0 := p1 + 1                     ; set the start of our RegEx Search to last result 
	}                                        ; added by one
	return fieldCount
}