 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
Rhys
Joined: 17 Apr 2007 Posts: 761 Location: Florida
|
Posted: Fri Jun 20, 2008 8:17 pm Post subject: How Can I Parse a CSV File with Multi-Line Values |
|
|
My Format4CSV() raised a question in my mind.
How can we correctly read and parse a CSV file that includes multi-line values?
Reference:
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#EmbedBRs
http://en.wikipedia.org/wiki/Comma-separated_values
Example: | Code: | To,Subject,Body,Status
Rhys,Hello,"Hey,
How the heck do I ""parse"" this file?
-Me",Sent
Rhys,Dunno,I dunno man....,Sent |
The above code when saved as a CSV should have only 3 rows, even though the raw file has 5 lines:
Sourcefile: http://www.autohotkey.net/~Rhys/Junk/multiline.csv _________________ [Join IRC!]
 |
|
| Back to top |
|
 |
derRaphael
Joined: 23 Nov 2007 Posts: 841 Location: ~/.
|
Posted: Fri Jun 20, 2008 9:00 pm Post subject: |
|
|
using this data: multiline.csv
| Code: | fileselectfile, csvfile
fileread,csv,%csvfile%
loop,parse,csv,`r,`n
Loop,Parse,A_LoopField,CSV
MsgBox % A_LoopField |
this works
greets
derRaphael _________________
All scripts, unless otherwise noted, are hereby released under CC-BY |
|
| Back to top |
|
 |
Krogdor
Joined: 18 Apr 2008 Posts: 1390 Location: The Interwebs
|
Posted: Fri Jun 20, 2008 9:07 pm Post subject: |
|
|
| DerRaphael, that is not what he means. That will parse a normal CSV fie just fine, but Rhys is looking for something that will parse it even if one of the CSV fields contains one or more newlines... Which sounds tricky. And I have no idea how to do. |
|
| Back to top |
|
 |
Rhys
Joined: 17 Apr 2007 Posts: 761 Location: Florida
|
Posted: Fri Jun 20, 2008 10:03 pm Post subject: |
|
|
@Krogdor, it actually works on the example file that I provided.. That CSV was generated by Excel, though, so it may not work on all CSV files if they don't follow the same linebreak rules that Excel does... _________________ [Join IRC!]
 |
|
| Back to top |
|
 |
Krogdor
Joined: 18 Apr 2008 Posts: 1390 Location: The Interwebs
|
Posted: Fri Jun 20, 2008 10:08 pm Post subject: |
|
|
| Oh, really? That's interesting... But since the first parsing loop parses at linebreaks, shouldn't it... separate any CSV sections that have linebreaks in them? |
|
| Back to top |
|
 |
derRaphael
Joined: 23 Nov 2007 Posts: 841 Location: ~/.
|
Posted: Fri Jun 20, 2008 11:48 pm Post subject: |
|
|
@krogdor:
the script just works fine, cuz the linebreak from the csv lines is `n`r
whereas the linebreak from field is `n
the outer loop parses `r and ommits the `n the inner parses the CSV field and dont care about line breaks _________________
All scripts, unless otherwise noted, are hereby released under CC-BY |
|
| Back to top |
|
 |
Krogdor
Joined: 18 Apr 2008 Posts: 1390 Location: The Interwebs
|
Posted: Sat Jun 21, 2008 12:00 am Post subject: |
|
|
| Ah, different linebreak types. That would explain it... |
|
| Back to top |
|
 |
derRaphael
Joined: 23 Nov 2007 Posts: 841 Location: ~/.
|
Posted: Sat Jun 21, 2008 4:25 pm Post subject: |
|
|
Parsing non comma seperated and quotationmarks encapsulated Delimiter Seperated Values
| Code: | ; Proof of Concept to extract DSV (Delimiter Seperator Values)
; - adapted for AHK by derRaphael / 21st July 2008 -
; derRaphael@oleco.net
; Following rules apply:
; You have to set a delimiter char and an encapsulation char.
; 1) If you're using the delimeter char within your value, the value has
; to be surrounded by your encapsulation char. One at beginning and one
; at its end.
; 2) If you're using your encapsulation char within your value you have to
; double it each time it occurs and surround your value as in rule 1.
; Remarks:
; The whole concept will break, when using same EOL (End Of Line) as LineBreaks
; in a value as in the entire file. Either you will have to escape these chars
; somehow or use a single linefeed (`n) in values and carriage return linefeed
; (`r`n) as EOL in your DSV file.
; Encapsulation and delimiter chars have to be single Chars. Strings containing
; more than one char are not supported by concept.
CurrentDSVLine=a,b,c,"d,e","f"","",g",,i
Loop, % ReturnDSVArray(CurrentDSVLine)
MsgBox % A_Index ": " DSVfield%A_Index%
ReturnDSVArray(CurrentDSVLine, ReturnArray="DSVfield", Delimiter=",", Encapsulator="""")
{
global
if ((StrLen(Delimiter)!=1)||(StrLen(Encapsulator)!=1)) {
return -1 ; return -1 indicating an error ...
}
SetFormat,integer,H ; needed for escaping the RegExNeedle properly
local d := SubStr(ASC(delimiter)+0,2) ; used as hex notation in the RegExNeedle
local e := SubStr(ASC(encapsulator)+0,2) ; used as hex notation in the RegExNeedle
SetFormat,integer,D ; no need for Hex values anymore
local p0 := 1 ; Start of search at char p0 in DSV Line
local fieldCount := 0 ; start off with empty fields.
CurrentDSVLine .= delimiter ; Add delimiter, otherwise last field
; won't get recognized
Loop
{
Local RegExNeedle := "\" d "(?=(?:[^\" e "]*\" e "[^\" e "]*\" e ")*(?![^\" e "]*\" e "))"
Local p1 := RegExMatch(CurrentDSVLine,RegExNeedle,tmp,p0)
; p1 contains now the position of our current delimitor in a 1-based index
fieldCount++ ; add count
local field := SubStr(CurrentDSVLine,p0,p1-p0)
; This is the Line you'll have to change if you want different treatment
; otherwise your resulting fields from the DSV data Line will be stored in AHK array
if (SubStr(field,1,1)=encapsulator) {
; This is the exception handling for removing any doubled encapsulators and
; leading/trailing encapsulator chars
field := RegExReplace(field,"^\" e "|\" e "$")
StringReplace,field,field,% encapsulator encapsulator,%encapsulator%, All
}
Local _field := ReturnArray A_Index ; construct a reference for our ReturnArray name
%_field% := field ; dereference _field and assign our value to it
if (p1=0) { ; p1 is 0 when no more delimitor chars have been found
fieldCount-- ; so correct fieldCount due to last appended delimitor
Break ; and exit loop
} Else
p0 := p1 + 1 ; set the start of our RegEx Search to last result
} ; added by one
return fieldCount
} |
greets
derRaphael _________________
All scripts, unless otherwise noted, are hereby released under CC-BY |
|
| Back to top |
|
 |
Krogdor
Joined: 18 Apr 2008 Posts: 1390 Location: The Interwebs
|
Posted: Sat Jun 21, 2008 6:07 pm Post subject: |
|
|
DerRaphael,
Quite an impressive script, especially that RegEx...! One question though: What exactly is the use of:
| Code: | local d := SubStr(ASC(delimiter)+0,2) ; used as hex notation in the RegExNeedle
local e := SubStr(ASC(encapsulator)+0,2) |
I'm afraid I don't quite understand >< |
|
| Back to top |
|
 |
derRaphael
Joined: 23 Nov 2007 Posts: 841 Location: ~/.
|
Posted: Mon Jun 23, 2008 12:58 pm Post subject: |
|
|
i used setformat,integer,h in that script, so singlelettered returnvalues from asc-function look like this 0xNN
the +0 is needed (at least i remember reading about that existing values need that to be represented in actual format) to display hex notation correctly
since every hex starts with 0x, i use a substr(myhex,2) to start off at 2nd char position
u need the encapsulator string, when handling with field, which content contains either a seperator or a encapsulator char
so the field enclosed by those encpasulationchars
traditional MSexcel CSV makes extensive use of this.
whenever u have either a comma or quitationmarks in your field, excel transforms it by enclosing your field with quotationmarks and doubles the used inner quotations if there are any
so a field looking like this:
| Code: | | This is my content having "," inside |
becomes this
| Code: | | "This is my content having "","" inside" |
greets
derRaphael _________________
All scripts, unless otherwise noted, are hereby released under CC-BY |
|
| Back to top |
|
 |
n-l-i-d Guest
|
Posted: Mon Jun 23, 2008 1:20 pm Post subject: |
|
|
Nice one!  |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|