AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How Can I Parse a CSV File with Multi-Line Values

 
Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
Rhys



Joined: 17 Apr 2007
Posts: 761
Location: Florida

PostPosted: Fri Jun 20, 2008 8:17 pm    Post subject: How Can I Parse a CSV File with Multi-Line Values Reply with quote

My Format4CSV() raised a question in my mind.

How can we correctly read and parse a CSV file that includes multi-line values?

Reference:
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#EmbedBRs
http://en.wikipedia.org/wiki/Comma-separated_values

Example:
Code:
To,Subject,Body,Status
Rhys,Hello,"Hey,
How the heck do I ""parse"" this file?
-Me",Sent
Rhys,Dunno,I dunno man....,Sent


The above code when saved as a CSV should have only 3 rows, even though the raw file has 5 lines:

Sourcefile: http://www.autohotkey.net/~Rhys/Junk/multiline.csv
_________________
[Join IRC!]
Back to top
View user's profile Send private message
derRaphael



Joined: 23 Nov 2007
Posts: 841
Location: ~/.

PostPosted: Fri Jun 20, 2008 9:00 pm    Post subject: Reply with quote

using this data: multiline.csv

Code:
fileselectfile, csvfile
fileread,csv,%csvfile%
loop,parse,csv,`r,`n
   Loop,Parse,A_LoopField,CSV
      MsgBox % A_LoopField


this works Smile

greets
derRaphael
_________________

    All scripts, unless otherwise noted, are hereby released under CC-BY
Back to top
View user's profile Send private message
Krogdor



Joined: 18 Apr 2008
Posts: 1390
Location: The Interwebs

PostPosted: Fri Jun 20, 2008 9:07 pm    Post subject: Reply with quote

DerRaphael, that is not what he means. That will parse a normal CSV fie just fine, but Rhys is looking for something that will parse it even if one of the CSV fields contains one or more newlines... Which sounds tricky. And I have no idea how to do.
Back to top
View user's profile Send private message AIM Address
Rhys



Joined: 17 Apr 2007
Posts: 761
Location: Florida

PostPosted: Fri Jun 20, 2008 10:03 pm    Post subject: Reply with quote

@Krogdor, it actually works on the example file that I provided.. That CSV was generated by Excel, though, so it may not work on all CSV files if they don't follow the same linebreak rules that Excel does...
_________________
[Join IRC!]
Back to top
View user's profile Send private message
Krogdor



Joined: 18 Apr 2008
Posts: 1390
Location: The Interwebs

PostPosted: Fri Jun 20, 2008 10:08 pm    Post subject: Reply with quote

Oh, really? That's interesting... But since the first parsing loop parses at linebreaks, shouldn't it... separate any CSV sections that have linebreaks in them?
Back to top
View user's profile Send private message AIM Address
derRaphael



Joined: 23 Nov 2007
Posts: 841
Location: ~/.

PostPosted: Fri Jun 20, 2008 11:48 pm    Post subject: Reply with quote

@krogdor:

the script just works fine, cuz the linebreak from the csv lines is `n`r
whereas the linebreak from field is `n

the outer loop parses `r and ommits the `n the inner parses the CSV field and dont care about line breaks
_________________

    All scripts, unless otherwise noted, are hereby released under CC-BY
Back to top
View user's profile Send private message
Krogdor



Joined: 18 Apr 2008
Posts: 1390
Location: The Interwebs

PostPosted: Sat Jun 21, 2008 12:00 am    Post subject: Reply with quote

Ah, different linebreak types. That would explain it...
Back to top
View user's profile Send private message AIM Address
derRaphael



Joined: 23 Nov 2007
Posts: 841
Location: ~/.

PostPosted: Sat Jun 21, 2008 4:25 pm    Post subject: Reply with quote

Parsing non comma seperated and quotationmarks encapsulated Delimiter Seperated Values

Code:
; Proof of Concept to extract DSV (Delimiter Seperator Values)
;      - adapted for AHK by derRaphael / 21st July 2008 -
;                           derRaphael@oleco.net
; Following rules apply:
;   You have to set a delimiter char and an encapsulation char.
;   1) If you're using the delimeter char within your value, the value has
;      to be surrounded by your encapsulation char. One at beginning and one
;      at its end.
;   2) If you're using your encapsulation char within your value you have to
;      double it each time it occurs and surround your value as in rule 1.
; Remarks:
;   The whole concept will break, when using same EOL (End Of Line) as LineBreaks
;   in a value as in the entire file. Either you will have to escape these chars
;   somehow or use a single linefeed (`n) in values and carriage return linefeed
;   (`r`n) as EOL in your DSV file.
;   Encapsulation and delimiter chars have to be single Chars. Strings containing
;   more than one char are not supported by concept.

CurrentDSVLine=a,b,c,"d,e","f"","",g",,i

Loop, % ReturnDSVArray(CurrentDSVLine)
   MsgBox % A_Index ": " DSVfield%A_Index%

ReturnDSVArray(CurrentDSVLine, ReturnArray="DSVfield", Delimiter=",", Encapsulator="""")
{
   global
   if ((StrLen(Delimiter)!=1)||(StrLen(Encapsulator)!=1)) {
      return -1                            ; return -1 indicating an error ...
   }
   SetFormat,integer,H                      ; needed for escaping the RegExNeedle properly
   local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
   local e := SubStr(ASC(encapsulator)+0,2) ; used as hex notation in the RegExNeedle
   SetFormat,integer,D                      ; no need for Hex values anymore

   local p0 := 1                            ; Start of search at char p0 in DSV Line
   local fieldCount := 0                    ; start off with empty fields.
   CurrentDSVLine .= delimiter              ; Add delimiter, otherwise last field
   ;                                          won't get recognized
   Loop
   {
      Local RegExNeedle := "\" d "(?=(?:[^\" e "]*\" e "[^\" e "]*\" e ")*(?![^\" e "]*\" e "))"
      Local p1 := RegExMatch(CurrentDSVLine,RegExNeedle,tmp,p0)
      ; p1 contains now the position of our current delimitor in a 1-based index
      fieldCount++                         ; add count
      local field := SubStr(CurrentDSVLine,p0,p1-p0)
      ; This is the Line you'll have to change if you want different treatment
      ; otherwise your resulting fields from the DSV data Line will be stored in AHK array
      if (SubStr(field,1,1)=encapsulator) {
         ; This is the exception handling for removing any doubled encapsulators and
         ; leading/trailing encapsulator chars
         field := RegExReplace(field,"^\" e "|\" e "$")
         StringReplace,field,field,% encapsulator encapsulator,%encapsulator%, All
      }
      Local _field := ReturnArray A_Index  ; construct a reference for our ReturnArray name
      %_field% := field                    ; dereference _field and assign our value to it
      if (p1=0) {                          ; p1 is 0 when no more delimitor chars have been found
         fieldCount--                     ; so correct fieldCount due to last appended delimitor
         Break                            ; and exit loop
      } Else
         p0 := p1 + 1                     ; set the start of our RegEx Search to last result
   }                                        ; added by one
   return fieldCount
}

greets
derRaphael
_________________

    All scripts, unless otherwise noted, are hereby released under CC-BY
Back to top
View user's profile Send private message
Krogdor



Joined: 18 Apr 2008
Posts: 1390
Location: The Interwebs

PostPosted: Sat Jun 21, 2008 6:07 pm    Post subject: Reply with quote

DerRaphael,
Quite an impressive script, especially that RegEx...! One question though: What exactly is the use of:
Code:
local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
local e := SubStr(ASC(encapsulator)+0,2) 


I'm afraid I don't quite understand ><
Back to top
View user's profile Send private message AIM Address
derRaphael



Joined: 23 Nov 2007
Posts: 841
Location: ~/.

PostPosted: Mon Jun 23, 2008 12:58 pm    Post subject: Reply with quote

i used setformat,integer,h in that script, so singlelettered returnvalues from asc-function look like this 0xNN

the +0 is needed (at least i remember reading about that existing values need that to be represented in actual format) to display hex notation correctly

since every hex starts with 0x, i use a substr(myhex,2) to start off at 2nd char position

u need the encapsulator string, when handling with field, which content contains either a seperator or a encapsulator char
so the field enclosed by those encpasulationchars

traditional MSexcel CSV makes extensive use of this.

whenever u have either a comma or quitationmarks in your field, excel transforms it by enclosing your field with quotationmarks and doubles the used inner quotations if there are any

so a field looking like this:
Code:
This is my content having "," inside

becomes this
Code:
"This is my content having "","" inside"


greets
derRaphael
_________________

    All scripts, unless otherwise noted, are hereby released under CC-BY
Back to top
View user's profile Send private message
n-l-i-d
Guest





PostPosted: Mon Jun 23, 2008 1:20 pm    Post subject: Reply with quote

Nice one! Cool
Back to top
Display posts from previous:   
Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group