AutoHotkey Community

It is currently May 26th, 2012, 8:22 am

All times are UTC [ DST ]




Post new topic Reply to topic  [ 11 posts ] 
Author Message
PostPosted: June 20th, 2008, 9:17 pm 
Offline

Joined: April 17th, 2007, 1:37 pm
Posts: 761
Location: Florida
My Format4CSV() raised a question in my mind.

How can we correctly read and parse a CSV file that includes multi-line values?

Reference:
http://www.creativyst.com/Doc/Articles/ ... m#EmbedBRs
http://en.wikipedia.org/wiki/Comma-separated_values

Example:
Code:
To,Subject,Body,Status
Rhys,Hello,"Hey,
How the heck do I ""parse"" this file?
-Me",Sent
Rhys,Dunno,I dunno man....,Sent


The above code when saved as a CSV should have only 3 rows, even though the raw file has 5 lines:
Image
Sourcefile: http://www.autohotkey.net/~Rhys/Junk/multiline.csv

_________________
[Join IRC!]
Image


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 20th, 2008, 10:00 pm 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
using this data: multiline.csv

Code:
fileselectfile, csvfile
fileread,csv,%csvfile%
loop,parse,csv,`r,`n
   Loop,Parse,A_LoopField,CSV
      MsgBox % A_LoopField


this works :)

greets
derRaphael

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 20th, 2008, 10:07 pm 
Offline

Joined: April 18th, 2008, 7:57 am
Posts: 1390
Location: The Interwebs
DerRaphael, that is not what he means. That will parse a normal CSV fie just fine, but Rhys is looking for something that will parse it even if one of the CSV fields contains one or more newlines... Which sounds tricky. And I have no idea how to do.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 20th, 2008, 11:03 pm 
Offline

Joined: April 17th, 2007, 1:37 pm
Posts: 761
Location: Florida
@Krogdor, it actually works on the example file that I provided.. That CSV was generated by Excel, though, so it may not work on all CSV files if they don't follow the same linebreak rules that Excel does...

_________________
[Join IRC!]
Image


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 20th, 2008, 11:08 pm 
Offline

Joined: April 18th, 2008, 7:57 am
Posts: 1390
Location: The Interwebs
Oh, really? That's interesting... But since the first parsing loop parses at linebreaks, shouldn't it... separate any CSV sections that have linebreaks in them?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 21st, 2008, 12:48 am 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
@krogdor:

the script just works fine, cuz the linebreak from the csv lines is `n`r
whereas the linebreak from field is `n

the outer loop parses `r and ommits the `n the inner parses the CSV field and dont care about line breaks

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 21st, 2008, 1:00 am 
Offline

Joined: April 18th, 2008, 7:57 am
Posts: 1390
Location: The Interwebs
Ah, different linebreak types. That would explain it...


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 21st, 2008, 5:25 pm 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
Parsing non comma seperated and quotationmarks encapsulated Delimiter Seperated Values

Code:
; Proof of Concept to extract DSV (Delimiter Seperator Values)
;      - adapted for AHK by derRaphael / 21st July 2008 -
;                           derRaphael@oleco.net
; Following rules apply:
;   You have to set a delimiter char and an encapsulation char.
;   1) If you're using the delimeter char within your value, the value has
;      to be surrounded by your encapsulation char. One at beginning and one
;      at its end.
;   2) If you're using your encapsulation char within your value you have to
;      double it each time it occurs and surround your value as in rule 1.
; Remarks:
;   The whole concept will break, when using same EOL (End Of Line) as LineBreaks
;   in a value as in the entire file. Either you will have to escape these chars
;   somehow or use a single linefeed (`n) in values and carriage return linefeed
;   (`r`n) as EOL in your DSV file.
;   Encapsulation and delimiter chars have to be single Chars. Strings containing
;   more than one char are not supported by concept.

CurrentDSVLine=a,b,c,"d,e","f"","",g",,i

Loop, % ReturnDSVArray(CurrentDSVLine)
   MsgBox % A_Index ": " DSVfield%A_Index%

ReturnDSVArray(CurrentDSVLine, ReturnArray="DSVfield", Delimiter=",", Encapsulator="""")
{
   global
   if ((StrLen(Delimiter)!=1)||(StrLen(Encapsulator)!=1)) {
      return -1                            ; return -1 indicating an error ...
   }
   SetFormat,integer,H                      ; needed for escaping the RegExNeedle properly
   local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
   local e := SubStr(ASC(encapsulator)+0,2) ; used as hex notation in the RegExNeedle
   SetFormat,integer,D                      ; no need for Hex values anymore

   local p0 := 1                            ; Start of search at char p0 in DSV Line
   local fieldCount := 0                    ; start off with empty fields.
   CurrentDSVLine .= delimiter              ; Add delimiter, otherwise last field
   ;                                          won't get recognized
   Loop
   {
      Local RegExNeedle := "\" d "(?=(?:[^\" e "]*\" e "[^\" e "]*\" e ")*(?![^\" e "]*\" e "))"
      Local p1 := RegExMatch(CurrentDSVLine,RegExNeedle,tmp,p0)
      ; p1 contains now the position of our current delimitor in a 1-based index
      fieldCount++                         ; add count
      local field := SubStr(CurrentDSVLine,p0,p1-p0)
      ; This is the Line you'll have to change if you want different treatment
      ; otherwise your resulting fields from the DSV data Line will be stored in AHK array
      if (SubStr(field,1,1)=encapsulator) {
         ; This is the exception handling for removing any doubled encapsulators and
         ; leading/trailing encapsulator chars
         field := RegExReplace(field,"^\" e "|\" e "$")
         StringReplace,field,field,% encapsulator encapsulator,%encapsulator%, All
      }
      Local _field := ReturnArray A_Index  ; construct a reference for our ReturnArray name
      %_field% := field                    ; dereference _field and assign our value to it
      if (p1=0) {                          ; p1 is 0 when no more delimitor chars have been found
         fieldCount--                     ; so correct fieldCount due to last appended delimitor
         Break                            ; and exit loop
      } Else
         p0 := p1 + 1                     ; set the start of our RegEx Search to last result
   }                                        ; added by one
   return fieldCount
}

greets
derRaphael

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 21st, 2008, 7:07 pm 
Offline

Joined: April 18th, 2008, 7:57 am
Posts: 1390
Location: The Interwebs
DerRaphael,
Quite an impressive script, especially that RegEx...! One question though: What exactly is the use of:
Code:
local d := SubStr(ASC(delimiter)+0,2)    ; used as hex notation in the RegExNeedle
local e := SubStr(ASC(encapsulator)+0,2) 


I'm afraid I don't quite understand ><


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 23rd, 2008, 1:58 pm 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
i used setformat,integer,h in that script, so singlelettered returnvalues from asc-function look like this 0xNN

the +0 is needed (at least i remember reading about that existing values need that to be represented in actual format) to display hex notation correctly

since every hex starts with 0x, i use a substr(myhex,2) to start off at 2nd char position

u need the encapsulator string, when handling with field, which content contains either a seperator or a encapsulator char
so the field enclosed by those encpasulationchars

traditional MSexcel CSV makes extensive use of this.

whenever u have either a comma or quitationmarks in your field, excel transforms it by enclosing your field with quotationmarks and doubles the used inner quotations if there are any

so a field looking like this:
Code:
This is my content having "," inside

becomes this
Code:
"This is my content having "","" inside"


greets
derRaphael

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: June 23rd, 2008, 2:20 pm 
Nice one! 8)


Report this post
Top
  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: BrandonHotkey, poserpro, RaptorX, sjc1000 and 61 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group