Parsing CSV files with multi-line fields

JnLLnd

Members
193 posts

Last active: Jul 23 2015 02:15 AM
Joined: 30 Dec 2007

This tutorial will show you how to load and save CSV files with multi-line fields. It is divided in three sections:

Reading and parsing a CSV file with multi-line fields (this post)
Control fields order with the function ObjCSV_CSV2Collection
Converting to a single-line CSV file

In most comma-separated-values (CSV) files, each record stands on its own line, with each field separated by a comma. For example:

Title,Album,Track
She Loves You [Mono],"Past Masters, Vol. 1",4
Eight Days A Week,Beatles For Sale,8
Ticket To Ride,Help!,7
Strawberry Fields Forever,Magical Mystery Tour,8
Get Back,Let It Be,12

But if we add a fourth field with the lyrics of these songs, then we have a multi-line field:

Title,Album,Track,Lyrics
She Loves You [Mono],"Past Masters, Vol. 1",4,"She loves you, yeah, yeah, yeah
She loves you, yeah, yeah, yeah
She loves you, yeah, yeah, yeah, yeah"
Eight Days A Week,Beatles For Sale,8,"Eight days a week
I love you.
Eight days a week
Is not enough to show I care."
Ticket To Ride,Help!,7,"She's got a ticket to ri-hide,
She's got a ticket to ri-hi-hide,
She's got a ticket to ride,
But she don't care."
Strawberry Fields Forever,Magical Mystery Tour,8,"Let me take you down, 'cos I'm going to Strawberry Fields.
Nothing is real and nothing to get hungabout.
Strawberry Fields forever."
Get Back,Let It Be,12,"Get back, get back.
Get back to where you once belonged
Get back, get back.
Get back to where you once belonged."

This is a valid CSV file: fields including line breaks are enclosed in double-quotes. And this is not unusual. For example, you will find multi-line fields in exports from Outlook or Google contacts where the "Notes" field includes line breaks.

But if we try to import these files in MS Excel, we have a problem because each end-of-line is interpreted at the end of a record. In that case, the line "She loves you, yeah, yeah, yeah" in the middle of the lyrics is interpreted as a record with four fields separated by the commas.

How can AutoHotKey help us? Let’s try with the "Loop, Parse, CSV" command. We would have something like this:

#NoEnv
#SingleInstance force
Loop, Read, %A_ScriptDir%\TheBeatles-Lyrics.txt
      if (A_Index = 1)
            MsgBox, Header: %A_LoopReadLine%
      else
            Loop, Parse, A_LoopReadLine, CSV
                  MsgBox, Field #%A_Index%:`n`n%A_LoopField%
return

Unfortunately, this will fall into the same trap as Excel. Line breaks between double-quotes are not considered as part of the fields as they should be. They are again considered as end of records.

There is a solution: replace all line breaks found between double-quotes with a temporary character and replace back this character with line breaks after a record has been parsed. This is easily said but this not a trivial task.

This is one of the things the command "ObjCSV_CSV2Collection" from the ObjCSV library will do for you! Let’s try this code:

NOTE: Before running this script...

Make sure the ObjCSV library is copied in one of these directories:
%A_ScriptDir%\Lib\ (folder Lib under the current script’s folder)
%A_MyDocuments%\AutoHotkey\Lib\ (global AHK’s Lib folder)
[path to the currently running AutoHotkey_L.exe]\Lib\
and copy the file TheBeatles-Lyrics.txt in the same directory as the script.

#NoEnv
#SingleInstance force
objCollection := ObjCSV_CSV2Collection(A_ScriptDir . "\TheBeatles-Lyrics.txt", strFileHeader, 1, 1)
loop, % objCollection.MaxIndex()
      for strFieldName, strFieldValue in objCollection[A_Index]
            MsgBox, %strFieldName%:`n`n%strFieldValue%
return

- Posted 07 September 2013 - 09:46 PM

JnLLnd

Members
193 posts

Last active: Jul 23 2015 02:15 AM
Joined: 30 Dec 2007

The "ObjCSV_CSV2Collection" function reads the CSV file and returns it to an object variable objCollection that contains an array of objects, one for each record (you can learn more on this in my basic ObjCSV tutorial). Let’s take a look at the syntax for this function:

ObjCSV_CSV2Collection(strFilePath, ByRef strFieldNames [, blnHeader = 1, blnMultiline = 1, blnProgress = 0, strFieldDelimiter = ",", strEncapsulator = """", strRecordDelimiter = "`n", strOmitChars = "`r"])

In our example, strFilePath is the name of the file to read. And we have two True (1) Boolean parameters: blnHeader to indicate that he first line of the file contains field names and blnMultiline to indicate that we have a file with multi-line fields.

Remaining parameters are left with the function default values. But what about this "ByRef strFieldNames" parameter? The content of this ByRef parameter is changed by the function to return the name of fields in the objCollection records. This string is in CSV format and fields appear in the order they were found in the CSV file.

Note: the strFieldNames parameter can also be used to tell the function which names to give to fields in a CSV file that has no header. More on this in my third and last tutorial on ObjCSV.

In the above example, you can see that fields are parsed in alphabetical order of field names (Album, Lyrics, Title and Track). This is an internal behavior of the "for strFieldName, strFieldValue in objCollection[A_Index]" command that cannot be changed.

If we want to control this in order to present fields as they appear in the CSV file, we can use the strFileHeader variable returned by the function in "Loop, Parse, CSV" command as shown in this variation of our script:

#NoEnv
#SingleInstance force
objCollection := ObjCSV_CSV2Collection(A_ScriptDir . "\TheBeatles-Lyrics.txt", strFileHeader, 1, 1)
Loop, % objCollection.MaxIndex()
{
      i := A_Index
      Loop, Parse, strFileHeader, `,
            MsgBox, % A_LoopField . ":`n`n" . objCollection[i][A_LoopField]
}
return

Now, each field is displayed in the order we had in the file: Title, Album, Track and Lyrics.

- Posted 07 September 2013 - 09:48 PM

JnLLnd

Members
193 posts

Last active: Jul 23 2015 02:15 AM
Joined: 30 Dec 2007

Finally, let’s say you need to process this file in Excel. The function "ObjCSV_Collection2CSV" of the ObjCSV library will prepare the file for an easy importation in XL: it will replace all line breaks inside fields with a replacement character of your choice.

#NoEnv
#SingleInstance force
objCollection := ObjCSV_CSV2Collection(A_ScriptDir . "\TheBeatles-Lyrics.txt", strFileHeader, 1, 1)
ObjCSV_Collection2CSV(objCollection, A_ScriptDir . "\TheBeatles-Lyrics-SingleLine.txt", 1, strFileHeader, , 1, , , "¶")
run, %A_ScriptDir%\TheBeatles-Lyrics-SingleLine.txt
return

Take a look at the last parameter of "ObjCSV_CSV2Collection". The character that will replace end-of-lines is "¶" (ASCII code 182). You can change this as you wish.

ObjCSV_Collection2CSV(objCollection, strFilePath [, blnHeader = 0, strFieldOrder = "", blnProgress = 0, blnOverwrite = 0, strFieldDelimiter = ",", strEncapsulator = """", strEolReplacement = ""])

Other parameters of this function are the object to save (objCollection), the path and file name, the Boolean value True (1) to include a header line, strHeader to indicate the order of fields, True again (1) to overwrite the destination file and, finally, our replacement character "¶". Other parameters are left with the default values.

In my third and last demo of ObjCSV, you will see how these other options will let you change the fields delimiter (convert a semi-colon or tab delimited file to a comma-delimited file or vice-versa) or change the encapsulation characters from "..." to the character of your choice, etc.

- Posted 07 September 2013 - 09:49 PM

Sign In