Optimize reading huge CSV file

Post by **Keito** » 01 Dec 2022, 05:10

Hey everyone.

I'm trying to read a CSV database that's about 2,5GB in size and contains more than 3,5M lines and 17 columns.
I've tried using CSV library but the file is evidently too big - all requests (CSV_MatchCellColumn, CSV_ReadCell) return empty.

Since what i need to do is list files in local directory and search CSV for rows that contain their names, i'm now building multiple pseudo-arrays (one array is not enough because it limits me to 16384 characters for all file names, that's around 500 files for my use case) with file names and loop each of these arrays against the CSV contents, building smaller CSV that works fine with the library.

Code: Select all

    Loop, read, posts.csv, posts_relevant.csv
    {
	    if A_LoopReadLine contains %0RelevantList%
	    ;0RelevantList is just a bunch of "," delimited file names.
        FileAppend, %A_LoopReadLine%`n
    }

I haven't checked but i assume loop reading such a file against a bunch of names concatenated in a single variable is faster than looping the file against each file name separately.

The problem is... Each loop takes around 10 minutes on a top-end hardware, reading the file from SSD. It's definitely bearable and since i'm using AHK which i understand decently, i can live with that.
I just wonder if there are any scripts or ideas that would allow me to get more speed, maybe utilize more RAM (i'm using ~4MB and one CPU core now) . E.g. loading the file into the RAM and reading from there, for example. Or maybe even using other software for each search...

One other idea i've had is to add all the pseudo-arrays %0RelevantList% at once, according to the doc page i can do that:

Code: Select all

if A_LoopReadLine contains %0RelevantList%1,%0RelevantList%2,%0RelevantList%3...

...but i have no clue how to pass multiple variables if their names (or, better say, their last number) are not predetermined and are built dynamically:

Code: Select all

Loop % FFileList.Length()
{
    LoopReadLine=
    0RelevantAdd=
	LoopReadLine := FFileList[A_Index]
	Length := StrLen(0RelevantList%NUM%)
	if (Length > 16000)
	{
	    NUM += 1
	}
    StringTrimRight, 0RelevantAdd, LoopReadLine, 4
    0RelevantList%NUM% := 0RelevantList%NUM% 0RelevantAdd ","
}

Thanks!

Post by **boiler** » 01 Dec 2022, 06:50

Keito wrote: ↑ E.g. loading the file into the RAM and reading from there, for example.

Did you try this using FileRead? Use a parsing loop where you currently are reading a line from the file. See Example #2 (not #4 because that’s what we’re trying to avoid). This would be better than using StrSplit() to create an array of rows as is often done because of the size of the file.

just me · Post by **just me** » 01 Dec 2022, 07:05

Would you please provide examples for the CSV and the contents of the FFileList array?

Post by **Keito** » 01 Dec 2022, 08:17

just me wrote: ↑
01 Dec 2022, 07:05
Would you please provide examples for the CSV and the contents of the FFileList array?

Sure! Here's a CSV example:

"id","created_at","md5","rating","image_width","image_height","tag_string","locked_tags","fav_count","file_ext","parent_id","change_seq","file_size","updated_at","is_deleted","is_pending","is_flagged","score"
"13","2007-02-10 06:20:51.823829","c884231ca9bb7bd294dff4d6463e2f54","e","994","768","tag_1 tag_2 tag_3","avoid_posting","124","png","","39659170","255495","2022-11-14 05:16:30.621592","t","f","f","64"
"14","2007-02-10 07:17:30.687994","3e47080200fbde2d7d2ccf419343ab0a","e","581","793","tag_1 tag_2 tag_3","","842","jpg","","43145703","96998","2022-11-29 00:14:49.302365","f","f","f","694"
"15","2007-02-10 18:50:40.09919","326065d926185463f00b684e3c763cee","e","592","811","tag_1 tag_2 tag_3","","341","jpg","","42509223","93675","2022-11-12 10:26:15.709321","f","f","f","147"
"16","2007-02-10 18:57:18.54022","3070e4234dd36148149508bd24ccb59d","q","831","648","tag_1 tag_2 tag_3","","24","jpg","","42069832","89139","2022-08-30 04:14:02.133068","t","f","f","2"
"17","2007-02-10 19:57:56.809237","9fceaae92efc38ef8d975b6401bd44f5","e","750","759","tag_1 tag_2 tag_3","avoid_posting","4","jpg","2505022","39747046","88937","2022-07-17 16:30:07.857124","t","f","f","11"
"18","2007-02-10 20:02:56.870021","1ae4065221e6d06f9f0634d24ecb5f66","e","574","841","tag_1 tag_2 tag_3","conditional_dnp","60","gif","","42540225","38461","2022-09-22 13:37:32.594548","t","f","f","-1"
"19","2007-02-10 20:06:43.212235","7b1345443b456e211e054d880b0e353e","q","300","300","tag_1 tag_2 tag_3","","150","jpg","","43006912","23914","2022-11-28 05:49:20.321799","f","f","f","99"
"20","2007-02-10 20:09:42.667396","8e1328c41cc7a9c787df6619ee733e7b","q","400","640","tag_1 tag_2 tag_3","","378","jpg","","43507392","240528","2022-11-28 14:53:11.854251","f","f","f","178"
"21","2007-02-10 20:11:28.21012","290bee1cdf8c9e86a8867d97d2c92efa","e","360","600","tag_1 tag_2 tag_3","avoid_posting","2","jpg","2193","42034876","108452","2022-08-27 18:01:24.670584","t","f","f","0"

FFileList is just a file list built like this:

Code: Select all

Loop, Files, %WhichFolder%\*.*,
{
    if InStr(A_LoopFileName, ".jpg") || InStr(A_LoopFileName, ".jpeg") || InStr(A_LoopFileName, ".png")
            FFileList.Push(A_LoopFileName)
            ImagesCount += 1
		}
}

Here's a pastebin with 16000 characters of delimited filenames. They were produced by the last piece of code in OP. (I'm not sure how to export the array itself)

boiler wrote: ↑
01 Dec 2022, 06:50
Did you try this using FileRead? Use a parsing loop where you currently are reading a line from the file. See Example #2 (not #4 because that’s what we’re trying to avoid). This would be better than using StrSplit() to create an array of rows as is often done because of the size of the file.

I actually haven't. I'll add this to the top of the things to try, thanks

just me · Post by **just me** » 01 Dec 2022, 10:09

Thanks!

Keito wrote:... (one array is not enough because it limits me to 16384 characters for all file names, ...

Why do you think so? The contents of variables is not delimited this way. The following should create a complete matchlist:

Code: Select all

#NoEnv
SetBatchLines, -1 ; <<<<< run with maximum speed

FileList := ""
Loop, Files, %WhichFolder%\*.*,
{
   If A_LoopFileExt In jpg,jpeg,png
   {
      FileList .= SubStr(A_LoopFileName, 1, InStr(A_LoopFileName, ".") - 1) . ","
      ImagesCount += 1
   }
}
FileList := RTrim(FileList, ",")

Possibly you need to set #MaxMem.

As an option you could create an associative array of file names and check the contents of the third field of the CSV line against the array, I'm not sure if if will perform better.

FileRead might not be an option for you because:

A file greater than 1 GB in size will cause ErrorLevel to be set to 1 and OutputVar to be made blank

Post by **Keito** » 01 Dec 2022, 12:12

just me wrote: ↑
01 Dec 2022, 10:09
Thanks!
Keito wrote:... (one array is not enough because it limits me to 16384 characters for all file names, ...
Why do you think so? The contents of variables is not delimited this way.

Documentation says so:

Any single item in the list that is longer than 16384 characters will have those extra characters treated as a new list item. Thus, it is usually best to avoid including such items.

...and it splits the string in the wrong place.

Thanks for the code, i'll probably use it. Still, it improves another part of my code that is not directly related to the problem (of reading huge csv) and i'll still have to brute force the database a long way (if i understand everything correctly).

Possibly you need to set #MaxMem.

Yeah, i've tried that, doesn't change anything. Better safe, than sorry anyway, i guess.

A file greater than 1 GB in size will cause ErrorLevel to be set to 1 and OutputVar to be made blank

Post by **Keito** » 01 Dec 2022, 12:19

I can probably use FileOpen() instead of the FileRead?
At first glance, it seems like a pain to use tho... Well at least there are no clear limits on file size i can open.

just me · Post by **just me** » 01 Dec 2022, 12:47

Keito wrote:Any single item in the list that is longer than 16384 characters ...

A single item in a match list is a single file name in this case.

Of course you can use a file object. I don't know if FileObj.Read() has the same limits as FileRead. Just try it.

AFAICS, the file name in the CSV is always the third field. So you should be able to get the name with

Code: Select all

MD5 := StrSplit(A_LoopField, ",", """", 4)[3]

I you store the file names provided by the File-Loop as keys of an associative array, you can directly check this array using FileArr.HasKey(MD5).

Post by **Keito** » 09 Dec 2022, 08:28

boiler wrote: ↑
01 Dec 2022, 06:50
Did you try this using FileRead? Use a parsing loop where you currently are reading a line from the file. See Example #2 (not #4 because that’s what we’re trying to avoid). This would be better than using StrSplit() to create an array of rows as is often done because of the size of the file.

That worked just fine for my 2,5GB CSV file:

Code: Select all

dbPath = %A_ScriptDir%\posts.csv
Loop, read, % dbPath
{
   last_line++ ;increment lines
   DatabaseArray.Push(A_LoopReadLine)
}

AutoHotkey Community

Optimize reading huge CSV file Topic is solved

Optimize reading huge CSV file

Re: Optimize reading huge CSV file Topic is solved

Re: Optimize reading huge CSV file

Re: Optimize reading huge CSV file

Re: Optimize reading huge CSV file

Re: Optimize reading huge CSV file

Re: Optimize reading huge CSV file

Re: Optimize reading huge CSV file

Re: Optimize reading huge CSV file