I'm trying to read a CSV database that's about 2,5GB in size and contains more than 3,5M lines and 17 columns.
I've tried using CSV library but the file is evidently too big - all requests (CSV_MatchCellColumn, CSV_ReadCell) return empty.
Since what i need to do is list files in local directory and search CSV for rows that contain their names, i'm now building multiple pseudo-arrays (one array is not enough because it limits me to 16384 characters for all file names, that's around 500 files for my use case) with file names and loop each of these arrays against the CSV contents, building smaller CSV that works fine with the library.
Code: Select all
Loop, read, posts.csv, posts_relevant.csv
{
if A_LoopReadLine contains %0RelevantList%
;0RelevantList is just a bunch of "," delimited file names.
FileAppend, %A_LoopReadLine%`n
}
The problem is... Each loop takes around 10 minutes on a top-end hardware, reading the file from SSD. It's definitely bearable and since i'm using AHK which i understand decently, i can live with that.
I just wonder if there are any scripts or ideas that would allow me to get more speed, maybe utilize more RAM (i'm using ~4MB and one CPU core now) . E.g. loading the file into the RAM and reading from there, for example. Or maybe even using other software for each search...
One other idea i've had is to add all the pseudo-arrays %0RelevantList% at once, according to the doc page i can do that:
Code: Select all
if A_LoopReadLine contains %0RelevantList%1,%0RelevantList%2,%0RelevantList%3...
Code: Select all
Loop % FFileList.Length()
{
LoopReadLine=
0RelevantAdd=
LoopReadLine := FFileList[A_Index]
Length := StrLen(0RelevantList%NUM%)
if (Length > 16000)
{
NUM += 1
}
StringTrimRight, 0RelevantAdd, LoopReadLine, 4
0RelevantList%NUM% := 0RelevantList%NUM% 0RelevantAdd ","
}