Finding the total lines in a colossal file

SputnikDX · 18 Nov 2015, 14:58

Hey all. Running a script that so far works brilliantly. It takes a gigantic .dat file, cuts out the bits that I don't need from each line by making strings around the useless chunks, and appending those strings to a new file. All that works great.

However, I wanted to add a nifty progress bar, since the process takes a bit of time. I know the process is simple enough, take the current line I'm on and compare it to the total lines on the document I'm reading from, get a neat decimal and compare that to 1 to get my percentage.

To get the total lines in my .dat, I used this.

Code: Select all

StartTime := A_TickCount
Loop, read, Week45.dat
{
	totalLines := A_Index
}
ElapsedTime := A_TickCount - StartTime
ElapsedTime /= 1000
MsgBox % totalLines . " , " . ElapsedTime . " seconds."

Which gave me 1,838,377 lines. The whole process takes 30 seconds, and that's too long. I get a new file every week with a different number of lines, so I can't just use the number I just got. I know about FileRead, but it has an arbitrary limit of 1GB, and my .dat is ~1.5GB.

So is there a way to speed this up? There's no point getting a progress bar if making the progress bar itself takes 30 seconds, especially when the whole script only takes about 2 minutes. Does AHK have a nifty way to read how many lines are in a file quickly, no matter how huge the file is?

Also, I don't need help figuring out the progress bar. I want to figure that bit out on my own, but the tools to work with a file of this size seem to be out of AHK's league.

MilesAhead · 18 Nov 2015, 15:17

The trouble is you pretty much have to scan the file and count the number of end of line markers. Which is dos derived systems like Windows is usually a cr/lf pair. For large files it can take a significant fraction of the time the whole operation would take.

Due to this reason there is also an indeterminant style of progress bar. It signifies to the user the program is not hung. But it doesn't indicate when the job will be done. I hate to use those too but sometimes there is no good alternative.

MJs · 18 Nov 2015, 17:06

just some ideas, since that's a big file and lots of lines to parse line by line
can I suggest splitting the file for example some how to three parts and operate on them that way, that way you negate the 1GB limit
even though reading about File Object and using it would help you in this case
can you some how try estimate the number of lines based on the size of the file, like if it's around the millions

OR: I WOULD GO WITH THIS ONE
use File object and read lines using ReadLine() while moving the progress based on the position of the pointer moving when reading each line, that way two birds one stone, and it's up to you to calculate the progress.
take care.

Shadowpheonix · 18 Nov 2015, 17:33

I would personally do something like this...

Code: Select all

MyFile := FileOpen("C:\Files\test.txt", "r")	; Adjust as needed.
FileLength := MyFile.Length()
Progress, b r0-%FileLength%, , File Progress...
While Not MyFile.AtEOF
{
	CurrentLine := MyFile.ReadLine()
	Progress, % MyFile.Pos
	Sleep 1000	; Replace this with whatever parsing you need to do for the line
}
Progress, Off
MyFile.Close()

Exaskryz · 18 Nov 2015, 17:59

I don't have a 1.5 GB file handy, but I don't see any documentation that 1GB is the limit set by AHK. I'd imagine if AHK has enough memory to work with, files up to almost 4 GB could be opened in 32-bit AHK. Much larger limit in 64-bit, AFAIK.

One idea though, if the FileOpen() is failing, is to do this:

Edit: Don't do this; I misunderstood how FileReadLine worked under the hood. Lexikos's post below explains why it is a bad idea.

Code: Select all

Loop, 250 ; theoretically could do infinite if you don't have an upperbound for line expectation, but I didn't want this untested example to glitch out and never ended
{
FileReadLine, var, C:\Files\text.txt, % A_Index*10000
If ErrorLevel
   {
   linesestimation:=A_Index*10000
   Break
}
Progress, b r0-%linesestimation%, , File Progress... ; borrowed from Shadow's code
; do your thing
; include a [c]Progress, % A_Index[/c] or what have you to update the line count

The progress bar may not reach 100%, but it'll come close enough. In a 325000 line file, it would want 330000 to be 100%. So at the end, you'd be at 325000/330000 = 98.5%. The larger your line counts, the closer to 100% you'd be.

wolf_II · 18 Nov 2015, 19:33

Try this:

Code: Select all

FileName := A_ScriptDir "\Big_Test_File.dat"



;-------------------------------------------------------------------------------
; Create File
;-------------------------------------------------------------------------------
Start_Time := A_TickCount

    ;-----------------------------------
    f := FileOpen(FileName, "w")
    Loop, % 2 * 1000 * 1000 ; 2 million lines
        f.Write(A_Index "`r`n")                                                 ; corrected -> see lexikos' post below
    f.Close()
    ;-----------------------------------

Elapsed_Time := A_TickCount - Start_Time
MsgBox,, Create File, % Round(Elapsed_Time / 1000, 3) " sec"



;-------------------------------------------------------------------------------
; Count Lines
;-------------------------------------------------------------------------------
Start_Time := A_TickCount

    ;-----------------------------------
    FileRead, BigVar, %FileName%
    ignore := StrReplace(BigVar, "`n", "", Count)                               ; improved -> see lexikos' post below
    ;-----------------------------------

Elapsed_Time := A_TickCount - Start_Time
MsgBox,, Count Lines, % prettify_Number(Count) " lines`n`n"
                      . Round(Elapsed_Time / 1000, 3) " sec"



;-------------------------------------------------------------------------------
prettify_Number(n) { ; insert thousands separators into a numeric string
;-------------------------------------------------------------------------------
    IfLess, n, 0, Return, "-" prettify_Number(SubStr(n, 2))
    Return, RegExReplace(n, "\G\d+?(?=(\d{3})+(?:\D|$))", "$0,")
}

Output:

Code: Select all

---------------------------
Count Lines
---------------------------
2,000,000 lines

0.125 sec
---------------------------
OK   
---------------------------

Edit: applied lexikos' correction, applied lexikos' improvement
See lexikos' post below, which also mentions the limitations of this approach.

18 Nov 2015, 22:04

SputnikDX wrote:I know about FileRead, but it has an arbitrary limit of 1GB,

No, it does not.

FileRead wrote:*m1024: If this option is omitted, the entire file is loaded unless there is insufficient memory, in which case an error message is shown and the thread exits (but Try can be used to avoid this). Otherwise, replace 1024 with a decimal or hexadecimal number of bytes. If the file is larger than this, only its leading part is loaded. Note: This might result in the last line ending in a naked carriage return (`r) rather than `r`n.

To load a 1.5GB file entirely into memory, you will need at least 1.5GB of available virtual memory (across RAM and page file) and a contiguous block of 1.5GB unused virtual address space. AutoHotkey 32-bit is not "large address aware" and therefore only has 2GB of address space at its disposal. The largest contiguous block of free address space depends on where any other allocations were made within that 2GB space. There might not be a 1.5GB contiguous block even if the process is using <10 MB.

If you are using AutoHotkey Unicode 32-bit and the file is not encoded with UTF-16, it is very likely you will need roughly twice the amount of memory, because each character is twice the size (16-bit in memory vs 8-bit in the file).

Exaskryz wrote:FileReadLine, var, C:\Files\text.txt, % A_Index*10000

Very bad idea. In order to read and return line 10000, FileReadLine has to read and discard the first 9999 lines. So the loop would read 10000 lines, then 20000 lines (including the first 10000 again) and so on.

wolf_II wrote: f.Write(A_Index "`n`r")

That will write an empty line after every line, using inconsistent line endings. I'm pretty sure you meant to write "`r`n"...

ignore := StrReplace(BigVar, "`n`r", "", Count)

If you use "`n", that would work for files with either of the line endings common on Windows: "`n" or "`r`n". Putting aside the transposed characters, this method relies on the entire file fitting in memory.

Shadowpheonix's solution seems like it should be the most efficient one.

However, depending on what you're doing, it may be more efficient to read and process a larger chunk of the file at a time, rather than a line at a time.

Exaskryz · 18 Nov 2015, 22:06

Thanks for pointing out the flaw in my FileReadLine idea. I had the misconception that FileReadLine had some special technique underlying it to quickly identify and read out a line.

18 Nov 2015, 22:09

If you need exactly one line and know exactly which line that is, FileReadLine is more efficient than a file loop which skips every leading line, but only because the looping is performed by compiled C++ code rather than your script.

wolf_II · 18 Nov 2015, 22:40

lexikos wrote:
wolf_II wrote: f.Write(A_Index "`n`r")
That will write an empty line after every line, using inconsistent line endings. I'm pretty sure you meant to write "`r`n"...
ignore := StrReplace(BigVar, "`n`r", "", Count)
If you use "`n", that would work for files with either of the line endings common on Windows: "`n" or "`r`n". Putting aside the transposed characters, this method relies on the entire file fitting in memory.

@lexikos: Many thanks for your correction and your improvement. I appreciate the time you spend on educating us forum users.

I also edited my previous post accordingly.

YouTube · 19 Nov 2015, 03:53

Here's a way using the FileSystemObject

Seems to be about 3x slower than the strReplace version though...

Code: Select all

SetBatchLines -1
FileName := A_ScriptDir "\Big_Test_File.dat"
 
 
FileDelete, %fileName%
;-------------------------------------------------------------------------------
; Create File
;-------------------------------------------------------------------------------
Start_Time := A_TickCount
 
    ;-----------------------------------
    f := FileOpen(FileName, "w")
    Loop, % 2 * 1000 * 1000 ; 2 million lines
        f.Write(A_Index "`r`n")
    f.Close()
    ;-----------------------------------
 
Elapsed_Time := A_TickCount - Start_Time
MsgBox,, Create File, % Round(Elapsed_Time / 1000, 3) " sec"


;-------------------------------------------------------------------------------
; Count Lines (FileSystemObject)
;-------------------------------------------------------------------------------
Start_Time := A_TickCount
 
    ;-----------------------------------
		fso := ComObjCreate("Scripting.FileSystemObject")
		objTextFile := fso.OpenTextFile(FileName, 8) ; 8 = forAppending
		Count1 := objTextFile.Line -1
    ;-----------------------------------
 
Elapsed_Time := A_TickCount - Start_Time
MsgBox,, Count Lines, % prettify_Number(Count1) " lines`n`n"
                      . Round(Elapsed_Time / 1000, 3) " sec"

 
 
;-------------------------------------------------------------------------------
prettify_Number(n) { ; insert thousands separators into a numeric string
;-------------------------------------------------------------------------------
    IfLess, n, 0, Return, "-" prettify_Number(SubStr(n, 2))
    Return, RegExReplace(n, "\G\d+?(?=(\d{3})+(?:\D|$))", "$0,")
}

SputnikDX · 19 Nov 2015, 09:37

Thank you for all the replies. I managed to scrounge together a ticker that displays the number of lines appended, as well as an infinite scrolling progress bar, but I have enough time to tear this apart to get the loading bar I've always wanted. Gives me work to do.

just me · 19 Nov 2015, 11:24

You might also try to read the file in parts, which might be faster:

Code: Select all

#NoEnv
FilePath := "Week45.dat"
ReadLength := 1024 * 1024 * 256 ; 256 MB - may be adjusted to your needs
Lines := 0
If (MyFile := FileOpen(FilePath, "r")) {
   While Not MyFile.AtEOF {
      StrReplace(MyFile.Read(ReadLength), "`n", " ", LineCount)
      Lines += LineCount
   }
   MyFile.Close()
   MsgBox, 0, Success, Found %Lines% lines in %FilePath%!
}
Else
   MsgBox, 0, Error, Could not open file %FilePath%!

*untested*

And, if you want to count the lines added to a growing file, there are better methods.

MilesAhead · 19 Nov 2015, 15:39

For something specialized like scanning a file for cr/lf pairs you may be able to find some optimized free utility that uses memory mapped files and assembler for the fastest scanning. It could output the number of lines and be launched by your ahk program, assuming it is a small exe. I am thinking there may be something even more specialized than wc Linux command ported to Windows. There may be some dedicated assembler exe module around somewhere like Code Project.

When everything was ansi it was easier. Just use the assembler instructions that scan a memory range and search for a cr byte. Then test if the byte following it was lf. Is so increment the count.

On the other hand if you are having fun figuring out how to do it strictly using AHK, never mind.

Finding the total lines in a colossal file

Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Re: Finding the total lines in a colossal file

Who is online