What is the best way to determine if a file is binary or ascii?
Preferably a fast and simple technique that is not dependent on file extension.
I need to process many files.
Thanks
How to detect file is binary or ascii?
Re: How to detect file is binary or ascii?
ASCII/every Can/is be represented in binary.
I believe you mean an executable vs text files?
I believe you mean an executable vs text files?
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Re: How to detect file is binary or ascii?
Here is what I have currently. It appears to work well, but may not be efficient. However, in my case it is plenty fast enough.
Code: Select all
folder = c:\temp
recurse = 1 ; 0 = no recursion, 1 = recursion
Loop, %folder%\*,, %recurse%
{
FileRead, recs, %A_LoopFileLongPath%
numeric= 1,2,3,4,5,6,7,8,9,.
if recs not contains %numeric%
{
;outputdebug binary - %a_loopfilename%
} else {
outputdebug ascii - %a_loopfilename%
}
}
- MilesAhead
- Posts: 232
- Joined: 03 Oct 2013, 09:44
Re: How to detect file is binary or ascii?
I would look for the source of the Linux "file" command. It's pretty good at catching text files and some printer format file types. I suspect executable it uses the file attribute info that's not built into NTFS but is in Linux file systems.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."
- MilesAhead
name on it and take the blame."
- MilesAhead
Re: How to detect file is binary or ascii?
I have better, but I'm Not at home right now, so i dont have my "setup"
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Re: How to detect file is binary or ascii?
I have made a function: isBinFile
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...
cheers!
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...
Code: Select all
folder = c:\tools
recurse = 1 ; 0 = no recursion, 1 = recursion
Loop, %folder%\*,, %recurse%
MsgBox % A_LoopFileLongPath "`n==>" (isBinFile(A_LoopFileLongPath) ? "Binary" : "ASCII")
isBinFile(Filename,tolerance=5) {
file:=FileOpen(Filename,"r")
loop, %tolerance% {
file.RawRead(a,1)
byte:=NumGet(&a,"Char")
if (byte<9) or (byte>126) or ((byte<32) and (byte>13)) {
file.Close()
return 1
}
}
file.Close()
return 0
}
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Re: How to detect file is binary or ascii?
tested and works great. i'll be sure to find some applications for this in my scripts!
joedf wrote:I have made a function: isBinFile
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...
cheers!Code: Select all
folder = c:\tools recurse = 1 ; 0 = no recursion, 1 = recursion Loop, %folder%\*,, %recurse% MsgBox % A_LoopFileLongPath "`n==>" (isBinFile(A_LoopFileLongPath) ? "Binary" : "ASCII") isBinFile(Filename,tolerance=5) { file:=FileOpen(Filename,"r") loop, %tolerance% { file.RawRead(a,1) byte:=NumGet(&a,"Char") if (byte<9) or (byte>126) or ((byte<32) and (byte>13)) { file.Close() return 1 } } file.Close() return 0 }
Re: How to detect file is binary or ascii?
Thanks if at one point, it does not work, increase the "tolerance" and If it still doesn't work for a certain file..
Report it here, and I'll fix it.
Report it here, and I'll fix it.
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Re: How to detect file is binary or ascii?
If the ascii file is not encoded as ANSI, such as UCS-2 Big Endian... then it isBinFile will show Binary because of the 0 bytes.
Re: How to detect file is binary or ascii?
I Knew about that... But what you're asking is Actually Unicode.panofish wrote:If the ascii file is not encoded as ANSI, such as UCS-2 Big Endian... then it isBinFile will show Binary because of the 0 bytes.
ASCII and Unicode are 2 different character sets.
The original question was precisely ASCII.
I will update it and try to conform, for it to function with Unicode also.
Will post it soon!
Cheers!
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Re: How to detect file is binary or ascii?
Sorry about that joedf. You are correct. What you created works great for what I need. I just thought I'd point that out for anyone else that may need this. THANKS!
Re: How to detect file is binary or ascii?
Probably IsBOM() will help?
Re: How to detect file is binary or ascii?
Yes thank you, it has helped as an exampleHotKeyIt wrote:Probably IsBOM() will help?
I have done some research on unicode at wikipedia, the official website, Unicode tables, and etc.
here is what i have. seems to work well
Code: Select all
/* Version 2 relies on BOM to indentify unicode files
;BOM ("Byte Order Mark")
;Table from : http://www.unicode.org/faq/utf_bom.html#bom4
----------------------------------------- ;Woohoo! ASCII Art.. get it? lol..
| Bytes | Encoding Form | ;if you don't well, we're trying
|---------------------------------------- ;to detect ASCII here... :P
|00 00 FE FF | UTF-32, big-endian | ;... and Unicode, of course! ;)
|FF FE 00 00 | UTF-32, little-endian |
|FE FF | UTF-16, big-endian |
|FF FE | UTF-16, little-endian |
|EF BB BF | UTF-8 | ;I know we can not rely on this...
-----------------------------------------
*/
isBinFile(Filename,tolerance=5,asumetext=4,detectunicode=1) {
file:=FileOpen(Filename,"r")
file.Position:=0 ;force position to 0 (zero)
nbytes:=file.RawRead(a,tolerance)
if (nbytes < asumetext) ;recommended 4 minimum for unicode detection
return 0 ;asume text file, if too short
if (detectunicode) {
;read first 4 bytes
byteA:=Numget(&a,0,"UChar")
byteB:=Numget(&a,1,"UChar")
byteC:=Numget(&a,2,"UChar")
byteD:=Numget(&a,3,"UChar")
;determine BOM if possible/existant
if (byteA=0xFE && byteB=0xFF)
or (byteA=0xFF && byteB=0xFE)
return 0 ;text Utf-16 BE/LE file
if (byteA=0xEF && byteB=0xBB && byteC=0xBF)
return 0 ;text Utf-8 file
if (byteA=0x00 && byteB=0x00
&& byteC=0xFE && byteD=0xFF)
or (byteA=0xFF && byteB=0xFE
&& byteC=0x00 && byteD=0x00)
return 0 ;text Utf-32 BE/LE file
}
;otherwise continue tradition method : detect ASCII (printable ranges)
loop, %nbytes% {
byte:=NumGet(&a,A_index-1,"UChar") ;start loop at 0 (zero)
if (byte<9) or (byte>126) or (byte=11) or (byte=12) or ((byte<32) and (byte>13)) {
file.Close()
return 1
}
}
file.Close()
return 0
}
when this flaw is fixed, i will add it to the functions topic.
Dont worry, i know how to fix it, just need to sleep first
cheers!
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Re: How to detect file is binary or ascii?
Hi joedf,
you might consider that extended ASCII codes like "Ü" (154) are valid in some European languages.
A file without a BOM might be considered to be binary if you find a NULL byte within the first nnn bytes, though it's still a guess.
you might consider that extended ASCII codes like "Ü" (154) are valid in some European languages.
A file without a BOM might be considered to be binary if you find a NULL byte within the first nnn bytes, though it's still a guess.
Re: How to detect file is binary or ascii?
I know of that, hmm but I didn't think that they would be needed...
Hmm Oh Well! I'll add support for that too! Thanks for your feedback
Hmm Oh Well! I'll add support for that too! Thanks for your feedback
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
- MilesAhead
- Posts: 232
- Joined: 03 Oct 2013, 09:44
Re: How to detect file is binary or ascii?
Hmmm, I'm curious how "file" does it. But I can't look at tar.gz files on this library Windows PC. If anyone is curious, here's the link to the source archive.ftp://ftp.astron.com/pub/file/
I believe it's a bash shell script.
I believe it's a bash shell script.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."
- MilesAhead
name on it and take the blame."
- MilesAhead
Re: How to detect file is binary or ascii?
the file command actually only checks for the the "signature"... like for exe it's "MZ", bmp it's something like "BM"
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
- MilesAhead
- Posts: 232
- Joined: 03 Oct 2013, 09:44
Re: How to detect file is binary or ascii?
I assumed it did so on stuff like image files, printer format files like pdf postscript etc.. but I thought for text it might be able to detect ascii/unicode types. But you're saying "text" is the fall through if nothing else is found?joedf wrote:the file command actually only checks for the the "signature"... like for exe it's "MZ", bmp it's something like "BM"
Dang! I wish I could just look at the script. Hopefully soon I'll have a machine instead of using a library loaner.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."
- MilesAhead
name on it and take the blame."
- MilesAhead
Re: How to detect file is binary or ascii?
Well I'm working on it when I arrive home, don't worry I can't detect utf-8 with BOM
Just need to get home
Just need to get home
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
- MilesAhead
- Posts: 232
- Joined: 03 Oct 2013, 09:44
Re: How to detect file is binary or ascii?
It's no biggie for me. Just a matter of curiosity. I know I looked through the 'file' script. But it was years ago. I was probably running Mandrake 9.1 then. But I bet it does fall through to text as last resort. The frustration is just generally dealing with these super restricted library computers. Not anything to do with this thread.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."
- MilesAhead
name on it and take the blame."
- MilesAhead
Who is online
Users browsing this forum: No registered users and 142 guests