How to detect file is binary or ascii?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

How to detect file is binary or ascii?

04 Oct 2013, 09:27

What is the best way to determine if a file is binary or ascii?
Preferably a fast and simple technique that is not dependent on file extension.
I need to process many files.

Thanks
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

04 Oct 2013, 12:35

ASCII/every Can/is be represented in binary.
I believe you mean an executable vs text files?
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

Re: How to detect file is binary or ascii?

04 Oct 2013, 13:21

Here is what I have currently. It appears to work well, but may not be efficient. However, in my case it is plenty fast enough.

Code: Select all

folder = c:\temp
recurse = 1  ; 0 = no recursion, 1 = recursion

Loop, %folder%\*,, %recurse%
{

    FileRead, recs, %A_LoopFileLongPath%
    numeric= 1,2,3,4,5,6,7,8,9,.

    if recs not contains %numeric% 
    {
        ;outputdebug binary - %a_loopfilename%
    } else {
        outputdebug ascii - %a_loopfilename%
    }
    
}
User avatar
MilesAhead
Posts: 232
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

04 Oct 2013, 13:21

I would look for the source of the Linux "file" command. It's pretty good at catching text files and some printer format file types. I suspect executable it uses the file attribute info that's not built into NTFS but is in Linux file systems.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

04 Oct 2013, 14:27

I have better, but I'm Not at home right now, so i dont have my "setup"
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

04 Oct 2013, 18:18

I have made a function: isBinFile
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...

Code: Select all

folder = c:\tools
recurse = 1  ; 0 = no recursion, 1 = recursion

Loop, %folder%\*,, %recurse%
	MsgBox % A_LoopFileLongPath "`n==>" (isBinFile(A_LoopFileLongPath) ? "Binary" : "ASCII")

isBinFile(Filename,tolerance=5) {
	file:=FileOpen(Filename,"r")
	loop, %tolerance% {
		file.RawRead(a,1)
		byte:=NumGet(&a,"Char")
		if (byte<9) or (byte>126) or ((byte<32) and (byte>13)) {
			file.Close()
			return 1
		}
	}
	file.Close()
	return 0
}
cheers!
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
Guest10
Posts: 578
Joined: 01 Oct 2013, 02:50

Re: How to detect file is binary or ascii?

06 Oct 2013, 04:11

tested and works great. i'll be sure to find some applications for this in my scripts! :lol:
joedf wrote:I have made a function: isBinFile
It reads the first few bytes (default: 5) and determines if that byte is within ASCII Printable Chars Range (9-13, 32-126)
seems to work well...

Code: Select all

folder = c:\tools
recurse = 1  ; 0 = no recursion, 1 = recursion

Loop, %folder%\*,, %recurse%
	MsgBox % A_LoopFileLongPath "`n==>" (isBinFile(A_LoopFileLongPath) ? "Binary" : "ASCII")

isBinFile(Filename,tolerance=5) {
	file:=FileOpen(Filename,"r")
	loop, %tolerance% {
		file.RawRead(a,1)
		byte:=NumGet(&a,"Char")
		if (byte<9) or (byte>126) or ((byte<32) and (byte>13)) {
			file.Close()
			return 1
		}
	}
	file.Close()
	return 0
}
cheers!
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

06 Oct 2013, 09:01

Thanks :) if at one point, it does not work, increase the "tolerance" and If it still doesn't work for a certain file..
Report it here, and I'll fix it. ;)
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

Re: How to detect file is binary or ascii?

07 Oct 2013, 16:07

If the ascii file is not encoded as ANSI, such as UCS-2 Big Endian... then it isBinFile will show Binary because of the 0 bytes.
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

07 Oct 2013, 16:17

panofish wrote:If the ascii file is not encoded as ANSI, such as UCS-2 Big Endian... then it isBinFile will show Binary because of the 0 bytes.
I Knew about that... But what you're asking is Actually Unicode.
ASCII and Unicode are 2 different character sets.
The original question was precisely ASCII.

I will update it and try to conform, for it to function with Unicode also.
Will post it soon!

Cheers! ;)
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
panofish
Posts: 11
Joined: 04 Oct 2013, 09:20
Contact:

Re: How to detect file is binary or ascii?

07 Oct 2013, 16:49

Sorry about that joedf. You are correct. What you created works great for what I need. I just thought I'd point that out for anyone else that may need this. THANKS!
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

08 Oct 2013, 22:01

HotKeyIt wrote:Probably IsBOM() will help?
Yes thank you, it has helped as an example :)
I have done some research on unicode at wikipedia, the official website, Unicode tables, and etc.
here is what i have. seems to work well ;)

Code: Select all

/* Version 2 relies on BOM to indentify unicode files
;BOM ("Byte Order Mark")
;Table from : http://www.unicode.org/faq/utf_bom.html#bom4
-----------------------------------------    ;Woohoo! ASCII Art.. get it? lol..
|    Bytes    |      Encoding Form      |    ;if you don't well, we're trying
|----------------------------------------    ;to detect ASCII here... :P
|00 00 FE FF  |  UTF-32, big-endian     |    ;... and Unicode, of course! ;)
|FF FE 00 00  |  UTF-32, little-endian  |
|FE FF        |  UTF-16, big-endian     |
|FF FE        |  UTF-16, little-endian  |
|EF BB BF     |  UTF-8                  |    ;I know we can not rely on this...
-----------------------------------------
*/

isBinFile(Filename,tolerance=5,asumetext=4,detectunicode=1) {
	file:=FileOpen(Filename,"r")
	file.Position:=0 ;force position to 0 (zero)
	nbytes:=file.RawRead(a,tolerance)
	if (nbytes < asumetext) ;recommended 4 minimum for unicode detection
		return 0 ;asume text file, if too short
	
	if (detectunicode) {
		;read first 4 bytes
		byteA:=Numget(&a,0,"UChar")
		byteB:=Numget(&a,1,"UChar")
		byteC:=Numget(&a,2,"UChar")
		byteD:=Numget(&a,3,"UChar")
		
		;determine BOM if possible/existant
		if (byteA=0xFE && byteB=0xFF)
			or (byteA=0xFF && byteB=0xFE)
			return 0 ;text Utf-16 BE/LE file
		if (byteA=0xEF && byteB=0xBB && byteC=0xBF)
			return 0 ;text Utf-8 file
		if (byteA=0x00 && byteB=0x00
			&& byteC=0xFE && byteD=0xFF)
			or (byteA=0xFF && byteB=0xFE
			&& byteC=0x00 && byteD=0x00)
			return 0 ;text Utf-32 BE/LE file
	}
	;otherwise continue tradition method : detect ASCII (printable ranges)
	loop, %nbytes% {
		byte:=NumGet(&a,A_index-1,"UChar") ;start loop at 0 (zero)
		if (byte<9) or (byte>126) or (byte=11) or (byte=12) or ((byte<32) and (byte>13)) {
			file.Close()
			return 1
		}
	}
	file.Close()
	return 0
}
Utf-8 without BOM is a know flaw.. working on it :P
when this flaw is fixed, i will add it to the functions topic.
Dont worry, i know how to fix it, just need to sleep first :lol:

cheers! ;)
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
just me
Posts: 9426
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: How to detect file is binary or ascii?

09 Oct 2013, 00:49

Hi joedf,

you might consider that extended ASCII codes like "Ü" (154) are valid in some European languages.

A file without a BOM might be considered to be binary if you find a NULL byte within the first nnn bytes, though it's still a guess.
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

09 Oct 2013, 09:20

I know of that, hmm but I didn't think that they would be needed...
Hmm Oh Well! I'll add support for that too! Thanks for your feedback ;)
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
MilesAhead
Posts: 232
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

09 Oct 2013, 13:07

Hmmm, I'm curious how "file" does it. But I can't look at tar.gz files on this library Windows PC. If anyone is curious, here's the link to the source archive.ftp://ftp.astron.com/pub/file/

I believe it's a bash shell script.
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

10 Oct 2013, 13:44

the file command actually only checks for the the "signature"... like for exe it's "MZ", bmp it's something like "BM"
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
MilesAhead
Posts: 232
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

10 Oct 2013, 13:58

joedf wrote:the file command actually only checks for the the "signature"... like for exe it's "MZ", bmp it's something like "BM"
I assumed it did so on stuff like image files, printer format files like pdf postscript etc.. but I thought for text it might be able to detect ascii/unicode types. But you're saying "text" is the fall through if nothing else is found?

Dang! I wish I could just look at the script. Hopefully soon I'll have a machine instead of using a library loaner. :)
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead
User avatar
joedf
Posts: 8940
Joined: 29 Sep 2013, 17:08
Location: Canada
Contact:

Re: How to detect file is binary or ascii?

10 Oct 2013, 14:02

Well I'm working on it when I arrive home, don't worry I can't detect utf-8 with BOM
Just need to get home :P
Image Image Image Image Image
Windows 10 x64 Professional, Intel i5-8500, NVIDIA GTX 1060 6GB, 2x16GB Kingston FURY Beast - DDR4 3200 MHz | [About Me] | [About the AHK Foundation] | [Courses on AutoHotkey]
[ASPDM - StdLib Distribution] | [Qonsole - Quake-like console emulator] | [LibCon - Autohotkey Console Library]
User avatar
MilesAhead
Posts: 232
Joined: 03 Oct 2013, 09:44

Re: How to detect file is binary or ascii?

10 Oct 2013, 15:11

It's no biggie for me. Just a matter of curiosity. I know I looked through the 'file' script. But it was years ago. I was probably running Mandrake 9.1 then. But I bet it does fall through to text as last resort. The frustration is just generally dealing with these super restricted library computers. Not anything to do with this thread. :)
"My plan is to ghostwrite my biography. Then hire another writer to put his
name on it and take the blame."

- MilesAhead

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: No registered users and 142 guests