AutoHotkey Community

It is currently May 27th, 2012, 9:17 am

All times are UTC [ DST ]




Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: December 5th, 2011, 6:10 am 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
I have a "Word" file that contains only one phrase in it:
Quote:
Jack is having a cup of coffee.


The name of that "Word" file is "tree" ("tree.doc")

When I am trying to save it as a .txt file:

Code:
FileCopy, tree.doc, fog.txt


I am getting a strange content inside "fog.txt":

Quote:
俵遄? >  ?   *  ,  ?
) 
鴠 q`  皽 
   bjbjqPqP  2 : :
   t t t t
t t t  ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? b  d d d
d d d $ ? h E J ?  t ?
? ? ? ? ? t t ?

? ?  . . . ? " t ? t ? b
. ? b .

. t t .
?

? €J
G
最 ? ?
. b ? 0 ? .
? ?  ? . ? t . 4 ? ? . ? ?


? ? ? ? ?  "
? ? ? ? ? ? ? ? ?
? ? $ ? ? ? ? ? ? ? ? t t t
 t t t   





Jack is having a cup of coffee.

俵遄? >  ?   *  ,  ?
) 


However, if I just highlight and copy the content of that "Word" document and then paste it into the .txt file, everything seems to be fine.

So, how can I get the text out of a "Word" file into .txt file automatically (I mean by means of using some AHK script) without all this abracadabra in the .txt file?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 6:39 am 
Offline

Joined: April 8th, 2009, 7:49 pm
Posts: 6074
Location: San Diego, California
A text file has only text, plus tab, linefeed, carriage ruturn, and formfeed.
There may be a few other control characters but I believe they are ignored.

Word documents have >tremendous< amounts of formatting imbedded in the document.

As you have seen, filecopy does >not< magically convert between the two formats.
AFAIK, only Word can do the conversion.
If you want, you could control word to make it save a copy of the file as a *.txt document.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 7:09 am 
Offline
User avatar

Joined: May 24th, 2009, 5:35 am
Posts: 2099
Location: Iowa, USA
.doc content (text) to .txt file

_________________
Image
Recommended: AutoHotkey_L
Basic Webpage Controls


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 7:18 am 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
WOW!!! jethrow, thank you for this link. I am researching it at the moment.

Leef_me, thank you for that info!


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 7:18 am 
Offline

Joined: August 21st, 2006, 7:07 pm
Posts: 2925
Location: The Shell
The magic byte offset is 0xa00
for the start of 'real text' in Word 2002
Code:
docFile := a_scriptDir "\file.doc"
fileRead, oSet, % docFile
pos := &oSet+0xa00

while asc( *pos ) != 48
  txt.=chr( *pos++ )

fileAppend, % txt, % a_scriptDir "\file.txt"

Image
Obviously the com method is the most compatible.

_________________
Imageparadigm.shift:=(•_•)┌П┐RTFM||^.*∞


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 7:44 am 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
Hello TLM,

but it doesn't seem to work for me:


Code:
docFile := a_scriptDir "\tree.doc"
fileRead, oSet, % docFile
pos := &oSet+0xa00

while asc( *pos ) != 48
  txt.=chr( *pos++ )

fileAppend, % txt, % a_scriptDir "\fog.txt


I just get an empty fog.txt file.

(I amusing "MS WORD 2003" and I have AHK_L installed)


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 7:55 am 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
jethrow wrote:


I have looked through those two scripts in that thread and tested them on my computer.

In case with the first script:

Code:
myinputfile := "tree.doc"
myoutputfile := "fog.txt"
FileAppend, % ComObjGet(myinputfile).Range.Text, %myoutputfile%"


I get this message:

Quote:
---------------------------
checky_1.ahk
---------------------------
Error: 0x80030002 - 找不到 %1。


Line#
001: myinputfile := "tree.doc"
002: myoutputfile := "fog.txt"
---> 003: FileAppend,ComObjGet(myinputfile).Range.Text,%myoutputfile%"
004: Exit
005: Exit
005: Exit

Continue running the script?
---------------------------
是(Y) 否(N)
---------------------------


In case with the second script:

Code:
myinputfile := "tree.doc"
myoutputfile := "fog.txt"

p := [myoutputfile, 2]
Loop, 13
   p.Insert(A_Index=10? 1252:A_Index=13? 0:ComObj())

doc := ComObjGet(myinputfile)
doc.SaveAs(p*)
doc.Close


I get this error message:

Quote:
---------------------------
checky_2.ahk
---------------------------
Error: 0x80030002 - 找不到 %1。


Line#
001: myinputfile := "tree.doc"
002: myoutputfile := "fog.txt"
004: p := [myoutputfile, 2]
005: Loop,13
006: p.Insert(A_Index=10? 1252:A_Index=13? 0:ComObj())
---> 008: doc := ComObjGet(myinputfile)
009: doc.SaveAs(p*)
010: doc.Close
011: Exit
012: Exit
012: Exit

Continue running the script?
---------------------------
是(Y) 否(N)
---------------------------


What am I doing wrong? ( I am using "AHK_L" and "MSWord 2003")


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 8:07 am 
Offline

Joined: August 21st, 2006, 7:07 pm
Posts: 2925
Location: The Shell
Benny-D wrote:
..it doesn't seem to work for me
Can you post a link to a WORD 2003 file plz?
Or 2 or more documents with different txt in each zipped would be even better if possible.

_________________
Imageparadigm.shift:=(•_•)┌П┐RTFM||^.*∞


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 8:32 am 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
TLM wrote:
Benny-D wrote:
..it doesn't seem to work for me
Can you post a link to a WORD 2003 file plz?
Or 2 or more documents with different txt in each zipped would be even better if possible.


Here is my "tree.doc" with "Jack is having a cup of coffee.":

http://roundcan.narod.ru/tree.zip

Here is "chemistry.doc" containing "Chemistry is the science of matter, especially its chemical reactions, but also its composition, structure and properties.":

http://roundcan.narod.ru/chemistry.zip

Here is "paradox.doc" containing "What happens when Pinocchio says, 'My nose will grow now'?":

http://roundcan.narod.ru/paradox.zip


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 8:48 am 
Offline

Joined: August 21st, 2006, 7:07 pm
Posts: 2925
Location: The Shell
I forgot to say, dont tell me what the document content is ;) Its ok tho.

Weird I'm still getting the same byte offset of 0xa00
Image
I will investigate but I still say there must be a much better COM solution.

_________________
Imageparadigm.shift:=(•_•)┌П┐RTFM||^.*∞


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 12:49 pm 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
How did you make this image with phrase about Pinocchio? Is it a .gif file? What software did you use?


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 1:37 pm 
Offline

Joined: August 21st, 2006, 7:07 pm
Posts: 2925
Location: The Shell
Adobe Imageready.

I think the problem has something to do with the way different systems handle unicode ( specifically language settings ).
When I do a straight copy to text I get: paradox.txt
I notice many of the nulls seem different.

Obviously this is different from your format so it wouldn't matter if we filtered by character#.
Not sure how to fix this other than using COM which is not my strong suit.
Here are some COM refs:
http://www.autohotkey.com/forum/viewtopic.php?t=22923
http://www.autohotkey.com/forum/topic67931-15.html
http://www.autohotkey.com/forum/topic61509.html

My suggestion would be to post this in one of the COM_L threads.

_________________
Imageparadigm.shift:=(•_•)┌П┐RTFM||^.*∞


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 2:01 pm 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
TLM wrote:
Adobe Imageready

- Is it freeware or you paid for it?

TLM wrote:

- Gosh! If it's not your strong suit, to me it's a deep forest! Just studying those links would take me half a year. I don't even know what COM stands for. Where should I start to study COM? Do you have the needed set of COM commands that would solve my problem? I could start studying COM from that. That's how I was starting out with AHK - I had a real simple problem to solve, somebody gave me the solution in AHK, and then I was looking into the code of that AHK solution - that was the most efficient and at the same time the most practical way.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 2:24 pm 
Offline

Joined: August 21st, 2006, 7:07 pm
Posts: 2925
Location: The Shell
ImageReady comes bundled with Adobe Photoshop and or Creative Suite which is payware ( or expensiveWare ;) ).
If you can find an cheap old version of PS7+ it comes with IR.

COM = Component Object Model

If you know how to use a HEX editor and viewing variables in memory,
you could easily find out how out the byte length between the start of the document
and the start of the 'real text'.
Then you just have to replace this:
Code:
pos := &oSet+0xa00
With the length you find.
Just make sure you get it right AND break @ the right offset or AHK will crash.

The only other cheat I can think of would be,
when your saving your word document put a unique character at the very start of it.
Then all you would have to do is search for the offset of that character,
This would mean you'd HAVE TO access the doc via WORD 1st thought.

_________________
Imageparadigm.shift:=(•_•)┌П┐RTFM||^.*∞


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 5th, 2011, 2:56 pm 
Offline

Joined: February 29th, 2008, 12:11 pm
Posts: 943
TLM wrote:
If you know how to use a HEX editor and viewing variables in memory, you could easily find out how out the byte length between the start of the document

- I don't know how to use HEX editor.
- I don't know how to to out the byte length. I don't even know what it means.

Is there really no easier way? Maybe there is a way of saving the word file as an .RTF file and then working with its code somehow?


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC [ DST ]


Who is online

Users browsing this forum: BrandonHotkey, Google Feedfetcher, migz99 and 69 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group