AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Extracting information from html documents.

 
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
Buckie



Joined: 13 Feb 2008
Posts: 15
Location: Denmark

PostPosted: Sun May 18, 2008 11:26 am    Post subject: Extracting information from html documents. Reply with quote

Hi guys im interested in extracting 11075 and e9.bogeyman, from the line below.

Code:
<td><a href="?section=profile&amp;show=sig&amp;id=11075">e9.bogeyman</a> <img src="gfx/laenderflaggen/Singapore.gif" alt="[SG]" title="Singapore"> <img src="gfx/laenderflaggen/Denmark.gif" alt="[DK]" title="Denmark"></td>'


The length of the name and numbers may differ. Also there might be a diffrence in the actual line number (in this case its 220)

My own best suggestion would be to download the file, use a loop to search each line for the string :

href="?section=profile&amp;show=sig&amp;id

however this method might be slow, and the script needs to parse around 100.000 html documents. Also I dont know how to point out the number and name (as these may vary in size)

Any suggestions ?

thx in advance
Back to top
View user's profile Send private message Yahoo Messenger MSN Messenger
n-l-i-d
Guest





PostPosted: Sun May 18, 2008 11:31 am    Post subject: Reply with quote

Loop, Read + something like InStr() or FileRead + RegExMatch or xpath.
Back to top
Buckie



Joined: 13 Feb 2008
Posts: 15
Location: Denmark

PostPosted: Sun May 18, 2008 11:34 am    Post subject: Reply with quote

The name and number will differ each time, so what to search for ?

I could use RegExMatch but I would need clumpsy loops too ?

if I try to search for "http://thisiswhatcomesbeforethenumber?" and lets say I get position called 7.

I would have 7, then i can go to endpos of the string, but i still dont know how long the number or name is. the length could be between 1 and 10

So I would have to do a clumpsy loop, to look for a < sign ?


Last edited by Buckie on Sun May 18, 2008 12:12 pm; edited 1 time in total
Back to top
View user's profile Send private message Yahoo Messenger MSN Messenger
n-l-i-d
Guest





PostPosted: Sun May 18, 2008 12:11 pm    Post subject: Reply with quote

RegExMatch and Regular Expressions (RegEx) - Quick Reference and #EscapeChar (and explanation of escape sequences)

not-tested (and, there are RegEx cracks around that could probably shorten this considerably)

Code:

; the question-mark is escaped for regex
before := "<td><a href="\?section=profile&amp;show=sig&amp;id="
; the double-quote is escaped for ahk
between := """<"
after := "<"

; "(.*)" is regex for: return anything in match variable as array
pattern := before . "(.*)" . between . "(.*)" . after

FileRead, aFile, pathToAFile
RegExMatch(aFile, pattern, match)
MsgBox % match1 . " " . match2


HTH
Back to top
Buckie



Joined: 13 Feb 2008
Posts: 15
Location: Denmark

PostPosted: Sun May 18, 2008 12:12 pm    Post subject: Reply with quote

Briliant, thx
Back to top
View user's profile Send private message Yahoo Messenger MSN Messenger
n-l-i-d
Guest





PostPosted: Sun May 18, 2008 12:55 pm    Post subject: Reply with quote

There is a mistake in the script Wink

To make it easier:

regex wrote:
the characters \.*?+[{|()^$ must be preceded by a backslash to be seen as literal


ahk wrote:
the characters ,%`'" might get special treatment, or have to be escaped (depends). Within an expression, two consecutive quotes enclosed inside a literal string resolve to a single literal quote.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group