| View previous topic :: View next topic |
| Author |
Message |
Buckie
Joined: 13 Feb 2008 Posts: 15 Location: Denmark
|
Posted: Sun May 18, 2008 11:26 am Post subject: Extracting information from html documents. |
|
|
Hi guys im interested in extracting 11075 and e9.bogeyman, from the line below.
| Code: | | <td><a href="?section=profile&show=sig&id=11075">e9.bogeyman</a> <img src="gfx/laenderflaggen/Singapore.gif" alt="[SG]" title="Singapore"> <img src="gfx/laenderflaggen/Denmark.gif" alt="[DK]" title="Denmark"></td>' |
The length of the name and numbers may differ. Also there might be a diffrence in the actual line number (in this case its 220)
My own best suggestion would be to download the file, use a loop to search each line for the string :
href="?section=profile&show=sig&id
however this method might be slow, and the script needs to parse around 100.000 html documents. Also I dont know how to point out the number and name (as these may vary in size)
Any suggestions ?
thx in advance |
|
| Back to top |
|
 |
n-l-i-d Guest
|
Posted: Sun May 18, 2008 11:31 am Post subject: |
|
|
| Loop, Read + something like InStr() or FileRead + RegExMatch or xpath. |
|
| Back to top |
|
 |
Buckie
Joined: 13 Feb 2008 Posts: 15 Location: Denmark
|
Posted: Sun May 18, 2008 11:34 am Post subject: |
|
|
The name and number will differ each time, so what to search for ?
I could use RegExMatch but I would need clumpsy loops too ?
if I try to search for "http://thisiswhatcomesbeforethenumber?" and lets say I get position called 7.
I would have 7, then i can go to endpos of the string, but i still dont know how long the number or name is. the length could be between 1 and 10
So I would have to do a clumpsy loop, to look for a < sign ?
Last edited by Buckie on Sun May 18, 2008 12:12 pm; edited 1 time in total |
|
| Back to top |
|
 |
n-l-i-d Guest
|
Posted: Sun May 18, 2008 12:11 pm Post subject: |
|
|
RegExMatch and Regular Expressions (RegEx) - Quick Reference and #EscapeChar (and explanation of escape sequences)
not-tested (and, there are RegEx cracks around that could probably shorten this considerably)
| Code: |
; the question-mark is escaped for regex
before := "<td><a href="\?section=profile&show=sig&id="
; the double-quote is escaped for ahk
between := """<"
after := "<"
; "(.*)" is regex for: return anything in match variable as array
pattern := before . "(.*)" . between . "(.*)" . after
FileRead, aFile, pathToAFile
RegExMatch(aFile, pattern, match)
MsgBox % match1 . " " . match2
|
HTH |
|
| Back to top |
|
 |
Buckie
Joined: 13 Feb 2008 Posts: 15 Location: Denmark
|
Posted: Sun May 18, 2008 12:12 pm Post subject: |
|
|
| Briliant, thx |
|
| Back to top |
|
 |
n-l-i-d Guest
|
Posted: Sun May 18, 2008 12:55 pm Post subject: |
|
|
There is a mistake in the script
To make it easier:
| regex wrote: | | the characters \.*?+[{|()^$ must be preceded by a backslash to be seen as literal |
| ahk wrote: | | the characters ,%`'" might get special treatment, or have to be escaped (depends). Within an expression, two consecutive quotes enclosed inside a literal string resolve to a single literal quote. |
|
|
| Back to top |
|
 |
|