 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Fri Nov 20, 2009 12:22 am Post subject: StrX() :: Auto-Parser for XML / HTML |
|
|
StrX() is a wrapper that extends SubStr()'s functionality. It accepts two strings for extremes ( begin & end ) and extracts the text in between them. It is much similar to
RegExMatch( Str, "BeginStr(.*)EndStr", SubPat ), but the major difference is, StrX() allows flexibility on the final length of the resultant string. To be precise, it can trim/expand characters at either/both ends of the resultant string.
| Quote: |
Announcement: The current version 1.0 can auto-parse when used with While loop. Please checkout the updated examples.
|
StrX( H, BS,BO,BT, ES,EO,ET, NextOffset )
Parameters
- 1 ) H = HayStack. The "Source Text"
- 2 ) BS = BeginStr. Pass a String that will result at the left extreme of Resultant String
- 3 ) BO = BeginOffset.
Number of Characters to omit from the left extreme of "Source Text" while searching for BeginStr
- Pass a 0 to search in reverse ( from right-to-left ) in "Source Text"
- If you intend to call StrX() from a Loop, pass the same variable used as 8th Parameter, which will simplify the parsing process.
- 4 ) BT = BeginTrim.
Number of characters to trim on the left extreme of Resultant String
- Pass the String length of BeginStr if you want to omit it from Resultant String
- Pass a Negative value if you want to expand the left extreme of Resultant String
- 5 ) ES = EndStr. Pass a String that will result at the right extreme of Resultant String
- 6 ) EO = EndOffset.
Can be only True or False.
If False, EndStr will be searched from the end of Source Text.
If True, search will be conducted from the search result offset of BeginStr or from offset 1 whichever is applicable. - 7 ) ET = EndTrim.
Number of characters to trim on the right extreme of Resultant String
- Pass the String length of EndStr if you want to omit it from Resultant String
- Pass a Negative value if you want to expand the right extreme of Resultant String
- 8 ) NextOffset : A name of ByRef Variable that will be updated by StrX() with the current offset, You may pass the same variable as Parameter 3, to simplify data parsing in a loop
Here follows real world examples that demonstrates StrX()'s functionality:
Example 1 : A Script to retrieve real-time details of last 15 posts made in our forum.
| Code: | UrlDownloadToFile, http://www.autohotkey.com/forum/rss.php, ahkrss.xml ; 01
FileRead, xml, ahkrss.xml ; 02
While Item := StrX( xml , "<item>" ,N,0, "</item>" ,1,0, N ) ; 03
Title := StrX( Item, "<title>",1,7, "</title>",1,8 ) ; 04
, Link := StrX( Item, "<link>" ,1,6, "</link>" ,1,7 ) ; 05
, List .= "`n`n" A_Index ")`t" Title "`n`t" Link ; 06
MsgBox, 64, Latest Posts on AHK Forum, %List% ; 07 |
| Quote: | Note: The result of above script may contain HTML formatting like below:
15) Ask for Help :: &quot;Jump to&quot; video frame (i.e. &quot;seek&quot;
You may use UnHTM() on Title to convert it to proper text.
|
Example 2 : Download and extract links from a Google Search Result
| Code: | UrlDownloadToFile, % "http://www.google.com/search?hl=en&lr=&safe=active&rlz=1C1GGLS_enIN"
. "307IN307&num=10&q=site:autohotkey.com&aq=f&oq=&aqi=", Google.htm
FileRead, html, Google.htm
While Item := StrX( html, "<h3 class=""r""><a href=",N,0, "<li class=g>",1,12, N )
Sub1 := StrX( Item, "<a href=",1,9, """" ,1,1, T )
, Sub2 := StrX( Item, ">", T,1, "</a>",1,4 )
, Text .= UnHTM( Sub2 ) "`n" Sub1 "`n`n"
MsgBox, %Text% ; Dependency :: Get UnHTM() www.autohotkey.com/forum/viewtopic.php?t=51342 |
Example 3 : Movie-DB Creator 66L for IMDb.com
Example 4 : ListView for http://www.google.com/movies
Example 5 : Yahoo! Weather in TrayTip
... and finally here is StrX()
| Code: | StrX( H, BS="",BO=0,BT=1, ES="",EO=0,ET=1, ByRef N="" ) { ; | by Skan | 19-Nov-2009
Return SubStr(H,P:=(((Z:=StrLen(ES))+(X:=StrLen(H))+StrLen(BS)-Z-X)?((T:=InStr(H,BS,0,((BO
<0)?(1):(BO))))?(T+BT):(X+1)):(1)),(N:=P+((Z)?((T:=InStr(H,ES,0,((EO)?(P+1):(0))))?(T-P+Z
+(0-ET)):(X+P)):(X)))-P) ; v1.0-196c 21-Nov-2009 www.autohotkey.com/forum/topic51354.html
} |
Last edited by SKAN on Fri May 14, 2010 7:46 am; edited 13 times in total |
|
| Back to top |
|
 |
The Naked General
Joined: 22 Feb 2009 Posts: 21 Location: Dallas TX
|
Posted: Fri Nov 20, 2009 4:18 am Post subject: |
|
|
This is great Skan!
I'll have to go back and clean up some old parsing scripts with it. Thanks a bunch  _________________ "lol, i made this thing, but it didn't work... so I read the forums and now it does!" |
|
| Back to top |
|
 |
SoLong&Thx4AllTheFish
Joined: 27 May 2007 Posts: 4999
|
|
| Back to top |
|
 |
linpinger
Joined: 20 Oct 2007 Posts: 14 Location: china,hubei
|
Posted: Fri Nov 20, 2009 9:50 am Post subject: |
|
|
If I Comment No. 03 line
It will go into an unend loop
why does this happen?
my English is pool, ^_^ |
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Fri Nov 20, 2009 11:43 am Post subject: |
|
|
"Title Post" Updated with Example 2
Download and extract links from a Google Search Result
| Code: | UrlDownloadToFile, % "http://www.google.com/search?hl=en&lr=&safe=active&rlz=1C1GGLS_enIN"
. "307IN307&num=10&q=site:autohotkey.com&aq=f&oq=&aqi=", Google.htm
FileRead, html, Google.htm
While Item := StrX( html, "<h3 class=""r""><a href=",N,0, "<li class=g>",1,12, N )
Sub1 := StrX( Item, "<a href=",1,9, """" ,1,1, T )
, Sub2 := StrX( Item, ">", T,1, "</a>",1,4 )
, Text .= UnHTM( Sub2 ) "`n" Sub1 "`n`n"
MsgBox, %Text% ; Dependency :: Get UnHTM() www.autohotkey.com/forum/viewtopic.php?t=51342 |
On a related note here is Lexikos' COM version for the same:
http://www.autohotkey.com/forum/viewtopic.php?p=182714#182714
Last edited by SKAN on Sat May 08, 2010 3:25 am; edited 2 times in total |
|
| Back to top |
|
 |
linpinger
Joined: 20 Oct 2007 Posts: 14 Location: china,hubei
|
Posted: Fri Nov 20, 2009 1:58 pm Post subject: |
|
|
Thanks SKAN's Reply !
I still Don't UnderStand
while Searching on the end of string, why It don't stop and break
I had to add some other check code,
add this three line in while loop can break
| Code: |
if ( N < old )
break
old := N
|
|
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Fri Nov 20, 2009 8:17 pm Post subject: |
|
|
| linpinger wrote: | I still Don't UnderStand
while Searching on the end of string, why It don't stop and break
I had to add some other check code,
add this three line in while loop can break
| Code: |
if ( N < old )
break
old := N
|
|
My code was at fault. I have re-written the function which has been posted on the top.
You do not have to add code anymore.. When used with "While loop" StrX() will
automatically parse the data and shall exit the loop gracefully.
Please test the updated examples and let me know the status.
| linpinger wrote: | | Thanks SKAN's Reply ! |
er.. You might find my reply missing as I have deleted it
... as it does not fit the current version of StrX() and may cause confusion.
Thank You. |
|
| Back to top |
|
 |
linpinger
Joined: 20 Oct 2007 Posts: 14 Location: china,hubei
|
Posted: Sat Nov 21, 2009 2:13 am Post subject: |
|
|
I have get the latest strX()
It's completly Great !
I noticed that new Example 1 don't have
N := 1
It means that N is blank, does it matter?
(The result is right, have no problem.) |
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Sat Nov 21, 2009 8:04 am Post subject: |
|
|
| linpinger wrote: | I have get the latest strX()
It's completly Great ! |
Thanks for testing it.
| linpinger wrote: | I noticed that new Example 1 don't have
N := 1
It means that N is blank, does it matter?
(The result is right, have no problem.) |
It is a side effect. The code tests the value of BeginOffset to make sure a negative value is not being passed to InStr().
| Code: | | BO < 0 ? 1 : BO ; If BO is lesser than 0 use 1 - otherwise use BO itself |
If you want to run both the posted examples from the same script,
then you have to use a N := 1 in between them to reset N
.. or you can name the variables differently, like N1 and N2
Maybe StrX() should reset N with 1 when it is about to return an empty string? |
|
| Back to top |
|
 |
linpinger
Joined: 20 Oct 2007 Posts: 14 Location: china,hubei
|
Posted: Sat Nov 21, 2009 12:08 pm Post subject: |
|
|
| SKAN wrote: |
Maybe StrX() should reset N with 1 when it is about to return an empty string?
|
I think reseting N is a good Ideal
Because, When we Use N as the last Parameter
It always show , N > strlen(xml)
so, it seems N is not very usefull, reset it is a good ideal |
|
| Back to top |
|
 |
daonlyfreez
Joined: 16 Mar 2005 Posts: 949 Location: Berlin
|
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Mon Nov 23, 2009 5:18 am Post subject: |
|
|
| linpinger wrote: | I think reseting N is a good Ideal
Because, When we Use N as the last Parameter
It always show , N > strlen(xml)
so, it seems N is not very usefull, reset it is a good ideal |
| Code: | While Item := StrX( html, "<h3 class=r><a href=",N,0, "<li class=g>",1,12, N )
Sub1 := StrX( Item, "<a href=",1,9, """" ,1,1, T )
, Sub2 := StrX( Item, ">", T,1, "</a>",1,4 ) |
In above, if Sub1 result is empty Sub2 will definitely become empty, which is best behaviour to expect. |
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Thu Mar 25, 2010 8:21 am Post subject: |
|
|
Real World example for using StrX() & UnHTM() to parse out text from HTML
Row Structure for Text DB
1. Year
2. Movie title
3. MPAA Rating
4. Runtime ( in Minutes)
5. IMDb hash - should be prefixed with www.imdb.com/title/tt to form a proper URL
6. User Rating ( 1.0 to 10.0 )
7. User Votes
8. Genre ( Pipe Delimited Values )
8. Director
10. Stars ( Comma Seperated Values )
11. Movie Outline
| Quote: | One may use the "IMDb hash" to connect with other providers:
The IMDb hash for "The Shawshank Redemption" is 0111161
1) Connect to www.themoviedb.org for extended Movie Info : http://api.themoviedb.org/2.1/Movie.imdbLookup/en/xml/APIKEY/tt0111161
2) Retrieve Images (Posters/Cover) from www.themoviedb.org : http://api.themoviedb.org/2.1/Movie.getImages/en/xml/APIKEY/tt0111161
3) Connect to www.opensubtitles.org for subtitles : http://www.opensubtitles.org/en/search/imdbid-0111161/sublanguageid-eng/rss_2_00
Again, the above methods return data in XML format which you may parse out with StrX() |
Movie-DB Creator 66L for IMDb.com | Code: | ; Movie-DB Creator 66L for IMDb.com ; By Skan / Last Modified: 24-Mar-2010
; Forum Post : www.autohotkey.com/forum/viewtopic.php?p=342196#342196
; Sample Output: www.autohotkey.net/~Skan/Scripts/StrX/IMDb/IMDb.txt (3.94 MiB)
; !!! Caution : Downloads around 1000 webpages from IMDb. Time/Bandwidth consuming operation.
#SingleInstance, Force
SetBatchLines, -1
IMDB_SR := A_Temp "\IMDB_Search_Results.htm" , IMDB_WF := A_Temp "\IMDB_Work_File.txt"
IMDB_URL := "http://www.imdb.com/search/title?has=asin-dvd-us&languages=en&num_votes=6,&"
. "sort=num_votes,desc&start=1&title_type=feature"
FileDelete, %IMDB_WF%
URLDownloadToFile, %IMDB_URL%, %IMDB_SR%
FileRead, HTM, %IMDB_SR%
TotalE := StrX( HTM, "<div id=""left"">",1,24, "titles",1,7 )
SysGet, m, MonitorWorkArea, 1
Y := (mBottom-46-2), X := (mRight-200-2), TotalE := RegExReplace( TotalE,"," )
Progress, CWE6E3E4 CT000020 CBF73D00 x%x% y%y% w200 h46 B1 FS8 WM700 WS400 FM8 ZH8 ZY3
, Downloading from IMDb, % "Page 1/" TP:=Round(TotalE/20), , Arial
Z := A_Tab, StartE := 1
Loop {
List := "", N := 1
While( TR := StrX( HTM, "<tr class=",N,0, "</tr>",1,5, N ) )
URat := StrX( TR, "Users rated this ",1,17, "/",1,1, O1 )
, Vote := StrX( TR, "(",O1,1, " votes",1,6, O2 )
, Vote := RegExReplace( Vote,"," )
, IMDB := StrX( TR, "href=""/title/tt",0,15, """",1,2, O2 ) ; Reverse Search
, Title := UnHTM( StrX( TR, ">",O2,1, "<",1,1, O3 ))
, Year := StrX( TR, "year_type"">(",O3,12,")",1,1 )
, OutL := UnHTM( StrX( TR, "outline"">",1,9, "<",1,1 ))
, Dir := UnHTM( StrX( TR, "Dir: <",1,5, "</a>",1,4 ))
, Star := UnHTM( StrX( TR, "With: <",1,6, "</span>",1,8 ))
, Gen := UnHTM( StrX( TR, "class=""genre",1,-6, "</span>",1,0, O4 ))
, Gen := RegExReplace( Gen,A_Space )
, CE := StrX( TR, "title=",O4,6, A_Space,1,1 )
, RunT := UnHTM( StrX( TR, "class=""runtime",1,-6, " mins",1,5 ))
, List .= Year Z Title Z CE Z RunT Z IMDB Z URat Z Vote Z Gen Z Dir Z Star Z OutL "`n"
FileAppend, %List%, %IMDB_WF%
StringReplace, IMDB_URL, IMDB_URL, start=%StartE%, % "start=" ( StartE := StartE+20 )
IfGreater,StartE,%TotalE%, Break
Progress, % (StartE/TotalE)*100, % "Page " Round(StartE/20) "/" TP, Downloading from IMDb
URLDownloadToFile, %IMDB_URL%, %IMDB_SR%
FileRead, HTM, %IMDB_SR%
}
FileCopy, %IMDB_WF%, %A_ScriptDir%\IMDb.txt, 1
Return ; // end of auto-execute section //
StrX(H, BS="",BO=0,BT=1, ES="",EO=0,ET=1, ByRef N="" ) { ; | by Skan | 19-Nov-2009
Return SubStr(H,P:=(((Z:=StrLen(ES))+(X:=StrLen(H))+StrLen(BS)-Z-X)?((T:=InStr(H,BS,0,((BO
<0)?(1):(BO))))?(T+BT):(X+1)):(1)),(N:=P+((Z)?((T:=InStr(H,ES,0,((EO)?(P+1):(0))))?(T-P+Z
+(0-ET)):(X+P)):(X)))-P) ; v1.0-196c 21-Nov-2009 www.autohotkey.com/forum/topic51354.html
}
UnHTM( HTM ) { ; Remove HTML formatting / Convert to ordinary text by SKAN 19-Nov-2009
Static HT ; Forum Topic: www.autohotkey.com/forum/topic51342.html
IfEqual,HT,, SetEnv,HT, % "ááââ´´ææàà&ååãã&au"
. "mlä&bdquo„¦¦&bull•ç縸¢¢&circˆ©©¤¤&dagger†&dagger‡°"
. "°÷÷ééêêèèððëë&euro€&fnofƒ½½¼¼¾¾>>&h"
. "ellip…ííîî¡¡ìì¿¿ïï««&ldquo“&lsaquo‹&lsquo‘<<&m"
. "acr¯&mdash—µµ··  &ndash–¬¬ññóóôô&oeligœòò&or"
. "dfªººøøõõöö¶¶&permil‰±±££"""»»&rdquo”®"
. "®&rsaquo›&rsquo’&sbquo‚&scaronš§§­¹¹²²³³ßßþþ&tilde˜&tim"
. "es×&trade™úúûûùù¨¨üüýý¥¥ÿÿ"
TXT := RegExReplace( HTM,"<[^>]+>" ) ; Remove all tags between "<" and ">"
Loop, Parse, TXT, &`; ; Create a list of special characters
L := "&" A_LoopField ";", R .= (!(A_Index&1)) ? ( (!InStr(R,L,1)) ? L:"" ) : ""
StringTrimRight, R, R, 1
Loop, Parse, R , `; ; Parse Special Characters
If F := InStr( HT, A_LoopField ) ; Lookup HT Data
StringReplace, TXT,TXT, %A_LoopField%`;, % SubStr( HT,F+StrLen(A_LoopField), 1 ), All
Else If ( SubStr( A_LoopField,2,1)="#" )
StringReplace, TXT, TXT, %A_LoopField%`;, % Chr(SubStr(A_LoopField,3)), All
Return RegExReplace( TXT, "(^\s*|\s*$)") ; Remove leading/trailing white spaces
} |
EDITS:
13-Sep-2010 : IMDb had updated its HTML layout breaking the script. The script is now altered and working
Last edited by SKAN on Mon Sep 13, 2010 2:04 pm; edited 4 times in total |
|
| Back to top |
|
 |
noise
Joined: 14 May 2009 Posts: 57 Location: UK
|
Posted: Thu Mar 25, 2010 8:35 am Post subject: |
|
|
Excellent, thanks SKAN. _________________ PrimalNoise.com
It's a rock. Can't wait to tell my friends. They don't have a rock this big. |
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 8688
|
Posted: Thu Mar 25, 2010 12:32 pm Post subject: |
|
|
| noise wrote: | | Excellent, thanks SKAN. |
Thanks, Welcome.
BTW, I have updated the script to include one more field: MPAA Rating
and have also provided additional info on IMDb hash usage.
 |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|