Real World example for using StrX() & UnHTM() to parse out text from HTML
Row Structure for Text DB
1. Year
2. Movie title
3. MPAA Rating
4. Runtime ( in Minutes)
5. IMDb hash - should be prefixed with www.imdb.com/title/tt to form a proper URL
6. User Rating ( 1.0 to 10.0 )
7. User Votes
8. Genre ( Pipe Delimited Values )
8. Director
10. Stars ( Comma Seperated Values )
11. Movie Outline
Quote:
One may use the "IMDb hash" to connect with other providers:
The IMDb hash for "The Shawshank Redemption" is 0111161
1) Connect to www.themoviedb.org for extended Movie Info : http://api.themoviedb.org/2.1/Movie.imdbLookup/en/xml/APIKEY/tt0111161
2) Retrieve Images (Posters/Cover) from www.themoviedb.org : http://api.themoviedb.org/2.1/Movie.getImages/en/xml/APIKEY/tt0111161
3) Connect to www.opensubtitles.org for subtitles : http://www.opensubtitles.org/en/search/imdbid-0111161/sublanguageid-eng/rss_2_00
Again, the above methods return data in XML format which you may parse out with StrX()
Movie-DB Creator 66L for IMDb.com
Code:
; Movie-DB Creator 66L for IMDb.com ; By Skan / Last Modified: 24-Mar-2010
; Forum Post : www.autohotkey.com/forum/viewtopic.php?p=342196#342196
; Sample Output: www.autohotkey.net/~Skan/Scripts/StrX/IMDb/IMDb.txt (3.94 MiB)
; !!! Caution : Downloads around 1000 webpages from IMDb. Time/Bandwidth consuming operation.
#SingleInstance, Force
SetBatchLines, -1
IMDB_SR := A_Temp "\IMDB_Search_Results.htm" , IMDB_WF := A_Temp "\IMDB_Work_File.txt"
IMDB_URL := "http://www.imdb.com/search/title?has=asin-dvd-us&languages=en&num_votes=6,&"
. "sort=num_votes,desc&start=1&title_type=feature"
FileDelete, %IMDB_WF%
URLDownloadToFile, %IMDB_URL%, %IMDB_SR%
FileRead, HTM, %IMDB_SR%
TotalE := StrX( HTM, "<div id=""left"">",1,24, "titles",1,7 )
SysGet, m, MonitorWorkArea, 1
Y := (mBottom-46-2), X := (mRight-200-2), TotalE := RegExReplace( TotalE,"," )
Progress, CWE6E3E4 CT000020 CBF73D00 x%x% y%y% w200 h46 B1 FS8 WM700 WS400 FM8 ZH8 ZY3
, Downloading from IMDb, % "Page 1/" TP:=Round(TotalE/20), , Arial
Z := A_Tab, StartE := 1
Loop {
List := "", N := 1
While( TR := StrX( HTM, "<tr class=",N,0, "</tr>",1,5, N ) )
URat := StrX( TR, "Users rated this ",1,17, "/",1,1, O1 )
, Vote := StrX( TR, "(",O1,1, " votes",1,6, O2 )
, Vote := RegExReplace( Vote,"," )
, IMDB := StrX( TR, "href=""/title/tt",0,15, """",1,2, O2 ) ; Reverse Search
, Title := UnHTM( StrX( TR, ">",O2,1, "<",1,1, O3 ))
, Year := StrX( TR, "year_type"">(",O3,12,")",1,1 )
, OutL := UnHTM( StrX( TR, "outline"">",1,9, "<",1,1 ))
, Dir := UnHTM( StrX( TR, "Dir: <",1,5, "</a>",1,4 ))
, Star := UnHTM( StrX( TR, "With: <",1,6, "</span>",1,8 ))
, Gen := UnHTM( StrX( TR, "class=""genre",1,-6, "</span>",1,0, O4 ))
, Gen := RegExReplace( Gen,A_Space )
, CE := StrX( TR, "title=",O4,6, A_Space,1,1 )
, RunT := UnHTM( StrX( TR, "class=""runtime",1,-6, " mins",1,5 ))
, List .= Year Z Title Z CE Z RunT Z IMDB Z URat Z Vote Z Gen Z Dir Z Star Z OutL "`n"
FileAppend, %List%, %IMDB_WF%
StringReplace, IMDB_URL, IMDB_URL, start=%StartE%, % "start=" ( StartE := StartE+20 )
IfGreater,StartE,%TotalE%, Break
Progress, % (StartE/TotalE)*100, % "Page " Round(StartE/20) "/" TP, Downloading from IMDb
URLDownloadToFile, %IMDB_URL%, %IMDB_SR%
FileRead, HTM, %IMDB_SR%
}
FileCopy, %IMDB_WF%, %A_ScriptDir%\IMDb.txt, 1
Return ; // end of auto-execute section //
StrX(H, BS="",BO=0,BT=1, ES="",EO=0,ET=1, ByRef N="" ) { ; | by Skan | 19-Nov-2009
Return SubStr(H,P:=(((Z:=StrLen(ES))+(X:=StrLen(H))+StrLen(BS)-Z-X)?((T:=InStr(H,BS,0,((BO
<0)?(1):(BO))))?(T+BT):(X+1)):(1)),(N:=P+((Z)?((T:=InStr(H,ES,0,((EO)?(P+1):(0))))?(T-P+Z
+(0-ET)):(X+P)):(X)))-P) ; v1.0-196c 21-Nov-2009 www.autohotkey.com/forum/topic51354.html
}
UnHTM( HTM ) { ; Remove HTML formatting / Convert to ordinary text by SKAN 19-Nov-2009
Static HT ; Forum Topic: www.autohotkey.com/forum/topic51342.html
IfEqual,HT,, SetEnv,HT, % "ááââ´´ææàà&ååãã&au"
. "mlä&bdquo„¦¦&bull•ç縸¢¢&circˆ©©¤¤&dagger†&dagger‡°"
. "°÷÷ééêêèèððëë&euro€&fnofƒ½½¼¼¾¾>>&h"
. "ellip…ííîî¡¡ìì¿¿ïï««&ldquo“&lsaquo‹&lsquo‘<<&m"
. "acr¯&mdash—µµ··  &ndash–¬¬ññóóôô&oeligœòò&or"
. "dfªººøøõõöö¶¶&permil‰±±££"""»»&rdquo”®"
. "®&rsaquo›&rsquo’&sbquo‚&scaronš§§­¹¹²²³³ßßþþ&tilde˜&tim"
. "es×&trade™úúûûùù¨¨üüýý¥¥ÿÿ"
TXT := RegExReplace( HTM,"<[^>]+>" ) ; Remove all tags between "<" and ">"
Loop, Parse, TXT, &`; ; Create a list of special characters
L := "&" A_LoopField ";", R .= (!(A_Index&1)) ? ( (!InStr(R,L,1)) ? L:"" ) : ""
StringTrimRight, R, R, 1
Loop, Parse, R , `; ; Parse Special Characters
If F := InStr( HT, A_LoopField ) ; Lookup HT Data
StringReplace, TXT,TXT, %A_LoopField%`;, % SubStr( HT,F+StrLen(A_LoopField), 1 ), All
Else If ( SubStr( A_LoopField,2,1)="#" )
StringReplace, TXT, TXT, %A_LoopField%`;, % Chr(SubStr(A_LoopField,3)), All
Return RegExReplace( TXT, "(^\s*|\s*$)") ; Remove leading/trailing white spaces
}
EDITS:
13-Sep-2010 : IMDb had updated its HTML layout breaking the script. The script is now altered and working