AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

StrX() :: Auto-Parser for XML / HTML
Goto page 1, 2, 3, 4  Next
 
Reply to topic    AutoHotkey Community Forum Index -> Scripts & Functions
View previous topic :: View next topic  
Author Message
SKAN



Joined: 26 Dec 2005
Posts: 8688

PostPosted: Fri Nov 20, 2009 12:22 am    Post subject: StrX() :: Auto-Parser for XML / HTML Reply with quote


    StrX() is a wrapper that extends SubStr()'s functionality. It accepts two strings for extremes ( begin & end ) and extracts the text in between them. It is much similar to
    RegExMatch( Str, "BeginStr(.*)EndStr", SubPat ), but the major difference is, StrX() allows flexibility on the final length of the resultant string. To be precise, it can trim/expand characters at either/both ends of the resultant string.

    Quote:

      Announcement: The current version 1.0 can auto-parse when used with While loop. Please checkout the updated examples.



    StrX( H, BS,BO,BT, ES,EO,ET, NextOffset )

      Parameters

    • 1 ) H = HayStack. The "Source Text"


    • 2 ) BS = BeginStr. Pass a String that will result at the left extreme of Resultant String
    • 3 ) BO = BeginOffset.
        Number of Characters to omit from the left extreme of "Source Text" while searching for BeginStr
      • Pass a 0 to search in reverse ( from right-to-left ) in "Source Text"
      • If you intend to call StrX() from a Loop, pass the same variable used as 8th Parameter, which will simplify the parsing process.
    • 4 ) BT = BeginTrim.
        Number of characters to trim on the left extreme of Resultant String
      • Pass the String length of BeginStr if you want to omit it from Resultant String
      • Pass a Negative value if you want to expand the left extreme of Resultant String


    • 5 ) ES = EndStr. Pass a String that will result at the right extreme of Resultant String
    • 6 ) EO = EndOffset.
        Can be only True or False.
        If False, EndStr will be searched from the end of Source Text.
        If True, search will be conducted from the search result offset of BeginStr or from offset 1 whichever is applicable.
    • 7 ) ET = EndTrim.
        Number of characters to trim on the right extreme of Resultant String
      • Pass the String length of EndStr if you want to omit it from Resultant String
      • Pass a Negative value if you want to expand the right extreme of Resultant String


    • 8 ) NextOffset : A name of ByRef Variable that will be updated by StrX() with the current offset, You may pass the same variable as Parameter 3, to simplify data parsing in a loop


    Here follows real world examples that demonstrates StrX()'s functionality:


    Example 1 : A Script to retrieve real-time details of last 15 posts made in our forum.

    Code:
    UrlDownloadToFile, http://www.autohotkey.com/forum/rss.php, ahkrss.xml   ; 01
    FileRead, xml, ahkrss.xml                                                ; 02

    While Item  := StrX( xml ,  "<item>" ,N,0,  "</item>" ,1,0,  N )         ; 03
          Title := StrX( Item,  "<title>",1,7,  "</title>",1,8     )         ; 04
        , Link  := StrX( Item,  "<link>" ,1,6,  "</link>" ,1,7     )         ; 05
        , List  .= "`n`n" A_Index ")`t" Title "`n`t" Link                    ; 06

    MsgBox, 64, Latest Posts on AHK Forum, %List%                            ; 07


    Quote:
    Note: The result of above script may contain HTML formatting like below:

    15) Ask for Help :: &amp;quot;Jump to&amp;quot; video frame (i.e. &amp;quot;seek&amp;quot;

    You may use UnHTM() on Title to convert it to proper text.


    Example 2 : Download and extract links from a Google Search Result

    Code:
    UrlDownloadToFile, % "http://www.google.com/search?hl=en&lr=&safe=active&rlz=1C1GGLS_enIN"
                       . "307IN307&num=10&q=site:autohotkey.com&aq=f&oq=&aqi=", Google.htm
    FileRead, html, Google.htm

    While Item := StrX( html,  "<h3 class=""r""><a href=",N,0, "<li class=g>",1,12, N )
          Sub1 := StrX( Item, "<a href=",1,9,  """"  ,1,1,  T )
        , Sub2 := StrX( Item, ">",       T,1,  "</a>",1,4     )
        , Text .= UnHTM( Sub2 ) "`n" Sub1 "`n`n"

    MsgBox, %Text% ; Dependency :: Get UnHTM() www.autohotkey.com/forum/viewtopic.php?t=51342


    Example 3 : Movie-DB Creator 66L for IMDb.com

    Example 4 : ListView for http://www.google.com/movies

    Example 5 : Yahoo! Weather in TrayTip

        ... and finally here is StrX()


      Code:
      StrX( HBS="",BO=0,BT=1,   ES="",EO=0,ET=1,  ByRef N="" ) { ;    | by Skan | 19-Nov-2009
      Return SubStr(H,P:=(((Z:=StrLen(ES))+(X:=StrLen(H))+StrLen(BS)-Z-X)?((T:=InStr(H,BS,0,((BO
       <0)?(1):(BO))))?(T+BT):(X+1)):(1)),(N:=P+((Z)?((T:=InStr(H,ES,0,((EO)?(P+1):(0))))?(T-P+Z
       +(0-ET)):(X+P)):(X)))-P) ; v1.0-196c 21-Nov-2009 www.autohotkey.com/forum/topic51354.html
      }




    Last edited by SKAN on Fri May 14, 2010 7:46 am; edited 13 times in total
    Back to top
    View user's profile Send private message Send e-mail
    The Naked General



    Joined: 22 Feb 2009
    Posts: 21
    Location: Dallas TX

    PostPosted: Fri Nov 20, 2009 4:18 am    Post subject: Reply with quote

    This is great Skan!

    I'll have to go back and clean up some old parsing scripts with it. Thanks a bunch Very Happy
    _________________
    "lol, i made this thing, but it didn't work... so I read the forums and now it does!"
    Back to top
    View user's profile Send private message MSN Messenger
    SoLong&Thx4AllTheFish



    Joined: 27 May 2007
    Posts: 4999

    PostPosted: Fri Nov 20, 2009 8:26 am    Post subject: Reply with quote

    No Sir, I'm definitely not disappointed Very Happy
    _________________
    AHK Wiki FAQ
    TF : Text files & strings lib, TF Forum
    Back to top
    View user's profile Send private message
    linpinger



    Joined: 20 Oct 2007
    Posts: 14
    Location: china,hubei

    PostPosted: Fri Nov 20, 2009 9:50 am    Post subject: Reply with quote

    If I Comment No. 03 line

    It will go into an unend loop

    why does this happen?

    my English is pool, ^_^
    Back to top
    View user's profile Send private message
    SKAN



    Joined: 26 Dec 2005
    Posts: 8688

    PostPosted: Fri Nov 20, 2009 11:43 am    Post subject: Reply with quote

    "Title Post" Updated with Example 2

    Download and extract links from a Google Search Result

    Code:
    UrlDownloadToFile, % "http://www.google.com/search?hl=en&lr=&safe=active&rlz=1C1GGLS_enIN"
                       . "307IN307&num=10&q=site:autohotkey.com&aq=f&oq=&aqi=", Google.htm
    FileRead, html, Google.htm

    While Item := StrX( html,  "<h3 class=""r""><a href=",N,0, "<li class=g>",1,12, N )
          Sub1 := StrX( Item, "<a href=",1,9,  """"  ,1,1,  T )
        , Sub2 := StrX( Item, ">",       T,1,  "</a>",1,4     )
        , Text .= UnHTM( Sub2 ) "`n" Sub1 "`n`n"

    MsgBox, %Text% ; Dependency :: Get UnHTM() www.autohotkey.com/forum/viewtopic.php?t=51342


    On a related note here is Lexikos' COM version for the same:
    http://www.autohotkey.com/forum/viewtopic.php?p=182714#182714


    Last edited by SKAN on Sat May 08, 2010 3:25 am; edited 2 times in total
    Back to top
    View user's profile Send private message Send e-mail
    linpinger



    Joined: 20 Oct 2007
    Posts: 14
    Location: china,hubei

    PostPosted: Fri Nov 20, 2009 1:58 pm    Post subject: Reply with quote

    Thanks SKAN's Reply !

    I still Don't UnderStand
    while Searching on the end of string, why It don't stop and break

    I had to add some other check code,

    add this three line in while loop can break

    Code:

    if ( N < old )
       break
    old := N
    Back to top
    View user's profile Send private message
    SKAN



    Joined: 26 Dec 2005
    Posts: 8688

    PostPosted: Fri Nov 20, 2009 8:17 pm    Post subject: Reply with quote

    linpinger wrote:
    I still Don't UnderStand
    while Searching on the end of string, why It don't stop and break

    I had to add some other check code,

    add this three line in while loop can break

    Code:

    if ( N < old )
       break
    old := N


    My code was at fault. I have re-written the function which has been posted on the top.
    You do not have to add code anymore.. When used with "While loop" StrX() will
    automatically parse the data and shall exit the loop gracefully.
    Please test the updated examples and let me know the status.

    linpinger wrote:
    Thanks SKAN's Reply !

    er.. You might find my reply missing as I have deleted it
    ... as it does not fit the current version of StrX() and may cause confusion.

    Thank You.
    Back to top
    View user's profile Send private message Send e-mail
    linpinger



    Joined: 20 Oct 2007
    Posts: 14
    Location: china,hubei

    PostPosted: Sat Nov 21, 2009 2:13 am    Post subject: Reply with quote

    I have get the latest strX()

    It's completly Great !

    I noticed that new Example 1 don't have
    N := 1

    It means that N is blank, does it matter?
    (The result is right, have no problem.)
    Back to top
    View user's profile Send private message
    SKAN



    Joined: 26 Dec 2005
    Posts: 8688

    PostPosted: Sat Nov 21, 2009 8:04 am    Post subject: Reply with quote

    linpinger wrote:
    I have get the latest strX()
    It's completly Great !


    Thanks for testing it. Smile

    linpinger wrote:
    I noticed that new Example 1 don't have
    N := 1

    It means that N is blank, does it matter?
    (The result is right, have no problem.)


    It is a side effect. The code tests the value of BeginOffset to make sure a negative value is not being passed to InStr().

    Code:
    BO < 0 ? 1 : BO  ; If BO is lesser than 0 use 1 -  otherwise use BO itself


    If you want to run both the posted examples from the same script,
    then you have to use a N := 1 in between them to reset N
    .. or you can name the variables differently, like N1 and N2

    Idea Maybe StrX() should reset N with 1 when it is about to return an empty string?
    Back to top
    View user's profile Send private message Send e-mail
    linpinger



    Joined: 20 Oct 2007
    Posts: 14
    Location: china,hubei

    PostPosted: Sat Nov 21, 2009 12:08 pm    Post subject: Reply with quote

    SKAN wrote:

    Idea Maybe StrX() should reset N with 1 when it is about to return an empty string?


    I think reseting N is a good Ideal

    Because, When we Use N as the last Parameter

    It always show , N > strlen(xml)

    so, it seems N is not very usefull, reset it is a good ideal
    Back to top
    View user's profile Send private message
    daonlyfreez



    Joined: 16 Mar 2005
    Posts: 949
    Location: Berlin

    PostPosted: Sat Nov 21, 2009 12:39 pm    Post subject: Reply with quote

    Very nice!

    Cool
    _________________
    mirror 1mirror 2mirror 3ahk4.me • PM or
    Back to top
    View user's profile Send private message
    SKAN



    Joined: 26 Dec 2005
    Posts: 8688

    PostPosted: Mon Nov 23, 2009 5:18 am    Post subject: Reply with quote

    linpinger wrote:
    I think reseting N is a good Ideal

    Because, When we Use N as the last Parameter

    It always show , N > strlen(xml)

    so, it seems N is not very usefull, reset it is a good ideal


    Code:
    While Item := StrX( html,  "<h3 class=r><a href=",N,0, "<li class=g>",1,12, N )
          Sub1 := StrX( Item, "<a href=",1,9,  """"  ,1,1,  T )
        , Sub2 := StrX( Item, ">",       T,1,  "</a>",1,4     )


    In above, if Sub1 result is empty Sub2 will definitely become empty, which is best behaviour to expect.
    Back to top
    View user's profile Send private message Send e-mail
    SKAN



    Joined: 26 Dec 2005
    Posts: 8688

    PostPosted: Thu Mar 25, 2010 8:21 am    Post subject: Reply with quote

      Real World example for using StrX() & UnHTM() to parse out text from HTML


        Row Structure for Text DB
        1. Year
        2. Movie title
        3. MPAA Rating
        4. Runtime ( in Minutes)
        5. IMDb hash - should be prefixed with www.imdb.com/title/tt to form a proper URL
        6. User Rating ( 1.0 to 10.0 )
        7. User Votes
        8. Genre ( Pipe Delimited Values )
        8. Director
        10. Stars ( Comma Seperated Values )
        11. Movie Outline

      Quote:
      One may use the "IMDb hash" to connect with other providers:

      The IMDb hash for "The Shawshank Redemption" is 0111161

      1) Connect to www.themoviedb.org for extended Movie Info : http://api.themoviedb.org/2.1/Movie.imdbLookup/en/xml/APIKEY/tt0111161
      2) Retrieve Images (Posters/Cover) from www.themoviedb.org : http://api.themoviedb.org/2.1/Movie.getImages/en/xml/APIKEY/tt0111161
      3) Connect to www.opensubtitles.org for subtitles : http://www.opensubtitles.org/en/search/imdbid-0111161/sublanguageid-eng/rss_2_00

      Again, the above methods return data in XML format which you may parse out with StrX()


      Movie-DB Creator 66L for IMDb.com
    Code:
    ; Movie-DB Creator 66L for IMDb.com ; By Skan / Last Modified: 24-Mar-2010
    ; Forum Post : www.autohotkey.com/forum/viewtopic.php?p=342196#342196
    ; Sample Output: www.autohotkey.net/~Skan/Scripts/StrX/IMDb/IMDb.txt (3.94 MiB)
    ; !!! Caution : Downloads around 1000 webpages from IMDb. Time/Bandwidth consuming operation.

    #SingleInstance, Force
    SetBatchLines, -1

    IMDB_SR := A_Temp "\IMDB_Search_Results.htm" ,     IMDB_WF := A_Temp "\IMDB_Work_File.txt"
    IMDB_URL := "http://www.imdb.com/search/title?has=asin-dvd-us&languages=en&num_votes=6,&"
              . "sort=num_votes,desc&start=1&title_type=feature"
    FileDelete, %IMDB_WF%
    URLDownloadToFile, %IMDB_URL%, %IMDB_SR%
    FileRead, HTM, %IMDB_SR%
    TotalE := StrX( HTM, "<div id=""left"">",1,24, "titles",1,7 )
    SysGet, m, MonitorWorkArea, 1
    Y := (mBottom-46-2),  X := (mRight-200-2), TotalE := RegExReplace( TotalE,"," )
    Progress, CWE6E3E4 CT000020 CBF73D00  x%x% y%y% w200 h46 B1 FS8 WM700 WS400 FM8 ZH8 ZY3
            , Downloading from IMDb, % "Page 1/" TP:=Round(TotalE/20), , Arial

    Z := A_Tab, StartE := 1
    Loop {
     List := "", N := 1
     While(  TR := StrX( HTM, "<tr class=",N,0, "</tr>",1,5, N ) )
        URat  := StrX( TR, "Users rated this ",1,17, "/",1,1, O1 )
      , Vote  := StrX( TR, "(",O1,1, " votes",1,6, O2 )
                , Vote := RegExReplace( Vote,"," )
      , IMDB := StrX( TR, "href=""/title/tt",0,15, """",1,2, O2 )            ; Reverse Search
      , Title := UnHTM( StrX( TR, ">",O2,1, "<",1,1, O3 ))
      , Year  := StrX( TR, "year_type"">(",O3,12,")",1,1 )
      , OutL  := UnHTM( StrX( TR, "outline"">",1,9, "<",1,1 ))
      , Dir   := UnHTM( StrX( TR, "Dir: <",1,5, "</a>",1,4 ))
      , Star  := UnHTM( StrX( TR, "With: <",1,6, "</span>",1,8 ))
      , Gen   := UnHTM( StrX( TR, "class=""genre",1,-6, "</span>",1,0, O4 ))
              ,  Gen := RegExReplace( Gen,A_Space )
      , CE    := StrX( TR, "title=",O4,6, A_Space,1,1 )
      , RunT  := UnHTM( StrX( TR, "class=""runtime",1,-6, " mins",1,5 ))
      , List  .= Year Z Title Z CE Z RunT Z IMDB Z URat Z Vote Z Gen Z Dir Z Star Z OutL "`n"
     FileAppend, %List%, %IMDB_WF%
     StringReplace, IMDB_URL, IMDB_URL, start=%StartE%, % "start=" ( StartE := StartE+20 )
     IfGreater,StartE,%TotalE%, Break
     Progress, % (StartE/TotalE)*100, % "Page " Round(StartE/20) "/" TP, Downloading from IMDb
     URLDownloadToFile, %IMDB_URL%, %IMDB_SR%
     FileRead, HTM, %IMDB_SR%
    }
    FileCopy, %IMDB_WF%, %A_ScriptDir%\IMDb.txt, 1
    Return                                                 ; // end of auto-execute section //

    StrX(HBS="",BO=0,BT=1,   ES="",EO=0,ET=1,  ByRef N="" ) { ;    | by Skan | 19-Nov-2009
    Return SubStr(H,P:=(((Z:=StrLen(ES))+(X:=StrLen(H))+StrLen(BS)-Z-X)?((T:=InStr(H,BS,0,((BO
     <0)?(1):(BO))))?(T+BT):(X+1)):(1)),(N:=P+((Z)?((T:=InStr(H,ES,0,((EO)?(P+1):(0))))?(T-P+Z
     +(0-ET)):(X+P)):(X)))-P) ; v1.0-196c 21-Nov-2009 www.autohotkey.com/forum/topic51354.html
    }

    UnHTM( HTM ) { ; Remove HTML formatting / Convert to ordinary text     by SKAN 19-Nov-2009
     Static HT     ; Forum Topic: www.autohotkey.com/forum/topic51342.html
     IfEqual,HT,,   SetEnv,HT, % "&aacuteá&acircâ&acute´&aeligæ&agraveà&amp&aringå&atildeã&au"
     . "mlä&bdquo„&brvbar¦&bull•&ccedilç&cedil¸&cent¢&circˆ&copy©&curren¤&dagger†&dagger‡&deg"
     . "°&divide÷&eacuteé&ecircê&egraveè&ethð&eumlë&euro€&fnofƒ&frac12½&frac14¼&frac34¾&gt>&h"
     . "ellip…&iacuteí&icircî&iexcl¡&igraveì&iquest¿&iumlï&laquo«&ldquo“&lsaquo‹&lsquo‘&lt<&m"
     . "acr¯&mdash—&microµ&middot·&nbsp &ndash–&not¬&ntildeñ&oacuteó&ocircô&oeligœ&ograveò&or"
     . "dfª&ordmº&oslashø&otildeõ&oumlö&para¶&permil‰&plusmn±&pound£&quot""&raquo»&rdquo”&reg"
     . "®&rsaquo›&rsquo’&sbquo‚&scaronš&sect§&shy­&sup1¹&sup2²&sup3³&szligß&thornþ&tilde˜&tim"
     . "es×&trade™&uacuteú&ucircû&ugraveù&uml¨&uumlü&yacuteý&yen¥&yumlÿ
    "
     TXT := RegExReplace( HTM,"<[^>]+>" )               ; Remove all tags between  "<" and ">"
     Loop, Parse, TXT, &`;                              ; Create a list of special characters
       L := "&" A_LoopField ";", R .= (!(A_Index&1)) ? ( (!InStr(R,L,1)) ? L:"" ) : ""
     StringTrimRight, R, R, 1
     Loop, Parse, R , `;                                ; Parse Special Characters
      If F := InStr( HT, A_LoopField )                  ; Lookup HT Data
        StringReplace, TXT,TXT, %A_LoopField%`;, % SubStr( HT,F+StrLen(A_LoopField), 1 ), All
      Else If ( SubStr( A_LoopField,2,1)="#" )
        StringReplace, TXT, TXT, %A_LoopField%`;, % Chr(SubStr(A_LoopField,3)), All
    Return RegExReplace( TXT, "(^\s*|\s*$)")            ; Remove leading/trailing white spaces
    }


    EDITS:

    13-Sep-2010 : IMDb had updated its HTML layout breaking the script. The script is now altered and working


    Last edited by SKAN on Mon Sep 13, 2010 2:04 pm; edited 4 times in total
    Back to top
    View user's profile Send private message Send e-mail
    noise



    Joined: 14 May 2009
    Posts: 57
    Location: UK

    PostPosted: Thu Mar 25, 2010 8:35 am    Post subject: Reply with quote

    Excellent, thanks SKAN.
    _________________
    PrimalNoise.com
    It's a rock. Can't wait to tell my friends. They don't have a rock this big.
    Back to top
    View user's profile Send private message Visit poster's website
    SKAN



    Joined: 26 Dec 2005
    Posts: 8688

    PostPosted: Thu Mar 25, 2010 12:32 pm    Post subject: Reply with quote

    noise wrote:
    Excellent, thanks SKAN.


    Thanks, Welcome. Smile

    BTW, I have updated the script to include one more field: MPAA Rating
    and have also provided additional info on IMDb hash usage.

    Smile
    Back to top
    View user's profile Send private message Send e-mail
    Display posts from previous:   
    Reply to topic    AutoHotkey Community Forum Index -> Scripts & Functions All times are GMT
    Goto page 1, 2, 3, 4  Next
    Page 1 of 4

     
    Jump to:  
    You can post new topics in this forum
    You can reply to topics in this forum


    Powered by phpBB © 2001, 2005 phpBB Group