AutoHotkey Homepage AutoHotkey Community
Let's help each other out
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Need help with a simple Google Results parser...

 
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help
View previous topic :: View next topic  
Author Message
areebb



Joined: 13 Feb 2008
Posts: 7

PostPosted: Wed Mar 05, 2008 9:39 am    Post subject: Need help with a simple Google Results parser... Reply with quote

Hi,

I'm trying to write a simple script that parses Google results. For example, I give it the HTML associated with the results for "test keyword" and it spits back every URL that Google returned.

Is there something like this already?

Anyways, if there isn't, I'm stuck at the RegExMatch. I've figured that the mention of the URL in Google's HTML code starts with the text

<!--m--><h2 class=r><a href="

and ends with

" class=l

Now how do I use a RegEx statement to extract everything between those two strings?

Thanks a lot for your help!
_________________
Areeb Bajwa
Back to top
View user's profile Send private message
Lexikos



Joined: 17 Oct 2006
Posts: 2737
Location: Australia, Qld

PostPosted: Wed Mar 05, 2008 11:07 am    Post subject: Reply with quote

This seems to work, though it doesn't strip the HTML (bold tags, escape sequences, etc.) out of the link text.
Code:
FileRead, search_results, search_results.htm

Pos = 1
Loop {
    Pos := RegExMatch(search_results, "(?<=<h2 class=r>).*?(?=</h2>)", m, Pos)
    if !Pos
        break
    RegExMatch(m, "<a href=""(?<Url>.*?)"" .*?>(?<TextHtml>.*)</a>", a)
    MsgBox URL: %aUrl%`nText: %aTextHtml%
    Pos += StrLen(m)
}


(unedit)


Last edited by Lexikos on Wed Mar 05, 2008 11:16 am; edited 2 times in total
Back to top
View user's profile Send private message
areebb



Joined: 13 Feb 2008
Posts: 7

PostPosted: Wed Mar 05, 2008 11:15 am    Post subject: Reply with quote

Dude, I love you! That's exactly what I needed! Thanks so much.

The RegEx statements seem damn complicated though. Oh well, long as it gets the job done!
_________________
Areeb Bajwa
Back to top
View user's profile Send private message
Lexikos



Joined: 17 Oct 2006
Posts: 2737
Location: Australia, Qld

PostPosted: Wed Mar 05, 2008 11:18 am    Post subject: Reply with quote

Blast, I edited too slowly. Razz

Here's a way to avoid parsing altogether. Requires: COM Standard Library.
Code:
FileRead, search_results, search_results.htm

COM_Init()

; Create a HTML document object.
doc := COM_CreateObject("htmlfile")

; Write HTML into it.
COM_Invoke(doc, "write", search_results)

; Get a collection of all links.
links := COM_Invoke(doc, "links")

; For each link
Loop % COM_Invoke(links, "length")
{
    link := COM_Invoke(links, "item", A_Index-1)

    ; Google search result links have class=l.
    if COM_Invoke(link, "className") = "l"
    {
        ; Show the link href and text.
        MsgBox % "URL: " COM_Invoke(link, "href") "`n`n"
               . COM_Invoke(link, "innerText")
    }

    COM_Release(link)
}

; Clean up.
COM_Release(links)
COM_Release(doc)
COM_Term()
This avoids the need to replace HTML tags and escape sequences in the link text. Smile

It would also be possible to adapt this to get information directly from an Internet Explorer window showing Google results.
Back to top
View user's profile Send private message
SKAN



Joined: 26 Dec 2005
Posts: 6264

PostPosted: Wed Mar 05, 2008 11:24 am    Post subject: Reply with quote

That will be very useful for me .. Many thanks lexiKos. Smile
Back to top
View user's profile Send private message
areebb



Joined: 13 Feb 2008
Posts: 7

PostPosted: Sun Mar 09, 2008 1:27 am    Post subject: Reply with quote

Sorry forgot to respond to your last post lexiKos. The COM version of it works like a charm! And thanks to you I've actually managed to figure out this whole pattern matching thing. Smile
_________________
Areeb Bajwa
Back to top
View user's profile Send private message
bixtool



Joined: 16 Aug 2008
Posts: 27

PostPosted: Sat Aug 16, 2008 6:11 am    Post subject: Reply with quote

Quote:
Lexikos Posted: Wed Mar 05, 2008 12:18 pm Post subject:

--------------------------------------------------------------------------------

Blast, I edited too slowly.

Here's a way to avoid parsing altogether. Requires: COM Standard Library.
Code (Collapse - Copy):
FileRead, search_results, search_results.htm

COM_Init()

; Create a HTML document object.
doc := COM_CreateObject("htmlfile")

; Write HTML into it.
COM_Invoke(doc, "write", search_results)

; Get a collection of all links.
links := COM_Invoke(doc, "links")

; For each link
Loop % COM_Invoke(links, "length")
{
link := COM_Invoke(links, "item", A_Index-1)

; Google search result links have class=l.
if COM_Invoke(link, "className") = "l"
{
; Show the link href and text.
MsgBox % "URL: " COM_Invoke(link, "href") "`n`n"
. COM_Invoke(link, "innerText")
}

COM_Release(link)
}

; Clean up.
COM_Release(links)
COM_Release(doc)
COM_Term()


How would I modify this code for a search page from Wiki?
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    AutoHotkey Community Forum Index -> Ask for Help All times are GMT
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum


Powered by phpBB © 2001, 2005 phpBB Group