 |
AutoHotkey Community Let's help each other out
|
| View previous topic :: View next topic |
| Author |
Message |
areebb
Joined: 13 Feb 2008 Posts: 7
|
Posted: Wed Mar 05, 2008 9:39 am Post subject: Need help with a simple Google Results parser... |
|
|
Hi,
I'm trying to write a simple script that parses Google results. For example, I give it the HTML associated with the results for "test keyword" and it spits back every URL that Google returned.
Is there something like this already?
Anyways, if there isn't, I'm stuck at the RegExMatch. I've figured that the mention of the URL in Google's HTML code starts with the text
<!--m--><h2 class=r><a href="
and ends with
" class=l
Now how do I use a RegEx statement to extract everything between those two strings?
Thanks a lot for your help! _________________ Areeb Bajwa |
|
| Back to top |
|
 |
Lexikos
Joined: 17 Oct 2006 Posts: 2737 Location: Australia, Qld
|
Posted: Wed Mar 05, 2008 11:07 am Post subject: |
|
|
This seems to work, though it doesn't strip the HTML (bold tags, escape sequences, etc.) out of the link text.
| Code: | FileRead, search_results, search_results.htm
Pos = 1
Loop {
Pos := RegExMatch(search_results, "(?<=<h2 class=r>).*?(?=</h2>)", m, Pos)
if !Pos
break
RegExMatch(m, "<a href=""(?<Url>.*?)"" .*?>(?<TextHtml>.*)</a>", a)
MsgBox URL: %aUrl%`nText: %aTextHtml%
Pos += StrLen(m)
} |
(unedit)
Last edited by Lexikos on Wed Mar 05, 2008 11:16 am; edited 2 times in total |
|
| Back to top |
|
 |
areebb
Joined: 13 Feb 2008 Posts: 7
|
Posted: Wed Mar 05, 2008 11:15 am Post subject: |
|
|
Dude, I love you! That's exactly what I needed! Thanks so much.
The RegEx statements seem damn complicated though. Oh well, long as it gets the job done! _________________ Areeb Bajwa |
|
| Back to top |
|
 |
Lexikos
Joined: 17 Oct 2006 Posts: 2737 Location: Australia, Qld
|
Posted: Wed Mar 05, 2008 11:18 am Post subject: |
|
|
Blast, I edited too slowly.
Here's a way to avoid parsing altogether. Requires: COM Standard Library.
| Code: | FileRead, search_results, search_results.htm
COM_Init()
; Create a HTML document object.
doc := COM_CreateObject("htmlfile")
; Write HTML into it.
COM_Invoke(doc, "write", search_results)
; Get a collection of all links.
links := COM_Invoke(doc, "links")
; For each link
Loop % COM_Invoke(links, "length")
{
link := COM_Invoke(links, "item", A_Index-1)
; Google search result links have class=l.
if COM_Invoke(link, "className") = "l"
{
; Show the link href and text.
MsgBox % "URL: " COM_Invoke(link, "href") "`n`n"
. COM_Invoke(link, "innerText")
}
COM_Release(link)
}
; Clean up.
COM_Release(links)
COM_Release(doc)
COM_Term() | This avoids the need to replace HTML tags and escape sequences in the link text.
It would also be possible to adapt this to get information directly from an Internet Explorer window showing Google results. |
|
| Back to top |
|
 |
SKAN
Joined: 26 Dec 2005 Posts: 6264
|
Posted: Wed Mar 05, 2008 11:24 am Post subject: |
|
|
That will be very useful for me .. Many thanks lexiKos.  |
|
| Back to top |
|
 |
areebb
Joined: 13 Feb 2008 Posts: 7
|
Posted: Sun Mar 09, 2008 1:27 am Post subject: |
|
|
Sorry forgot to respond to your last post lexiKos. The COM version of it works like a charm! And thanks to you I've actually managed to figure out this whole pattern matching thing.  _________________ Areeb Bajwa |
|
| Back to top |
|
 |
bixtool
Joined: 16 Aug 2008 Posts: 27
|
Posted: Sat Aug 16, 2008 6:11 am Post subject: |
|
|
| Quote: | Lexikos Posted: Wed Mar 05, 2008 12:18 pm Post subject:
--------------------------------------------------------------------------------
Blast, I edited too slowly.
Here's a way to avoid parsing altogether. Requires: COM Standard Library.
Code (Collapse - Copy):
FileRead, search_results, search_results.htm
COM_Init()
; Create a HTML document object.
doc := COM_CreateObject("htmlfile")
; Write HTML into it.
COM_Invoke(doc, "write", search_results)
; Get a collection of all links.
links := COM_Invoke(doc, "links")
; For each link
Loop % COM_Invoke(links, "length")
{
link := COM_Invoke(links, "item", A_Index-1)
; Google search result links have class=l.
if COM_Invoke(link, "className") = "l"
{
; Show the link href and text.
MsgBox % "URL: " COM_Invoke(link, "href") "`n`n"
. COM_Invoke(link, "innerText")
}
COM_Release(link)
}
; Clean up.
COM_Release(links)
COM_Release(doc)
COM_Term()
|
How would I modify this code for a search page from Wiki? |
|
| Back to top |
|
 |
|
|
You can post new topics in this forum You can reply to topics in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|