Scrape Elements on Site and Save to Text File

Get help with using AutoHotkey and its commands and hotkeys
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Scrape Elements on Site and Save to Text File

06 Dec 2013, 12:29

As the title says, I'd like to have something written that will help me figure out how to write something to follow the example below. Obviously this isn't what I want to pull - I don't need you to write the specific script for me as I won't learn anything from that. A basic example is all I need so that I can get the ball rolling and learn it on my own (I'm a more of a VBA guy).

Example link:
Example Link

Needs:
-shortcut key that activates the script (e.g. CTRL-ALT-Q)
-assumes the current and active browser instance and current and active tab
-url of the site must match or begin with "https://www.google.com/maps/"
-if the above is true, then scrape the following elements
-Business name - "Pitfire Artisan Pizza"
-

Code: Select all

<span jstcache="141">Pitfire Artisan Pizza</span>
-Address - "108 W 2nd St Los Angeles, CA 90012"
-

Code: Select all

<div class="cards-entity-address cards-strong" jstcache="142"><div class="cards-text-truncate-and-wrap" jstcache="143" jsinstance="0"><span jstcache="144">108 W 2nd St</span></div><div class="cards-text-truncate-and-wrap" jstcache="143" jsinstance="*1"><span jstcache="144">Los Angeles, CA 90012</span></div></div>
-Hours - "Open today 11:00 am – 10:00 pm"
-

Code: Select all

<div class="cards-hours" jstcache="150"><span class="" jstcache="151">Open today</span><a class="cards-open-hours" jsaction="entity.openHoursClick" jstcache="0" tabindex="155" href="javascript:void(0)"><div jstcache="152" jsinstance="*0">11:00 am – 10:00 pm</div><div jstcache="153" style="display: none;"><span jstcache="0">Hours</span></div></a></div>
-and so forth (three examples is plenty for me to figure out the rest)
-write and overwrite the stored information to a local text file (delete the other information and replace with this)
-assume C:\Tools\test.txt
-write the information on a new line per data capture point (don't ask - it just needs to be this way for now)
-save the text file

So again, I learn by working with examples and I can't find a good starting place so maybe I can get the help I need here.

Thanks in advance!
User avatar
Blackholyman
Posts: 1291
Joined: 29 Sep 2013, 22:57
Facebook: socialjsz
Google: +Jszapp
Location: Denmark
Contact:

Re: Scrape Elements on Site and Save to Text File

07 Dec 2013, 15:13

here is an Example using COM with IE

Code: Select all

^!q::

wb := WBGet()
if !instr(wb.LocationURL, "https://www.google.com/maps")
{
   wb := ""
   return
}
doc := wb.document
table    := doc.getElementById("biwtable")
rows     := table.rows
spans    := rows[0].getElementsByTagName("span")


BusinessName     := doc.getElementById("place-title").innertext
Address          := spans[1].innertext " " spans[3].innertext



FileAppend, %BusinessName%`n%Address%`n`n, Somefile.txt
Run Somefile.txt
return




WBGet(WinTitle="ahk_class IEFrame", Svr#=1) {               ;// based on ComObjQuery docs
   static msg := DllCall("RegisterWindowMessage", "str", "WM_HTML_GETOBJECT")
        , IID := "{0002DF05-0000-0000-C000-000000000046}"   ;// IID_IWebBrowserApp
;//     , IID := "{332C4427-26CB-11D0-B483-00C04FD90119}"   ;// IID_IHTMLWindow2
   SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, %WinTitle%
   if (ErrorLevel != "FAIL") {
      lResult:=ErrorLevel, VarSetCapacity(GUID,16,0)
      if DllCall("ole32\CLSIDFromString", "wstr","{332C4425-26CB-11D0-B483-00C04FD90119}", "ptr",&GUID) >= 0 {
         DllCall("oleacc\ObjectFromLresult", "ptr",lResult, "ptr",&GUID, "ptr",0, "ptr*",pdoc)
         return ComObj(9,ComObjQuery(pdoc,IID,IID),1), ObjRelease(pdoc)
      }
   }
}
you will need to have the Business place open and Active in IE

Hope it helps
Last edited by Blackholyman on 08 Dec 2013, 00:39, edited 1 time in total.
User avatar
Joe Glines
Posts: 688
Joined: 30 Sep 2013, 20:49
Facebook: https://www.facebook.com/theAutomatorGuru/
Google: https://plus.google.com/105328929654286634910
GitHub: joetazz
Location: Dallas
Contact:

Re: Scrape Elements on Site and Save to Text File

07 Dec 2013, 19:02

Google maps kept stripping the last backslash from the URL and, since your instr command was looking for it, the code would not work. This takes care of it

Code: Select all

if !instr(wb.LocationURL, "https://www.google.com/maps")
User avatar
Joe Glines
Posts: 688
Joined: 30 Sep 2013, 20:49
Facebook: https://www.facebook.com/theAutomatorGuru/
Google: https://plus.google.com/105328929654286634910
GitHub: joetazz
Location: Dallas
Contact:

Re: Scrape Elements on Site and Save to Text File

08 Dec 2013, 07:59

Blackholyman,
I'm studying your code to try and improve my web-scraping capabilities and I'm puzzled on where the .rows part of your code came from

Code: Select all

table    := doc.getElementById("biwtable")
rows     := table.rows
spans    := rows[0].getElementsByTagName("span")
When I use iWB2 Learner as well as IE's built in developer tools I see no reference to "rows" under the "biwtable" table. I do see the SPAN tags.

Did you add the .Rows this from your knowledge of HTML / the DOM?
Thanks
Joe
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

08 Dec 2013, 11:22

Thanks for the information so far - I've successfully done this on the example page.

As for the actual page I'm going to be using this on I'm running into a stump. Here's the source for the part that I'm trying to pull:

Code: Select all

<section class="module-2">
                    <h2 class="disableUserSelect">Prospect Information</h2>
                    
                        <a class="group-link yjs-edit-prospect-info-trigger disableUserSelect" data-test-id="prospect.edit" href="#">Edit</a>
                    

                    <dl>
                        
                            <dd class="disableUserSelect">
								<a target="_blank" href="http://www.google.com/search?hl=en&q=Test+Company+Name%2C+LLC Houston TX">Look Up Prospect</a>
							</dd>
                        
                        
                            <dt class="disableUserSelect">Website</dt>
                            <dd>
								<a target="_blank" href="http://www.testsite.com" data-test-id="prospect.websiteUrl">http://www.testsite.com</a>
							</dd>
                        
                        
                        
                            <dt class="disableUserSelect">Address</dt>
                            <dd>
                                <address data-test-id="prospect.address">
                                    
                                        123 Main Street
										
											<br>
										
                                    
                                        Houston, TX 77036
										
                                    
                                </address>
                                <a class="disableUserSelect" target="_blank" href="https://maps.google.com/?q=123+Main+Street%2C+Houston%2C+TX+77036">Map</a>
                            </dd>
                        
                        
                    </dl>

                </section>
What do I need to alter in the code to call the elements for Address (123 Main Street Houston, TX 77036) and Website (www.testsite.com)?

Thanks again for the help!
User avatar
Blackholyman
Posts: 1291
Joined: 29 Sep 2013, 22:57
Facebook: socialjsz
Google: +Jszapp
Location: Denmark
Contact:

Re: Scrape Elements on Site and Save to Text File

08 Dec 2013, 15:14

@ Joetazz
Did you add the .Rows this from your knowledge of HTML / the DOM?
Yes i did, it gives a collection of the <tr> tags in a table http://www.w3schools.com/jsref/coll_table_rows.asp

@ uknowwhoibe (u know who i be)

try something like this

Code: Select all

;// ^^^ just for testing
html = 
(`%
<html>
 <body>
<section class="module-2">
                    <h2 class="disableUserSelect">Prospect Information</h2>
                    
                        <a class="group-link yjs-edit-prospect-info-trigger disableUserSelect" data-test-id="prospect.edit" href="#">Edit</a>
                    

                    <dl>
                        
                            <dd class="disableUserSelect">
                                <a target="_blank" href="http://www.google.com/search?hl=en&q=Test+Company+Name%2C+LLC Houston TX">Look Up Prospect</a>
                            </dd>
                        
                        
                            <dt class="disableUserSelect">Website</dt>
                            <dd>
                                <a target="_blank" href="http://www.testsite.com" data-test-id="prospect.websiteUrl">http://www.testsite.com</a>
                            </dd>
                        
                        
                        
                            <dt class="disableUserSelect">Address</dt>
                            <dd>
                                <address data-test-id="prospect.address">
                                    
                                        123 Main Street
                                        
                                            <br>
                                        
                                    
                                        Houston, TX 77036
                                        
                                    
                                </address>
                                <a class="disableUserSelect" target="_blank" href="https://maps.google.com/?q=123+Main+Street%2C+Houston%2C+TX+77036">Map</a>
                            </dd>
                        
                        
                    </dl>

                </section>
                <html>
 </body>
 </html> 
) ;// ^^^ just for testing

^!q::

;~ wb := WBGet()
{ ;// ^^^ just for testing
wb := ComObjCreate("InternetExplorer.Application")
wb.Visible := true
wb.Navigate("http://www.bing.com/")
while wb.busy
   Sleep 100
wb.document.write(html)
sleep 200
} ;// ^^^ just for testing

;// this is the code you need
loop % (elements := wb.document.getElementsByTagName("address")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.address")
         Address := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("A")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.websiteUrl")
         Website := elements[A_index-1].href
      
msgbox % "Addr:`n" Address "`n`nSite:`n" Website
return
Hope it helps
User avatar
Joe Glines
Posts: 688
Joined: 30 Sep 2013, 20:49
Facebook: https://www.facebook.com/theAutomatorGuru/
Google: https://plus.google.com/105328929654286634910
GitHub: joetazz
Location: Dallas
Contact:

Re: Scrape Elements on Site and Save to Text File

09 Dec 2013, 21:06

Very cool! I've learned all my OOP / DOM from reading posts in the forum thus I'm not aware of things that many people that actually build pages are.

Somehow through all my readings I'd missed this little bit about Tables / rows, etc. I reviewed it as well as other table properties and can definitely see how they will be useful. Thank you for providing the link!
Regards,
Joe
User avatar
tank
Posts: 2825
Joined: 28 Sep 2013, 22:15
Facebook: charlie.simmons.7334
Google: ttnnkkrr
GitHub: ttnnkkrr
Location: Irving TX
Contact:

Re: Scrape Elements on Site and Save to Text File

10 Dec 2013, 01:00

you learn the tables rows etc from learning html and javascript and by proxy dhtml. dhtml techniques can be applied to any language with access to DOM
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
https://www.facebook.com/ahkscript.org
If you have forum suggestions please submit a pull request
Check Out WebWriter
Thanks Tank :thumbup:
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

16 Dec 2013, 15:54

First of all, thanks to everyone here (specifically Blackholyman) as I was able to take the pieces from this and set the script to my needs.

I have two final questions which may be better served in a new thread - mods, let me know and I can move it.

As I'm pulling the source code of FF, it opens this up in a new browser window - I then read from that the information I want. How can I open that specific FireFox window where Visible = False? Currently I'm taking the source of the FF page, selecting all, copying all, opening an instance of IE and setting the document.write(clipboard) to the html of the FF source (a little hacky but functional) - I just need to hide the FF window of the source code. I can't load this in IE as it would be a PITA with logins, security credentials and that we don't support IE anything so unfortunately that's not an option.

Secondly, I need a way to setup a listener or a poller script - when something on a FF page occurs, I would like to trigger my scrape script. I thought about doing a polling script that loops every second (seems resource-heavy) and have it look for a specific element that pops up; if so, run the script; if not, keep looping. Is there any way to setup an actual listener so that when that element pops up it activates? I was thinking that it would be similar to how if a captcha box pops up then it runs itself (it's not a captcha box). Thoughts?

Thanks again! Here's the modified finished project - any superfluous code that can be trimmed down?

Code: Select all

;page scraper to log to file to run powerpoint to pass variables into ppt userform
;2013 uknowwhoibe with help from ahkscript.org <3

#SingleInstance, Force
SetTitleMatchMode 2
AutoTrim, off
^!q::

Clipboard = ; empty clipboard 

	Sleep, 200
	IfWinExist, Target Site 
		WinActivate, Target Site
	
	MsgBox % 3, Launch PPT Builder From This Record? `r Auto-launching in 10 seconds, 10
		IfMsgBox No
			{
				Return
			}
		If ErrorLevel
			{
				MsgBox, Navigate to the record you want to setup a demo for then re-run
			}
			
	
	Send, ^u	; FF source code
	Sleep, 200	
	
	WinActivate, Source of: http://targetsite.com/record/
	Sleep, 200
	Send, ^a
	Sleep, 200
	Send, ^c
	Sleep, 200
	Send, ^w
	Sleep, 200

	html := Clipboard
	
wb := ComObjCreate("InternetExplorer.Application")
wb.Visible := false
wb.Navigate("about:blank")
while wb.busy
   Sleep 200
wb.document.write(html)
	
loop % (elements := wb.document.getElementsByTagName("h1")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "page.title")
         BusinessName := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("h4")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "page.subtitle")
         Segment := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("address")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.address")
         Address := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("A")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.websiteUrl")
         Website := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("dt")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "contact.fullName")
         FullName := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("dd")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "contact.email")
         Email := elements[A_index-1].innertext		 

loop % (elements := wb.document.getElementsByTagName("dd")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.statusDescription")
         ProspectStatus := elements[A_index-1].innertext		 

loop % (elements := wb.document.getElementsByTagName("dd")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.prospectId")
         ProspectID := elements[A_index-1].innertext

FormatTime, Time,, MM/dd/yy h:mm tt

StringTrimLeft, Email, Email, 6 ; remove "Email" and just show the email address
StringTrimLeft, ProspectStatus, ProspectStatus, 9 ; remove "Owned by" and just show record owner
		 
;MsgBox %Time% `n %Bname% `n %BusinessName% `n %Segment% `n %Address% `n %Website% `n %FullName% `n %Email% `n %ProspectStatus% `n %ProspectID%
		 
wb.quit		 

FileDelete, C:\Tools\pptbuilder.txt
FileAppend, %FullName% `r`n %BusinessName% `r`n %Website% `r`n %Address% `r`n %Segment% `r`n `r`n %Time% `r`n %Email% `r`n %ProspectStatus% `r`n %ProspectID%, C:\Tools\demobuilder.txt

Sleep,25

ppa:= "C:\path\POWERPNT.EXE" ; open PPT

Run, %ppa%

Sleep, 25

	IfWinExist Microsoft PowerPoint
		WinActivate Microsoft PowerPoint
		
	Sleep 1000
		
                ; Find ribbon button color and click - sadly, best work around I found

		PixelSearch, Px, Py, 0, 0, 2400, 2400, 0xB2852E, 0, Fast
			if ErrorLevel
			MsgBox, PowerPoint was not found open - please manually click on PPT Builder
		else
		click %Px%, %Py%
		 
Return
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

17 Dec 2013, 17:02

Guest wrote:FF Addons to get source and setup up hotkey http://www.autohotkey.com/board/topic/9 ... clipboard/
Unfortunately, I'm trying to make this as scaleable as possible so being forced to install a FF extension/addin isn't really ideal. If that's the only solution, then perhaps I can tell the source window to open and specify the X,Y of where to have it load to and put it off screen (hacky but meh)?
User avatar
tank
Posts: 2825
Joined: 28 Sep 2013, 22:15
Facebook: charlie.simmons.7334
Google: ttnnkkrr
GitHub: ttnnkkrr
Location: Irving TX
Contact:

Re: Scrape Elements on Site and Save to Text File

17 Dec 2013, 17:32

i personally feel sorry for you having such a horid site that cant work correctly in other browsers. your having to deal with quite possibly the most aweful browser to automate. even chrome isnt as bad
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
https://www.facebook.com/ahkscript.org
If you have forum suggestions please submit a pull request
Check Out WebWriter
Thanks Tank :thumbup:
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

17 Dec 2013, 18:05

tank wrote:i personally feel sorry for you having such a horid site that cant work correctly in other browsers. your having to deal with quite possibly the most aweful browser to automate. even chrome isnt as bad
Thank you for the sympathy - it can be quite annoying at times but it is what it is. The reason we don't use IE is because we don't support IE anything (which is a good thing) and Chrome, while my preference, wasn't as easy to greasemonkey when the systems were first setup (around 2006 or so). That's not an excuse; just a fact.

Return to “Ask For Help”

Who is online

Users browsing this forum: Bing [Bot], Google [Bot], songdg, vegard74 and 157 guests