Scrape Elements on Site and Save to Text File

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Scrape Elements on Site and Save to Text File

06 Dec 2013, 12:29

As the title says, I'd like to have something written that will help me figure out how to write something to follow the example below. Obviously this isn't what I want to pull - I don't need you to write the specific script for me as I won't learn anything from that. A basic example is all I need so that I can get the ball rolling and learn it on my own (I'm a more of a VBA guy).

Example link:
Example Link

Needs:
-shortcut key that activates the script (e.g. CTRL-ALT-Q)
-assumes the current and active browser instance and current and active tab
-url of the site must match or begin with "https://www.google.com/maps/"
-if the above is true, then scrape the following elements
-Business name - "Pitfire Artisan Pizza"
-

Code: Select all

<span jstcache="141">Pitfire Artisan Pizza</span>
-Address - "108 W 2nd St Los Angeles, CA 90012"
-

Code: Select all

<div class="cards-entity-address cards-strong" jstcache="142"><div class="cards-text-truncate-and-wrap" jstcache="143" jsinstance="0"><span jstcache="144">108 W 2nd St</span></div><div class="cards-text-truncate-and-wrap" jstcache="143" jsinstance="*1"><span jstcache="144">Los Angeles, CA 90012</span></div></div>
-Hours - "Open today 11:00 am – 10:00 pm"
-

Code: Select all

<div class="cards-hours" jstcache="150"><span class="" jstcache="151">Open today</span><a class="cards-open-hours" jsaction="entity.openHoursClick" jstcache="0" tabindex="155" href="javascript:void(0)"><div jstcache="152" jsinstance="*0">11:00 am – 10:00 pm</div><div jstcache="153" style="display: none;"><span jstcache="0">Hours</span></div></a></div>
-and so forth (three examples is plenty for me to figure out the rest)
-write and overwrite the stored information to a local text file (delete the other information and replace with this)
-assume C:\Tools\test.txt
-write the information on a new line per data capture point (don't ask - it just needs to be this way for now)
-save the text file

So again, I learn by working with examples and I can't find a good starting place so maybe I can get the help I need here.

Thanks in advance!
User avatar
Blackholyman
Posts: 1293
Joined: 29 Sep 2013, 22:57
Location: Denmark
Contact:

Re: Scrape Elements on Site and Save to Text File

07 Dec 2013, 15:13

here is an Example using COM with IE

Code: Select all

^!q::

wb := WBGet()
if !instr(wb.LocationURL, "https://www.google.com/maps")
{
   wb := ""
   return
}
doc := wb.document
table    := doc.getElementById("biwtable")
rows     := table.rows
spans    := rows[0].getElementsByTagName("span")


BusinessName     := doc.getElementById("place-title").innertext
Address          := spans[1].innertext " " spans[3].innertext



FileAppend, %BusinessName%`n%Address%`n`n, Somefile.txt
Run Somefile.txt
return




WBGet(WinTitle="ahk_class IEFrame", Svr#=1) {               ;// based on ComObjQuery docs
   static msg := DllCall("RegisterWindowMessage", "str", "WM_HTML_GETOBJECT")
        , IID := "{0002DF05-0000-0000-C000-000000000046}"   ;// IID_IWebBrowserApp
;//     , IID := "{332C4427-26CB-11D0-B483-00C04FD90119}"   ;// IID_IHTMLWindow2
   SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, %WinTitle%
   if (ErrorLevel != "FAIL") {
      lResult:=ErrorLevel, VarSetCapacity(GUID,16,0)
      if DllCall("ole32\CLSIDFromString", "wstr","{332C4425-26CB-11D0-B483-00C04FD90119}", "ptr",&GUID) >= 0 {
         DllCall("oleacc\ObjectFromLresult", "ptr",lResult, "ptr",&GUID, "ptr",0, "ptr*",pdoc)
         return ComObj(9,ComObjQuery(pdoc,IID,IID),1), ObjRelease(pdoc)
      }
   }
}
you will need to have the Business place open and Active in IE

Hope it helps
Last edited by Blackholyman on 08 Dec 2013, 00:39, edited 1 time in total.
Also check out:
Courses on AutoHotkey

My Autohotkey Blog
:dance:
User avatar
Joe Glines
Posts: 770
Joined: 30 Sep 2013, 20:49
Location: Dallas
Contact:

Re: Scrape Elements on Site and Save to Text File

07 Dec 2013, 19:02

Google maps kept stripping the last backslash from the URL and, since your instr command was looking for it, the code would not work. This takes care of it

Code: Select all

if !instr(wb.LocationURL, "https://www.google.com/maps")
Sign-up for the 🅰️HK Newsletter

ImageImageImageImage:clap:
AHK Tutorials:Web Scraping | | Webservice APIs | AHK and Excel | Chrome | RegEx | Functions
Training: AHK Webinars Courses on AutoHotkey :ugeek:
YouTube

:thumbup: Quick Access Popup, the powerful Windows folders, apps and documents launcher!
User avatar
Joe Glines
Posts: 770
Joined: 30 Sep 2013, 20:49
Location: Dallas
Contact:

Re: Scrape Elements on Site and Save to Text File

08 Dec 2013, 07:59

Blackholyman,
I'm studying your code to try and improve my web-scraping capabilities and I'm puzzled on where the .rows part of your code came from

Code: Select all

table    := doc.getElementById("biwtable")
rows     := table.rows
spans    := rows[0].getElementsByTagName("span")
When I use iWB2 Learner as well as IE's built in developer tools I see no reference to "rows" under the "biwtable" table. I do see the SPAN tags.

Did you add the .Rows this from your knowledge of HTML / the DOM?
Thanks
Joe
Sign-up for the 🅰️HK Newsletter

ImageImageImageImage:clap:
AHK Tutorials:Web Scraping | | Webservice APIs | AHK and Excel | Chrome | RegEx | Functions
Training: AHK Webinars Courses on AutoHotkey :ugeek:
YouTube

:thumbup: Quick Access Popup, the powerful Windows folders, apps and documents launcher!
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

08 Dec 2013, 11:22

Thanks for the information so far - I've successfully done this on the example page.

As for the actual page I'm going to be using this on I'm running into a stump. Here's the source for the part that I'm trying to pull:

Code: Select all

<section class="module-2">
                    <h2 class="disableUserSelect">Prospect Information</h2>
                    
                        <a class="group-link yjs-edit-prospect-info-trigger disableUserSelect" data-test-id="prospect.edit" href="#">Edit</a>
                    

                    <dl>
                        
                            <dd class="disableUserSelect">
								<a target="_blank" href="http://www.google.com/search?hl=en&q=Test+Company+Name%2C+LLC Houston TX">Look Up Prospect</a>
							</dd>
                        
                        
                            <dt class="disableUserSelect">Website</dt>
                            <dd>
								<a target="_blank" href="http://www.testsite.com" data-test-id="prospect.websiteUrl">http://www.testsite.com</a>
							</dd>
                        
                        
                        
                            <dt class="disableUserSelect">Address</dt>
                            <dd>
                                <address data-test-id="prospect.address">
                                    
                                        123 Main Street
										
											<br>
										
                                    
                                        Houston, TX 77036
										
                                    
                                </address>
                                <a class="disableUserSelect" target="_blank" href="https://maps.google.com/?q=123+Main+Street%2C+Houston%2C+TX+77036">Map</a>
                            </dd>
                        
                        
                    </dl>

                </section>
What do I need to alter in the code to call the elements for Address (123 Main Street Houston, TX 77036) and Website (www.testsite.com)?

Thanks again for the help!
User avatar
Blackholyman
Posts: 1293
Joined: 29 Sep 2013, 22:57
Location: Denmark
Contact:

Re: Scrape Elements on Site and Save to Text File

08 Dec 2013, 15:14

@ Joetazz
Did you add the .Rows this from your knowledge of HTML / the DOM?
Yes i did, it gives a collection of the <tr> tags in a table http://www.w3schools.com/jsref/coll_table_rows.asp

@ uknowwhoibe (u know who i be)

try something like this

Code: Select all

;// ^^^ just for testing
html = 
(`%
<html>
 <body>
<section class="module-2">
                    <h2 class="disableUserSelect">Prospect Information</h2>
                    
                        <a class="group-link yjs-edit-prospect-info-trigger disableUserSelect" data-test-id="prospect.edit" href="#">Edit</a>
                    

                    <dl>
                        
                            <dd class="disableUserSelect">
                                <a target="_blank" href="http://www.google.com/search?hl=en&q=Test+Company+Name%2C+LLC Houston TX">Look Up Prospect</a>
                            </dd>
                        
                        
                            <dt class="disableUserSelect">Website</dt>
                            <dd>
                                <a target="_blank" href="http://www.testsite.com" data-test-id="prospect.websiteUrl">http://www.testsite.com</a>
                            </dd>
                        
                        
                        
                            <dt class="disableUserSelect">Address</dt>
                            <dd>
                                <address data-test-id="prospect.address">
                                    
                                        123 Main Street
                                        
                                            <br>
                                        
                                    
                                        Houston, TX 77036
                                        
                                    
                                </address>
                                <a class="disableUserSelect" target="_blank" href="https://maps.google.com/?q=123+Main+Street%2C+Houston%2C+TX+77036">Map</a>
                            </dd>
                        
                        
                    </dl>

                </section>
                <html>
 </body>
 </html> 
) ;// ^^^ just for testing

^!q::

;~ wb := WBGet()
{ ;// ^^^ just for testing
wb := ComObjCreate("InternetExplorer.Application")
wb.Visible := true
wb.Navigate("http://www.bing.com/")
while wb.busy
   Sleep 100
wb.document.write(html)
sleep 200
} ;// ^^^ just for testing

;// this is the code you need
loop % (elements := wb.document.getElementsByTagName("address")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.address")
         Address := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("A")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.websiteUrl")
         Website := elements[A_index-1].href
      
msgbox % "Addr:`n" Address "`n`nSite:`n" Website
return
Hope it helps
Also check out:
Courses on AutoHotkey

My Autohotkey Blog
:dance:
User avatar
Joe Glines
Posts: 770
Joined: 30 Sep 2013, 20:49
Location: Dallas
Contact:

Re: Scrape Elements on Site and Save to Text File

09 Dec 2013, 21:06

Very cool! I've learned all my OOP / DOM from reading posts in the forum thus I'm not aware of things that many people that actually build pages are.

Somehow through all my readings I'd missed this little bit about Tables / rows, etc. I reviewed it as well as other table properties and can definitely see how they will be useful. Thank you for providing the link!
Regards,
Joe
Sign-up for the 🅰️HK Newsletter

ImageImageImageImage:clap:
AHK Tutorials:Web Scraping | | Webservice APIs | AHK and Excel | Chrome | RegEx | Functions
Training: AHK Webinars Courses on AutoHotkey :ugeek:
YouTube

:thumbup: Quick Access Popup, the powerful Windows folders, apps and documents launcher!
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Scrape Elements on Site and Save to Text File

10 Dec 2013, 01:00

you learn the tables rows etc from learning html and javascript and by proxy dhtml. dhtml techniques can be applied to any language with access to DOM
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

16 Dec 2013, 15:54

First of all, thanks to everyone here (specifically Blackholyman) as I was able to take the pieces from this and set the script to my needs.

I have two final questions which may be better served in a new thread - mods, let me know and I can move it.

As I'm pulling the source code of FF, it opens this up in a new browser window - I then read from that the information I want. How can I open that specific FireFox window where Visible = False? Currently I'm taking the source of the FF page, selecting all, copying all, opening an instance of IE and setting the document.write(clipboard) to the html of the FF source (a little hacky but functional) - I just need to hide the FF window of the source code. I can't load this in IE as it would be a PITA with logins, security credentials and that we don't support IE anything so unfortunately that's not an option.

Secondly, I need a way to setup a listener or a poller script - when something on a FF page occurs, I would like to trigger my scrape script. I thought about doing a polling script that loops every second (seems resource-heavy) and have it look for a specific element that pops up; if so, run the script; if not, keep looping. Is there any way to setup an actual listener so that when that element pops up it activates? I was thinking that it would be similar to how if a captcha box pops up then it runs itself (it's not a captcha box). Thoughts?

Thanks again! Here's the modified finished project - any superfluous code that can be trimmed down?

Code: Select all

;page scraper to log to file to run powerpoint to pass variables into ppt userform
;2013 uknowwhoibe with help from ahkscript.org <3

#SingleInstance, Force
SetTitleMatchMode 2
AutoTrim, off
^!q::

Clipboard = ; empty clipboard 

	Sleep, 200
	IfWinExist, Target Site 
		WinActivate, Target Site
	
	MsgBox % 3, Launch PPT Builder From This Record? `r Auto-launching in 10 seconds, 10
		IfMsgBox No
			{
				Return
			}
		If ErrorLevel
			{
				MsgBox, Navigate to the record you want to setup a demo for then re-run
			}
			
	
	Send, ^u	; FF source code
	Sleep, 200	
	
	WinActivate, Source of: http://targetsite.com/record/
	Sleep, 200
	Send, ^a
	Sleep, 200
	Send, ^c
	Sleep, 200
	Send, ^w
	Sleep, 200

	html := Clipboard
	
wb := ComObjCreate("InternetExplorer.Application")
wb.Visible := false
wb.Navigate("about:blank")
while wb.busy
   Sleep 200
wb.document.write(html)
	
loop % (elements := wb.document.getElementsByTagName("h1")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "page.title")
         BusinessName := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("h4")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "page.subtitle")
         Segment := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("address")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.address")
         Address := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("A")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.websiteUrl")
         Website := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("dt")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "contact.fullName")
         FullName := elements[A_index-1].innertext

loop % (elements := wb.document.getElementsByTagName("dd")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "contact.email")
         Email := elements[A_index-1].innertext		 

loop % (elements := wb.document.getElementsByTagName("dd")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.statusDescription")
         ProspectStatus := elements[A_index-1].innertext		 

loop % (elements := wb.document.getElementsByTagName("dd")).length 
      if (elements[A_index-1].getAttribute("data-test-id") = "prospect.prospectId")
         ProspectID := elements[A_index-1].innertext

FormatTime, Time,, MM/dd/yy h:mm tt

StringTrimLeft, Email, Email, 6 ; remove "Email" and just show the email address
StringTrimLeft, ProspectStatus, ProspectStatus, 9 ; remove "Owned by" and just show record owner
		 
;MsgBox %Time% `n %Bname% `n %BusinessName% `n %Segment% `n %Address% `n %Website% `n %FullName% `n %Email% `n %ProspectStatus% `n %ProspectID%
		 
wb.quit		 

FileDelete, C:\Tools\pptbuilder.txt
FileAppend, %FullName% `r`n %BusinessName% `r`n %Website% `r`n %Address% `r`n %Segment% `r`n `r`n %Time% `r`n %Email% `r`n %ProspectStatus% `r`n %ProspectID%, C:\Tools\demobuilder.txt

Sleep,25

ppa:= "C:\path\POWERPNT.EXE" ; open PPT

Run, %ppa%

Sleep, 25

	IfWinExist Microsoft PowerPoint
		WinActivate Microsoft PowerPoint
		
	Sleep 1000
		
                ; Find ribbon button color and click - sadly, best work around I found

		PixelSearch, Px, Py, 0, 0, 2400, 2400, 0xB2852E, 0, Fast
			if ErrorLevel
			MsgBox, PowerPoint was not found open - please manually click on PPT Builder
		else
		click %Px%, %Py%
		 
Return
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

17 Dec 2013, 17:02

Guest wrote:FF Addons to get source and setup up hotkey http://www.autohotkey.com/board/topic/9 ... clipboard/
Unfortunately, I'm trying to make this as scaleable as possible so being forced to install a FF extension/addin isn't really ideal. If that's the only solution, then perhaps I can tell the source window to open and specify the X,Y of where to have it load to and put it off screen (hacky but meh)?
User avatar
tank
Posts: 3122
Joined: 28 Sep 2013, 22:15
Location: CarrolltonTX
Contact:

Re: Scrape Elements on Site and Save to Text File

17 Dec 2013, 17:32

i personally feel sorry for you having such a horid site that cant work correctly in other browsers. your having to deal with quite possibly the most aweful browser to automate. even chrome isnt as bad
We are troubled on every side‚ yet not distressed; we are perplexed‚
but not in despair; Persecuted‚ but not forsaken; cast down‚ but not destroyed;
Telegram is the best way to reach me
https://t.me/ttnnkkrr
If you have forum suggestions please submit a
Check Out WebWriter
uknowwhoibe
Posts: 5
Joined: 06 Dec 2013, 12:08

Re: Scrape Elements on Site and Save to Text File

17 Dec 2013, 18:05

tank wrote:i personally feel sorry for you having such a horid site that cant work correctly in other browsers. your having to deal with quite possibly the most aweful browser to automate. even chrome isnt as bad
Thank you for the sympathy - it can be quite annoying at times but it is what it is. The reason we don't use IE is because we don't support IE anything (which is a good thing) and Chrome, while my preference, wasn't as easy to greasemonkey when the systems were first setup (around 2006 or so). That's not an excuse; just a fact.

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Lamron750, songdg and 267 guests