Help With Webscraping Script

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
jim7181812
Posts: 16
Joined: 10 Jun 2016, 10:46

Help With Webscraping Script

12 Aug 2016, 11:34

I'm still learning AHK and learned most of how to do this through Joe Glines' youtube vid's. If the script looks familiar, thats why. What I'm trying to do here is grab all the text's and links only from a website and place them into a GUI. For now I'm just looking for help with the text portion. This script works, but in addition to all the text .innertext is grabbing all the javascript (I think its javascript) from the website. Is there any way I can just present the text without all that. Do I have to do this with regex, or is there an easier way? Is this even possible.

Code: Select all

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.
#SingleInstance Force


pwb := ComObjCreate("InternetExplorer.Application") ;create IE Object
pwb.visible:=true  ; Set the IE object to visible
pwb.Navigate("https://www.myurl.com") ;Navigate to URL
while pwb.busy or pwb.ReadyState != 4 ;Wait for page to load
	Sleep, 100


var := pwb.document.documentElement.innerText ;Get All text on page
gui, font, s12 cblue q5, book antiqua
gui, add, edit, w900 h600 -Wrap, %var%
gui, show

I'd like to get this portion of the script working first, then I'm likely to ask for help with the links.
User avatar
TLM
Posts: 1608
Joined: 01 Oct 2013, 07:52
Contact:

Re: Help With Webscraping Script

12 Aug 2016, 12:20

You could parse the HTML and use if's to filter out the js if it's in script tags.
Other than that, It's hard to say blindly because js can be placed inline as an attribute, in <code> tags and or in <script> tags etc.

Please paste an example of the HTML you're trying to parse so we can take a closer look.
jim7181812
Posts: 16
Joined: 10 Jun 2016, 10:46

Re: Help With Webscraping Script

12 Aug 2016, 12:39

TLM wrote:You could parse the HTML and use if's to filter out the js if it's in script tags.
Other than that, It's hard to say blindly because js can be placed inline as an attribute, in <code> tags and or in <script> tags etc.

Please paste an example of the HTML you're trying to parse so we can take a closer look.

Thanks for your help. How do I parse the HTML?

I wasn't trying with a webpage specific example. I was trying to make a template script that could go through and grab the text off any random webpage.
User avatar
TLM
Posts: 1608
Joined: 01 Oct 2013, 07:52
Contact:

Re: Help With Webscraping Script

12 Aug 2016, 14:49

jim7181812 wrote:I wasn't trying with a webpage specific example. I was trying to make a template script that could go through and grab the text off any random webpage.
Web documents can be and are usually structured slightly different depending how how they were written and presented.
In some cases you may not need filters while in others, you will.

For instance this example pulls text from the AutoHotkey landing page without conditional statements or traversing of page elements:
NOTE: I'm using WinHttpRequest + HTML objects over an Internet Explorer object.

Code: Select all

url = https://autohotkey.com/

reqObj := ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()

htmObj := ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )

Gui, Add, Edit, w500 r8, % htmObj.getElementsByTagName( "HTML" )[ 0 ].innerText
Gui, Show
return

GuiClose: ; Close GUI to exit script
ExitApp
jim7181812 wrote:How do I parse the HTML?
Often text is wrapped in p,h tags but again this is not 100% guaranteed in all instances.
This example filters each element's text that is wrapped in p, and h tags.

Code: Select all

url = https://autohotkey.com/

reqObj := ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()

htmObj := ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )

Try While ( element := htmObj.getElementsByTagName( "*" )[ a_index-1 ] ) ; Parse the document's tags
{
	if ( element.tagName ~= "i)^(h\d|p)" ) 		; if the tag name starts with an H(number) or P,
		docText .= element.innerText "`n"		; append text to variable docText
}

Gui, Add, Edit, w500 r8, % docText
Gui, Show
Return

GuiClose:
ExitApp
The key is to create a flexible enough conditional statement eg:if ( element.tagName ~= "i)^(h\d|p)" ),
that will give you the best results while taking into consideration the myriad of ways documents can be presented on the web.

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: Chunjee, Hansielein, Lpanatt and 324 guests