jim7181812 wrote:I wasn't trying with a webpage specific example. I was trying to make a template script that could go through and grab the text off any random webpage.
Web documents can be and are usually structured slightly different depending how how they were written and presented.
In some cases you may not need filters while in others, you will.
For instance this example pulls text from the AutoHotkey landing page without conditional statements or traversing of page elements:
NOTE: I'm using WinHttpRequest + HTML objects over an Internet Explorer object.Code: Select all
url = https://autohotkey.com/
reqObj := ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()
htmObj := ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )
Gui, Add, Edit, w500 r8, % htmObj.getElementsByTagName( "HTML" )[ 0 ].innerText
Gui, Show
return
GuiClose: ; Close GUI to exit script
ExitApp
jim7181812 wrote:How do I parse the HTML?
Often text is wrapped in p,h tags but again this is not 100% guaranteed in all instances.
This example filters each element's text that is wrapped in p, and h tags.
Code: Select all
url = https://autohotkey.com/
reqObj := ComObjCreate( "WinHttp.WinHttpRequest.5.1" )
reqObj.Open( "GET", url, false ), reqObj.Send()
htmObj := ComObjCreate( "HTMLfile" ), htmObj.Write( reqObj.ResponseText )
Try While ( element := htmObj.getElementsByTagName( "*" )[ a_index-1 ] ) ; Parse the document's tags
{
if ( element.tagName ~= "i)^(h\d|p)" ) ; if the tag name starts with an H(number) or P,
docText .= element.innerText "`n" ; append text to variable docText
}
Gui, Add, Edit, w500 r8, % docText
Gui, Show
Return
GuiClose:
ExitApp
The key is to create a flexible enough conditional statement eg:
if ( element.tagName ~= "i)^(h\d|p)" ),
that will give you the best results while taking into consideration the myriad of ways documents can be presented on the web.