How to parse HTML?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
User avatar
menteith
Posts: 51
Joined: 04 Feb 2016, 12:22

How to parse HTML?

Post by menteith » 07 Apr 2016, 07:06

Hi all,

I have problems with parsing HTML files. The thing is that all the examples I have seen so for are so complicated for a beginner and I don't understand them at all. I have read GetNestedTag manual but it seems that there is a better way: using HTML Dom. Let's consider the following code:


Code: Select all

; from https://autohotkey.com/board/topic/102562-parse-html-form-data-regex/
htmlCode =
(
<Form method="POST" action='http://google.com'>
    <input name='test'   value="abc">
    <input name="test2" id="testid" value='aaa' >
    <input value="345" name="w12">
    <div value="xtest" name="ytest"></div><input name='bla'    value='bla' class="testclass">
    <input value="123">
<input type="submit" name="method1" value="white">
    <input type="submit" name="method2" value="green">
</form>
)

document := ComObjCreate("HTMLFile")
document.write(htmlCode)

f := document.forms(0)
e := f.elements

t := "method:`t" f.method "`naction:`t" f.action "`n`n"
Loop % e.length
   i := A_Index-1
 , t .= (e[i].name!=""? e[i].name "`t" e[i].value "`n":"")
MsgBox %t%
I have many tags (not sure if this is the correct word) like:
  • quit simple </span> 3272</p>

    or more complicated <h1 class="articleTitle">Title of text</h1>
How do I get the text: "3272" in the first case and "Title of text" in the second case?

Any help would be much appreciated :)

User avatar
Exaskryz
Posts: 2882
Joined: 17 Oct 2015, 20:28

Re: How to parse HTML?

Post by Exaskryz » 07 Apr 2016, 09:08

See https://autohotkey.com/boards/viewtopic.php?f=5&t=15272 - kon may have provided a better answer, though I haven't looked into his links.

You need to be familiar with the source code, and know that it is consistent. Using your two examples, where I assume there'd be more text surrounding those tags that weren't given in the sample:

Code: Select all

string1:="</span> 3272</p>"
string2:="<h1 class=""articleTitle"">Title of text</h1>"
pos1:=RegExMatch(string1,"</span>.*?(?P<digits>\d*?)</p>",output1)
pos2:=RegExMatch(string2,"<h1 class=""articleTitle"">(?P<title>.*?)</h1>",output2)
MsgBox % output1digits "`n" output2title
return
You'll want to read up on RegEx.

timelizards
Posts: 20
Joined: 11 Sep 2015, 21:00

Re: How to parse HTML?

Post by timelizards » 07 Apr 2016, 10:32

Maybe this will get you started. I dont see the span tags in your example but i was just passing through. Ill check up in a few. I might have changed one of your variable names.

Code: Select all

htmlCode =
(
<Form method="POST" action='http://google.com'>
    <input name='test'   value="abc">
    <input name="test2" id="testid" value='aaa' >
    <input value="345" name="w12">
    <div value="xtest" name="ytest"></div><input name='bla'    value='bla' class="testclass">
    <input value="123">
<input type="submit" name="method1" value="white">
    <input type="submit" name="method2" value="green">
</form>
)
 
html := ComObjCreate("HTMLFile")
html.write(htmlCode)

MsgBox % html.getElementByID("testid").value

timelizards
Posts: 20
Joined: 11 Sep 2015, 21:00

Re: How to parse HTML?

Post by timelizards » 07 Apr 2016, 10:41

Here is another quick example. I added an h1 tag to the html code

Code: Select all

htmlCode =
(
<Form method="POST" action='http://google.com'>
    <input name='test'   value="abc">
    <input name="test2" id="testid" value='aaa' >
    <input value="345" name="w12">
    <div value="xtest" name="ytest"></div><input name='bla'    value='bla' class="testclass">
    <input value="123">
<input type="submit" name="method1" value="white">
    <input type="submit" name="method2" value="green">
	<h1>some text in an h1</h1>
</form>
)
 
html := ComObjCreate("HTMLFile")
html.write(htmlCode)
 
MsgBox % html.getElementByID("testid").value

INPUTS := html.getElementsByTagName("INPUT") ;get a collection of INPUT elements
H1 := html.getElementsByTagName("H1") ;get a collection of H1 elements

Loop % INPUTS.length
{
	_input := INPUTS[A_Index-1] ; zero based collection
	
	MsgBox % "Name: " . _input.name . "`nValue: " . _input.value
}


Loop % H1.length
{
	_h1 := H1[A_Index-1] ; zero based collection
	
	MsgBox % "InnerText of H1: " . _h1.innertext"
}

garry
Posts: 3795
Joined: 22 Dec 2013, 12:50

Re: How to parse HTML?

Post by garry » 07 Apr 2016, 12:06

a basic example

Code: Select all

#NoEnv
#Warn
SetBatchLines -1

e4x:=""
e4x=
(Join`r`n
<Form method="POST" action='http://google.com'>
    <input name='test'   value="abc">
    <input name="test2" id="testid" value='aaa' >
<h1 class="articleTitle">Title of text1</h1>
    <input value="345" name="w12">
    <div value="xtest" name="ytest"></div><input name='bla'    value='bla' class="testclass">
    <input value="123">
<input type="submit" name="method1" value="white">
    <input type="submit" name="method2" value="green">
</form>
<h1 class="articleTitle">Title of text2</h1>
)

e:=""
y:=""
x:=""
Loop,parse,e4x,`n,`r
  {
  x:= A_LoopField
  if x contains articletitle
    {
    y := RegExReplace(x, "[<].*?>")
    e .= y . "`n"
    }
  else
    continue
  }
msgbox,%e%
e:=""
return

User avatar
menteith
Posts: 51
Joined: 04 Feb 2016, 12:22

Re: How to parse HTML?

Post by menteith » 08 Apr 2016, 06:03

Thank you guys! I will help me a lot. As soon as I'm done with coding, I will ask another questions if I get stuck with something:)
An ordinary user who needs some help with developing own programs for his own use.

Post Reply

Return to “Ask for Help (v1)”