Jump to content


Photo

GetNestedTag() - HTML parser to retrieve (nested)tags [func]


  • Please log in to reply
7 replies to this topic

#1 SoLong&Thx4AllTheFish

SoLong&Thx4AllTheFish
  • Members
  • 4999 posts

Posted 14 October 2011 - 06:02 PM

As far as I know it is pretty complicated to retrieve the contents of a nested html-tag, say a DIV, using a regular expression. Therefore I wrote this small parser.

This is useful if you have a html file/or variable and you want to extract a block of data, for example the CONTENT or MAIN div or a UL that is used for a menu. By default it will get the first occurrence of a tag, so if you want to use it to get more than one tag you will have to use a loop, for example if you wanted to get all <p> tags from the html data. (See example 5)

The examples below will hopefully illustrate its purpose. And of course it only works with properly formatted html.

Documentation
<!-- m -->http://www.autohotke... ... edTag.html<!-- m -->

Function
; GetNestedTag() v1
; AHK Forum Topic : http://www.autohotkey.com/forum/viewtopic.php?t=77653
; Documentation   : http://www.autohotkey.net/~hugov/functions/GetNestedTag.html

GetNestedTag(data,tag,occurrence="1")
	{
	 Start:=InStr(data,tag,false,1,occurrence)
	 RegExMatch(tag,"i)<([a-z]*)",basetag) ; get yer basetag1 here
	 Loop
		{
		 Until:=InStr(data, "</" basetag1 ">", false, Start, A_Index) + StrLen(basetag1) + 3
 		 Strng:=SubStr(data, Start, Until - Start) 

		 StringReplace, strng, strng, <%basetag1%, <%basetag1%, UseErrorLevel ; start counting to make match
		 OpenCount:=ErrorLevel
		 StringReplace, strng, strng, </%basetag1%, </%basetag1%, UseErrorLevel
		 CloseCount:=ErrorLevel
		 If (OpenCount = CloseCount)
		 	Break

		 If (A_Index > 250) ; for safety so it won't get stuck in an endless loop, 
		 	{                 ; it is unlikely to have over 250 nested tags
		 	 strng=
		 	 Break
		 	}	
		}
	 If (StrLen(strng) < StrLen(tag)) ; something went wrong/can't find it
	 	strng=
	 Return strng
	}

Examples - (includes function to it runs out of the box if you simply want to try)

    Gosub, GetHtml

    ; Example 1 - get DIV header
    ; Pretty simple as it is not a nested DIV
    get=<div id="header">
    MsgBox,,Example 1 - get DIV header, % GetNestedTag(html,get)

    Result:
     <div id="header"><!-- start header -->
      <h1>GetNestedTag(data,tag)</h1>
     <!-- /end header --></div>

    ; Example 2 - get content
    ; More complex as it is a nested DIV
    get=<div id="content">
    MsgBox,,Example 2 - get DIV content, % GetNestedTag(html,get)

    Result:
     <div id="content"><!-- start content -->
     .... all lines in between ...
     <!-- /end content --></div>     

    ; Example 3a - get table data
    ; Nested TABLE
    get=<table id="data">
    MsgBox,,Example 3a - get table data, % GetNestedTag(html,get)

    ; Example 3b - get table subdata
    get=<table id="subdata">
    MsgBox,,Example 3b - get table subdata, % GetNestedTag(html,get)

    ; Example 4a - get UL menu
    ; Nested UL, 3 levels
    get=<ul id="menu">
    MsgBox,,Example 4a - get UL Menu, % GetNestedTag(html,get)

    ; Example 4b - get the first UL
    get=<ul>
    MsgBox,,Example 4 - get UL, % GetNestedTag(html,get)

    ; Example 5 - get all paragraphs
    Loop
    	{
    	 tag:=GetNestedTag(html,"<p",A_Index)
    	 If (tag = "")
    	 	Break
    	 MsgBox,,Example 5 - get P (%A_Index%/5), % tag
    	}

    ExitApp

		GetHtml:
		html=
		(
		<!DOCTYPE html>
		<html lang="en-us">
		<head>
		<title>Example HTML</title>
		</head>
		<body>
		<div id="wrapper"><!-- start wrapper -->
		 <div id="header"><!-- start header -->
		  <h1>GetNestedTag(data,tag)</h1>
		 <!-- /end header --></div>
		 <div id="navigation"><!-- start navigation -->
		  <ul id="menu">
		   <li>Menu option 1</li>
		   <li>Menu option 2
		    <ul class="submenu">
		     <li>Submenu 2.1</li>
		     <li>Submenu 2.2
		      <ul><!-- ul1 -->
		       <li>Submenu 2.2.1</li>
		       <li>Submenu 2.2.2</li>
		       <li>Submenu 2.2.3</li>
		       <li>Submenu 2.2.4</li>
		      </ul>
		     </li>
		     <li>Submenu 2.3</li>
		     <li>Submenu 2.4</li>
		    </ul>
		   </li>
		   <li>Menu option 3</li>
		   <li>Menu option 4</li>
		   <li>Menu option 5    
		    <ul class="submenu">
		     <li>Submenu 5.1</li>
		     <li>Submenu 5.2</li>
		    </ul>
		   </li>
		   <li>Menu option 6</li>
		  </ul>
		 </div><!-- /end navigation -->
		
		   <div id="leftcolumn"><!-- start leftcol -->
		    <ol>
		     <li>Scripting</li>
		     <li>Hotkeys</li>
		     <li>Automation</li>
		    </ol>
		   <!-- /end leftcol --></div>
		   
		   <div id="content"><!-- start content -->
		   
		   <div class="intro"><p>AutoHotkey is a free, open-source utility for Windows. With it, you can:</p></div>
		
		   <div style="clear:both;"></div>
		   
		   <ul><!-- ul2 -->
		   <li>Automate almost anything by sending keystrokes and mouse clicks.</li>
		   <li>Create hotkeys for keyboard, joystick, and mouse.</li>
		   <li>Expand abbreviations as you type them.</li>
		   <li>Create custom data-entry forms, user interfaces, and menu bars.</li>
		   <li>Remap keys and buttons on your keyboard, joystick, and mouse.</li>
		   <li>Respond to signals from hand-held remote controls via the WinLIRC client script.</li>
		   <li>Run existing AutoIt v2 scripts and enhance them with new capabilities.</li>
		   </ul>
		 
		   <p>Getting started might be easier than you think. Check out the quick-start tutorial.</p>
		
		   <div style="clear:both;"></div>
		
		   <p>Here is a nice table with some data:</p>
		
		   <table id="data">
		     <tr>
		       <th>1</th>
		       <th>2</th>
		     </tr>
		     <tr>
		       <td>
		       <table id="subdata">
		        <tr>
		         <td>3a</td>
		         <td>3b</td>
		        </tr>
		       </table>
		       </td>
		       <td>4</td>
		     </tr>
		     <tr>
		       <td>5</td>
		       <td>6</td>
		     </tr>
		   </table>
		   
		   <p>Nothing more to report in content.</p>
		
		   <!-- /end content --></div>
		   <div id="rightcolumn"><!-- start rightcol -->
		    <ol>
		     <li>Automation</li>
		     <li>Hotkeys</li>
		     <li>Scripting</li>
		    </ol>
		   <!-- /end rightcol --></div>
		
		  <div id="footer">
		   <p><a href='http://www.autohotkey.com/'>http://www.autohotkey.com/</a></p>
		  </div>
		<!-- /end wrapper --></div>
		</body>
		</html>
		)
		Return
		
; GetNestedTag() v1
; AHK Forum Topic : http://www.autohotkey.com/forum/viewtopic.php?t=77653
; Documentation   : http://www.autohotkey.net/~hugov/functions/GetNestedTag.html

GetNestedTag(data,tag,occurrence="1")
	{
	 Start:=InStr(data,tag,false,1,occurrence)
	 RegExMatch(tag,"i)<([a-z]*)",basetag) ; get yer basetag1 here
	 Loop
		{
		 Until:=InStr(data, "</" basetag1 ">", false, Start, A_Index) + StrLen(basetag1) + 3
 		 Strng:=SubStr(data, Start, Until - Start) 

		 StringReplace, strng, strng, <%basetag1%, <%basetag1%, UseErrorLevel ; start counting to make match
		 OpenCount:=ErrorLevel
		 StringReplace, strng, strng, </%basetag1%, </%basetag1%, UseErrorLevel
		 CloseCount:=ErrorLevel
		 If (OpenCount = CloseCount)
		 	Break

		 If (A_Index > 250) ; for safety so it won't get stuck in an endless loop, 
		 	{                 ; it is unlikely to have over 250 nested tags
		 	 strng=
		 	 Break
		 	}	
		}
	 If (StrLen(strng) < StrLen(tag)) ; something went wrong/can't find it
	 	strng=
	 Return strng
	}		

(I don't even know if it is possible with COM, it very well might be, but I needed it so I wrote it)

#2 A_Samurai

A_Samurai
  • Members
  • 30 posts

Posted 14 October 2011 - 10:25 PM

My take on this:

GetHTMLbyID(HTMLSource, ID, Format=0)

GetHTMLbyID(HTMLSource, ID, Format=0) {
	;Format 0:Text 1:HTML 2:DOM
	ComError := ComObjError(false), `(oHTML := ComObjCreate("HtmlFile")).write(HTMLSource)	
	if (Format = 2) {
		if (innerHTML := oHTML.getElementById(ID)["innerHTML"]) {
			`(oDOM := ComObjCreate("HtmlFile")).write(innerHTML)
			Return oDOM, ComObjError(ComError)
		} else 
			Return "", ComObjError(ComError)
	} else
	Return (result := oHTML.getElementById(ID)[(Format ? "innerHTML" : "innerText")]) ? result : "", ComObjError(ComError)
}

GetHTMLbyTag(HTMLSource, Tag, Occurrence=1, Format=0)

GetHTMLbyTag(HTMLSource, Tag, Occurrence=1, Format=0) {
	;Format 0:Text 1:HTML 2:DOM
	ComError := ComObjError(false), `(oHTML := ComObjCreate("HtmlFile")).write(HTMLSource)	
	if (Format = 2) {
		if (innerHTML := oHTML.getElementsByTagName(Tag)[Occurrence-1]["innerHTML"]) {
			`(oDOM := ComObjCreate("HtmlFile")).write(innerHTML)
			Return oDOM, ComObjError(ComError)
		} else 
			Return "", ComObjError(ComError)
	}
	return (result := oHTML.getElementsByTagName(Tag)[Occurrence-1][(Format ? "innerHTML" : "innerText")]) ? result : "", ComObjError(ComError)
}
The above functions require AutoHotkey_L. The descriptions and examples are found in the links.

#3 Guests

  • Guests

Posted 15 October 2011 - 08:40 AM

Thanks for the COM version A_Samurai, I already finished the small project for which I wrote GetNestedTag but your solution is now available for people to use as well.

#4 Richard

Richard
  • Members
  • 19 posts

Posted 24 July 2012 - 12:12 AM

Thank you for posting the examples, I found them very helpful. As a result, I was able to successfully write a test script to retrieve a price from html and compare it to a set maximum price. I have enclosed the code below.

I note that there is an option to use FileRead rather than including the html within the script. I wonder whether someone would show me how I could have the script run on the html of an opened webpage? In other words, I currently have a second script that cycles through open Amazon webpages and clicks on the buy button. I would like to insert the code below into this script so that the script only purchases items below my maximum price.

; Test script to pull a price from html and compare it to a given maximum price.
; AutoHotKey_L
; GetNestedTag() v1
; Example based on:
;     AHK Forum Topic : http://www.autohotkey.com/forum/viewtopic.php?t=77653
;     Documentation   : http://www.autohotkey.net/~hugov/functions/GetNestedTag.html

GetNestedTag(data,tag,occurrence="1")
   {
    Start:=InStr(data,tag,false,1,occurrence)
    RegExMatch(tag,"i)<([a-z]*)",basetag) ; get yer basetag1 here
    Loop
      {
       Until:=InStr(data, "</" basetag1 ">", false, Start, A_Index) + StrLen(basetag1) + 3
        Strng:=SubStr(data, Start, Until - Start) 

       StringReplace, strng, strng, <%basetag1%, <%basetag1%, UseErrorLevel ; start counting to make match
       OpenCount:=ErrorLevel
       StringReplace, strng, strng, </%basetag1%, </%basetag1%, UseErrorLevel
       CloseCount:=ErrorLevel
       If (OpenCount = CloseCount)
          Break

       If (A_Index > 250) ; for safety so it won't get stuck in an endless loop, 
          {                 ; it is unlikely to have over 250 nested tags
           strng=
           Break
          }   
      }
    If (StrLen(strng) < StrLen(tag)) ; something went wrong/can't find it
       strng=
    Return strng
   }
   
 Gosub, GetHtml

    MaxPrice := 1.00 ; Set the maximum price to purchase an item for.

	Loop
       {
        tag:=GetNestedTag(html,"<B class=priceLarge>",A_Index)
        If (tag = "")
           Break
        StringTrimLeft, RemoveClass, tag, 21 ; Trim all characters to the left of the first digit
		StringTrimRight, PriceStr, RemoveClass, 5 ; Trim all characters to the right of the last digit
		RegExMatch(PriceStr,"(\d{1,3}(\,\d{3})*|(\d+))(\.\d{2})?$",Price) ; Convert the string to a number
		
		MsgBox,,The cost of the eBook, %  Price
		
		If (Price < MaxPrice)
			{
		;	pwb.Document.getElementById("buyButton").click()	; Purchase the item	
			MsgBox, Yeah! Your eBook has been purchased for $%Price%, which is less than your maximum price of $%MaxPrice%.
			}
		Else
			{
			MsgBox, Sorry, your maximum price of $%MaxPrice% is less than the current price of $%Price%.
			; Return
			}
	  Sleep, 3000
	  }

    ExitApp

GetHtml: ; *** QUESTION: How can I have the script search the html on an open Amazon page rather than html enclosed below?
html=
(
	<!DOCTYPE html>
    <html lang="en-us">
    <head>
    <title>Example HTML</title>
    </head>
    <body>
   <DIV id=priceBlock class=buying><TABLE class=product>
   <TBODY>
   <TR>
   <TD class=productBlockLabel>Digital List Price: </TD>
   <TD class=digitalListPrice><SPAN class=listprice>$1.99</SPAN> </TD></TR>
   <TR>
   <TD class=productBlockLabel>Kindle Price: </TD>
   <TD><B class=priceLarge>$0.00 </B><SPAN class=price>(Save 100`% or $1.99) </SPAN></TD></TR>
   <TR>
   <TD class=productBlockLabel></TD>
   <TD><SPAN class=price></SPAN></TD></TR></TBODY></TABLE></DIV>
)
Return

SPACE::ExitApp


#5 Wicked

Wicked
  • Members
  • 480 posts

Posted 24 July 2012 - 01:46 AM

These are great! Thanks!

#6 Guests

  • Guests

Posted 24 July 2012 - 06:38 AM

*** QUESTION: How can I have the script search the html on an open Amazon page rather than html enclosed below?


1) URLDownloadToFile + FileRead (you can copy the current location via either controlget or simple send commands (send ^l^a^c, focus address bar, select all, copy to clipboard, depends on browser ^l might be !d)

or

2) If you use IE you could use COM, I can't link to the FAQ atm as it is offline but search for COM + browser and you'll find a few tutorials

#7 fragman

fragman
  • Members
  • 1591 posts

Posted 24 July 2012 - 10:38 AM

3) HttpRequest(). Can be found on the forums and avoids files while providing more options, such as cookies, POST etc.

#8 Richard

Richard
  • Members
  • 19 posts

Posted 24 July 2012 - 08:51 PM

Thanks for the suggestions.

1) As I will be moving through a number of tabs, I would like to avoid downloading the html for each page.

2) This was the area that I was looking into, but I am currently struggling to find a solution using COM.

3) I have posted some questions on the HTTPRequest thread (<!-- l --><a class="postlink-local" href="http://www.autohotkey.com/community/viewtopic.php?f=13&t=73040&p=553591#p553591">viewtopic.php?f=13&t=73040&p=553591#p553591</a><!-- l -->) to see if that will work.

The good news is that with the help of sinkfaze (<!-- l --><a class="postlink-local" href="http://www.autohotkey.com/community/viewtopic.php?f=1&t=13545&p=553611#p553611">viewtopic.php?f=1&t=13545&p=553611#p553611</a><!-- l -->), I was able to pull the price using the following code:

RegExMatch(pwb.document.getElementbyID("priceBlock").outerHTML,"<B class=priceLarge>\K\S+",StrAmount) ; Pull price including $
      
StringTrimLeft, Price, StrAmount, 1 ; Remove $ from price thus leaving only the numbers and decimal

I would still like to see if I can pull html from a webpage as I would like to use it to obtain href links that I can then use to open in new tabs.