Need help with XML parse

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
smarq8
Posts: 69
Joined: 16 Jan 2016, 00:33

Need help with XML parse

13 Jan 2018, 13:11

I want to read hocr file (XML format) but I can not get any resoult.
So where is mistake in this code?

Code: Select all

gosub tst
;msgbox,% xmldata
doc := ComObjCreate("MSXML2.DOMDocument.6.0")
doc.async := false
doc.loadXML(xmldata)
;al(doc)
;DocNode := doc.selectSingleNode("//html/body/div[0]").getAttribute("title") ;not working
;DocNode := doc.selectSingleNode("//html/body/div[0]/div/p/span/span/strong") ;not working
DocNode := doc.selectSingleNode("//html/body/div[0]/div/p/span/span").getAttribute("bbox") ;not working
DocText := DocNode.text
MsgBox,% DocText
return

f4::reload

tst:
xmldata = 
(join`r`n
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 4.00.00alpha' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "D:\VideoSubFinder_2.10_x32_64\down\RGBImages\x\2\ClearedText\0_07_47_760__0_07_49_039.png"; bbox 0 0 1280 170; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 467 87 812 128">
    <p class='ocr_par' id='par_1_1' lang='pol' title="bbox 467 87 812 128">
     <span class='ocr_line' id='line_1_1' title="bbox 467 87 812 128; baseline 0 -9; x_size 41; x_descenders 9; x_ascenders 10"><span class='ocrx_word' id='word_1_1' title='bbox 467 87 571 119; x_wconf 94'><strong>Jesteś</strong></span> <span class='ocrx_word' id='word_1_2' title='bbox 582 87 653 128; x_wconf 95'><strong>zbyt</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 663 87 812 119; x_wconf 95'><strong>ambitna!</strong></span> 
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

)

return
User avatar
noname
Posts: 515
Joined: 19 Nov 2013, 09:15

Re: Need help with XML parse

15 Jan 2018, 05:58

There must be a better way but the problem is the DTD and namespace declaration , i can prevent the DTD but not the namespace .
If you just remove the line it works but i hope some better informant forum user will give a "real" solution :)

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> replace with <html> (root)

Code: Select all

xml = 
(join`r`n
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">


  <head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
    <meta name='ocr-system' content='tesseract 4.00.00alpha' />
    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
  </head>
  <body>
    <div class='ocr_page' id='page_1' title='image "D:\VideoSubFinder_2.10_x32_64\down\RGBImages\x\2\ClearedText\0_07_47_760__0_07_49_039.png"; bbox 0 0 1280 170; ppageno 0'>
     <div class='ocr_carea' id='block_1_1' title="bbox 467 87 812 128">
      <p class='ocr_par' id='par_1_1' lang='pol' title="bbox 467 87 812 128">
       <span class='ocr_line' id='line_1_1' title="bbox 467 87 812 128; baseline 0 -9; x_size 41; x_descenders 9; x_ascenders 10"><span class='ocrx_word' id='word_1_1' title='bbox 467 87 571 119; x_wconf 94'><strong>Jesteś</strong></span> <span class='ocrx_word' id='word_1_2' title='bbox 582 87 653 128; x_wconf 95'><strong>zbyt</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 663 87 812 119; x_wconf 95'><strong>ambitna!</strong></span> 
       </span>
      </p>
     </div>
    </div>
  </body>
</html>
) 

xmldata:=regexreplace(xml,"<html xmlns.+?[>]","<html>")

doc := ComObjCreate("MSXML2.DOMDocument.6.0")
doc.async := false
doc.setProperty("ProhibitDTD", false)    ; prevents DTD
doc.resolveExternals := false
doc.validateOnParse := false


doc.loadXML(xmldata)

l .= doc.selectSingleNode("//body/div").getAttribute("title") "`n" "`n"
l .= doc.selectSingleNode("//body/div/div").getAttribute("title") "`n" "`n"
l .= doc.selectSingleNode("//body/div/div/p").getAttribute("title")


msgbox %l%
Guest

Re: Need help with XML parse

15 Jan 2018, 10:26

Code: Select all

gosub tst
doc := ComObjCreate("MSXML2.DOMDocument.6.0")

doc.setProperty("ProhibitDTD", false) 
doc.setProperty("SelectionNamespaces"
				," xmlns:z='http://www.w3.org/1999/xhtml' ")
doc.validateOnParse := false

doc.async := false
doc.loadXML(xmldata)

DocNode := doc.selectSingleNode("/z:html/z:body/z:div/@title") 
strong := doc.selectSingleNode("/z:html/z:body/z:div/z:div/z:p/z:span/z:span/z:strong")
span := doc.selectSingleNode("//z:span/z:span/@title[contains(.,'bbox')]")
span2 := doc.selectSingleNode("/z:html/z:body/z:div/z:div/z:p/z:span/z:span[contains(@title,'663')]")
MsgBox,% DocNode.text "`r`r" strong.text "`r`r" span.Text "`r`r" span2.Text

tst:
xmldata = 
(
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>title</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 4.00.00alpha' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "D:\VideoSubFinder_2.10_x32_64\down\RGBImages\x\2\ClearedText\0_07_47_760__0_07_49_039.png"; bbox 0 0 1280 170; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 467 87 812 128">
    <p class='ocr_par' id='par_1_1' lang='pol' title="bbox 467 87 812 128">
     <span class='ocr_line' id='line_1_1' title="bbox 467 87 812 128; baseline 0 -9; x_size 41; x_descenders 9; x_ascenders 10">
		<span class='ocrx_word' id='word_1_1' title='bbox 467 87 571 119; x_wconf 94'><strong>Jestes</strong>
		</span> 
		<span class='ocrx_word' id='word_1_2' title='bbox 582 87 653 128; x_wconf 95'><strong>zbyt</strong>
		</span> 
		<span class='ocrx_word' id='word_1_3' title='bbox 663 87 812 119; x_wconf 95'><strong>ambitna!</strong>
		</span> 
     </span>
    </p>
   </div>
  </div>
 </body>
</html>
)
return

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: songdg, TAC109 and 293 guests