AutoHotkey Community

It is currently May 26th, 2012, 1:26 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 7 posts ] 
Author Message
PostPosted: November 25th, 2008, 7:44 pm 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
Hi,

following StdLib compliant script removes and transcodes if applicable from HTML so called entities (example: &amp;&nbsp;&#32; etc) and removes HTML-Tags (example: "<p>blah</p>" becomes "blah")

To have an easy access to complete package functionality there is a direct unHTML(html) wrapper included.

greets
dR

unhtml.ahk

Code:
; unhtml.ahk - stdlib desired name
;
; StdLib compliant collection to convert HTML to Text
;
; Known Bugs: unHTML_StripEntities wont work for
; Unicode / UTF-8 Entities - These will stay intact
;
; v 1.3 / (a & w) Nov 2008 by derRaphael(at)oleco.net
;

; Syntax sugar for combined call of StripEntities & StripTags
unHTML(html){         
   Return unHTML_StripTags(unHTML_StripEntities(html))
}

; Removes HTML Tags frolm given text
unHTML_StripTags(txt){ ; v1 (w) by derRaphael Oct 2008
   Return RegExReplace(txt,"<[^>]+>","")
}

; Strips HTML entities out of given Text / No Unicode / UTF8 Entity support yet
unHTML_StripEntities(html){ ; v1.3.1 (w) by derRaphael Nov 2008
   Loop,
      if (RegExMatch(html,"&[a-zA-Z0-9#]+;",entity)) {
         n:=RegExReplace(unHTML_ConvertEntity2Number(entity),"[^\d]")
         if ((n>0) && (n<256)) {
            html := RegExReplace(html,"\Q" entity "\E",chr(n))
         }
      } else
         break
   return html
}

; Thx Jamey: http://www.autohotkey.com/forum/viewtopic.php?t=22522
; Rewrite by derRaphael v 1.3 / Nov 2008
unHTML_ConvertEntity2Number(sEntityName) {
   nNumber := -1   ;This will remain -1 if the entity name could not be translated.
   nStr := StrLen(sEntityName)

   ;Require the input format:  "& ... ;"
   if ((nStr < 3) || (SubStr(sEntityName, 1, 1) != "&") || (SubStr(sEntityName, 0) != ";"))
      return %nNumber%

   ;If the entity was given as its entity-number format, then return the number part.
   sEntityIdentifier := SubStr(sEntityName, 2, nStr-2)

   if sEntityIdentifier is integer
   {
      sEntityIdentifier := Round(sEntityIdentifier)   ;Not just can be interpreted as integer, but is!
      if (sEntityIdentifier >= 0)
         return %sEntityIdentifier%
      else
         return %nNumber%
   }

   ;If sEntityName really is a name-format entity, then find its number from the table below.
      entityList := "quot|34,apos|39,amp|38,lt|60,gt|62,nbsp|160,iexcl|161,cent|162,pound|163,curren|164,"
            . "yen|165,brvbar|166,sect|167,uml|168,copy|169,ordf|170,laquo|171,not|172,shy|173,reg|"
            . "174,macr|175,deg|176,plusmn|177,sup2|178,sup3|179,acute|180,micro|181,para|182,middo"
            . "t|183,cedil|184,sup1|185,ordm|186,raquo|187,frac14|188,frac12|189,frac34|190,iquest|"
            . "191,Agrave|192,Aacute|193,Acirc|194,Atilde|195,Auml|196,Aring|197,AElig|198,Ccedil|1"
            . "99,Egrave|200,Eacute|201,Ecirc|202,Euml|203,Igrave|204,Iacute|205,Icirc|206,Iuml|207"
            . ",ETH|208,Ntilde|209,Ograve|210,Oacute|211,Ocirc|212,Otilde|213,Ouml|214,times|215,Os"
            . "lash|216,Ugrave|217,Uacute|218,Ucirc|219,Uuml|220,Yacute|221,THORN|222,szlig|223,agr"
            . "ave|224,aacute|225,acirc|226,atilde|227,auml|228,aring|229,aelig|230,ccedil|231,egra"
            . "ve|232,eacute|233,ecirc|234,euml|235,igrave|236,iacute|237,icirc|238,iuml|239,eth|24"
            . "0,ntilde|241,ograve|242,oacute|243,ocirc|244,otilde|245,ouml|246,divide|247,oslash|2"
            . "48,ugrave|249,uacute|250,ucirc|251,uuml|252,yacute|253,thorn|254,yuml|255,OElig|338,"
            . "oelig|339,Scaron|352,scaron|353,Yuml|376,circ|710,tilde|732,ensp|8194,emsp|8195,thin"
            . "sp|8201,zwnj|8204,zwj|8205,lrm|8206,rlm|8207,ndash|8211,mdash|8212,lsquo|8216,rsquo|"
            . "8217,sbquo|8218,ldquo|8220,rdquo|8221,bdquo|8222,dagger|8224,Dagger|8225,hellip|8230"
            . ",permil|8240,lsaquo|8249,rsaquo|8250,euro|8364,trade|8482"

   Loop,Parse,entityList,`,
      if (RegExMatch(A_LoopField,"i)^" sEntityIdentifier "\|(?P<Number>\d+)", n))
         break

   return (nNumber!="") ? nNumber : sEntityIdentifier
}

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 25th, 2008, 8:03 pm 
Offline
User avatar

Joined: August 11th, 2004, 1:47 am
Posts: 5347
Location: UK
Nice collection of regex, thanks.

_________________
GitHubScriptsIronAHK Contact by email not private message.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 25th, 2008, 10:08 pm 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
thank you, titan

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 21st, 2009, 6:11 pm 
Offline

Joined: November 24th, 2005, 8:16 am
Posts: 851
DarRaphael,

I am playing with your code now, since I am trying to decode HTML entities with as much accuracy as possible, and it seems you have an endless loop potential here.

In your code:
Code:
if ((n>0) && (n<256)) {
  html := RegExReplace(html,"\Q" entity "\E",chr(n))
}

in unHTML_StripEntities(), you should probably have an Else
Otherwise, your line will still contain entities that were not replaced, and the RegEx will still match.

I bumped into it with entities like & #8217;

Also, further down the script, where you ask if is integer (in unHTML_ConvertEntity2Number() ), I think it will never be an integer.
When you reach this point, you have numbered entities still with their # prefix.
The code still works with this last issue, but I think not as you intended.

_________________
Sector-Seven - Freeware tools built with AutoHotkey


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 22nd, 2009, 1:35 pm 
Offline

Joined: January 12th, 2007, 4:30 am
Posts: 531
Location: Norway
DerRaphael: Thanks! Using unhtml on HTML containing <br> tags gently mutilated my output. Perhaps special attention should be given to the <br> tags to convert them to a newline instead of simply removing them?

Something like this:
Code:
RegExReplace(txt, "<br>", "`n")
RegExReplace(txt, "<br/>", "`n")
RegExReplace(txt, "<br />", "`n")


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 23rd, 2009, 9:38 am 
Offline

Joined: November 23rd, 2007, 10:23 am
Posts: 841
Location: ~/.
hey Murp|e,

you may change the following function to receive your desired result:

Code:
unHTML_StripTags(txt){ ; v1 (w) by derRaphael Oct 2008
   txt := RegExReplace(txt,"<br(( )?\/)?>","`n")
   Return RegExReplace(txt,"<[^>]+>","")
}


the regex is not tested, but should work.

greets
dR

_________________
Image
    All scripts, unless otherwise noted, are hereby released under CC-BY


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: December 23rd, 2009, 10:28 am 
Offline

Joined: January 12th, 2007, 4:30 am
Posts: 531
Location: Norway
DerRaphael: Thank you, I'll have to spend some time with the manual to decypher that. Don't you think this is worth updating the original post with? Do you think there are cases where people would not want to replace <br> with `n?


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: Google Feedfetcher, maraskan_user, Yahoo [Bot] and 15 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group