AutoHotkey Community

Posted: **25 May 2022, 09:55**

The alt description of a link

ie

alt="this is an image of a dog"

Posted: **25 May 2022, 11:22**

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Posted: **26 May 2022, 02:10**

I was hoping to be able to extract the info from the img tag alt attribute.

Too difficult? In terms of regex, would not this be better?

Code: Select all

(?<=<img[^>])*(?=alt="([^"]+)"[^>]*>)

or 

(?<=<img[^>])*alt="\K([^"]+) 

or 

(?<=<img[^>])*(?:alt.+=")\K([^"]+)

sometimes the last one leads to Catastrophic backtracking though. Any idea how I can fix it since it seems the best.

Posted: **26 May 2022, 05:18**

Rikk03 wrote: ↑I was hoping to be able to extract the info from the img tag alt attribute.

My code above exactly does that. Doesn't it work for you?

Posted: **27 May 2022, 02:40**

Hi

Sometimes the Html is malformed, I was hoping to take it into consideration in this particular case. Hence my version.

ie <img alt" content="whatever"

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

Posted: **27 May 2022, 03:26**

My approach is better anyway, trust me.

Posted: **27 May 2022, 04:04**

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

Posted: **27 May 2022, 04:18**

Never mind ... answered my own question.

Sort, altAttributes, U

Posted: **27 May 2022, 07:29**

It doesnt want to save to variable for some reason. I'm not looking for msgbox return.

Posted: **27 May 2022, 10:52**

Miracles!

Posted: **27 May 2022, 13:41**

Perhaps I have missed something important here, but why wouldn't this work?

Code: Select all

SetWorkingDir %A_ScriptDir%

UrlDownloadToFile, https://play.google.com/store/apps/details?id=com.mydiabetes, myfile.txt
FileRead, content, myfile.txt
altTexts := [], pos := 1
while (pos := RegexMatch(content, "<img [^>]*src=""\K([^""]+)[^>]*(?<=alt="")([^""]+)", altText, pos+StrLen(altText))) { ; Captures all src and alt tags from images, although it fails if alt comes before src
	MsgBox, Image with src="%altText1%" has alt="%altText2%"
	altTexts.push(altText2)
}
; Perhaps do something with altTexts?
; FileDelete, myfile.txt ; Optionally delete the downloaded html file

Posted: **27 May 2022, 18:25**

For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?

Posted: **28 May 2022, 00:29**

teadrinker wrote: ↑
27 May 2022, 18:25
For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?

Yes, since using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers, and having a simpler code for beginners to work with etc. Although probably it is a bit slower since it requires writing to the hard-drive, and Regex isn't beginner-friendly anyway.

Haven't thoroughly tested if the two approaches are interchangeable though, since it appears that defining the user agent was necessary in this case.

Posted: **28 May 2022, 04:29**

Descolada wrote: ↑using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers

Of course, my only reason to use GetHtml is to avoid having to use UrlDownloadToFile and create the file on disk.

Posted: **13 Jun 2022, 04:34**

@teadrinker When using an youtube video url, this fails, returning ALL html. How can we add some error handling.

teadrinker wrote: ↑

25 May 2022, 11:22

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Posted: **13 Jun 2022, 22:01**

Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")

Posted: **24 Jun 2022, 02:01**

teadrinker wrote: ↑

13 Jun 2022, 22:01

Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")

Hi, OK but im removing duplicates like so

Code: Select all

Loop, Parse, altAttributes, `n
altAttributes := (A_Index=1 ? A_LoopField : altAttributes . (InStr("`n" altAttributes
	. "`n", "`n" A_LoopField "`n") ? "" : "`n" A_LoopField ))

how can I add your errorhandling with the deduplication - most efficiently?

Posted: **24 Jun 2022, 02:53**

Perhaps like this:

Code: Select all

delim := "`n"
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . RegExReplace(altAttributes, "`asm)(^\V*\R)(?=.*\1)") : " not found")

Posted: **24 Jun 2022, 04:18**

Yes, that works!

Thx

AutoHotkey Community

How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?