How to extract url from a webpage?

Rikk03 · Post by **Rikk03** » 25 May 2022, 09:55

The alt description of a link

ie

alt="this is an image of a dog"

teadrinker · Post by **teadrinker** » 25 May 2022, 11:22

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03 · Post by **Rikk03** » 26 May 2022, 02:10

I was hoping to be able to extract the info from the img tag alt attribute.

Too difficult? In terms of regex, would not this be better?

Code: Select all

(?<=<img[^>])*(?=alt="([^"]+)"[^>]*>)

or 

(?<=<img[^>])*alt="\K([^"]+) 

or 

(?<=<img[^>])*(?:alt.+=")\K([^"]+)

sometimes the last one leads to Catastrophic backtracking though. Any idea how I can fix it since it seems the best.

teadrinker · Post by **teadrinker** » 26 May 2022, 05:18

Rikk03 wrote: ↑I was hoping to be able to extract the info from the img tag alt attribute.

My code above exactly does that. Doesn't it work for you?

Rikk03 · Post by **Rikk03** » 27 May 2022, 02:40

Hi

Sometimes the Html is malformed, I was hoping to take it into consideration in this particular case. Hence my version.

ie <img alt" content="whatever"

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

teadrinker · Post by **teadrinker** » 27 May 2022, 03:26

My approach is better anyway, trust me.

Rikk03 · Post by **Rikk03** » 27 May 2022, 04:04

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

Rikk03 · Post by **Rikk03** » 27 May 2022, 04:18

Never mind ... answered my own question.

Sort, altAttributes, U

Rikk03 · Post by **Rikk03** » 27 May 2022, 07:29

It doesnt want to save to variable for some reason. I'm not looking for msgbox return.

teadrinker · Post by **teadrinker** » 27 May 2022, 10:52

Miracles!

Descolada · Post by **Descolada** » 27 May 2022, 13:41

Perhaps I have missed something important here, but why wouldn't this work?

Code: Select all

SetWorkingDir %A_ScriptDir%

UrlDownloadToFile, https://play.google.com/store/apps/details?id=com.mydiabetes, myfile.txt
FileRead, content, myfile.txt
altTexts := [], pos := 1
while (pos := RegexMatch(content, "<img [^>]*src=""\K([^""]+)[^>]*(?<=alt="")([^""]+)", altText, pos+StrLen(altText))) { ; Captures all src and alt tags from images, although it fails if alt comes before src
	MsgBox, Image with src="%altText1%" has alt="%altText2%"
	altTexts.push(altText2)
}
; Perhaps do something with altTexts?
; FileDelete, myfile.txt ; Optionally delete the downloaded html file

teadrinker · Post by **teadrinker** » 27 May 2022, 18:25

For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?

Descolada · Post by **Descolada** » 28 May 2022, 00:29

teadrinker wrote: ↑
27 May 2022, 18:25
For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?

Yes, since using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers, and having a simpler code for beginners to work with etc. Although probably it is a bit slower since it requires writing to the hard-drive, and Regex isn't beginner-friendly anyway.

Haven't thoroughly tested if the two approaches are interchangeable though, since it appears that defining the user agent was necessary in this case.

teadrinker · Post by **teadrinker** » 28 May 2022, 04:29

Descolada wrote: ↑using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers

Of course, my only reason to use GetHtml is to avoid having to use UrlDownloadToFile and create the file on disk.

Rikk03 · Post by **Rikk03** » 13 Jun 2022, 04:34

@teadrinker When using an youtube video url, this fails, returning ALL html. How can we add some error handling.

teadrinker wrote: ↑

25 May 2022, 11:22

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

teadrinker · Post by **teadrinker** » 13 Jun 2022, 22:01

Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")

Rikk03 · Post by **Rikk03** » 24 Jun 2022, 02:01

teadrinker wrote: ↑

13 Jun 2022, 22:01

Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")

Hi, OK but im removing duplicates like so

Code: Select all

Loop, Parse, altAttributes, `n
altAttributes := (A_Index=1 ? A_LoopField : altAttributes . (InStr("`n" altAttributes
	. "`n", "`n" A_LoopField "`n") ? "" : "`n" A_LoopField ))

how can I add your errorhandling with the deduplication - most efficiently?

teadrinker · Post by **teadrinker** » 24 Jun 2022, 02:53

Perhaps like this:

Code: Select all

delim := "`n"
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . RegExReplace(altAttributes, "`asm)(^\V*\R)(?=.*\1)") : " not found")

Rikk03 · Post by **Rikk03** » 24 Jun 2022, 04:18

Yes, that works!

Thx

AutoHotkey Community

How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?