How to extract url from a webpage?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 25 May 2022, 09:55

The alt description of a link

ie

alt="this is an image of a dog"

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 25 May 2022, 11:22

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 26 May 2022, 02:10

I was hoping to be able to extract the info from the img tag alt attribute.

Too difficult? In terms of regex, would not this be better?

Code: Select all

(?<=<img[^>])*(?=alt="([^"]+)"[^>]*>)

or 

(?<=<img[^>])*alt="\K([^"]+) 

or 

(?<=<img[^>])*(?:alt.+=")\K([^"]+)
sometimes the last one leads to Catastrophic backtracking though. Any idea how I can fix it since it seems the best.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 26 May 2022, 05:18

Rikk03 wrote: I was hoping to be able to extract the info from the img tag alt attribute.
My code above exactly does that. Doesn't it work for you?

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 27 May 2022, 02:40

Hi

Sometimes the Html is malformed, I was hoping to take it into consideration in this particular case. Hence my version.

ie <img alt" content="whatever"

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.
Last edited by Rikk03 on 27 May 2022, 04:04, edited 1 time in total.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 27 May 2022, 03:26

My approach is better anyway, trust me. 8-)

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 27 May 2022, 04:04

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 27 May 2022, 04:18

Never mind ... answered my own question.

Sort, altAttributes, U

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 27 May 2022, 07:29

It doesnt want to save to variable for some reason. I'm not looking for msgbox return.


Descolada
Posts: 1099
Joined: 23 Dec 2021, 02:30

Re: How to extract url from a webpage?

Post by Descolada » 27 May 2022, 13:41

Perhaps I have missed something important here, but why wouldn't this work?

Code: Select all

SetWorkingDir %A_ScriptDir%

UrlDownloadToFile, https://play.google.com/store/apps/details?id=com.mydiabetes, myfile.txt
FileRead, content, myfile.txt
altTexts := [], pos := 1
while (pos := RegexMatch(content, "<img [^>]*src=""\K([^""]+)[^>]*(?<=alt="")([^""]+)", altText, pos+StrLen(altText))) { ; Captures all src and alt tags from images, although it fails if alt comes before src
	MsgBox, Image with src="%altText1%" has alt="%altText2%"
	altTexts.push(altText2)
}
; Perhaps do something with altTexts?
; FileDelete, myfile.txt ; Optionally delete the downloaded html file

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 27 May 2022, 18:25

For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?

Descolada
Posts: 1099
Joined: 23 Dec 2021, 02:30

Re: How to extract url from a webpage?

Post by Descolada » 28 May 2022, 00:29

teadrinker wrote:
27 May 2022, 18:25
For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?
Yes, since using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers, and having a simpler code for beginners to work with etc. Although probably it is a bit slower since it requires writing to the hard-drive, and Regex isn't beginner-friendly anyway. :) Haven't thoroughly tested if the two approaches are interchangeable though, since it appears that defining the user agent was necessary in this case.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 28 May 2022, 04:29

Descolada wrote: using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers
Of course, my only reason to use GetHtml is to avoid having to use UrlDownloadToFile and create the file on disk. :)

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 13 Jun 2022, 04:34

@teadrinker When using an youtube video url, this fails, returning ALL html. How can we add some error handling.
teadrinker wrote:
25 May 2022, 11:22

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 13 Jun 2022, 22:01

Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 24 Jun 2022, 02:01

teadrinker wrote:
13 Jun 2022, 22:01
Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")
Hi, OK but im removing duplicates like so

Code: Select all

Loop, Parse, altAttributes, `n
altAttributes := (A_Index=1 ? A_LoopField : altAttributes . (InStr("`n" altAttributes
	. "`n", "`n" A_LoopField "`n") ? "" : "`n" A_LoopField ))
how can I add your errorhandling with the deduplication - most efficiently?

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 24 Jun 2022, 02:53

Perhaps like this:

Code: Select all

delim := "`n"
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . RegExReplace(altAttributes, "`asm)(^\V*\R)(?=.*\1)") : " not found")

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 24 Jun 2022, 04:18

Yes, that works!

Thx

Post Reply

Return to “Ask for Help (v1)”