Page 4 of 4

Re: How to extract url from a webpage?

Posted: 25 May 2022, 09:55
by Rikk03
The alt description of a link

ie

alt="this is an image of a dog"

Re: How to extract url from a webpage?

Posted: 25 May 2022, 11:22
by teadrinker

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Re: How to extract url from a webpage?

Posted: 26 May 2022, 02:10
by Rikk03
I was hoping to be able to extract the info from the img tag alt attribute.

Too difficult? In terms of regex, would not this be better?

Code: Select all

(?<=<img[^>])*(?=alt="([^"]+)"[^>]*>)

or 

(?<=<img[^>])*alt="\K([^"]+) 

or 

(?<=<img[^>])*(?:alt.+=")\K([^"]+)
sometimes the last one leads to Catastrophic backtracking though. Any idea how I can fix it since it seems the best.

Re: How to extract url from a webpage?

Posted: 26 May 2022, 05:18
by teadrinker
Rikk03 wrote: I was hoping to be able to extract the info from the img tag alt attribute.
My code above exactly does that. Doesn't it work for you?

Re: How to extract url from a webpage?

Posted: 27 May 2022, 02:40
by Rikk03
Hi

Sometimes the Html is malformed, I was hoping to take it into consideration in this particular case. Hence my version.

ie <img alt" content="whatever"

Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

Re: How to extract url from a webpage?

Posted: 27 May 2022, 03:26
by teadrinker
My approach is better anyway, trust me. 8-)

Re: How to extract url from a webpage?

Posted: 27 May 2022, 04:04
by Rikk03
Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.

Question, how can I remove duplicates from your example.

Re: How to extract url from a webpage?

Posted: 27 May 2022, 04:18
by Rikk03
Never mind ... answered my own question.

Sort, altAttributes, U

Re: How to extract url from a webpage?

Posted: 27 May 2022, 07:29
by Rikk03
It doesnt want to save to variable for some reason. I'm not looking for msgbox return.

Re: How to extract url from a webpage?

Posted: 27 May 2022, 10:52
by teadrinker
Miracles! :shock:

Re: How to extract url from a webpage?

Posted: 27 May 2022, 13:41
by Descolada
Perhaps I have missed something important here, but why wouldn't this work?

Code: Select all

SetWorkingDir %A_ScriptDir%

UrlDownloadToFile, https://play.google.com/store/apps/details?id=com.mydiabetes, myfile.txt
FileRead, content, myfile.txt
altTexts := [], pos := 1
while (pos := RegexMatch(content, "<img [^>]*src=""\K([^""]+)[^>]*(?<=alt="")([^""]+)", altText, pos+StrLen(altText))) { ; Captures all src and alt tags from images, although it fails if alt comes before src
	MsgBox, Image with src="%altText1%" has alt="%altText2%"
	altTexts.push(altText2)
}
; Perhaps do something with altTexts?
; FileDelete, myfile.txt ; Optionally delete the downloaded html file

Re: How to extract url from a webpage?

Posted: 27 May 2022, 18:25
by teadrinker
For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?

Re: How to extract url from a webpage?

Posted: 28 May 2022, 00:29
by Descolada
teadrinker wrote:
27 May 2022, 18:25
For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?
Yes, since using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers, and having a simpler code for beginners to work with etc. Although probably it is a bit slower since it requires writing to the hard-drive, and Regex isn't beginner-friendly anyway. :) Haven't thoroughly tested if the two approaches are interchangeable though, since it appears that defining the user agent was necessary in this case.

Re: How to extract url from a webpage?

Posted: 28 May 2022, 04:29
by teadrinker
Descolada wrote: using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers
Of course, my only reason to use GetHtml is to avoid having to use UrlDownloadToFile and create the file on disk. :)

Re: How to extract url from a webpage?

Posted: 13 Jun 2022, 04:34
by Rikk03
@teadrinker When using an youtube video url, this fails, returning ALL html. How can we add some error handling.
teadrinker wrote:
25 May 2022, 11:22

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Re: How to extract url from a webpage?

Posted: 13 Jun 2022, 22:01
by teadrinker
Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")

Re: How to extract url from a webpage?

Posted: 24 Jun 2022, 02:01
by Rikk03
teadrinker wrote:
13 Jun 2022, 22:01
Like this:

Code: Select all

altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")
Hi, OK but im removing duplicates like so

Code: Select all

Loop, Parse, altAttributes, `n
altAttributes := (A_Index=1 ? A_LoopField : altAttributes . (InStr("`n" altAttributes
	. "`n", "`n" A_LoopField "`n") ? "" : "`n" A_LoopField ))
how can I add your errorhandling with the deduplication - most efficiently?

Re: How to extract url from a webpage?

Posted: 24 Jun 2022, 02:53
by teadrinker
Perhaps like this:

Code: Select all

delim := "`n"
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . RegExReplace(altAttributes, "`asm)(^\V*\R)(?=.*\1)") : " not found")

Re: How to extract url from a webpage?

Posted: 24 Jun 2022, 04:18
by Rikk03
Yes, that works!

Thx