How to extract url from a webpage?
Re: How to extract url from a webpage?
The alt description of a link
ie
alt="this is an image of a dog"
ie
alt="this is an image of a dog"
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
Code: Select all
headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
. " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)
GetHtml(url, HeadersArray := "") {
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, true)
for name, value in HeadersArray
Whr.SetRequestHeader(name, value)
Whr.Send()
Whr.WaitForResponse()
status := Whr.status
if (status != 200)
throw "HttpRequest error, status: " . status
Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
Return html := StrGet(pData, length, "UTF-8")
}
Re: How to extract url from a webpage?
I was hoping to be able to extract the info from the img tag alt attribute.
Too difficult? In terms of regex, would not this be better?
sometimes the last one leads to Catastrophic backtracking though. Any idea how I can fix it since it seems the best.
Too difficult? In terms of regex, would not this be better?
Code: Select all
(?<=<img[^>])*(?=alt="([^"]+)"[^>]*>)
or
(?<=<img[^>])*alt="\K([^"]+)
or
(?<=<img[^>])*(?:alt.+=")\K([^"]+)
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
Hi
Sometimes the Html is malformed, I was hoping to take it into consideration in this particular case. Hence my version.
ie <img alt" content="whatever"
Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.
Question, how can I remove duplicates from your example.
Sometimes the Html is malformed, I was hoping to take it into consideration in this particular case. Hence my version.
ie <img alt" content="whatever"
Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.
Question, how can I remove duplicates from your example.
Last edited by Rikk03 on 27 May 2022, 04:04, edited 1 time in total.
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
My approach is better anyway, trust me.
Re: How to extract url from a webpage?
Modifying yours to ((?<=alt=""|alt""\scontent="")[^""]+) did the job.
Question, how can I remove duplicates from your example.
Question, how can I remove duplicates from your example.
Re: How to extract url from a webpage?
Never mind ... answered my own question.
Sort, altAttributes, U
Sort, altAttributes, U
Re: How to extract url from a webpage?
It doesnt want to save to variable for some reason. I'm not looking for msgbox return.
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
Miracles!
Re: How to extract url from a webpage?
Perhaps I have missed something important here, but why wouldn't this work?
Code: Select all
SetWorkingDir %A_ScriptDir%
UrlDownloadToFile, https://play.google.com/store/apps/details?id=com.mydiabetes, myfile.txt
FileRead, content, myfile.txt
altTexts := [], pos := 1
while (pos := RegexMatch(content, "<img [^>]*src=""\K([^""]+)[^>]*(?<=alt="")([^""]+)", altText, pos+StrLen(altText))) { ; Captures all src and alt tags from images, although it fails if alt comes before src
MsgBox, Image with src="%altText1%" has alt="%altText2%"
altTexts.push(altText2)
}
; Perhaps do something with altTexts?
; FileDelete, myfile.txt ; Optionally delete the downloaded html file
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?
Re: How to extract url from a webpage?
Yes, since using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers, and having a simpler code for beginners to work with etc. Although probably it is a bit slower since it requires writing to the hard-drive, and Regex isn't beginner-friendly anyway. Haven't thoroughly tested if the two approaches are interchangeable though, since it appears that defining the user agent was necessary in this case.teadrinker wrote: ↑27 May 2022, 18:25For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
@teadrinker When using an youtube video url, this fails, returning ALL html. How can we add some error handling.
teadrinker wrote: ↑25 May 2022, 11:22Code: Select all
headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36" . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"} html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers) delim := "`n" MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim) GetHtml(url, HeadersArray := "") { Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1") Whr.Open("GET", url, true) for name, value in HeadersArray Whr.SetRequestHeader(name, value) Whr.Send() Whr.WaitForResponse() status := Whr.status if (status != 200) throw "HttpRequest error, status: " . status Arr := Whr.responseBody pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize) length := arr.MaxIndex() + 1 Return html := StrGet(pData, length, "UTF-8") }
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
Like this:
Code: Select all
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")
Re: How to extract url from a webpage?
Hi, OK but im removing duplicates like soteadrinker wrote: ↑13 Jun 2022, 22:01Like this:Code: Select all
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count) MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")
Code: Select all
Loop, Parse, altAttributes, `n
altAttributes := (A_Index=1 ? A_LoopField : altAttributes . (InStr("`n" altAttributes
. "`n", "`n" A_LoopField "`n") ? "" : "`n" A_LoopField ))
-
- Posts: 4326
- Joined: 29 Mar 2015, 09:41
- Contact:
Re: How to extract url from a webpage?
Perhaps like this:
Code: Select all
delim := "`n"
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . RegExReplace(altAttributes, "`asm)(^\V*\R)(?=.*\1)") : " not found")
Re: How to extract url from a webpage?
Yes, that works!
Thx
Thx