Re: How to extract url from a webpage?
Posted: 25 May 2022, 09:55
The alt description of a link
ie
alt="this is an image of a dog"
ie
alt="this is an image of a dog"
Let's help each other out
https://www.autohotkey.com/boards/
https://www.autohotkey.com/boards/viewtopic.php?f=76&t=92137
Code: Select all
headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
. " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)
delim := "`n"
MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim)
GetHtml(url, HeadersArray := "") {
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, true)
for name, value in HeadersArray
Whr.SetRequestHeader(name, value)
Whr.Send()
Whr.WaitForResponse()
status := Whr.status
if (status != 200)
throw "HttpRequest error, status: " . status
Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
Return html := StrGet(pData, length, "UTF-8")
}
Code: Select all
(?<=<img[^>])*(?=alt="([^"]+)"[^>]*>)
or
(?<=<img[^>])*alt="\K([^"]+)
or
(?<=<img[^>])*(?:alt.+=")\K([^"]+)
Code: Select all
SetWorkingDir %A_ScriptDir%
UrlDownloadToFile, https://play.google.com/store/apps/details?id=com.mydiabetes, myfile.txt
FileRead, content, myfile.txt
altTexts := [], pos := 1
while (pos := RegexMatch(content, "<img [^>]*src=""\K([^""]+)[^>]*(?<=alt="")([^""]+)", altText, pos+StrLen(altText))) { ; Captures all src and alt tags from images, although it fails if alt comes before src
MsgBox, Image with src="%altText1%" has alt="%altText2%"
altTexts.push(altText2)
}
; Perhaps do something with altTexts?
; FileDelete, myfile.txt ; Optionally delete the downloaded html file
Yes, since using UrlDownloadToFile means avoiding the use of GetHtml function, messing around with headers, and having a simpler code for beginners to work with etc. Although probably it is a bit slower since it requires writing to the hard-drive, and Regex isn't beginner-friendly anyway. Haven't thoroughly tested if the two approaches are interchangeable though, since it appears that defining the user agent was necessary in this case.teadrinker wrote: ↑27 May 2022, 18:25For me this works, why do you ask? Do you mean, why I don't use UrlDownloadToFile, or what?
teadrinker wrote: ↑25 May 2022, 11:22Code: Select all
headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36" . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"} html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers) delim := "`n" MsgBox, % altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim) GetHtml(url, HeadersArray := "") { Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1") Whr.Open("GET", url, true) for name, value in HeadersArray Whr.SetRequestHeader(name, value) Whr.Send() Whr.WaitForResponse() status := Whr.status if (status != 200) throw "HttpRequest error, status: " . status Arr := Whr.responseBody pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize) length := arr.MaxIndex() + 1 Return html := StrGet(pData, length, "UTF-8") }
Code: Select all
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")
Hi, OK but im removing duplicates like soteadrinker wrote: ↑13 Jun 2022, 22:01Like this:Code: Select all
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count) MsgBox, % "altAttributes:" . (count ? "`n`n" . altAttributes : " not found")
Code: Select all
Loop, Parse, altAttributes, `n
altAttributes := (A_Index=1 ? A_LoopField : altAttributes . (InStr("`n" altAttributes
. "`n", "`n" A_LoopField "`n") ? "" : "`n" A_LoopField ))
Code: Select all
delim := "`n"
altAttributes := RegExReplace(html, "s).*?((?<=alt="")[^""]+).*?(?=(?1)|$)", "$1" . delim, count)
MsgBox, % "altAttributes:" . (count ? "`n`n" . RegExReplace(altAttributes, "`asm)(^\V*\R)(?=.*\1)") : " not found")