How to extract url from a webpage?

Rikk03 · Post by **Rikk03** » 03 May 2022, 10:14

Any Google play app detail page it direct matches the developer URL.

How is it not right?

teadrinker · Post by **teadrinker** » 03 May 2022, 10:20

Why not link to your example on regex 101?

Rikk03 · Post by **Rikk03** » 03 May 2022, 10:21

https://regex101.com/r/R7T03l/1

teadrinker · Post by **teadrinker** » 03 May 2022, 10:26

AHK doesn't use ECMAScript variant of RegEx, it uses PCRE.

Rikk03 · Post by **Rikk03** » 03 May 2022, 10:36

Then why does the regex work when I load the webpage and use control U, copy the source into clipboard?

teadrinker · Post by **teadrinker** » 03 May 2022, 10:37

The code may differ in this case.

Rikk03 · Post by **Rikk03** » 03 May 2022, 10:39

Well, an alternative PCRE version would be appreciated,

teadrinker · Post by **teadrinker** » 03 May 2022, 10:43

teadrinker wrote: ↑ Before creating a RegEx, you need to formulate exactly what it should do.

Before you do it, I can't help.

Rikk03 · Post by **Rikk03** » 03 May 2022, 22:37

Anything wrong with this?

<a href=["]\Khttps?([^""]+)(?="\sclass="hrTbp ">Visit website)

Rikk03 · Post by **Rikk03** » 03 May 2022, 22:42

(?<=<a href=")https?([^""]+)(?="\sclass="hrTbp ">Visit website)
(?<=<a href="")https?([^""]+)(?=""\sclass=""hrTbp "">Visit website)

Hah, I got it working finally!!!!

Worth it though, way faster.

Rikk03 · Post by **Rikk03** » 09 May 2022, 01:21

One question, does this work in the background, as sendinput would?

It's weird, If I change window focus while the script is running, and then return it's not updated, but then suddenly it does seem to fill / update.

Same thing happens with screen scrolling. The script seems to continue as it should, then if you scroll down while initially blank, suddenly seems to update.

Rikk03 · Post by **Rikk03** » 12 May 2022, 02:49

Hi again,

Question for @teadrinker.

Would html.getElementsByTagName("H2") work, I'd look to place it in a cell also. I've tried adding it in various ways without much success.

What would be the best way to add it considering the script in this thread.

teadrinker · Post by **teadrinker** » 12 May 2022, 12:49

The html code can contain several h2 tags. This example shows how you can get all of them:

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)

Doc := ComObjCreate("HTMLFILE")
Doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=edge"">")
Doc.write(html)

coll := Doc.getElementsByTagName("h2")
Loop % coll.length {
   MsgBox, % "outer html: " . coll[A_Index - 1].outerHTML . "`n"
           . "inner text: " . coll[A_Index - 1].innerText
}

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03 · Post by **Rikk03** » 13 May 2022, 02:27

I get this as a response:-

: image.png (15.63 KiB) Viewed 945 times

Was this the intent? I'd like to just get the title and not the class info. In fact, what Id like to do is get the inner HTML, H2s into a single variable to then put into a single cell, (double space-separated).

So how can I get more than just one H2

MH2 := MetH2[A_Index - 1].innerText

teadrinker · Post by **teadrinker** » 13 May 2022, 05:57

Rikk03 wrote: ↑So how can I get more than just one H2

No idea, for me the previouse code gives 6 messages.

Rikk03 · Post by **Rikk03** » 13 May 2022, 06:21

For me too, 6 messages,

That image above was just the first. My question is how to output it into a variable rather than a msgbox.

teadrinker · Post by **teadrinker** » 13 May 2022, 06:36

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)

Doc := ComObjCreate("HTMLFILE")
Doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=edge"">")
Doc.write(html)

coll := Doc.getElementsByTagName("h2")
text := ""
Loop % coll.length {
   caption := coll[A_Index - 1].innerText
   if (caption != "")
      text .= (text = "" ? "" : "  ") . caption
}
MsgBox, % text

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03 · Post by **Rikk03** » 13 May 2022, 06:54

It works,

But man it's really slow!

Rikk03 · Post by **Rikk03** » 25 May 2022, 00:55

Hi again

I'm struggling to get the Alt tag.

Code: Select all

 MetAlt:= Doc.getAttribute("ALT")
		  text := ""
		  Loop % MetAlt.length {
		   caption := MetAlt[A_Index - 1].innerText
   if (caption != "")
      text .= (text = "" ? "" : "  ") . caption
	  continue
	}
		XlSht.Cells(row, col + 15).value := metalt

Your help would be appreciated @teadrinker

teadrinker · Post by **teadrinker** » 25 May 2022, 08:52

There is no such a tag like alt. Also the document object has no alt attribute. Specify exactly what you want to get.

AutoHotkey Community

How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?

Re: How to extract url from a webpage?