How to extract url from a webpage?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 10:14

Any Google play app detail page it direct matches the developer URL.

How is it not right?

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 10:20

Why not link to your example on regex 101?


teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 10:26

 Image
 
AHK doesn't use ECMAScript variant of RegEx, it uses PCRE.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 10:36

Then why does the regex work when I load the webpage and use control U, copy the source into clipboard?

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 10:37

The code may differ in this case.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 10:39

Well, an alternative PCRE version would be appreciated, :D

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 10:43

teadrinker wrote: Before creating a RegEx, you need to formulate exactly what it should do.
Before you do it, I can't help.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 22:37

Anything wrong with this?

<a href=["]\Khttps?([^""]+)(?="\sclass="hrTbp ">Visit website)

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 22:42

(?<=<a href=")https?([^""]+)(?="\sclass="hrTbp ">Visit website)
(?<=<a href="")https?([^""]+)(?=""\sclass=""hrTbp "">Visit website)

Hah, I got it working finally!!!!

Worth it though, way faster.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 09 May 2022, 01:21

One question, does this work in the background, as sendinput would?

It's weird, If I change window focus while the script is running, and then return it's not updated, but then suddenly it does seem to fill / update.

Same thing happens with screen scrolling. The script seems to continue as it should, then if you scroll down while initially blank, suddenly seems to update.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 12 May 2022, 02:49

Hi again,

Question for @teadrinker.

Would html.getElementsByTagName("H2") work, I'd look to place it in a cell also. I've tried adding it in various ways without much success.

What would be the best way to add it considering the script in this thread.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 12 May 2022, 12:49

The html code can contain several h2 tags. This example shows how you can get all of them:

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)

Doc := ComObjCreate("HTMLFILE")
Doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=edge"">")
Doc.write(html)

coll := Doc.getElementsByTagName("h2")
Loop % coll.length {
   MsgBox, % "outer html: " . coll[A_Index - 1].outerHTML . "`n"
           . "inner text: " . coll[A_Index - 1].innerText
}

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 13 May 2022, 02:27

I get this as a response:-
image.png
image.png (15.63 KiB) Viewed 923 times

Was this the intent? I'd like to just get the title and not the class info. In fact, what Id like to do is get the inner HTML, H2s into a single variable to then put into a single cell, (double space-separated).

So how can I get more than just one H2

MH2 := MetH2[A_Index - 1].innerText

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 13 May 2022, 05:57

Rikk03 wrote: So how can I get more than just one H2
No idea, for me the previouse code gives 6 messages.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 13 May 2022, 06:21

For me too, 6 messages,

That image above was just the first. My question is how to output it into a variable rather than a msgbox.

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 13 May 2022, 06:36

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
html := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)

Doc := ComObjCreate("HTMLFILE")
Doc.write("<meta http-equiv=""X-UA-Compatible"" content=""IE=edge"">")
Doc.write(html)

coll := Doc.getElementsByTagName("h2")
text := ""
Loop % coll.length {
   caption := coll[A_Index - 1].innerText
   if (caption != "")
      text .= (text = "" ? "" : "  ") . caption
}
MsgBox, % text

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 13 May 2022, 06:54

It works,

But man it's really slow!

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 25 May 2022, 00:55

Hi again

I'm struggling to get the Alt tag.

Code: Select all

 MetAlt:= Doc.getAttribute("ALT")
		  text := ""
		  Loop % MetAlt.length {
		   caption := MetAlt[A_Index - 1].innerText
   if (caption != "")
      text .= (text = "" ? "" : "  ") . caption
	  continue
	}
		XlSht.Cells(row, col + 15).value := metalt	
Your help would be appreciated @teadrinker

teadrinker
Posts: 4309
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 25 May 2022, 08:52

There is no such a tag like alt. Also the document object has no alt attribute. Specify exactly what you want to get.

Post Reply

Return to “Ask for Help (v1)”