How to extract url from a webpage?

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 15 Jul 2021, 13:14

You do really somthing weird.
This should work:

Code: Select all

F3::
   Xl := ComObjActive("Excel.Application")
   WinActivate, % "ahk_id " . Xl.hwnd
   XlSht := Xl.ActiveSheet
   row := Xl.ActiveCell.Row - 1
   col := Xl.ActiveCell.Column
   Loop {
      url := XlSht.Cells(++row, col).value
      if !(url ~= "^https?://") {
         MsgBox, Incorrect url
         Return
      }
      try html := GetHtml(url)
      catch
         continue
      if !RegExMatch(html, "id=""websiteurlfecth""[^>]+?value=""\Khttp[^""]+", extractedUrl) {
         MsgBox, 4, % " ", Can't extract url. Continue?
         IfMsgBox, No
            Return
         continue
      }
      XlSht.Cells(row, col + 1).value := extractedUrl
   }
   Return

GetHtml(url) {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 02 May 2022, 23:49

@teadrinker

I can't get this working (extracting links) on the android play store. Can it be blocked from a website perspective?

I can extract other text data. For example I have been able to add additional cells per line, for example the address

For example

https://play.google.com/store/apps/details?id=com.mydiabetes
Last edited by Rikk03 on 03 May 2022, 05:06, edited 2 times in total.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 05:00

I find if load the pages in a browser, and then copy and paste the source code into the clipboard via control u (google chrome) , that my regex can extract the link I require. The problem is I want it to work in the background and my method takes twice as long because it has to load the browser and go through a bunch of steps.

It seems that it really depends on how the pages load as to whether you can extract the links, using your script - seems not to work in my case. This makes me doubt if your method is reliable.

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 05:32

Check if this code returns something for you:

Code: Select all

MsgBox, % Clipboard := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes")

GetHtml(url) {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 06:17

Yes, it does, but perhaps the urls don't feature within it. As mentioned I AM able to scrape the address using your code, with a different Regex.

What could be the problem?

Lazy load, URLs stored within CSS etc??

OR

Perhaps this is a character length issue? I know 5000 characters is a magic number for AHK (unlikely since I can scrape the address)

Im after the developer link at the bottom of the details page.

Does your method provide everything you would see in Google Chrome control U ?

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 06:56

Rikk03 wrote: Im after the developer link at the bottom of the details page
At least this link I see:
 
 Image

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 07:03

Yeah, that's not the one, it located after "Developer"
image.png
image.png (13.48 KiB) Viewed 865 times
I compared the two and its definitely missing from the output generated with your script.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 07:11

image.png
image.png (64.05 KiB) Viewed 859 times

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 07:31

Try adding the User-Agent header:

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
MsgBox, % Clipboard := WebRequest("https://play.google.com/store/apps/details?id=com.mydiabetes",, headers,, error := "")
if error
   MsgBox, % error

WebRequest(url, method := "GET", HeadersArray := "", body := "", ByRef error := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open(method, url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send(body)
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      error := "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := Arr.MaxIndex() + 1
   Return StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 07:43

Yes that worked.

Can you alter your original script to include header array?

I mean rework the F3 script to function using this new webrequest. I made my script based on it so it would really help me.

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 07:51

Code: Select all

headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
MsgBox, % Clipboard := GetHtml("https://play.google.com/store/apps/details?id=com.mydiabetes", headers)

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 08:17

"can't extract url"

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 08:19

Code: Select all

F3::
   Xl := ComObjActive("Excel.Application")
   WinActivate, % "ahk_id " . Xl.hwnd
   XlSht := Xl.ActiveSheet
   row := Xl.ActiveCell.Row - 1
   col := Xl.ActiveCell.Column
   Loop {
      url := XlSht.Cells(++row, col).value
      if !(url ~= "^https?://") {
         MsgBox, Incorrect url
         Return
      }
	  headers := {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36"
                        . " (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63"}
      try html := GetHtml(url)
      catch
         continue
      if !RegExMatch(html, "id=""websiteurlfecth""[^>]+?value=""\Khttp[^""]+", extractedUrl) {
         MsgBox, 4, % " ", Can't extract url. Continue?
         IfMsgBox, No
            Return
         continue
      }
      XlSht.Cells(row, col + 1).value := extractedUrl
   }
   Return

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}
What am i missing

tried it with try html := GetHtml(URL, headers)

no joy
Last edited by Rikk03 on 03 May 2022, 08:32, edited 1 time in total.

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 08:30

Code: Select all

RegExMatch(html, "id=""websiteurlfecth""[^>]+?value=""\Khttp[^""]+", extractedUrl)
This RegEx is not universal. You have to create new one depending on what URLs you need to extract.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 08:33

Yeah i tried my regex .... no joy .... anything else?

Code: Select all

Developer.*\Khttps?:\/\/(?!(.+google|.+sandbox|gstatic|.+ggpht|ssl|.+gstatic|.+broofa))[^""]+(?=.+Visit)

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 09:02

Code: Select all

F3::
   Xl := ComObjActive("Excel.Application")
   WinActivate, % "ahk_id " . Xl.hwnd
   XlSht := Xl.ActiveSheet
   row := Xl.ActiveCell.Row - 1
   col := Xl.ActiveCell.Column
   Loop {
      url := XlSht.Cells(++row, col).value
      if !(url ~= "^https?://") {
         MsgBox, Incorrect url
         Return
      }
      try html := GetHtml(url)
      catch
         continue
      if !RegExMatch(html, "Developer.*\Khttps?://(?!(.+google|.+sandbox|gstatic|.+ggpht|ssl|.+gstatic|.+broofa))[^""]+(?=.+Visit)", extractedUrl) {
         MsgBox, 4, % " ", Can't extract url. Continue?
         IfMsgBox, No
            Return
         continue
      }
      XlSht.Cells(row, col + 1).value := extractedUrl
   }
   Return

GetHtml(url, HeadersArray := "") {
   Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
   Whr.Open("GET", url, true)
   for name, value in HeadersArray
      Whr.SetRequestHeader(name, value)
   Whr.Send()
   Whr.WaitForResponse()
   status := Whr.status
   if (status != 200)
      throw "HttpRequest error, status: " . status
   Arr := Whr.responseBody
   pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
   length := arr.MaxIndex() + 1
   Return html := StrGet(pData, length, "UTF-8")
}

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 09:16

Before creating a RegEx, you need to formulate exactly what it should do.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 09:23

yeah, it does exactly what I want in regex101

It returns with exactly one result every time.

Code: Select all

RegExMatch(html, "Developer.*\Khttps?://(?!(.+google|.+sandbox|gstatic|.+ggpht|ssl|.+gstatic|.+broofa))[^""]+(?=.+Visit)", extractedUrl)
It works perfectly! Just not in your script. :headwall:

teadrinker
Posts: 4326
Joined: 29 Mar 2015, 09:41
Contact:

Re: How to extract url from a webpage?

Post by teadrinker » 03 May 2022, 10:06

teadrinker wrote: Before creating a RegEx, you need to formulate exactly what it should do.
Your RegEx looks weird. It doesn't look like it can work properly.

Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: How to extract url from a webpage?

Post by Rikk03 » 03 May 2022, 10:11

Well it does.

using regex 101

Code: Select all

https?:\/\/(?!(.+google|.+sandbox|gstatic|.+ggpht|ssl|.+gstatic|.+broofa))[^""]+(?=.+Visit)
Works perfectly.

How weird??

Post Reply

Return to “Ask for Help (v1)”