Remove Duplicates

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Remove Duplicates

06 Jun 2021, 02:08

Hi, Im trying to the following code with a regex to find links in source code which works.

Code: Select all

dups := []
		Pos=1
		try While Pos:= RegExMatch(bata, pattern,m,Pos+StrLen(m))
			{
				cou:= a_index `t
				for i, v in dups {
				if (v = m)
				continue, 2
				}
		extlinks .= m "`n"
		dups.push(m)
		eztlinks:= Rtrim(extlinks)
		eztlinks:= StrReplace(eztlinks, ",")
		FileAppend, %cou%`,%eztlinks%,%A_Desktop%\gsresults2021.csv
		}
However, while some duplicates are cleared out, others are not, how can I be more intensive on duplicate removal? To explain a little further, this is for google search results extraction. I want the variable "cou" to reflect the actual rank as it loops through source code. I find the link is listed 4 times in the source for each listing result. Perhaps all i need is to return the first?

Thoughts? How can I return just the first link for search result of each position. (instead of 4 links being returned for each)
Last edited by Rikk03 on 06 Jun 2021, 08:46, edited 1 time in total.
braunbaer
Posts: 478
Joined: 22 Feb 2016, 10:49

Re: Remove Duplicates

06 Jun 2021, 03:34

Without more information, it is difficult to tell where the problem lies.
At the first look, I would expect duplicates to be removed. Where are the links listed 4 times?

Maybe you want to post a complete example to reproduce:
What is bata, what is pattern, what is the expected result and what is your actual result?
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

06 Jun 2021, 08:46

updated to show output

For example, the output provides both

https://www.artefact.com/digital-marketing-3/e-retail-e-commerce/

AND

https://www.artefact.com/

The latter would not be needed. So if a link with the domain exists then the domain would not by itself.

They are not considered duplicates I guess. So how can I solve it?
teadrinker
Posts: 4344
Joined: 29 Mar 2015, 09:41
Contact:

Re: Remove Duplicates

06 Jun 2021, 10:40

An example:

Code: Select all

url := "https://www.autohotkey.com/boards/viewtopic.php?f=76&t=91362"
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, false)
Whr.Send()
status := Whr.status
if (status != 200)
   throw "HttpRequest error, status: " . status

Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
html := StrGet(pData, length, "UTF-8")

links := m := "", Domains := {}
while RegExMatch(html, "isO)""\Khttps?://[^""]+", m, m ? m.Pos + m.Len : 1) {
   link := m[0]
   RegExMatch(link, "https?://[^/:]+", domain)
   if !Domains.HasKey(domain) {
      Domains[domain] := ""
      links .= link . "`n"
   }
}
MsgBox, % RTrim(links, "`n")
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

06 Jun 2021, 14:49

Wow,

For the most part it worked! Im using 2 very different regex and it still worked.

Question, how would you add a count to your example, lets assume you are looping through a html source file extracting links. The count would be for approximate rank.
teadrinker
Posts: 4344
Joined: 29 Mar 2015, 09:41
Contact:

Re: Remove Duplicates

06 Jun 2021, 15:29

I'd add i := A_Index into the while loop.
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

06 Jun 2021, 23:28

Yes, it does add count but is a little too approximate because it counts the duplicates too.

Another question, if I was to want to exclude certain domains in your regex where would I put the (?!example|example2) ?

Negative look ahead doesn't seem to work with your first Regex.
Last edited by Rikk03 on 07 Jun 2021, 02:22, edited 1 time in total.
User avatar
Chunjee
Posts: 1429
Joined: 18 Apr 2014, 19:05
Contact:

Re: Remove Duplicates

06 Jun 2021, 23:42

https://biga-ahk.github.io/biga.ahk/#/?id=uniq returns a new array with all duplicates removed. Does not accept with psudoarrays as input though.

Code: Select all

A := new biga() ; requires https://www.npmjs.com/package/biga.ahk

A.uniq([2, 1, 2, 2, 2, 2, 1])
; => [2, 1]
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

07 Jun 2021, 03:41

Thankyou Teadrinker,

I figured it out, it works perfectly now.
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

27 Sep 2021, 08:13

teadrinker wrote:
06 Jun 2021, 10:40
An example:

Code: Select all

url := "https://www.autohotkey.com/boards/viewtopic.php?f=76&t=91362"
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, false)
Whr.Send()
status := Whr.status
if (status != 200)
   throw "HttpRequest error, status: " . status

Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
html := StrGet(pData, length, "UTF-8")

links := m := "", Domains := {}
while RegExMatch(html, "isO)""\Khttps?://[^""]+", m, m ? m.Pos + m.Len : 1) {
   link := m[0]
   RegExMatch(link, "https?://[^/:]+", domain)
   if !Domains.HasKey(domain) {
      Domains[domain] := ""
      links .= link . "`n"
   }
}
MsgBox, % RTrim(links, "`n")
Hi Again, say if the domain was preceded by some other text, how could I prevent the de-duplication process?

Im looking for the best way ...... perhaps using instr or something simple.
teadrinker
Posts: 4344
Joined: 29 Mar 2015, 09:41
Contact:

Re: Remove Duplicates

27 Sep 2021, 08:18

Can you provide an example?
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

22 Dec 2021, 16:51

Regarding the regular expression used.

I need it to be this ((?:https:\/\/www\.|https:\/\/)(?!" domain1 ")([^:\s\"]+))

How do I use the Group 2 result from this.
teadrinker
Posts: 4344
Joined: 29 Mar 2015, 09:41
Contact:

Re: Remove Duplicates

22 Dec 2021, 18:14

Please post the source string and what you need to get. I'm not sure if your expression is correct.
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

25 Dec 2021, 14:02

Hi Teadrinker,

Forget my previous post, i've updated my regex.

Id like to use this regex, tested with regex101, so it works.

(?:https:\/\/)(?!.*(cdn|ssl|cache))([^:\s""<>]+)

However, I am having trouble replacing your suggested regex with

, m, m ? m.Pos + m.Len : 1) {

on the end. I don't really understand why it doesn't work because there is only one capture group.
teadrinker
Posts: 4344
Joined: 29 Mar 2015, 09:41
Contact:

Re: Remove Duplicates

26 Dec 2021, 13:24

teadrinker wrote: Please post the source string and what you need to get.
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

03 Jan 2022, 05:02

It could literally be any webpage source html

for example source scrapes this url

https://whatever.com/greatwaysto-walk

but if

https://whatever.com or https://www.whatever.com OR https://subdomain.whatever.com OR https://subdomain.another.whatever.com OR https://subdomain.com/whatever/ is found again it would be excluded or if it had "static" or "images" in the URL, again excluded.

to be exact

Im trying to use


while RegExMatch(bata, "isO)""\Khttps?://(?!(.+)?(" variable "|static|images))[^""\s]+\b", m, m ? m.Pos + m.Len : 1) {

linb := m[0]
and

RegExMatch(linb, "https?://([^/:""\s\/]+)", compvar)

however it doesn't seem to be working.
Last edited by Rikk03 on 13 Jan 2022, 13:54, edited 1 time in total.
Rikk03
Posts: 192
Joined: 12 Oct 2020, 02:44

Re: Remove Duplicates

03 Jan 2022, 10:07

Never mind,

Got it working! Finally!!

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: hitman and 412 guests