Page 1 of 1

Remove Duplicates

Posted: 06 Jun 2021, 02:08
by Rikk03
Hi, Im trying to the following code with a regex to find links in source code which works.

Code: Select all

dups := []
		Pos=1
		try While Pos:= RegExMatch(bata, pattern,m,Pos+StrLen(m))
			{
				cou:= a_index `t
				for i, v in dups {
				if (v = m)
				continue, 2
				}
		extlinks .= m "`n"
		dups.push(m)
		eztlinks:= Rtrim(extlinks)
		eztlinks:= StrReplace(eztlinks, ",")
		FileAppend, %cou%`,%eztlinks%,%A_Desktop%\gsresults2021.csv
		}
However, while some duplicates are cleared out, others are not, how can I be more intensive on duplicate removal? To explain a little further, this is for google search results extraction. I want the variable "cou" to reflect the actual rank as it loops through source code. I find the link is listed 4 times in the source for each listing result. Perhaps all i need is to return the first?

Thoughts? How can I return just the first link for search result of each position. (instead of 4 links being returned for each)

Re: Remove Duplicates

Posted: 06 Jun 2021, 03:34
by braunbaer
Without more information, it is difficult to tell where the problem lies.
At the first look, I would expect duplicates to be removed. Where are the links listed 4 times?

Maybe you want to post a complete example to reproduce:
What is bata, what is pattern, what is the expected result and what is your actual result?

Re: Remove Duplicates

Posted: 06 Jun 2021, 08:46
by Rikk03
updated to show output

For example, the output provides both

https://www.artefact.com/digital-marketing-3/e-retail-e-commerce/

AND

https://www.artefact.com/

The latter would not be needed. So if a link with the domain exists then the domain would not by itself.

They are not considered duplicates I guess. So how can I solve it?

Re: Remove Duplicates

Posted: 06 Jun 2021, 10:40
by teadrinker
An example:

Code: Select all

url := "https://www.autohotkey.com/boards/viewtopic.php?f=76&t=91362"
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, false)
Whr.Send()
status := Whr.status
if (status != 200)
   throw "HttpRequest error, status: " . status

Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
html := StrGet(pData, length, "UTF-8")

links := m := "", Domains := {}
while RegExMatch(html, "isO)""\Khttps?://[^""]+", m, m ? m.Pos + m.Len : 1) {
   link := m[0]
   RegExMatch(link, "https?://[^/:]+", domain)
   if !Domains.HasKey(domain) {
      Domains[domain] := ""
      links .= link . "`n"
   }
}
MsgBox, % RTrim(links, "`n")

Re: Remove Duplicates

Posted: 06 Jun 2021, 14:49
by Rikk03
Wow,

For the most part it worked! Im using 2 very different regex and it still worked.

Question, how would you add a count to your example, lets assume you are looping through a html source file extracting links. The count would be for approximate rank.

Re: Remove Duplicates

Posted: 06 Jun 2021, 15:29
by teadrinker
I'd add i := A_Index into the while loop.

Re: Remove Duplicates

Posted: 06 Jun 2021, 23:28
by Rikk03
Yes, it does add count but is a little too approximate because it counts the duplicates too.

Another question, if I was to want to exclude certain domains in your regex where would I put the (?!example|example2) ?

Negative look ahead doesn't seem to work with your first Regex.

Re: Remove Duplicates

Posted: 06 Jun 2021, 23:42
by Chunjee
https://biga-ahk.github.io/biga.ahk/#/?id=uniq returns a new array with all duplicates removed. Does not accept with psudoarrays as input though.

Code: Select all

A := new biga() ; requires https://www.npmjs.com/package/biga.ahk

A.uniq([2, 1, 2, 2, 2, 2, 1])
; => [2, 1]

Re: Remove Duplicates

Posted: 07 Jun 2021, 03:41
by Rikk03
Thankyou Teadrinker,

I figured it out, it works perfectly now.

Re: Remove Duplicates

Posted: 07 Jun 2021, 03:48
by teadrinker
Ok :)

Re: Remove Duplicates

Posted: 27 Sep 2021, 08:13
by Rikk03
teadrinker wrote:
06 Jun 2021, 10:40
An example:

Code: Select all

url := "https://www.autohotkey.com/boards/viewtopic.php?f=76&t=91362"
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, false)
Whr.Send()
status := Whr.status
if (status != 200)
   throw "HttpRequest error, status: " . status

Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
html := StrGet(pData, length, "UTF-8")

links := m := "", Domains := {}
while RegExMatch(html, "isO)""\Khttps?://[^""]+", m, m ? m.Pos + m.Len : 1) {
   link := m[0]
   RegExMatch(link, "https?://[^/:]+", domain)
   if !Domains.HasKey(domain) {
      Domains[domain] := ""
      links .= link . "`n"
   }
}
MsgBox, % RTrim(links, "`n")
Hi Again, say if the domain was preceded by some other text, how could I prevent the de-duplication process?

Im looking for the best way ...... perhaps using instr or something simple.

Re: Remove Duplicates

Posted: 27 Sep 2021, 08:18
by teadrinker
Can you provide an example?

Re: Remove Duplicates

Posted: 22 Dec 2021, 16:51
by Rikk03
Regarding the regular expression used.

I need it to be this ((?:https:\/\/www\.|https:\/\/)(?!" domain1 ")([^:\s\"]+))

How do I use the Group 2 result from this.

Re: Remove Duplicates

Posted: 22 Dec 2021, 18:14
by teadrinker
Please post the source string and what you need to get. I'm not sure if your expression is correct.

Re: Remove Duplicates

Posted: 25 Dec 2021, 14:02
by Rikk03
Hi Teadrinker,

Forget my previous post, i've updated my regex.

Id like to use this regex, tested with regex101, so it works.

(?:https:\/\/)(?!.*(cdn|ssl|cache))([^:\s""<>]+)

However, I am having trouble replacing your suggested regex with

, m, m ? m.Pos + m.Len : 1) {

on the end. I don't really understand why it doesn't work because there is only one capture group.

Re: Remove Duplicates

Posted: 26 Dec 2021, 13:24
by teadrinker
teadrinker wrote: Please post the source string and what you need to get.

Re: Remove Duplicates

Posted: 03 Jan 2022, 05:02
by Rikk03
It could literally be any webpage source html

for example source scrapes this url

https://whatever.com/greatwaysto-walk

but if

https://whatever.com or https://www.whatever.com OR https://subdomain.whatever.com OR https://subdomain.another.whatever.com OR https://subdomain.com/whatever/ is found again it would be excluded or if it had "static" or "images" in the URL, again excluded.

to be exact

Im trying to use


while RegExMatch(bata, "isO)""\Khttps?://(?!(.+)?(" variable "|static|images))[^""\s]+\b", m, m ? m.Pos + m.Len : 1) {

linb := m[0]
and

RegExMatch(linb, "https?://([^/:""\s\/]+)", compvar)

however it doesn't seem to be working.

Re: Remove Duplicates

Posted: 03 Jan 2022, 10:07
by Rikk03
Never mind,

Got it working! Finally!!