Page 1 of 1
Remove Duplicates
Posted: 06 Jun 2021, 02:08
by Rikk03
Hi, Im trying to the following code with a regex to find links in source code which works.
Code: Select all
dups := []
Pos=1
try While Pos:= RegExMatch(bata, pattern,m,Pos+StrLen(m))
{
cou:= a_index `t
for i, v in dups {
if (v = m)
continue, 2
}
extlinks .= m "`n"
dups.push(m)
eztlinks:= Rtrim(extlinks)
eztlinks:= StrReplace(eztlinks, ",")
FileAppend, %cou%`,%eztlinks%,%A_Desktop%\gsresults2021.csv
}
However, while some duplicates are cleared out, others are not, how can I be more intensive on duplicate removal? To explain a little further, this is for google search results extraction. I want the variable "cou" to reflect the actual rank as it loops through source code. I find the link is listed 4 times in the source for each listing result. Perhaps all i need is to return the first?
Thoughts? How can I return just the first link for search result of each position. (instead of 4 links being returned for each)
Re: Remove Duplicates
Posted: 06 Jun 2021, 03:34
by braunbaer
Without more information, it is difficult to tell where the problem lies.
At the first look, I would expect duplicates to be removed. Where are the links listed 4 times?
Maybe you want to post a complete example to reproduce:
What is bata, what is pattern, what is the expected result and what is your actual result?
Re: Remove Duplicates
Posted: 06 Jun 2021, 08:46
by Rikk03
updated to show output
For example, the output provides both
https://www.artefact.com/digital-marketing-3/e-retail-e-commerce/
AND
https://www.artefact.com/
The latter would not be needed. So if a link with the domain exists then the domain would not by itself.
They are not considered duplicates I guess. So how can I solve it?
Re: Remove Duplicates
Posted: 06 Jun 2021, 10:40
by teadrinker
An example:
Code: Select all
url := "https://www.autohotkey.com/boards/viewtopic.php?f=76&t=91362"
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, false)
Whr.Send()
status := Whr.status
if (status != 200)
throw "HttpRequest error, status: " . status
Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
html := StrGet(pData, length, "UTF-8")
links := m := "", Domains := {}
while RegExMatch(html, "isO)""\Khttps?://[^""]+", m, m ? m.Pos + m.Len : 1) {
link := m[0]
RegExMatch(link, "https?://[^/:]+", domain)
if !Domains.HasKey(domain) {
Domains[domain] := ""
links .= link . "`n"
}
}
MsgBox, % RTrim(links, "`n")
Re: Remove Duplicates
Posted: 06 Jun 2021, 14:49
by Rikk03
Wow,
For the most part it worked! Im using 2 very different regex and it still worked.
Question, how would you add a count to your example, lets assume you are looping through a html source file extracting links. The count would be for approximate rank.
Re: Remove Duplicates
Posted: 06 Jun 2021, 15:29
by teadrinker
I'd add i := A_Index into the while loop.
Re: Remove Duplicates
Posted: 06 Jun 2021, 23:28
by Rikk03
Yes, it does add count but is a little too approximate because it counts the duplicates too.
Another question, if I was to want to exclude certain domains in your regex where would I put the (?!example|example2) ?
Negative look ahead doesn't seem to work with your first Regex.
Re: Remove Duplicates
Posted: 06 Jun 2021, 23:42
by Chunjee
https://biga-ahk.github.io/biga.ahk/#/?id=uniq returns a new array with all duplicates removed. Does not accept with psudoarrays as input though.
Code: Select all
A := new biga() ; requires https://www.npmjs.com/package/biga.ahk
A.uniq([2, 1, 2, 2, 2, 2, 1])
; => [2, 1]
Re: Remove Duplicates
Posted: 07 Jun 2021, 03:41
by Rikk03
Thankyou Teadrinker,
I figured it out, it works perfectly now.
Re: Remove Duplicates
Posted: 07 Jun 2021, 03:48
by teadrinker
Ok
Re: Remove Duplicates
Posted: 27 Sep 2021, 08:13
by Rikk03
teadrinker wrote: ↑06 Jun 2021, 10:40
An example:
Code: Select all
url := "https://www.autohotkey.com/boards/viewtopic.php?f=76&t=91362"
Whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
Whr.Open("GET", url, false)
Whr.Send()
status := Whr.status
if (status != 200)
throw "HttpRequest error, status: " . status
Arr := Whr.responseBody
pData := NumGet(ComObjValue(arr) + 8 + A_PtrSize)
length := arr.MaxIndex() + 1
html := StrGet(pData, length, "UTF-8")
links := m := "", Domains := {}
while RegExMatch(html, "isO)""\Khttps?://[^""]+", m, m ? m.Pos + m.Len : 1) {
link := m[0]
RegExMatch(link, "https?://[^/:]+", domain)
if !Domains.HasKey(domain) {
Domains[domain] := ""
links .= link . "`n"
}
}
MsgBox, % RTrim(links, "`n")
Hi Again, say if the domain was preceded by some other text, how could I prevent the de-duplication process?
Im looking for the best way ...... perhaps using instr or something simple.
Re: Remove Duplicates
Posted: 27 Sep 2021, 08:18
by teadrinker
Can you provide an example?
Re: Remove Duplicates
Posted: 22 Dec 2021, 16:51
by Rikk03
Regarding the regular expression used.
I need it to be this ((?:https:\/\/www\.|https:\/\/)(?!" domain1 ")([^:\s\"]+))
How do I use the Group 2 result from this.
Re: Remove Duplicates
Posted: 22 Dec 2021, 18:14
by teadrinker
Please post the source string and what you need to get. I'm not sure if your expression is correct.
Re: Remove Duplicates
Posted: 25 Dec 2021, 14:02
by Rikk03
Hi Teadrinker,
Forget my previous post, i've updated my regex.
Id like to use this regex, tested with regex101, so it works.
(?:https:\/\/)(?!.*(cdn|ssl|cache))([^:\s""<>]+)
However, I am having trouble replacing your suggested regex with
, m, m ? m.Pos + m.Len : 1) {
on the end. I don't really understand why it doesn't work because there is only one capture group.
Re: Remove Duplicates
Posted: 26 Dec 2021, 13:24
by teadrinker
teadrinker wrote: ↑Please post the source string and what you need to get.
Re: Remove Duplicates
Posted: 03 Jan 2022, 05:02
by Rikk03
It could literally be any webpage source html
for example source scrapes this url
https://whatever.com/greatwaysto-walk
but if
https://whatever.com or
https://www.whatever.com OR
https://subdomain.whatever.com OR
https://subdomain.another.whatever.com OR
https://subdomain.com/whatever/ is found again it would be excluded or if it had "static" or "images" in the URL, again excluded.
to be exact
Im trying to use
while RegExMatch(bata, "isO)""\Khttps?://(?!(.+)?(" variable "|static|images))[^""\s]+\b", m, m ? m.Pos + m.Len : 1) {
linb := m[0]
and
RegExMatch(linb, "https?://([^/:""\s\/]+)", compvar)
however it doesn't seem to be working.
Re: Remove Duplicates
Posted: 03 Jan 2022, 10:07
by Rikk03
Never mind,
Got it working! Finally!!