Need guidance how to pattern match within a large amount of text.

Get help with using AutoHotkey (v1.1 and older) and its commands and hotkeys
omar
Posts: 540
Joined: 22 Oct 2015, 17:56

Need guidance how to pattern match within a large amount of text.

24 Jan 2016, 12:48

I have a large amount of text.
I want to find 3 bits of information.
I know exactly what comes before and after the text I want.

I need help on how best to pattern match.
I assume regular expressions is what I want?
Just not sure where to start.

That's the first step.
The second step is where my matches will be several.
So, I need to create an array and store one by one until the end of the text is reached.

Any help to start off would be great.

Thanks.


Omar
User avatar
boiler
Posts: 16915
Joined: 21 Dec 2014, 02:44

Re: Need guidance how to pattern match within a large amount of text.

24 Jan 2016, 12:56

You may think this is too general to be helpful, but given how general your question is, I think it's best for you to read this and try some code, then post specific questions if you get stuck.

RegExMatch
kon
Posts: 1756
Joined: 29 Sep 2013, 17:11

Re: Need guidance how to pattern match within a large amount of text.

24 Jan 2016, 13:05

If you want specific advice, you're going to need to provide specific examples (both of the haystack you want to search and the text matches you want to find). Otherwise, have a look at the RegEx Quick Reference, RegExMatch, and RegExReplace in the AHK docs.

Code: Select all

Test = 
(Join`r`n
abc
123
def
456
ghi
789
)

p := 1, m := "", Found := []
while p := RegExMatch(Test, "\d+", m, p + StrLen(m))
    Found.Push(m)

for k, v in Found
    MsgBox, % "k:`t" k "`nv:`t" v
This will find all the groups of numbers in the text.
omar
Posts: 540
Joined: 22 Oct 2015, 17:56

Re: Need guidance how to pattern match within a large amount of text.

28 Jan 2016, 20:24

thanks for the replies guys.
since posting, i've gone away and have been learning regular expression.
having mastered it, i am back and am completely lost.
lol
(ok, so maybe not a master just yet)

what i'm trying to do is pull data from the source of a webpage.
i want a title, name and telephone.

the 3 variables will have some css that i can find

the problem with the title is that the data comes on the next line...
what i have read about regular expressions so far, it deals with line by line...!

a bit of the text:

Code: Select all

<td align="left" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">
										 <a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=38000000000">The title I want to get from the code</a></td>
<td align="center" width="75" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">£9.15</td>
so, i want the line following <td align="left" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">
i then only want the text in between the a tag, so i want everything between <a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll? ... 8000000000"> and </a>

i'm so lost! :)
i spent ages reading up and watching videos. it seemed so easy!

@kon, the code u gave is awesome and helps me learn.
i've tried reading up about the function u use: https://autohotkey.com/docs/commands/RegExMatch.htm
but am really confused. what does p + StrLen(m) do? i can't quite figure out.

let me know

thanks
User avatar
AlphaBravo
Posts: 586
Joined: 29 Sep 2013, 22:59

Re: Need guidance how to pattern match within a large amount of text.

28 Jan 2016, 21:32

here is one way :

Code: Select all

h =
(
<td align="left" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">
										 <a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=38000000000">The title I want to get from the code</a></td>
<td align="center" width="75" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">£9.15</td>
 )

start_line = 
(
<td align="left" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">
)

RegExMatch(H, start_line "\s*<a href[^>]+>\K[^<]+", m)
MsgBox % m
omar
Posts: 540
Joined: 22 Oct 2015, 17:56

Re: Need guidance how to pattern match within a large amount of text.

07 Feb 2016, 21:00

@AlphaBravo thanks for the awesome reply. ur code works!!
but i'm confused. i thought i had figured out RE - guess i have a lot to learn.

you have: RegExMatch(H, start_line "\s*<a href[^>]+>\K[^<]+", m)
(not happy with the documentation for RegExMatch - I think it's confusing and not enough examples)

\s* - OK, I get this - one or more whitespace chars - this solves the problem i wasn't sure of how you cope with carriage returns.

Then you have: <a href
OK so far.

Next: [^>]+ - I'm confused. This means starts with > char? What do the square brackets do. I thought I understood square brackets - matches any one of the chars inside? Or combinations of inside - even repeating?

The + - means or more of the preceeding? So you're saying we could have >>>>>> ?

You've made a mistake! You need to have a * after 'href' - indicating that anything in between this and the next '>' char.
Except... you haven't. Your code works perfectly.

We then have: > - assuming you are saying we have this char next (and that this char doesn't have a special meaning)

\K - hmmm? I googled this. Am confused.

[^<]+ confused about this. Assume answer to the above will answer.

Thanks!!
Peabianjay
Posts: 52
Joined: 07 Nov 2015, 22:50

Re: Need guidance how to pattern match within a large amount of text.

07 Feb 2016, 21:37

Personally,

I'd use https://autohotkey.com/docs/commands/InStr.htm .

It would require a couple of calls, but easier to follow.

Code: Select all

Haystack = 
(
<td align="left" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">
										 <a href="http://cgi.ebay.co.uk/ws/eBayISAPI.dll?ViewItem&item=38000000000">The title I want to get from the code</a></td>
<td align="center" width="75" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">£9.15</td>
 )
Needle = 
(
<td align="left" valign="top" class="bdr-rt Arl c777 fs11 pdgTop navigation bdrBtm">
)
StartingPos := 1

FoundPos := InStr(Haystack, Needle, , StartingPos)       ; Find your startline
FoundPos := InStr(Haystack, "a href", FoundPos )          ; Find the next "a href"
BeginTitle := InStr(Haystack, ">", FoundPos )  + 1           ; Find the beginning of your title
EndTitle := InStr(Haystack, "<", BeginTitle )  - 1          ; Find the end of your title

Title := SubStr(Haystack,BeginTitle , EndTitle - BeginTitle + 1 )   ; Note that I subtracted one above so endtitle would point BE the end, but that means I have to add one, here, to get the length of the title.
This could be written better, but I tried to keep it simple just for the explanation. :-)
If you wanted to continue searching the rest of the HayStack for other occurances, you'd just pickup where you left off (at EndTitle + 1 )
Peabianjay
Posts: 52
Joined: 07 Nov 2015, 22:50

Re: Need guidance how to pattern match within a large amount of text.

07 Feb 2016, 22:04

After reviewing RegExMatch myself, I figured this out....but still think "InStr" is better in this case. (The documentation also says it's faster for simple matches.) But, I'll share what I figured out....

Code: Select all

<a href[^>]+>          ; this grabs (or rather ignores) the whole HTML call from "<" to ">"
<a href                ; literally looks for this string
[   ]                   ; any thing of what's inside. Typically used for more than one character. Used here (I think) because of the ^, which would have a different meaning, otherwise
^                       ; NOT
so that means
[^>]                   ; any character that is NOT ">"
+                       ; one or more of the above ( as long as it's not ">" )

("<a href>" would NOT be found by this search, since there are ZERO characters between "href" and ">" )

/K                    ; I couldn't find this, either...presumably, it's pretty important, 'cuz it's your actual title. (Not sure why AlphaBravo used it. I would've used * )
[^<]+                 ; as above, "one or more characters which are NOT "<"
User avatar
sinkfaze
Posts: 616
Joined: 01 Oct 2013, 08:01

Re: Need guidance how to pattern match within a large amount of text.

08 Feb 2016, 09:57

omar wrote:i thought i had figured out RE - guess i have a lot to learn.
The famous last words of most people who think they have regex figured out. ;) I haven't got it totally figured out and I've been using it for nearly a decade.
omar wrote:Next: [^>]+ - I'm confused. This means starts with > char? What do the square brackets do. I thought I understood square brackets - matches any one of the chars inside? Or combinations of inside - even repeating?

The + - means or more of the preceeding? So you're saying we could have >>>>>> ?
If you look at the Regular Expressions Quick Reference you'll see that [...] represents a certain class of characters and that [^...] represents what is not of a certain class of characters i.e. anything that does not match a character(s) inside the brackets is considered a match. So when Alpha Bravo uses [^>] he is asking for any character that is not a closed chevron to match. Add the plus sign, and he is asking to match one or more characters that are not a closed chevron. Add the closed chevron after that and he is asking to match one or more characters that are not a closed chevron, followed by a closed chevron.
omar wrote:\K - hmmm? I googled this. Am confused.
Check the same page linked above and look for the Look-ahead and look-behind assertions heading at the bottom of the page. The \K, in essence, signifies that everything to its left must be matched as part of the whole pattern, but that part to the left will not be returned as a part of the match. It's handy when you need to match a piece of data you don't really need to find a piece of data that you do need:

Code: Select all

word=foobar
RegExMatch(word,"foo\Kbar",m)
MsgBox %   m
User avatar
AlphaBravo
Posts: 586
Joined: 29 Sep 2013, 22:59

Re: Need guidance how to pattern match within a large amount of text.

08 Feb 2016, 13:20

omar wrote:(not happy with the documentation for RegExMatch - I think it's confusing and not enough examples)
Regex is a pattern matching tool not made by AutoHotkey, AutoHotkey only implements it.
I find the info offered by this website to be detailed with clear examples : http://www.rexegg.com/
omar
Posts: 540
Joined: 22 Oct 2015, 17:56

Re: Need guidance how to pattern match within a large amount of text.

11 Feb 2016, 23:21

guys, thanks for the awesome replies. it's taken me a few days to get 20 min to sit down and read and learn from what u all have said!!

@Peabianjay - awesome code.
i'll definitely keep InStr up my sleeve - the way you did it seems so much simpler. the code u gave didn't work actually. lol.
but i think i know enough to understand how to use.

but.... i'm on a mission to learn regex. i need for other things i'll be coding in so am forcing/torturing myself to read and re-read trying to learn.

@Peabianjay ur explanations afterwards really helped. u got me 80% of the way there...

@sinkfaze OMG... u are awesome. i read what u said and now understand the original code!!!
feel like fist punching the air. lol.
THANKS!

@AlphaBravo awesome website. thanks. i'll definitely read from this place.

thank u all!!

Return to “Ask for Help (v1)”

Who is online

Users browsing this forum: mikeyww, zerox and 339 guests