Page 1 of 1

Regex help...

Posted: 20 Oct 2016, 12:09
by drawback
Hi,

I have the following string:

Code: Select all

EED6-5BA8:\Us;e;rs;D:\Test;Storage:\A;B;C\
I know how to loop over it via RegExMatch but I need a bit of help to define the actual regex pattern...

The string consists of three paths:
1.) EED6-5BA8:\Us;e;rs -> Drive serial number + folder with semicolons in it
2.) D:\Test
3.) Storage:\A;B;C\ -> Volume label + folder with semicolons in it

All paths are separated with a semicolon as well (can't do anything against that!)

The regex should be able to capture each of these 3 paths...

I've tried:

Code: Select all

(^|;)(.*?:.*?(?=;.*?:|$))
But this doesn't work as it should :/ (limiting via .*? doesn't capture enough)

Code: Select all

EED6-5BA8:\Us
;e;rs;D:\Test
;Storage:\A;B;C\
Each path could contain unicode letters (volume label + folders)!

Re: Regex help...

Posted: 20 Oct 2016, 12:24
by evilC
Could be quite difficult due to the different ways in which semicolon is used (Both as a field separator and a folder separator - how can it tell the difference?)

Does EED6-5BA8:\Us;e;rs vary? ie are there always the same number of semicolons?

Do you not have any control whatsoever over the format of the input string?

Re: Regex help...

Posted: 20 Oct 2016, 12:35
by drawback
Could be quite difficult due to the different ways in which semicolon is used (Both as a field separator and a folder separator - how can it tell the difference?)
The difference should be solvable by a positive lookahead (at least that was my plan...)
Does EED6-5BA8:\Us;e;rs vary?
Ofc. It could be anything (drive letter, serial number, volume label:\folder with <x> numbers of semicolons

Do you not have any control whatsoever over the format of the input string?
Unfortunately no, sorry!

Re: Regex help...

Posted: 20 Oct 2016, 13:28
by evilC

Code: Select all

str := "EED6-5BA8:\Us;e;rs;D:\Test;Storage:\A;B;C\"

RegexMatch(str, "^(.+:\\.+);(.+:\\.+);(.+:\\.+)$", out)
msgbox % out1 "`n" out2 "`n" out3
EED6-5BA8:\Us;e;rs
D:\Test
Storage:\A;B;C\

Basically I used :\ to anchor each capture pattern

Re: Regex help...

Posted: 20 Oct 2016, 13:44
by ahcahc
try

Code: Select all

text = EED6-5BA8:\Us;e;rs;D:\Test;Storage:\A;B;C\
while pos := regexmatch(text,"m)(?:[~!@#$%^&()_+`\-=\[\]{}'\.,\w]+):\\(?:[~!@#$%^&()_+`;\-=\[\]{}'\., \w]+(?:(?=;)|\\|$))*",m,a_index=1?1:pos+strlen(m))
   MsgBox % m

Re: Regex help...

Posted: 20 Oct 2016, 13:59
by drawback
@evilC: Thank you!, but I guess I didn't describe it correctly :(

This string can consist of any combination (and numbers!) of serial number / volume label / drive letter :\ [<path with ; in it]
So these would be all "valid" strings that could occur:
EED6-5BA8:\Us;e;rs;D:\Test;Storage:\A;B;C\
Storage:\
C:\a ; (semicolon) inside me\subfolder with 漢 in it;Windows:\Users\
etc.
Sorry If I my description was misunderstandable!


@ahcahc
Thank you! This is very close. It splits all entries correctly unless a non-english character appears. E.g. a chinese char in a folder name
Like:
EED6-5BA8:\Us;e;rs;D:\@Chinese-漢字-chars;acter UTF-8 BOM\Test;Storage:\A;B;C\
Where the second entry comes out as 'D:\' only, every other char for that entry was truncated...
The .ahk script is using UTF-8 BOM, but I tried it with UTF-16 BOM / No BOM as well.

Re: Regex help...

Posted: 20 Oct 2016, 14:33
by ahcahc
try ^[^:\\]+?:\\(?:[^:\\]+?(?:(?=;)|\\|$))*|(?<=;)[^:\\]+?:\\(?:[^:\\]+?(?:(?=;)|\\|$))* maybe needs more testing.

Re: Regex help...

Posted: 20 Oct 2016, 14:36
by Helgef
drawback wrote:@evilC: Thank you!, but I guess I didn't describe it correctly :(
This string can consist of any combination (and numbers!) of serial number / volume label / drive letter :\ [<path with ; in it]
This creates a evilC-style regex of "correct length":

Code: Select all

RegExReplace(str,":",,n)
regex:="^"
Loop, % n
	regex.="(.+:\\.+);"
regex:=RTrim(regex,";") "$"
RegexMatch(str, regex, out)

Re: Regex help...

Posted: 20 Oct 2016, 15:07
by drawback
@ahcahc
Thanks a lot, works!

@HelgeF
Very... evil! :)