Help with regex

Discuss other programming languages besides AutoHotkey
brandonhotkey
Posts: 89
Joined: 03 Nov 2013, 05:46

Help with regex

26 Jan 2014, 07:46

Hello. I am working on PHP project, there I use regular expresion search to find (bad)words which should be masked with stars. The characters which are match should be masked. Now I look dor regex for this particular situation:

I have a filter for filtering badwords in my native language. My language is very accurate not like simple English, so it is harder to give good exaple in English. I don't find good word / verb to have good example. But I will try.

I separated the pattern for three parts: prefix, middle and suffix. The middle will not change. The suffix can change. The prefix can be a specific letter or may not be present.


lets give example now:

Let's have the middle part: mil.

I want to find words like smile, smiles, smiling or mile, miles, miling

but not words like mill, mills, milling or words with different prefix like s. Here I dont find realistic example. So if there would be words like emile, emiles, emiling, amile, amiles, amiling, and so on. There could be also prefixes like pre-, un,- and so on. These words have not to be included in the result.



In my native language I tried something similar:

\bv?ojet.?\b

where ojet is the middle part and v is the only possible prefix. But this this not find anything.

My second try was

[v^d]?ojed[ueo][usm]

this time with different middle word: ojed. The suffix is changing here according declination. But always this does not work.

In my language the word would be either ojet in words like vojet,ojet,vojel,ojel,ojeli or in the second pattern it would be vojedu,vojedes,vojedem, and so on. These should be found and filtered. But there are next similar words which should not be in the result and not to be filtered: dojel,dojeli, kojeni, tojetvoje, tvoje, . So if the prefix is different than "v" it should not be found. Especially the last example is very problematic. Notice that is does not contain t or d on the preffix. So if "oje" from the "tvoje" string is selected, so this is wrong.


The common problem in my tests were that some of the characters of the incorrect middle word were captured. And I need only to capture the character of the correct word, in English example it is the mil, or smile, or smil... etc. So how can I specify pattern, that will capture as many characters as possible from the correct word, but not to capture the characters which are not presend.



Note:

I work only with a-z (small characters) no white characters in the text.
User avatar
sinkfaze
Posts: 613
Joined: 01 Oct 2013, 08:01

Re: Help with regex

27 Jan 2014, 09:36

Sounds like you need to deploy word boundaries:

Code: Select all

list=vojet,ojet,vojel,ojel,ojeli,dojel,dojeli,tojetvoje,tvoje

Loop, parse, list, `,
	if	RegExMatch(A_LoopField,"\bv?oje(?:t|li?)\b")
		MsgBox, %A_LoopField% is naughty naughty!
return
brandonhotkey
Posts: 89
Joined: 03 Nov 2013, 05:46

Re: Help with regex

27 Jan 2014, 11:14

I have solved the problem. I created new patterns to fix my text:

Code: Select all

(?:[^cdkprmt])oje[tl].?
(?:[^cdkprmt])ojed[ueo][usm]?
To see more pls visit:
http://forums.phpfreaks.com/topic/28568 ... -badwords/
There are the sentences to be filtered in colors (red color is non working result).

I think bounderies would not help, the text in which I search is has no space.

Return to “Other Programming Languages”

Who is online

Users browsing this forum: No registered users and 2 guests