AutoHotkey Community

It is currently May 27th, 2012, 3:50 am

All times are UTC [ DST ]




Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: August 6th, 2010, 7:06 am 
Offline

Joined: November 11th, 2008, 11:24 pm
Posts: 55
Location: Ashland
I saw this thread...

http://www.autohotkey.com/forum/topic49488.html

And it got me thinking about something else I'd like.

I'd like to find repeated sentences, or even paragraphs.

The first step I suppose would be this.

Would it be possible to search a body of text and display the longest repeated set of characters including spaces and punctuation?

The next step would be to cut out that phrase and append it to a new line in a list.

Repeating this you'd end up with a wedge shaped list.

Any suggestions would be appreciated.

Addition:

What I'm after is the same as this question asked of a different community.

http://stackoverflow.com/questions/1928 ... dy-of-text


Last edited by Innomen on August 6th, 2010, 8:54 am, edited 2 times in total.

Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 6th, 2010, 8:08 am 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
Just to be sure I understand, you want to:

- remove duplicate lines
- sort the file from the longest lines to the shortest

Before:
Quote:
hello this is a test
bye
hi there who are
you are using AutoHotkey
hello this is a test
yes that is correct


After:
Quote:
you are using AutoHotkey
hello this is a test
yes that is correct
hi there who are
bye

_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
 Post subject: Maybe.
PostPosted: August 6th, 2010, 8:26 am 
Offline

Joined: November 11th, 2008, 11:24 pm
Posts: 55
Location: Ashland
I just want to data mine a body of text for repeated phases of any length in whatever way is easiest for everyone.

Yes yours is one example of how to present the answer.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 6th, 2010, 9:36 am 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
Part of code that is currently in development for my TF lib, give this a spin
Code:
VarToSort=
(
hello this is a test
bye
hi there who are
you are using AutoHotkey
you are using AutoHotkey
you are using AutoHotkey
you are using AutoHotkey
hello this is a test
yes that is correct
)

Sort, VarToSort, F _AscendingLinesL U ;
MsgBox % VarToSort

_AscendingLinesL(a1, a2)
{
    Return StrLen(a2) - StrLen(a1)
}


_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
 Post subject: No.
PostPosted: August 6th, 2010, 5:55 pm 
Offline

Joined: November 11th, 2008, 11:24 pm
Posts: 55
Location: Ashland
I get an error message, missing bracket at line 018 or something? I'll tinker with it here in a little bit.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: August 6th, 2010, 6:27 pm 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
Probably means you didn't copy the code correctly as it runs as it intended and procudes the result as shown a few posts above.

_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
 Post subject: It doesn't matter
PostPosted: November 5th, 2010, 7:32 pm 
Offline

Joined: November 11th, 2008, 11:24 pm
Posts: 55
Location: Ashland
If I'm understanding that code correctly all that does is sort lines.

I need to be able to pull things out of lines as well.

Like if it encountered the sentence "of course I use the phrase of course, of course you might think this is excessive, but I don't think this is excessive."

It would produce.

think this is excessive
think this is excessive
of course
of course
of course

Is that helpful?

I'm wanting to linguistically study large batches of text, and part of what I need is to data mine for repeated phrases (or whole sentences/paragraphs, such as commonly used quotes or the like).


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 6th, 2010, 12:43 am 
Offline

Joined: May 27th, 2007, 9:41 am
Posts: 4999
Sorry, can't help. Would be a difficult task unless you can find an algorithm to find repeating texts. (Unless you already know what you want to search for)

http://www.autohotkey.com/forum/viewtopic.php?t=7269
http://www.autohotkey.com/forum/viewtopic.php?t=59407

http://stackoverflow.com/questions/2340 ... -fuzzyness

_________________
AHK FAQ
TF : Text files & strings lib, TF Forum


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 6th, 2010, 12:52 am 
Offline

Joined: November 1st, 2010, 3:13 pm
Posts: 62
This looks like something where you need a real programming language.
It would be pretty simple with Assembler (well, not simple), or cobol or REXX.
and i'm sure there are dozens of other languages that would do the trick easily (but, these are the ones I know).

The other problem is--and this would crop up quickly--is how long must the text be that repeats? If only 1 character, you'd get zillions of hits with large files.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 6th, 2010, 5:16 pm 
Code:
#NoEnv
SetBatchLines, -1

text =
(
of course I use the phrase of course, of course you might think this is excessive,
but I don't think this is excessive.
)

; going from long to short can avoid sub-phrases if you so wish
MsgBox % DupPhrase(text, 4)

MsgBox % DupPhrase(text, 3)

MsgBox % DupPhrase(text)


DupPhrase(text, wordsInPhrase=2, disregard=""){
   Report := "words in phrase: " wordsInPhrase "`n`n"
   trunc := text
   trunc := RegExReplace(trunc, "s)[,;:\.]*") . " "         ; ignore punctuation + space allows last phrase
   Loop
   {
      StringReplace, dummy, trunc, %A_Space%, , UseErrorLevel
      If ( ErrorLevel < wordsInPhrase )                  ; if there are less than enough spaces to get a phrase
         Break
      
      StringGetPos, splicePos, trunc, %A_Space%, L%wordsInPhrase%
      
      phrase := SubStr( trunc, 1, splicePos )               ; get a new phrase from start
      
      If ( InStr(disregard, phrase) )                     ; skip if to be ignored
         Continue

      trunc := SubStr( trunc, InStr(trunc, A_Space) + 1 )     ; remove one further word from text
      
      If phrase Not in %old_Phrases%                     ; if it's an unfamiliar phrase
      {
         old_Phrases .= phrase ","
         StringReplace, dummy, text, %phrase%, , UseErrorLevel
         If ErrorLevel > 1                           ; if more than one occurence found
            Report .= "phrase /" phrase "/ was found " ErrorLevel " times.`n"
      }
   }
   Return Report
}


Report this post
Top
  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: Bing [Bot], bobbysoon and 21 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group