AutoHotkey Community

It is currently May 27th, 2012, 6:16 am

All times are UTC [ DST ]




Post new topic Reply to topic  [ 8 posts ] 
Author Message
PostPosted: November 19th, 2010, 5:01 pm 
hi,

I've got two txt files which include about 4000 words, and I want to list all the words that are not included in both list:

example:

listA contains:
about adv., prep.
above prep., adv.
abroad adv.
absence n.
absent adj.


listB contains:

about prep S1, W1
about adv S1, W1
above adv, prep S2, W1
above adj W3
abroad adv S2, W3
absence n S3, W2
absolute adj S2, W3



in this case I want to list the words: "absent" and "absolute" because they are NOT in both lists.

how can I do this? thanks in advance


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: November 19th, 2010, 6:01 pm 
Offline

Joined: August 28th, 2009, 11:17 am
Posts: 599
Location: Brighton, UK
This is a bit sloppy but should work for what you need:

Code:
Loop, Read, ListA.txt
{
   CurrentLine := A_LoopReadLine
   Loop, Read, ListB.txt, NewList.txt
   {
      If (InStr(A_LoopReadLine, CurrentLine))
         FileAppend, %A_LoopReadLine%`n
   }
}

_________________
With mixed feelings I am stepping down from all moderation responsibilities: http://www.autohotkey.com/forum/viewtopic.php?t=82906


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 19th, 2010, 6:07 pm 
Offline

Joined: June 18th, 2008, 8:36 am
Posts: 4923
Location: AHK Forum
This won't work MacroMan!, try it.
Here is an example with AHK_L and objects:
Code:
list1=
(
about adv., prep.
above prep., adv.
abroad adv.
absence n.
absent adj.
)
list2=
(
about prep S1, W1
about adv S1, W1
above adv, prep S2, W1
above adj W3
abroad adv S2, W3
absence n S3, W2
absolute adj S2, W3
)
obj1:=Object()
obj2:=Object()
While A_Index<3    ,    idx:=A_Index
   Loop,Parse,List%A_Index%,`n,`r
   {
      obj%idx%[(in:=InStr(A_LoopField," "))?SubStr(A_LoopField,1,in-1):A_LoopField]:=A_LoopField
   }
for k in obj1
   If !obj2[k]
      out.="List 1: " k "`n"
for k in obj2
   If !obj1[k]
      out.="List 2: " k "`n"
MsgBox % out

_________________
AHK_H (2alpha) AHF TT _Struct WatchDir Yaml _Input ObjTree RapidHotkey DynaRun :wink:


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 19th, 2010, 8:03 pm 
Offline

Joined: June 8th, 2006, 9:38 pm
Posts: 307
Here's my take on it - the idea is to "tidy up" the lists in a first step and convert them to comma-separated lists, so that we can use AHK's "if var in ..." command.

Code:
#NoEnv
#SingleInstance force
SetBatchLines -1

listA =
(
about adv., prep.
above prep., adv.
abroad adv.
absence n.
absent adj.
)

listB =
(
about prep S1, W1
about adv S1, W1
above adv, prep S2, W1
above adj W3
abroad adv S2, W3
absence n S3, W2
absolute adj S2, W3
)

listA := TidyList(listA)
listB := TidyList(listB)

; add items from listA which are not in listB
Loop, Parse, listA, `,
  { if A_LoopField not in %listB%
      result .= A_LoopField ","
  }
; add items from listB which are not in listA
Loop, Parse, listB, `,
  { if A_LoopField not in %listA%
      result .= A_LoopField ","
  }
StringTrimRight, result, result, 1   ; remove trailing comma

Msgbox % result

ExitApp

; ===============================

; get a comma-separated list of just the words
TidyList(list) {
; Note: Could use VarSetCapacity() here to boost performance?
Loop, Parse, list, `n, `r
  { ; get rid of the clutter - we just want the word
    StringLeft, word, A_LoopField, InStr(A_LoopField, A_Space)-1
    tidyList .= word ","
  } StringTrimRight, tidyList, tidyList, 1   ; remove trailing comma
; remove dupliates - this should help performance I would think
Sort, tidyList, U D,
return tidyList
}


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 19th, 2010, 9:05 pm 
hey thank you very much guys!

I also tried to code it on my own, using regexp, and its working:
Code:
Loop, Read, FileA.txt
{
   unique = 1
   RegExMatch(A_LoopReadLine, "^\w+", FileAWord)
   
   Loop, Read, FileB.txt
   {
   RegExMatch(A_LoopReadLine, "^\w+", FileBWord)
   
      If(FileBWord == FileAWord)
      {
         unique = 0
      }
   }
   
   If(unique == 1)
   {
      FileAppend, %A_LoopReadLine%`n, Output.txt
   }   
}


but I still have one issue, there some entries that contain more than just one word.

for example in listA:
all right adj., adv., exclamation
businessman, businesswoman n.


orin listB:
good morning interjection S2
good night interjection S3


the scripts will only compare the first word, so they won't consider "right", "businesswoman".
but they will compare "good morning" and "good night" with "good", which is bad too.

any ideas how to fix this?
doesn't matter if you use my script, roland's script, or autokeyit's script. I just want to get it done :)

thanks


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: November 19th, 2010, 10:00 pm 
Offline

Joined: June 8th, 2006, 9:38 pm
Posts: 307
This is fairly tricky - the easiest way would be if you could get hold of the data in a better format. Failing that, you are going to have to compile a list of everything that can follow the actual word(s) data, e.g. "adv", "prep", "n", etc. Then you can try, for each line, to figure out where the actual word ends and the noise starts.

Below is my attempt, based on the example data you've given.
It treats entries like "businessman, businesswoman" as two separate words for simplicities sake - it probably doesn't matter though.


Code:
#NoEnv
#SingleInstance force
SetBatchLines -1

listA =
(
about adv., prep.
above prep., adv.
abroad adv.
absence n.
absent adj.
all right adj., adv., exclamation
businessman, businesswoman n.
)

listB =
(
about prep S1, W1
about adv S1, W1
above adv, prep S2, W1
above adj W3
abroad adv S2, W3
absence n S3, W2
absolute adj S2, W3
good morning interjection S2
good night interjection
)

listA := TidyList(listA)
listB := TidyList(listB)

; add items from listA which are not in listB
Loop, Parse, listA, `,
  { if A_LoopField not in %listB%
      result .= A_LoopField ","
  }
; add items from listB which are not in listA
Loop, Parse, listB, `,
  { if A_LoopField not in %listA%
      result .= A_LoopField ","
  }
StringTrimRight, result, result, 1   ; remove trailing comma

Msgbox % result

ExitApp

; ===============================

; get a comma-separated list of just the words
TidyList(list) {
static noise := "adv|prep|n|adj|exclamation|interjection"
; Note: Could use VarSetCapacity() here to boost performance?
Loop, Parse, list, `n, `r
  { ; get rid of the clutter - we just want the word
    ; regex: keep matching chars a-z, spaces and commas, until we find
    ; a "noise" sequence followed by a comma, period or space
    haystack := A_LoopField . " "  ; indicate end of line
    if !RegExMatch(haystack, "iU)^([a-z ,]+) (" noise ")[,. ]", m)
      msgbox ERROR - No match for: %A_LoopField%
    words := m1
    ; e.g. "businessman, businesswoman" becomes 2 separate entries
    Loop, Parse, words, `,
      { word = %A_LoopField%
        tidyList .= word ","
      }
  } StringTrimRight, tidyList, tidyList, 1   ; remove trailing comma
; remove dupliates - this should help performance I would think
Sort, tidyList, U D,
return tidyList
}


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: November 19th, 2010, 11:12 pm 
wow thank you! nice work, nice skills!

I've added all the other noise words like "adv", "n", "v" and so on.
However there are actually some entries that have no noise words at all.

for example:
access n.
accident n.
by accident
accidental adj.

then it tells me: "ERROR - No match for: by accident".
I know that instead of that message I simply have to add this element to the list, but I don't know how.
And another question: how change it to output the original lines as result? (I need the output to be as it was. "accidental adj.", "above adv, prep S2, W1", and so on)

thank you


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: November 21st, 2010, 3:15 pm 
Offline

Joined: June 8th, 2006, 9:38 pm
Posts: 307
Anonymous wrote:
And another question: how change it to output the original lines as result? (I need the output to be as it was. "accidental adj.", "above adv, prep S2, W1", and so on)


OK - but what happens to entries that are listed more than once? Should they all be included in the result set?

Perhaps you can explain what you're actually trying to do...


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: hd0202, HotkeyStick and 54 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group