This is fairly tricky - the easiest way would be if you could get hold of the data in a better format. Failing that, you are going to have to compile a list of everything that can follow the actual word(s) data, e.g. "adv", "prep", "n", etc. Then you can try, for each line, to figure out where the actual word ends and the noise starts.
Below is my attempt, based on the example data you've given.
It treats entries like "businessman, businesswoman" as two separate words for simplicities sake - it probably doesn't matter though.
Code:
#NoEnv
#SingleInstance force
SetBatchLines -1
listA =
(
about adv., prep.
above prep., adv.
abroad adv.
absence n.
absent adj.
all right adj., adv., exclamation
businessman, businesswoman n.
)
listB =
(
about prep S1, W1
about adv S1, W1
above adv, prep S2, W1
above adj W3
abroad adv S2, W3
absence n S3, W2
absolute adj S2, W3
good morning interjection S2
good night interjection
)
listA := TidyList(listA)
listB := TidyList(listB)
; add items from listA which are not in listB
Loop, Parse, listA, `,
{ if A_LoopField not in %listB%
result .= A_LoopField ","
}
; add items from listB which are not in listA
Loop, Parse, listB, `,
{ if A_LoopField not in %listA%
result .= A_LoopField ","
}
StringTrimRight, result, result, 1 ; remove trailing comma
Msgbox % result
ExitApp
; ===============================
; get a comma-separated list of just the words
TidyList(list) {
static noise := "adv|prep|n|adj|exclamation|interjection"
; Note: Could use VarSetCapacity() here to boost performance?
Loop, Parse, list, `n, `r
{ ; get rid of the clutter - we just want the word
; regex: keep matching chars a-z, spaces and commas, until we find
; a "noise" sequence followed by a comma, period or space
haystack := A_LoopField . " " ; indicate end of line
if !RegExMatch(haystack, "iU)^([a-z ,]+) (" noise ")[,. ]", m)
msgbox ERROR - No match for: %A_LoopField%
words := m1
; e.g. "businessman, businesswoman" becomes 2 separate entries
Loop, Parse, words, `,
{ word = %A_LoopField%
tidyList .= word ","
}
} StringTrimRight, tidyList, tidyList, 1 ; remove trailing comma
; remove dupliates - this should help performance I would think
Sort, tidyList, U D,
return tidyList
}