Currently, I'm reviewing a single document which is 4,377 pages long. I'm guessing that - if duplicates are removed - it would be 1,000 pages or less.
How can I remove duplicates?
The files are usually .pdf, but I can convert them to .txt or Word.
I've seen people on the forums talking about scripts that can remove duplicate lines in a file - i.e. keep the first instance that line appears, and then delete all subsequent instances of the identical line - but that wouldn't work here.
For example, if I had the following two emails in a document:
andDear Joe,
Yes, I will meet you Friday.
Sincerely,
Bob
A line-duplication script would delete "Dear Joe" and "Bob" in both emails (since those lines are identical) even though the body of the emails ("meet you Friday" versus "never do business") are different.Dear Joe,
I will never do business with you.
Regards,
Bob
So the question is whether it's possible for a script to compare larger portions of the text. For example, if a script could loop through a file and compare 5 lines of text at a time, that's more likely to delete the same identical email if it appears multiple times. For example, if an email with 10 lines of text in it is duplicated multiple times in a single file, the script would delete the first 5 lines of the duplicates, and then loop through on a future pass and delete the seconds 5 lines of text in the duplicates.
I don't know whether or not this is possible with AHK.