search indexing functions - stdlib
I am interested in creating an open search index (GPL2).
Features:
1. single complete word queries
2. inverted index on disk
3. search ranking based on term frequency, inverse document frequency
3. Latest benchmark: 6000 wikipedia articles (150megabytes) indexed in 90 seconds, index (30 megabytes) written to disk in 30 seconds. RAM: 200 megabytes. Comparison.
Download:ahksearch.git
Demo: search RosettaCode demosite
Translations: rosettacode
Todo:
1. Allow some punctuation and special characters in the index: '-_$@...'.
2. more complex searches: boolean, phrases, regular expressions
3. interface with git based index database on disk
<!-- m -->http://www.google.co... ... afe=images<!-- m -->
limitations:
No punctuations allowed in search word.
Many others...
Would your indexing algorithm work for simply indexing file names and other property info? for example, if I wanted to delete all Evernote files from my computer, a search for files by "Evernote" company would for example reveal the three files shown below, whereas I'd never think to search by "RiteShape" etc. In particular, some antivirus may leave trash files in your system folders with very random names you'd never think to use in a name search, but often the company (Symantec, Kaspersky, ...) is listed in the Company name field.
Since I don't know of any search engine (and I've looked) that lets you search in company, description, author, key words, etc. fields, I'm wondering if I could fill in that gap myself with AHK... Your script should be able to handle that right if I can just figure out how to 'Read' file properties into the "Contents" field rather than the actual contents? fwiw, StatusBarGetText can grab them in open explorer window if those 'details' have been checked for display...
Hardware: fast laptop with SSD
Software: Win 7 Home Premium 64-bit, android for phone and tablet
Yeah, the idea was to have a text based index that would be easy to merge with other people's indices, possibly with something like git.Is the script ultimately meant to implement conventional indexing (write an index file)
Currently, both the forward index and inverted index are just a bunch of variables in memory. Search is implemented by the builtin ahk procedures for looking up normal variables.
I have not decided how I am going to store them on disk yet.
The next algorithms to implement are
1. term-frequency
2. inverse document frequency
Yes, that should work.Your script should be able to handle that right if I can just figure out how to 'Read' file properties into the "Contents" field rather than the actual contents?
In function: wordsinfile(file){}
Just change
"FileRead, contents, %file%" within
to
GetProperties(file) { getProperties code here }
Yes, that should work.
In function: wordsinfile(file){}
Just change
"FileRead, contents, %file%" within
to
GetProperties(file) { getProperties code here }
Thank you. I'll add that to my pending projects.
It's strange how I initially learned AHK to save time (macros), but the opposite happened because I got hooked on writing scripts with it for everything. Of course, it's all supposed to pay off in the end, but I dunno... That said, I kind of like it.
Hardware: fast laptop with SSD
Software: Win 7 Home Premium 64-bit, android for phone and tablet
still no attempted optimization
I ran the ahksearch on a small wikipedia collection from here.
Results: 408442 words in 6043 files indexed in 22 minutes.
space: 540 megabytes, (probably about 2 million variables)
lookup: instantaneous
Comparison
about 1/5 as fast as python
about 1/20 as fast as xapian
Todo:
1. try binary arrays.
2. phrase search, boolean search etc...
Even if I had read the tfdif thing you linked, I wouldn't have known how fulltext search works.
An Autohotkey script gives me a full picture.
And about the use of Array, well I am not sure it will make it faster.
In my test in other scripts a few months ago, string array-native autohotkey- is much faster than any other array scripts presented in the forum. Although it is covenient to use but in this case the speed thing matters, I think.
How about using Cheetah or sqlite in the forum.
Normal query and retrieving speed is a bit slower than sqlite, but if you use index search it's as fast as sqlite(also with indexed db) For what it's worth, in my case Cheeta is very reliable-very intuitive use thanks to SKAN
But I'm not sure if the writing speed is better than sqlite.
But it's a few years old, so I'm still contemplating whether to use it or to try something else. Also, I need to find if that list of file types covers anything I might need to find by company, description, author, etc.
Hardware: fast laptop with SSD
Software: Win 7 Home Premium 64-bit, android for phone and tablet
But now it does 6000 of the same wikipedia articles as above in 90 seconds.
enabled writing inverted index to disk.
faster search
As you said, phrase and boolean support will be really great.
I don't know if I'm going too far, but when you say 'phrase search', does it mean the words are in a same file or it means they are located nearby, say 'in the same line'?
Yes, that should work.
In function: wordsinfile(file){}
Just change
"FileRead, contents, %file%" within
to
GetProperties(file) { getProperties code here }
I replaced
"FileRead, contents, %file%"
with
"contents:= FileGetFullVer(file,1|2|4|8|0xFFFF,"`t")"
andof course included wOxxOm's FileGetFullVer at the bottom of the script.
Seems to work, though the indexing takes a while.
Hardware: fast laptop with SSD
Software: Win 7 Home Premium 64-bit, android for phone and tablet