AutoHotkey Community

It is currently May 27th, 2012, 1:24 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

Do you use ahksearch
yes
no
You may select 1 option

View results
Author Message
PostPosted: February 28th, 2009, 7:45 am 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
AhkSearch is a search engine in AutoHotkey.
I am interested in creating an open search index (GPL2).

Features:
1. single complete word queries
2. inverted index on disk
3. search ranking based on term frequency, inverse document frequency
3. Latest benchmark: 6000 wikipedia articles (150megabytes) indexed in 90 seconds, index (30 megabytes) written to disk in 30 seconds. RAM: 200 megabytes. Comparison.

Download:ahksearch.git

Demo: search RosettaCode demosite
Translations: rosettacode

Todo:
1. Allow some punctuation and special characters in the index: '-_$@...'.
2. more complex searches: boolean, phrases, regular expressions
3. interface with git based index database on disk


Last edited by tinku99 on April 21st, 2010, 12:19 am, edited 17 times in total.

Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: February 28th, 2009, 4:51 pm 
Offline

Joined: November 7th, 2006, 9:47 pm
Posts: 1934
Location: Germany
You can experiment writing such a program in Ahk, I find it also interesting. But there are other file indexing programs which also runs as services and works fast and very good.


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: March 3rd, 2009, 9:21 pm 
Offline

Joined: September 8th, 2008, 8:38 pm
Posts: 33
Could Google's advanced search help? Like:

http://www.google.com/search?hl=en&as_q ... afe=images


Report this post
Top
 Profile  
Reply with quote  
PostPosted: May 7th, 2009, 4:57 am 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
solved.
See first post in thread.


Report this post
Top
 Profile  
Reply with quote  
PostPosted: May 8th, 2009, 11:32 pm 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
Edit:

limitations:
No punctuations allowed in search word.
Many others...


Last edited by tinku99 on May 20th, 2009, 9:52 pm, edited 2 times in total.

Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: May 16th, 2009, 4:38 am 
Offline

Joined: February 7th, 2009, 11:28 pm
Posts: 384
I'm interested in your project tinku, but a bit confused by the terminology. As far as I can tell it doesn't save or update an index file anywhere in contrast to other search utilities with indexing. Is the script ultimately meant to implement conventional indexing (write an index file), or are you using the term in some other sense?

Would your indexing algorithm work for simply indexing file names and other property info? for example, if I wanted to delete all Evernote files from my computer, a search for files by "Evernote" company would for example reveal the three files shown below, whereas I'd never think to search by "RiteShape" etc. In particular, some antivirus may leave trash files in your system folders with very random names you'd never think to use in a name search, but often the company (Symantec, Kaspersky, ...) is listed in the Company name field.

Image

Since I don't know of any search engine (and I've looked) that lets you search in company, description, author, key words, etc. fields, I'm wondering if I could fill in that gap myself with AHK... Your script should be able to handle that right if I can just figure out how to 'Read' file properties into the "Contents" field rather than the actual contents? fwiw, StatusBarGetText can grab them in open explorer window if those 'details' have been checked for display...

_________________
Hardware: 1.8 GHz laptop with 4 GB ram, Windows XP/SP3
Software: Prevx, Privatefirewall, KeyScrambler.


Report this post
Top
 Profile  
Reply with quote  
 Post subject: indexing
PostPosted: May 16th, 2009, 5:30 am 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
Quote:
Is the script ultimately meant to implement conventional indexing (write an index file)

Yeah, the idea was to have a text based index that would be easy to merge with other people's indices, possibly with something like git.
Currently, both the forward index and inverted index are just a bunch of variables in memory. Search is implemented by the builtin ahk procedures for looking up normal variables.
I have not decided how I am going to store them on disk yet.

The next algorithms to implement are
1. term-frequency
2. inverse document frequency

Quote:
Your script should be able to handle that right if I can just figure out how to 'Read' file properties into the "Contents" field rather than the actual contents?

Yes, that should work.
In function: wordsinfile(file){}
Just change
"FileRead, contents, %file%" within
to
GetProperties(file) { getProperties code here }


Report this post
Top
 Profile  
Reply with quote  
 Post subject: Re: indexing
PostPosted: May 16th, 2009, 2:47 pm 
Offline

Joined: February 7th, 2009, 11:28 pm
Posts: 384
tinku99 wrote:
Yes, that should work.
In function: wordsinfile(file){}
Just change
"FileRead, contents, %file%" within
to
GetProperties(file) { getProperties code here }


Thank you. I'll add that to my pending projects.

It's strange how I initially learned AHK to save time (macros), but the opposite happened because I got hooked on writing scripts with it for everything. Of course, it's all supposed to pay off in the end, but I dunno... That said, I kind of like it.

_________________
Hardware: 1.8 GHz laptop with 4 GB ram, Windows XP/SP3
Software: Prevx, Privatefirewall, KeyScrambler.


Report this post
Top
 Profile  
Reply with quote  
PostPosted: May 19th, 2009, 10:57 pm 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
update: implemented tfdif searh algorithm.
still no attempted optimization

I ran the ahksearch on a small wikipedia collection from here.
Results: 408442 words in 6043 files indexed in 22 minutes.
space: 540 megabytes, (probably about 2 million variables)
lookup: instantaneous
Comparison
about 1/5 as fast as python
about 1/20 as fast as xapian

Todo:
1. try binary arrays.
2. phrase search, boolean search etc...


Report this post
Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: May 20th, 2009, 11:51 am 
Wow, I have been looking for this kind of thing. Thank you, tinku!
Even if I had read the tfdif thing you linked, I wouldn't have known how fulltext search works.
An Autohotkey script gives me a full picture.

And about the use of Array, well I am not sure it will make it faster.
In my test in other scripts a few months ago, string array-native autohotkey- is much faster than any other array scripts presented in the forum. Although it is covenient to use but in this case the speed thing matters, I think.

How about using Cheetah or sqlite in the forum.
Normal query and retrieving speed is a bit slower than sqlite, but if you use index search it's as fast as sqlite(also with indexed db) For what it's worth, in my case Cheeta is very reliable-very intuitive use thanks to SKAN
But I'm not sure if the writing speed is better than sqlite.


Report this post
Top
  
Reply with quote  
 Post subject:
PostPosted: May 20th, 2009, 11:58 am 
Offline

Joined: February 7th, 2009, 11:28 pm
Posts: 384
by the way, I found a function called FileGetFullVer by wOxxOm that can pick up all the properties for executable files (exe,dll,drv,fon,ttf,vxd,sys,cpl,ocx): http://www.autohotkey.com/forum/viewtopic.php?t=8618

But it's a few years old, so I'm still contemplating whether to use it or to try something else. Also, I need to find if that list of file types covers anything I might need to find by company, description, author, etc.

_________________
Hardware: 1.8 GHz laptop with 4 GB ram, Windows XP/SP3
Software: Prevx, Privatefirewall, KeyScrambler.


Report this post
Top
 Profile  
Reply with quote  
 Post subject: 20x improvement in speed
PostPosted: May 20th, 2009, 11:03 pm 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
Still no use of arrays, hashes, or binary trees.
But now it does 6000 of the same wikipedia articles as above in 90 seconds.


Report this post
Top
 Profile  
Reply with quote  
 Post subject: changes
PostPosted: May 22nd, 2009, 3:21 am 
Offline

Joined: August 3rd, 2007, 8:01 am
Posts: 555
Location: Houston, TX
removed need for holding on to forward index.
enabled writing inverted index to disk.

faster search


Report this post
Top
 Profile  
Reply with quote  
 Post subject: really fast
PostPosted: May 22nd, 2009, 3:04 pm 
Offline

Joined: May 22nd, 2009, 2:48 pm
Posts: 12
thanks, I will definitely use your search engine.
As you said, phrase and boolean support will be really great.

I don't know if I'm going too far, but when you say 'phrase search', does it mean the words are in a same file or it means they are located nearby, say 'in the same line'?


Report this post
Top
 Profile  
Reply with quote  
 Post subject: Re: indexing
PostPosted: May 22nd, 2009, 4:30 pm 
Offline

Joined: February 7th, 2009, 11:28 pm
Posts: 384
tinku99 wrote:
Yes, that should work.
In function: wordsinfile(file){}
Just change
"FileRead, contents, %file%" within
to
GetProperties(file) { getProperties code here }


I replaced

"FileRead, contents, %file%"

with

"contents:= FileGetFullVer(file,1|2|4|8|0xFFFF,"`t")"

andof course included wOxxOm's FileGetFullVer at the bottom of the script.

Seems to work, though the indexing takes a while.

_________________
Hardware: 1.8 GHz laptop with 4 GB ram, Windows XP/SP3
Software: Prevx, Privatefirewall, KeyScrambler.


Report this post
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC [ DST ]


Who is online

Users browsing this forum: Bon, Yahoo [Bot] and 13 guests


You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group