If you ever find the need to index some GB’s worth content in txt/html give Zettair a try.

Advantages

1) Unlike Solr/lucene you don’t end up spending time configuring,

2) Up and running in 8 commands,  (Assuming everything goes right :p )

$ wget http://www.seg.rmit.edu.au/zettair/download/zettair-0.9.3.tar.gz

$ tar -xvf  zettair-0.9.3.tar.gz

$ cd  zettair-0.9.3.tar.gz

$ ./configure ; make ; make install

$ zet -i

$ zet

>> <Enter your query>

3)  Command line interface for building and searching index,

4) It’s in C, so it beats the shit out of Lucene/solr (Java) when it comes to indexing time by factor of 3 in my tests.

Disadvantage

1) Crashed for a file, (The bug seem to be in the html -> text convertor)

2) Can index only text and html.

3) Index cannot be build in parallel using multiple machines.

I am currently using it create indexes on news content of past 12 hrs every half and hr from 356 sites in total with a 2-depth crawl i.e. Front page stories. It takes six minutes to index approx 500 MB.

Tags software, timepass

Leave a Reply