If you ever find the need to index some GB’s worth content in txt/html give Zettair a try.
Advantages
1) Unlike Solr/lucene you don’t end up spending time configuring,
2) Up and running in 8 commands, (Assuming everything goes right :p )
$ wget http://www.seg.rmit.edu.au/zettair/download/zettair-0.9.3.tar.gz
$ tar -xvf zettair-0.9.3.tar.gz
$ cd zettair-0.9.3.tar.gz
$ ./configure ; make ; make install
$ zet -i
$ zet
>> <Enter your query>
3) Command line interface for building and searching index,
4) It’s in C, so it beats the shit out of Lucene/solr (Java) when it comes to indexing time by factor of 3 in my tests.
Disadvantage
1) Crashed for a file, (The bug seem to be in the html -> text convertor)
2) Can index only text and html.
3) Index cannot be build in parallel using multiple machines.
I am currently using it create indexes on news content of past 12 hrs every half and hr from 356 sites in total with a 2-depth crawl i.e. Front page stories. It takes six minutes to index approx 500 MB.
Tags software, timepass