Revision as of 20:32, 9 February 2012

Sphinx

This page is helpful and I used their config and modified it to our needs.
The Sphinx indexer simply runs on a cron, so that part is simple.
As far as front end, we are going to look at packaging the above linked MW extension.
- The extension depends on sphinxapi.php, which is in the libsphinxclient package, at /usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php.
- The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
Sphinx does not crawl, it only indexes databases, which kind of defeats the purpose for us.

Doesn't have a crawler built in.
Most stuff is done via Omega, Xapian just backs it.
Hacky way to crawl sites: Crawl with htdig, convert into a format omega understands and can index.
htdig is unsupported and *OLD*.
htdig seems to segfault on https sites in my testing.
Omega's default UI is *ugly* but that is changeable.

Apache Lucene (with Apache Nutch to crawl).
- Heavily relies Java so probably out of the question (Lucene is Java, Nutch is a Tomcat servlet. Nuff said.)
Datapark Search
- Fork of Mnogosearch?
- Written in C.

@@ Line 4: / Line 4: @@
 * The Sphinx indexer simply runs on a cron, so that part is simple.
 * As far as front end, we are going to look at packaging the above linked MW extension.
-** The extension depends on sphinxapi.php, which is in the libsphinxclient package, at */usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php*.
+** The extension depends on sphinxapi.php, which is in the libsphinxclient package, at '''/usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php'''.
 ** The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
 * '''Sphinx does not crawl, it only indexes databases, which kind of defeats the purpose for us.'''