Revision as of 20:31, 9 February 2012

Sphinx

This page is helpful and I used their config and modified it to our needs.
The Sphinx indexer simply runs on a cron, so that part is simple.
As far as front end, we are going to look at packaging the above linked MW extension.
- The extension depends on sphinxapi.php, which is in the libsphinxclient package, at */usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php*.
- The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
*Sphinx does not crawl, it only indexes databases, which kind of defeats the purpose for us.*

Doesn't have a crawler built in.
Most stuff is done via Omega, Xapian just backs it.
Hacky way to crawl sites: Crawl with htdig, convert into a format omega understands and can index.
htdig is unsupported and *OLD*.
htdig seems to segfault on https sites in my testing.
Omega's default UI is *ugly* but that is changeable.

Apache Lucene (with Apache Nutch to crawl).
- Heavily relies Java so probably out of the question (Lucene is Java, Nutch is a Tomcat servlet. Nuff said.)
Datapark Search
- Fork of Mnogosearch?
- Written in C.

@@ Line 6: / Line 6: @@
 ** The extension depends on sphinxapi.php, which is in the libsphinxclient package, at */usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php*.
 ** The extension does not seem to work with MW 1.16, but we want to upgrade eventually anyway.
+* *Sphinx does not crawl, it only indexes databases, which kind of defeats the purpose for us.*
+= Xapian =
+* Doesn't have a crawler built in.
+* Most stuff is done via Omega, Xapian just backs it.
+* Hacky way to crawl sites: Crawl with htdig, convert into a format omega understands and can index.
+* htdig is unsupported and *OLD*.
+* htdig seems to segfault on https sites in my testing.
+* Omega's default UI is *ugly* but that is changeable.
+= Mnogosearch =
+* [http://www.mnogosearch.org/ Link]
+* Looks nice. Has a somewhat nice UI, and is customizable.
+* Built in crawler, with a default 1000 line (with comments) config file.
+* CGI barfs when there are results: [http://mnogosearch.org/bugs/index.php?id=19129 bug 19129] and [http://mnogosearch.org/bugs/index.php?id=19141 bug 19141] upstream.
+= Others to try =
+* Apache Lucene (with Apache Nutch to crawl).
+** Heavily relies Java so probably out of the question (Lucene is Java, Nutch is a Tomcat servlet. Nuff said.)
+* [http://www.dataparksearch.org/ Datapark Search]
+** Fork of Mnogosearch?
+** Written in C.
+* ASPseek
+** C++
+** Last copyright year on [http://www.aspseek.org/ their site] is 2003. Is it unmaintained?