From Fedora Project Wiki
(→‎Comparison by Requirements: Added CLucene to the table (don't know how it got missed before))
(→‎Comparison by Requirements: Add preliminary info for Nutch)
Line 124: Line 124:
|-
|-
!{{rh}}| Nutch
!{{rh}}| Nutch
| Java
| Java <br/> (OpenJDK & Tomcat)
|
| {{Yes}}
|
| {{Yes}}
|
| {{No}}
|
| {{No}}
|-
|-
!{{rh}}| Swish-e
!{{rh}}| Swish-e

Revision as of 20:51, 23 December 2009


Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine[1]

Requirements

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)
  • Java, if any, must be the GCJ/OpenJDK versions in RHEL5; Sun/IBM/BEA Java is not acceptable

Preferences

  • Python-based
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project Plan

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instances of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Resources Needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s) (maybe)

Software Investigation and Evaluation

Comparison by Requirements

Engine Name Source Language Integrated Web Crawler Integrated Web Front-End Programmable Categories Application Interface
CLucene C++
DataparkSearch C
Egothor Java
Ferret Ruby
Indri C/C++
Isearch C++
KinoSearch Perl/C No
(sample file crawler included)
No
(sample included)
Yes Yes
(BDB/JSON)
Namazu Perl
Nutch Java
(OpenJDK & Tomcat)
Yes Yes No No
Swish-e C/Perl Yes
(Perl)
No
(sample included, but has problems)
No
(but can search on META tags)
Yes
(Perl and C APIs)
Terrier Java
Xapian C++ No
(can combine Omega with something)
Yes
(rudimentary Omega CGI)
Yes Yes
(C++, Perl, Python, Ruby)
Zettair C
Engine Name Source Language Integrated Web Crawler Integrated Web Front-End Programmable Categories Application Interface

In Progress

C++ port of Lucene
in Fedora already
described as beta by the developers
  • DataparkSearch [3]
written in C
written in Java
Ruby port of Lucene
KinoSearch and Ferret intend to merge as Lucy [6]
written in C/C++
written in C++
  • KinoSearch [9] - akistler examined
  • Description
Perl/C port of Lucene
in Fedora already
maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software
KinoSearch and Ferret intend to merge as Lucy [6]
  • Evaluation
Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.
  • Requirements
buildrequires
  • gcc
  • (EPEL) perl-Module-Build
requires
  • (EPEL) perl-JSON-XS
Problem: Desires 1.53, but EPEL has 1.43
Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/
Note: works with 1.43, anyway
  • (EPEL) perl-Lingua-Stem-Snowball
  • (EPEL) perl-Lingua-StopWords
  • (EPEL) perl-Parse-RecDescent
sample indexer reads files from the file system and requires
  • (EPEL) perl-HTML-Tree
sample cgi search script requires
  • (CPAN) Data::Pageset (which requires Data::Page)
  • (EPEL) perl-Test-Exception
  • (EPEL) perl-Class-Accessor-Chained
  • Namazu [10] - Huzaifa investigating
written in Perl
in Fedora already
  • Nutch [11] - Allen investigating
  • written in Java
  • based on Lucene
  • Swish-e [12] - akistler examined
  • Description
written in C
Note: Swish++ is a rewrite in C++ (not evaluated here)
  • Evaluation
Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler. The distribution includes sample search pages which use the Perl API. There is also a C API. The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document. The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.
  • Requirements
buildrequires
  • gcc
  • make
  • libxml2-devel
  • zlib-devel
requires
  • libxml2
  • zlib
  • perl-libwww-perl (for the built-in spider)
  • others as desired to index documents (pdf, etc.)
  • Terrier (TERabyte RetrIEveR) [13]
written in Java
  • Xapian [14] - akistler examined
  • Description
written in C++
bindings to Python, Ruby, and Perl XS
Omega provides a Xapian front-end for indexing (via script) and searching (command line or CGI)
xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not
additional bindings to PHP, Java, and more (?)
Omega provides glue scripts for ht://Dig, mbox files, and perl DBI
Flax [15] is another search engine built on top of Xapian and CherryPy
  • Evaluation
Xapian is a search engine library. Omega adds functionality on top of Xapian. The Xapian database is very flexible, supporting an entirely user-designed schema. Usage through Omega loses very little, if any, of that flexibility, however the supplied Omega CGI is extremely rudimentary. The supplied Omega CGI also requires the database to be named "default," although that can be changed. Database columns are of type field or index. Fields are stored verbatim (e.g., URL, date, MIME type, keywords). Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page). The Omega scriptindex utility can be combined with an external web crawler for HTML. Making Omega work with Apache requires relabeling /var/lib/omega as httpd_sys_content, or moving /var/lib/omega to /var/www/omega and using the default context there.
  • Requirements
xapian-core buildrequires
  • gcc gcc-c++
  • make
  • zlib-devel
xapian-bindings buildrequires (not including gcc gcc-c++ make)
  • python python-devel
  • ruby ruby-devel
  • xapian-core-devel
perl-Search-Xapian buildrequires (not including gcc gcc-c++ make)
  • perl
  • xapian-core-devel
xapian-omega buildrequires (not including gcc gcc-c++ make)
  • libtool
  • xapian-core-devel
xapian-core requires
  • xapian-core-libs
xapian-bindings requires
  • coreutils
  • python
  • xapian-core-libs
perl-Search-Xapian requries
  • perl
  • xapian-core-libs
xapian-omega requires
  • httpd
  • perl
  • perl-DBI
  • xapian-core-libs
written in C

Eliminated from Consideration

  • written in Ruby
  • specializes in source code search
  • not actively maintained
  • written in C#
  • written in Java
  • archives content rather than indexing it
  • written in C++
  • not actively maintained
  • It's just a MediaWiki plugin, not suitable for searching non-wiki sites
  • Lucene [22] - akistler examined
  • Description
written in Java, but ported to others [23]
Requires/uses GCJ
in Fedora already
PyLucene [24] is a Python wrapper around Java Lucene
  • Evaluation
Search engine library meant to be integrated into applications
  • Requirements
buildrequires (based on 1.4.3-f7)
  • ant
  • ant-junit
  • java-1.4.2-gcj-compat-devel
  • javacc
  • jpackage-utils
  • junit
  • make
requires (based on 1.4.3-f7)
  • java-1.4.2-gcj-compat
  • mnoGoSearch [25] - akistler examined
  • Description
written in C
UNIX/Linux source code is GPL; Windows binaries are commercial, likely based on the GPL UNIX/Linux code, and lag a few versions behind
Indices are stored in a database; Supported databases include MySQL, PostgreSQL, and SQLite (among others)
HTTP, FTP, and NNTP crawling
C, PHP, and Perl APIs
SBCS and most MBCS supported, including most eastern Asian languages
  • Evaluation
The supplied install.pl script generates a configure command, but does not support SQLite. Adding --with-sqlite3 to the generated command adds SQLite support. An empty database must be created manually. A URI in the indexer.conf file specifies the location of the database. According to the documentation, sqlite:/path/to/db/file should work, but doesn't. According to the message boards on mnoGoSearch.org, sqlite://localhost/path/to/db/file should work, but doesn't. No other databases were tested for evaluation.
  • Requirements
buildrequires
  • gcc make
  • sqlite-devel (for SQLite support)
  • zlib-devel
requires
  • sqlite (for SQLite support)
  • zlib
  • Requires EzMwLucene (Java) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Perl port of Lucene
  • not actively maintained
  • Crawls the MediaWiki database, not the web site
  • Doesn't work for non-MediaWiki web sites, including any non-wiki web site
  • written in C++
  • designed to index SQL tables, not web pages
  • Solr [31] - akistler examined
  • Description
written in Java
based on Lucene
  • Evaluation
The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters.
Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.
It has no crawling/spidering facility. It has no user query interface. There are no samples.
  • Requirements
buildrequires
  • ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)
  • ant-junit
  • java-1.6.0-openjdk-devel
  • junit
requires
  • java-1.6.0-openjdk
  • Written in C++
  • MediaWiki plug-in, so it's wiki-only
  • written in Python
  • inspired by Lucene, but closer to a Python port of parts of KinoSearch combined with some features of Terrier
  • toolkit only, not even sample crawlers and user interfaces
  • YaCy [34] - huzaifa examined
  • written in Java, but requires Sun Java
  • well maintained
  • support for peer search engine database exchanges
  • customized search parameters
  • fast indexing and web interface for querying the back-end db

Public Testing

<tbd>

Deployment Plan

<tbd>

References

  1. "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/. 
  3. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  4. "Egothor". Egothor. http://www.egothor.org/. 
  5. "Ferret". David Balmain. http://ferret.davebalmain.com/. 
  6. 6.0 6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/. 
  7. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  8. "Isearch". Isite. http://isite.awcubed.com/. 
  9. "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/. 
  10. "Namazu". Namazu Project. http://www.namazu.org/. 
  11. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  12. "Swish-e". Swish-e. http://swish-e.org/. 
  13. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  14. "Xapian". Xapian Project. http://xapian.org/. 
  15. "Flax". Flax. http://www.flax.co.uk/products.shtml. 
  16. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 
  17. "Gonzui". SourceForge. http://gonzui.sourceforge.net/. 
  18. "Grub". Wikia, Inc.. http://grub.org/. 
  19. "Heritrix". Internet Archive. http://crawler.archive.org/. 
  20. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  21. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  22. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  23. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  24. "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/. 
  25. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  26. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  27. "OpenFTS". XWare. http://www.astronet.ru/xware/#fts. 
  28. "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25. 
  29. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  30. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  31. "Solr". Apache Software Foundation. http://lucene.apache.org/solr/. 
  32. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  33. "Whoosh". Matt Chaput. http://whoosh.ca/. 
  34. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.