From Fedora Project Wiki

< Infrastructure

Revision as of 22:39, 12 October 2009 by Akistler (talk | contribs) (→‎Software Investigation and Evaluation: Added references for additional engines; Sorted Perl & Ruby to the top)


Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary: Fedora needs a search engine[1]

Requirements:

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences:

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project plan (Detailed):

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Specific resources needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

  • HtdigSearch [2]
Huzaifa (in progress)
  • SphinxSearch [3]
Huzaifa (in progress)
  • Ferret
Ruby port of Lucene
  • Gonzui [4] (specializes in source code search)
  • written in Ruby
  • not actively maintained
  • KinoSearch
Perl port of Lucene
written in Perl
Not suitable
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Plucene
Perl port of Lucene
  • DataparkSearch [7]
Not suitable
written in C
Not suitable
written in Java
Not suitable
written in C#
Not suitable
  • written in C++
  • not actively maintained
Not suitable
written in C/C++
Not suitable
written in C++
Not suitable
  • originally in Java, ported to others
  • Perl ports are Plucene and KinoSearch; Ruby port is Ferret
  • see Lucene Implementations [14]
Not suitable
written in C
Not suitable
  • Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
Not suitable
  • written in Java
  • based on Lucene
Not suitable
Crawls the MediaWiki database, not the web site. It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
Not suitable
written in C++
Not suitable
written in C
Swish++ is a rewrite in C++
  • Terrier (TERabyte RetrIEveR) [21]
Not suitable
written in Java
Not suitable
written in C++
Not suitable
written in C
Not suitable
written in C
  1. "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  3. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  4. "Gonzui". SourceForge. http://gonzui.sourceforge.net/. 
  5. "Namazu". Namazu Project. http://www.namazu.org/. 
  6. "OpenFTS". SourceForge. http://openfts.sourceforge.net/. 
  7. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  8. "Egothor". Egothor. http://www.egothor.org/. 
  9. "Grub". Wikia, Inc.. http://grub.org/. 
  10. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  11. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  12. "Isearch". Isite. http://isite.awcubed.com/. 
  13. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  14. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  15. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  16. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  17. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  18. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  19. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  20. "Swish-e". Swish-e. http://swish-e.org/. 
  21. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  22. "Xapian". Xapian Project. http://xapian.org/. 
  23. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/. 
  24. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 

Public Testing

<tbd>

Deployment Plan

<tbd>

References