Infrastructure/Search

From FedoraProject

< Infrastructure(Difference between revisions)
Jump to: navigation, search
(Not Suitable)
(In Progress)
Line 75: Line 75:
 
:* Perl port of Lucene
 
:* Perl port of Lucene
 
:* not actively maintained
 
:* not actively maintained
 
* SphinxSearch <ref name="SphinxSearch">{{cite web|url=https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch|title=SphinxSearch Extension|publisher=[[MediaWiki]]}}</ref> '''Huzaifa investigating'''
 
  
 
=== Not Suitable ===
 
=== Not Suitable ===

Revision as of 09:17, 13 October 2009

Infrastructure InfrastructureTeamN1.png


Contents

Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine[1]

Requirements

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project Plan

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Resources Needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

In Progress

  • KinoSearch [2]
Perl port of Lucene
written in Perl
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Perl port of Lucene
  • not actively maintained

Not Suitable

  • DataparkSearch [6]
written in C
written in Java
written in C#
  • written in Java
  • archives content rather than simply indexing it
  • written in C++
  • not actively maintained
written in C/C++
written in C++
written in Java, but ported to others [14]
written in C
  • Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
  • written in Java
  • based on Lucene
  • RigorousSearch [18]
Crawls the MediaWiki database, not the web site
Doesn't work for non-MediaWiki web sites, including any non-wiki web site
written in C++
written in C
Swish++ is a rewrite in C++
  • Terrier (TERabyte RetrIEveR) [21]
written in Java
written in C++
written in C
written in C
Its just a MediaWiki plugin, not suitable for searching non-wiki sites

Public Testing

<tbd>

Deployment Plan

<tbd>

References

  1. "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/. 
  3. "Namazu". Namazu Project. http://www.namazu.org/. 
  4. "OpenFTS". SourceForge. http://openfts.sourceforge.net/. 
  5. "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25. 
  6. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  7. "Egothor". Egothor. http://www.egothor.org/. 
  8. "Grub". Wikia, Inc.. http://grub.org/. 
  9. "Heritrix". Internet Archive. http://crawler.archive.org/. 
  10. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  11. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  12. "Isearch". Isite. http://isite.awcubed.com/. 
  13. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  14. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  15. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  16. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  17. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  18. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  19. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  20. "Swish-e". Swish-e. http://swish-e.org/. 
  21. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  22. "Xapian". Xapian Project. http://xapian.org/. 
  23. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/. 
  24. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 
  25. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.