From Fedora Project Wiki

< Infrastructure

Revision as of 00:24, 12 October 2009 by Akistler (talk | contribs) (→‎Software Investigation and Evaluation: Added quick search results for additional candidates)


Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary: Fedora needs a search engine[1]

Requirements:

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences:

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project plan (Detailed):

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Specific resources needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

  • HtdigSearch [2]
Huzaifa (in progress)
  • SphinxSearch [3]
Huzaifa (in progress)
Not suitable
  • MWSearch requires EzMwLucene to be running on the servers to be searched. EzMwLucene is Java, therefore not preferable.
  • MWSearch is a client to EzMwLucene, which is wiki-only, therefore MWSearch is wiki-only.
  • RigorousSearch [5]
Not suitable
RigorousSearch crawls the MediaWiki database, not the web site. It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
  • DataparkSearch
written in C
  • Egothor
written in Java
  • Gonzui (specializes in source code search)
  • ? written in Ruby
  • ? not actively maintained
  • Grub
? written in C#
  • Ht://dig
  • written in C++
  • not actively maintained
  • Isearch
written in C++
  • Lucene
  • Lemur Toolkit & Indri Search Engine
  • written in C/C++
  • not really a search engine, more like a toolkit
  • mnoGoSearch
written in C
  • Namazu
written in Perl
  • Nutch
  • written in Java
  • based on Lucene
  • OpenFTS
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Sciencenet (for scientific knowledge, based on YaCy technology)
written in Java
  • Sphinx
written in C++
  • SWISH-E
written in C
  • Terrier Search Engine
written in Java
  • Wikia Search
shut down
  • Xapian
written in C++
  • YaCy
written in C
  • Zettair
written in C

Public Testing

<tbd>

Deployment Plan

<tbd>

References