From Fedora Project Wiki
(→‎Software Investigation and Evaluation: Preliminary result from looking more closely at RigorousSearch documentation)
(→‎Software Investigation and Evaluation: Added quick search results for additional candidates)
Line 64: Line 64:
: Not suitable
: Not suitable
: RigorousSearch crawls the MediaWiki database, not the web site.  It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
: RigorousSearch crawls the MediaWiki database, not the web site.  It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
* DataparkSearch
: written in C
* Egothor
: written in Java
* Gonzui (specializes in source code search)
:* ? written in Ruby
:* ? not actively maintained
* Grub
: ? written in C#
* Ht://dig
:* written in C++
:* not actively maintained
* Isearch
: written in C++
* Lucene
:* originally in Java, ported to others
:* Perl ports are Plucene and KinoSearch
:* Ruby port is Ferret
:* see http://wiki.apache.org/lucene-java/LuceneImplementations
* Lemur Toolkit & Indri Search Engine
:* written in C/C++
:* not really a search engine, more like a toolkit
* mnoGoSearch
: written in C
* Namazu
: written in Perl
* Nutch
:* written in Java
:* based on Lucene
* OpenFTS
:* written in Perl or TCL on top of PostgreSQL
:* Python interface available
:* not actively maintained
* Sciencenet (for scientific knowledge, based on YaCy technology)
: written in Java
* Sphinx
: written in C++
* SWISH-E
: written in C
* Terrier Search Engine
: written in Java
* Wikia Search
: shut down
* Xapian
: written in C++
* YaCy
: written in C
* Zettair
: written in C


== Public Testing ==
== Public Testing ==

Revision as of 00:24, 12 October 2009


Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Keiran Smith
Fedora Account Name: affix
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary: Fedora needs a search engine[1]

Requirements:

  • Crawl the web sites (wiki and non-wiki)
  • Search the web sites (wiki and non-wiki)

Preferences:

  • Python-based (no Java)
  • Programmable keywords to have control over what pages get displayed for certain keywords
  • XML or library interface so other applications can use it

Project plan (Detailed):

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instance of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Specific resources needed

  • Public Test for testing candidate software
  • Permanent home(s) for deployment
    • Web server(s)
    • Database server(s)

Software Investigation and Evaluation

  • HtdigSearch [2]
Huzaifa (in progress)
  • SphinxSearch [3]
Huzaifa (in progress)
Not suitable
  • MWSearch requires EzMwLucene to be running on the servers to be searched. EzMwLucene is Java, therefore not preferable.
  • MWSearch is a client to EzMwLucene, which is wiki-only, therefore MWSearch is wiki-only.
  • RigorousSearch [5]
Not suitable
RigorousSearch crawls the MediaWiki database, not the web site. It doesn't work for non-MediaWiki web sites, including any non-wiki web site.
  • DataparkSearch
written in C
  • Egothor
written in Java
  • Gonzui (specializes in source code search)
  • ? written in Ruby
  • ? not actively maintained
  • Grub
? written in C#
  • Ht://dig
  • written in C++
  • not actively maintained
  • Isearch
written in C++
  • Lucene
  • Lemur Toolkit & Indri Search Engine
  • written in C/C++
  • not really a search engine, more like a toolkit
  • mnoGoSearch
written in C
  • Namazu
written in Perl
  • Nutch
  • written in Java
  • based on Lucene
  • OpenFTS
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Sciencenet (for scientific knowledge, based on YaCy technology)
written in Java
  • Sphinx
written in C++
  • SWISH-E
written in C
  • Terrier Search Engine
written in Java
  • Wikia Search
shut down
  • Xapian
written in C++
  • YaCy
written in C
  • Zettair
written in C

Public Testing

<tbd>

Deployment Plan

<tbd>

References