Revision as of 20:38, 29 October 2009

File:Infrastructure InfrastructureTeamN1.png

Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine^[1]

Requirements

Crawl the web sites (wiki and non-wiki)
Search the web sites (wiki and non-wiki)

Preferences

Python-based

Note: Other languages are permitted, but Java must be the GCJ/OpenJDK versions in RHEL5. Sun/IBM/BEA Java is not acceptable.

Programmable keywords to have control over what pages get displayed for certain keywords
XML or library interface so other applications can use it

Project Plan

Investigate and evaluate existing open source search engines
Select candidate software
Create public test instance of candidate software
Test for functionality, performance, and impact (re-evaluate, if necessary)
Create capacity and deployment plans
Deploy

Resources Needed

Public Test for testing candidate software
Permanent home(s) for deployment
- Web server(s)
- Database server(s)

Software Investigation and Evaluation

In Progress

DataparkSearch ^[2]

written in C

Indri ^[3]

written in C/C++

Isearch ^[4]

written in C++

KinoSearch ^[5] - Allen investigating

Description

Perl/C port of Lucene

Maintainer Rectangular Research appears to be just one guy, who considers KinoSearch to be alpha software

KinoSearch and Ferret intend to merge as Lucy ^[6]

Evaluation

Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.

Requirements

buildrequires

gcc
(EPEL) perl-Module-Build

requires

(EPEL) perl-JSON-XS

Problem: Desires 1.53, but EPEL has 1.43

Note: http://web.archive.org/web/20071122035408/search.cpan.org/src/MLEHMANN/JSON-XS-1.53/

Note: works with 1.43, anyway

(EPEL) perl-Lingua-Stem-Snowball
(EPEL) perl-Lingua-StopWords
(EPEL) perl-Parse-RecDescent

sample indexer reads files from the file system and requires

(EPEL) perl-HTML-Tree

sample cgi search script requires

(CPAN) Data::Pageset (which requires Data::Page)
(EPEL) perl-Test-Exception
(EPEL) perl-Class-Accessor-Chained

mnoGoSearch ^[7]

written in C

Namazu ^[8] - Huzaifa investigating

written in Perl

Swish-e ^[9]

written in C

Swish++ is a rewrite in C++

Xapian ^[10]

written in C++

Bindings to Python, Perl, PHP, Ruby, Java, and more

Omega (builtin) provides a more complete search engine experience on top of core Xapian

Built-in web crawler is a script that interfaces with ht://Dig

Flax ^[11] is another search engine built on top of Xapian and CherryPy

Zettair ^[12]

written in C

Not Suitable

Egothor ^[13]

written in Java

Ferret ^[14]

Ruby port of Lucene

KinoSearch and Ferret intend to merge as Lucy ^[6]

Grub ^[15]

written in C#

Heritrix ^[16]

written in Java
archives content rather than indexing it

ht://Dig ^[17]

written in C++
not actively maintained

HtdigSearch ^[18]

It's just a MediaWiki plugin, not suitable for searching non-wiki sites

Lucene ^[19]

written in Java, but ported to others ^[20]

MWSearch ^[21]

Requires EzMwLucene (Java, not desirable) to be running on the servers to be searched
EzMwLucene is wiki-only, therefore MWSearch is wiki-only

Nutch ^[22]

written in Java
based on Lucene

OpenFTS ^[23]

written in Perl or TCL on top of PostgreSQL
Python interface available
not actively maintained

Plucene ^[24]

Perl port of Lucene
not actively maintained

RigorousSearch ^[25]

Crawls the MediaWiki database, not the web site

Doesn't work for non-MediaWiki web sites, including any non-wiki web site

Sphinx ^[26]

written in C++

designed to index SQL tables, not web pages.

SphinxSearch ^[27]

Written in C++

Wiki-only (?)

Terrier (TERabyte RetrIEveR) ^[28]

written in Java

YaCy ^[29] - Examined by Huzaifa

written in Java, but requires Sun Java
Well maintained
Support for peer search engine database exchanges
Customized search parameters
Fast indexing and web interface for querying the back end db.

Public Testing

<tbd>

Deployment Plan

<tbd>

References

↑ "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055.
↑ "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.
↑ "Indri". The Lemur Project. http://www.lemurproject.org/indri/.
↑ "Isearch". Isite. http://isite.awcubed.com/.
↑ "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.
↑ ^6.0 ^6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.
↑ "mnoGoSearch". LavTech. http://www.mnogosearch.org/.
↑ "Namazu". Namazu Project. http://www.namazu.org/.
↑ "Swish-e". Swish-e. http://swish-e.org/.
↑ "Xapian". Xapian Project. http://xapian.org/.
↑ "Flax". Flax. http://www.flax.co.uk/products.shtml.
↑ "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.
↑ "Egothor". Egothor. http://www.egothor.org/.
↑ "Ferret". David Balmain. http://ferret.davebalmain.com/.
↑ "Grub". Wikia, Inc.. http://grub.org/.
↑ "Heritrix". Internet Archive. http://crawler.archive.org/.
↑ "ht://Dig". The ht://Dig Group. http://www.htdig.org/.
↑ "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.
↑ "Lucene". Apache Software Foundation. http://lucene.apache.org/.
↑ "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.
↑ "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.
↑ "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.
↑ "OpenFTS". SourceForge. http://openfts.sourceforge.net/.
↑ "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.
↑ "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.
↑ "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.
↑ "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.
↑ "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.
↑ "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.

[Trac-1] "Fedora Search Engine". Infrastructure/Tickets. https://fedorahosted.org/fedora-infrastructure/ticket/1055.

[DataparkSearch-2] "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/.

[Indri-3] "Indri". The Lemur Project. http://www.lemurproject.org/indri/.

[Isearch-4] "Isearch". Isite. http://isite.awcubed.com/.

[KinoSearch-5] "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/.

[Lucy-6] 6.0 ^6.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/.

[mnoGoSearch-7] "mnoGoSearch". LavTech. http://www.mnogosearch.org/.

[Namazu-8] "Namazu". Namazu Project. http://www.namazu.org/.

[Swish-e-9] "Swish-e". Swish-e. http://swish-e.org/.

[Xapian-10] "Xapian". Xapian Project. http://xapian.org/.

[Flax-11] "Flax". Flax. http://www.flax.co.uk/products.shtml.

[Zettair-12] "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/.

[Egothor-13] "Egothor". Egothor. http://www.egothor.org/.

[Ferret-14] "Ferret". David Balmain. http://ferret.davebalmain.com/.

[Grub-15] "Grub". Wikia, Inc.. http://grub.org/.

[Heritrix-16] "Heritrix". Internet Archive. http://crawler.archive.org/.

[htDig-17] "ht://Dig". The ht://Dig Group. http://www.htdig.org/.

[HtdigSearch-18] "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch.

[Lucene-19] "Lucene". Apache Software Foundation. http://lucene.apache.org/.

[LuceneImplementations-20] "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations.

[MWSearch-21] "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch.

[Nutch-22] "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/.

[OpenFTS-23] "OpenFTS". SourceForge. http://openfts.sourceforge.net/.

[Plucene-24] "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25.

[RigorousSearch-25] "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch.

[Sphinx-26] "Sphinx". Sphinx Technologies. http://sphinxsearch.com/.

[SphinxSearch-27] "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch.

[Terrier-28] "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/.

[YaCy-29] "YaCy". Karlsruhe Institute of Technology. http://yacy.net/.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

@@ Line 109: / Line 109: @@
 : Built-in web crawler is a script that interfaces with ht://Dig
 : Flax <ref name="Flax">{{cite web|url=http://www.flax.co.uk/products.shtml|title=Flax|publisher=Flax}}</ref> is another search engine built on top of Xapian and CherryPy
-* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref> - Examined by Huzaifa
-: written in Java
-::* Well maintained
-::* Support for peer search engine database exchanges
-::* Customised search parameters
-::* Fast indexing and web interface for querying the backend db.
 * Zettair <ref name="Zettair">{{cite web|url=http://www.seg.rmit.edu.au/zettair/|title=Zettair|publisher=Search Engine Group, Royal Melbourne Institute of Technology}}</ref>
@@ Line 134: / Line 127: @@
 * Heritrix <ref name="Heritrix">{{cite web|url=http://crawler.archive.org/|title=Heritrix|publisher=Internet Archive}}</ref>
 :* written in Java
-:* archives content rather than simply indexing it
+:* archives content rather than indexing it
 * ht://Dig <ref name="htDig">{{cite web|url=http://www.htdig.org/|title=ht://Dig|publisher=The ht://Dig Group}}</ref>
@@ Line 177: / Line 170: @@
 * Terrier (TERabyte RetrIEveR) <ref name="Terrier">{{cite web|url=http://ir.dcs.gla.ac.uk/terrier/|title=Terrier|publisher=Terrier Project}}</ref>
 : written in Java
+* YaCy <ref name="YaCy">{{cite web|url=http://yacy.net/|title=YaCy|publisher=Karlsruhe Institute of Technology}}</ref> - Examined by Huzaifa
+:* written in Java, but requires Sun Java
+:* Well maintained
+:* Support for peer search engine database exchanges
+:* Customized search parameters
+:* Fast indexing and web interface for querying the back end db.
 == Public Testing ==

Search

Infrastructure/Search: Difference between revisions

Revision as of 20:38, 29 October 2009

Contents

Points of Contact

Project Sponsor

Secondary Contact info

Project Info

Description/Summary

Requirements

Preferences

Project Plan

Resources Needed

Software Investigation and Evaluation

In Progress

Not Suitable

Public Testing

Deployment Plan

References