Infrastructure/Search

From FedoraProject

Jump to: navigation, search
Infrastructure InfrastructureTeamN1.png


Contents

Points of Contact

Project Sponsor

Name: Mike McGrath
Fedora Account Name: mmcgrath
Group: Infrastructure
Infrastructure Sponsor: mmcgrath

Secondary Contact info

Name: Allen Kistler
Fedora Account Name: akistler
Group: Infrastructure

Name: Huzaifa Sidhpurwala
Fedora Account Name: huzaifas
Group: Infrastructure

Project Info

Project Name: Search Engine Enhancement
Target Audience: All users of Fedora web sites
Expiration/Delivery Date (required): F13

Description/Summary

Fedora needs a search engine[1]

Requirements

Preferences

Project Plan

  1. Investigate and evaluate existing open source search engines
  2. Select candidate software
  3. Create public test instances of candidate software
  4. Test for functionality, performance, and impact (re-evaluate, if necessary)
  5. Create capacity and deployment plans
  6. Deploy

Resources Needed

Software Investigation and Evaluation

Comparison by Requirements

Engine Name Source Language Integrated Crawler Integrated Search Tool Programmable Categories Application Interface
DataparkSearch C Yes Yes Yes
(Tags)
Yes
(Native C API)
Egothor Java
Ferret Ruby
Indri C/C++
KinoSearch Perl/C No
(sample file crawler included)
No
(sample included)
Yes Yes
(BDB/JSON)
mnoGoSearch C Yes Yes Yes
(Tags and Hierarchical categories)
Yes
(Native C API)
Nutch Java Yes
(OpenJDK command line)
Yes
(Tomcat servlet)
No Yes
(Java)
Swish-e C/Perl Yes
(Perl)
Sort Of
(sample included, but has problems)
No
(but can search on META tags)
Yes
(Perl and C APIs)
Xapian C++ Sort Of
(combined Omega with custom Perl)
Yes
(rudimentary Omega CGI)
Yes Yes
(C++, Perl, Python, Ruby)
Engine Name Source Language Integrated Crawler Integrated Search Tool Programmable Categories Application Interface

In Progress

Eliminated from Consideration

  • Reason for elimination
It has no crawling/spidering facility. It is a library toolkit (API) only.
  • Description
C++ port of Lucene
in Fedora already
described as beta by the developers
  • Reason for elimination
It has no crawling/spidering facility. It has no user query interface. There are no samples.
  • Description
written in Java, but ported to others [16]
Requires/uses GCJ
in Fedora already
PyLucene [17] is a Python wrapper around Java Lucene
  • Evaluation
Search engine library meant to be integrated into applications
  • Requirements
buildrequires (based on 1.4.3-f7)
  • ant
  • ant-junit
  • java-1.4.2-gcj-compat-devel
  • javacc
  • jpackage-utils
  • junit
  • make
requires (based on 1.4.3-f7)
  • java-1.4.2-gcj-compat
  • Reason for elimination
It has no crawling/spidering facility. It indexes local documents only.
  • Description
written in Perl
in Fedora already
  • Reason for elimination
It has no crawling/spidering facility. It has no user query interface. There are no samples.
  • Description
written in Java
based on Lucene
  • Evaluation
The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters.
Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.
  • Requirements
buildrequires
  • ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)
  • ant-junit
  • java-1.6.0-openjdk-devel
  • junit
requires
  • java-1.6.0-openjdk
  • Reason for elimination
It has no crawler or user search tool. It does not run as a service (as provided), only interactively.
  • Description
written in Java
runs from the command line (i.e., not a Tomcat servlet)
  • Reason for elimination
Requires Sun Java
  • written in Java
  • well maintained
  • support for peer search engine database exchanges
  • customized search parameters
  • fast indexing and web interface for querying the back-end db
  • Reason for elimination
No crawling capability, only indexes local documents
User search/retrieval tool is command-line only, no web interface
  • Description
written in C

Never Considered

  • written in Ruby
  • specializes in source code search
  • not actively maintained
  • written in C#
  • written in Java
  • archives content rather than indexing it
  • written in C++
  • not actively maintained
  • It's just a MediaWiki plugin, not suitable for searching non-wiki sites
  • Requires EzMwLucene (Java) to be running on the servers to be searched
  • EzMwLucene is wiki-only, therefore MWSearch is wiki-only
  • written in Perl or TCL on top of PostgreSQL
  • Python interface available
  • not actively maintained
  • Perl port of Lucene
  • not actively maintained
  • Crawls the MediaWiki database, not the web site
  • Doesn't work for non-MediaWiki web sites, including any non-wiki web site
  • written in C++
  • designed to index SQL tables, not web pages
  • Written in C++
  • MediaWiki plug-in, so it's wiki-only
  • written in Python
  • inspired by Lucene, but closer to a Python port of parts of KinoSearch combined with some features of Terrier
  • toolkit only, not even sample crawlers and user interfaces

Public Testing

Public Testing is taking place on publictest3.

Search Engines

PostgreSQL installed (postgresql, postgresql-devel, postgresql-libs, postgresql-server)
SELinux (not present on publictest3, but needed eventually) needs:
"setsebool -P httpd_can_network_connect=1" to connect to PostgreSQL
Crawling trials (with database cleared each time, i.e., not incremental)
Memory and CPU utilization are modest, less than 10% each. Most CPU time is spent in I/O Wait for the database.
Depth=4, 2.5 hrs crawling, 2k documents, db = 320M (700M with clone detection off)
Depth=5, 16.5 hrs crawling, 12k documents, db = 1.1G
PostgreSQL installed (postgresql, postgresql-devel, postgresql-libs, postgresql-server)
SELinux (not present on publictest3, but needed eventually) needs:
"setsebool -P httpd_can_network_connect=1" to connect to PostgreSQL
Crawling trials (with database cleared each time, i.e., not incremental)
Memory and CPU utilization are quite modest, about 1% each. Most CPU time is spent in I/O Wait for the database.
Depth=4, 2 hrs crawling, 1.5 min indexing, 11k documents
Depth=5, 4.5 hrs crawling, 12 min indexing, 25k documents
Depth=6, 6.5 hrs crawling, 16 min indexing, 34k documents
Depth=7, 12 hrs crawling, 23 min indexing, 40k documents, db = 2.6G
The Nutch tarball was unpacked in /opt/nutch-1.0, just as in preliminary local testing
Tomcat is reverse proxied through Apache (see notes below)
Nutch's definition/conception of depth appears to be unusual. The crawler must be directed to spider much more deeply than should be necessary.
Crawls are executed as (e.g.) "/opt/nutch-1.0/bin/nutch crawl /opt/nutch-1.0/urls -dir /opt/nutch-1.0/crawl -depth 5 -threads 2"
Crawling trials
java process uses about 18% of 6G of memory (4G RAM, 2G swap), regardless of depth
Depth=4, 2 threads, 1.5k documents
Depth=5, 2 threads, 3 hrs, 8k documents
Depth=6, 1 thread, 8.5 hrs, 23k documents
Depth=7, 1 thread, 14.5 hrs, 37k documents, db = 400M
Depth=8, 1 thread, 16.5 hrs, 44k documents, db = 440M
At this time, only installed xapian-core-libs, xapian-core, and xapian-omega (i.e., no xapian-bindings or perl-Search-xapian)
Enabled cgi-bin in /etc/httpd/conf.d/cgi-bin.conf (see notes below)
Omega bombs on http://fedoraproject.org/wiki/Overview (and would possibly on others later) with "unhtml index"
Resolution: Don't use "unhtml"
Omega bombs on long URIs (longer than 244 chars)
Example: http://fedoraproject.org/wiki/Special:WhatLinksHere/Ru_RU/%D0%9F%D0%BB%D0%B0%D0%BD_%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B_%D0%BF%D0%BE_%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%83_web-%D1%81%D0%B0%D0%B9%D1%82%D0%B0_%D0%BF%D1%80%D0%BE%D0%B5%D0%BA%D1%82%D0%B0_Fedora
Resolution: Enhanced custom crawler to filter URIs better (fedoraproject.org/w/ from the wiki); added capability to discard URIs that are too long (mostly hex URIs translated from other DBCS)
Perl custom crawler prints warnings for (and refuses to translate) URIs with Unicode characters outside the Latin 1 range
Resolution: None. This issue is known for URI.pm.[35]
Crawling trials
Failed, Depth=5, scriptindex used 70% of 4G of memory (2G RAM, 2G swap, 0% free)
terminated, system sluggish with swap I/O
Depth=4, 4 hrs, 15218 documents, index is about 500M on disk, scriptindex used 40% of 4G of memory (2G RAM, 2G swap)
Depth=5, 8 hrs, 41171 documents, index is about 1G on disk, scriptindex used 20% of 6G of memory (4G RAM, 2G swap)

Apache Configuration Notes

In /etc/httpd/conf.d/cgi-bin.conf, just use the default configuration normally commented out in httpd.conf.
 # ScriptAlias: This controls which directories contain server scripts.
 # ScriptAliases are essentially the same as Aliases, except that
 # documents in the realname directory are treated as applications and
 # run by the server when requested rather than as documents sent to the client.
 # The same rules about trailing "/" apply to ScriptAlias directives as to
 # Alias.
 #
 ScriptAlias /cgi-bin/ "/var/www/cgi-bin/"
 #
 # "/var/www/cgi-bin" should be changed to whatever your ScriptAliased
 # CGI directory exists, if you have that configured.
 #
 <Directory "/var/www/cgi-bin">
     AllowOverride None
     Options None
     Order allow,deny
     Allow from all
 </Directory>
In /etc/httpd/conf.d/tomcat5.conf:
 <Location /admin>
   Order Allow,Deny
   Allow from ...
 </Location>
 ProxyPass        /admin   http://localhost:8082/admin
 ProxyPassReverse /admin   http://localhost:8082/admin
 #
 <Location /manager>
   Order Allow,Deny
   Allow from ...
 </Location>
 ProxyPass        /manager http://localhost:8082/manager
 ProxyPassReverse /manager http://localhost:8082/manager
 #
 ProxyPass        /nutch   http://localhost:8082/nutch
 ProxyPassReverse /nutch   http://localhost:8082/nutch

Tomcat Configuration Notes

comment out the AJP connector on port 8009
comment out the HTTP connector on port 8080
uncomment the proxied HTTP connector on 8082
add proxyName to the HTTP connector on 8082
could alternatively define proxyName and proxyPort and undefine redirectPort in the HTTP connector on port 8080
For port 8082, SELinux needs "setsebool -P httpd_can_network_connect=1"
Alternatively, for port 8080, SELinux needs "setsebool -P httpd_can_network_relay=1"

Deployment Plan

<tbd>

References

  1. "Fedora Search Engine". Infrastructure Trac. https://fedorahosted.org/fedora-infrastructure/ticket/1055. 
  2. "DataparkSearch". DataparkSearch. http://www.dataparksearch.org/. 
  3. "Egothor". Egothor. http://www.egothor.org/. 
  4. "Ferret". David Balmain. http://ferret.davebalmain.com/. 
  5. 5.0 5.1 "Lucy". Apache Software Foundation. http://lucene.apache.org/lucy/. 
  6. "Indri". The Lemur Project. http://www.lemurproject.org/indri/. 
  7. "KinoSearch". Rectangular Research. http://www.rectangular.com/kinosearch/. 
  8. "mnoGoSearch". LavTech. http://www.mnogosearch.org/. 
  9. "Nutch". Apache Software Foundation. http://lucene.apache.org/nutch/. 
  10. "Swish-e". Swish-e. http://swish-e.org/. 
  11. "Xapian". Xapian Project. http://xapian.org/. 
  12. "Flax". Flax. http://www.flax.co.uk/products.shtml. 
  13. "CLucene". CLucene Project. http://sourceforge.net/projects/clucene/. 
  14. "Isearch". Isite. http://isite.awcubed.com/. 
  15. "Lucene". Apache Software Foundation. http://lucene.apache.org/. 
  16. "Lucene Implementations". Apache Software Foundation. http://wiki.apache.org/lucene-java/LuceneImplementations. 
  17. "PyLucene". Apache Software Foundation. http://lucene.apache.org/pylucene/. 
  18. "Namazu". Namazu Project. http://www.namazu.org/. 
  19. "Solr". Apache Software Foundation. http://lucene.apache.org/solr/. 
  20. "Terrier". Terrier Project. http://ir.dcs.gla.ac.uk/terrier/. 
  21. "YaCy". Karlsruhe Institute of Technology. http://yacy.net/. 
  22. "Zettair". Search Engine Group, Royal Melbourne Institute of Technology. http://www.seg.rmit.edu.au/zettair/. 
  23. "Gonzui". SourceForge. http://gonzui.sourceforge.net/. 
  24. "Grub". Wikia, Inc. http://grub.org/. 
  25. "Heritrix". Internet Archive. http://crawler.archive.org/. 
  26. "ht://Dig". The ht://Dig Group. http://www.htdig.org/. 
  27. "HtdigSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:HtdigSearch. 
  28. "MWSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:MWSearch. 
  29. "OpenFTS". XWare. http://www.astronet.ru/xware/#fts. 
  30. "Plucene". CPAN. http://search.cpan.org/~tmtm/Plucene-1.25. 
  31. "RigorousSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:RigorousSearch. 
  32. "Sphinx". Sphinx Technologies. http://sphinxsearch.com/. 
  33. "SphinxSearch Extension". MediaWiki. https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Extension:SphinxSearch. 
  34. "Whoosh". Matt Chaput. http://whoosh.ca/. 
  35. ""URI.pm error"". Usenet. http://www.nntp.perl.org/group/perl.libwww/2006/01/msg6540.html.