Features/Hadoop

From FedoraProject

< Features(Difference between revisions)
Jump to: navigation, search
(Approach)
m (People involved)
(44 intermediate revisions by 6 users not shown)
Line 3: Line 3:
 
== Summary ==
 
== Summary ==
 
Bring Apache Hadoop, the hottest open source big data platform, to Fedora, the hottest open source distribution. Fedora should be the best distribution for using Apache Hadoop.  
 
Bring Apache Hadoop, the hottest open source big data platform, to Fedora, the hottest open source distribution. Fedora should be the best distribution for using Apache Hadoop.  
 +
 +
This and other big data activities can be found going on in the [https://fedoraproject.org/wiki/Big_data_SIG Big Data SIG].
  
 
== Owner ==
 
== Owner ==
Line 26: Line 28:
 
  | packaging
 
  | packaging
 
  | UTC-5
 
  | UTC-5
 +
|-
 +
| [[User:tstclair | Timothy St. Clair]]
 +
| tstclair
 +
| setup and configuration
 +
| UTC-6
 
  |-
 
  |-
 
  | [[User:skottler | Sam Kottler]]
 
  | [[User:skottler | Sam Kottler]]
Line 36: Line 43:
 
  | packaging
 
  | packaging
 
  | UTC+1
 
  | UTC+1
 +
|-
 +
| [[User:cicku | Christopher Meng]]
 +
| cicku
 +
| packaging, testing
 +
| UTC+8
 
  |}
 
  |}
  
 
== Current status ==
 
== Current status ==
 
* Targeted release: [[Releases/20 | Fedora 20 ]]  
 
* Targeted release: [[Releases/20 | Fedora 20 ]]  
* Last updated: 3 Apr 2013
+
* Last updated: 07 May 2013
* Percentage of completion: 5%
+
* Percentage of completion
 
+
** Dependencies available in Fedora (missing since project initiation): 31%
 +
** Adaptation of Hadoop 2.0.2a source via patches: 100%
 +
** Hadoop spec completion: 60%
  
 
== Detailed Description ==
 
== Detailed Description ==
Line 58: Line 72:
  
 
=== Approach ===
 
=== Approach ===
We are taking an iterative, depth-first approach to packaging. We do not have all the dependencies mapped out ahead of time.
+
We are taking an iterative, depth-first approach to packaging. We do not have all the dependencies mapped out ahead of time. Dependencies are being tabulated into two groups:
 +
# ''missing'' - the dependency being requested from a hadoop-common pom has not yet been packaged, reviewed or generated into fedora repos
 +
# ''broken'' - the dependency requested is out of date with current fedora versions, and patches must be developed for inclusion in a hadoop rpm build that address any build, API or source code deltas
 +
Note that a dependency may show up in both of these tables.
  
 
Anyone who wants to help should find an available dependency below, edit the table changing the state to Active and packager to yourself.
 
Anyone who wants to help should find an available dependency below, edit the table changing the state to Active and packager to yourself.
Line 64: Line 81:
 
While packaging a dependency, test dependencies can be skipped. Testing will be done via integration testing periodically during packaging and then after packaging completes. Test dependencies that are skipped must be added to the [[#skip|Skipped dependencies]] table below.
 
While packaging a dependency, test dependencies can be skipped. Testing will be done via integration testing periodically during packaging and then after packaging completes. Test dependencies that are skipped must be added to the [[#skip|Skipped dependencies]] table below.
  
If you are ''lucky enough'' to pick a dependency that itself has unpackaged dependencies, identity the sub-dependencies and add them to the bottom of the [[#deps|Dependencies]] table below, change your current dependency to Blocked and repeat.
+
If you are ''lucky enough'' to pick a dependency that itself has unpackaged dependencies, identify the sub-dependencies and add them to the bottom of the [[#deps|Dependencies]] table below, change your current dependency to Blocked and repeat.
  
 
If your dependency is already packaged but the version is incompatible, contact the package owner and resolve the incompatibility in a mutually satisfactory way. For instance:
 
If your dependency is already packaged but the version is incompatible, contact the package owner and resolve the incompatibility in a mutually satisfactory way. For instance:
Line 73: Line 90:
  
 
{| class="wikitable"
 
{| class="wikitable"
  |+ Table legend
+
  |+ Missing dependency legend
 
  ! State !! Notes
 
  ! State !! Notes
 
  |-
 
  |-
Line 79: Line 96:
 
  | '''<span style="color:darkviolet">Available</span>''' || free for someone to take
 
  | '''<span style="color:darkviolet">Available</span>''' || free for someone to take
 
  |-
 
  |-
  | '''<span style="color:blue">Active</span>'''    || actively being packaged
+
  | '''<span style="color:blue">Active</span>'''    || dependency is actively being packaged if missing, or patch is being developed or tested for inclusion in hadoop-common build
 
  |-
 
  |-
 
  | '''<span style="color:red">Blocked</span>'''  || pending packages for dependencies
 
  | '''<span style="color:red">Blocked</span>'''  || pending packages for dependencies
Line 89: Line 106:
  
 
{| class="wikitable"
 
{| class="wikitable"
  |+ <div id="deps">Dependencies</div>
+
  |+ <div id="deps">Missing Dependencies</div>
  ! # !! Project !! State !! Review BZ !! Packager !! Notes
+
  ! Project !! State !! Review BZ !! Packager !! Notes
 
  |-
 
  |-
| <div id="0">0</div>
 
 
  | hadoop
 
  | hadoop
  | '''<span style="color:red">Blocked</span>'''
+
  | '''<span style="color:blue">Active</span>'''
 
  |
 
  |
 
  | [[User:rrati|rrati]],[[User:pmackinn|pmackinn]]
 
  | [[User:rrati|rrati]],[[User:pmackinn|pmackinn]]
 
  |
 
  |
 
  |-
 
  |-
| <div id="1">1</div>
 
| ant
 
| '''<span style="color:darkviolet">Available</span>'''
 
|
 
|
 
|  VERSION INCOMPATIBILITY: Version 1.6 asked for, 1.8 currently packaged in Fedora.  Needs to be inspected for api/functional incompatibilities
 
|-
 
| <div id="2">2</div>
 
| apache-commons-daemon
 
| '''<span style="color:darkviolet">Available</span>'''
 
|
 
|
 
| Java import compilation error with existing package.  Needs inspection
 
|-
 
| <div id="3">3</div>
 
| apache-commons-math
 
| '''<span style="color:darkviolet">Available</span>'''
 
|
 
|
 
| Current apache-commons-math uses math3 in pom instead of math.  Hadoop requests version 2.x.  Needs inspection
 
|-
 
| <div id="4">4</div>
 
 
  | bookkeeper
 
  | bookkeeper
 
  | '''<span style="color:orange">Review</span>'''
 
  | '''<span style="color:orange">Review</span>'''
Line 127: Line 121:
 
  | Version 4.0 requested. packaged 4.2.1. Patch: [https://issues.apache.org/jira/browse/BOOKKEEPER-598 BOOKKEEPER-598]
 
  | Version 4.0 requested. packaged 4.2.1. Patch: [https://issues.apache.org/jira/browse/BOOKKEEPER-598 BOOKKEEPER-598]
 
  |-
 
  |-
| <div id="5">5</div>
 
| gmaven
 
| '''<span style="color:red">Blocked</span>'''
 
| {{bz|914056}}
 
|
 
| Version 1.0 requested, available 1.4 (but has a broken deps)
 
|-
 
| <div id="6">6</div>
 
 
  | glassfish-gmbal
 
  | glassfish-gmbal
  | '''<span style="color:orange">Review</span>'''
+
  | '''<span style="color:green">Complete</span>'''
 
  | {{bz|859112}}
 
  | {{bz|859112}}
 
  | [[User:gil|gil]]
 
  | [[User:gil|gil]]
  |
+
  | [https://koji.fedoraproject.org/koji/buildinfo?buildID=413470 F18 build]
|-
+
| <div id="7">7</div>
+
| grizzly
+
| '''<span style="color:orange">Review</span>'''
+
| {{bz|859114}}
+
| [[User:gil|gil]]
+
|
+
 
  |-
 
  |-
| <div id="8">8</div>
 
 
  | glassfish-management-api
 
  | glassfish-management-api
 
  | '''<span style="color:green">Complete</span>'''
 
  | '''<span style="color:green">Complete</span>'''
 
  | {{bz|859110}}
 
  | {{bz|859110}}
 
  | [[User:gil|gil]]
 
  | [[User:gil|gil]]
  |
+
  | [https://koji.fedoraproject.org/koji/buildinfo?buildID=412579 F18 build]
 +
|-
 +
| grizzly
 +
| '''<span style="color:green">Complete</span>'''
 +
| {{bz|859114}}
 +
| [[User:gil|gil]]
 +
| Only for F20 for now. Cause: missing glassfish-servlet-api on [https://bugzilla.redhat.com/show_bug.cgi?id=959702 F18 and F19].
 
  |-
 
  |-
| <div id="9">9</div>
 
 
  | groovy
 
  | groovy
 
  | '''<span style="color:orange">Review</span>'''
 
  | '''<span style="color:orange">Review</span>'''
 
  | {{bz|858127}}
 
  | {{bz|858127}}
 
  | [[User:gil|gil]]
 
  | [[User:gil|gil]]
  | 1.5 requested, but 1.8 packaged in fedora.  Possible moving forward 1.8 series will be known as groovy18 and groovy will be 2.x.
+
  | 1.5 requested but 1.8 packaged in fedora.  Possible moving forward 1.8 series will be known as groovy18 and groovy will be 2.x.
 
  |-
 
  |-
| <div id="10">10</div>
 
 
  | hsqldb
 
  | hsqldb
 
  | '''<span style="color:darkviolet">Available</span>'''
 
  | '''<span style="color:darkviolet">Available</span>'''
Line 169: Line 151:
 
  | 1.8 in fedora, 2.0 requested.  2.2.8 packaged by gil, but seemingly no review request.  Needs followup.
 
  | 1.8 in fedora, 2.0 requested.  2.2.8 packaged by gil, but seemingly no review request.  Needs followup.
 
  |-
 
  |-
| <div id="11">11</div>
 
| javax-servlet
 
| '''<span style="color:darkviolet">Available</span>'''
 
|
 
|
 
| 3.0 requested for glassfish groupID. Failed resolution. Needs investigation since other javax.servlet API packages should be available in Fedora (tomcat?).
 
|-
 
| <div id="12">12</div>
 
 
  | jersey
 
  | jersey
 
  | '''<span style="color:green">Complete</span>'''
 
  | '''<span style="color:green">Complete</span>'''
 
  | {{bz|825347}}
 
  | {{bz|825347}}
 
  | [[User:gil|gil]]
 
  | [[User:gil|gil]]
  |
+
  | [https://koji.fedoraproject.org/koji/buildinfo?buildID=407315 F18 build] Should be rebuilt with grizzly2 support enabled.
 
  |-
 
  |-
| <div id="13">13</div>
 
 
  | jets3t
 
  | jets3t
 
  | '''<span style="color:orange">Review</span>'''
 
  | '''<span style="color:orange">Review</span>'''
 
  | {{bz|847109}}
 
  | {{bz|847109}}
 
  | [[User:gil|gil]]
 
  | [[User:gil|gil]]
  | Require 0.6.1. With 0.9.x: hadoop-common Jets3tNativeFileSystemStore.java error: incompatible types S3ObjectsChunk chunk = s3Service.listObjectsChunked(bucket.getName(),
+
  |  
 
  |-
 
  |-
| <div id="14">14</div>
 
| jetty
 
| '''<span style="color:darkviolet">Available</span>'''
 
|
 
|
 
| VERSION INCOMPATIBILITY: jetty8 packaged in Fedora, but 6.x requested.  6 and 8 are incompatible.  Needs investigation
 
|-
 
| <div id="15">15</div>
 
 
  | jspc-compiler
 
  | jspc-compiler
  | '''<span style="color:blue">Active</span>'''
+
  | '''<span style="color:orange">Review</span>'''
  |
+
  |{{bz|960720}}
 
  |[[User:pmackinn|pmackinn]]
 
  |[[User:pmackinn|pmackinn]]
  |
+
  |Passes preliminary overall hadoop-common compilation/testing.
 
  |-
 
  |-
| <div id="16">16</div>
 
 
  | kfs
 
  | kfs
  | '''<span style="color:darkviolet">Available</span>'''
+
  | '''<span style="color:orange">Review</span>'''
  |
+
  |{{bz|960728}}
  |
+
  |[[User:pmackinn|pmackinn]]
  | gil has packaged 0.5, but no review request. kfs has become Quantcast qfs. If not strictly necessary, you could also remove
+
  | kfs has become Quantcast qfs.  
 
  |-
 
  |-
| <div id="17">17</div>
 
 
  | maven-native
 
  | maven-native
 
  | '''<span style="color:orange">Review</span>'''
 
  | '''<span style="color:orange">Review</span>'''
Line 218: Line 181:
 
  | Needs patch to build with java7. NOTE: javac target/source is already set by mojo.java.target option
 
  | Needs patch to build with java7. NOTE: javac target/source is already set by mojo.java.target option
 
  |-
 
  |-
| <div id="18">18</div>
 
| slf4j
 
| '''<span style="color:darkviolet">Available</span>'''
 
|
 
|
 
| Package in fedora fails to match in dependency resolution.  Needs inspection
 
|-
 
| <div id="19">19</div>
 
| tomcat-jasper
 
| '''<span style="color:blue">Active</span>'''
 
|
 
| [[User:rrati|rrati]]
 
| Version 5.5.x requested.  May be able to use jasper jar from tomcat6 or later.
 
|-
 
| <div id="20">20</div>
 
 
  | zookeeper
 
  | zookeeper
 
  | '''<span style="color:orange">Review</span>'''
 
  | '''<span style="color:orange">Review</span>'''
 
  | {{bz|823122}}
 
  | {{bz|823122}}
 
  | [[User:gil|gil]]
 
  | [[User:gil|gil]]
 +
| requires [https://koji.fedoraproject.org/koji/buildinfo?buildID=957337 jtoaster]
 +
|}
 +
 +
{| class="wikitable"
 +
|+ <div id="deps">Broken Dependencies</div>
 +
! Project !! Packager !! Notes
 +
|-
 +
| ant
 
  |
 
  |
 +
| Version 1.6 requested, 1.8 currently packaged in Fedora.  Needs to be inspected for API/functional incompatibilities(?)
 
  |-
 
  |-
  | <div id="21">21</div>
+
  | apache-commons-collections
  | [package name]
+
  |[[User:pmackinn|pmackinn]]
  | Available
+
| Java import compilation error with existing package.  Patches for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-collections
  | {{bz|XYZ}}
+
|-
  | [[User:noone|noone]]
+
| apache-commons-math
  | [notes]
+
|[[User:pmackinn|pmackinn]]
 +
  | Current apache-commons-math uses math3 in pom instead of math, and API changes in code. Patches for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-math
 +
  |-
 +
| ecj
 +
| [[User:rrati|rrati]]
 +
| Need ecj version ecj-4.2.1-6 or later to resolve a dependency lookup issue
 +
|-
 +
| gmaven
 +
| [[User:gil|gil]]
 +
| Version 1.0 requested, available 1.4 (but has broken deps) {{bz|914056}}  
 +
|-
 +
| hadoop-hdfs
 +
  | [[User:pmackinn|pmackinn]]
 +
| glibc link error in hdfs native build. Patch for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-cmake-hdfs
 +
|-
 +
| jersey
 +
  | [[User:pmackinn|pmackinn]]
 +
| Needs jersey-servlet and version. Tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jersey
 +
|-
 +
| jets3t
 +
| [[User:pmackinn|pmackinn]]
 +
| Requires 0.6.1. With 0.9.x: hadoop-common Jets3tNativeFileSystemStore.java error: incompatible types S3ObjectsChunk chunk = s3Service.listObjectsChunked(bucket.getName(). Patches for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jets3t
 +
|-
 +
| jetty
 +
| [[User:rrati|rrati]]
 +
| jetty8 packaged in Fedora, but 6.x requested. 6 and 8 are incompatible. Patches tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jetty
 +
|-
 +
| slf4j
 +
|[[User:pmackinn|pmackinn]]
 +
| Package in fedora fails to match in dependency resolution.  jcl104-over-slf4j dep in hadoop-common moved to jcl-over-slf4j as part of jspc/tomcat dep. Patch being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jasper
 +
|-
 +
| tomcat-jasper
 +
| [[User:pmackinn|pmackinn]]
 +
| Version 5.5.x requested. Adaptations made for incumbent Tomcat 7 via patches at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jasper. Reviewing fit as part of overall hadoop-common compilation/testing.
 
  |}
 
  |}
  
 
{| class="wikitable"
 
{| class="wikitable"
  |+ <div id="skip">Skipped dependencies</div>
+
  |+ <div id="junit">Unit Test Log</div>
  ! # !! JAR !! Project !! State !! Packager !! Notes
+
  ! Module !! Name !! Baseline !! Fedora !! Tester !! Notes
 +
|-
 +
| hadoop-common
 +
| TestSSLHttpServer
 +
| '''<span style="color:red">?</span>'''
 +
| '''<span style="color:red">Fail</span>'''
 +
| [[User:pmackinn|pmackinn]]
 +
| Does internal keystore setup but then seems to get tripped up later looking for default keystore file. Possible config issue.
 
  |-
 
  |-
  | 0
+
  | hadoop-yarn-applications
  | [jar name]
+
  | TestUnmanagedAMLauncher
  | [package name]
+
  | '''<span style="color:red">Fail</span>'''
  | Available
+
  | '''<span style="color:red">Fail</span>'''
  | [[User:noone|noone]]
+
  | [[User:pmackinn|pmackinn]]
  | Needed for tests by [[#N]]
+
  | Seems designed to execute once by successfully contacting an RM but repeatedly retries with: yarnAppState=FAILED, distributedFinalState=FAILED
 
  |}
 
  |}
  
 +
== Packager Resources ==
 
=== Packager tips ===
 
=== Packager tips ===
 
* mvn-rpmbuild utility will ONLY resolve from system repo
 
* mvn-rpmbuild utility will ONLY resolve from system repo
 
* mvn-local will resolve from system repo first then fallback to maven if unresolved
 
* mvn-local will resolve from system repo first then fallback to maven if unresolved
* can be used to find the delta between system repo packages available and missing dependencies that can be viewed in the .m2 local maven repo (find *.jar)
+
** can be used to find the delta between system repo packages available and missing dependencies that can be viewed in the .m2 local maven repo (find ~/.m2/repository -name '*.jar')
 
* -Dmaven.local.debug=true
 
* -Dmaven.local.debug=true
** reveals how JPP lookups are executing per dependency -> useful for finding gId,aId mismatches
+
** reveals how JPP lookups are executed per dependency: useful for finding groupId,artifactId mismatches
 
* -Dmaven.test.skip=true
 
* -Dmaven.test.skip=true
** tells maven to skip test compilation
+
** tells maven to skip test runs AND compilation
 
+
** useful for unblocking end-to-end build
'''TODO: Template spec files to work from'''
+
 
+
'''TODO: Setup staging repository for sharing packages under review'''
+
  
 
'''An alternative to gmaven:'''
 
'''An alternative to gmaven:'''
Line 328: Line 323:
 
         </executions>
 
         </executions>
 
       </plugin>'''
 
       </plugin>'''
 +
 +
=== Repositories ===
 +
An RPM repository of dependencies already packaged and in, or heading towards, review state can be found here:
 +
 +
http://repos.fedorapeople.org/repos/rrati/hadoop/
 +
 +
Currently, only Fedora 18 x86_64 packages are available
 +
 +
 +
Source repositories:
 +
 +
https://github.com/fedora-bigdata/hadoop-common      Fork of Apache Hadoop for changes required to support compilation on Fedora
 +
 +
https://github.com/fedora-bigdata/hadoop-rpm        Spec and supporting files for generating an RPM for Fedora
 +
 +
 +
=== Workflow ===
 +
The Apache Hadoop project uses a number of old, or obsolete, dependencies in their build and test environment, and this presents a challenge for including Apache Hadoop into Fedora.  Any changes to the Apache Hadoop source or build files that is required in order to use a newer version of a dependency is a candidate for creating a patch to send upstream.  Any changes that are required to conform to Fedora's packaging guidelines or deal with a package naming issue should be contained to the hadoop spec file.
 +
 +
The intention of this process is to isolate changes to a single dependency so patches can be created that can be consumed upstream.  It is '''important''' that changes to the source be isolated to 1 dependency and the changes must be self-contained.  A dependency is not necessarily a single jar file.  Changes to a dependency should entail everything needed to use the jar files from a later release of the dependency.
 +
 +
 +
==== Dependency Branches ====
 +
All code/build changes to Apache Hadoop should be performed on a branch in the hadoop-common repo that should be based off the
 +
 +
:'''branch-2.0.2-alpha'''
 +
 +
branch and should following this naming convention:
 +
 +
:'''fedora-patch-<dependency>'''
 +
 +
Where <dependency> is the name of the dependency being worked on.  Changes to this branch should ONLY relate to the dependency being worked on.  Do not include the dependency version in the branch name.  These branches will be updated as needed because of Fedora or Hadoop updates until they are accepted upstream by Apache Hadoop.  Not having the dependency version allows the branch to move from version 1->2->3 without confusion if it is required before accepted upstream.
 +
 +
==== Integration Branch ====
 +
An integration branch should be created in the hadoop-common repository that corresponds with the release version being packaged using the following naming convention:
 +
 +
:'''fedora-<version>-integration'''
 +
 +
where <ver> is the hadoop version being packaged.  All branches containing changes that have not yet been accepted upstream should be merged to the integration branch and the result should pass the build and all tests.  Once this is complete a patch should be generated and pushed to the hadoop-rpm repository.
 +
 +
==== Testing Changes ====
 +
In order for a set of changes to be considered complete, it must be able to compile and pass all tests in 2 separate ways:
 +
 +
# On Fedora using Fedora packages (mvn-rpmbuild)
 +
# On Fedora using maven retrieved packages (mvn)
 +
 +
The changes should compile and the build process should run through all tests without error.  To verify a set of changes, use the following options:
 +
 +
:'''<mvn-build> -Pnative install'''
 +
 +
Where <mvn-build> is either mvn-rpmbuild or mvn.
 +
 +
NOTE: ''This pirates' code is more what you'd call guidelines than actual rules. There are places where incompatible changes exist (at least for now): for example, the zookeeper test jar.''
  
 
== How To Test ==
 
== How To Test ==

Revision as of 03:19, 9 May 2013

Contents

Apache Hadoop 2.0

Summary

Bring Apache Hadoop, the hottest open source big data platform, to Fedora, the hottest open source distribution. Fedora should be the best distribution for using Apache Hadoop.

This and other big data activities can be found going on in the Big Data SIG.

Owner

People involved

Name IRC Focus Additional
Matthew Farrellee mattf keeping track, integration testing UTC-5
Peter MacKinnon pmackinn packaging UTC-5
Rob Rati rsquared packaging UTC-5
Timothy St. Clair tstclair setup and configuration UTC-6
Sam Kottler skottler packaging UTC-5
Gil Cattaneo gil packaging UTC+1
Christopher Meng cicku packaging, testing UTC+8

Current status

  • Targeted release: Fedora 20
  • Last updated: 07 May 2013
  • Percentage of completion
    • Dependencies available in Fedora (missing since project initiation): 31%
    • Adaptation of Hadoop 2.0.2a source via patches: 100%
    • Hadoop spec completion: 60%

Detailed Description

Apache Hadoop is a widely used, increasingly complete big data platform, with a strong open source community and growing ecosystem. The goal is to package and integrate the core of the Hadoop ecosystem for Fedora, allowing for immediate use and creating a base for the rest of the ecosystem.


Benefit to Fedora

The Apache Hadoop software will be packaged and integrated with Fedora. The core of the Hadoop ecosystem will be available with Fedora and provide a base for additional packages.


Scope

  • Package the Apache Hadoop 2.0.2 software
  • Package all dependencies needed for Apache Hadoop 2.0.2
  • Skip package dependencies required for unit testing, record them in a dependency backlog for later cleanup

Approach

We are taking an iterative, depth-first approach to packaging. We do not have all the dependencies mapped out ahead of time. Dependencies are being tabulated into two groups:

  1. missing - the dependency being requested from a hadoop-common pom has not yet been packaged, reviewed or generated into fedora repos
  2. broken - the dependency requested is out of date with current fedora versions, and patches must be developed for inclusion in a hadoop rpm build that address any build, API or source code deltas

Note that a dependency may show up in both of these tables.

Anyone who wants to help should find an available dependency below, edit the table changing the state to Active and packager to yourself.

While packaging a dependency, test dependencies can be skipped. Testing will be done via integration testing periodically during packaging and then after packaging completes. Test dependencies that are skipped must be added to the Skipped dependencies table below.

If you are lucky enough to pick a dependency that itself has unpackaged dependencies, identify the sub-dependencies and add them to the bottom of the Dependencies table below, change your current dependency to Blocked and repeat.

If your dependency is already packaged but the version is incompatible, contact the package owner and resolve the incompatibility in a mutually satisfactory way. For instance:

  • If the version available in Fedora is older, explore updating the package. If that is not possible, explore creating a package that includes a version in its name, e.g. pkgnameXY. Ultimately, the most recent version in Fedora should have the name pkgname while older versions have pkgnameXY. It may take a full Fedora release to rationalize package names. Make a note in the Dependencies table.
  • If the version you need is older than the packaged version, consider creating a patch to use the newer version. If a patch is not viable, proceed by packaging the dependency with a version in its name, e.g. pkgnameXY. Make a note in the Dependencies table.
Missing dependency legend
State Notes
Available free for someone to take
Active dependency is actively being packaged if missing, or patch is being developed or tested for inclusion in hadoop-common build
Blocked pending packages for dependencies
Review under review, include link to review BZ
Complete woohoo!

Missing Dependencies

Project State Review BZ Packager Notes
hadoop Active rrati,pmackinn
bookkeeper Review RHBZ #948589 gil Version 4.0 requested. packaged 4.2.1. Patch: BOOKKEEPER-598
glassfish-gmbal Complete RHBZ #859112 gil F18 build
glassfish-management-api Complete RHBZ #859110 gil F18 build
grizzly Complete RHBZ #859114 gil Only for F20 for now. Cause: missing glassfish-servlet-api on F18 and F19.
groovy Review RHBZ #858127 gil 1.5 requested but 1.8 packaged in fedora. Possible moving forward 1.8 series will be known as groovy18 and groovy will be 2.x.
hsqldb Available 1.8 in fedora, 2.0 requested. 2.2.8 packaged by gil, but seemingly no review request. Needs followup.
jersey Complete RHBZ #825347 gil F18 build Should be rebuilt with grizzly2 support enabled.
jets3t Review RHBZ #847109 gil
jspc-compiler Review RHBZ #960720 pmackinn Passes preliminary overall hadoop-common compilation/testing.
kfs Review RHBZ #960728 pmackinn kfs has become Quantcast qfs.
maven-native Review RHBZ #864084 gil Needs patch to build with java7. NOTE: javac target/source is already set by mojo.java.target option
zookeeper Review RHBZ #823122 gil requires jtoaster

Broken Dependencies

Project Packager Notes
ant Version 1.6 requested, 1.8 currently packaged in Fedora. Needs to be inspected for API/functional incompatibilities(?)
apache-commons-collections pmackinn Java import compilation error with existing package. Patches for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-collections
apache-commons-math pmackinn Current apache-commons-math uses math3 in pom instead of math, and API changes in code. Patches for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-math
ecj rrati Need ecj version ecj-4.2.1-6 or later to resolve a dependency lookup issue
gmaven gil Version 1.0 requested, available 1.4 (but has broken deps) RHBZ #914056
hadoop-hdfs pmackinn glibc link error in hdfs native build. Patch for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-cmake-hdfs
jersey pmackinn Needs jersey-servlet and version. Tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jersey
jets3t pmackinn Requires 0.6.1. With 0.9.x: hadoop-common Jets3tNativeFileSystemStore.java error: incompatible types S3ObjectsChunk chunk = s3Service.listObjectsChunked(bucket.getName(). Patches for hadoop-common being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jets3t
jetty rrati jetty8 packaged in Fedora, but 6.x requested. 6 and 8 are incompatible. Patches tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jetty
slf4j pmackinn Package in fedora fails to match in dependency resolution. jcl104-over-slf4j dep in hadoop-common moved to jcl-over-slf4j as part of jspc/tomcat dep. Patch being tracked at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jasper
tomcat-jasper pmackinn Version 5.5.x requested. Adaptations made for incumbent Tomcat 7 via patches at https://github.com/fedora-bigdata/hadoop-common/tree/fedora-patch-jasper. Reviewing fit as part of overall hadoop-common compilation/testing.

Unit Test Log

Module Name Baseline Fedora Tester Notes
hadoop-common TestSSLHttpServer ? Fail pmackinn Does internal keystore setup but then seems to get tripped up later looking for default keystore file. Possible config issue.
hadoop-yarn-applications TestUnmanagedAMLauncher Fail Fail pmackinn Seems designed to execute once by successfully contacting an RM but repeatedly retries with: yarnAppState=FAILED, distributedFinalState=FAILED

Packager Resources

Packager tips

  • mvn-rpmbuild utility will ONLY resolve from system repo
  • mvn-local will resolve from system repo first then fallback to maven if unresolved
    • can be used to find the delta between system repo packages available and missing dependencies that can be viewed in the .m2 local maven repo (find ~/.m2/repository -name '*.jar')
  • -Dmaven.local.debug=true
    • reveals how JPP lookups are executed per dependency: useful for finding groupId,artifactId mismatches
  • -Dmaven.test.skip=true
    • tells maven to skip test runs AND compilation
    • useful for unblocking end-to-end build

An alternative to gmaven:

  • apply a patch with the following content where required
  • test support is not guaranteed, should not work.

     <plugin>
       <groupId>org.apache.maven.plugins</groupId>
       <artifactId>maven-antrun-plugin</artifactId>
       <version>1.7</version>
       <dependencies>
         <dependency>
           <groupId>org.codehaus.groovy</groupId>
           <artifactId>groovy</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>antlr</groupId>
           <artifactId>antlr</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>commons-cli</groupId>
           <artifactId>commons-cli</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>asm</groupId>
           <artifactId>asm-all</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>org.slf4j</groupId>
           <artifactId>slf4j-nop</artifactId>
           <version>any</version>
         </dependency>
       </dependencies>
       <executions>
         <execution>
           <id>compile</id>
           <phase>process-sources</phase>
           <configuration>
             <target>
               <mkdir dir="${basedir}/target/classes"/>
               <taskdef name="groovyc" classname="org.codehaus.groovy.ant.Groovyc">
                 <classpath refid="maven.plugin.classpath"/>
               </taskdef>
               <groovyc destdir="${project.build.outputDirectory}" srcdir="${basedir}/src/main" classpathref="maven.compile.classpath">
                 <javac source="1.5" target="1.5" debug="on"/>
               </groovyc>
             </target>
           </configuration>
           <goals>
             <goal>run</goal>
           </goals>
         </execution>
       </executions>
     </plugin>

Repositories

An RPM repository of dependencies already packaged and in, or heading towards, review state can be found here:

http://repos.fedorapeople.org/repos/rrati/hadoop/

Currently, only Fedora 18 x86_64 packages are available


Source repositories:

https://github.com/fedora-bigdata/hadoop-common Fork of Apache Hadoop for changes required to support compilation on Fedora

https://github.com/fedora-bigdata/hadoop-rpm Spec and supporting files for generating an RPM for Fedora


Workflow

The Apache Hadoop project uses a number of old, or obsolete, dependencies in their build and test environment, and this presents a challenge for including Apache Hadoop into Fedora. Any changes to the Apache Hadoop source or build files that is required in order to use a newer version of a dependency is a candidate for creating a patch to send upstream. Any changes that are required to conform to Fedora's packaging guidelines or deal with a package naming issue should be contained to the hadoop spec file.

The intention of this process is to isolate changes to a single dependency so patches can be created that can be consumed upstream. It is important that changes to the source be isolated to 1 dependency and the changes must be self-contained. A dependency is not necessarily a single jar file. Changes to a dependency should entail everything needed to use the jar files from a later release of the dependency.


Dependency Branches

All code/build changes to Apache Hadoop should be performed on a branch in the hadoop-common repo that should be based off the

branch-2.0.2-alpha

branch and should following this naming convention:

fedora-patch-<dependency>

Where <dependency> is the name of the dependency being worked on. Changes to this branch should ONLY relate to the dependency being worked on. Do not include the dependency version in the branch name. These branches will be updated as needed because of Fedora or Hadoop updates until they are accepted upstream by Apache Hadoop. Not having the dependency version allows the branch to move from version 1->2->3 without confusion if it is required before accepted upstream.

Integration Branch

An integration branch should be created in the hadoop-common repository that corresponds with the release version being packaged using the following naming convention:

fedora-<version>-integration

where <ver> is the hadoop version being packaged. All branches containing changes that have not yet been accepted upstream should be merged to the integration branch and the result should pass the build and all tests. Once this is complete a patch should be generated and pushed to the hadoop-rpm repository.

Testing Changes

In order for a set of changes to be considered complete, it must be able to compile and pass all tests in 2 separate ways:

  1. On Fedora using Fedora packages (mvn-rpmbuild)
  2. On Fedora using maven retrieved packages (mvn)

The changes should compile and the build process should run through all tests without error. To verify a set of changes, use the following options:

<mvn-build> -Pnative install

Where <mvn-build> is either mvn-rpmbuild or mvn.

NOTE: This pirates' code is more what you'd call guidelines than actual rules. There are places where incompatible changes exist (at least for now): for example, the zookeeper test jar.

How To Test

  1. TODO: NEEDS MORE DEFINITION
  2. yum install X Y Z across one or more nodes
  3. Setup a simple cluster by following TBD
  4. Run http://hadoop.apache.org/docs/stable/gridmix.html


User Experience

For users who are interested in running Apache Hadoop on Fedora, they will find it available from Fedora Project yum repositories.

TODO: SPECIFICALLY PACKAGES X Y Z


Dependencies

No other packages currently depend on Apache Hadoop.

Completion of this feature will involve packaging numerous dependencies, see the Dependencies table. Some of the dependencies are already being packaged by others in the Fedora community. Where dependency overlap is found, a negotaition must occur to ensure a satisfactory version and package is available to all parties.

TODO: Is https://fedoraproject.org/wiki/Hypertable ?


Contingency Plan

With no packages depending on Apache Hadoop, none is necessary. The biggest risk is not completing packages for all dependencies. In that case, the feature can be removed from the release notes. The packaged dependencies should remain in the distribution. The feature can be pushed to the next Fedora release.


Documentation


Release Notes

  • TODO


Comments and Discussion