Revision as of 08:25, 2 August 2015

If you're wondering what Big Data things are in Fedora, or are interested in working on packaging or reviews to help out the Big Data SIG, this is the page to look at!

If you know of a big-data-related package that is already in Fedora, or have one that you'd like to get into Fedora, be sure to list it here, or link to the page describing what needs to be done, or link to the bugzilla that needs help.

Packages available in Fedora

Package	Description	Packaged Version	Upstream Version	Sources	Notes
Apache Hadoop	Batch processing system and core of the Hadoop ecosystem	2.4.1	2.7.0	hadoop.git	Hadoop packaging
Apache HBase	The Apache Hadoop NoSQL Database	0.98.3	1.0.1	hbase.git	HBase packaging
Apache Hive	SQL-on-Hadoop query framework, a data warehouse for Hadoop	0.12.2	1.1.0	hive.git
Apache Pig	Language for expression data analysis programs run on MapReduce	0.13.10	0.14.0	pig.git	Pig packaging
Apache Zookeeper	A service for highly reliable distributed coordination	3.4.6	3.4.6	zookeeper.git
Apache Oozie		4.0.1	4.1.0	oozie.git
Apache Ambari	Hadoop cluster manager	1.5.1	2.0.0	ambari.git
Apache Accumulo	A software platform for processing vast amounts of data	1.6.1	1.6.2	accumulo.git
Apache Mesos	Cluster manager for sharing distributed application frameworks	0.22.1	0.22.1	mesos.git	Mesos packaging
Apache Solr	Ultra-fast Lucene-based Search Server	4.10.4	5.1.0	solr.git
Apache Spark	Lightning-fast cluster computing	0.9.1	1.3.1	spark.git	Spark packaging and Scala packaging
AMPLab Tachyon	A memory resident, fault tolerant distributed file system	0.99	0.6.4	tachyon.git	Tachyon packaging

On going packages
Package	Description	Packaged Version	Upstream Version	Sources	Notes
Apache Flume		1.5.0	1.5.0	flume-rpm.git	Partially supported
Cloudera Kite SDK	Kite SDK to simplify the development of data-related systems	1.0.0	1.0.0	kite.spec
Apache Crunch		0.11.0	0.11.0	crunch-rpm.git
Apache Tez		0.5.3	0.6.0	tez-rpm.git
Apache Kafka		0.8.0	0.8.2.1	kafka-rpm.git
Apache Tajo		0.10.0	0.10.0	tajo.spec
Apache Jena		2.13.0	2.13.0	jena.spec
Cascading		2.6.3	2.6.3	cascading.spec

Packages in review

the bigdata-review-tracker

Project	Review BZ	Who	Notes
Apache Oozie	RHBZ #1071456	rsquared	Oozie packaging
Apache Sqoop	RHBZ #1089675	pmackinn

Packages we're working on

Project	Who	Status
Apache Mahout	besser82
Apache Flume	gil jromanes	Flume package status
Apache Crunch	gil jromanes	Crunch package status
Apache Tez	gil jromanes	Tez package status
Apache Kafka	gil jromanes	Kafka package status
Apache Tajo	gil	Tajo package status
Apache Jena	donpellegrino	package status
Cascading	gil	Cascading package status

Packages we'd like to include

Shark
Aurora
Sparrow
Storm
Tez
Presto
Cascading
Summingbird
RHadoop
Sentry
Ooyala Job Server
unicage
GridGain
Crunch
Elephant Bird
Hadoop-lzo
Tajo
CKAN - "The open source data portal software"
Samza
Flink
Geode
New stuff here!

Becoming a packager

Not yet a packager? Check out the Package Maintainers, or the Join the package collection maintainers page to get more information. You could also ask on the Big Data SIG mailing list for assistance and see if you can find a willing helper or sponsor. For bundling Java packages read the Java packaging guidelines first.

Typical workflow (relies on github)

Clone original repo, if modifications are required.
Patch where necessary. (Use github tickets where possible if working as a group).
- Try to organize your patch set into meaningful units, and create tickets to push upstream where possible.
- For patches that require carrying, they should be applied to the raw-sources where possible.
Create a package-rpm repo with specs and system integration files (systemd, custom-conf, etc).
Use rpmbuild | hack fedpkg to enable prototype package building
- spectool -g package.spec (will download sources)
- md5sum package-sources.tar.gz > sources
- fedpkg local
Once you feel you have a package ready for review run the following prior to submit:
- Setup Fedora Review
- rpmlint package.spec
- mock --clean --init -r fedora-rawhide-x86_64 && fedora-review -m fedora-rawhide-x86_64 -n package.srpm

Packaging Notes

Fedora java rpms can not bundle dependent jars. Every jar file not created by the build must come from an rpm in the Fedora repository.
All jars must be built from source
Fedora build tools: xmvn-resolve, ~~mvn-local, mvn-rpmbuild, mvn-build~~ no longer available in rawhide, considered private implementation
Fedora rpm macros: %pom_*, %mvn_build, %mvn_install, %mvn_file
xmvn-subst for dependency jars when packaging
Fedora Java Packaging guidelines: https://fedoraproject.org/wiki/Packaging:Java JNI handling: System.load replaces System.loadLibrary, jar file in %{_jnidir} Jar files in %{_javadir}
Fedora build systems have no internet access, avoid DNS if possible.
Breaking apart or subsuming subelements
- Depending on the popularity of a sub-element as a stand-alone package it sometimes makes more sense to break it out as a sub-package which can stand alone, but doesn't have to live in a separate repository. This is a choice which will have to be made by the upstream group and will depend heavily on their ideal workflow, but from a maintenance perspective it's far easier to maintain as a sub-package. E.g. one project produces multiple libs/jars.
Fedora is OpenJDK7 or higher. You cannot mix-and-match usage of the Fedora versions of maven and ant with Java 6, since they are themselves compiled with source="1.7".

@@ Line 4: / Line 4: @@
 = Packages available in Fedora =
-{|
- ! Project !! Since !! Description !! Notes
+{| class="wikitable" style="color:black; background-color:#CCFFFF;" cellpadding="10"
- |-
+! Package !! Description !! Packaged <br> Version !! Upstream <br> Version !! Sources !! Notes
- | [http://research.cs.wisc.edu/htcondor/ HTCondor]
+|-
- | F8
+| '''Apache Hadoop'''
- | A scalable batch scheduling system
+| Batch processing system and core of the Hadoop ecosystem
- |
+| 2.4.1
- |-
+| 2.7.0
- | [http://zookeeper.apache.org/ Apache ZooKeeper]
+| [http://pkgs.fedoraproject.org/cgit/hadoop.git/ hadoop.git]
- | F18
+| [[Changes/Hadoop | Hadoop packaging]]
- | A service for highly reliable distributed coordination
+|-
- |
+| '''Apache HBase'''
- |-
+| The Apache Hadoop NoSQL Database
- | [https://launchpad.net/savanna Savanna]
+| 0.98.3
- | F20
+| 1.0.1
- | An OpenStack project for managing Hadoop clusters and workflow
+| [http://pkgs.fedoraproject.org/cgit/hbase.git/ hbase.git]
- |
+| [[SIGs/bigdata/packaging/Hbase| HBase packaging]]
- |-
+|-
- | [https://forge.gluster.org/hadoop GlusterFS Hadoop]
+| '''Apache Hive'''
- | F20
+| SQL-on-Hadoop query framework, a data warehouse for Hadoop
- | An [http://wiki.apache.org/hadoop/HCFS HCFS] plugin for [http://gluster.org/ Gluster]
+| 0.12.2
- |
+| 1.1.0
- |-
+| [http://pkgs.fedoraproject.org/cgit/hive.git/ hive.git]
- | [https://fedoraproject.org/wiki/Features/Hadoop Apache Hadoop]
+|
- | F20
+|-
- | Batch processing system and core of the Hadoop ecosystem
+| '''Apache Pig'''
- | [[Changes/Hadoop | Hadoop F20 Change]]
+| Language for expression data analysis programs run on MapReduce
- |-
+| 0.13.10
- | [https://github.com/amplab/tachyon/wiki Tachyon]
+| 0.14.0
- | F20
+| [http://pkgs.fedoraproject.org/cgit/pig.git/ pig.git]
- | A memory resident, fault tolerant distributed file system
+| [[SIGs/bigdata/packaging/Pig | Pig packaging]]
- | [[SIGs/bigdata/packaging/Tachyon | Tachyon packaging]]
+|-
- |-
+| '''Apache Zookeeper'''
- | [http://mesos.apache.org/ Apache Mesos]
+| A service for highly reliable distributed coordination
- | F21
+| 3.4.6
- | Cluster manager for sharing distributed application frameworks
+| 3.4.6
- | [[SIGs/bigdata/packaging/Mesos | Mesos packaging]]
+| [http://pkgs.fedoraproject.org/cgit/zookeeper.git/ zookeeper.git]
- |-
+|
- | [http://hbase.apache.org/ Apache HBase]
+|-
- | F21
+| '''Apache Oozie'''
- | The Apache Hadoop Database
+|
- | [[SIGs/bigdata/packaging/Hbase| HBase packaging]]
+| 4.0.1
- |-
+| 4.1.0
- | [http://pig.apache.org/ Apache Pig]
+| [http://pkgs.fedoraproject.org/cgit/oozie.git/ oozie.git]
- | F21
+|
- | Language for expression data analysis programs run on MapReduce
+|-
- | [[SIGs/bigdata/packaging/Pig | Pig packaging]]
+| '''Apache Ambari'''
- |-
+| Hadoop cluster manager
- | [http://lucene.apache.org/solr/ Apache Solr]
+| 1.5.1
- | [http://pkgs.fedoraproject.org/cgit/solr.git/ F21]
+| 2.0.0
- | Ultra-fast Lucene-based Search Server
+| [http://pkgs.fedoraproject.org/cgit/ambari.git/ ambari.git]
- | <s>{{bz|1025904}}</s>
+|
- |-
+|-
- | [http://spark.apache.org/ Apache Spark]
+| '''Apache Accumulo'''
- | [http://pkgs.fedoraproject.org/cgit/spark.git/ F21]
+| A software platform for processing vast amounts of data
- | Lightning-fast cluster computing
+| 1.6.1
- | [[SIGs/bigdata/packaging/Spark|Spark packaging]] and [[SIGs/bigdata/packaging/Scala|Scala packaging]]
+| 1.6.2
- |-
+| [http://pkgs.fedoraproject.org/cgit/accumulo.git/ accumulo.git]
- | [http://hive.apache.org/ Apache Hive]
+|
- | [http://pkgs.fedoraproject.org/cgit/hive.git/ F21]
+|-
- | Hadoop data warehouse
+| '''Apache Mesos'''
- | [[SIGs/bigdata/packaging/Hive | Hive packaging]]
+| Cluster manager for sharing distributed application frameworks
- |-
+| 0.22.1
- | [http://ambari.apache.org/ Apache Ambari]
+| 0.22.1
- | [http://pkgs.fedoraproject.org/cgit/ambari.git/ F20,F21]
+| [http://pkgs.fedoraproject.org/cgit/mesos.git/ mesos.git]
- | Hadoop cluster manager
+| [[SIGs/bigdata/packaging/Mesos | Mesos packaging]]
- | [[SIGs/bigdata/packaging/Ambari | Ambari packaging]]
+|-
- |-
+| '''Apache Solr'''
- | [http://accumulo.apache.org/ Apache Accumulo]
+| Ultra-fast Lucene-based Search Server
- | [http://pkgs.fedoraproject.org/cgit/accumulo.git/ F21]
+| 4.10.4
- | A software platform for processing vast amounts of data
+| 5.1.0
- | [[Changes/ApacheAccumulo | Accumulo F21 Change]]
+| [http://pkgs.fedoraproject.org/cgit/solr.git/ solr.git]
- |-
+|
- | [http://kitesdk.org/docs/current/ Kite SDK]
+|-
- |
+| '''Apache Spark'''
- | Kite SDK to simplify the development of data-related systems
+| Lightning-fast cluster computing
- | {{bz|1025904}}
+| 0.9.1
- |}
+| 1.3.1
+| [http://pkgs.fedoraproject.org/cgit/spark.git/ spark.git]
+| [[SIGs/bigdata/packaging/Spark|Spark packaging]] and [[SIGs/bigdata/packaging/Scala|Scala packaging]]
+|-
+| '''AMPLab Tachyon'''
+| A memory resident, fault tolerant distributed file system
+| 0.99
+| 0.6.4
+| [http://pkgs.fedoraproject.org/cgit/tachyon.git tachyon.git]
+| [[SIGs/bigdata/packaging/Tachyon | Tachyon packaging]]
+|-
+|
+|
+|
+|
+|}
+{| class="wikitable" style="color:black; background-color:#CCFFFF;" cellpadding="10"
+|+ On going packages
+! Package !! Description !! Packaged <br> Version !! Upstream <br> Version !! Sources !! Notes
+|-
+| '''Apache Flume'''
+|
+| 1.5.0
+| 1.5.0
+| [https://github.com/fedora-bigdata-rpms/flume-rpm flume-rpm.git]
+| Partially supported
+|-
+| '''Cloudera Kite SDK'''
+| Kite SDK to simplify the development of data-related systems
+| 1.0.0
+| 1.0.0
+| [https://gil.fedorapeople.org/kite.spec kite.spec]
+|
+|-
+| '''Apache Crunch'''
+|
+| 0.11.0
+| 0.11.0
+| [https://github.com/fedora-bigdata-rpms/crunch-rpm crunch-rpm.git]
+|
+|-
+| '''Apache Tez'''
+|
+| 0.5.3
+| 0.6.0
+| [https://github.com/fedora-bigdata-rpms/tez-rpm tez-rpm.git]
+|
+|-
+| '''Apache Kafka'''
+|
+| 0.8.0
+| 0.8.2.1
+| [https://github.com/fedora-bigdata-rpms/kafka-rpm kafka-rpm.git]
+|
+|-
+| '''Apache Tajo'''
+|
+| 0.10.0
+| 0.10.0
+| [https://gil.fedorapeople.org/tajo.spec tajo.spec]
+|
+|-
+|'''Apache Jena'''
+|
+| 2.13.0
+| 2.13.0
+| [https://gil.fedorapeople.org/jena.spec jena.spec]
+|
+|-
+| '''Cascading'''
+|
+| 2.6.3
+| 2.6.3
+| [https://gil.fedorapeople.org/cascading.spec cascading.spec]
+|
+|-
+|
+|
+|
+|
+|
+|}
 = Packages in review =

Search

SIGs/bigdata/packaging: Difference between revisions