From Fedora Project Wiki

Javi Roman: Twitter Linkedin Photography

Fedora Big Data Package Ecosystem

Fedora Hosted Packages
Package Packaged Version Upstream Version Sources
Apache Hadoop 2.4.1 2.7.0 http://pkgs.fedoraproject.org/cgit/hadoop.git/
Apache HBase 0.98.3 1.0.1 http://pkgs.fedoraproject.org/cgit/hbase.git/
Apache Hive 0.12.2 1.1.0 http://pkgs.fedoraproject.org/cgit/hive.git/
Apache Pig 0.13.10 0.14.0 http://pkgs.fedoraproject.org/cgit/pig.git/
Apache Zookeeper 3.4.6 3.4.6 http://pkgs.fedoraproject.org/cgit/zookeeper.git/
Apache Oozie 4.0.1 4.1.0 http://pkgs.fedoraproject.org/cgit/oozie.git/
Apache Ambari 1.5.1 2.0.0 http://pkgs.fedoraproject.org/cgit/ambari.git/
Apache Accumulo 1.6.1 1.6.2 http://pkgs.fedoraproject.org/cgit/accumulo.git/
Apache Mesos 0.22.1 0.22.1 http://pkgs.fedoraproject.org/cgit/mesos.git/
Apache Solr 4.10.4 5.1.0 http://pkgs.fedoraproject.org/cgit/solr.git/
Apache Spark 0.9.1 1.3.1 http://pkgs.fedoraproject.org/cgit/spark.git/
AMPLab Tachyon 0.99 0.6.4 http://pkgs.fedoraproject.org/cgit/tachyon.git
On going packages
Package Packaged Version Upstream Version Status Sources
Apache Flume 1.5.0 1.5.0 Partially supported https://github.com/fedora-bigdata-rpms/flume-rpm
Cloudera Kite SDK 1.0.0 1.0.0 https://gil.fedorapeople.org/kite.spec
Apache Crunch 0.11.0 0.11.0 https://github.com/fedora-bigdata-rpms/crunch-rpm
Apache Tez 0.5.3 0.6.0 https://github.com/fedora-bigdata-rpms/tez-rpm
Apache Kafka 0.8.0 0.8.2.1 https://github.com/fedora-bigdata-rpms/kafka-rpm
Apache Tajo 0.10.0 0.10.0 https://gil.fedorapeople.org/tajo.spec
Apache Jena 2.13.0 2.13.0 https://gil.fedorapeople.org/jena.spec
Cascading 2.6.3 2.6.3 https://gil.fedorapeople.org/cascading.spec

Apache Flume package status

Package status

The package builds with this assumptions (we are working on this issues)

  • The code is not ready for Thrift v0.9.1 available in Fedora 21, however Flume code can builds using legacy Thrift built-in code available in the upstream Flume TGZ.
  • Disable ElasticSearch Sink
  • Disable Morphline Solr Sink
  • Disable Twitter Source
  • Disable Kite Dataset Sink

Testing the package

git clone https://github.com/fedora-bigdata-rpms/flume-rpm.git
cd flume-rpm
spectool -g flume.spec
rpmbuild -bs --nodeps --define "_sourcedir ." --define "_srcrpmdir ." flume.spec
sudo mock flume-1.5.2-1.fc21.src.rpm

Dependency packages

  • In order to build Flume with full features those are the dependency packages and their status:
Package Bugzilla Status
irclib RHBZ #976049 Package is available in Rawhide and in Fedora 21 as an update
mapdb RHBZ #1178861 Package is available in Rawhide and in Fedora 21 as an update
asynchbase RHBZ #1244657
async RHBZ #1244655 asynchbase dependency.
kite RHBZ #1179355 Patched in order to support Fedora Guava version (partial support).
parquet RHBZ #1073017 kite package dependency. Package is available in Rawhide and was submitted to Fedora 22 and 21 as an update
parquet-format RHBZ #1073014 parquet package dependency. Package is available in Rawhide and was submitted to Fedora 22 and 21 as an update
maxmind-db-java RHBZ #1179309 kite package dependency. Package is available in Rawhide and in Fedora 21 as an update
ua-parser-java RHBZ #1179342 kite package dependency. Package is available in Rawhide and in Fedora 21 as an update
elasticsearch RHBZ #902086 RHBZ #1181564 Package is available in Rawhide and in Fedora 22 as an update

Apache Storm package status

sources

Apache Kafka package status

sources

Apache Kafka is a distributed publish-subscribe messaging system persistent oriented with O(1) disk structures that provide constant time performance even with many TB of stored messages.

Apache Kafka is based on Scala language. Scala uses sbt (Simple Build Tool) for builds, it's the de facto build tool for the Scala community. Sbt is similar to Apache Ant, and uses Apache Ivy (a sub-project of the Apache Ant project) for resolving project dependencies.

We have two methods for scala based project RPM building:

  • Building packages with sbt and the climbing-nemesis script (a tool to make a temporary Ivy repository from installed Fedora packages)
 SIGs/bigdata/packaging/Sbt
 sbt is in Fedora 20 
 Example of climbing-nemsis usage
  • Building packages with sbt and xmvn’s Ivy resolution support
 Making Fedora a better place for Scala
 improved Fedora support for Ivy
 SIGs/bigdata/packaging/Scala
 Changes/ImprovedScalaEcosystem
 Changes/ImprovedIvyPackaging

Package status

The package doesn't build, mainly because Scala project based on sbt are broken in Fedora23-rawhide, the pending bugs here:

Removing depmap support in Fedora 23

sbt: FTBFS in rawhide

sbt: broken hawtjni-runtime-1.8.jar symlink

Testing the package

git clone https://github.com/fedora-bigdata-rpms/kafka-rpm.git
cd kafka-rpm
spectool -g kafka.spec
rpmbuild -bs --nodeps --define "_sourcedir ." --define "_srcrpmdir ." kafka.spec
sudo mock kafka-0.8.0-1.fc23.src.rpm

Apache Tez package status

sources

Apache Crunch package status

sources