From Fedora Project Wiki

< SIGs‎ | bigdata‎ | packaging

(Created page with " ant build so rought cut but should be complete hive*, hbase* excluded but some jars listed are derived from build $ find -name '*.jar' -type f -printf "%f\n" | grep ...")
 
 
(79 intermediate revisions by 2 users not shown)
Line 1: Line 1:
    ant build so rought cut but should be complete
= Hive =
== Overview ==
From the [http://hive.apache.org/ project site]: "Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL."


    hive*, hbase* excluded but some jars listed are derived from build
The Fedora Big Data SIG is investigating the requirements to adapt the latest version of Hive as a package in Fedora, now that [https://koji.fedoraproject.org/koji/packageinfo?packageID=16841 Hadoop 2.x has been packaged]. Although Hive obviously has a significant dependency on Hadoop, the Java project is not Maven-based and instead is built using Ant and Ivy. The [[:Packaging:Java|xmvn tooling]] support in Fedora does not directly apply to the Hive build. In many ways this can be viewed as a simplification instead of a challenge since one can configure a local file-system Ivy resolver relatively easily.


$ find -name '*.jar' -type f -printf "%f\n" | grep -v hive | grep -v hbase | sort | uniq
Using static build-derived analysis (Ant doesn't really provide something like the Maven dependency plugin), there are a group of dependencies that are currently missing from Fedora which block the build of Hive using Fedora-only installed versions. There are also many dependencies available which are not necessarily version-compatible. However, like the [[:Changes/Hadoop|hadoop outline]], those can hopefully be mitigated in the Hive source where possible.
activation-1.1.jar
 
activeio-core-3.1.2.jar
== Build ==
activemq-core-5.5.0.jar
Version 0.12.0 is the latest release and built from source (using the Fedora Hadoop target of 2.2.0) using:
activemq-protobuf-1.1.jar
 
ant-1.6.5.jar
<pre>ant very-clean package -Dhadoop.version=2.2.0 -Dhadoop-0.23.version=2.2.0 -Dhadoop.mr.rev=23 -DenhanceModel.notRequired=true -Dmvn.hadoop.profile=hadoop23 -Dshims.include=0.23 -Dbuild.profile=core -Dthrift.home=/usr</pre>
ant-contrib-1.0b3.jar
 
antlr-2.7.7.jar
Note that to do a local build using the [[#SCM|SIG branch]] you must make a directory to store any [[#Dependencies|(currently)]] unpackaged jars:
antlr-3.4.jar
 
antlr-runtime-3.4.jar
<pre>mkdir -p ~/hive/lib/missing</pre>
aopalliance-1.0.jar
 
asm-3.1.jar
== Dependencies ==
asm-3.2.jar
The full Hive dependency list is captured [http://pmackinn.fedorapeople.org/tattletale/hive/hive-0.11.0/dependson/index.html here] but the following table outlines the missing dependencies. The ones in '''bold''' are deemed hard dependencies and must be packaged.
asm-commons-3.1.jar
 
asm-tree-3.1.jar
{| class="wikitable"
avalon-framework-4.1.3.jar
|+ <div id="deps">Missing/Questionable Dependencies</div>
avro-1.3.2.jar
! Project !! State !! Review BZ !! Packager !! Notes
avro-1.5.3.jar
|-
avro-1.7.1.jar
| avro-ipc, '''avro-mapred'''
avro-ipc-1.5.3.jar
| '''<span style="color:green">Complete</span>'''
avro-ipc-1.7.1.jar
| <strike>{{bz|1009170}}</strike>
avro-mapred-1.7.1.jar
| [[User:ricardo|ricardo]]
cglib-2.2.1-v20090111.jar
| Although avro 1.6.2 is packaged, it does not include the ipc and mapred jars. IPC appears to only apply to 0.20 shim. MapRed is used by an Avro reader/input/output feature in QL and is based on the legacy mapred API (i.e.,org.apache.hadoop.mapred).
checkstyle-5.5.jar
|-
commons-beanutils-1.7.0.jar
| '''datanucleus-core'''
commons-beanutils-core-1.8.0.jar
| '''<span style="color:green">Complete</span>'''
commons-beanutils-core-1.8.3.jar
| <strike>{{bz|1011705}}</strike>
commons-cli-1.2.jar
| [[User:pmackinn|pmackinn]],[[User:gil|gil]]
commons-codec-1.3.jar
| Forms backbone of metastore layer for different data sinks. Upstream project at http://www.datanucleus.org/
commons-codec-1.4.jar
|-
commons-collections-3.2.1.jar
| '''datanucleus-api-jdo'''
commons-compress-1.4.1.jar
| '''<span style="color:green">Complete</span>'''
commons-configuration-1.6.jar
| <strike>{{bz|1011962}}</strike>
commons-dbcp-1.4.jar
| [[User:pmackinn|pmackinn]],[[User:gil|gil]]
commons-digester-1.8.jar
| JDO implementation for datanucleus
commons-el-1.0.jar
|-
commons-exec-1.1.jar
| '''datanucleus-rdbms'''
commons-httpclient-3.0.1.jar (Jakarta???)
| '''<span style="color:green">Complete</span>'''
commons-httpclient-3.1.jar
| <strike>{{bz|1011960}}</strike>
commons-io-2.1.jar
| [[User:pmackinn|pmackinn]],[[User:gil|gil]]
commons-io-2.4.jar
| RDBMS plugin adapter for datanucleus
commons-lang-2.4.jar
|-
commons-lang-2.5.jar
| hbase
commons-logging-1.0.4.jar
| '''<span style="color:green">Complete</span>'''
commons-logging-1.1.1.jar
| <strike>{{bz|1045556}}</strike>
commons-logging-1.1.jar
| [[User:rrati|rrati]]
commons-logging-api-1.0.4.jar
| hbase-handler can be compiled out but seems like a significant omission
commons-math-2.1.jar
|-
commons-net-1.4.1.jar
| high-scale-lib
commons-net-2.0.jar
| '''<span style="color:green">Complete</span>'''
commons-net-3.1.jar
| <strike>{{bz|865893}}</strike>
commons-pool-1.5.4.jar
| [[User:gil|gil]]
core-3.1.1.jar (hive?)
|
datanucleus-connectionpool-2.0.3.jar
|-
datanucleus-core-2.0.3.jar
| '''javolution'''
datanucleus-enhancer-2.0.3.jar
| '''<span style="color:green">Complete</span>'''
datanucleus-rdbms-2.0.3.jar
| <strike>{{bz|1009153}}</strike>
derby-10.4.2.0.jar
| [[User:pmackinn|pmackinn]]
ftplet-api-1.0.0.jar
| Used by the QL classes: a '''hard''' dependency
ftpserver-core-1.0.0.jar
|-
ftpserver-deprecated-1.0.0-M2.jar
| '''jdo-api'''
geronimo-annotation_1.0_spec-1.1.1.jar
| '''<span style="color:green">Complete</span>'''
geronimo-j2ee-management_1.1_spec-1.0.1.jar
| <strike>{{bz|1011696}}</strike>
geronimo-jaspic_1.0_spec-1.0.jar
| [[User:pmackinn|pmackinn]],[[User:gil|gil]]
geronimo-jms_1.1_spec-1.1.1.jar
| Dependency for datanucleus-api-jdo. CANNOT substitute existing jdo2-api.
geronimo-jta_1.1_spec-1.1.1.jar
|-
guava-11.0.2.jar
| '''libthrift, libfb303'''
guice-3.0.jar
| '''<span style="color:green">Complete</span>'''
guice-servlet-3.0.jar
| <strike>{{bz|982285}}, {{bz|1000563}}</strike>
hadoop-annotations-2.0.5-alpha.jar
| [[User:willb|willb]]
hadoop-auth-2.0.5-alpha.jar
| Will Benton has some RPM artifacts at http://freevariable.com/thrift/
hadoop-common-2.0.5-alpha.jar
|-
hadoop-common-2.0.5-alpha-tests.jar
| metrics-core
hadoop-core-0.20.2.jar
| '''<span style="color:green">Complete</span>'''
hadoop-core-1.0.3.jar
| <strike>{{bz|861502}}</strike>
hadoop-core-1.1.2.jar
| [[User:gil|gil]]
hadoop-hdfs-2.0.5-alpha.jar
|
hadoop-hdfs-2.0.5-alpha-tests.jar
|-
hadoop-mapreduce-client-app-2.0.5-alpha.jar
| pig
hadoop-mapreduce-client-common-2.0.5-alpha.jar
| '''<span style="color:orange">Review</span>'''
hadoop-mapreduce-client-core-2.0.5-alpha.jar
| {{bz|1060277}}
hadoop-mapreduce-client-hs-2.0.5-alpha.jar
| [[User:pmackinn|pmackinn]]
hadoop-mapreduce-client-jobclient-2.0.5-alpha.jar
| Test and source imports of Pig classes, however they appear to be in the adapter space so may be able to defer.
hadoop-mapreduce-client-jobclient-2.0.5-alpha-tests.jar
|-
hadoop-mapreduce-client-shuffle-2.0.5-alpha.jar
| tempus-fugit
hadoop-test-0.20.2.jar
| '''<span style="color:orange">Review</span>'''
hadoop-test-1.0.3.jar
| {{bz|1009654}}
hadoop-test-1.1.2.jar
| [[User:gil|gil]]
hadoop-tools-0.20.2.jar
| Concurrency library. May only be a test dep. Upstream at http://tempusfugitlibrary.org/
hadoop-tools-1.0.3.jar
|-
hadoop-tools-1.1.2.jar
|}
hadoop-yarn-api-2.0.5-alpha.jar
NB: This list is distilled from the overall set of missing dependencies but many of the ones that aren't listed are not required for the latest Fedora version of Hadoop (2.0.5a), assuming the appropriate [[#Build|build properties noted]] are specified.
hadoop-yarn-client-2.0.5-alpha.jar
 
hadoop-yarn-common-2.0.5-alpha.jar
== SCM ==
hadoop-yarn-server-common-2.0.5-alpha.jar
 
hadoop-yarn-server-nodemanager-2.0.5-alpha.jar
The BigData SIG is tracking a set of commits [https://github.com/fedora-bigdata/hive/tree/fedora-0.11 here] to build according to FPG. These will eventually be converted into a patch set for a spec file once all the outstanding missing dependencies are in place. These commits include a set of custom Ivy resolvers that only inspect the local filesystem in typical Fedora Java jar locations. A [https://bugzilla.redhat.com/show_bug.cgi?id=1012612 RFE] was created to make Fedora Ivy map dependencies into the local filesystem implicitly, thus doing away with custom resolvers (as much as feasible).
hadoop-yarn-server-resourcemanager-2.0.5-alpha.jar
hadoop-yarn-server-tests-2.0.5-alpha-tests.jar
hadoop-yarn-server-web-proxy-2.0.5-alpha.jar
hamcrest-core-1.1.jar
hcatalog-core-0.11.0.jar (should be built within hive?)
hcatalog-pig-adapter-0.11.0.jar
hcatalog-server-extensions-0.11.0.jar
high-scale-lib-1.1.1.jar
hsqldb-1.8.0.10.jar
httpclient-4.1.2.jar (Jakarta???)
httpclient-4.1.3.jar
httpcore-4.1.3.jar
ivy-2.1.0.jar
jackson-core-asl-1.8.8.jar
jackson-core-asl-1.9.2.jar
jackson-jaxrs-1.7.1.jar
jackson-jaxrs-1.8.8.jar
jackson-jaxrs-1.9.2.jar
jackson-mapper-asl-1.8.8.jar
jackson-mapper-asl-1.9.2.jar
jackson-xc-1.7.1.jar
jackson-xc-1.8.8.jar
jackson-xc-1.9.2.jar
jamon-runtime-2.3.1.jar
jasper-compiler-5.5.12.jar
jasper-compiler-5.5.23.jar
jasper-runtime-5.5.12.jar
jasper-runtime-5.5.23.jar
jasypt-1.7.jar
JavaEWAH-0.3.2.jar
javax.inject-1.jar
javolution-5.5.1.jar
jaxb-api-2.1.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jdo2-api-2.3-ec.jar
jersey-core-1.14.jar
jersey-core-1.8.jar
jersey-guice-1.8.jar
jersey-json-1.14.jar
jersey-json-1.8.jar
jersey-server-1.14.jar
jersey-server-1.8.jar
jersey-servlet-1.14.jar
jersey-test-framework-grizzly2-1.8.jar
jets3t-0.6.1.jar
jets3t-0.7.1.jar
jettison-1.1.jar
jetty-6.1.14.jar
jetty-6.1.26.jar
jetty-all-server-7.6.0.v20120127.jar
jetty-util-6.1.14.jar
jetty-util-6.1.26.jar
jline-0.9.94.jar
jms-1.1.jar
jmxri-1.2.1.jar
jmxtools-1.2.1.jar
jruby-complete-1.6.5.jar
jsch-0.1.42.jar
json-20090211.jar
jsp-2.1-6.1.14.jar
jsp-api-2.1-6.1.14.jar
jsp-api-2.1.jar
jsr305-1.3.9.jar
jul-to-slf4j-1.6.1.jar
junit-3.8.1.jar
junit-4.10.jar
junit-4.5.jar
kahadb-5.5.0.jar
kfs-0.3.jar
libfb303-0.9.0.jar*
libthrift-0.8.0.jar*
libthrift-0.9.0.jar*
log4j-1.2.15.jar
log4j-1.2.16.jar
log4j-1.2.17.jar
logkit-1.0.1.jar
mail-1.4.1.jar
mail-1.4.jar
maven-ant-tasks-2.1.3.jar
metrics-core-2.1.2.jar
mina-core-2.0.0-M5.jar
mockito-all-1.8.2.jar
netty-3.2.2.Final.jar
netty-3.4.0.Final.jar
netty-3.5.11.Final.jar
org.osgi.core-4.1.0.jar
oro-2.0.8.jar
paranamer-2.2.jar
paranamer-2.3.jar
paranamer-ant-2.2.jar
paranamer-generator-2.2.jar
pig-0.10.1.jar (dep on pig?)
protobuf-java-2.4.0a.jar
protobuf-java-2.4.1.jar
qdox-1.10.1.jar
servlet-api-2.3.jar
servlet-api-2.5-20081211.jar
servlet-api-2.5-6.1.14.jar
servlet-api-2.5.jar
slf4j-api-1.5.2.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar
snappy-0.2.jar
snappy-java-1.0.3.2.jar
snappy-java-1.0.4.1.jar
ST4-4.0.4.jar
stax-api-1.0.1.jar (not sure about the api level here)
stax-api-1.0-2.jar
stringtemplate-3.2.1.jar
tempus-fugit-1.1.jar
TestSerDe.jar
velocity-1.5.jar
velocity-1.7.jar
wadl-resourcedoc-doclet-1.4.jar
webhcat-0.11.0.jar (from hive build)
webhcat-java-client-0.11.0.jar (from hive build)
xercesImpl-2.6.1.jar
xmlenc-0.52.jar
xz-1.0.jar
zookeeper-3.4.2.jar
zookeeper-3.4.3.jar
zookeeper-3.4.3-tests.jar

Latest revision as of 15:02, 19 February 2014

Hive

Overview

From the project site: "Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL."

The Fedora Big Data SIG is investigating the requirements to adapt the latest version of Hive as a package in Fedora, now that Hadoop 2.x has been packaged. Although Hive obviously has a significant dependency on Hadoop, the Java project is not Maven-based and instead is built using Ant and Ivy. The xmvn tooling support in Fedora does not directly apply to the Hive build. In many ways this can be viewed as a simplification instead of a challenge since one can configure a local file-system Ivy resolver relatively easily.

Using static build-derived analysis (Ant doesn't really provide something like the Maven dependency plugin), there are a group of dependencies that are currently missing from Fedora which block the build of Hive using Fedora-only installed versions. There are also many dependencies available which are not necessarily version-compatible. However, like the hadoop outline, those can hopefully be mitigated in the Hive source where possible.

Build

Version 0.12.0 is the latest release and built from source (using the Fedora Hadoop target of 2.2.0) using:

ant very-clean package -Dhadoop.version=2.2.0 -Dhadoop-0.23.version=2.2.0 -Dhadoop.mr.rev=23 -DenhanceModel.notRequired=true -Dmvn.hadoop.profile=hadoop23 -Dshims.include=0.23 -Dbuild.profile=core -Dthrift.home=/usr

Note that to do a local build using the SIG branch you must make a directory to store any (currently) unpackaged jars:

mkdir -p ~/hive/lib/missing

Dependencies

The full Hive dependency list is captured here but the following table outlines the missing dependencies. The ones in bold are deemed hard dependencies and must be packaged.

Missing/Questionable Dependencies
Project State Review BZ Packager Notes
avro-ipc, avro-mapred Complete RHBZ #1009170 ricardo Although avro 1.6.2 is packaged, it does not include the ipc and mapred jars. IPC appears to only apply to 0.20 shim. MapRed is used by an Avro reader/input/output feature in QL and is based on the legacy mapred API (i.e.,org.apache.hadoop.mapred).
datanucleus-core Complete RHBZ #1011705 pmackinn,gil Forms backbone of metastore layer for different data sinks. Upstream project at http://www.datanucleus.org/
datanucleus-api-jdo Complete RHBZ #1011962 pmackinn,gil JDO implementation for datanucleus
datanucleus-rdbms Complete RHBZ #1011960 pmackinn,gil RDBMS plugin adapter for datanucleus
hbase Complete RHBZ #1045556 rrati hbase-handler can be compiled out but seems like a significant omission
high-scale-lib Complete RHBZ #865893 gil
javolution Complete RHBZ #1009153 pmackinn Used by the QL classes: a hard dependency
jdo-api Complete RHBZ #1011696 pmackinn,gil Dependency for datanucleus-api-jdo. CANNOT substitute existing jdo2-api.
libthrift, libfb303 Complete RHBZ #982285, RHBZ #1000563 willb Will Benton has some RPM artifacts at http://freevariable.com/thrift/
metrics-core Complete RHBZ #861502 gil
pig Review RHBZ #1060277 pmackinn Test and source imports of Pig classes, however they appear to be in the adapter space so may be able to defer.
tempus-fugit Review RHBZ #1009654 gil Concurrency library. May only be a test dep. Upstream at http://tempusfugitlibrary.org/

NB: This list is distilled from the overall set of missing dependencies but many of the ones that aren't listed are not required for the latest Fedora version of Hadoop (2.0.5a), assuming the appropriate build properties noted are specified.

SCM

The BigData SIG is tracking a set of commits here to build according to FPG. These will eventually be converted into a patch set for a spec file once all the outstanding missing dependencies are in place. These commits include a set of custom Ivy resolvers that only inspect the local filesystem in typical Fedora Java jar locations. A RFE was created to make Fedora Ivy map dependencies into the local filesystem implicitly, thus doing away with custom resolvers (as much as feasible).