Revision as of 23:17, 20 March 2014

Apache Spark

Summary

Apache Spark is a fast and general engine for large-scale data processing. This change brings Spark to Fedora, allowing easy deployment and development of Spark applications on Fedora.

Owner

Name: William Benton
Email: willb@redhat.com
Release notes owner:

Current status

Targeted release: Fedora 21
Last updated: 20 March 2014
Tracker bug: <will be assigned by the Wrangler>

Detailed Description

Apache Spark is a fast and general engine for large-scale data processing. It supports developing custom analytic processing applications over large data sets or streaming data. Because it has the capability to cache intermediate results in cluster memory and schedule DAGs of computations, Spark programs can run up to 100x faster than equivalent Hadoop MapReduce jobs. Spark applications are easy to develop, parallel, fast, and resilient to failure, and they can operate on data from in-memory collections, local files, a Hadoop-compatible filesystem, or from a variety of streaming sources. Spark also includes libraries for distributed machine learning and graph algorithms.

Benefit to Fedora

Apache Spark is a tremendously exciting project and having it in Fedora makes Fedora a better platform for big data, machine learning, and analytics development, as well as for deploying and distributing these kinds of applications.

Scope

Proposal owners: Currently our Spark package has been accepted into Fedora. It features nearly all of the functionality available from the upstream release. (The missing features -- specifically, Python bindings, the Spark REPL, Kryo-based serialization, primitives for approximate cardinalities of very large sets, and Mesos integration -- were missing from the initial packages due to unavailable dependencies and bundling issues; we're working to close the gap with upstream as quickly as possible.) This work depended upon Fedora 21's improved support for the Scala ecosystem.

Other developers: N/A (not a System Wide Change)

Release engineering: N/A (not a System Wide Change)

Policies and guidelines: N/A (not a System Wide Change)

Upgrade/compatibility impact

N/A

How To Test

It should be possible to install Spark from Fedora repositories and develop and run applications against it. I can prepare a simple Fedora-specific example if necessary.

User Experience

Users will be able to develop and deploy applications based on Apache Spark in Fedora without relying on third-party software distributions.

Dependencies

This work partially motivated and was dependent upon Fedora 21's improved support for the Scala ecosystem, but the packages listed there are all complete and available in F21.

Contingency Plan

Contingency mechanism: (What to do? Who will do it?) N/A (not a System Wide Change)
Contingency deadline: N/A (not a System Wide Change)
Blocks release? N/A (not a System Wide Change), Yes/No
Blocks product? product <-- Applicable for Changes that blocks specific product release/Fedora.next -->

Documentation

N/A (not a System Wide Change)

Release Notes

Fedora 21 includes Apache Spark, a fast and general engine for large-scale data processing on clusters.

@@ Line 4: / Line 4: @@
 == Summary ==
-<!-- A sentence or two summarizing what this change is and what it will do. This information is used for the overall changeset summary page for each release. -->
 Apache Spark is a fast and general engine for large-scale data processing.  This change brings Spark to Fedora, allowing easy deployment and development of Spark applications on Fedora.
 == Owner ==
-<!--
-For change proposals to quality as self-contained, owners of all affected packages need to be included here. Alternatively, a SIG can be listed as an owner if it owns all affected packages.
-This should link to your home wiki page so we know who you are.
--->
 * Name: [[User:Willb| William Benton]]
-<!-- Include you email address that you can be reached should people want to contact you about helping with your change, status is requested, or technical issues need to be resolved. If the change proposal is owned by a SIG, please also add a primary contact person. -->
 * Email:  <code>willb@redhat.com</code>
 * Release notes owner: <!--- To be assigned by docs team [[User:FASAccountName| Release notes owner name]] <email address> -->
 <!--- UNCOMMENT only for Changes with assigned Shepherd (by FESCo)
 * FESCo shepherd: [[User:FASAccountName| Shehperd name]] <email address>
--->
-<!--- UNCOMMENT only if this Change aims specific product, working group (Cloud, Workstation, Server, Base, Env & Stacks)
-* Product:
-* Responsible WG:
 -->
@@ Line 39: / Line 30: @@
 == Detailed Description ==
-<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->
 Apache Spark is a fast and general engine for large-scale data processing.  It supports developing custom analytic processing applications over large data sets or streaming data.  Because it has the capability to cache intermediate results in cluster memory and schedule DAGs of computations, Spark programs can run up to 100x faster than equivalent Hadoop MapReduce jobs.  Spark applications are easy to develop, parallel, fast, and resilient to failure, and they can operate on data from in-memory collections, local files, a Hadoop-compatible filesystem, or from a variety of streaming sources.  Spark also includes libraries for distributed machine learning and graph algorithms.
 == Benefit to Fedora ==
-<!-- What is the benefit to the platform?  If this is a major capability update, what has changed?  If this is a new functionality, what capabilities does it bring? Why will Fedora become a better distribution or project because of this proposal?-->
 Apache Spark is a tremendously exciting project and having it in Fedora makes Fedora a better platform for big data, machine learning, and analytics development, as well as for deploying and distributing these kinds of applications.
 == Scope ==
-<!-- What work do the developers have to accomplish to complete the change in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
 * Proposal owners:  Currently our [http://pkgs.fedoraproject.org/cgit/spark.git Spark package has been accepted into Fedora].  It features nearly all of the functionality available from the upstream release.  (The missing features -- specifically, Python bindings, the Spark REPL, Kryo-based serialization, primitives for approximate cardinalities of very large sets, and Mesos integration -- were missing from the initial packages due to unavailable dependencies and bundling issues; we're working to close the gap with upstream as quickly as possible.)  This work depended upon [[Changes/ImprovedScalaEcosystem|Fedora 21's improved support for the Scala ecosystem]].
-<!-- What work do the feature owners have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
 * Other developers: N/A (not a System Wide Change) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
-<!-- What work do other developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
 * Release engineering: N/A (not a System Wide Change)  <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
-<!-- Does this feature require coordination with release engineering (e.g. changes to installer image generation or update package delivery)?  Is a mass rebuid required?  If a rel-eng ticket exists, add a link here.  -->
 * Policies and guidelines: N/A (not a System Wide Change) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
-<!-- Do the packaging guidelines or other documents need to be updated for this feature?  If so, does it need to happen before or after the implementation is done?  If a FPC ticket exists, add a link here. -->
 == Upgrade/compatibility impact ==
-<!-- What happens to systems that have had a previous versions of Fedora installed and are updated to the version containing this change? Will anything require manual configuration or data migration? Will any existing functionality be no longer supported? -->
-<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
+N/A
-N/A (not a System Wide Change)
 == How To Test ==
-<!-- This does not need to be a full-fledged document. Describe the dimensions of tests that this change implementation is expected to pass when it is done.  If it needs to be tested with different hardware or software configurations, indicate them.  The more specific you can be, the better the community testing can be.
-Remember that you are writing this how to for interested testers to use to check out your change implementation - documenting what you do for testing is OK, but it's much better to document what *I* can do to test your change.
-A good "how to test" should answer these four questions:
-. What special hardware / data / etc. is needed (if any)?
-. How do I prepare my system to test this change? What packages
-need to be installed, config files edited, etc.?
-. What specific actions do I perform to check that the change is
-working like it's supposed to?
-. What are the expected results of those actions?
--->
 It should be possible to install Spark from Fedora repositories and develop and run applications against it.  I can prepare a simple Fedora-specific example if necessary.
 == User Experience ==
-<!-- If this change proposal is noticeable by its target audience, how will their experiences change as a result?  Describe what they will see or notice. -->
-<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 Users will be able to develop and deploy applications based on Apache Spark in Fedora without relying on third-party software distributions.
 == Dependencies ==
-<!-- What other packages (RPMs) depend on this package?  Are there changes outside the developers' control on which completion of this change depends?  In other words, completion of another change owned by someone else and might cause you to not be able to finish on time or that you would need to coordinate?  Other upstream projects like the kernel (if this is not a kernel change)? -->
-<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 This work partially motivated and was dependent upon [[Changes/ImprovedScalaEcosystem|Fedora 21's improved support for the Scala ecosystem]], but the packages listed there are all complete and available in F21.
 == Contingency Plan ==
-<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "Revert the shipped configuration".  Or it might not (e.g. rebuilding a number of dependent packages).  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
 * Contingency mechanism: (What to do?  Who will do it?) N/A (not a System Wide Change)  <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
-<!-- When is the last time the contingency mechanism can be put in place?  This will typically be the beta freeze. -->
 * Contingency deadline: N/A (not a System Wide Change)  <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
-<!-- Does finishing this feature block the release, or can we ship with the feature in incomplete state? -->
 * Blocks release? N/A (not a System Wide Change), Yes/No <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 * Blocks product? product <-- Applicable for Changes that blocks specific product release/Fedora.next -->
 == Documentation ==
-<!-- Is there upstream documentation on this change, or notes you have written yourself?  Link to that material here so other interested developers can get involved. -->
-<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 N/A (not a System Wide Change)
 == Release Notes ==
-<!-- The Fedora Release Notes inform end-users about what is new in the release.  Examples of past release notes are here: http://docs.fedoraproject.org/release-notes/ -->
-<!-- The release notes also help users know how to deal with platform changes such as ABIs/APIs, configuration or data file formats, or upgrade concerns.  If there are any such changes involved in this change, indicate them here.  A link to upstream documentation will often satisfy this need.  This information forms the basis of the release notes edited by the documentation team and shipped with the release.
-Release Notes are not required for initial draft of the Change Proposal but has to be completed by the Change Freeze.
--->
 Fedora 21 includes Apache Spark, a fast and general engine for large-scale data processing on clusters.
 [[Category:ChangeReadyForWrangler]]
-<!-- When your change proposal page is completed and ready for review and announcement -->
-<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->
-<!-- The Wrangler announces the Change to the devel-announce list and changes the category to Category:ChangeAnnounced (no action required) -->
-<!-- After review, the Wrangler will move your page to Category:ChangeReadyForFesco... if it still needs more work it will move back to Category:ChangePageIncomplete-->
-<!-- Select proper category, default is Self Contained Change -->
 [[Category:SelfContainedChange]]
-<!-- [[Category:SystemWideChange]] -->

Search

Changes/ApacheSpark: Difference between revisions