GSOC 2014/Student Application discretestates/Gluster

Your name: RJ Nowling
FAS Account: discretestates
Fedora userpage: User:discretestates

Contact Information

Email Address: rnowling@gmail.com
Blog URL: [1]
Freenode IRC Nick: discretestates

NOTE: We require all students to blog about the progress of their project. You are strongly encouraged to register on the Freenode network and participate in our IRC channels. For more information and other instructions contact Org Admins.

please answer following questions

Why do you want to work with the Fedora Project?

I'd like to gain experience with open source projects and communities. I've developed software, but I don't have much experience working as part of a large group.

I would also like more experience with distributed systems and file systems.

I'm interested in a software engineering position related to distributed systems and open source after my Ph.D. Working with the Fedora community for GSoC would give me an opportunity to see what this would entail day-to-day. Since Fedora / Gluster are large communities, I feel it would be very good real-world experience.

Do you have any past involvement with the Fedora project or any other open source project as a contributor?

I've contributed to several scientific software packages as part of my research. The source for the software is open, but the development model is very different from the common model seen in open source communities.

Did you participate with the past GSoC programs, if so which years, which organizations?

No.

Will you continue contributing/ supporting the Fedora project after the GSoC 2014 program, if yes, which team(s), you are interested with?

Yes, I would be interested in continuing to work with the Gluster community, either on work related to my project or in other areas.

Why should we choose you over other applicants?

I have a lot of passion and interest. I want to make a difference. My experience and prior work suggests that I have a high chance of success. I think that GSoC is a great opportunity for me to learn how to work with open source communities and develop relationships with other developers that would enable me to make contributions in the long term.

Project Details

Overview of proposal

WebHDFS is a RESTful API and server implemented on top of HDFS. Built on standard web protocols, WebHDFS decouples the HDFS client and server and is language- and operating system-agnostic. WebHDFS makes it easy to integrate other systems, regardless of programming language, with HDFS.

A RESTful API and server for Gluster could offer benefits over current approaches including automatic and transparent compatibility with WebDHFS clients and easier implementation of Gluster clients in a variety of programming languages. The goal of this proposal is to evaluate the feasibility of such an approach. This will be done by designing or adopting an API, implementing a server, and validating correctness and evaluating performance under various use cases.

Need it fulfills

Following on the popularity of Hadoop, a number of "big data" processing systems (e.g., Berkeley Data Analytics Stack, Storm, Stratophere, Disco) are being created and adopted. These systems are written in a wide range of languages such as Java, Scala, Python, and Erlang.

These systems are rarely used in isolation. Maintaining separate storage systems is laborious, costly, and wasteful. Migrating data between separate storage systems is difficult, error prone, and limits easy access to data when it is needed. As a result, there is great interest in integration as exemplified by projected such as the Gluster plugin for Hadoop.

Gluster's existing clients (FUSE, libgfapi) are limited to specific operating systems (Linux) and/or require bindings for each programming language of interest. RESTful/JSON APIs and servers such as WebHDFS offer a more general solution that is independent of the client's operating system and programming language. A RESTful/JSON interface and server for GlusterFS could offer similar benefits.

Further, WebHDFS has proven to be popular and is being used by systems such as Spring, Fluentd, and Disco to support HDFS. If a GlusterFS RESTful server were to implement the WebHDFS API, any WebHDFS client could automatically and transparently use GlusterFS.

Any relevant experience you have

I am a Ph.D. student in Computer Science & Engineering at the University of Notre Dame, where my research is focused on novel algorithms for all-atom, physics-based simulations of molecules. Prior to my Ph.D., I was involved in undergraduate research at the University of Connecticut Health Center and Eckerd College, where I earned a B.S. in Computer Science and Mathematics. In total, I've been developing software in research environments for the last 10 years. I am fluent in Python and Java, have experience using C++ on a day-to-day basis, and have studied languages such as Erlang and Scheme on my own time. I am familiar with clusters, distributed systems, and high-performance computing, algorithms (numerical, analytics, mathematical optimization), software engineering, databases, and scientific applications including computational chemistry / physics and bioinformatics

As a teaching assistant for three years, I gained experience with client-server systems, RESTful/JSON APIs, and CherryPy, a Python web service framework, by creating and helping students complete assignments. I have familiarized myself with WebHDFS and Hadoop Gluster plugin. I am also familiar with the work in Disco to add HDFS support through WebHDFS.

How you intend to implement your proposal

Aim 1: Design a RESTful/JSON API

The core of project will be a RESTful/JSON API used for communicating between the server and any clients. The WebHDFS API could be adopted, thus enabling the Gluster RESTful server to be a drop-in replacement for the WebHDFS server. A new API that supports Gluster semantics could also be developed. The API may extend Gluster semantics to support functionality such as reporting data locality information to be used by clients for scheduling workers and tasks. Both options could be pursued to provide compatibility as well as advanced functionality for clients that want it.

Aim 2: Implement a RESTful/JSON server

A RESTful/JSON proof-of-concept server will be implemented to validate the overall approach, including the RESTful API and compatibility with WebHDFS clients. The server will support multiple backends. Ideally, the server will be written in a higher-level language such as Python or Java using bindings to the libgfapi C library. If issues with the bindings develop and are not resolved in a reasonable amount of time, I will fallback to using the FUSE client. A dummy backend will be developed to enable parallel development of the API and server while resolving any potential issues with the bindings to libgfapi and for testing purposes. Where possible, I will use the same libraries as efforts (e.g., Gluster Swift backend) in the community.

To aid with validation, regression testing, and documentation, unit tests will be developed in parallel with the server and API.

Aim 3: Identification of Several Common Use Cases and Benchmarks

As good performance will be important for adoption, several common use cases will be documented. Benchmarks recreating those use cases will be designed and run to evaluate the performance of the RESTful/JSON server compared with the standard FUSE client. Although server developed in Aim 2 will not optimized, the benchmarks may still be useful for measuring overhead due to the use of the RESTful API and additional intermediate layers.

Aim 4: (Optional and time permitting) Test proof-of-concept integration with a big data system

Interoperability and performance will be evaluated for Hadoop, Disco, or another big data system utilizing their WebHDFS clients.

Rough timeline

April 21 – May 18
- Aim 1: Review Gluster semantics and API
- Aim 2: Review Python / Java libgfapi bindings, gluster-swift, RESTful and unit testing frameworks
- Other: Set up and learn Gluster; become familiar with resources (clusters)
May 18 – 24; May 25 – 31; June 1 – 7
- Aim 1: Design and finalize API
- Aim 2: Develop RESTful server implementation and tests; begin implementing API
June 8 – 14; June 15 – 21; June 22 – 28
- Aim 1: Document API
- Aim 2: Finish RESTful server and tests; finish API implementation
- Aim 3: Choose use cases for testing; design benchmarks
- Aim 4: Identify projects using WebHDFS; choose one
- Other: Midterm Evaluations (June 27)
June 29 – July 5; July 6 – 12; July 13 – 19
- Aim 2: Document server and tests
- Aim 3: Implement benchmarks
- Aim 4: Familiarize myself with software and environments
July 20 – 26; July 27 – Aug 2; Aug 3 – 9
- Aim 2: Fix bugs found in compatibility tests; clean up code; packaging
- Aim 3: Document benchmarks
- Aim 4: Test compatibility with WebHDFS clients
Aug 10 – 16; Aug 17 – 23
- Aim 4: Document compatibility tests
- Other: Submit work; Final evaluations (Aug 22)

Have you communicated with a potential mentor? If so, who?

Jay Vyas, a friend and mentor of 10 years, has offered to serve as my primary adviser. He has significant experience with Java, Hadoop, and the Hadoop Gluster plugin. We will seek out further advisers for areas such as the language bindings to libgfapi and Gluster semantics.

Search