GSOC 2014/Student Application Brickgao/Shumgrepper

From FedoraProject

Jump to: navigation, search
please use this template to organize your proposal



Please describe your proposal in detail. Include:

An overview of your proposal

Shumgrepper is a project to get md5sum, sha1sum and sha512sum of every file in every package in Fedora. I think I could fork the project named datagrepper, and use the project summershum to get the information of the md5sum, sha1sum and sha512sum.

The need you believe it fulfills

When this project is finished, any user can easily get md5sum, sha1sum and sha512sum of every file in every package. The data is formatted in json. So I think it is easy to use it in other projects, such as finding the relationship among some open-source projects.

Any relevant experience you have

First one is Baiduyun_Simple, it is a web service to decode the share link from baidu yun. You can find it on Baiduyun_simple written in Node.js and return data in json. From that time, I realized that json is a good format for RESTFul API. It's my first personal web project. For the revision of Baiduyun, it can't work now, but I think it make me think more deeply in web project.

Second one is inchobot, it is a bot to help you hand in your homework. You can find it on We worked on the project in order to make handing in homework more easily for student and getting summary of homework more easily for teacher/admin. I did some front-end work in this project. I touched Bootstrap since that time, and I think using bootstrap can make the building of web project more easily. We use flask, flask-sqlalchemy and flask-bootstrap to finish it. Because of reading the document of flask-bootstrap, I fixe small mistake in its doc. I also review some code in its core, I learnt flask and sqlalchemy from this.

The last one is RankIt, it is a web service that make a plugin for WeChat platform. You can find it on I writed it during the winter vacation. I write the core of it based on flask and design the models of database. I think it is a good practice of flask, sqlalchemy to me. The most impressed things for me in that project is solving concurrency conflict. Concurrency connect always make the database dirty. I solve it using the single command, lock and inherent property of the tables.

There are some other projects, you can view my Github account.

How you intend to implement your proposal

Shumgrepper is a new project that is similar with datagrepper. So it could fork from Datagrepper. Shumgrepper is a project based on Flask, SQLAlchemy and summershum. Datagrepper is a project based on Flask, SQLAlchemy and fedmsg.

I think the view of datagrepper and shumgrepper is similar, so I think the modify for view is easy, just change the word that returned.

Summershum and shumgrepper can be connected with each other in database and Python data. For the query that has been generated, it’s okay just to return the database query from the database. As for the old data, shumgrepper should use summershum modules and datagrepper to download the packages. Then calculate md5sum, sha1sum and sha512sum of the file, return to user in json and store it in database.

I looked carefully at the code of summershum. There has been a model of database in summershum. I could use this model in shumgrepper which will make the connection between summershum and shumgrepper smooth.

A rough timeline for your progress

  • 21 April - 1 May: Read the code of shumgrepper and datagrepper. (Maybe fix some small bug in passing)
  • 1 May - 10 May: Make the blueprint (such as which way to use to get the connect between shumgrepper , the name of the method to use …) of the project.
  • 1 May - 15 June: Code, finish the basic function of the project. (user can get json information of md5sum, sha1sum and sha512sum)
  • 16 June - 25 June: Review the code , find bug and write a mid-term evaluations.
  • 25 June - 30 July: Imporve the function of the project, fix the bug and enhance performance.
  • 30 July - 18 August: Make some test and finish the documents of the project.

Any other details you feel we should consider

I think we should consider of the ocasion when there are a great amount of requests, and they may be in parallel.

First we should deal with concurrency conflict, for example, there is two request querying for the hash num that is not in the database, the core judge both of them need to generate new data. If the two operation done parallelly, there may cause two data which is same in the database. So we should confirm the operation is an atom operation.

Second one is performance problem, we should review the code and make it run fast, in order to deal with a great amount of request.

Have you communicated with a potential mentor? If so, who?