Summer Coding 2010 proposal - CHASM-wangfang

From FedoraProject

Jump to: navigation, search
Important.png
Proposal deadline is passed. This page is locked.
Do not change any details on this page. If must change something, talk with the project mentor first.

For information how to complete this form, refer to Summer Coding 2010 step-by-step for students.

Contents

Random list of application requirements

  1. Must include a schedule that was worked out with mentor
  2. Keep on eye on the Talk: page that is associated with the proposal page you create. Click on the discussion link on the top of your proposal page. The Talk: page is where mentors comment on your proposal.
  3. Make sure you have clicked on the watch link on the top of your proposal page(s) and Talk: page(s). Use the link to my preferences at the top of the page to set your Watchlist preferences to email you when changes are made.

About me

  1. What is your name?
    • Wang Fang
  2. What is your email address?
    • wangfangcs@gmail.com
  3. What is your wiki username?
    • wangfang
  4. What is your IRC nickname?
    • wangfang
  5. What is your primary language?
    • Chinese, second language English
  6. Where are you located, and what hours do you tend to work?
    • Hubei, China. UTC+08:00, 10:00 ~ 23:00 in UTC+08:00.
  7. Have you participated in an open-source project before? If so, please send us URLs to your profile pages for those projects, or some other demonstration of the work that you have done in open-source. If not, why do you want to work on an open-source project this summer?

About the project

  1. Project information
    • The name of my project is CHASM. The idea page is https://fedoraproject.org/wiki/Summer_Coding_2010_ideas_-_CHASM.
    • CHASM stands for the Cryptographic-Hash-Algorithm-Secured Mirroring solution, and provides the following improvements to the current alternatives:
      • Uses SHA2 to uniquely identify files: Each file is stored by SHA2 and hard-linked into place on the filesystem.
      • Cache-aware transfer protocol: When an upstream and a downstream node synchronize, they take care not to invalidate the filesystem cache of the upstream in order to minimize disk writes.
      • Reduced latency: A trivial update will allow a node to compute what is out of date without any additional network traffic.
  2. Poject overview
    • The work process of CHASM is quite different with rsync. When the file updating is made on the master, the master will creates a new manifest which describes the state of the file structures and includes the hash values of files. The manifest is published to the tracker. The tracker performs a role similar to that of a bittorrent tracker and is responsible for distributing the new manifest. Non-master nodes periodically query the tracker for new manifests (and more peers for incomplete manifests) to check whether its local mirror is up to date.
    • The only structure enforced on the network is that of a directed graph in which no edges flow into the master node. An arbitrary node x can be the upstream to y while at the same time fetching updates from y (and thus it is downstream as well).
  3. What will be done
    • Implementation of the peer-to-peer network protocol. This protocol resembles a "stateful HTTP" as one of our members put it. It takes into account the state of the upstream's pool and cache to allow clients to maximize throughput. Including the following aspects:
      • Peer negotiate and discovery protocol.
      • Manifest transfer protocol.
      • File transfer protocol.
    • Implementation of a generic message-passing framework for Unix domain sockets including the following functionalities:
      • The ability to communicate between daemons effectively.
      • The ability to transfer file descriptors easily.
    • Partial design and implementation of a peer-tracker protocol. Once the peer-to-peer protocol is complete we will be turning our attention to the peer-tracker protocol. We have not put nearly as much attention into this as the peer-to-peer protocol as it is less critical.
  4. Timeline
    • 1~4, May 24 ~ June 20, finish the peer-to-peer network protocol and the test code. In the middle of this period, about May 29 ~ June 7, I need to finsh my dissertation for my bachelor degree.
    • 5~6, June 21 ~ July 4, finish the generic message-passing framework for Unix domain sockets and the test code. Write some utilities to make the protocol demonstrable if possible.
    • 7, July 5 ~ July 12, midterm evaluation.
    • 8~11, July 13 ~ end, try to finish peer-tracker protocol. If there is still time left, write some daemons.
    • After the final evaluations, if I don't have any other arrangement, I think I can I continue work on the project.
  5. Experience
    • I am good at Linux netwrok programming. I have participated several netwok related projects including web proxy, NAS(Network Attached Storage), distributed file system(http://code.google.com/p/dpfs/), implement TCP UTO(RFC5482) in Freebsd kernel(gsoc2009 project). I am familiar with network programming paradigms including multi-thread, multi-process, event-base.

CHASM Evaluation

  1. System Coding
    • 1. In blocking IO, if there is no data to read or no resource to write, it will wait until resource is available. In non-blocking IO, it will return immediately and use some method to indicate the specifical error(in Linux, the errno will be EAGAIN). Most all network programs use non-blocking IO, especially the event-based programs. Some programs based on multi-thread or multi-process may use blocking IO.
    • 2. Benefits: threads provide a more natural abstraction for high-concurrency, convenient connection and resource management. Drawbacks: using a whole stack frame for each client costs memory, expensive synchronization with tricky locks, high context switch overhead.
    • 3. The correct code see the last part of this section.
    • 4. Concurrency is a property of system which several jobs of different kind or same kind are executing simultaneously and it provides the ability to serve many clients at the same time. Parallelism is a form of computation in which different parts of large problems are executing simultaneously and it provides the ability to speed up single problem calculation process.
  2. Project Management
    • 1. Forwards compatible network protocol means it can work with the new version and partly ignore new data introduced by new version. Backwards compatible network protocol means it can work with the old version.
    • 2. I will pick (a), (b), (e), (f), (g). (c) is unnecessary because the storage capability is quite large nowadays. (d) I am not sure what it can be used to.
    • 3. Drawbacks: external space and performance overhead, additional abstraction layer. Benefits: dynamic library updating without recompile executable file.
    • 4. I think the primary method I will use is email. Because email is one of the most popular, low-cost and easy to use methods, and you will think more when you write email which make the communication more efficient. But email is not timely I will use instant messaging like gtalk, IRC as a complement.
  3. Personal Background
    • 1. Network programming: Computer Telecommunications & Netwrok. Systems programming: Operating System, Assembly Language Programming, Database System, Compiler Principles, etc. Cryptography: no particularly courses. Computer algorithms: Data Structure.
    • 2. It depends, about 30 ~ 45 hours per week, I will try my best to work no less than 30 hours per week.
    • 3. Currently, no.
    • 4. Work on open source projects can help me to improve technical skills without paying too much attention to the strict workflow, documentaion and many other boring things of business projects. Besides, I use open source software everyday and I want to pay back.
    • 5. http://code.google.com/p/dpfs, this project is my course design and it was written completely from scratch.
    • 6. Yes. My timezone is UTC+8:00. It seems that my ISP has blocked the 6667 port and I have to use the webchat(http://webchat.freenode.net).


int main(int argc, char **argv)
{
	int sockfd;
	struct sockaddr_in addr;
	int addrlen = sizeof(addr);
	char buf[512], *p;
	int len, len2;
	sockfd = socket(PF_INET, SOCK_DGRAM, 0);
	if (sockfd < 0)
	{
		/* An process can open 65535 file descriptors at most,	most 
		system can only use 1024 default and setrlimit can change the limit */
		exit(-1);
	}
	addr.sin_family = AF_INET;
	/* Network byte order is BigEndian, x86 and x64 are LittleEndian, 
	so we need to change the byte order */
	addr.sin_port = htons(port);
	addr.sin_addr.s_addr = htonl(host);
	if (bind(sockfd, &addr, addr_len) < )
	{
		/* If other process has already bind to this port, it will fail. */
		exit(-1);
	}
	/* UDP is not an connection-oriented prototol and does not need listen */
	//listen(sockfd, 5);
	while (1)
	{
		/* recvfrom and sendto return the number of characters received and sent. */
		len = recvfrom(sockfd, buf, 512, 0, &addr, &addr_len);
		for (p = buf; len > 0;)
		{
			len2 = sendto(sockfd, p, len, 0, &addr, addr_len);
			if (len2 < 0)
			{
				break;
			}
			else
			{
				len -= len2;
				p += len2;
			}
		}
	}
}

Miscellaneous

  1. We want to make sure that you are prepared before the project starts
    • Can you set up an appropriate development environment?
      • Yes.
    • Have you met your proposed mentor and members of the associated community?
      • Yes.
  2. What is your t-shirt size?
    • L