User:Gholms/EC2 Mirror Proposal

Problem Space
The user experience on Fedora VMs running on Amazon EC2 would benefit from yum mirrors hosted within Amazon's cloud network. In particular...
 * Such mirrors will be considerably faster.
 * Data transfer charges will be reduced.
 * Intra-region S3-to-EC2 traffic is free.
 * Intra-zone data transfer between EC2 instances is free.
 * Users with hundreds of EC2 instances will not place additional load on existing public mirrors.

Solution Overview
The most economical solution is to place yum mirrors in region-specific S3 buckets and direct clients toward the mirrors inside their respective regions. Fedora will need to create these buckets and keep them up to date with a script since S3 does not support direct filesystem access reliably. This is the solution Amazon uses for their newly-released Amazon Linux repositories.

AWS Credentials
Fedora needs an AWS account to use for managing these buckets. For a script to be able to push things to a S3 bucket it needs a set of REST keys that give it access. People most commonly use the keys for the account that pays for and manages the S3 bucket. To minimize damage in case of compromise, however, each region will use a separate set of per-region, task-specific keys created with Amazon's IAM service.

S3 Buckets
The S3 buckets themselves will contain mirrors of Fedora's i686 and x86_64 repositories for every release we publish on EC2. Clients inside EC2 can then access yum repositories via region-specific URIs such as http://fedora-mirror-us-west-1.s3.amazonaws.com/fedora/linux/releases/13/Everything/x86_64/os/.

Since Amazon charges for data transfer from S3 buckets to the rest of the Internet these buckets will be accessible only by clients inside EC2. S3's REST API allows one to create ACLs based on host IP addresses, so we will prevent outside access to these mirrors by allowing access only to EC2-internal IP addresses (the 10.x.x.x range).

Client Access
Yum needs to know which region a given client resides in so it can use the correct region's mirror. We cannot do this via MirrorManager's normal IP block-based mechanism because EC2 instances' IP addresses are too volatile.

While the VM images Fedora will provide are restricted to specific regions, encoding regions directly into these images presents two main difficulties:
 * We have to spin images once for each region instead of using the same image globally.
 * Users can re-bundle their own versions of Fedora's stock images and start them in different regions, not only negating the benefits of this system for users, but also causing those who fund the mirrors to have to pay for data transfer.

A running instance can query EC2 to discern which region it is located inside via its internal API. We can use this information either at boot time or whenever yum is called to ensure yum has up-to-date information as to where it resides.

Possible solutions to this problem were discussed at several meetings and in rel-eng ticket 4149. The accepted solution follows:

Recent versions of yum replace variables like  in their configuration files with the contents of /etc/yum/vars/varname, as long as such a file exists. At boot time an init script will grok the contents of http://169.254.169.254/latest/meta-data/placement/availability-zone (nonexistent outside EC2) and write an appropriate value to /etc/yum/vars/location.

Yum will then pass this to MirrorManager via an additional  flag that is referenced by appending   to the end of the metalink URIs in Fedora's stock repository files. MirrorManager will then look up the value and prepend the relevant mirror(s), if any, to the mirror list it returns. Bare metal machines will lack this file and pass, verbatim, to MirrorManager, which will fail to find results for that value and return a standard mirror list.

The server-side code to accomplish this is present in MirrorManager's 1.4 branch. MirrorManager ignores parameters it does not recognize, so sending such a URI to a server that does not support this parameter still results in a useful mirror list.

Updating S3 Mirrors
S3 buckets are accessible via a REST API, which makes normal filesystem access difficult and very slow at best. Instead we will use a script that fetches updated packages and metadata files and pushes them to each region's S3 bucket. This script will either run on Fedora's regular infrastructure or on one EC2 instance per region, each of which uses separate credentials.

Finalize this Proposal

 * Decide whether to use IAM or AWS sub-accounts.
 * Decide who will manage "official" Fedora AWS credentials.
 * Decide whether to run the S3 bucket population script on Fedora servers or EC2 instances.
 * Decide how these scripts and possibly EC2 instances will be managed. (Involve Infrastructure in the discussion.)

Implement and Document

 * Ask Amazon officials what support/subsidies they can provide for our finalized proposal.
 * Reserve appropriately-named S3 buckets for Fedora's yum mirrors in each AWS region.
 * Add appropriate ACLs to these S3 buckets.
 * Add  to stock yum repo files.  (See bugs 643185 and 643186)
 * Add AWS region flag support to MirrorManager.
 * Document and script repository population and updating.
 * Document when and how to retire S3-based yum mirrors of old releases.