End goal of EC2 mirrors
To assess a solution to a problem, it is important to define the problem first. The "problem" we are trying to address is that an internal repo would be, for Fedora users within AWS EC2:
- cheaper, due to network bandwidth charges (S3 to EC2 intra-region is free, EC2 to EC2 intra-zone is free).
- considerably faster, due to local bandwidth and native access
- provide "official" sources for AWS users (especially for official AMIs when published)
- provide a sense of community among Fedora users at Amazon
- provide a starting point for additional services to be provided from AWS
Currently, to push things in to an S3 bucket one needs the REST keys for the account itself that pays for/owns the S3 service. The way to mitigate this has tended to be to create sub-accounts that use consolidated billing. Now however, there is a new option: AWS IAM.
One or the other option needs to be picked to limit risk exposure, while still providing the access required.
Restricting Repo Access intra-region
Tools for doing this are not widely available. The issue is that the "canned" access controls from most tools (<a href="http://docs.amazonwebservices.com/AmazonS3/index.html?RESTAPI.html">generally limited by the REST API</a>). We are working with Amazon to solve this.
The easiest solution seems to be to get the metadata, and construct a preferred mirror that is interjected at the top of the list. Most alternatives seem to require extensive changes to a number of tools, versus one simple change to a single text file (the repo config file).
Syncing an S3 Mirror
The simplest options for this are as follows:
- an EC2 instance per-region that gets started up for a short period each night, checks the meta info in the latest master mirror. [user:gholms] noted that: "primary.xml.gz stores file modification times. And build times!" which would allow a check of parsing how that info has changed since the previous run, and updating to S3 only those files that have changed.
- an EC2 instance per-region that gets started up for a short period each night, mounts a persistent EBS volume set that has a local copy of the repo information sans ISOs and for just i686 and x86_64. This should require less than 200G of persistent storage (per region).
- an EC2 instance per availability zone, serving repos directly from EBS volumes, with tools that try to send people to the "right" instance (IPs are very mobile, this is the most expensive and the most difficult option. included because it's the most "classic" option as well)
There are technical issues that have various costs and benefits; these must be weighed and a decision made to pick the most appropriate solution.