Initial thoughts by Matt Domsch
- Use Reduced Redundancy Storage. All the content will be replicated easily.
- Use s3cmd sync to keep content in buckets in sync
- exclude ISOs
- exclude debuginfo? I think so.
- Use bucket policies to limit access to each region
- Need list of IP addresses for each region to populate MM. Would be nice if we could get that programmatically.
- Per FI meeting 20120216, suggest using secondary01 or a releng* box for the copying. secondary01 does not have /pub/epel or /pub/fedora currently mounted.
- bucket names s3-mirror-<region>.fedoraproject.org allow for CNAME s3-mirror.fedoraproject.org to s3.amazon.com in our DNS
|Region||Region Server||Bucket Name||CNAME|
|US Standard||s3-website-us-east-1.amazonaws.com||s3-mirror-us-east-1.fedoraproject.org||s3-mirror-us-east-1.fedoraproject.org CNAME s3-mirror-us-east-1.fedoraproject.org.s3-website-us-east-1.amazonaws.com|
|US West (Oregon) Region||s3-website-us-west-2.amazonaws.com|
|US West (Northern California) Region||s3-website-us-west-1.amazonaws.com|
|EU (Ireland) Region||s3-website-eu-west-1.amazonaws.com|
|Asia Pacific (Singapore) Region||s3-website-ap-southeast-1.amazonaws.com|
|Asia Pacific (Tokyo) Region||s3-website-ap-northeast-1.amazonaws.com|
|South America (Sao Paulo) Region||s3-website-sa-east-1.amazonaws.com|
- if we upload ISOs, we get .torrent links "for free".
- no tracker stats :-(
- Can't group multiple files together into a single torrent
- we're paying for outbound bandwidth
- bucket policies keeping traffic in a single region means we need separate buckets for torrent content
- none for all uploads
- none for intra-region requests
- 0.093/GB/month for data, 200GB = $30-40/month/region. 7 Regions.
- no way guess number of GET requests. $40 assumes 10M requests, while $30/month assumes 1M requests.
Total: ~$280/month, or $3360/yr
- do we sync to one region, then COPY to others? If so, what tool? That'll cost $ for bandwidth.
source/ SRPMS/ debug/ beta/ ppc/ ppc64/ repoview/ Fedora/ Live/ isolinux/ images/ EFI/ drpms/ core/ extras/ LiveOS/ updates/8 updates/9 updates/10 updates/11 updates/12 updates/13 updates/14 updates/testing/8 updates/testing/9 updates/testing/10 updates/testing/11 updates/testing/12 updates/testing/13 updates/testing/14 releases/test/
- s3cmd sync processes excludes after walking the whole local directory tree with os.walk(). This means it recurses over .snapshot/ and all the directories we want to exclude, increasing processing time by 20x (>700k files vs ~35k files we'll actually upload). Matt has a patch to s3cmd to fix this, but it's ugly and needs love.
- On subsequent syncs, got this error from the /pub/epel tree:
ERROR: no element found: line 1, column 0 ERROR: Parameter problem: Bucket contains invalid filenames. Please run: s3cmd fixbucket s3://your-bucket/
- The MD5 checks don't happen at all for files uploaded via multipart, which seems to affect larger files. This defeats the purpose of MD5 checking. But, we can't disable MD5 checking for all files, because repomd.xml often changes content but doesn't change file size. So, we need MD5 checking only for some files.
- It does store mtime/ctime values in the metadata. Need to add code to check those.
- upload of initial bucket for EPEL took real 651m57.042s, /pub/fedora took real 892m52.286s.
- subsequent syncs failed because of the above element error, but took 12m and 21m respectively (w/o transferring any changes due to the error)
- MirrorManager's report_mirror program needs to be run after the sync, because this will be a private mirror. But, it also blindly does os.walk(), without a concept of excludes. Solutions are to either make a private copy of the whole content (ugh!), or add --exclude-from=<file> handling to report_mirror. Matt did the latter.