From Fedora Project Wiki

Revision as of 01:22, 21 January 2009 by Mdomsch (talk | contribs) (New page: From Matt_Domsch@dell.com Tue Jan 13 19:48:52 2009 Date: Tue, 13 Jan 2009 19:48:52 -0600 From: Matt Domsch <Matt_Domsch@dell.com> Subject: Re: ERROR: chroot failed for fedora-web Mime-Vers...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

From Matt_Domsch@dell.com Tue Jan 13 19:48:52 2009 Date: Tue, 13 Jan 2009 19:48:52 -0600 From: Matt Domsch <Matt_Domsch@dell.com> Subject: Re: ERROR: chroot failed for fedora-web Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Length: 3667 Lines: 97

On Tue, Jan 13, 2009 at 11:01:51PM -0200, Carlos Carvalho wrote: > On Tue, 13 Jan 2009 14:10:18 -0800 Jesse Keating <jkeating@redhat.com> wrote: > >fullfilelist is something we're working on that came from a FUDCon > >brainstorming session. > > What's FUDCon?

Fedora User and Developer conference, which was held this past weekend in Cambridge, MA.

> This is a mirror-only issue, so I think this list is the correct > forum; why hasn't it been discussed here?

It has been, in the push mirroring thread. There I gathered ideas, and discussed those ideas with people, including folks like Chuck Anderson, Kevin Fenzi, and other mirror admins who were at FUDCon, as well as Seth Vidal, Jesse, and James Antill who are experts in yum and file transfers.

Jesse's right - I do need to do a writeup of those conversations. I just haven't had time since getting home late Sunday night... Here it is.

There are 2 basic problems we have.

1) when a bitflip happens, it can take a whole day before most mirrors

  have picked up the bitflip, even if they have all the content.

2) a "null rsync" - e.g. resyncing when you're already in sync, takes

  15-20 minutes.  This is mostly due to the directory walk + stat()s
  happening on the "upstream" mirrors, for each client connection.

I'd like to solve both.

Lots of ideas were thrown around, both on this list, and at FUDCon. They boil down to:

Triggering has both "Push" and "Polling" as methods to know "hey, now would be a good time to run rsync". I suspect we'll wind up implementing several.

Push:

 outbound ssh
 outbound email
 send a message on an AMQP queue, have listeners
 send a message on an IRC channel via a bot
 (insert your favorite here)

Poll:

 traditional rsync
 download and check a timestamp file
 (insert your favorite poll mechanism here)


Once you've figured out that "now is a good time to run rsync", what more can we do to speed things up?

a) various kernel tunables to keep more NFS inodes and directory trees

  in cache on the server.

b) hack rsyncd to do the directory tree walk + stat()s, and cache it,

  and then use the cache for each client rsync connect.  Refresh the
  cache on occasion.  This avoids the full tree walk on each client connect.

c) have a list of "files changed since

  $(insert-some-time-interval-here)", and use rsync --file-list to
  sync only those files that have changed.

Jesse eluded to the "fullfilelist" file (part of c) above) he's working on, as that is really really simple to implement. It's not a full solution, but it's a start. He needs scripts on his side to update those files whenever content is changed on the master servers, and we want to distribute useful example scripts for mirror admins to run on their side to check that file, compare against the last time they downloaded it, to know if anything changed, and if so, rsync (either full or a subset).

If done right, the fullfilelist can be used to know that nothing has changed, and using rsync to get that single file means it can be done very fast (thus more frequently), and we can avoid most of the "null rsyncs" completely.

The "handle the bitflip" problem can also be solved using the rsync --file-list mechanism, only the looked-for file would list only the dir where the bitflip happens. This could then be scheduled to run frequently on release day.

If done well, then standard rsync polling will be just fine again. If that doesn't prove viable, then we'll still wind up implementing some of the trigger methods.

That's the braindump. -Matt

-- Matt Domsch Linux Technology Strategist, Dell Office of the CTO linux.dell.com & www.dell.com/linux