Mass Upgrade Infrastructure SOP: Difference between revisions

Revision as of 21:54, 19 May 2011

File:Infrastructure InfrastructureTeamN1.png

Shortcut: ISOP:UPGRADES

Every once in a while, we need to apply mass upgrades to our servers for various security and other upgrades.

Contact Information

Owner: Fedora Infrastructure Team

Contact: #fedora-admin, sysadmin-main, fedora-infrastructure-list@redhat.com, #fedora-noc

Location: All over the world.

Servers: all

Purpose: Apply kernel/other upgrades to all of our servers

Preparation

Follow the Outage Infrastructure SOP and send advance notification to fedora-infrastructure-list and fedora-devel-announce. Try to schedule the update at a time when many admins are around to help/watch for problems.
Plan an order for rebooting the machines considering two factors:
- Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together]
- Impact of systems going down on other services, operations and users. Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes.
To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document.
Switch DNS to point to PHX only in advance. This allows the external proxy servers to be rebooted without causing downtime.
Schedule downtime in nagios
Make doubly sure that various app owners are aware of the reboots

Staging

Any updates that can be tested in staging or a pre-production environment should be tested there first. Including new kernels, updates to core database applications / libraries. Web applications, libraries, etc.

Special Considerations

While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems:

Before the following machines are rebooted, all koji builders should be disabled and all running jobs allowed to complete:

db3
nfs1
kojipkgs1

The following machines need services to be shutdown manually before they are rebooted:

noc1 (tell zodbot to quit first)

The following machines require post-boot actions (mostly entering passphrases). Make sure admins that have the passphrases are on hand for the reboot:

app1 (Transifex SSH passphrase post-boot, see the Translations Infrastructure SOP)
backup2 (LUKS passphrase on boot)
sign-vault1 (NSS passphrase for sigul service)
sign-bridge1 (NSS passphrase for sigul bridge service)
noc1 (start zodbot, see the Zodbot Infrastructure SOP)

Minimizing Downtime

To minimize downtime as much as possible, the following main servers (and thus their respective xen hosts) should probably be rebooted first. Note that the xen servers may change from update to update.

db1
db2
db3
nfs1
cvs1
proxy2 (the proxy server for all PHX machines)
kojipkgs1
secondary1
fas1 (minor, only absolutely needed for certificate generation)
torrent1
hosted1
people1

When rebooting servers, try to avoid having all of the machines in any of these groups down at the same time.

proxy1, proxy2
app1, app2, app3, app4
fas1, fas2
memcached1, memcached2,
bastion1, bastion2 (these use heartbeat, but they will probably cause VPN blips on rebooting)
koji1, koji2 (also on heartbeat)
ns1, ns2

External xen hosts can generally be done at any time during this, with the exception of the main machines listed above.

Doing the upgrade

If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging). To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages (Infrastructure Yum Repo SOP)

On puppet1, as root run:

func-yum [--host=hostname] update

--host can be specified multiple times and takes wildcards.

pinging people as necessary if you are unsure about any packages:

Doing the reboot

In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on. You can see the guests per virt host on puppet1 in /var/log/virthost-lists.out

For each host you will want to:

connect and verify no one is logged in and using it. If they are contact them to log off, etc
grep default /etc/grub.conf # make sure that the kernel you upgraded to will be the one rebooted.
shutdown -h now

This is also a good time to double check that each guest you are starting up is set to be restarted on reboot of the virt host.

Aftermath

Make sure that everything's running fine
Reenable nagios notification as needed
Make sure to perform any manual post-boot setup (such as loading SSH keys for transifex or entering passphrases for encrypted volumes)

@@ Line 7: / Line 7: @@
 Owner: Fedora Infrastructure Team
-Contact: #fedora-admin, sysadmin-main, fedora-infrastructure-list@redhat.com
+Contact: #fedora-admin, sysadmin-main, fedora-infrastructure-list@redhat.com, #fedora-noc
-Location: Phoenix
+Location: All over the world.
 Servers: all
@@ Line 19: / Line 19: @@
 # Follow the [[Outage Infrastructure SOP]] and send advance notification to fedora-infrastructure-list and fedora-devel-announce.  Try to schedule the update at a time when many admins are around to help/watch for problems.
 # Plan an order for rebooting the machines considering two factors:
-#* Location of systems on the xen clusters. [You will normally reboot all systems on a cluster together.
+#* Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together]
 #* Impact of systems going down on other services, operations and users.  Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes.
 # To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document.
@@ Line 83: / Line 83: @@
 If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging).  To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages ([[Infrastructure Yum Repo SOP]])
-SSH to each machine in the list and perform the upgrades, pinging people as necessary if you are unsure about any packages:
+On puppet1, as root run:
 <pre>
-yum clean metadata
+func-yum [--host=hostname] update
-yum update # make sure to review the list of updates and ask if you think it might break something
 </pre>
+--host can be specified multiple times and takes wildcards.
+pinging people as necessary if you are unsure about any packages:
 == Doing the reboot ==
-In the order determined above, reboots will usually be grouped by the xen hosts that the servers are on.  For each xen host, login to each guest and shut it down:
+In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on.
+You can see the guests per virt host on puppet1 in /var/log/virthost-lists.out
-<pre>
+For each host you will want to:
-xm console guestname # and login
+* connect and verify no one is logged in and using it. If they are contact them to log off, etc
-w # ping any logged on people if they're around so they don't get kicked off unexpectedly
+* grep default /etc/grub.conf # make sure that the kernel you upgraded to will be the one rebooted.
-grep default /etc/grub.conf # make sure that the kernel you upgraded to will be the one rebooted.
+* shutdown -h now
-shutdown -h now
-</pre>
-This is also a good time to double check that each xen guest has a proper symlink in /etc/xen/auto if it should be started automatically.  When the guests are done, double check that no guests are running, then reboot the xen host.
+This is also a good time to double check that each guest you are starting up is set to be restarted on reboot of the virt host.
 == Aftermath ==

Search