Mass Upgrade Infrastructure SOP

From FedoraProject

(Difference between revisions)
Jump to: navigation, search
(add note about yum includes)
(redirect page to new infra-docs)
 
Line 2: Line 2:
 
{{shortcut|ISOP:UPGRADES}}
 
{{shortcut|ISOP:UPGRADES}}
  
Every once in a while, we need to apply mass upgrades to our servers for various security and other upgrades.
 
  
== Contact Information ==
+
This SOP has moved to the fedora Infrastructure SOP git repo. Please see the current document at: http://infrastructure.fedoraproject.org/infra/docs/massupgrade.txt
Owner: Fedora Infrastructure Team
+
  
Contact: #fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org, #fedora-noc
+
For changes, questions or comments, please contact anyone in the Fedora Infrastructure team.  
  
Location: All over the world.
 
 
Servers: all
 
 
Purpose: Apply kernel/other upgrades to all of our servers
 
 
== Preparation ==
 
 
# Determine which host group you are going to be doing updates/reboots on.
 
## Group "A" are servers that end users will see or note being down and anything that depends on them.
 
## Group "B" are servers that contributors will see or note being down and anything that depends on them.
 
## Group "C" are servers that infrastructure will notice are down, or are redundent enough to reboot some with others taking the load.
 
# Appoint an 'Update Leader' for the updates.
 
# Follow the [[Outage Infrastructure SOP]] and send advance notification to the appropriate lists.  Try to schedule the update at a time when many admins are around to help/watch for problems and when impact for the group affected is less. Do NOT do multiple groups on the same day if possible.
 
# Plan an order for rebooting the machines considering two factors:
 
#* Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together]
 
#* Impact of systems going down on other services, operations and users.  Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes.
 
# To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document.
 
# Schedule downtime in nagios.
 
# Make doubly sure that various app owners are aware of the reboots
 
 
== Staging ==
 
 
Any updates that can be tested in staging or a pre-production environment should be tested there first.  Including new kernels, updates to core database applications / libraries.  Web applications, libraries, etc.
 
 
== Special Considerations ==
 
 
While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems:
 
 
=== Disable builders ===
 
 
Before the following machines are rebooted, all koji builders should be disabled and all running jobs allowed to complete:
 
 
* db04
 
* nfs01
 
* kojipkgs01
 
 
Builders can be removed from koji, updated and re-added. Use:
 
 
<pre>
 
koji disable-host NAME
 
</pre>
 
 
and
 
 
<pre>
 
koji enable-host NAME
 
</pre>
 
 
(note: you must be a koji admin).
 
 
=== Post reboot action ===
 
 
The following machines require post-boot actions (mostly entering passphrases).  Make sure admins that have the passphrases are on hand for the reboot:
 
 
* backup-2 (LUKS passphrase on boot)
 
* sign-vault01 (NSS passphrase for sigul service)
 
* sign-bridge01 (NSS passphrase for sigul bridge service)
 
 
=== Schedule autoqa01 reboot ===
 
 
There is currently an autoqa01.c host on cnode01. Check with QA folks before rebooting this guest/host.
 
 
===  Bastion01 and Bastion02 and openvpn server ===
 
 
We need one of the bastion machines to be up to provide openvpn for all machines. Before rebooting bastion02, modify: manifests/nodes/bastion0*.phx2.fedoraproject.org.pp files to start openvpn server on bastion01, wait for all clients to re-connect, reboot bastion02 and
 
then revert back to it as openvpn hub.
 
 
=== Special yum directives ===
 
 
Sometimes we will wish to exclude or otherwise modify the yum.conf on a machine. For this purpose, all machines have an include, making them read http://infrastructure.fedoraproject.org/infra/hosts/FQHN/yum.conf.include from the infrastructure repo. If you need to make such changes, add them to the infrastructure repo before doing updates.
 
 
== Update Leader ==
 
 
Each update should have a Leader appointed. This person will be in charge of doing any read-write operations, and delegating to others to do tasks. If you aren't specficially asked by the Leader to reboot or change something, please don't. The Leader will assign out machine groups to reboot, or ask specific people to look at machines that didn't come back up from reboot or aren't working right after reboot. It's important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes.
 
 
== Group A reboots ==
 
 
Group A machines are end user critical ones. Outages here should be planned at least a week in advance and announced to the announce list.
 
 
List of machines currently in A group (note: this is going to be automated):
 
 
<pre>
 
retrace01.fedoraproject.org
 
telia1.fedoraproject.org
 
 
torrent01.fedoraproject.org
 
ibiblio01.fedoraproject.org
 
 
people02.fedoraproject.org
 
internetx01.fedoraproject.org
 
 
collab1.fedoraproject.org
 
serverbeach2.fedoraproject.org
 
 
hosted1.fedoraproject.org
 
serverbeach4.fedoraproject.org
 
 
insight01.phx2.fedoraproject.org
 
virthost02.phx2.fedoraproject.org
 
 
db05.phx2.fedoraproject.org
 
virthost03.phx2.fedoraproject.org
 
 
db02.phx2.fedoraproject.org
 
xen15.phx2.fedoraproject.org
 
 
(due to being on the same virt host as above)
 
 
ns1.fedoraproject.org
 
app05.fedoraproject.org
 
app6.fedoraproject.org
 
backup02.fedoraproject.org
 
bastion01.phx2.fedoraproject.org
 
fas01.phx2.fedoraproject.org
 
fas02.phx2.fedoraproject.org
 
log02.phx2.fedoraproject.org
 
memcached03.phx2.fedoraproject.org
 
noc01.phx2.fedoraproject.org
 
noc02.fedoraproject.org
 
ns02.fedoraproject.org
 
ns04.phx2.fedoraproject.org
 
ns05.fedoraproject.org
 
proxy02.fedoraproject.org
 
proxy04.fedoraproject.org
 
proxy5.fedoraproject.org
 
smtp-mm01.fedoraproject.org
 
smtp-mm03.fedoraproject.org
 
lockbox02.phx2.fedoraproject.org
 
</pre>
 
 
== Group B reboots ==
 
 
This Group contains machines that contributors use. Announcements of outages here should be at least a week in advance and sent to the devel-announce list.
 
 
<pre>
 
db04.phx2.fedoraproject.org
 
bvirthost01.phx2.fedoraproject.org
 
 
nfs01.phx2.fedoraproject.org
 
bvirthost02.phx2.fedoraproject.org
 
 
pkgs01.phx2.fedoraproject.org
 
bvirthost03.phx2.fedoraproject.org
 
 
kojipkgs01.phx2.fedoraproject.org
 
bxen03.phx2.fedoraproject.org
 
 
(due to being on the same virt host as one of above)
 
 
koji01.phx2.fedoraproject.org
 
releng02.phx2.fedoraproject.org
 
</pre>
 
 
== Group C reboots ==
 
 
Group C are machines that infrastructure uses, or can be rebooted in such a way as to continue to provide services to others via multiple machines.
 
Outages here should be announced on the infrastructure list.
 
 
Group C hosts that have proxy servers on them:
 
<pre>
 
publictest01.fedoraproject.org
 
publictest02.fedoraproject.org
 
publictest04.fedoraproject.org
 
insight01.dev.fedoraproject.org
 
fakefas01.fedoraproject.org
 
proxy6.fedoraproject.org
 
ask01.dev.fedoraproject.org
 
paste01.dev.fedoraproject.org
 
osuosl1.fedoraproject.org
 
 
proxy07.fedoraproject.org
 
bodhost01.fedoraproject.org
 
 
bastion02.phx2.fedoraproject.org NOTE: will take down the entire VPN!
 
proxy01.phx2.fedoraproject.org
 
value01.phx2.fedoraproject.org
 
xen05.phx2.fedoraproject.org
 
 
proxy3.fedoraproject.org
 
smtp-mm02.fedoraproject.org
 
tummy1.fedoraproject.org
 
</pre>
 
 
Other Group C hosts:
 
 
<pre>
 
app01.stg.phx2.fedoraproject.org
 
app01.dev.fedoraproject.org
 
app02.stg.phx2.fedoraproject.org
 
koji01.stg.phx2.fedoraproject.org
 
noc01.stg.phx2.fedoraproject.org
 
proxy01.stg.phx2.fedoraproject.org
 
releng01.stg.phx2.fedoraproject.org
 
value01.stg.phx2.fedoraproject.org
 
virthost13.phx2.fedoraproject.org
 
 
bnfs01.phx2.fedoraproject.org
 
 
autoqa01.c.fedoraproject.org (check with QA before rebooting this host/guest)
 
dhcp02.c.fedoraproject.org
 
cnode01.fedoraproject.org
 
 
autoqa01.qa.fedoraproject.org
 
autoqa-stg01.qa.fedoraproject.org
 
bastion-comm01.qa.fedoraproject.org
 
virthost-comm01.qa.fedoraproject.org
 
 
compose-x86-01.phx2.fedoraproject.org
 
 
download01.phx2.fedoraproject.org
 
download02.phx2.fedoraproject.org
 
download03.phx2.fedoraproject.org
 
download04.phx2.fedoraproject.org
 
download05.phx2.fedoraproject.org
 
 
app07.phx2.fedoraproject.org
 
fas03.phx2.fedoraproject.org
 
insight01.stg.phx2.fedoraproject.org
 
secondary01.phx2.fedoraproject.org
 
smolt01
 
memcached04.phx2.fedoraproject.org
 
virthost01.phx2.fedoraproject.org
 
 
serverbeach1.fedoraproject.org
 
 
hosted2.fedoraproject.org
 
serverbeach5.fedoraproject.org
 
 
collab2.fedoraproject.org
 
serverbeach3.fedoraproject.org
 
 
app02.phx2.fedoraproject.org
 
db01.stg.phx2.fedoraproject.org
 
xen09.phx2.fedoraproject.org
 
 
bapp01.phx2.fedoraproject.org
 
ns03.phx2.fedoraproject.org
 
app01.phx2.fedoraproject.org
 
app03.phx2.fedoraproject.org
 
value02.phx2.fedoraproject.org
 
xen04.phx2.fedoraproject.org
 
 
dhcp01.phx2.fedoraproject.org
 
koji02.phx2.fedoraproject.org
 
releng01.phx2.fedoraproject.org
 
sign-bridge01.phx2.fedoraproject.org
 
relepel01.phx2.fedoraproject.org
 
bxen04.phx2.fedoraproject.org
 
 
app04.phx2.fedoraproject.org
 
fas01.stg.phx2.fedoraproject.org
 
pkgs01.stg.phx2.fedoraproject.org
 
xen03.phx2.fedoraproject.org
 
 
(disable each builder in turn, update and reenable).
 
ppc05.phx2.fedoraproject.org
 
ppc06.phx2.fedoraproject.org
 
ppc07.phx2.fedoraproject.org
 
ppc08.phx2.fedoraproject.org
 
ppc09.phx2.fedoraproject.org
 
ppc10.phx2.fedoraproject.org
 
ppc12.phx2.fedoraproject.org
 
x86-01.phx2.fedoraproject.org
 
x86-02.phx2.fedoraproject.org
 
x86-03.phx2.fedoraproject.org
 
x86-04.phx2.fedoraproject.org
 
x86-05.phx2.fedoraproject.org
 
x86-06.phx2.fedoraproject.org
 
x86-07.phx2.fedoraproject.org
 
x86-09.phx2.fedoraproject.org
 
x86-10.phx2.fedoraproject.org
 
x86-11.phx2.fedoraproject.org
 
x86-12.phx2.fedoraproject.org
 
x86-13.phx2.fedoraproject.org
 
x86-14.phx2.fedoraproject.org
 
x86-15.phx2.fedoraproject.org
 
x86-16.phx2.fedoraproject.org
 
x86-17.phx2.fedoraproject.org
 
x86-18.phx2.fedoraproject.org
 
 
backup01
 
 
backup03
 
 
sign-vault01
 
</pre>
 
 
== Doing the upgrade ==
 
 
If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging).  To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages ([[Infrastructure Yum Repo SOP]])
 
 
On lockbox01, as root run:
 
 
<pre>
 
func-yum [--host=hostname] update
 
</pre>
 
 
--host can be specified multiple times and takes wildcards.
 
 
pinging people as necessary if you are unsure about any packages.
 
 
Additionally you can see which machines still need rebooted with:
 
 
<pre>
 
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes
 
</pre>
 
 
You can also see which machines would need a reboot if updates were all applied with:
 
 
<pre>
 
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py after-updates | grep yes
 
</pre>
 
 
== Doing the reboot ==
 
 
In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on.
 
You can see the guests per virt host on lockbox01 in /var/log/virthost-lists.out
 
 
For each host you will want to:
 
* connect and verify no one is logged in and using it. If they are contact them to log off, etc
 
* grep default /etc/grub.conf # make sure that the kernel you upgraded to will be the one rebooted.
 
* shutdown -h now
 
 
This is also a good time to double check that each guest you are starting up is set to be restarted on reboot of the virt host.
 
 
== Aftermath ==
 
 
# Make sure that everything's running fine
 
# Reenable nagios notification as needed
 
# Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes)
 
# Close outage ticket.
 
  
 
[[Category:Infrastructure SOPs]]
 
[[Category:Infrastructure SOPs]]

Latest revision as of 18:28, 19 December 2011

Infrastructure InfrastructureTeamN1.png
Shortcut:
ISOP:UPGRADES


This SOP has moved to the fedora Infrastructure SOP git repo. Please see the current document at: http://infrastructure.fedoraproject.org/infra/docs/massupgrade.txt

For changes, questions or comments, please contact anyone in the Fedora Infrastructure team.