From Fedora Project Wiki

No edit summary
(revamp a lot)
Line 7: Line 7:
Owner: Fedora Infrastructure Team
Owner: Fedora Infrastructure Team


Contact: #fedora-admin, sysadmin-main, fedora-infrastructure-list@redhat.com, #fedora-noc
Contact: #fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org, #fedora-noc


Location: All over the world.
Location: All over the world.
Line 17: Line 17:
== Preparation ==
== Preparation ==


# Follow the [[Outage Infrastructure SOP]] and send advance notification to fedora-infrastructure-list and fedora-devel-announce.  Try to schedule the update at a time when many admins are around to help/watch for problems.
# Determine which host group you are going to be doing updates/reboots on.
## Group "A" are servers that end users will see or note being down and anything that depends on them.
## Group "B" are servers that contributors will see or note being down and anything that depends on them.
## Group "C" are servers that infrastructure will notice are down, or are redundent enough to reboot some with others taking the load.
# Appoint an 'Update Leader' for the updates.
# Follow the [[Outage Infrastructure SOP]] and send advance notification to the appropriate lists.  Try to schedule the update at a time when many admins are around to help/watch for problems and when impact for the group affected is less. Do NOT do multiple groups on the same day if possible.  
# Plan an order for rebooting the machines considering two factors:
# Plan an order for rebooting the machines considering two factors:
#* Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together]
#* Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together]
#* Impact of systems going down on other services, operations and users.  Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes.
#* Impact of systems going down on other services, operations and users.  Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes.
# To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document.
# To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document.
# Switch DNS to point to PHX only in advance.  This allows the external proxy servers to be rebooted without causing downtime.
# Schedule downtime in nagios.
# Schedule downtime in nagios
# Make doubly sure that various app owners are aware of the reboots
# Make doubly sure that various app owners are aware of the reboots


Line 35: Line 39:


Before the following machines are rebooted, all koji builders should be disabled and all running jobs allowed to complete:
Before the following machines are rebooted, all koji builders should be disabled and all running jobs allowed to complete:
* db3
* db04
* nfs1
* nfs01
* kojipkgs1
* kojipkgs01


The following machines need services to be shutdown manually before they are rebooted:
Builders can be removed from koji, updated and re-added. Use:  
* noc1 (tell zodbot to quit first)
 
<pre>
koji disable-host NAME
</pre>
 
and
 
<pre>
koji enable-host NAME
</pre>
 
(note: you must be a koji admin).


The following machines require post-boot actions (mostly entering passphrases).  Make sure admins that have the passphrases are on hand for the reboot:
The following machines require post-boot actions (mostly entering passphrases).  Make sure admins that have the passphrases are on hand for the reboot:
* app1 (Transifex SSH passphrase post-boot, see the [[Translations Infrastructure SOP]])
* backup2 (LUKS passphrase on boot)
* sign-vault1 (NSS passphrase for sigul service)
* sign-bridge1 (NSS passphrase for sigul bridge service)
* noc1 (start zodbot, see the [[Zodbot Infrastructure SOP]])


== Minimizing Downtime ==
* backup-2 (LUKS passphrase on boot)
* sign-vault01 (NSS passphrase for sigul service)
* sign-bridge01 (NSS passphrase for sigul bridge service)
 
== Update Leader ==
 
Each update should have a Leader appointed. This person will be in charge of doing any read-write operations, and delegating to others to do tasks. If you aren't specficially asked by the Leader to reboot or change something, please don't. The Leader will assign out machine groups to reboot, or ask specific people to look at machines that didn't come back up from reboot or aren't working right after reboot. It's important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes.
 
== Group A reboots ==
 
Group A machines are end user critical ones. Outages here should be planned at least a week in advance and announced to the announce list.
 
List of machines currently in A group (note: this is going to be automated):
 
<pre>
retrace01.fedoraproject.org
 
torrent01.fedoraproject.org
ibiblio01.fedoraproject.org
 
people02.fedoraproject.org
internetx01.fedoraproject.org
 
collab1.fedoraproject.org
serverbeach2.fedoraproject.org
 
hosted1.fedoraproject.org
serverbeach4.fedoraproject.org
telia1.fedoraproject.org
 
insight01.phx2.fedoraproject.org
virthost02.phx2.fedoraproject.org
 
db05.phx2.fedoraproject.org
virthost03.phx2.fedoraproject.org
 
db02.phx2.fedoraproject.org
xen15.phx2.fedoraproject.org
 
(due to being on the same virt host as above)
 
ns1.fedoraproject.org
app05.fedoraproject.org
app6.fedoraproject.org
backup02.fedoraproject.org
bastion01.phx2.fedoraproject.org
fas01.phx2.fedoraproject.org
fas02.phx2.fedoraproject.org
log02.phx2.fedoraproject.org
memcached03.phx2.fedoraproject.org
noc02.fedoraproject.org
ns02.fedoraproject.org
ns04.phx2.fedoraproject.org
ns05.fedoraproject.org
proxy02.fedoraproject.org
proxy04.fedoraproject.org
proxy5.fedoraproject.org
smtp-mm01.fedoraproject.org
smtp-mm03.fedoraproject.org
</pre>
 
== Group B reboots ==
 
This Group contains machines that contributors use. Announcements of outages here should be at least a week in advance and sent to the devel-announce list.
 
<pre>
db04.phx2.fedoraproject.org
bvirthost01.phx2.fedoraproject.org
 
nfs01.phx2.fedoraproject.org
bvirthost02.phx2.fedoraproject.org
 
pkgs01.phx2.fedoraproject.org
bvirthost03.phx2.fedoraproject.org


To minimize downtime as much as possible, the following main servers (and thus their respective xen hosts) should probably be rebooted first. Note that the xen servers may change from update to update.
kojipkgs01.phx2.fedoraproject.org
bxen03.phx2.fedoraproject.org


* db1
(due to being on the same virt host as one of above)
* db2
* db3
* nfs1
* cvs1
* proxy2 (the proxy server for all PHX machines)
* kojipkgs1
* secondary1
* fas1 (minor, only absolutely needed for certificate generation)
* torrent1
* hosted1
* people1


When rebooting servers, try to avoid having all of the machines in any of
koji01.phx2.fedoraproject.org
these groups down at the same time.
releng02.phx2.fedoraproject.org
</pre>


* proxy1, proxy2
== Group C reboots ==
* app1, app2, app3, app4
* fas1, fas2
* memcached1, memcached2,
* bastion1, bastion2 (these use heartbeat, but they will probably cause VPN blips on rebooting)
* koji1, koji2 (also on heartbeat)
* ns1, ns2


External xen hosts can generally be done at any time during this, with the exception of the main machines listed above.
Group C are machines that infrastructure uses, or can be rebooted in such a way as to continue to provide services to others via multiple machines.
Outages here should be announced on the infrastructure list.
 
<pre>
app01.stg.phx2.fedoraproject.org
app01.dev.fedoraproject.org
app02.stg.phx2.fedoraproject.org
koji01.stg.phx2.fedoraproject.org
noc01.stg.phx2.fedoraproject.org
proxy01.stg.phx2.fedoraproject.org
releng01.stg.phx2.fedoraproject.org
value01.stg.phx2.fedoraproject.org
virthost13.phx2.fedoraproject.org
 
publictest01.fedoraproject.org
publictest02.fedoraproject.org
publictest04.fedoraproject.org
publictest05.fedoraproject.org
insight01.dev.fedoraproject.org
fakefas01.fedoraproject.org
proxy6.fedoraproject.org
osuosl1.fedoraproject.org
 
proxy07.fedoraproject.org
bodhost01.fedoraproject.org
 
bnfs01.phx2.fedoraproject.org
 
bxen01.phx2.fedoraproject.org
 
dhcp02.c.fedoraproject.org
cnode01.fedoraproject.org
 
compose-x86-01.phx2.fedoraproject.org
 
download01.phx2.fedoraproject.org
download02.phx2.fedoraproject.org
download03.phx2.fedoraproject.org
download04.phx2.fedoraproject.org
download05.phx2.fedoraproject.org
 
app07.phx2.fedoraproject.org
fas03.phx2.fedoraproject.org
insight01.stg.phx2.fedoraproject.org
secondary02.phx2.fedoraproject.org
smolt01
noc01.phx2.fedoraproject.org
virthost01.phx2.fedoraproject.org
 
serverbeach1.fedoraproject.org
 
hosted2.fedoraproject.org
serverbeach5.fedoraproject.org
 
collab2.fedoraproject.org
serverbeach3.fedoraproject.org
 
app02.phx2.fedoraproject.org
db01.stg.phx2.fedoraproject.org
memcached01.phx2.fedoraproject.org
xen09.phx2.fedoraproject.org
 
bapp01.phx2.fedoraproject.org
ns03.phx2.fedoraproject.org
xen04.phx2.fedoraproject.org
 
dhcp01.phx2.fedoraproject.org
koji02.phx2.fedoraproject.org
releng01.phx2.fedoraproject.org
bxen04.phx2.fedoraproject.org
 
bastion02.phx2.fedoraproject.org
proxy01.phx2.fedoraproject.org
value01.phx2.fedoraproject.org
xen05.phx2.fedoraproject.org
 
proxy3.fedoraproject.org
smtp-mm02.fedoraproject.org
tummy1.fedoraproject.org
 
app03.phx2.fedoraproject.org
value02.phx2.fedoraproject.org
xen07.phx2.fedoraproject.org
 
app04.phx2.fedoraproject.org
fas01.stg.phx2.fedoraproject.org
pkgs01.stg.phx2.fedoraproject.org
xen03.phx2.fedoraproject.org
 
secondary01.phx2.fedoraproject.org
xen11.phx2.fedoraproject.org
 
puppet01.phx2.fedoraproject.org
app01.phx2.fedoraproject.org
xen14.phx2.fedoraproject.org
 
(disable each builder in turn, update and reenable).
ppc05.phx2.fedoraproject.org
ppc06.phx2.fedoraproject.org
ppc07.phx2.fedoraproject.org
ppc08.phx2.fedoraproject.org
ppc09.phx2.fedoraproject.org
ppc10.phx2.fedoraproject.org
ppc12.phx2.fedoraproject.org
x86-01.phx2.fedoraproject.org
x86-02.phx2.fedoraproject.org
x86-03.phx2.fedoraproject.org
x86-04.phx2.fedoraproject.org
x86-05.phx2.fedoraproject.org
x86-06.phx2.fedoraproject.org
x86-07.phx2.fedoraproject.org
x86-09.phx2.fedoraproject.org
x86-10.phx2.fedoraproject.org
x86-11.phx2.fedoraproject.org
x86-12.phx2.fedoraproject.org
x86-13.phx2.fedoraproject.org
x86-14.phx2.fedoraproject.org
x86-15.phx2.fedoraproject.org
x86-16.phx2.fedoraproject.org
x86-17.phx2.fedoraproject.org
x86-18.phx2.fedoraproject.org
 
backup01
 
sign-vault01
</pre>


== Doing the upgrade ==
== Doing the upgrade ==
Line 91: Line 284:
--host can be specified multiple times and takes wildcards.
--host can be specified multiple times and takes wildcards.


pinging people as necessary if you are unsure about any packages:
pinging people as necessary if you are unsure about any packages.
 
Additionally you can see which machines still need rebooted with:  
 
<pre>
sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes
</pre>


== Doing the reboot ==
== Doing the reboot ==


In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on.  
In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on.  
You can see the guests per virt host on puppet1 in /var/log/virthost-lists.out
You can see the guests per virt host on puppet01 in /var/log/virthost-lists.out


For each host you will want to:
For each host you will want to:
Line 106: Line 305:


== Aftermath ==
== Aftermath ==
# Make sure that everything's running fine
# Make sure that everything's running fine
# Reenable nagios notification as needed
# Reenable nagios notification as needed
# Make sure to perform any manual post-boot setup (such as loading SSH keys for transifex or entering passphrases for encrypted volumes)
# Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes)
# Close outage ticket.


[[Category:Infrastructure SOPs]]
[[Category:Infrastructure SOPs]]

Revision as of 18:33, 3 June 2011

Shortcut:
ISOP:UPGRADES

Every once in a while, we need to apply mass upgrades to our servers for various security and other upgrades.

Contact Information

Owner: Fedora Infrastructure Team

Contact: #fedora-admin, sysadmin-main, infrastructure@lists.fedoraproject.org, #fedora-noc

Location: All over the world.

Servers: all

Purpose: Apply kernel/other upgrades to all of our servers

Preparation

  1. Determine which host group you are going to be doing updates/reboots on.
    1. Group "A" are servers that end users will see or note being down and anything that depends on them.
    2. Group "B" are servers that contributors will see or note being down and anything that depends on them.
    3. Group "C" are servers that infrastructure will notice are down, or are redundent enough to reboot some with others taking the load.
  2. Appoint an 'Update Leader' for the updates.
  3. Follow the Outage Infrastructure SOP and send advance notification to the appropriate lists. Try to schedule the update at a time when many admins are around to help/watch for problems and when impact for the group affected is less. Do NOT do multiple groups on the same day if possible.
  4. Plan an order for rebooting the machines considering two factors:
    • Location of systems on the kvm or xen hosts. [You will normally reboot all systems on a host together]
    • Impact of systems going down on other services, operations and users. Thus since the database servers and nfs servers are the backbone of many other systems, they and systems that are on the same xen boxes would be rebooted before other boxes.
  5. To aid in organizing a mass upgrade/reboot with many people helping, it may help to create a checklist of machines in a gobby document.
  6. Schedule downtime in nagios.
  7. Make doubly sure that various app owners are aware of the reboots

Staging

Any updates that can be tested in staging or a pre-production environment should be tested there first. Including new kernels, updates to core database applications / libraries. Web applications, libraries, etc.

Special Considerations

While this may not be a complete list, here are some special things that must be taken into account before rebooting certain systems:

Before the following machines are rebooted, all koji builders should be disabled and all running jobs allowed to complete:

  • db04
  • nfs01
  • kojipkgs01

Builders can be removed from koji, updated and re-added. Use:

koji disable-host NAME

and

koji enable-host NAME

(note: you must be a koji admin).

The following machines require post-boot actions (mostly entering passphrases). Make sure admins that have the passphrases are on hand for the reboot:

  • backup-2 (LUKS passphrase on boot)
  • sign-vault01 (NSS passphrase for sigul service)
  • sign-bridge01 (NSS passphrase for sigul bridge service)

Update Leader

Each update should have a Leader appointed. This person will be in charge of doing any read-write operations, and delegating to others to do tasks. If you aren't specficially asked by the Leader to reboot or change something, please don't. The Leader will assign out machine groups to reboot, or ask specific people to look at machines that didn't come back up from reboot or aren't working right after reboot. It's important to avoid multiple people operating on a single machine in a read-write manner and interfering with changes.

Group A reboots

Group A machines are end user critical ones. Outages here should be planned at least a week in advance and announced to the announce list.

List of machines currently in A group (note: this is going to be automated):

retrace01.fedoraproject.org

torrent01.fedoraproject.org
ibiblio01.fedoraproject.org

people02.fedoraproject.org
internetx01.fedoraproject.org

collab1.fedoraproject.org
serverbeach2.fedoraproject.org

hosted1.fedoraproject.org
serverbeach4.fedoraproject.org
telia1.fedoraproject.org

insight01.phx2.fedoraproject.org
virthost02.phx2.fedoraproject.org

db05.phx2.fedoraproject.org
virthost03.phx2.fedoraproject.org

db02.phx2.fedoraproject.org
xen15.phx2.fedoraproject.org

(due to being on the same virt host as above)

ns1.fedoraproject.org
app05.fedoraproject.org
app6.fedoraproject.org
backup02.fedoraproject.org
bastion01.phx2.fedoraproject.org
fas01.phx2.fedoraproject.org
fas02.phx2.fedoraproject.org
log02.phx2.fedoraproject.org
memcached03.phx2.fedoraproject.org
noc02.fedoraproject.org
ns02.fedoraproject.org
ns04.phx2.fedoraproject.org
ns05.fedoraproject.org
proxy02.fedoraproject.org
proxy04.fedoraproject.org
proxy5.fedoraproject.org
smtp-mm01.fedoraproject.org
smtp-mm03.fedoraproject.org

Group B reboots

This Group contains machines that contributors use. Announcements of outages here should be at least a week in advance and sent to the devel-announce list.

db04.phx2.fedoraproject.org
bvirthost01.phx2.fedoraproject.org

nfs01.phx2.fedoraproject.org
bvirthost02.phx2.fedoraproject.org

pkgs01.phx2.fedoraproject.org
bvirthost03.phx2.fedoraproject.org

kojipkgs01.phx2.fedoraproject.org
bxen03.phx2.fedoraproject.org

(due to being on the same virt host as one of above)

koji01.phx2.fedoraproject.org
releng02.phx2.fedoraproject.org

Group C reboots

Group C are machines that infrastructure uses, or can be rebooted in such a way as to continue to provide services to others via multiple machines. Outages here should be announced on the infrastructure list.

app01.stg.phx2.fedoraproject.org
app01.dev.fedoraproject.org
app02.stg.phx2.fedoraproject.org
koji01.stg.phx2.fedoraproject.org
noc01.stg.phx2.fedoraproject.org
proxy01.stg.phx2.fedoraproject.org
releng01.stg.phx2.fedoraproject.org
value01.stg.phx2.fedoraproject.org
virthost13.phx2.fedoraproject.org

publictest01.fedoraproject.org
publictest02.fedoraproject.org
publictest04.fedoraproject.org
publictest05.fedoraproject.org
insight01.dev.fedoraproject.org
fakefas01.fedoraproject.org
proxy6.fedoraproject.org
osuosl1.fedoraproject.org

proxy07.fedoraproject.org
bodhost01.fedoraproject.org

bnfs01.phx2.fedoraproject.org

bxen01.phx2.fedoraproject.org

dhcp02.c.fedoraproject.org
cnode01.fedoraproject.org

compose-x86-01.phx2.fedoraproject.org

download01.phx2.fedoraproject.org
download02.phx2.fedoraproject.org
download03.phx2.fedoraproject.org
download04.phx2.fedoraproject.org
download05.phx2.fedoraproject.org

app07.phx2.fedoraproject.org
fas03.phx2.fedoraproject.org
insight01.stg.phx2.fedoraproject.org
secondary02.phx2.fedoraproject.org
smolt01
noc01.phx2.fedoraproject.org
virthost01.phx2.fedoraproject.org

serverbeach1.fedoraproject.org

hosted2.fedoraproject.org
serverbeach5.fedoraproject.org

collab2.fedoraproject.org
serverbeach3.fedoraproject.org

app02.phx2.fedoraproject.org
db01.stg.phx2.fedoraproject.org
memcached01.phx2.fedoraproject.org
xen09.phx2.fedoraproject.org

bapp01.phx2.fedoraproject.org
ns03.phx2.fedoraproject.org
xen04.phx2.fedoraproject.org

dhcp01.phx2.fedoraproject.org
koji02.phx2.fedoraproject.org
releng01.phx2.fedoraproject.org
bxen04.phx2.fedoraproject.org

bastion02.phx2.fedoraproject.org
proxy01.phx2.fedoraproject.org
value01.phx2.fedoraproject.org
xen05.phx2.fedoraproject.org

proxy3.fedoraproject.org
smtp-mm02.fedoraproject.org
tummy1.fedoraproject.org

app03.phx2.fedoraproject.org
value02.phx2.fedoraproject.org
xen07.phx2.fedoraproject.org

app04.phx2.fedoraproject.org
fas01.stg.phx2.fedoraproject.org
pkgs01.stg.phx2.fedoraproject.org
xen03.phx2.fedoraproject.org

secondary01.phx2.fedoraproject.org
xen11.phx2.fedoraproject.org

puppet01.phx2.fedoraproject.org
app01.phx2.fedoraproject.org
xen14.phx2.fedoraproject.org

(disable each builder in turn, update and reenable). 
ppc05.phx2.fedoraproject.org
ppc06.phx2.fedoraproject.org
ppc07.phx2.fedoraproject.org
ppc08.phx2.fedoraproject.org
ppc09.phx2.fedoraproject.org
ppc10.phx2.fedoraproject.org
ppc12.phx2.fedoraproject.org
x86-01.phx2.fedoraproject.org
x86-02.phx2.fedoraproject.org
x86-03.phx2.fedoraproject.org
x86-04.phx2.fedoraproject.org
x86-05.phx2.fedoraproject.org
x86-06.phx2.fedoraproject.org
x86-07.phx2.fedoraproject.org
x86-09.phx2.fedoraproject.org
x86-10.phx2.fedoraproject.org
x86-11.phx2.fedoraproject.org
x86-12.phx2.fedoraproject.org
x86-13.phx2.fedoraproject.org
x86-14.phx2.fedoraproject.org
x86-15.phx2.fedoraproject.org
x86-16.phx2.fedoraproject.org
x86-17.phx2.fedoraproject.org
x86-18.phx2.fedoraproject.org

backup01

sign-vault01

Doing the upgrade

If possible, system upgrades should be done in advance of the reboot (with relevant testing of new packages on staging). To do the upgrades, make sure that the Infrastructure RHEL repo is updated as necessary to pull in the new packages (Infrastructure Yum Repo SOP)

On puppet1, as root run:

func-yum [--host=hostname] update

--host can be specified multiple times and takes wildcards.

pinging people as necessary if you are unsure about any packages.

Additionally you can see which machines still need rebooted with:

sudo func-command --timeout=10 --oneline /usr/local/bin/needs-reboot.py | grep yes

Doing the reboot

In the order determined above, reboots will usually be grouped by the virtualization hosts that the servers are on. You can see the guests per virt host on puppet01 in /var/log/virthost-lists.out

For each host you will want to:

  • connect and verify no one is logged in and using it. If they are contact them to log off, etc
  • grep default /etc/grub.conf # make sure that the kernel you upgraded to will be the one rebooted.
  • shutdown -h now

This is also a good time to double check that each guest you are starting up is set to be restarted on reboot of the virt host.

Aftermath

  1. Make sure that everything's running fine
  2. Reenable nagios notification as needed
  3. Make sure to perform any manual post-boot setup (such as entering passphrases for encrypted volumes)
  4. Close outage ticket.