Nagios Infrastructure SOP

From FedoraProject

(Difference between revisions)
Jump to: navigation, search
m
(redirect page to new infra-docs)
 
Line 2: Line 2:
 
{{shortcut|ISOP:NAGIOS}}
 
{{shortcut|ISOP:NAGIOS}}
  
This SOP is to describe nagios configurations
 
  
== Contact Information ==
+
This SOP has moved to the fedora Infrastructure SOP git repo. Please see the current document at: http://infrastructure.fedoraproject.org/infra/docs/nagios.txt
Owner: Fedora Infrastructure Team
+
  
Contact: #fedora-admin, sysadmin-main & sysadmin-noc groups
+
For changes, questions or comments, please contact anyone in the Fedora Infrastructure team.
  
Location: Anywhere
 
 
Servers: noc01, noc02, noc01.stg, puppet1
 
 
Purpose: This SOP is to describe nagios configurations
 
 
== Initial Configuration ==
 
=== CGI Access ===
 
To view information in nagios (anything with cgi-bin in the path) you need to be able to grant yourself access.  After checking out the Puppet CVS tree as described in the  [[Infrastructure/SOP/Puppet |Puppet SOP]]  you first need to edit configs/system/nagios/cgi.cfg and append your FAS username to 'authorized_for_system_commands'
 
=== Contact Information ===
 
{{Admon/caution | You must configure a contacts file to be able to acknowledge [[Infrastructure/SOP/Outage |outages]]}}
 
 
Create a new file named 'fasname.cfg' in configs/system/nagios/contacts/ with the following details:
 
<pre>
 
define contact{
 
contact_name            fasname
 
alias                  Real Name
 
service_notification_period  24x7
 
host_notification_period      24x7
 
service_notification_options  w,u,c,r
 
host_notification_options    d,u,r
 
service_notification_commands notify-by-email
 
host_notification_commands    host-notify-by-email
 
email                  Email address (any)
 
}
 
</pre>
 
{{Admon/warning | Using the 24x7 notification period may cause duplicate messages if you are a member of sysadmin-main, in which case you can specify 'never' instead}}
 
 
Next append your name to the 'members' section of configs/system/nagios/contactgroups/fedora-sysadmin-email.cfg
 
 
=== nagios-external ===
 
The same changes will need to be applied with the nagios-external configuration (configs/system/nogios-external)
 
 
=== Commit Changes ===
 
{{Admon/caution | Remember to "cvs add" the contacts/fasname.cfg files}}
 
 
Commit changes by running <code>cvs commit -m "Adding fasname to Nagios"</code> and then mark the changes for distribution by <code>make install</code>
 
 
== Configuration ==
 
=== Instances ===
 
Fedora Project runs two nagios instances, [https://admin.fedoraproject.org/nagios nagios] (noc01) and [http://admin.fedoraproject.org/nagios-external nagios-external] (noc02), you must be in the 'sysadmin' group to access them.
 
 
=== Staging Istances ===
 
Apart from the two production istances, we are currently running a staging istance for testing-purposes available through SSH at noc01.stg.
 
 
=== nagios (noc01) ===
 
The nagios configuration on noc01 should only monitor general host statistics - puppet status, uptime, apache status (up/down), SSH etc.
 
 
The configurations are found at <code>configs/system/nagios/</code> in the puppet tree.
 
 
=== nagios-external (noc02) ===
 
The nagios configuration on noc02 is located outside of our main datacenter and should monitor our user websites/applications (fedoraproject.org, FAS, PackageDB, Bodhi/Updates).
 
 
The configurations are found at <code>configs/system/nagios-external/</code> in the puppet tree.
 
 
=== Production and staging istances through SSH ===
 
'''Note:''' Please make sure you are into 'sysadmin' and 'sysadmin-noc' FAS groups before trying to access these hosts.
 
 
<li>SSH into bastion appending your FAS uid. (i.e <code>ssh UID@bastion.fedoraproject.org</code>) ('''Note:''' no password login is required, so if you get any password request at this point, you probably don't have access to this machine)
 
<li>SSH again into the host you need to work on. (i.e <code>ssh UID@noc01</code> for the production system or <code>ssh UID@noc01.stg</code> for the staging one) ('''Note:''' you will be prompted for a password, which is the one registered for your FAS account)
 
<li>Finally do your checks! :)
 
 
=== NRPE ===
 
We are currently using NRPE to execute remote Nagios plugins on any host of our network.
 
 
A great guide about it and its usage mixed up with some nice images about its structure can be found at this [http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf link]
 
 
== Understanding the Messages ==
 
=== General ===
 
Nagios notifications are generally easy to read, and follow this consistent format:
 
<pre>
 
** PROBLEM/ACKNOWLEDGEMENT/RECOVERY alert - hostname/Check is WARNING/CRITICAL/OK **
 
** HOST DOWN/UP alert - hostname **
 
</pre>
 
Reading the message will provide extra information on what is wrong.
 
 
=== Disk Space Warning/Critical ===
 
Disk space warnings normally include the following information:
 
<pre>
 
DISK WARNING/CRITICAL/OK - free space: mountpoint freespace(MB) (freespace(%) inode=freeinodes(%)):
 
</pre>
 
 
A message stating "(1% inode=99%)" means that the diskspace is critical '''not''' the inode usage and is a sign that more diskspace is required.
 
 
== Further Reading ==
 
* [[Infrastructure/SOP/Puppet |Puppet SOP]]
 
* [[Infrastructure/SOP/Outage |Outages SOP]]
 
  
 
[[Category:Infrastructure SOPs]]
 
[[Category:Infrastructure SOPs]]

Latest revision as of 18:38, 19 December 2011

Infrastructure InfrastructureTeamN1.png
Shortcut:
ISOP:NAGIOS


This SOP has moved to the fedora Infrastructure SOP git repo. Please see the current document at: http://infrastructure.fedoraproject.org/infra/docs/nagios.txt

For changes, questions or comments, please contact anyone in the Fedora Infrastructure team.