06 October 2012

Server Tip: How to monitor Servers using Nagios


Nagios Overview

Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. This guide starts with installation and ends with monitoring routers/switch monitoring.

By using Nagios, you can:
  • Plan for infrastructure upgrades before outdated systems cause failures
  • Respond to issues at the first sign of a problem
  • Automatically fix problems when they are detected
  • Coordinate technical team responses
  • Ensure your organization's SLAs are being met
  • Ensure IT infrastructure outages have a minimal effect on your organization's bottom line
  • Monitor your entire infrastructure and business processes

How It Works

Monitoring

howitworks-monitoringIT staff configure Nagios to monitor critical IT infrastructure components, including system metrics, network protocols, applications, services, servers, and network infrastructure.

Alerting

howitworks-alertingNagios sends alerts when critical infrastructure components fail and recover, providing administrators with notice of important events. Alerts can be delivered via email, SMS, or custom script.

Response

IT staff can acknowledge alerts and begin resolving outages and investigating security alerts immediately. Alerts can be escalated to different groups if alerts are not acknowledged in a timely manner.

Reporting

howitworks-reportingReports provide a historical record of outages, events, notifications, and alert response for later review. Availability reports help ensure your SLAs are being met.

Maintenance

howitworks-maintenanceScheduled downtime prevents alerts during scheduled maintenance and upgrade windows.

Planning

howitworks-planningTrending and capacity planning graphs and reports allow you to identify necessary infrastructure upgrades before failures occur.

Installation Guide

Introduction

This guide is intended to provide you with simple instructions on how to install Nagios from source (code) on Fedora and have it monitoring your local machine inside of 20 minutes. No advanced installation options are discussed here - just the basics that will work for 95% of users who want to get started.
These instructions were written based on a standard REDHAT Linux distribution.

What You'll End Up With

If you follow these instructions, here's what you'll end up with:
  • Nagios and the plugins will be installed underneath /usr/local/nagios
  • Nagios will be configured to monitor a few aspects of your local system (CPU load, disk usage, etc.)
  • The Nagios web interface will be accessible at http://localhost/nagios/
Prerequisites

During portions of the installation you'll need to have root access to your machine.
Make sure you've installed the following packages on your Fedora installation before continuing.
  • Apache
  • PHP
  • GCC compiler
  • GD development libraries
You can use yum to install these packages by running the following commands (as root):

root@sajan-desktop:/home/sajan# yum install httpd php

root@sajan-desktop:/home/sajan# yum install gcc glibc glibc-common

root@sajan-desktop:/home/sajan# yum install gd gd-devel

1) Create Account Information

Become the root user.

sajan@sajan-desktop:~$ su -l

Create a new nagios user account and give it a password.

root@sajan-desktop:/home/sajan# /usr/sbin/useradd -m nagios

root@sajan-desktop:/home/sajan# passwd nagios

Create a new nagcmd group for allowing external commands to be submitted through the web interface. Add both the nagios user and the apache user to the group.

root@sajan-desktop:/home/sajan# /usr/sbin/groupadd nagcmd

root@sajan-desktop:/home/sajan# /usr/sbin/usermod -a -G nagcmd nagios

root@sajan-desktop:/home/sajan# /usr/sbin/usermod -a -G nagcmd apache

2) Download Nagios and the Plugins

Create a directory for storing the downloads.

root@sajan-desktop:/home/sajan# mkdir ~/downloads

root@sajan-desktop:/home/sajan# cd ~/downloads

Download the source code tarballs of both Nagios and the Nagios plugins (visit http://www.nagios.org/download/ for links to the latest versions). These directions were tested with Nagios 3.1.1 and Nagios Plugins 1.4.11.

root@sajan-desktop:/home/sajan# wget http://prdownloads.sourceforge.net/sourceforge/nagios/nagios-3.2.3.tar.gz

root@sajan-desktop:/home/sajan# wget http://prdownloads.sourceforge.net
                          /sourceforge/nagiosplug/nagios-plugins-1.4.11.tar.gz

3) Compile and Install Nagios

Extract the Nagios source code tarball.

root@sajan-desktop:/home/sajan# cd ~/downloads

root@sajan-desktop:/home/sajan# tar xzf nagios-3.2.3.tar.gz

root@sajan-desktop:/home/sajan# cd nagios-3.2.3

Run the Nagios configure script, passing the name of the group you created earlier like so:

root@sajan-desktop:/home/sajan# ./configure --with-command-group=nagcmd

Compile the Nagios source code.

root@sajan-desktop:/home/sajan# make all

Install binaries, init script, sample config files and set permissions on the external command directory.

root@sajan-desktop:/home/sajan# make install

root@sajan-desktop:/home/sajan# make install-init

root@sajan-desktop:/home/sajan# make install-config

root@sajan-desktop:/home/sajan# make install-commandmode

Don't start Nagios yet - there's still more that needs to be done...

4) Customize Configuration

Sample configuration files have now been installed in the /usr/local/nagios/etc directory. These sample files should work fine for getting started with Nagios. You'll need to make just one change before you proceed...
Edit the /usr/local/nagios/etc/objects/contacts.cfg config file with your favorite editor and change the email address associated with the nagiosadmin contact definition to the address you'd like to use for receiving alerts.

root@sajan-desktop:/home/sajan# vi /usr/local/nagios/etc/objects/contacts.cfg

5) Configure the Web Interface

Install the Nagios web config file in the Apache conf.d directory.

root@sajan-desktop:/home/sajan# make install-webconf

Create a nagiosadmin account for logging into the Nagios web interface. Remember the password you assign to this account - you'll need it later.

root@sajan-desktop:/home/sajan# htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

Restart Apache to make the new settings take effect.

root@sajan-desktop:/home/sajan# service httpd restart

Note Note: Consider implementing the ehanced CGI security measures described here to ensure that your web authentication credentials are not compromised.

6) Compile and Install the Nagios Plugins

Extract the Nagios plugins source code tarball.

root@sajan-desktop:/home/sajan# cd ~/downloads

root@sajan-desktop:/home/sajan# tar xzf nagios-plugins-1.4.11.tar.gz

root@sajan-desktop:/home/sajan# cd nagios-plugins-1.4.11

Compile and install the plugins.

root@sajan-desktop:/home/sajan# ./configure --with-nagios-user=nagios --with-nagios-group=nagios

root@sajan-desktop:/home/sajan# make

root@sajan-desktop:/home/sajan# make install

7) Start Nagios

Add Nagios to the list of system services and have it automatically start when the system boots.

root@sajan-desktop:/home/sajan# chkconfig --add nagios

root@sajan-desktop:/home/sajan# chkconfig nagios on

Verify the sample Nagios configuration files.

root@sajan-desktop:/home/sajan# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

If there are no errors, start Nagios.

root@sajan-desktop:/home/sajan# service nagios start

8) Modify SELinux Settings

Fedora ships with SELinux (Security Enhanced Linux) installed and in Enforcing mode by default. This can result in "Internal Server Error" messages when you attempt to access the Nagios CGIs.
See if SELinux is in Enforcing mode.

root@sajan-desktop:/home/sajan# getenforce

Put SELinux into Permissive mode.

root@sajan-desktop:/home/sajan# setenforce 0

To make this change permanent, you'll have to modify the settings in /etc/selinux/config and reboot.
Instead of disabling SELinux or setting it to permissive mode, you can use the following command to run the CGIs under SELinux enforcing/targeted mode:

root@sajan-desktop:/home/sajan# chcon -R -t httpd_sys_content_t /usr/local/nagios/sbin/

root@sajan-desktop:/home/sajan# chcon -R -t httpd_sys_content_t /usr/local/nagios/share/

For information on running the Nagios CGIs under Enforcing mode with a targeted policy, visit the Nagios Support Portal or Nagios Community Wiki.

9) Login to the Web Interface

You should now be able to access the Nagios web interface at the URL below. You'll be prompted for the username (nagiosadmin) and password you specified earlier.

http://localhost/nagios/

Click on the "Service Detail" navbar link to see details of what's being monitored on your local machine. It will take a few minutes for Nagios to check all the services associated with your machine, as the checks are spread out over time.

10) Other Modifications

Make sure your machine's firewall rules are configured to allow access to the web server if you want to access the Nagios interface remotely.
Configuring email notifications is out of the scope of this documentation. While Nagios is currently configured to send you email notifications, your system may not yet have a mail program properly installed or configured. Refer to your system documentation, search the web, or look to the Nagios Support Portal or Nagios Community Wiki for specific instructions on configuring your system to send email messages to external addresses. More information on notifications can be found here.

11) You're Done

Congratulations! You successfully installed Nagios. Your journey into monitoring is just beginning. 

Monitoring Linux Server

Introduction

This document describes how you can monitor "private" services and attributes of Linux/UNIX servers, such as:
  • CPU load
  • Memory usage
  • Disk usage
  • Logged in users
  • Running processes
  • etc.
Publicly available services that are provided by Linux servers (HTTP, FTP, SSH, SMTP, etc.) can be monitored easily by following the documentation on monitoring publicly available services.
Note Note: These instructions assume that you've installed Nagios according to the quickstart guide. The sample configuration entries below reference objects that are defined in the sample config files (commands.cfgtemplates.cfg, etc.) that are installed if you follow the quickstart.

Overview

[Note: This document has not been completed. I would recommend you read the documentation on the NRPE addon for instructions on how to monitor a remote Linux/Unix server.]

There are several different ways to monitor attributes or remote Linux/Unix servers. One is by using shared SSH keys and the check_by_ssh plugin to execute plugins on remote servers. This method will not be covered here, but can result in high load on your monitoring server if you are monitoring hundreds or thousands of services. The overhead of setting up/destroying SSH connections is the cause of this.
NRPE
Another common method of monitoring remote Linux/Unix hosts is to use the NRPE addon. NRPE allows you to execute plugins on remote Linux/Unix hosts. This is useful if you need to monitor local resources/attributes like disk usage, CPU load, memory usage, etc. on a remote host.

Monitoring Router / Switch

Introduction

Switch
This document describes how you can monitor the status of network switches and routers. Some cheaper "unmanaged" switches and hubs don't have IP addresses and are essentially invisible on your network, so there's not any way to monitor them. More expensive switches and routers have addresses assigned to them and can be monitored by pinging them or using SNMP to query status information.
I'll describe how you can monitor the following things on managed switches, hubs, and routers:
  • Packet loss, round trip average
  • SNMP status information
  • Bandwidth / traffic rate
Note Note: These instructions assume that you've installed Nagios according to the quickstart guide. The sample configuration entries below reference objects that are defined in the sample config files (commands.cfgtemplates.cfg, etc.) that are installed when you follow the quickstart.

Overview

Monitoring a Router or Switch
Monitoring switches and routers can either be easy or more involved - depending on what equipment you have and what you want to monitor. As they are critical infrastructure components, you'll no doubt want to monitor them in at least some basic manner.
Switches and routers can be monitored easily by "pinging" them to determine packet loss, RTA, etc. If your switch supports SNMP, you can monitor port status, etc. with the check_snmpplugin and bandwidth (if you're using MRTG) with the check_mrtgtraf plugin.
The check_snmp plugin will only get compiled and installed if you have the net-snmp and net-snmp-utils packages installed on your system. Make sure the plugin exists in/usr/local/nagios/libexec before you continue. If it doesn't, install net-snmp and net-snmp-utils and recompile/reinstall the Nagios plugins.

Steps

There are several steps you'll need to follow in order to monitor a new router or switch. They are:
  1. Perform first-time prerequisites
  2. Create new host and service definitions for monitoring the device
  3. Restart the Nagios daemon
What's Already Done For You

To make your life a bit easier, a few configuration tasks have already been done for you:
  • Two command definitions (check_snmp and check_local_mrtgtraf) have been added to the commands.cfg file. These allows you to use the check_snmp and check_mrtgtraf plugins to monitor network routers.
  • A switch host template (called generic-switch) has already been created in the templates.cfg file. This allows you to add new router/switch host definitions in a simple manner.
The above-mentioned config files can be found in the /usr/local/nagios/etc/objects/ directory. You can modify the definitions in these and other definitions to suit your needs better if you'd like. However, I'd recommend waiting until you're more familiar with configuring Nagios before doing so. For the time being, just follow the directions outlined below and you'll be monitoring your network routers/switches in no time.

Prerequisites

The first time you configure Nagios to monitor a network switch, you'll need to do a bit of extra work. Remember, you only need to do this for the *first* switch you monitor.
Edit the main Nagios config file.

root@sajan-desktop:/home/sajan# vi /usr/local/nagios/etc/nagios.cfg

Remove the leading pound (#) sign from the following line in the main configuration file:
#cfg_file=/usr/local/nagios/etc/objects/switch.cfg

Save the file and exit.

What did you just do? You told Nagios to look to the /usr/local/nagios/etc/objects/switch.cfg to find additional object definitions. That's where you'll be adding host and service definitions for routers and switches. That configuration file already contains some sample host, hostgroup, and service definitions. For the *first* router/switch you monitor, you can simply modify the sample host and service definitions in that file, rather than creating new ones.

Configuring Nagios

You'll need to create some object definitions in order to monitor a new router/switch.
Open the switch.cfg file for editing.

root@sajan-desktop:/home/sajan# vi /usr/local/nagios/etc/objects/switch.cfg

Add a new host definition for the switch that you're going to monitor. If this is the *first* switch you're monitoring, you can simply modify the sample host definition in switch.cfg. Change the host_namealias, and address fields to appropriate values for the switch.
define host{

  use  generic-switch  ; Inherit default values from a template

  host_name linksys-srw224p  ; The name we're giving to this switch

  alias  Linksys SRW224P Switch ; A longer name associated with the switch

  address  192.168.1.253  ; IP address of the switch

  hostgroups allhosts,switches ; Host groups this switch is associated with

}

Monitoring Services

Now you can add some service definitions (to the same configuration file) to monitor different aspects of the switch. If this is the *first* switch you're monitoring, you can simply modify the sample service definition in switch.cfg.
Note Note: Replace "linksys-srw224p" in the example definitions below with the name you specified in the host_name directive of the host definition you just added.

Monitoring Packet Loss and RTA

Add the following service definition in order to monitor packet loss and round trip average between the Nagios host and the switch every 5 minutes under normal conditions.
define service{

   use  generic-service ; Inherit values from a template

   host_name linksys-srw224p ; The name of the host the service is associated with

   service_description PING ; The service description

   check_command check_ping!200.0,20%!600.0,60% ; The command used to monitor the service

   normal_check_interval 5 ; Check the service every 5 minutes under normal conditions

   retry_check_interval 1 ; Re-check service every min until its final/hard state
                                  is determined

}

This service will be:
  • CRITICAL if the round trip average (RTA) is greater than 600 milliseconds or the packet loss is 60% or more
  • WARNING if the RTA is greater than 200 ms or the packet loss is 20% or more
  • OK if the RTA is less than 200 ms and the packet loss is less than 20%
Monitoring SNMP Status Information

If your switch or router supports SNMP, you can monitor a lot of information by using the check_snmp plugin. If it doesn't, skip this section.
Add the following service definition to monitor the uptime of the switch.
define service{

 use  generic-service ; Inherit values from a template

 host_name linksys-srw224p

 service_description Uptime 

 check_command  check_snmp!-C public -o sysUpTime.0

}

In the check_command directive of the service definition above, the "-C public" tells the plugin that the SNMP community name to be used is "public" and the "-o sysUpTime.0" indicates which OID should be checked.
If you want to ensure that a specific port/interface on the switch is in an up state, you could add a service definition like this:
define service{

 use  generic-service ; Inherit values from a template

 host_name linksys-srw224p

 service_description Port 1 Link Status

 check_command check_snmp!-C public -o ifOperStatus.1 -r 1 -m RFC1213-MIB

}

In the example above, the "-o ifOperStatus.1" refers to the OID for the operational status of port 1 on the switch. The "-r 1" option tells the check_snmp plugin to return an OK state if "1" is found in the SNMP result (1 indicates an "up" state on the port) and CRITICAL if it isn't found. The "-m RFC1213-MIB" is optional and tells the check_snmp plugin to only load the "RFC1213-MIB" instead of every single MIB that's installed on your system, which can help speed things up.

That's it for the SNMP monitoring example. There are a million things that can be monitored via SNMP, so its up to you to decide what you need and want to monitor. Good luck!
Tip Tip: You can usually find the OIDs that can be monitored on a switch by running the following command (replace 192.168.1.253 with the IP address of the switch): 

root@sajan-desktop:/home/sajan# snmpwalk -v1 -c public 192.168.1.253 -m ALL .1

Monitoring Bandwidth / Traffic Rate

If you're monitoring bandwidth usage on your switches or routers using MRTG, you can have Nagios alert you when traffic rates exceed thresholds you specify. The check_mrtgtraf plugin (which is included in the Nagios plugins distribution) allows you to do this.

You'll need to let the check_mrtgtraf plugin know what log file the MRTG data is being stored in, along with thresholds, etc. In my example, I'm monitoring one of the ports on a Linksys switch. The MRTG log file is stored in /var/lib/mrtg/192.168.1.253_1.log. Here's the service definition I use to monitor the bandwidth data that's stored in the log file...

define service{

 use  generic-service ; Inherit values from a template

 host_name linksys-srw224p

 service_description Port 1 Bandwidth Usage

 check_command check_local_mrtgtraf!/var/lib/mrtg/192.168.1.253_1.log
                         !AVG!1000000,2000000!5000000,5000000!10

}

In the example above, the "/var/lib/mrtg/192.168.1.253_1.log" option that gets passed to the check_local_mrtgtraf command tells the plugin which MRTG log file to read from. The "AVG" option tells it that it should use average bandwidth statistics. The "1000000,2000000" options are the warning thresholds (in bytes) for incoming traffic rates. The "5000000,5000000" are critical thresholds (in bytes) for outgoing traffic rates. The "10" option causes the plugin to return a CRITICAL state if the MRTG log file is older than 10 minutes (it should be updated every 5 minutes).
Save the file.

Restarting Nagios
Once you've added the new host and service definitions to the switch.cfg file, you're ready to start monitoring the router/switch. To do this, you'll need to verify your configuration and restart Nagios.

If the verification process produces any errors messages, fix your configuration file before continuing. Make sure that you don't (re)start Nagios until the verification process completes without any errors!

Thats it! Cheers.

No comments:

Post a Comment