2012년 12월 10일 월요일

Nagois Eventhandler


I wrote about how to resister remote process in nagios monitoring environment in my last writing. This time, I'd like to post about registering a action. The action is that when a process is sopped or is killed unexpectedly, then we should trgger the action which is sort of restart the killed process. In nagios, it is able to make action, which is called 'eventhandler'.

As I was previously mentioned, my VMs were continusly shutting down and then I made services for monitoring if they were dead or alive. Now, it is time to make a eventhandler for actions. There are some important points to remember in configuring the eventhandler.

   - The property 'max_checks_attempts' in nagios.config
   - StateType: SOFT STATE vs HARD STATE

Nagios comes to conclusion times when it calls the eventhandler or gives the administrator notification by these points. When monitoring a service and its state has been changed into 'Critical', nagios only calls eventhandler until the check time is reached to ckeck max_checks_attempts (the service is in SOFT state). If total check time is reached to max_check_attemps, nagios notifies the failure of the service (service is now in HARD state).

The following link is descriptive well about statetype: http://nagios.sourceforge.net/docs/3_0/statetypes.html

Resistering eventhandler

1) Wiring a script
$ vi /usr/lib64/nagios/plugins/eventhandlers/restart-vm

#!/bin/sh
#
# Event handler script for restarting the nrpe server on the local machine
#
# Note: This script will only restart the nrpe server if the service is
#       retried 3 times (in a "soft" state) or if the web service somehow
#       manages to fall into a "hard" error state.
#

date=`date`

case "$1" in
OK)
 # The service just came back up, so don't do anything...
 ;;
WARNING)
 # We don't really care about warning states, since the service is probably still running...
 ;;
UNKNOWN)
 # We don't know what might be causing an unknown error, so don't do anything...
 ;;
CRITICAL)
 # Aha!  The BLAH service appears to have a problem - perhaps we should restart the server...

 # Is this a "soft" or a "hard" state?
 case "$2" in

 # We're in a "soft" state, meaning that Nagios is in the middle of retrying the
 # check before it turns into a "hard" state and contacts get notified...
 SOFT)

  # What check attempt are we on?  We don't want to restart the web server on the first
  # check, because it may just be a fluke!
  case "$3" in

  # Wait until the check has been tried 2 times before restarting the web server.
  # If the check fails on the 3rd time (after we restart the web server), the state
  # type will turn to "hard" and contacts will be notified of the problem.
  # Hopefully this will restart the web server successfully, so the 4th check will
  # result in a "soft" recovery.  If that happens no one gets notified because we
  # fixed the problem!
  2)
   echo -n "Restarting the VM service (3rd soft critical state)..."
   # Call the init script to restart the VM
   /usr/bin/sudo /usr/sbin/xm start $4
   echo "$date - restart $4 - SOFT"  >> /tmp/eventhandlers
   ;;
   esac
  ;;

 # The BLAH service somehow managed to turn into a hard error without getting fixed.
 # It should have been restarted by the code above, but for some reason it didn't.
 # Let's give it one last try, shall we?
 # Note: Contacts have already been notified of a problem with the service at this
 # point (unless you disabled notifications for this service)
 HARD)
  case "$3" in

  4)
   echo -n "Restarting VM  service..."
   # Call the init script to restart the VM
   echo "$date - restart $4 - HARD"  >> /tmp/eventhandlers
   /usr/bin/sudo /usr/sbin/xm start $4
   ;;
   esac
  ;;
 esac
 ;;
esac
exit 0

I was writing the script file that made nagios call this the while my VM was in 'critical'. I set max_check_attempts to 3 in config file and defined to start VM when it became the second and the third continuous fail.

2) The following works that giving execution permission to the script and add a command, service to nagios:
# give execution permission to the script
$ chmod a+x /usr/lib64/nagios/plugins/eventhandlers/restart-vm

# define the eventhandler as new command
$ vi /etc/nagios/objects/commands.cfg
define command{
command_name    restart-vm
command_line    /usr/lib64/nagios/plugins/eventhandlers/restart-vm $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $SERVICEDESC$
}

# set the command to a specific service
$ vi /etc/nagios/objects/services.cfg
define service{
use generic-service
host_name localhost
hostgroup_name
service_description ad
active_checks_enabled 1
passive_checks_enabled 1
check_command check_local_procs!1:1!qemu-dm!'domain-name ad'
event_handler restart-vm
}

# Nagios always needs to be validation and restart after modified config files.
$nagios -v /etc/nagios/nagos.cfg
$service nagios restart

When I looked at nagios.debug file (/var/log/nagios/nagios.debug), they were said like:
[1354857538.204205] [256.1] [pid=6862] Running command '/usr/lib64/nagios/plugins/eventhandlers/restart_vm CRITICAL SOFT 1 ad'...
[1354857538.251733] [256.1] [pid=6862] Execution time=0.047 sec, early timeout=0, result=0, output=(null)

[1354857548.035316] [256.1] [pid=6862] Running command '/usr/lib64/nagios/plugins/eventhandlers/restart_vm CRITICAL SOFT 2 ad'...

[1354857550.770515] [256.1] [pid=6862] Execution time=2.734 sec, early timeout=0, result=0, output=Restarting the VM service (2nd soft critical state)...Fri Dec  7 14:19:08 KST 2012 - restart ad - SOFT
[1354857558.040135] [256.1] [pid=6862] Running command '/usr/lib64/nagios/plugins/eventhandlers/restart_vm OK SOFT 3 ad'...

Troubleshooting
If the script doesn't work, check where the execution is given to the script file.
$ ls -al /usr/lib64/nagios/plugins/eventhandlers/restart-vm
-rwxr-xr-x 1 root root 3032 Dec  7 14:05 /usr/lib64/nagios/plugins/eventhandlers/restart-vm
Next is checking /etc/sudoers. In my case, the user "nagios" is the default user of Nagios monitoring server and give the permission of executing sudo without password and Defaults:nagios !requiretty
$ vi /etc/sudoers
Defaults:nagios !requiretty
nagios          ALL=(ALL)       NOPASSWD: ALL

References:
1. http://www.techadre.com/content/nagios-event-handler-restarting-local-service
2. http://forums.meulie.net/viewtopic.php?f=59&t=5918


댓글 없음:

댓글 쓰기