Getting rid of “false” notifications from Nagios

July 28, 2013

As you probably know, Nagios has excellent support for managing scheduled downtimes for hosts and services. That said, the normal way of managing downtimes is to use the web ui, and while this might be a good step in a change management process, there are (lots) of cases where this might not be the easiest way to go.

There might be different teams interacting with hosts and applications, and forcing each and everyone to use the nagios UI to schedule downtimes can slow down most activities. This is especially true when dealing with non-production environments, where configuration changes and stop/start activities are the norm.

The thing here is to distinguish what’s an interactive action (host reboot, service stop, etc) from an unplanned event (service crash, kernel panic, etc.). I came across this cool script:

Which solves this “problem”. In short, it’s a bash script that interacts with nagios’ downtime cgi, scheduling a downtime if invoked with “stop” and getting rid of the previously scheduled downtime if invoked with “start”.

First thing I did is to write a simple puppet module to push the init script to my linux servers, and make the service fire up at boot, and stop on reboot/shutdown/halt. Since this isn’t prepackaged (rpm), I needed to add a “chkconfig -add” resource to the puppet module, since the standard “service” resource in puppet only triggers “chkconfig on” and “chkconfig off”, preventing the system to have the correct symlink in /etc/rc.d/rc*.d .

Then I had to make minor changes to the script itself, (aside from setting up a dedicated user/password on the nagios side): the chkconfig runlevels, the priority (99 for start, 1 for stop… so that it fires up after all system services and shuts down early). Also, rhel systems require that you touch the /var/lock/subsys/{servicename} file, or it won’t trigger the service shutdown on rc6 or rc0.

Voila’. Now, here’s what happening on interactive HOST actions:

– on shutdown/reboot, the nagalert service “stop” action schedules a 10min downtime on nagios, preventing it from firing out notifications.

– on boot, the nagalert service “start” action cancels the previously scheduled downtime, putting things back to normal.

The same thing can be applied to normal services, just by triggering the “start” and “stop” actions on start and stop scripts respectively. One common use case on my environment is tied to developers accessing dev/test servers and needing to shutdown/restart application server instances. This way, I can track who and when triggered the stop/start action, not getting annoyed by false alarms, but still retaining the actual state of the service. REAL incidents (kernel panics, application crashes, etc), are the ones that will keep nagios on firing its notifications, since in those cases the stop / start scripts won’t be invoked. BTW, the nagalert script already accepts one extra parameter (duration) that can change the default 10 min behaviour.

As soon as I have some spare hours to spend on this, I’d like to write a powershell version of the script, to solve the same issue on windows-based hosts.




One Response to “Getting rid of “false” notifications from Nagios”

  1. […] Ok, that’s big news for people like me. Like I said before, i manage four pages in the newspaper. We started a few months ago,but all this while Click

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: