As you probably know, Nagios has excellent support for managing scheduled downtimes for hosts and services. That said, the normal way of managing downtimes is to use the web ui, and while this might be a good step in a change management process, there are (lots) of cases where this might not be the easiest way to go.

There might be different teams interacting with hosts and applications, and forcing each and everyone to use the nagios UI to schedule downtimes can slow down most activities. This is especially true when dealing with non-production environments, where configuration changes and stop/start activities are the norm.

The thing here is to distinguish what’s an interactive action (host reboot, service stop, etc) from an unplanned event (service crash, kernel panic, etc.). I came across this cool script:

Which solves this “problem”. In short, it’s a bash script that interacts with nagios’ downtime cgi, scheduling a downtime if invoked with “stop” and getting rid of the previously scheduled downtime if invoked with “start”.

First thing I did is to write a simple puppet module to push the init script to my linux servers, and make the service fire up at boot, and stop on reboot/shutdown/halt. Since this isn’t prepackaged (rpm), I needed to add a “chkconfig -add” resource to the puppet module, since the standard “service” resource in puppet only triggers “chkconfig on” and “chkconfig off”, preventing the system to have the correct symlink in /etc/rc.d/rc*.d .

Then I had to make minor changes to the script itself, (aside from setting up a dedicated user/password on the nagios side): the chkconfig runlevels, the priority (99 for start, 1 for stop… so that it fires up after all system services and shuts down early). Also, rhel systems require that you touch the /var/lock/subsys/{servicename} file, or it won’t trigger the service shutdown on rc6 or rc0.

Voila’. Now, here’s what happening on interactive HOST actions:

– on shutdown/reboot, the nagalert service “stop” action schedules a 10min downtime on nagios, preventing it from firing out notifications.

– on boot, the nagalert service “start” action cancels the previously scheduled downtime, putting things back to normal.

The same thing can be applied to normal services, just by triggering the “start” and “stop” actions on start and stop scripts respectively. One common use case on my environment is tied to developers accessing dev/test servers and needing to shutdown/restart application server instances. This way, I can track who and when triggered the stop/start action, not getting annoyed by false alarms, but still retaining the actual state of the service. REAL incidents (kernel panics, application crashes, etc), are the ones that will keep nagios on firing its notifications, since in those cases the stop / start scripts won’t be invoked. BTW, the nagalert script already accepts one extra parameter (duration) that can change the default 10 min behaviour.

As soon as I have some spare hours to spend on this, I’d like to write a powershell version of the script, to solve the same issue on windows-based hosts.




I’ve been writing a module that takes care of managing IBM Websphere MQ servers and clients (I know , I know… πŸ˜‰ ). It is quite amazing how much it pays back to write a puppet module: you get a lot of insight about how a package should be installed and configured.

The simple fact of describing how to install, configure, run, update a software forces you to document each process, to make it repeatable (that’s THE requirement). This is just to say that there is WAY MORE than pure automation (which alone is a LOT). Describing an environment through recipes and modules, tracking changes through the puppet dashboard, versioning your changes, etc. it’s not just cool devops stuff. It’s a smarter way of doing things. No downsides, period.

Just a couple of notes about the actual modules: it’s kind of funny that IBM did a good job at packaging (rpms) its software, but at the same time force you to go through an interactive (#fail) shell script that makes you accept the license agreement. The upside is that I didn’t have to take care of executing that script in my modules: I eventually found it writes out a dummy file in /tmp/blablabla: puppet file resource to the rescue. Done.

Another upside of puppetizing the whole thing (e.g. managing queues, listeners, channels, and the like) is that besides getting rid of the java (cough cough) gui, it allows me to forget the insane naming convention that Big Blue decided to use for the dozens of binaries you get with the MQ install. Let puppet take care of that (well, at least after I’ve done my homework).

Wanna talk automation? I don’t think there’s a better way to do it.

Long time since last post, I know.. I know πŸ˜‰

Just a quick followup on the most recent (for the moment πŸ™‚ ) releases of puppet and facter.

I decided to give the new packages a try, and after some testing, decided that the packages were stable enough for a minor upgrade in my datacenter (running puppet 2.7.3 and facter 1.6.0, until today :)).

Well, sort of…

Facter is in fact (man… I’m redundant) a piece of cake: rebuilding rpms from scratch, for el4, el5, and el6 systems is a no-brainer (specfile is clean and “el-aware”). Also, prebuilt packages from work just fine. Chap chap, I rhnpushed the packages to my satellite/spacewalk custom repos, released the errata, and … wait, I don’t even need to schedule the update, since puppet takes care (in my recipes) of facter updates on its own! Aaah the magic of puppet πŸ™‚

Puppet has a somewhat different story here. I wasn’t in the mood of rebuilding the packages from source, since the last time (2.7.3) I did it, there were some minor edits needed in the specfile (not quite up to date), so I decided to give the binaries from a run. Mmmmh… result is not perfect:

  • puppetlabs rpms are built w/ SElinux bindings, which require additional packages/dependencies for the binaries to install (namely, ruby-SElinux bindings). Plus, it requires augeas, but I’m fine with this one. I know this is the most general-purpose configuration (w/ SElinux bindings available), but in my case it’s not a requirement, and adding the extra packages to my nodes means adding an additional repo/channel to the satellite/spacewalk infrastructure.
  • even after installing the “official” rpms, I noticed a couple of warnings when running the puppet agent (puppet agent –test):

/usr/lib/ruby/site_ruby/1.8/puppet/type/file.rb:48: warning: parenthesize argument(s) for future version
/usr/lib/ruby/site_ruby/1.8/puppet/provider/package/msi.rb:50: warning: parenthesize argument(s) for future version

Seriously, it’s only cosmetic stuff… but also a minor edit to the sources… So:

Time to rebuild!

Like with 2.7.3, rebuilding with the stock specfile provided in the 2.7.6 tarball simply doesn’t work… There has probably been a change in the paths of the current codebase. The build halts complaining about a missing mongrel.rb which is supposed to be found in “network/http.server/” path. Actually, the path is “network/http” nowadays, so I patched the specfile accordingly. Other than changing this and rebuilding with “–disable selinux (see above) the specfile was good enough for the package to build w/o warnings.

Regarding the runtime warnings mentioned above, the file.rb and msi.rb files only needed a () at the mentioned lines, that now look like this:

(line 48 of lib/puppet/type/file.rb): path, name = ::File.split(::File.expand_path(value))
(lineΒ  50 of lib/puppet/provider/package/msi.rb):Β  f.puts(YAML.dump(metadata))

Bang bang. Packages rebuilt, tested on lab node, pushed to satellite/spacewalk, and distributed to all the nodes. Aside from being up to date (yup) I can’t wait to play w/ puppet’s new features for windows nodes (2.7.6 adds some serious coolness).

Aside from this minor (really piece of cake) issues, another great release of puppet, one of the most clever piece of software I encountered in my sysadmin life (btw, big kudos to the guys at puppetlabs. You rule). πŸ™‚