As you probably know, Nagios has excellent support for managing scheduled downtimes for hosts and services. That said, the normal way of managing downtimes is to use the web ui, and while this might be a good step in a change management process, there are (lots) of cases where this might not be the easiest way to go.

There might be different teams interacting with hosts and applications, and forcing each and everyone to use the nagios UI to schedule downtimes can slow down most activities. This is especially true when dealing with non-production environments, where configuration changes and stop/start activities are the norm.

The thing here is to distinguish what’s an interactive action (host reboot, service stop, etc) from an unplanned event (service crash, kernel panic, etc.). I came across this cool script:

Which solves this “problem”. In short, it’s a bash script that interacts with nagios’ downtime cgi, scheduling a downtime if invoked with “stop” and getting rid of the previously scheduled downtime if invoked with “start”.

First thing I did is to write a simple puppet module to push the init script to my linux servers, and make the service fire up at boot, and stop on reboot/shutdown/halt. Since this isn’t prepackaged (rpm), I needed to add a “chkconfig -add” resource to the puppet module, since the standard “service” resource in puppet only triggers “chkconfig on” and “chkconfig off”, preventing the system to have the correct symlink in /etc/rc.d/rc*.d .

Then I had to make minor changes to the script itself, (aside from setting up a dedicated user/password on the nagios side): the chkconfig runlevels, the priority (99 for start, 1 for stop… so that it fires up after all system services and shuts down early). Also, rhel systems require that you touch the /var/lock/subsys/{servicename} file, or it won’t trigger the service shutdown on rc6 or rc0.

Voila’. Now, here’s what happening on interactive HOST actions:

– on shutdown/reboot, the nagalert service “stop” action schedules a 10min downtime on nagios, preventing it from firing out notifications.

– on boot, the nagalert service “start” action cancels the previously scheduled downtime, putting things back to normal.

The same thing can be applied to normal services, just by triggering the “start” and “stop” actions on start and stop scripts respectively. One common use case on my environment is tied to developers accessing dev/test servers and needing to shutdown/restart application server instances. This way, I can track who and when triggered the stop/start action, not getting annoyed by false alarms, but still retaining the actual state of the service. REAL incidents (kernel panics, application crashes, etc), are the ones that will keep nagios on firing its notifications, since in those cases the stop / start scripts won’t be invoked. BTW, the nagalert script already accepts one extra parameter (duration) that can change the default 10 min behaviour.

As soon as I have some spare hours to spend on this, I’d like to write a powershell version of the script, to solve the same issue on windows-based hosts.




I’ve been writing a module that takes care of managing IBM Websphere MQ servers and clients (I know , I know… πŸ˜‰ ). It is quite amazing how much it pays back to write a puppet module: you get a lot of insight about how a package should be installed and configured.

The simple fact of describing how to install, configure, run, update a software forces you to document each process, to make it repeatable (that’s THE requirement). This is just to say that there is WAY MORE than pure automation (which alone is a LOT). Describing an environment through recipes and modules, tracking changes through the puppet dashboard, versioning your changes, etc. it’s not just cool devops stuff. It’s a smarter way of doing things. No downsides, period.

Just a couple of notes about the actual modules: it’s kind of funny that IBM did a good job at packaging (rpms) its software, but at the same time force you to go through an interactive (#fail) shell script that makes you accept the license agreement. The upside is that I didn’t have to take care of executing that script in my modules: I eventually found it writes out a dummy file in /tmp/blablabla: puppet file resource to the rescue. Done.

Another upside of puppetizing the whole thing (e.g. managing queues, listeners, channels, and the like) is that besides getting rid of the java (cough cough) gui, it allows me to forget the insane naming convention that Big Blue decided to use for the dozens of binaries you get with the MQ install. Let puppet take care of that (well, at least after I’ve done my homework).

Wanna talk automation? I don’t think there’s a better way to do it.

Long time since last post, I know.. I know πŸ˜‰

Just a quick followup on the most recent (for the moment πŸ™‚ ) releases of puppet and facter.

I decided to give the new packages a try, and after some testing, decided that the packages were stable enough for a minor upgrade in my datacenter (running puppet 2.7.3 and facter 1.6.0, until today :)).

Well, sort of…

Facter is in fact (man… I’m redundant) a piece of cake: rebuilding rpms from scratch, for el4, el5, and el6 systems is a no-brainer (specfile is clean and “el-aware”). Also, prebuilt packages from work just fine. Chap chap, I rhnpushed the packages to my satellite/spacewalk custom repos, released the errata, and … wait, I don’t even need to schedule the update, since puppet takes care (in my recipes) of facter updates on its own! Aaah the magic of puppet πŸ™‚

Puppet has a somewhat different story here. I wasn’t in the mood of rebuilding the packages from source, since the last time (2.7.3) I did it, there were some minor edits needed in the specfile (not quite up to date), so I decided to give the binaries from a run. Mmmmh… result is not perfect:

  • puppetlabs rpms are built w/ SElinux bindings, which require additional packages/dependencies for the binaries to install (namely, ruby-SElinux bindings). Plus, it requires augeas, but I’m fine with this one. I know this is the most general-purpose configuration (w/ SElinux bindings available), but in my case it’s not a requirement, and adding the extra packages to my nodes means adding an additional repo/channel to the satellite/spacewalk infrastructure.
  • even after installing the “official” rpms, I noticed a couple of warnings when running the puppet agent (puppet agent –test):

/usr/lib/ruby/site_ruby/1.8/puppet/type/file.rb:48: warning: parenthesize argument(s) for future version
/usr/lib/ruby/site_ruby/1.8/puppet/provider/package/msi.rb:50: warning: parenthesize argument(s) for future version

Seriously, it’s only cosmetic stuff… but also a minor edit to the sources… So:

Time to rebuild!

Like with 2.7.3, rebuilding with the stock specfile provided in the 2.7.6 tarball simply doesn’t work… There has probably been a change in the paths of the current codebase. The build halts complaining about a missing mongrel.rb which is supposed to be found in “network/http.server/” path. Actually, the path is “network/http” nowadays, so I patched the specfile accordingly. Other than changing this and rebuilding with “–disable selinux (see above) the specfile was good enough for the package to build w/o warnings.

Regarding the runtime warnings mentioned above, the file.rb and msi.rb files only needed a () at the mentioned lines, that now look like this:

(line 48 of lib/puppet/type/file.rb): path, name = ::File.split(::File.expand_path(value))
(lineΒ  50 of lib/puppet/provider/package/msi.rb):Β  f.puts(YAML.dump(metadata))

Bang bang. Packages rebuilt, tested on lab node, pushed to satellite/spacewalk, and distributed to all the nodes. Aside from being up to date (yup) I can’t wait to play w/ puppet’s new features for windows nodes (2.7.6 adds some serious coolness).

Aside from this minor (really piece of cake) issues, another great release of puppet, one of the most clever piece of software I encountered in my sysadmin life (btw, big kudos to the guys at puppetlabs. You rule). πŸ™‚

Nagios, or Perish

October 4, 2009

A huge overhaul in my monitoring architecture is in progress, and as it goes on I find myself more and more confident that going with nagios is THE way to go. Period. Overtime we grew a number of monitoring tools (either FOSS or not) that do (when things go as expected) what they’re meant to do… But nothing more.

Here’s what’s cool about nagiois: it does what you expect it to do, and a whole lot more. It’s no secret I’ve always been a fan of Ethan Galstad’s baby, but overtime I also found myself adopting something different for special purposes… and it didn’t go as expected: highly specialized monitoring tools, with 0 flexibility and 0 integration w/ the rest of the world. Moreover, those systems tend to make you become a ‘slave’ of the system itself, not only for setting them up, but also for running them… And that’s simply unacceprable.

People often say that Nagios has a steep learning curve, but I found that it only requires accurate planning (which is not an option for other monitoring tools as well!).

What’s wrong with other monitoring solutions? In my opinion, most of them fail due to the ‘everything in a box’ approach, while Nagios itself is extremely modular by its very nature.

Fact is, unless your architecture is pretty straight (read dull), you’ll end up needing something more than the out-of-the-box bells and whistles that many tools provide. My own architecture is everything but straight (or dull πŸ˜‰ ): kilos of systems, tons of apps, complex networks, and so on: this is where nagios fits, as it allows for an incremental approach that makes you start with the basics and add bells, whistles, and whatever as you go.

The incremental approach is the key: you should really monitor only what’s critical, what’s providing insight into your systems/apps, what’s valuable when your dealing with outages and faults. This is why I don’t like the ‘agent does everything’ approach: it gives you tons of data, which seems cool at first sight, but ends up being useless (or, even worse, confusing) in real world scenarios.

Nagios is also often criticized for its lack of graphical configuration frontends. Actually, there are a few good frontends, but after a LONG evaluation I ended up choosing the good old conf by hand. Nagios’ template based and inheritance based conf allows for some elegant configuration scheme, which (if carefully planned) results in a highly maintainable system.

The result? A clean conf structure (structured dir tree for conf files), easily expandable conf items (templates, etc), manageable exceptions (which is soooo fundamental), integration w/ other tools (read trouble ticketing, etc), network awareness (parent/child relationship), dependencies awareness (when it’s needed!), and bells/whistles (nagvis, pnp4nagios, etc).

Nagios rocks!

Lately I’ve been struggling to find some elegant way to put apple software updates into some sort of patch management process (by definition centrally managed), while keeping the process as simple/cost-effective as possible. Actually, what I was trying to achieve is a way to centrally manage and approve the update packages, and make them automagically applied to the clients, considering that the end users are standard users (non admin), which prevents the chance of having the end user invoke the software update at all (admin privileges required).

Here is the starting environment:
– 10.4.x and 10.5.x clients
– end users are non-admin
– clients are joined to a win2k8 a.d. Domain (not very important in this case, while useful).

And this is what I managed to create so far (with a 2day investment πŸ™‚ ) :

Since I had a spare leopard osx server not doing much , I configured and started the software update service. This allowed me to centrally download and cache all updates from apple, and also solves the need to manually enable/allow only selected updates for the involved clients.

Then I followed some tips found on the web and managed to trick the clients into thinking that (official apple update site(s)) is actually my leopard server (using some dumb dns redirect or something like that…). Actually there is a ‘more official way’ of instructing the clients into using a custom update server (modifying ‘defaults’ on each client), but I preferred the tricky way for a couple of reasons:

1 – easier to implement and change overtime
2 – allows for fallback to the official apple servers whenever I need this feature (roaming/mobile users, etc).

The tricky solution actually needed some more work, since there are a couple of undocumented web redirects that are needed in order for the trick to work. Let me say that apple’s docs about the software update service are poor to say the least. Also, the service itself is poor: it downloads packages, it allows you to choose which to enable, but there are no useful logs (and you sit there wondering what the hell it is doing, just to discover that some uber-gig package is being silently downloaded). Moreover, there is NO WAY of fine graining the selection of updates: a package is enabled for everyone or for none… No groups or the equivalent of MS SCCM ‘collections’. Very much ‘SOHO oriented’, if you ask me.

Anyway, if you’re willing (and can) adapt your patch management process to this ‘all or nothing’ constraint, the software update service does its job.

Now on the VERY tricky part: how do you manage to push/force/publish/you name it the packages to the clients so that the user (not admin, remember!) gets his updates without much ado?

I seriously googled for some official or semi-official way to do this, but found nothing really enlightning.

So I ended up building a client-side shell script that:
– invokes the ‘softwareupdate -l’ command and parses the results to see if there are updates available, and if one or more of them require a reboot.
– downloads the packages and installs them
– if needed, it reboots the client.

Doing the updates while the user is working (open files/apps) may be a LOT dangerous, and a hell for the IT support, so I tried to find some ‘safe zone’ that allows to ensure that the updates are actually triggered, without risking to scramble the poor user’s machine.

Shame on Apple nr.2: I ended up thinking of a pre-shutdown update-trigger script (I’ve built tons on my *nix servers for ages), but apple decided that rc.* was too simple, and drove us into using the superperfect launchd, which happens imho to be a real mess, and more important doesn’t offer ONE SINGLE way of intercepting a shutdown/reboot event. No comment 😦

At least, they allowed for the so-called ‘loginwindow hooks’, which provide (simply put) some sort of flexibility in triggering actions on user login and logout. Luckily, the logout hook came to the rescue, since it’s triggered on logout (doh, who could have guessed? πŸ˜‰ ), and shutdown/reboot.

This is cool as it provides the ‘safe zone’ I needed to trigger the updates w/o worrying that the user has open files/apps and even worse that he could get stuck because my updates on his mac require a reboot.

So I hooked my logout hook πŸ˜‰ to loginwindow, and automagically it started to behave as expected.

Now I found the need to provide some feedback to the user during the update process, since the background script left him with a dumb wallpaper and no further info.
And I found the excellent tool iHook (really, that’s its name! πŸ™‚ ), which allowed me to show a nice cocoa dialog that I can populate with warnings, progress bar (how cool is that) and even a background image (!!! πŸ™‚ ), just by adding some echos to my bash script.

The result (still a work in progress) is a user friendly updates-trigger process that automagically applies updates on my behalf, without requiring user intervention or other major issues.

The thing needs some cleanup and tweaks, but the whole process is running pretty good!
Things to fix asap are:
– provide an installer script that sets up the necessary packages and config files on the client side (nearly done), and chooses among the iHook for leopard vs tiger (sgrunt 😦 ).
– provide an autoescape in the hook so that if a mobile user accesses the internet avoiding my tricky diversion, it simply quits without accessing the official apple update server.
– some more informative displays in the iHook dialog, so that the user is not driven into thinking that the mac is hung (the softwareupdate -i -a process could last looooong).

So far, so good… Considering the starting point.. I’m pretty satisfied with this setup, as it lowers the burden on IT staff and give all users something automated, user friendly, and sleek :-).

I’ll let you know as soon as I roll it out in production πŸ™‚

I just decided (early this morning) to give my radware appliances a firmware roll/update. I had a few incidents with the running firmware… and I decided to go to the non-ga releases, since after reading the extensive release notes and maintenance release notes, it turned out that the new versions carried a LOT of bugfixes, and huge performance improvements (even if I didn’t ever notice a slowdown). After a few hours of careful upgrading the machines (a couple of wsd (AS1) and a couple of ct100) are now online with the latest firmware… or nearly.

I discovered in fact that the latest firmware changed name… the new products are named AppDirector (for wsd) and AppXcel (for the ct100). Radware also suggests to upgrade to those new releases… but despite my bleeding-edge orientation, I decided not to go that way… for a couple of reasons:

– I don’t see major improvements, for the moment.

– Configuration will surely be different and since I put up a really complex scenario… it’s not time to mess it up so lightly.

– My experience tells me that radware’s stable releases are stable after a few months of GA… and maybe more… and since the 2 new firmwares have been on the shelves for a couple of weeks… well… I won’t test these for them…

Anyway… I’ll keep an eye on the new App* stuff… We’ll see if cool things show up in the future.

sslexplorer logoIn my last post, I stated that openvpn could be considered the perfect vpn solution nearly always. Well, “nearly” was there since another great opensource project appeared to me a few months ago: sslexplorer. While openvpn and sslexplorer share ssl as a security layer, their approach to vpns is totally different. Openvpn is a client-server based solution which uses ssl as a secure way to encapsulate ip traffic over a secure udp/tcp connection, while sslexplorer is a browser based vpn solution, which relies on https for communication security.

sslexplorer is a java based project that greatly simplifies the burden of distributing and configuring clients that other vpn solutions impose (while openvpn is way easier, though, than all the ugly ipsec stuff in general). Simply put, the client doesn’t exist… or at least the relevant part needed for secure communication is activated as a signed java applet after the user accesses the sslexplorer portal via a standard web browser. The java applet is responsible for the secure communication (ssl based) from the client to the server and back, and sslexplorer itself acts in general as a proxy to the corporate resources. Among other things, it can allow you to reverse proxy corporate intranet sites, redirect tcp ports for e.g. the corporate mailserver, or maybe give you access to a java applet that acts as an ssh client to your *nix servers. All this lives inside the browser session, so you can easily be at your favorite internet cafe @whatever place and without any sotware requirements other than a browser with a decent java plugin, you can get full access to your corporate resources in a snap.

But they went even further! If you do have your preferred email application (thunderbird, of course) at hand, why would you rely on that uncomfortable intranet webmail app? Just fire up the “bird” and configure it so that it points to localhost:xxxx where xxxx is the port number your friendly sslexplorer java applet is proxying versus your intranet imap/smtp server, for example.

Many other things not covered here make sslexplorer another great great opensource project (like, e.g., its powerful web based management interface).

Obviously, while sslexplorer is a great solution for roadwarrior vpn setup, it isn’t the right solution for site2site architectures. But for this, guys, there’s openvpn πŸ™‚

Jump in the openvpn & sslexplorer club… we’re having a hell of a party πŸ˜‰