As you probably know, Nagios has excellent support for managing scheduled downtimes for hosts and services. That said, the normal way of managing downtimes is to use the web ui, and while this might be a good step in a change management process, there are (lots) of cases where this might not be the easiest way to go.

There might be different teams interacting with hosts and applications, and forcing each and everyone to use the nagios UI to schedule downtimes can slow down most activities. This is especially true when dealing with non-production environments, where configuration changes and stop/start activities are the norm.

The thing here is to distinguish what’s an interactive action (host reboot, service stop, etc) from an unplanned event (service crash, kernel panic, etc.). I came across this cool script:

http://exchange.nagios.org/directory/Addons/Scheduled-Downtime/nagalert/details

Which solves this “problem”. In short, it’s a bash script that interacts with nagios’ downtime cgi, scheduling a downtime if invoked with “stop” and getting rid of the previously scheduled downtime if invoked with “start”.

First thing I did is to write a simple puppet module to push the init script to my linux servers, and make the service fire up at boot, and stop on reboot/shutdown/halt. Since this isn’t prepackaged (rpm), I needed to add a “chkconfig -add” resource to the puppet module, since the standard “service” resource in puppet only triggers “chkconfig on” and “chkconfig off”, preventing the system to have the correct symlink in /etc/rc.d/rc*.d .

Then I had to make minor changes to the script itself, (aside from setting up a dedicated user/password on the nagios side): the chkconfig runlevels, the priority (99 for start, 1 for stop… so that it fires up after all system services and shuts down early). Also, rhel systems require that you touch the /var/lock/subsys/{servicename} file, or it won’t trigger the service shutdown on rc6 or rc0.

Voila’. Now, here’s what happening on interactive HOST actions:

- on shutdown/reboot, the nagalert service “stop” action schedules a 10min downtime on nagios, preventing it from firing out notifications.

- on boot, the nagalert service “start” action cancels the previously scheduled downtime, putting things back to normal.

The same thing can be applied to normal services, just by triggering the “start” and “stop” actions on start and stop scripts respectively. One common use case on my environment is tied to developers accessing dev/test servers and needing to shutdown/restart application server instances. This way, I can track who and when triggered the stop/start action, not getting annoyed by false alarms, but still retaining the actual state of the service. REAL incidents (kernel panics, application crashes, etc), are the ones that will keep nagios on firing its notifications, since in those cases the stop / start scripts won’t be invoked. BTW, the nagalert script already accepts one extra parameter (duration) that can change the default 10 min behaviour.

As soon as I have some spare hours to spend on this, I’d like to write a powershell version of the script, to solve the same issue on windows-based hosts.

Ciao!

PJ

A few minutes ago a discussion on twitter about the importance of the users in VMUG meetings led to a statement from @dutch_vmaffia that I disagreed with, or more precisely I don’t accept. The statement was something about the fact that “VMUG (meetings) are ruled by vendors”.

Actually, @dutch_vmaffia raised an interesting point. The same issue has been discussed during the VMUG lunch in Copehagen (during last VMworld Europe): many leaders admitted that meeting sponsors happen to have a “strong voice” about the meetings content. Even in the VMUG IT steering committee we have been discussing about this a few times. After all, without the vendors/sponsors it would be VERY difficult to organize meetings (location, etc. ain’t cheap!).

So, as Scott Lowe pointed out, this is an important discussion, at it heavily steers the quality and direction of the meetings, which are a fundamental part of VMUG activities. I’m so convinced that this is SO  important that I’ve stopped on my way home to write this post on the road (in a parking lot, actually ;-p).

Good News! There’s a solution! :-)

Luckily, in VMUG IT we’ve been debating about this since the very start, and together with my board members (Note: my fellow board members kick ass!), we decided to define a strict policy about how the sponsor can and should contribute to the meetings.

Simply put, we require that vendors provide an end user/customer that speaks at the meeting about his own implementation of the specific product/technology. Pre-sales, Marketing, Sales guys are not allowed to go on stage. Period. We allow people from the vendor/sponsor to be in the meeting room, but they’re allowed only to answer specific technical question that may come from the members attendig the meeting. Other than that, we allow the vendor/sponsor to have booths/etc OUTSIDE the meeting room, so it’s upon the users to get in contact with them, only if THEY want to.

Sounds difficult? Well, It is. But it can be done, and we’ve been doing it since the first meeting.

The most difficult part is to convince the sponsor that actually this is THE RIGHT THING TO DO! Let’s face it, we all attended tons of presos by vendors, and we already know how they’re structured and how they often end up as a mere showcase without much value. I can’t blame the vendors for this: it’s hard to be funny and interesting when your primary job is to sell or help sell a product.

Instead, by letting an end user do the talk, there are a lot of advantages:

- The case study/preso is a real world implementation of the product. No fireworky-preso that promises wonders. Just the facts.

- Users can hear from a user, and can ask the questions they care the most to the USER.

- Sponsors/Vendors get a BIGGER return since there’s no conflict of interest in what the user is saying. “A customer is worth a thousand presales”, so to speak ;-p

- Sponsors get REAL feedback from the users (e.g. if your product is perfect but the price isn’t… don’t worry, we’ll let you know buahah!).

- Users/Members don’t get the “usual supper”: there are plenty of meetings where the vendor has the chance to showcase his product. But this is a USER led group. Whole different story, and a chance to communicate your story in a whole different way!

 

CFP First!

It’s not only sponsor content, anyway! Beginning with last meeting, we launched a Call For Papers, inviting users to submit their presos and letting users choose what they want to hear. As Steering Committee, our role in this case is just to keep things going, and try to give general guidance about each meeting main topics (for example, next meeting topic will be “virtual security”.This is user-submitted content, and it’s the most valuable for us.

 

Exceptions? Yes, there are.

Not every product/technology may be mature enough to have an established user base, so in some cases it’s necessary to allow the vendor to speak. Be aware though… it’s a preso about a promise of what your product should do, and hearing what a user actually DID with a product is a hugely different story.

Other exceptions also involve people that may work for vendors but actually don’t do presos on their vendor product (or not strictly). The point is to avoid as much as possible shameless plugs. As an example, Massimo Re Ferre’ did a BRILLIANT preso during last VMUG IT meeting, about cloud computing concepts and common misconceptions. Well, the preso was so “clean”, and smart, and interesting, that no one even cared or knew who Massimo worked for (he could have been an Oracle emp… mmm no, wrong example ;-p). Fact is, with clever and smart people you don’t even need a policy.

Common sense to the rescue! there are some other edge cases that may force you to work around the “rule” above, but if we keep the focus on the most important aspect of the VMUG (aka the USER), it all flaws smoothly.

Believe me, this is the proverbial “win-win”. Honestly, as a VMUG leader (and I know my fellow board members agree) we want the users to WIN (e.g. to get interesting content and information), but ultimately we got a lot of great feedback from the vendors that sponsored the past meetings and that accepted our policy… They weren’t used to this kind of approach, and although many were skeptical at first, they found out the users are a LOT more interested in the content and get many more questions after the presentations than during the “traditional” “one way” presentations..

That’s it for now. Our members gave us a lot of positive feedback about our policy, but I’d like to get YOUR feedback too!

PS: Oh, and did I mention that Steve Herrod will be our special guest at the next meeting? Yes, I did ;-p

Bring back the U in VMUG!!!

 

Ciao,

PJ – @drakpz

Today I attended Juku’s “unplugged” (www.juku.it) event in Bologna, organized by my friends Enrico Signoretti (@esignoretti) and Fabio Rapposelli (@fabiorapposelli).

Juku’s slogan is “think global, act local” and let me say that imho this is the perfect statement to concisely describe the “unplugged” event. Enrico and Fabio are well known experts in the IT enterprise market, with a strong focus on storage, cloud computing, and virtualization.

What they managed to setup is an informal and very interesting series of meeting with lots of great presentations about “hot themes”: ranging from cloud paradigms to chargeback processes, from storage for virtualized environments to stretched lans, Fabio and Enrico managed to share great information while keeping their feet on the ground (a much needed approach in a market that too often tries to address all but real world scenarios), and explaining even complex stuff with a language and examples much aligned with our day-to-day headaches.

Their knowledge though comes from an explicit passion for IT, and goes far beyond their daily duties: rather it comes from technical deep dives in new technologies coming from major vendors, mixed and blended with real world scenarios and feedback they’re getting from REAL users.

That’s it: thnk global, act local. Kudos to Juku for this effective approach. I really enjoyed the event.

I’ve been writing a module that takes care of managing IBM Websphere MQ servers and clients (I know , I know… ;-) ). It is quite amazing how much it pays back to write a puppet module: you get a lot of insight about how a package should be installed and configured.

The simple fact of describing how to install, configure, run, update a software forces you to document each process, to make it repeatable (that’s THE requirement). This is just to say that there is WAY MORE than pure automation (which alone is a LOT). Describing an environment through recipes and modules, tracking changes through the puppet dashboard, versioning your changes, etc. it’s not just cool devops stuff. It’s a smarter way of doing things. No downsides, period.

Just a couple of notes about the actual modules: it’s kind of funny that IBM did a good job at packaging (rpms) its software, but at the same time force you to go through an interactive (#fail) shell script that makes you accept the license agreement. The upside is that I didn’t have to take care of executing that script in my modules: I eventually found it writes out a dummy file in /tmp/blablabla: puppet file resource to the rescue. Done.

Another upside of puppetizing the whole thing (e.g. managing queues, listeners, channels, and the like) is that besides getting rid of the java (cough cough) gui, it allows me to forget the insane naming convention that Big Blue decided to use for the dozens of binaries you get with the MQ install. Let puppet take care of that (well, at least after I’ve done my homework).

Wanna talk automation? I don’t think there’s a better way to do it.

Long time since last post, I know.. I know ;)

Just a quick followup on the most recent (for the moment :) ) releases of puppet and facter.

I decided to give the new packages a try, and after some testing, decided that the packages were stable enough for a minor upgrade in my datacenter (running puppet 2.7.3 and facter 1.6.0, until today :)).

Well, sort of…

Facter is in fact (man… I’m redundant) a piece of cake: rebuilding rpms from scratch, for el4, el5, and el6 systems is a no-brainer (specfile is clean and “el-aware”). Also, prebuilt packages from yum.puppetlabs.com work just fine. Chap chap, I rhnpushed the packages to my satellite/spacewalk custom repos, released the errata, and … wait, I don’t even need to schedule the update, since puppet takes care (in my recipes) of facter updates on its own! Aaah the magic of puppet :-)

Puppet has a somewhat different story here. I wasn’t in the mood of rebuilding the packages from source, since the last time (2.7.3) I did it, there were some minor edits needed in the specfile (not quite up to date), so I decided to give the binaries from yum.puppetlabs.com a run. Mmmmh… result is not perfect:

  • puppetlabs rpms are built w/ SElinux bindings, which require additional packages/dependencies for the binaries to install (namely, ruby-SElinux bindings). Plus, it requires augeas, but I’m fine with this one. I know this is the most general-purpose configuration (w/ SElinux bindings available), but in my case it’s not a requirement, and adding the extra packages to my nodes means adding an additional repo/channel to the satellite/spacewalk infrastructure.
  • even after installing the “official” rpms, I noticed a couple of warnings when running the puppet agent (puppet agent –test):

/usr/lib/ruby/site_ruby/1.8/puppet/type/file.rb:48: warning: parenthesize argument(s) for future version
/usr/lib/ruby/site_ruby/1.8/puppet/provider/package/msi.rb:50: warning: parenthesize argument(s) for future version

Seriously, it’s only cosmetic stuff… but also a minor edit to the sources… So:

Time to rebuild!

Like with 2.7.3, rebuilding with the stock specfile provided in the 2.7.6 tarball simply doesn’t work… There has probably been a change in the paths of the current codebase. The build halts complaining about a missing mongrel.rb which is supposed to be found in “network/http.server/” path. Actually, the path is “network/http” nowadays, so I patched the specfile accordingly. Other than changing this and rebuilding with “–disable selinux (see above) the specfile was good enough for the package to build w/o warnings.

Regarding the runtime warnings mentioned above, the file.rb and msi.rb files only needed a () at the mentioned lines, that now look like this:

(line 48 of lib/puppet/type/file.rb): path, name = ::File.split(::File.expand_path(value))
(line  50 of lib/puppet/provider/package/msi.rb):  f.puts(YAML.dump(metadata))

Bang bang. Packages rebuilt, tested on lab node, pushed to satellite/spacewalk, and distributed to all the nodes. Aside from being up to date (yup) I can’t wait to play w/ puppet’s new features for windows nodes (2.7.6 adds some serious coolness).

Aside from this minor (really piece of cake) issues, another great release of puppet, one of the most clever piece of software I encountered in my sysadmin life (btw, big kudos to the guys at puppetlabs. You rule). :-)

In the last few months, I recurringly stumbled upon mentions and articles of the infamous “partition alignment problem”, that can cause serious performance hits on i/o, particularly in virtualized environments and with RAID based disk subsystems.

I won’t start from scratch trying to explain you what the problem is, since many gurus already did an amazing job of describing the issue. My favorite post in this case is Duncan’s : http://www.yellow-bricks.com/2010/04/08/aligning-your-vms-virtual-harddisks/ , where he clearly explains what’s the issue behind partition alignment, and (particularly for vmware environments) what you should and could do to prevent/fix the issue.

The easy part in this scenario is how to align vmfs partitions: as long as you use vcenter server to create the partition, it takes care of proper alignment, and you can get rid of that side of the problem. Now on the “guest side” of the issue, where each and every OS take a different approach to partition alignment. Basically, partitioning has always been approached using the CHS (Cylinder,Head,Sector) technique, which doesn’t consider the blocks and tracks where the actual data is written to. It’s good for its simplicity, but definitely not good performance-wyse.

For example, common Linux distros and even MS Windows Server 2003 misalign partitions, since they rely on cylinder boundaries. Windows 2008 server, by contrast, aligns partitions to 1MB, which is safe and cool generally, regardless of the storage array that sits behind your virtualized environment (different storage arrays have different chunk sizes, thus your “alignment needs” may vary – refer to your storage vendor’s docs to know what your chunk size is).

Ok, so we’re talking about Linux here, and in my case, RHEL environments. What should I do to:

a) install properly-aligned systems

b) fix already installed systems w/ misaligned partitions.

Before I go on, let me  underline the fact that while this may seem a “tweak”, and maybe not an easy/cheap one, this can make a huge difference, particularly if you consider many guests that waste i/o resources by unnecessarily stressing your consolidated storage array. Using partition alignment as a best practice can improve your performance and save you bucks! 

 In this post I’m gonna show you how we solved the first scenario, by preparing a customized kickstart file that takes care of the partition aligment process. At least this allows us to avoid creating new vms that will need to be fixed sooner or later :-) .

In a future post I will describe the procedure we’re taking in order to fix already installed and misaligned systems (we’re still evaluating the different possibilities).

The kickstart file

Our approach to linux installs is to use a pxe server to boot the newly created vm, and to install the linux guest via network. Setting up a pxe boot server is easy and covered with great detail throughout the web, so I won’ t bother you with this part of the process. The good thing here is that RHEL systems provide you with a way of customizing beforehand the installation process, so that it can be automated in nearly every aspect of it (partition, grub installation, package selection, post installation activities, etc.). In one way this speeds up A LOT your installations, and on the other it provides the chance to configure an automated/repeatable setup (and maybe align partitions properly ;-) ).

Regarding the “partitioning issue”, We discovered that the default partitioning methods available with the RHEL installer simply can’t be used, since they rely on the CHS mechanism, and thus align partitions to cylinder boundaries, which is not what we need. We need to align to sector boundaries, and to choose the exact sector that we want our partitions to start.

The “%pre” section in the kickstart file allows to execute operations that are executed before the installation process (I’ll let you guess what the %post section does ;-) ). So we decided to dig more into this, and to use GNU parted to define the partitions the way we wanted.

So, here’s a sample %pre section that does all the job:

%pre

#section 1

TOTAL_RAM=`expr $(free | grep ^Mem: | awk ‘{print $2}’) / 1024`

if [ $TOTAL_RAM -lt 2048 ];

then

        SWAP_SIZE=2048

else

       SWAP_SIZE=4096

fi

# section 2

dd if=/dev/zero of=/dev/sda bs=512 count=1

parted -s /dev/sda mklabel msdos

# section 3

TOTAL=`parted -s /dev/sda unit s print free | grep Free | awk ‘{print $3}’ | cut -d “s” -f1`

SWAP_START=`expr $TOTAL – $SWAP_SIZE \* 1024 \* 2`

SWAP_START=`expr $SWAP_START / 64 \* 64`

ROOT_STOP=`expr $SWAP_START – 1`

parted -s /dev/sda unit s mkpart primary ext3 128 $ROOT_STOP

parted -s /dev/sda unit s mkpart primary linux-swap $SWAP_START 100%

What are we doing here?

In this simple scenario, we want to guess the amount of RAM available on the system, in order to size the swap partition. Then, we want to split our disk into two partitions, one holding / , and the other holding the swap.

The first section calculates the amount of available RAM and defines two possible sizes of the swap partition: 2GB for system with <2GB of RAM, 4GB for larger systems.

Then we go on the interesting part:

The second section starts by erasing the partition table (by using dd) and creating an empty one.

The third section calculates the amount of sectors available in the disk, then calculates the start sector for the swap partition, which is one sector ahead of the end of the root partition.

if you look at the math being done here (okok, I know… it's a quickie ;-) ), we're substracting N GB (in sectors) from the end of the disk, divide by 64 to get an int number, and multiply back to 64, to get a sector number that is aligned. This gives us also an aligned swap partition! Substracting one sector to this figure gives us the end sector of the root partition. The remaining stuff is easy: We define the two partitions: the first starts from sector 128 and goes on to SWAP_START – 1, the second is the swap and starts from SWAP_START up to the whole disk

Mike Laverick’s ‘stupid IT’ series ( http://bit.ly/bY5vxS ) which I liked a lot, inspired me to write this post. But this one isn’t directly related to IT stupidity. Instead, it’s more about the great flexibility that comes with modern IT technologies, and the chaos, panic, and disasters that ‘old-style-IT-guys’ create by not grasping the potential and understanding the inner dynamics of these new multi-dimensional environments.

‘Multiple dimensions’ is one key factor here, the other one being flexibility in those dimensions. Many IT environments have multiple dimensions, and those dimensions are there for a reason: they address specific requirements, and have unique features that involve planning, managing, monitoring, etc in a very specific way. Each dimension of course is part of a whole, and ‘control’ is what you get when you understand and manage ‘the whole thing’ ( this is ‘governance’, in the parlance of our times ;-) ).

A quick example: storage systems are multi-dimensional in their very nature, even if some single-neuron/single-threaded/mono-dimension IT guys think that storage=bunch of rotating rust that it’s supposed to hold data, eventually. Dimensions here are: protection level (raid), throughput (iops), bandwidth (gbps), scalability, tiering capabilities (automated or not), just to name a few. Add to that storage-related functions, such as SANs or replication, and a whole lot of new dimensions come into play, with varying degrees of impact in the overall architecture.

Some may argue that this multi-dimensional nature leads inevitably to complexity. I do believe that it mostly leads to flexibility (and coolness, but that’s the geek in me speaking). Flexibility is imho what makes an architecture really succesful, as it allows for change and growth when your business requirement vary: that’s why enterprise architectures exist, after all.

Flexibility comes at a cost, indeed: enterprise architectures force you to think in many ways at the same time, and require that you understand the need of careful planning, operating and monitoring the whole thing, to keep it current and aligned to business goals/requirements. That’s where the ‘rigid flexibility’ comes from: the multi-dimensional nature is there, whether you want it or not. It’s up to you to get the whole picture and transform flexibility into business advantage (or instead take everything for granted and prepare to suffer massive pain ;-) ).

I had the luck of living the growth of such an architecture, from the very basic needs (few standalone servers) up to a mature environment (consolidated storage, replication, D/R-B/C, virtualization, and so on). That allowed me to deal with one (well, sometimes more than one) dimension at a time, and to fit the new scenarios in the whole design.

I like to think that one of my tasks is to keep up to date with technology and with business requirements, so that both keep converging. If I look back on what we’ve done with our architecture, it has really been an evolving journey towards flexibility. Accordingly, the ‘rigid flexibility’ I mentioned is a good thing imho, since it forced us to broaden our thinking instead of letting us choose some shorter (and maybe simpler) path.

Obviously, pushing towards the cutting/bleeding edge isn’t always well accepted by coworkers and users (I don’t really understand the need to sit and watch as most do, but I take it as a component of the whole environment), but by taking the risk of braking some eggs, I saw a good number of pretty made omelettes.
Actually, the way of knowing which eggs you should break, is knowledge, again. The more you know your architecture, the products and solutions the world has to offer (and where they’re going, too), the more succesful you can be in planning and designing your own evolution. That’s why plan & design is soooo crucial nowadays (and that’s why a VCDX is a killer figure :-) ). Sure, flexibility allows for remediation, but that’s something you’d like to avoid if possible, right?

Ironically, what I’m seeing lately is that while there are some great vendors that show clearly their faith and endorsement in innovation, some others are actually pulling the handbrake and slowing things down, unacceptably. This is done by applying ridicoulus/anachronistic licensing or support policies (yes, ORCL, I’m talking to you), or throwing tons of FUD against mainstream/leading technologies such as virtualization.

While this pisses me off (A LOT), I use to take a deep breath and try to remind myself that this is evolution. Evolution will allow those that embrace flexibility and innovation to succeed, and at the same time will leave in the dust those sitting on their golden support contracts and once-shining technology.

So, dear vendor, since I have no intention of standing still, you better do the same… Or I’ll look elsewhere
>:-D

Follow

Get every new post delivered to your Inbox.

Join 717 other followers