Cacti: Monitor protocol statistics for NetApp volumes

Update 2011-07-10:  Due to a template export error with Cacti, the import was failing for a lot of people. I apologize for taking so long to fix the templates, however they should be fixed now. Thank you to everyone who pointed out the errors and the fix in the comments.


I have made no secret that I use two applications daily to monitor my infrastructure: Nagios and Cacti. I have created a fair number of scripts (and hopefully publishing more soon) to help Nagios monitor the different parts of the infrastructure, however I haven’t published many of my Cacti scripts previously.

One of the most useful is the config that I use to monitor the different protocol stats for volumes. I created an indexed query so that the single script, and accompanying XML file, are capable of monitoring all the volumes, and I can select which graphs to create for each volume. The polling script is loosely based off of the multi-protocol realtime volume statistics script that I created some time ago.

Download the updated template and script(s) here.

Some examples…

Total Operations, Latency
total_ops
  
total_lat
CIFS Operations, Latency
cifs_ops
  
cifs_lat
NFS Operations, Latency
nfs_ops
  
nfs_lat
iSCSI Operations, Latency
iscsi_ops
  
iscsi_lat

Nagios: Checking for abnormally large NetApp snapshots

My philosophy with Nagios checks, especially with the NetApp, is that unless there are extenuating circumstances then I want all volumes (or whatever may be being checked) to be checked equally and at the same time. This means I don’t want to have to constantly add and remove checks from Nagios as volumes are added, deleted and modified. I would much rather have one check that checks all of the volumes and reports on them en masse. This means I don’t have to think about the check itself, but rather, only what it’s checking.

One of the many things that I regularly monitor on our multitude of NetApp systems is snapshots. We have had issues, especially with LUNs, where the snapshots have gotten out of control.

In order to prevent this, or at least hope that someone is watching the screen…, I wrote a quick script that checks to see if the total size of snapshots on a volume exceed the snap reserve. Since not all of our volumes have a snap reserve, I also put in the ability to check the size of the snaps against the percentage of free space left in the volume.

This last measure is a little strange, but I think it works fairly well. Take, for example, a 100GB volume. If it is 50% full (50GB), there is no snap reserve and the alert percentage is left at the default of 40% free space, then the alert will happen when snapshots exceed about 15GB. “But that’s not 40% of the free space”, I hear you saying. Ahhh, but it is…you see as the snapshot(s) grow, there is less free space, which means that it takes a larger percentage as the free space shrinks. So at 15GB of snapshots, there would be 35GB of free space, and 40% of 35GB is 14GB.

This causes the alerts to happen earlier than you may expect at first. You can adjust this number to be a percentage of the total space in the volume if you like…however, why not just set a snap reserve at that point? I chose to make the script this way in order to attempt to keep a little more free space in the volume, while not making a snap reserve mandatory.

One last word…please keep in mind this script does not check for a volume being filled, you should have other checks for that. This merely checks to see if snapshots have exceeded a threshold of space in the volume to prevent them from taking up too much space.

Bring on the Perl…

Read more

VMware vSphere Hypervisor (ESXi) 4.1 kickstart – A.K.A. official “touchfree” ESXi installs

VMware is a facilitator. I know, you’re thinking “yeah, they facilitate my power/space/cooling savings, they facilitate infrastructure consolidation, IT agility, high availability, etc.”, but really, they facilitate me being lazy (which for sysadmins is a good thing…a lazy admin will only want to do a task once, then automate the sh*t out of it).

I’ve already documented how I hacked the ESXi 4.0 installer to have it do the installation without interaction. However, VMware has one-upped me and integrated kickstart into their installer. This makes things VASTLY easier, requires no tomfoolery with the ISO, and is significantly more capable.

This blog post will be just a short one to demonstrate how easy it is now to have the install be “touch-free”. I am working on some more complex examples in the coming days.

So without further blithering from me, on with the install! Put the CD in the drive (or mount the iso remotely), boot the server. When it reaches the boot options menu:

ESXi 4.1 Boot options screen

Press tab to append options to the boot line. Append the following after the vmkboot.gz, but before the --- after it.

ks=file://etc/vmware/weasel/ks.cfg

It is VERY IMPORTANT that you place the kickstart file location after vmkboot.gz, but before the next boot module. It should not be at the end.

mboot.c32 vmkboot.gz ks=file://etc/vmware/weasel/ks.cfg --- vmkernel.gz ---sys.vgz ---cim.vgz --- ienviron.vgz --- install.vgz

Here is an example:

ESXi 4.1 Boot Options with KS

When you’re done, press enter. It will begin to load the data off the CD, and when the different install modules are done it should simply begin to install ESXi just like how I had hacked it together previously…

ESXi 4.1 KS Install

The only thing left will be to press “Enter” when it’s done (why?!).

A word of caution…the kickstart that VMware has provided will automatically select and format the first disk that it finds, regardless of it being local or “remote” (i.e. a SAN LUN). I would assume that the vast majority of the time it’s going to find the local disks first, however……..

Hopefully in the next few days I’ll have some more time to play with the new kickstart features and post some more examples. VMware has really done some great things with this process and it is now possible to have the entire process be automated…
1) use DHCP to provide the “permanent” IP
2) use a network PXE boot for the media and to provide a KS file
3) use the --post section of the kickstart file to have the server reach out and touch a vCLI or PowerCLI configuration host and provide permanent configuration.

The reason that step one should provide the actual IP is that it provides an easy way of having your configuration host (vCLI or PowerCLI) know what IP (and potentially hostname) to assign to the host.

Good luck, and thanks to VMware for (finally) integrating kickstart with ESXi!

vSphere: Console… we don’t need no stink’in console

I won’t attempt to provide a feature rundown or tell you why vSphere 4.1 is the greatest thing since sliced bread.  It appears to be a solid release, but  I’ll leave that analysis to the experts…Instead I want to talk about the vSphere hypervisor (previously ESXi).

Why the name change? Simple what was previously mis-branded as a separate technology is really the hypervisors core.  Previously in ESX3.5, ESXi was a separate technology, but as of vSphere 4 they have had a unified core.   In-fact the product we like to think about as vSphere 4.0/4.1 is really just a vSphere hypervisor with a special management VM!  This is important, the only difference is the console which is nothing more than a VM!

So why the distinction, Why now?  VMware is playing it’s hand this round because that special VM is going bye, bye.  The Next release of vSphere will not have a service console.. PAINIC…. RUN IN CIRCLES THE ZOMBIES ARE COMMING!!!

Don’t Panic, Personally I applaud the move.  Over the past year and a half I’ve heard every argument against the console less hypervisor, but honestly I chalk it all up to people fear change.  There are a couple thousand admins who have invested a lot of time mastering vSphere, and VMware is about to change the whole game on them.  These guys/gals bring up several arguments against the console less hypervisor, I’ll attempt to offer my counter argument to these points.

Q. No 3rd Party agents.

A. It has been public knowledge that the console was going away, and as of vSphere 4.0 VMware shipped a new management appliance vMA.  One of the intended uses of this appliance was to install 3rd party agents.  So you see we do still have 3rd Party agents they just need to be rewritten.  In most cases this will result in a better product. Unfortunately, the vast majority of 3rd party software, could better be described as a really complex perl script running over ssh!

Q. Hardware monitors/plug-in

A. Part of the original ESXi 3.5 release was the introduction of a rudimentary CIM provider.  This provider has been fully expanded , and made extensible.  While it is a change from the traditional agent based monitors CIM does fill in this gap.

Q. Automating common tasks.

A. As of vSphere 4.1 Tech Support Mode supports SSH, but you should really be using either PowerCLI or the vCLI!  While it is true that are still a couple of things that can only be done via the console.  I’m confident VMware will fix those gaps before putting the console out to pasture.

Q. Security

A. So this is the big one, and my personal pet peeve.  I’ve heard security experts bash the vSphere hypervisor claiming it was insecure.  I just don’t understand this stance, admittedly I’m no security expert.  I only work with the federal government in some of the most secure data centers in the world, but what do I know…

Let’s break this down shall we… The only difference is a VM.   Admittedly this VM has special connections into the vmkernal, but it’s still just a VM.  How exactly does the inclusion of a VM make the hypervisor more secure?  In my opinion the exclusion of this VM instantly increased the security posture of most organizations.  The reason for this simple, it was hard to properly harden the console.  Alternatively it was all too easy to open a critical security hole, and expose ones infrastructure with the console.

Yes you still have to do several things to really lock down the console less hypervisor, but it’s not nearly the feat the console once was.  In fact it’s simple;

1. Modify the Proxy.xml (turning off all unneeded web services, and make everything use https).
2. Enable Lockdown mode.
3. Physical security.

That’s it folks, that’s all it takes to secure the hypervisor.  There are a couple hundred other little things necessary to design a secure infrastructure, but as you can see the hypervisor is easy!  In fact I’m so confident in this I’m willing to hold a Bobby Flay style throw down.  If you have the means to provide a  pair of internet facing vSphere hosts. I’ll secure the console less hypervisor, we’ll get TexiWill to harden the legacy console based hypervisor, and then we’ll release the IP’s to the world.  Have at it, folks I bet the console less hypervisor holds up at least as long as the legacy hypervisor!

Why so brash? Well it will take an exploit to get in to the console less hypervisor, and any exploit will also be present in the legacy hypervisor.  The console less vSphere hypervisor without access to the physical host or vCenter there is simply no other way in.   Remember this isn’t Linux or BSD or UNIX… it’s vSphere it’s practicality firmware, and the whole point was to remove all that crap that weaken the security , and stability to begin with!

I really want to put this to bed!  Let’s develop the to do list for VMware.  The 10-20 things they need to fix before they can finally kill the console.  Then let’s collectively shut up about it.  It’s going to happen, and complaining with arbitrary little gripes… or demanding NDA meetings with engineers isn’t going to stop any of it.  The Task at hand is simple, weed out the crap, and focus on what needs to be fixed in vSphere v.Next.

If we missed something let us know in the comments.
~Glenn

PowerShell: DataOnTap Realtime Multiprotocol Volume Latency

I had some free time yesterday morning as I couldn’t sleep after the long weekend. I used the time to dig into into the DataOnTap PowerShell Toolkit.  I started with an easy port of one of Andrews performance monitoring scripts.  I won’t go into as it’s very straight forward, but I will say so far I am very pleased with the DataOnTAP toolkit. 

~Glenn

4dd96af9f807d642cd096caa5e654214ssssssssssssssssssss