Nagios: Checking for abnormally large NetApp snapshots

My philosophy with Nagios checks, especially with the NetApp, is that unless there are extenuating circumstances then I want all volumes (or whatever may be being checked) to be checked equally and at the same time. This means I don’t want to have to constantly add and remove checks from Nagios as volumes are added, deleted and modified. I would much rather have one check that checks all of the volumes and reports on them en masse. This means I don’t have to think about the check itself, but rather, only what it’s checking.

One of the many things that I regularly monitor on our multitude of NetApp systems is snapshots. We have had issues, especially with LUNs, where the snapshots have gotten out of control.

In order to prevent this, or at least hope that someone is watching the screen…, I wrote a quick script that checks to see if the total size of snapshots on a volume exceed the snap reserve. Since not all of our volumes have a snap reserve, I also put in the ability to check the size of the snaps against the percentage of free space left in the volume.

This last measure is a little strange, but I think it works fairly well. Take, for example, a 100GB volume. If it is 50% full (50GB), there is no snap reserve and the alert percentage is left at the default of 40% free space, then the alert will happen when snapshots exceed about 15GB. “But that’s not 40% of the free space”, I hear you saying. Ahhh, but it is…you see as the snapshot(s) grow, there is less free space, which means that it takes a larger percentage as the free space shrinks. So at 15GB of snapshots, there would be 35GB of free space, and 40% of 35GB is 14GB.

This causes the alerts to happen earlier than you may expect at first. You can adjust this number to be a percentage of the total space in the volume if you like…however, why not just set a snap reserve at that point? I chose to make the script this way in order to attempt to keep a little more free space in the volume, while not making a snap reserve mandatory.

One last word…please keep in mind this script does not check for a volume being filled, you should have other checks for that. This merely checks to see if snapshots have exceeded a threshold of space in the volume to prevent them from taking up too much space.

Bring on the Perl…

Read more

PowerShell: DataOnTap Realtime Multiprotocol Volume Latency

I had some free time yesterday morning as I couldn’t sleep after the long weekend. I used the time to dig into into the DataOnTap PowerShell Toolkit.  I started with an easy port of one of Andrews performance monitoring scripts.  I won’t go into as it’s very straight forward, but I will say so far I am very pleased with the DataOnTAP toolkit. 


Monitoring for orphaned snapshots left by SMVI

NetApp’s SnapManager for Virtual Infrastructure (SMVI) is a great product, but it’s messy. If it encounters the any error, it seemingly forgets to delete the virtual machine snapshots from the Virtual Infrastructure before dying.

To prevent many orphans (I’ve seen as many as 20 on a single virtual machine) from happening, I created a quick Nagios check that simply alerts when it sees them.

This script is very elementary. It very simply uses a regex to check for any snapshots that match the default SMVI naming convention. For each one it finds, a counter is incremented. If any are found, the script returns an error to Nagios, which causes an alert to be sent.

I’ve set the check to execute once an hour in my environment, as I don’t feel that granularity finer than that is needed…an hour’s worth of change is ok for an SMVI snapshot for me.

PowerShell: Import NetApp AutoSupport

The first step to any problem is getting said problem into PowerShell! We all know the usual players here WMI, ADSI, .NET, COM, etc… but what about good old text. I get the impression text has been ignored as legacy. When text does get a little attention it is almost always treated like the unix world.  Get-Content | Select-String used in place of cat|grep… I offer a different approach, in my opinion you haven’t really ingested something until it’s in the form of an object. Ala, Import-ASUP, a PowerShell script to ingest a NetApp autosupport. Give it a try and look around at the object it spit’s out. You may be surprised just how much information we all beam back to the mother ship every week!

Read more

Homemade remote CLI for NetApp

Security is one of those things that everyone knows they need to do, but it rarely gets done to the level that it should be.  This, at least in my experience, is primarily because security makes general, day-to-day tasks more difficult.  Take, for instance, rsh.  Rsh by itself is a great time saver…admit it…it’s great to just be able to execute commands from your admin host and have the results returned back.  You can parse them however you like using standard operating system tools like grep, awk, and sed, and best of all (or perhaps worst…) you don’t have to type the password repeatedly.

However, all of the benefits of rsh can be realized using ssh, it just takes a little more setup.  But, I’m not going to get into that today.  What if you just want a way to securely execute commands against your NetApp without consuming the sole connection to your your filer via ssh (you have telnet and rsh disabled, right?).  What if you don’t want to enable ssh, telnet, or rsh but still want to have a pseudo command line?  Assuming you have SSL (https) access enabled, you can use the Perl SDK to access, and execute commands against, your filer almost like you were telnet/ssh’d into it.

The magic comes from the undocumented system-cli SDK command.  It allows you to execute almost any command just as though you were sitting at the console.

The great part is that with this, you can accomplish probably 99% or more of all tasks having only one access method enabled to your NetApp: the https/ssl option.  SSH, RSH, telnet and HTTP can all be disabled.

I say almost because there are two types of commands that do not work using the below Perl script.  The first type is non-terminating commands.  These, at least off the top of my head, are primarily the stats show commands with the –i option specified.  With the –i option, the stats command repeats every number of seconds specified.  Now, the caveat to this is that you can also specify a –c option that limits the number of occurrences to the number specified.  The downside to this is that if you issue a command like stats show –i 5 –c 5 volume:*:read_ops then the command will take 25 seconds, at which point the results, as a whole, will be returned.

This also applies to issuing man commands.  Man will not return (at least with the simulator) to STDOUT, so system-cli doesn’t capture the output.

So, without any more pontificating by me, here is some sample output and the script.  If you would like to see additional examples, let me know in the comments.

Read more

Really NetApp?!? You didn’t use your own SDK?

So, this post irked me. Not because of the poster or his post (honest Andy, if you ever read this, I have nothing against you or your post! I’m happy to see another VMware/NetApp blogger!), but because of the script he referenced and the problem encountered. He has a good solution, but the problem shouldn’t exist.

You see, I hate RSH. I don’t know why (well, it is quite insecure, and it can require some configuration), but I hate it. SSH is only marginally better in this case…sure it’s secure, but you have to auth each time, and if you don’t (ssh keys), well, it’s only a little better than RSH (comms are encrypted, but compromise of a single account can lead to bad things on many hosts). The script that is referenced, one that NetApp recommends that admins use to verify that their aggregates have enough free space to hold the metadata for the volumes in OnTAP 7.3 (the metadata gets moved from the volumes to the aggregate in 7.3), uses RSH to execute commands that are then parsed in a somewhat rudimentary way to get information.

Sure, it’s effective, but it’s far from graceful…especially when you have a perfectly good and effective SDK at your disposal.

I was kind of bored, so I decided to rewrite the script using the SDK. This is the end result. It reports the same data, but uses the SDK to gather all of the necessary information to make a determination for the user. The new script is significantly shorter (10KB vs 25KB, 380 lines vs 980), and it requires only one login.

Thanks to NetApp for providing their SDK, and I hope that no one over there minds me refactoring…

Read more

Perl OnTAP SDK: Realtime Multiprotocol Volume Latency

Update 2009-07-21: With some help from Steffen, a bug was found where the script wasn’t returning any values in the result hash when the toaster didn’t return values for certain queries. This caused Perl to print errors when it was trying to do math on non-existent values. Starting at line 273, the script has been updated so that the hash returned by the subroutine that does the ZAPI query has default values of zero, which should eliminate the errors seen by Steffen. Please let me know of any other problems encountered! (and thanks to Steffen for finding this bug!)

My previous post only prints NFS latency for the NetApp as a whole, it doesn’t give any information about a specific volume. Some of my ESX hosts use iSCSI for their datastores, and because the NetApp has many iSCSI clients, looking at iSCSI stats for the filer as a whole didn’t help me very much.

The solution was this script. It is a significantly modified version of the previous script that is capable of showing the realtime latency for all protocols: NFS, CIFS, SAN (which I believe is all block level ops summarized), FCP and iSCSI. It also displays the three different types of operations for each protocol: read, write, and other.

The script, if invoked with nothing more than the connection information, will display the read, write, and “other” latency and operations for the total of all protocols. There is a fourth column as well, which shows the average latency and total operations across all operation types (r/w/o).

This script has proven quite beneficial for me. By monitoring CIFS latency during peak hours on the volume that contains shares for profiles, I have proven that the reason logins can take a significant amount of time is due to the use of high capacity, but very slow, SATA disks and not the network or desktops themselves. I’ve also been able to prove that one of our iSCSI volumes was “slow” due to bandwidth, and not spindle count (interestingly, the problem with this volume is the I/O request size…the app makes large requests which chokes bandwidth before available IOP/s runs out).

The OnTAP SDK is quite powerful, Glenn and I are quickly discovering that anything possible in FilerView and/or DFM/OpsMgr is doable through the SDK.

Read more

Perl OnTAP SDK: Realtime NFS Latency

Since most of my Virtual Infrastructure runs on NFS datastores I like to keep a very close eye on what’s going on inside the NetApp. I generally use Cacti for long term monitoring of the status of the datastores and the NetApp as a whole.

However, when I want to see what’s going on in less than five minute increments, Cacti is pretty much useless. I wrote this script a while ago so that if I feel that latency is becoming a problem, I can check it right away and see it in frequent intervals.

Most often I use this script when Nagios starts to chirp at me. I use a slightly modified version of this script with Nagios and have it alert when latency gets out of hand. I then use this script to get a good look at what’s going on.

The OnTAP SDK is almost as entertaining to work with as VMware’s…

Read more

Cacti templates for NetApp’s CIFS, NFS, and iSCSI

Update 2010-07-21: If you are interested in graphing operations and latency for NFS, CIFS, iSCSI, FCP and/or SAN protocols on a per volume basis, you may want to see this post.

Update 2009-07-21: As requested by Dave, I have posted an FCP template and script. I have no way of testing them (no FCP…), but I think they should work.

I’m one of those people who has to know everything that is going on inside my infrastructure at all times. ESX, vCenter, MySQL, SQL Server, NetApp…I keep close tabs on all of them using both Cacti and Nagios.

Some people might find it strange that I have both running, but the two applications have very different functions. Cacti is superb at trend analysis and detecting abnormalities after the fact (only occasionally during the event). I use it’s data to determine, for example, when is a good time for Exchange to take an outage based on the number of users connected and the number of RPC requests occurring. Nagios, on the other hand, is extremely well suited to real time monitoring and alerting. It checks different data points at intervals and if it finds one out of the accepted range, it tells me.

Anyway, back to the point. I created these graphs to give me detailed information about CIFS, NFS and iSCSI on my NetApp filers. I have used these against FAS270s, FAS2020s and FAS6030s running OnTAP, and/or 7.3.1 with success against them all.

In addition to the below templates I use some additional graphs to track other metrics. Those graphs are available here:

As I get time (which is rare) I plan on adding additional graphs, when I do, I will post them here. I would like to get and graph ASIS information, WAFL stats, and space information (raw, formatted, usable, allocated, overhead for the filer as a whole). If anyone has, or knows where to find, these graphs, please let me know!

Read more