Really NetApp?!? You didn’t use your own SDK?

So, this post irked me. Not because of the poster or his post (honest Andy, if you ever read this, I have nothing against you or your post! I’m happy to see another VMware/NetApp blogger!), but because of the script he referenced and the problem encountered. He has a good solution, but the problem shouldn’t exist.

You see, I hate RSH. I don’t know why (well, it is quite insecure, and it can require some configuration), but I hate it. SSH is only marginally better in this case…sure it’s secure, but you have to auth each time, and if you don’t (ssh keys), well, it’s only a little better than RSH (comms are encrypted, but compromise of a single account can lead to bad things on many hosts). The script that is referenced, one that NetApp recommends that admins use to verify that their aggregates have enough free space to hold the metadata for the volumes in OnTAP 7.3 (the metadata gets moved from the volumes to the aggregate in 7.3), uses RSH to execute commands that are then parsed in a somewhat rudimentary way to get information.

Sure, it’s effective, but it’s far from graceful…especially when you have a perfectly good and effective SDK at your disposal.

I was kind of bored, so I decided to rewrite the script using the SDK. This is the end result. It reports the same data, but uses the SDK to gather all of the necessary information to make a determination for the user. The new script is significantly shorter (10KB vs 25KB, 380 lines vs 980), and it requires only one login.

Thanks to NetApp for providing their SDK, and I hope that no one over there minds me refactoring…

Read more

Perl OnTAP SDK: Realtime Multiprotocol Volume Latency

Update 2009-07-21: With some help from Steffen, a bug was found where the script wasn’t returning any values in the result hash when the toaster didn’t return values for certain queries. This caused Perl to print errors when it was trying to do math on non-existent values. Starting at line 273, the script has been updated so that the hash returned by the subroutine that does the ZAPI query has default values of zero, which should eliminate the errors seen by Steffen. Please let me know of any other problems encountered! (and thanks to Steffen for finding this bug!)

My previous post only prints NFS latency for the NetApp as a whole, it doesn’t give any information about a specific volume. Some of my ESX hosts use iSCSI for their datastores, and because the NetApp has many iSCSI clients, looking at iSCSI stats for the filer as a whole didn’t help me very much.

The solution was this script. It is a significantly modified version of the previous script that is capable of showing the realtime latency for all protocols: NFS, CIFS, SAN (which I believe is all block level ops summarized), FCP and iSCSI. It also displays the three different types of operations for each protocol: read, write, and other.

The script, if invoked with nothing more than the connection information, will display the read, write, and “other” latency and operations for the total of all protocols. There is a fourth column as well, which shows the average latency and total operations across all operation types (r/w/o).

This script has proven quite beneficial for me. By monitoring CIFS latency during peak hours on the volume that contains shares for profiles, I have proven that the reason logins can take a significant amount of time is due to the use of high capacity, but very slow, SATA disks and not the network or desktops themselves. I’ve also been able to prove that one of our iSCSI volumes was “slow” due to bandwidth, and not spindle count (interestingly, the problem with this volume is the I/O request size…the app makes large requests which chokes bandwidth before available IOP/s runs out).

The OnTAP SDK is quite powerful, Glenn and I are quickly discovering that anything possible in FilerView and/or DFM/OpsMgr is doable through the SDK.

Read more

Perl OnTAP SDK: Realtime NFS Latency

Since most of my Virtual Infrastructure runs on NFS datastores I like to keep a very close eye on what’s going on inside the NetApp. I generally use Cacti for long term monitoring of the status of the datastores and the NetApp as a whole.

However, when I want to see what’s going on in less than five minute increments, Cacti is pretty much useless. I wrote this script a while ago so that if I feel that latency is becoming a problem, I can check it right away and see it in frequent intervals.

Most often I use this script when Nagios starts to chirp at me. I use a slightly modified version of this script with Nagios and have it alert when latency gets out of hand. I then use this script to get a good look at what’s going on.

The OnTAP SDK is almost as entertaining to work with as VMware’s…

Read more