Perl Toolkit: Check ESX(i) host time

I had an issue recently where a single ESXi host’s clock was incorrect. The administrator had never set the clock initially, so NTP never kept it in sync cause it was too far off to begin.

Since I’ve got a large number of hosts and the idea of clicking to each one through VI Client and checking the configuration tab, I immediately turned to PowerCLI. Naturally, one of Luc‘s scripts was the top search result.

That solved my immediate need to check the hosts, but I also wanted to setup some general monitoring. Since my monitoring infrastructure is compromised, primarily, of a linux Nagios host, that means PowerCLI couldn’t help. So, I did the next best thing and ported Luc’s script to perl.

Below is the result of that porting. It can also be run from vMA for reporting via email or another mechanism.

Read more

Nagios: Checking for abnormally large NetApp snapshots

My philosophy with Nagios checks, especially with the NetApp, is that unless there are extenuating circumstances then I want all volumes (or whatever may be being checked) to be checked equally and at the same time. This means I don’t want to have to constantly add and remove checks from Nagios as volumes are added, deleted and modified. I would much rather have one check that checks all of the volumes and reports on them en masse. This means I don’t have to think about the check itself, but rather, only what it’s checking.

One of the many things that I regularly monitor on our multitude of NetApp systems is snapshots. We have had issues, especially with LUNs, where the snapshots have gotten out of control.

In order to prevent this, or at least hope that someone is watching the screen…, I wrote a quick script that checks to see if the total size of snapshots on a volume exceed the snap reserve. Since not all of our volumes have a snap reserve, I also put in the ability to check the size of the snaps against the percentage of free space left in the volume.

This last measure is a little strange, but I think it works fairly well. Take, for example, a 100GB volume. If it is 50% full (50GB), there is no snap reserve and the alert percentage is left at the default of 40% free space, then the alert will happen when snapshots exceed about 15GB. “But that’s not 40% of the free space”, I hear you saying. Ahhh, but it is…you see as the snapshot(s) grow, there is less free space, which means that it takes a larger percentage as the free space shrinks. So at 15GB of snapshots, there would be 35GB of free space, and 40% of 35GB is 14GB.

This causes the alerts to happen earlier than you may expect at first. You can adjust this number to be a percentage of the total space in the volume if you like…however, why not just set a snap reserve at that point? I chose to make the script this way in order to attempt to keep a little more free space in the volume, while not making a snap reserve mandatory.

One last word…please keep in mind this script does not check for a volume being filled, you should have other checks for that. This merely checks to see if snapshots have exceeded a threshold of space in the volume to prevent them from taking up too much space.

Bring on the Perl…

Read more

Homemade remote CLI for NetApp

Security is one of those things that everyone knows they need to do, but it rarely gets done to the level that it should be.  This, at least in my experience, is primarily because security makes general, day-to-day tasks more difficult.  Take, for instance, rsh.  Rsh by itself is a great time saver…admit it…it’s great to just be able to execute commands from your admin host and have the results returned back.  You can parse them however you like using standard operating system tools like grep, awk, and sed, and best of all (or perhaps worst…) you don’t have to type the password repeatedly.

However, all of the benefits of rsh can be realized using ssh, it just takes a little more setup.  But, I’m not going to get into that today.  What if you just want a way to securely execute commands against your NetApp without consuming the sole connection to your your filer via ssh (you have telnet and rsh disabled, right?).  What if you don’t want to enable ssh, telnet, or rsh but still want to have a pseudo command line?  Assuming you have SSL (https) access enabled, you can use the Perl SDK to access, and execute commands against, your filer almost like you were telnet/ssh’d into it.

The magic comes from the undocumented system-cli SDK command.  It allows you to execute almost any command just as though you were sitting at the console.

The great part is that with this, you can accomplish probably 99% or more of all tasks having only one access method enabled to your NetApp: the https/ssl option.  SSH, RSH, telnet and HTTP can all be disabled.

I say almost because there are two types of commands that do not work using the below Perl script.  The first type is non-terminating commands.  These, at least off the top of my head, are primarily the stats show commands with the –i option specified.  With the –i option, the stats command repeats every number of seconds specified.  Now, the caveat to this is that you can also specify a –c option that limits the number of occurrences to the number specified.  The downside to this is that if you issue a command like stats show –i 5 –c 5 volume:*:read_ops then the command will take 25 seconds, at which point the results, as a whole, will be returned.

This also applies to issuing man commands.  Man will not return (at least with the simulator) to STDOUT, so system-cli doesn’t capture the output.

So, without any more pontificating by me, here is some sample output and the script.  If you would like to see additional examples, let me know in the comments.

Read more

Really NetApp?!? You didn’t use your own SDK?

So, this post irked me. Not because of the poster or his post (honest Andy, if you ever read this, I have nothing against you or your post! I’m happy to see another VMware/NetApp blogger!), but because of the script he referenced and the problem encountered. He has a good solution, but the problem shouldn’t exist.

You see, I hate RSH. I don’t know why (well, it is quite insecure, and it can require some configuration), but I hate it. SSH is only marginally better in this case…sure it’s secure, but you have to auth each time, and if you don’t (ssh keys), well, it’s only a little better than RSH (comms are encrypted, but compromise of a single account can lead to bad things on many hosts). The script that is referenced, one that NetApp recommends that admins use to verify that their aggregates have enough free space to hold the metadata for the volumes in OnTAP 7.3 (the metadata gets moved from the volumes to the aggregate in 7.3), uses RSH to execute commands that are then parsed in a somewhat rudimentary way to get information.

Sure, it’s effective, but it’s far from graceful…especially when you have a perfectly good and effective SDK at your disposal.

I was kind of bored, so I decided to rewrite the script using the SDK. This is the end result. It reports the same data, but uses the SDK to gather all of the necessary information to make a determination for the user. The new script is significantly shorter (10KB vs 25KB, 380 lines vs 980), and it requires only one login.

Thanks to NetApp for providing their SDK, and I hope that no one over there minds me refactoring…

Read more

Perl OnTAP SDK: Realtime Multiprotocol Volume Latency

Update 2009-07-21: With some help from Steffen, a bug was found where the script wasn’t returning any values in the result hash when the toaster didn’t return values for certain queries. This caused Perl to print errors when it was trying to do math on non-existent values. Starting at line 273, the script has been updated so that the hash returned by the subroutine that does the ZAPI query has default values of zero, which should eliminate the errors seen by Steffen. Please let me know of any other problems encountered! (and thanks to Steffen for finding this bug!)

My previous post only prints NFS latency for the NetApp as a whole, it doesn’t give any information about a specific volume. Some of my ESX hosts use iSCSI for their datastores, and because the NetApp has many iSCSI clients, looking at iSCSI stats for the filer as a whole didn’t help me very much.

The solution was this script. It is a significantly modified version of the previous script that is capable of showing the realtime latency for all protocols: NFS, CIFS, SAN (which I believe is all block level ops summarized), FCP and iSCSI. It also displays the three different types of operations for each protocol: read, write, and other.

The script, if invoked with nothing more than the connection information, will display the read, write, and “other” latency and operations for the total of all protocols. There is a fourth column as well, which shows the average latency and total operations across all operation types (r/w/o).

This script has proven quite beneficial for me. By monitoring CIFS latency during peak hours on the volume that contains shares for profiles, I have proven that the reason logins can take a significant amount of time is due to the use of high capacity, but very slow, SATA disks and not the network or desktops themselves. I’ve also been able to prove that one of our iSCSI volumes was “slow” due to bandwidth, and not spindle count (interestingly, the problem with this volume is the I/O request size…the app makes large requests which chokes bandwidth before available IOP/s runs out).

The OnTAP SDK is quite powerful, Glenn and I are quickly discovering that anything possible in FilerView and/or DFM/OpsMgr is doable through the SDK.

Read more

Perl OnTAP SDK: Realtime NFS Latency

Since most of my Virtual Infrastructure runs on NFS datastores I like to keep a very close eye on what’s going on inside the NetApp. I generally use Cacti for long term monitoring of the status of the datastores and the NetApp as a whole.

However, when I want to see what’s going on in less than five minute increments, Cacti is pretty much useless. I wrote this script a while ago so that if I feel that latency is becoming a problem, I can check it right away and see it in frequent intervals.

Most often I use this script when Nagios starts to chirp at me. I use a slightly modified version of this script with Nagios and have it alert when latency gets out of hand. I then use this script to get a good look at what’s going on.

The OnTAP SDK is almost as entertaining to work with as VMware’s…

Read more

Perl Toolkit: pNIC to vSwitch information

Another itch to scratch: which vSwitch is a pNIC connected to? To solve this simple problem I created a quick perl script…

This script also lets me see the driver in use, connection speed and duplex setting, and the MAC address of the pNIC.

Read more

Perl Toolkit: Portgroup type information

I wanted get a list of port groups and their type (kernel, console, virtual machine) from a series of hosts, however the only thing I could find that was even close was a POSH script in the VMTN forums that was posted by LucD.

Using that script for inspiration, I essentially duplicated the functionality, but using the perl toolkit. This script gives me an easy to read (and parse…) list of portgroups, the vSwitch they belong to, and the type.

Read more

Perl Toolkit: NFS snapshot fix via rCLI

I dislike having to SSH into each host I am responsible for, and I detest having to enable SSH on ESXi (there should be NO reason for me to have to enable it). Because it’s difficult to script applying the NFS snapshot fix to a lot of hosts using the SSH method (and impossible if you don’t enable it on ESXi), I fooled around with the command that is provided with the rCLI.

I discovered that I can pull certain configuration files for the host using the command, modify them, then replace the configuration file…all without having to SSH to the host! has an excellent list of files available using this method.

All of the commands I use in the below script are available when the rCLI is installed (the rCLI also installs the perl toolkit, so all those “sample” scripts are available to us).

My windows scripting skills are non-existent, so I don’t know how to write a wrapper around the rCLI commands like I can with bash, but these same commands will work if you are using rCLI installed on Windows.

Read more

Perl Toolkit: Adjust Active/Standby NICs for Virtual Switches and Port Groups

I had the need to change the configuration of my ESX hosts so that the virtual switches had a single active and single standby adapter assigned to them. The reason for the need is rather irritating (the IBM I/O modules that we have in these particular blade centers are not really designed to handle a high amount of traffic), and it was causing some issues during vMotions.

This script allows me set the vmnics assigned to a vswitch to the desired active/standby configuration, and additionally allows me to set the port group’s vmnic active/standby policy. In my setup, I use two vSwitches, one for primary COS, vMotion and IP storage, and a second vSwitch for the virtual machines and secondary COS, each vSwitch has two NICs assigned (remember, they’re blade centers…limited network connectivity). In order to avoid vMotion taking all the bandwidth for storage I wanted to separate their traffic onto different NICs, but still provide redundancy.

The way that I accomplish this is by making the default for the vSwitch have, for example, vmnic0 active and vmnic2 standby. I then adjust the vMotion port group so that it has the opposite (vmnic2 active and vmnic0 standby). Redundancy is still present in the event of a NIC failure, but under normal circumstances, the traffic is separate.

Read more