My philosophy with Nagios checks, especially with the NetApp, is that unless there are extenuating circumstances then I want all volumes (or whatever may be being checked) to be checked equally and at the same time. This means I don’t want to have to constantly add and remove checks from Nagios as volumes are added, deleted and modified. I would much rather have one check that checks all of the volumes and reports on them en masse. This means I don’t have to think about the check itself, but rather, only what it’s checking.
One of the many things that I regularly monitor on our multitude of NetApp systems is snapshots. We have had issues, especially with LUNs, where the snapshots have gotten out of control.
In order to prevent this, or at least hope that someone is watching the screen…, I wrote a quick script that checks to see if the total size of snapshots on a volume exceed the snap reserve. Since not all of our volumes have a snap reserve, I also put in the ability to check the size of the snaps against the percentage of free space left in the volume.
This last measure is a little strange, but I think it works fairly well. Take, for example, a 100GB volume. If it is 50% full (50GB), there is no snap reserve and the alert percentage is left at the default of 40% free space, then the alert will happen when snapshots exceed about 15GB. “But that’s not 40% of the free space”, I hear you saying. Ahhh, but it is…you see as the snapshot(s) grow, there is less free space, which means that it takes a larger percentage as the free space shrinks. So at 15GB of snapshots, there would be 35GB of free space, and 40% of 35GB is 14GB.
This causes the alerts to happen earlier than you may expect at first. You can adjust this number to be a percentage of the total space in the volume if you like…however, why not just set a snap reserve at that point? I chose to make the script this way in order to attempt to keep a little more free space in the volume, while not making a snap reserve mandatory.
One last word…please keep in mind this script does not check for a volume being filled, you should have other checks for that. This merely checks to see if snapshots have exceeded a threshold of space in the volume to prevent them from taking up too much space.
Bring on the Perl…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
#!/usr/bin/perl -w # # check_na_snaps.pl - written by Andrew Sullivan, 2010-07-18 # # Please report bugs and request improvements at http://get-admin.com/blog/?p=1061 # # This is a Nagios script that checks a NetApp for volumes that have total snapshot # space consumed that exceeds the snap reserve, or exceeds the user specified # percentage of free space. # # Options are: # --hostname|-H = (mandatory) hostname or IP of NetApp to connect to # --username|-u = (mandatory) username to connect with # --password|-p = (optional) password for user, will be prompted if not supplied # --protocol|-P = (optional) Currently only HTTP and HTTPS are available, HTTP # is the default # --freespace|-f = (optional) Integer. the amount of free space to use as a measure to # determine if a snapshot is too large for vols that have # no snap reserve set. The default is 40% # # Examples: # Basic check, no frills, 40% threshold # check_na_snaps.pl --hostname toaster1 --username notRoot --password '$ecr3t' # # Check using HTTPS, 20% threshold # check_na_snaps.pl -H toaster1 -u notRoot -p '$ecr3t' -P 'HTTPS' -f 20 # # The OnTAP SDK is available here: http://communities.netapp.com/docs/DOC-1365 use strict; use Getopt::Long qw(:config no_ignore_case); use NaServer; use NaElement; # make sure you change this path to match your environment use lib "/usr/lib64/nagios/plugins"; use utils qw(%ERRORS &print_revision &support &usage); my $PROGNAME = "check_na_snaps.pl"; main( ); sub main { my $opts = parse_options(); my $server = getFiler( $opts->{ 'hostname' }, $opts->{ 'username' }, $opts->{ 'password' }, $opts->{ 'protocol' } ); my $volumes = {}; my $errors = []; my $request = NaElement->new('volume-list-info'); my $result = $server->invoke_elem($request); if ($result->results_errno != 0) { print STDERR 'volume-list-info failed! Reason: ' . $result->results_reason() . "n"; exit $ERRORS{'UNKNOWN'}; } # let's begin by looping through the volumes.... foreach my $volume ( $result->child_get('volumes')->children_get() ) { my $name = $volume->child_get_string('name'); # we only want volumes that are online if ($volume->child_get_string('state') eq "online") { # we need some info, namely the amount of free space and the amount of space # reserved for snapshots. The latter is returned in 1024 byte blocks, so let's # normalize on bytes... $volumes->{ $name }->{ 'free' } = $volume->child_get_string('size-available'); $volumes->{ $name }->{ 'snapreserve' } = $volume->child_get_string('snapshot-blocks-reserved') * 1024; # now we need to get info on all the snapshots. Unfortunately there is no way # that I'm aware of to return simply the total size of all snaps, so we have # to do it the long way $request = NaElement->new('snapshot-list-info'); $request->child_add_string('terse', 'false'); $request->child_add_string('volume', $name); my $snapshots = $server->invoke_elem($request); if ($snapshots->results_errno != 0) { print STDERR 'snapmirror-get-status failed! Reason: ' . $snapshots->results_reason() . "n"; exit $ERRORS{'UNKNOWN'}; } my $snapsize = 0; # ...the long way being loop through each one, look for the oldest and get it's # cumulative total data. This may take a while, so depending on the timeout # for your checks, this may not always finish in time if ( scalar($snapshots->child_get('snapshots')->children_get()) > 0) { # make the "first" time impossible to achieve so that it always get's # pushed out on the first iteration my $snaptime = 999999999999; foreach my $snap ( $snapshots->child_get('snapshots')->children_get() ) { if ($snap->child_get_int('access-time') child_get_string('access-time'); $snapsize = $snap->child_get_string('cumulative-total'); } } # snapsize is also returned in 1024 byte blocks, so let's normalize again $snapsize *= 1024; } $volumes->{ $name }->{ 'snapsize' } = $snapsize; } } # now that we have the information for all volumes, check to see if we have any violators foreach my $vol_name ( keys( %{ $volumes } ) ) { my $volume = $volumes->{ $vol_name }; # is snap reserve turned on? if ( $volume->{ 'snapreserve' } > 0 ) { # yes, check to see if the snapshots are greater than snapreserve if ($volume->{ 'snapsize' } > $volume->{ 'snapreserve' }) { push(@$errors, "$vol_name reserve (" . printableSize($volume->{ 'snapreserve' }) . ") {'snapsize'}) . "); "); } } else { # no reserve, check the snap to free space ratio if ( $volume->{ 'snapsize' } > ($volume->{ 'free' } * $opts->{ 'freespace' }) ) { push(@$errors, "$vol_name snapsize (" . printableSize($volume->{ 'snapsize' }) . ") > " . ($opts->{ 'freespace' } * 100) . "% free (" . printableSize($volume->{ 'free' } * $opts->{ 'freespace' }) . "); "); } } } # if any volumes have snaps out of line, alert the user and exit with a warning if (scalar(@$errors) > 0) { print "WARNING - "; print $_ for @$errors; print "n"; exit $ERRORS{'WARNING'}; } else { print "OK - No volumes have overly large snapshotsn"; exit $ERRORS{'OK'}; } } sub getFiler { my ($hostname, $username, $password, $protocol) = @_; my $s = NaServer->new($hostname, 1, 3); $s->set_style('LOGIN'); $s->set_admin_user($username, $password); $s->set_transport_type($protocol); return $s; } sub parse_options { my %options = ( 'hostname' => '', 'username' => '', 'password' => '', 'protocol' => 'HTTP', 'freespace' => '40', 'help' => 0 ); GetOptions( %options, 'hostname|H=s', 'username|u=s', 'password|p:s', 'protocol|P:s', 'freespace|f:i', 'help|h' ); $options{ 'freespace' } = $options{ 'freespace' } / 100; if (! $options{ 'hostname' } || ! $options{ 'username' } || ! $options{ 'password' } || $options{ 'help' }) { print "Missing or invalid options!nn"; print_usage(); exit $ERRORS{'UNKNOWN'}; } return %options; } sub printableSize { my $size = shift; my $size_string; # Bytes if ($size / 1024 < 1 ) { $size_string = sprintf("%d bytes", $size); } # Kilo bytes elsif ($size / (1024**2) < 1 ) { $size_string = sprintf("%4.2fKB", $size / 1024); } # Mega bytes elsif ($size / (1024**3) < 1 ) { $size_string = sprintf("%4.2fMB", $size / (1024**2)); } # Giga bytes elsif ($size / (1024**4) < 1 ) { $size_string = sprintf("%4.2fGB", $size / (1024**3)); } # Tera bytes elsif ($size / (1024**5) < 1 ) { $size_string = sprintf("%4.2fTB", $size / (1024**4)); } # Peta bytes elsif ($size / (1024**6) < 1 ) { $size_string = sprintf("%4.2fPB", $size / (1024**5)); } else { $size_string = sprintf("%d bytes", $size); } return $size_string; } |
About time you posted this one!
Seriously folks we were in a bad place. Our predecessor sloppy habits caught up with us all at once. This little script saved our bacon more than once.
hi
I get a
String found where operator expected at ./check_na_snaps.pl line 163, near “‘help'”
(Missing semicolon on previous line?)
syntax error at ./check_na_snaps.pl line 163, near “‘help'”
Execution of ./check_na_snaps.pl aborted due to compilation errors.
can somebody help?
branjo
I get the same error also, any help ?
String found where operator expected at ./check_na_snaps.pl line 163, near “‘help'”
(Missing semicolon on previous line?)
syntax error at ./check_na_snaps.pl line 163, near “‘help'”
Execution of ./check_na_snaps.pl aborted due to compilation errors.
wp,
There is a missing comma at the end of the previous line. I have corrected the script in the post.
Thanks for reading!
Andrew
Hi,
thanks for posting this!
Please add a licence, so we know if this code is GPL (so I would have put it on github) or whatever.
I have a patch that adds multiline output for OK states…
— check_na_snaps.pl 2012-05-09 14:38:30.000000000 +0200
+++ check_na_snaps.pl.orig 2012-05-10 19:23:01.000000000 +0200
@@ -48,7 +48,6 @@
my $volumes = {};
my $errors = [];
– my $ok = [];
my $request = NaElement->new(‘volume-list-info’);
my $result = $server->invoke_elem($request);
@@ -119,9 +118,6 @@
if ($volume->{ ‘snapsize’ } > $volume->{ ‘snapreserve’ }) {
push(@$errors, “$vol_name reserve (” . printableSize($volume->{ ‘snapreserve’ }) .
“) {‘snapsize’}) . “); “);
– } else {
– push(@$ok, “$vol_name reserve (” . printableSize($volume->{ ‘snapreserve’ }) .
– “) >= consumed (” . printableSize($volume->{‘snapsize’}) . “); “);
}
} else {
# no reserve, check the snap to free space ratio
@@ -129,10 +125,6 @@
push(@$errors, “$vol_name snapsize (” . printableSize($volume->{ ‘snapsize’ }) .
“) > ” . ($opts->{ ‘freespace’ } * 100) . “% free (” .
printableSize($volume->{ ‘free’ } * $opts->{ ‘freespace’ }) . “); “);
– } else {
– push(@$ok, “$vol_name snapsize (” . printableSize($volume->{ ‘snapsize’ }) .
– “) { ‘freespace’ } * 100) . “% free (” .
– printableSize($volume->{ ‘free’ } * $opts->{ ‘freespace’ }) . “); “);
}
}
}
@@ -142,11 +134,9 @@
print “WARNING – “;
print $_ for @$errors;
print “n”;
– print “OK – $_n” for @$ok;
exit $ERRORS{‘WARNING’};
} else {
print “OK – No volumes have overly large snapshotsn”;
– print “OK – $_n” for @$ok;
exit $ERRORS{‘OK’};
}
}