Nagios: Checking for abnormally large NetApp snapshots

My philosophy with Nagios checks, especially with the NetApp, is that unless there are extenuating circumstances then I want all volumes (or whatever may be being checked) to be checked equally and at the same time. This means I don’t want to have to constantly add and remove checks from Nagios as volumes are added, deleted and modified. I would much rather have one check that checks all of the volumes and reports on them en masse. This means I don’t have to think about the check itself, but rather, only what it’s checking.

One of the many things that I regularly monitor on our multitude of NetApp systems is snapshots. We have had issues, especially with LUNs, where the snapshots have gotten out of control.

In order to prevent this, or at least hope that someone is watching the screen…, I wrote a quick script that checks to see if the total size of snapshots on a volume exceed the snap reserve. Since not all of our volumes have a snap reserve, I also put in the ability to check the size of the snaps against the percentage of free space left in the volume.

This last measure is a little strange, but I think it works fairly well. Take, for example, a 100GB volume. If it is 50% full (50GB), there is no snap reserve and the alert percentage is left at the default of 40% free space, then the alert will happen when snapshots exceed about 15GB. “But that’s not 40% of the free space”, I hear you saying. Ahhh, but it is…you see as the snapshot(s) grow, there is less free space, which means that it takes a larger percentage as the free space shrinks. So at 15GB of snapshots, there would be 35GB of free space, and 40% of 35GB is 14GB.

This causes the alerts to happen earlier than you may expect at first. You can adjust this number to be a percentage of the total space in the volume if you like…however, why not just set a snap reserve at that point? I chose to make the script this way in order to attempt to keep a little more free space in the volume, while not making a snap reserve mandatory.

One last word…please keep in mind this script does not check for a volume being filled, you should have other checks for that. This merely checks to see if snapshots have exceeded a threshold of space in the volume to prevent them from taking up too much space.

Bring on the Perl…

5 thoughts on “Nagios: Checking for abnormally large NetApp snapshots”

  1. About time you posted this one!

    Seriously folks we were in a bad place. Our predecessor sloppy habits caught up with us all at once. This little script saved our bacon more than once.

  2. hi
    I get a

    String found where operator expected at ./check_na_snaps.pl line 163, near “‘help'”
    (Missing semicolon on previous line?)
    syntax error at ./check_na_snaps.pl line 163, near “‘help'”
    Execution of ./check_na_snaps.pl aborted due to compilation errors.

    can somebody help?
    branjo

  3. I get the same error also, any help ?

    String found where operator expected at ./check_na_snaps.pl line 163, near “‘help'”
    (Missing semicolon on previous line?)
    syntax error at ./check_na_snaps.pl line 163, near “‘help'”
    Execution of ./check_na_snaps.pl aborted due to compilation errors.

  4. Hi,

    thanks for posting this!
    Please add a licence, so we know if this code is GPL (so I would have put it on github) or whatever.

    I have a patch that adds multiline output for OK states…

    — check_na_snaps.pl 2012-05-09 14:38:30.000000000 +0200
    +++ check_na_snaps.pl.orig 2012-05-10 19:23:01.000000000 +0200
    @@ -48,7 +48,6 @@

    my $volumes = {};
    my $errors = [];
    – my $ok = [];

    my $request = NaElement->new(‘volume-list-info’);
    my $result = $server->invoke_elem($request);
    @@ -119,9 +118,6 @@
    if ($volume->{ ‘snapsize’ } > $volume->{ ‘snapreserve’ }) {
    push(@$errors, “$vol_name reserve (” . printableSize($volume->{ ‘snapreserve’ }) .
    “) {‘snapsize’}) . “); “);
    – } else {
    – push(@$ok, “$vol_name reserve (” . printableSize($volume->{ ‘snapreserve’ }) .
    – “) >= consumed (” . printableSize($volume->{‘snapsize’}) . “); “);
    }
    } else {
    # no reserve, check the snap to free space ratio
    @@ -129,10 +125,6 @@
    push(@$errors, “$vol_name snapsize (” . printableSize($volume->{ ‘snapsize’ }) .
    “) > ” . ($opts->{ ‘freespace’ } * 100) . “% free (” .
    printableSize($volume->{ ‘free’ } * $opts->{ ‘freespace’ }) . “); “);
    – } else {
    – push(@$ok, “$vol_name snapsize (” . printableSize($volume->{ ‘snapsize’ }) .
    – “) { ‘freespace’ } * 100) . “% free (” .
    – printableSize($volume->{ ‘free’ } * $opts->{ ‘freespace’ }) . “); “);
    }
    }
    }
    @@ -142,11 +134,9 @@
    print “WARNING – “;
    print $_ for @$errors;
    print “n”;
    – print “OK – $_n” for @$ok;
    exit $ERRORS{‘WARNING’};
    } else {
    print “OK – No volumes have overly large snapshotsn”;
    – print “OK – $_n” for @$ok;
    exit $ERRORS{‘OK’};
    }
    }

Leave a Reply