Update 2009-07-21: With some help from Steffen, a bug was found where the script wasn’t returning any values in the result hash when the toaster didn’t return values for certain queries. This caused Perl to print errors when it was trying to do math on non-existent values. Starting at line 273, the script has been updated so that the hash returned by the subroutine that does the ZAPI query has default values of zero, which should eliminate the errors seen by Steffen. Please let me know of any other problems encountered! (and thanks to Steffen for finding this bug!)
My previous post only prints NFS latency for the NetApp as a whole, it doesn’t give any information about a specific volume. Some of my ESX hosts use iSCSI for their datastores, and because the NetApp has many iSCSI clients, looking at iSCSI stats for the filer as a whole didn’t help me very much.
The solution was this script. It is a significantly modified version of the previous script that is capable of showing the realtime latency for all protocols: NFS, CIFS, SAN (which I believe is all block level ops summarized), FCP and iSCSI. It also displays the three different types of operations for each protocol: read, write, and other.
The script, if invoked with nothing more than the connection information, will display the read, write, and “other” latency and operations for the total of all protocols. There is a fourth column as well, which shows the average latency and total operations across all operation types (r/w/o).
This script has proven quite beneficial for me. By monitoring CIFS latency during peak hours on the volume that contains shares for profiles, I have proven that the reason logins can take a significant amount of time is due to the use of high capacity, but very slow, SATA disks and not the network or desktops themselves. I’ve also been able to prove that one of our iSCSI volumes was “slow” due to bandwidth, and not spindle count (interestingly, the problem with this volume is the I/O request size…the app makes large requests which chokes bandwidth before available IOP/s runs out).
The OnTAP SDK is quite powerful, Glenn and I are quickly discovering that anything possible in FilerView and/or DFM/OpsMgr is doable through the SDK.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 |
#!/usr/bin/perl -w # # na-realtime-vol-latency.pl - written by Andrew Sullivan, 2009-06-23 # # Please report bugs and request improvements at http://practical-admin.com/blog/?p=763 # # This script will query a NetApp appliance using the OnTAP SDK to get different # protocol latencies for the specified volume. # # na-realtime-nfs-latency.pl --hostname|-H # --username|-u # --password|-p # --volume|-v # [ --protocol|-P all|nfs|cifs|san|fcp|iscsi ] # [ --interval|-i ] # [ --noread, --nowrite, --noother, --noaverage, --noops ] # # Password is optional, if it is not supplied at the command line the script # will prompt for it. # # If "average" is requested (it's turned on by default) it is the average # latency (and total operations) for all protocols. # # By default read, write, other, and average latencies are displayed for # the specified protocol. You can turn them off by passing "--no". # If no protocol is specified, "all" is the default (which is the sum of all # the protocols respective read/write/other operations). # # There is a delay of before the first set of results is displayed. # This is perfectly normal. # # Examples: # na-realtime-vol-latency.pl --hostname toaster --username not_root --interval 10 # # Same as above, but with the abbreviated options, and not displaying the average # na-realtime-vol-latency.pl -H toaster -u not_root -i 10 --noaverage # # Display only read and write cifs operations/latency # na-realtime-vol-latency.pl -H toaster -u not_root -i 10 --noaverage --noother -P cifs # # TODO: # Find a good way to print multiple volumes. The inital part is here (the script will # accept a comma separated list of volumes, but only uses the first supplied), I # just don't know of a good, readable, way to print the data. # # I placed the NetApp perl modules into my perl lib directory, however, # if you haven't done this, you will probably need to specify where they # are using a lib declaration #use lib "./NetApp"; use NaServer; use NaElement; use Getopt::Long qw(:config no_ignore_case); main( ); sub main { my $opts = parse_options(); my $server = get_filer( $opts->{ 'hostname' }, $opts->{ 'username' }, $opts->{ 'password' } ); my $first = get_latency( $server, $opts ); # check to make sure the user provided a valid vol name if ( scalar( keys %$first ) == 0 ) { print "Volume " . @{ $opts->{'volume'} }[0] . " not found!n"; exit(1); } # retrieve the column headers and the printf formatter for the values my ($header, $printf) = build_header( $opts ); print $header; while (1) { my $read_latency = 0; my $write_latency = 0; my $other_latency = 0; my $avg_latency = 0; my $read_ops = 0; my $write_ops = 0; my $other_ops = 0; my $total_ops = 0; my @values; sleep $opts->{ 'interval' }; my $second = get_latency( $server, $opts ); # check to see if this was a selected value, if so, make sure that we have > 0 ops that # occurred and do the simple math to get our values if ($opts->{'average'} && $second->{'total_ops'} > $first->{'total_ops'}) { $total_ops = $second->{'total_ops'} - $first->{'total_ops'}; $avg_latency = ($second->{'avg_latency'} - $first->{'avg_latency'}) / $total_ops; } # if this was in the above statement, and there was no operations, then the "0" value # would not get inserted into the values array push(@values, $avg_latency) if $opts->{'average'}; push(@values, $total_ops) if $opts->{'ops'} && $opts->{'average'}; if ($opts->{'read'} && $second->{'read_ops'} > $first->{'read_ops'}) { $read_ops = $second->{'read_ops'} - $first->{'read_ops'}; $read_latency = ($second->{'read_latency'} - $first->{'read_latency'}) / $read_ops; } push(@values, $read_latency) if $opts->{'read'}; push(@values, $read_ops) if $opts->{'ops'} && $opts->{'read'}; if ($opts->{'write'} && $second->{'write_ops'} > $first->{'write_ops'}) { $write_ops = $second->{'write_ops'} - $first->{'write_ops'}; $write_latency = ($second->{'write_latency'} - $first->{'write_latency'}) / $write_ops; } push(@values, $write_latency) if $opts->{'write'}; push(@values, $write_ops) if $opts->{'ops'} && $opts->{'write'}; if ($opts->{'other'} && $second->{'other_ops'} > $first->{'other_ops'}) { $other_ops = $second->{'other_ops'} - $first->{'other_ops'}; $other_latency = ($second->{'other_latency'} - $first->{'other_latency'}) / $other_ops; } push(@values, $other_latency) if $opts->{'other'}; push(@values, $other_ops) if $opts->{'ops'} && $opts->{'other'}; printf $printf, @values; print "n"; $first = $second; } } ### Supporting subroutines ### # # Prints the usage information # sub print_usage { print <<EOU Missing or incorrect arguments! na-realtime-nfs-latency.pl --hostname|-H --username|-u --password|-p --volume|-v [ --protocol|-P all|nfs|cifs|san|fcp|iscsi ] [ --interval|-i ] [ --noread, --nowrite, --noother, --noaverage, --noops ] na-realtime-nfs-latency.pl --help|-h EOU } # # Defines and parses the options from the command line, also # checks to make sure that options are valid. Will prompt for # a password if one is not provided. # sub parse_options { my %options = ( 'hostname' => '', 'username' => '', 'password' => '', 'interval' => 30, 'volume' => [], 'protocol' => 'all', 'average' => 1, 'read' => 1, 'write' => 1, 'other' => 1, 'ops' => 1, 'help' => 0 ); GetOptions( %options, 'hostname|H=s', 'username|u=s', 'password|p:s', 'interval|i:i', 'volume|v=s', 'protocol|P:s', 'average!', 'read!', 'write!', 'other!', 'ops!', 'help|h' ); # if the user supplied a comma separated list of volumes to get data for, we split them # into an array here... @{$options{ 'volume' }} = split( /,/, join( ',', @{$options{ 'volume' }} ) ); # make sure we have a hostname, username, at least 1 volume and protocol is a valid # value. if any of those are not true, or the user requested help, print the help. if ( ! $options{ 'hostname' } || ! $options{ 'username' } || scalar( @{$options{ 'volume' }} ) import( qw(ReadMode) ); Term::ReadKey->import( qw(ReadLine) ); ReadMode('noecho'); chomp( $options{ 'password' } = ReadLine(0) ); ReadMode('normal'); } else { system("stty -echo") and die "ERROR: stty failedn"; chomp ( $options{ 'password' } = ); system("stty echo") and die "ERROR: stty failedn"; } print "n"; } return %options; } # # Creates the NaServer object # sub get_filer { my ($hostname, $username, $password) = @_; my $s = NaServer->new($hostname, 1, 7); $s->set_style(LOGIN_PASSWORD); $s->set_admin_user($username, $password); $s->set_transport_type(NA_SERVER_TRANSPORT_HTTP); return $s; } # # Queries the NetApp and returns a hash of responses # sub get_latency { my ($server, $opts) = @_; # Initialize the return hash so that it contains values. # This prevents a bug found by Steffen Nagel where perl will # display errors when trying to do math on values that don't exist # because they were not returned by the toaster. # See http://practical-admin.com/blog/?p=763#comment-715 my $return = { 'avg_latency' => 0, 'total_ops' => 0, 'read_latency' => 0, 'read_ops' => 0, 'write_latency' => 0, 'write_ops' => 0, 'other_latency' => 0, 'other_ops' => 0 }; my $request = build_request( $opts ); my $result = $server->invoke_elem( $request ); if ( scalar( $result->child_get('instances')->children_get() ) > 0 ) { foreach ($result->child_get('instances')->child_get('instance-data')->child_get('counters')->children_get()) { my $name = $_->child_get_string('name'); # strip off the protocol name so that they are always the same, this makes # it easy to retrieve them during display back cause we don't have to # account for the different names $name =~ s/(?:nfs|cifs|san|fcp|iscsi)_//; $return->{ $name } = $_->child_get_string('value'); } } return $return; } # # creates the NaElement tree for querying latency # sub build_request { my $opts = shift; # each of the different protocols on the volume has it's own read, write # and other counters. Average is the odd-man-out, as it is a summary of # all of the protocols, so we compensate with a check in the loop below my $counter_names = { 'average' => [ 'avg_latency', 'total_ops' ], 'read' => { 'all' => [ 'read_latency', 'read_ops' ], 'nfs' => [ 'nfs_read_latency', 'nfs_read_ops' ], 'cifs' => [ 'cifs_read_latency', 'cifs_read_ops' ], 'san' => [ 'san_read_latency', 'san_read_ops' ], 'fcp' => [ 'fcp_read_latency', 'fcp_read_ops' ], 'iscsi' => [ 'iscsi_read_latency', 'iscsi_read_ops' ] }, 'write' => { 'all' => [ 'write_latency', 'write_ops' ], 'nfs' => [ 'nfs_write_latency', 'nfs_write_ops' ], 'cifs' => [ 'cifs_write_latency', 'cifs_write_ops' ], 'san' => [ 'san_write_latency', 'san_write_ops' ], 'fcp' => [ 'fcp_write_latency', 'fcp_write_ops' ], 'iscsi' => [ 'iscsi_write_latency', 'iscsi_write_ops' ] }, 'other' => { 'all' => [ 'other_latency', 'other_ops' ], 'nfs' => [ 'nfs_other_latency', 'nfs_other_ops' ], 'cifs' => [ 'cifs_other_latency', 'cifs_other_ops' ], 'san' => [ 'san_other_latency', 'san_other_ops' ], 'fcp' => [ 'fcp_other_latency', 'fcp_other_ops' ], 'iscsi' => [ 'iscsi_other_latency', 'iscsi_other_ops' ] } }; my $request = NaElement->new('perf-object-get-instances'); $request->child_add_string('objectname', 'volume'); my $counters = NaElement->new('counters'); foreach my $r ( @{$opts->{'requested'}} ) { if ($r eq "average") { foreach ( @{ $counter_names->{'average'} } ) { $counters->child_add_string('counter', $_); } } else { foreach ( @{$counter_names->{ $r }->{ $opts->{'protocol'} }} ) { $counters->child_add_string('counter', $_); } } } $request->child_add( $counters ); my $instances = NaElement->new('instances'); $instances->child_add_string('instance', @{$opts->{'volume'}}[0]); $request->child_add( $instances ); return $request; } # # Builds the "header" line for display back and returns the printf string that # we will use to print the values returned from our SDK queries in a nicely # formatted manner # sub build_header { my $opts = shift; # the base formatters for the header, note that here is where the pipe # character that separates the columns when the ops are displayed is # added in my $head_lat_printf = '%14s'; my $head_ops_printf = '%10s |'; my $head_lat_sep = "-" x 14; my $head_ops_sep = "-" x 10; my @header_printf; my @header_values; my @separator; # the column header title values my $header_titles = { # Value Title Ops Title 'average' => [ 'Avg Lat', 'Total Ops' ], 'read' => [ 'Read Lat', 'Read Ops' ], 'write' => [ 'Write Lat', 'Write Ops' ], 'other' => [ 'Other Lat', 'Other Ops' ] }; # the base formatters for the values my $lat_printf = '%11.2f ms'; my $ops_printf = '%10i |'; my @values_printf; # loop through the reqested displays, adding the data for each foreach my $request ( @{$opts->{'requested'}} ) { # add the header's print push(@header_printf, $head_lat_printf); # we get the title text from the array reference in the hash above push(@header_values, @{ $header_titles->{ $request } }[0]); # and an unsatisfactory hack to get a separator line... push(@separator, $head_lat_sep); # add the value line's printf push(@values_printf, $lat_printf); if ( $opts->{ 'ops' } ) { # header push(@header_printf, $head_ops_printf); push(@header_values, @{ $header_titles->{ $request } }[1]); push(@separator, $head_ops_sep); # values push(@values_printf, $ops_printf); } } # put the values in the hash $t = join(" ", @header_printf); $header = sprintf( $t, @header_values ) . "n" . sprintf( $t, @separator ) . "n"; $printf = join(" ", @values_printf); # return the values printf formatter return ($header, $printf); } |
Hi Andrew,
first of all: This is great work you do, i appreciate that very much. But still, i’ve got a problem with your script: Everything works fine except for the protocol option ‘iscsi’. I’m getting:
Avg Lat Total Ops | Read Lat Read Ops | Write Lat Write Ops | Other Lat Other Ops |
————– ———- | ————– ———- | ————– ———- | ————– ———- |
Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 105, line 20.
Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 105, line 20.
Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 113, line 20.
Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 113, line 20.
Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 122, line 20.
Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 122, line 20.
Unfortunately i’ve only got basic coding skills and i can’t get along with your hashes of arrays etc.. What i can say is that the results of line 281 are only “avg_latency” and “total_ops”; the protocol specific values do not appear with the option ‘iscsi’, all other options work fine.
Any help would be very much appreciated.
Keep up the good work!
Regards,
Steffen
Hi Steffen,
Thanks for finding the bug and letting me know about it!
I’ve updated the script above so that it should hopefully fix the errors you are seeing. The changes occur at line 273, the “return” hash is now being initialized with values (of zero) so that there is always something for the display routine to have for comparison, even if no value is returned by ZAPI.
Please let me know if I missed the mark on the fix!
Andrew
Hi Andrew,
wow, that was fast! I’ve tested the script and your change catches the error… Still, i’m a little bit confused: I’m getting no data for the iscsi protocol. I’ve looked into the objects/instance/counters stuff a little bit (using the perf_operation.pl script from NetApp) and found that there are no fcp or iscsi counters associated with the volume object… There’s only nfs, cifs, san and total. AFAIU, the iscsi and fcp counters are related to LUNs rather than volumes; in the SDK the relevant counters are associated with the iscsi / fcp object. Am i missing something here?
Looking forward to your reply.
Cheers,
Steffen
Steffen,
I believe that you don’t get any results for a protocol when you have nothing in the volume that uses that protocol. For example, I have no FCP in my environment, so for all volumes I get all zeros for FCP. For a volume that consists of only a file share (CIFS or NFS) I don’t see iSCSI stats.
When I look at iSCSI operations for my root volume (which has nothing but the array’s root in it), for example, I see total operations and latency (cause there are operations of some type going on), but I get all zeros for the read/write/other. This is because the “Avg Lat” and “Total Ops”, regardless of which protocol is specified, always report their values as the sum of all protocols.
Make sure you have at least one FCP or iSCSI LUN in the volume to see the values for that protocol.
Here is two examples from my environment:
The first is a volume with an iSCSI LUN, the second has only a CIFS share. Note that for the second there is still operations occurring, but they are not iSCSI operations.
(There is a small difference in this result than in the published script above…this appliance has been upgraded to 7.3.1.1, which reports latency in microseconds vice milliseconds. This makes the values an order of magnitude larger…you can get the millisecond values by dividing these by 1000. I had to adjust my local script to take this into account…I’m working on a fix for this so that it will work with 7.2.x and 7.3.x and report both in milliseconds.)
There are per volume protocol statistics reported by the SDK. They are reported by the perf-object-get-instances object using the “volume” object name and providing the name of a volume for the “instance” value. NetApp has published a document that documents all of the counters here (you’ll need a NOW account).
Andrew,
thanks again for your reply and sorry for bothering you again, but i’m stuck here. I do have two flexvols in the aggregate, each of them holds an iSCSI LUN (VMWARE) which contains the vmdk-files of our virtual machines. So there has to be iSCSI traffic, and of course there is. I can see the counters increasing when questioning the iscsi object with the perf_operation.pl script, but these seem to be overall counters for the complete filer, not for a specific volume. Example:
[user@host]> perl perf_operation.pl filer user pass get-instance-counter-value iscsi iscsi iscsi_write_latency
Counter value for object:iscsi instance:iscsi counter:iscsi_write_latency is 1138273304
But i still get nothing using your script… I don’t see any fcp / iscsi counters when i question the “volume” object with the perf_operation.pl script (using the perf-object-get-instances method), which is definitely a deviation from the documentation you mentioned in your previous post.
I seem to lack a lot of counters in several objects. Even tested with the root account to make sure that there is no privilege issue… Could there be an issue with the ONTAP or SDK version? I have:
FAS2020 with ONTAP 7.2.5
SDK 3.5 (tested 3.5P1 also)
Any more ideas? I’m close to giving up on this…
Steffen
Steffen,
I’m honestly not sure…I don’t know of any options or other items of that nature that would need to be enabled for it to work. If you tested with a user in the administrators group, then I’m really out of ideas.
Perhaps try this: SSH/telnet to your appliance, login as the user you would use to query via the SDK.
Go into diag mode:
priv set diag
, then check the volume iSCSI counters for values. The iSCSI counters are only available if you have access to the “diag” level of permissions.You can check the counters by issuing the following command:
stats show -r -i 5 volume:[volume_name]:iscsi_read_latency
, where you would substitute the name of the volume for “[volume_name]”. There are additional counters available that you can check, to see all available counters issue the commandstats list counters volume
. This will show you all of the counters that you can query using the stats command at your current permissions level. By going topriv set advanced
you can see a larger set of the counters than when you are atpriv set admin
(the default).If the user you have logged in as doesn’t have the ability to access diag mode, then that may be where the problem lies.
Hope this helps,
Andrew
Andrew,
thank you very much once again. I think this is a dead end, i did what you recommended, here’s the result:
filer> priv set diag
filer*> stats show -r -i 5 volume:ESX_FC01:iscsi_read_latency
No counter ‘iscsi_read_latency’ was found for volume
filer*>
Same thing with “stats list counters volume”, no fcp or iscsi counters. I will contact NetApp support to get a clue on this issue.
Thanks again for your time and patience, i’ll be dropping by every now and then.
Steffen
Hi,
I’m attempting to execute the script above, but have encountered an error:
DKmanage-ontap-sdk-3.5.1manage-ontap-sdk-3.5.1binnt>perl na-realtime-nfs-lat
ency.pl -H filer1 -u user -v vol4 -P nfs -i 10
Enter password:
Can’t call method “children_get” on an undefined value at na-realtime-nfs-latenc
y.pl line 293.
The only modification I made to the script was to add the path to LIB. Any ideas what I’m doing wrong? Thank you in advance.
Hi Loving the script i can get it to run but i get a strange error. readline() on unopened filehandle S at c:/program Files/VMware/VMware vSphere CLI/Perl/lib/NaServer.pm line 467.
It still works but inbetween every sample the above error appears. Without this I would be very happy indeed!!. P.s does anyone know the steps from creating a user on the filer that is locked down that I can run this with.
@Jim,
Firstly, sorry for the delayed response.
Can you please post the code you have surrounding line 293? It doesn’t appear to match what’s above, as line 293 doesn’t contain a children_get method call.
That being said, there are a couple of things that could be going wrong. One is that you are not actually connecting to the filer…a bad username and/or password would cause it to not return valid objects. The possibility being that the user you are connecting with may not have permissions to the correct object to query the counters. The third possibility is that the volume you are trying to query is offline/restricted/nonexistant.
I’ll work on updating the script with better error reporting soon. Thanks for reading and your feedback!
Andrew
@Tom,
Sorry for the delayed response, can you tell me what version of the SDK you are using? There seems to be some changes between the versions that are causing some inconsistencies.
Thanks for reading!
Andrew
When i try to run the script an getting the following error
$ ./na-realtime-vol-latency.pl –hostname XXXXXX –username YYYYYY –password ZZZZZ –volume vol3_AAAA
Use of uninitialized value $na_can_use_ipv6 in numeric eq (==) at /usr/lib/netapp-manageability-sdk/lib/perl/NetApp/NaServer.pm line 2169.
Use of uninitialized value $server in gethostbyname at /usr/lib/netapp-manageability-sdk/lib/perl/NetApp/NaServer.pm line 2183.
Use of uninitialized value $addr in pack at /usr/lib/netapp-manageability-sdk/lib/perl/NetApp/NaServer.pm line 2184.
Can’t call method “children_get” on an undefined value at ./check_realtime_latency.pl line 306.