Perl OnTAP SDK: Realtime Multiprotocol Volume Latency

Update 2009-07-21: With some help from Steffen, a bug was found where the script wasn’t returning any values in the result hash when the toaster didn’t return values for certain queries. This caused Perl to print errors when it was trying to do math on non-existent values. Starting at line 273, the script has been updated so that the hash returned by the subroutine that does the ZAPI query has default values of zero, which should eliminate the errors seen by Steffen. Please let me know of any other problems encountered! (and thanks to Steffen for finding this bug!)


My previous post only prints NFS latency for the NetApp as a whole, it doesn’t give any information about a specific volume. Some of my ESX hosts use iSCSI for their datastores, and because the NetApp has many iSCSI clients, looking at iSCSI stats for the filer as a whole didn’t help me very much.

The solution was this script. It is a significantly modified version of the previous script that is capable of showing the realtime latency for all protocols: NFS, CIFS, SAN (which I believe is all block level ops summarized), FCP and iSCSI. It also displays the three different types of operations for each protocol: read, write, and other.

The script, if invoked with nothing more than the connection information, will display the read, write, and “other” latency and operations for the total of all protocols. There is a fourth column as well, which shows the average latency and total operations across all operation types (r/w/o).

This script has proven quite beneficial for me. By monitoring CIFS latency during peak hours on the volume that contains shares for profiles, I have proven that the reason logins can take a significant amount of time is due to the use of high capacity, but very slow, SATA disks and not the network or desktops themselves. I’ve also been able to prove that one of our iSCSI volumes was “slow” due to bandwidth, and not spindle count (interestingly, the problem with this volume is the I/O request size…the app makes large requests which chokes bandwidth before available IOP/s runs out).

The OnTAP SDK is quite powerful, Glenn and I are quickly discovering that anything possible in FilerView and/or DFM/OpsMgr is doable through the SDK.

11 thoughts on “Perl OnTAP SDK: Realtime Multiprotocol Volume Latency”

  1. Hi Andrew,

    first of all: This is great work you do, i appreciate that very much. But still, i’ve got a problem with your script: Everything works fine except for the protocol option ‘iscsi’. I’m getting:

    Avg Lat Total Ops | Read Lat Read Ops | Write Lat Write Ops | Other Lat Other Ops |
    ————– ———- | ————– ———- | ————– ———- | ————– ———- |
    Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 105, line 20.
    Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 105, line 20.
    Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 113, line 20.
    Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 113, line 20.
    Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 122, line 20.
    Use of uninitialized value in numeric gt (>) at na-realtime-vol-latency.pl line 122, line 20.

    Unfortunately i’ve only got basic coding skills and i can’t get along with your hashes of arrays etc.. What i can say is that the results of line 281 are only “avg_latency” and “total_ops”; the protocol specific values do not appear with the option ‘iscsi’, all other options work fine.

    Any help would be very much appreciated.
    Keep up the good work!

    Regards,
    Steffen

  2. Hi Steffen,

    Thanks for finding the bug and letting me know about it!

    I’ve updated the script above so that it should hopefully fix the errors you are seeing. The changes occur at line 273, the “return” hash is now being initialized with values (of zero) so that there is always something for the display routine to have for comparison, even if no value is returned by ZAPI.

    Please let me know if I missed the mark on the fix!

    Andrew

  3. Hi Andrew,

    wow, that was fast! I’ve tested the script and your change catches the error… Still, i’m a little bit confused: I’m getting no data for the iscsi protocol. I’ve looked into the objects/instance/counters stuff a little bit (using the perf_operation.pl script from NetApp) and found that there are no fcp or iscsi counters associated with the volume object… There’s only nfs, cifs, san and total. AFAIU, the iscsi and fcp counters are related to LUNs rather than volumes; in the SDK the relevant counters are associated with the iscsi / fcp object. Am i missing something here?

    Looking forward to your reply.

    Cheers,
    Steffen

  4. Steffen,

    I believe that you don’t get any results for a protocol when you have nothing in the volume that uses that protocol. For example, I have no FCP in my environment, so for all volumes I get all zeros for FCP. For a volume that consists of only a file share (CIFS or NFS) I don’t see iSCSI stats.

    When I look at iSCSI operations for my root volume (which has nothing but the array’s root in it), for example, I see total operations and latency (cause there are operations of some type going on), but I get all zeros for the read/write/other. This is because the “Avg Lat” and “Total Ops”, regardless of which protocol is specified, always report their values as the sum of all protocols.

    Make sure you have at least one FCP or iSCSI LUN in the volume to see the values for that protocol.

    Here is two examples from my environment:

    user> na-realtime-vol-latency.pl -H my_toaster -u not_root -v exchange_sg_one -P iscsi -i 10
    Enter password:
           Avg Lat  Total Ops |       Read Lat   Read Ops |      Write Lat  Write Ops |      Other Lat  Other Ops |
    -------------- ---------- | -------------- ---------- | -------------- ---------- | -------------- ---------- |
        3973.06 us       9175 |     4053.17 us       8992 |       37.20 us        178 |        0.00 us          0 |
        3200.32 us      11190 |     3246.74 us      11028 |       40.06 us        161 |        0.00 us          0 |
        3524.05 us      10734 |     3632.47 us      10409 |       51.51 us        325 |        0.00 us          0 |
        3077.88 us      10932 |     3150.41 us      10677 |       41.14 us        254 |        0.00 us          0 |
        2839.87 us      12333 |     2945.87 us      11879 |       66.89 us        451 |        0.00 us          0 |
        2239.45 us      14836 |     2278.04 us      14580 |       41.94 us        256 |        0.00 us          0 |
        2202.69 us      15596 |     2253.64 us      15237 |       40.34 us        358 |        0.00 us          0 |
        3135.53 us      13884 |     3207.06 us      13519 |      487.25 us        364 |        0.00 us          0 |
        3231.71 us      11310 |     3298.37 us      11078 |       48.61 us        232 |        0.00 us          0 |
        2597.15 us      11561 |     2629.60 us      11416 |       42.65 us        144 |        0.00 us          0 |
    
    user> na-realtime-vol-latency.pl -H my_toaster -u not_root -v root_vol -P iscsi -i 10
    Enter password:
           Avg Lat  Total Ops |       Read Lat   Read Ops |      Write Lat  Write Ops |      Other Lat  Other Ops |
    -------------- ---------- | -------------- ---------- | -------------- ---------- | -------------- ---------- |
          23.69 us         54 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          14.51 us        278 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          22.82 us         38 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          83.17 us        931 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          23.78 us         58 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          22.40 us         57 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          24.84 us         38 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
           4.38 us        655 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
           9.59 us        202 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |
          24.19 us         54 |        0.00 us          0 |        0.00 us          0 |        0.00 us          0 |

    The first is a volume with an iSCSI LUN, the second has only a CIFS share. Note that for the second there is still operations occurring, but they are not iSCSI operations.

    (There is a small difference in this result than in the published script above…this appliance has been upgraded to 7.3.1.1, which reports latency in microseconds vice milliseconds. This makes the values an order of magnitude larger…you can get the millisecond values by dividing these by 1000. I had to adjust my local script to take this into account…I’m working on a fix for this so that it will work with 7.2.x and 7.3.x and report both in milliseconds.)

    There are per volume protocol statistics reported by the SDK. They are reported by the perf-object-get-instances object using the “volume” object name and providing the name of a volume for the “instance” value. NetApp has published a document that documents all of the counters here (you’ll need a NOW account).

  5. Andrew,

    thanks again for your reply and sorry for bothering you again, but i’m stuck here. I do have two flexvols in the aggregate, each of them holds an iSCSI LUN (VMWARE) which contains the vmdk-files of our virtual machines. So there has to be iSCSI traffic, and of course there is. I can see the counters increasing when questioning the iscsi object with the perf_operation.pl script, but these seem to be overall counters for the complete filer, not for a specific volume. Example:

    [user@host]> perl perf_operation.pl filer user pass get-instance-counter-value iscsi iscsi iscsi_write_latency

    Counter value for object:iscsi instance:iscsi counter:iscsi_write_latency is 1138273304

    But i still get nothing using your script… I don’t see any fcp / iscsi counters when i question the “volume” object with the perf_operation.pl script (using the perf-object-get-instances method), which is definitely a deviation from the documentation you mentioned in your previous post.

    I seem to lack a lot of counters in several objects. Even tested with the root account to make sure that there is no privilege issue… Could there be an issue with the ONTAP or SDK version? I have:

    FAS2020 with ONTAP 7.2.5
    SDK 3.5 (tested 3.5P1 also)

    Any more ideas? I’m close to giving up on this…

    Steffen

  6. Steffen,

    I’m honestly not sure…I don’t know of any options or other items of that nature that would need to be enabled for it to work. If you tested with a user in the administrators group, then I’m really out of ideas.

    Perhaps try this: SSH/telnet to your appliance, login as the user you would use to query via the SDK.

    Go into diag mode: priv set diag, then check the volume iSCSI counters for values. The iSCSI counters are only available if you have access to the “diag” level of permissions.

    You can check the counters by issuing the following command: stats show -r -i 5 volume:[volume_name]:iscsi_read_latency, where you would substitute the name of the volume for “[volume_name]”. There are additional counters available that you can check, to see all available counters issue the command stats list counters volume. This will show you all of the counters that you can query using the stats command at your current permissions level. By going to priv set advanced you can see a larger set of the counters than when you are at priv set admin (the default).

    If the user you have logged in as doesn’t have the ability to access diag mode, then that may be where the problem lies.

    Hope this helps,

    Andrew

  7. Andrew,

    thank you very much once again. I think this is a dead end, i did what you recommended, here’s the result:

    filer> priv set diag
    filer*> stats show -r -i 5 volume:ESX_FC01:iscsi_read_latency
    No counter ‘iscsi_read_latency’ was found for volume
    filer*>

    Same thing with “stats list counters volume”, no fcp or iscsi counters. I will contact NetApp support to get a clue on this issue.

    Thanks again for your time and patience, i’ll be dropping by every now and then.

    Steffen

  8. Hi,

    I’m attempting to execute the script above, but have encountered an error:

    DKmanage-ontap-sdk-3.5.1manage-ontap-sdk-3.5.1binnt>perl na-realtime-nfs-lat
    ency.pl -H filer1 -u user -v vol4 -P nfs -i 10
    Enter password:

    Can’t call method “children_get” on an undefined value at na-realtime-nfs-latenc
    y.pl line 293.

    The only modification I made to the script was to add the path to LIB. Any ideas what I’m doing wrong? Thank you in advance.

  9. Hi Loving the script i can get it to run but i get a strange error. readline() on unopened filehandle S at c:/program Files/VMware/VMware vSphere CLI/Perl/lib/NaServer.pm line 467.
    It still works but inbetween every sample the above error appears. Without this I would be very happy indeed!!. P.s does anyone know the steps from creating a user on the filer that is locked down that I can run this with.

  10. @Jim,

    Firstly, sorry for the delayed response.

    Can you please post the code you have surrounding line 293? It doesn’t appear to match what’s above, as line 293 doesn’t contain a children_get method call.

    That being said, there are a couple of things that could be going wrong. One is that you are not actually connecting to the filer…a bad username and/or password would cause it to not return valid objects. The possibility being that the user you are connecting with may not have permissions to the correct object to query the counters. The third possibility is that the volume you are trying to query is offline/restricted/nonexistant.

    I’ll work on updating the script with better error reporting soon. Thanks for reading and your feedback!

    Andrew

  11. @Tom,

    Sorry for the delayed response, can you tell me what version of the SDK you are using? There seems to be some changes between the versions that are causing some inconsistencies.

    Thanks for reading!

    Andrew

Leave a Reply