Performance monitoring is a complex topic, but it’s something that is vital to the successful implementation and maintenance of any system. In the past I’ve had several posts about using Perl for gathering performance statistics from a 7-mode system (using ONTAP 7.3.x, which is quite old at this point), so I thought it might be a good time for an update.
I originally documented some of this information in a response on the NetApp Community site. This post expands on that a bit and documents it externally.
The NetApp PowerShell Toolkit has three cmdlets which we can use to determine what objects, counters, and instances are available, and a fourth cmdlet to actually collect the data.
Finding the Right Performance Object
Performance reporting in the clustered Data ONTAP API is broken out by two things: Object
and Counter
. In order to monitor something, for example aggregate performance, we need to find the object which pertains to that “something”. We do this using the Get-NcPerfObject
cmdlet.
Throughout the rest of this post I will be using the example of aggregate monitoring, specifically how many reads and writes are being done against an aggregate.
1 2 3 4 5 6 7 8 9 10 11 12 |
PS C:\> Get-NcPerfObject Name PrivilegeLevel ---- -------------- affinity diag affiperclass diag affiperqid diag affitotal diag aggregate admin ... ... ... |
For my cDOT 8.3 cluster this returned 358 items, which is a lot of different categories of monitoring! For many things we can help reduce the ones to consider by using the PrivilegeLevel
. The most commonly monitored things are going to be at either admin or advanced privilege level, whereas diag is used for very detailed, infrequently needed, counters. To view non-diag objects, we change the command slightly.
1 |
Get-NcPerfObject | ?{ $_.PrivilegeLevel -ne "diag" } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
PS C:\Users\Andrew> Get-NcPerfObject | ?{ $_.PrivilegeLevel -ne "diag" } Name PrivilegeLevel ---- -------------- aggregate admin audit_ng admin audit_ng:vserver admin cifs admin cifs:node admin cifs:vserver admin client admin client:vserver admin cluster_peer admin cpx admin cpx_op advanced disk admin disk:constituent admin disk:raid_group admin ext_cache admin ext_cache_obj admin |
This results in just 113 objects returned, a much shorter list to consider. This privilege level also indicates how much permission on the cluster the user collecting the information will need. A user with diag privileges is going to have considerably more permission on the cluster than one with only admin or advanced.
Finding the Counters
Now that we know what objects are available they give us a categorical view of what’s available. To find out what counters are being collected for each one we use the Get-NcPerfCounter
cmdlet. Using the aggregate
object as an example, we see the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
PS C:\Users\Andrew> Get-NcPerfCounter -Name aggregate | ?{ $_.PrivilegeLevel -ne "diag" } | Select-Object Name,PrivilegeLevel,Unit,Properties,Desc | Format-Table Name PrivilegeLevel Unit Properties Desc ---- -------------- ---- ---------- ---- cp_read_blocks admin per_sec rate Number of blocks read per second during a CP on the aggregate cp_read_blocks_hdd admin per_sec rate Number of blocks read per second during a CP on the aggregate HDD disks cp_read_blocks_ssd admin per_sec rate Number of blocks read per second during a CP on the aggregate SSD disks cp_reads admin per_sec rate Number of reads per second done during a CP to the aggregate cp_reads_hdd admin per_sec rate Number of reads per second done during a CP to the aggregate HDD disks cp_reads_ssd admin per_sec rate Number of reads per second done during a CP to the aggregate SSD disks instance_name admin none string Name of the aggreagte instance instance_uuid admin none string UUID for aggregate instance node_name admin none string Node Name node_uuid admin none string,no-display System node id total_transfers admin per_sec rate Total number of transfers per second serviced by the aggregate total_transfers_hdd admin per_sec rate Total number of transfers per second serviced by the aggregate HDD disks total_transfers_ssd admin per_sec rate Total number of transfers per second serviced by the aggregate SSD disks user_read_blocks admin per_sec rate Number of blocks read per second on the aggregate user_read_blocks_hdd admin per_sec rate Number of blocks read per second on the aggregate HDD disks user_read_blocks_ssd admin per_sec rate Number of blocks read per second on the aggregate SSD disks user_reads admin per_sec rate Number of user reads per second to the aggregate user_reads_hdd admin per_sec rate Number of user reads per second to the aggregate HDD disks user_reads_ssd admin per_sec rate Number of user reads per second to the aggregate SSD disks user_write_blocks admin per_sec rate Number of blocks written per second to the aggregate user_write_blocks_hdd admin per_sec rate Number of blocks written per second to the aggregate HDD disks user_write_blocks_ssd admin per_sec rate Number of blocks written per second to the aggregate SSD disks user_writes admin per_sec rate Number of user writes per second to the aggregate user_writes_hdd admin per_sec rate Number of user writes per second to the aggregate HDD disks user_writes_ssd admin per_sec rate Number of user writes per second to the aggregate SSD disks |
Notice that, once again, I removed the counters which are at the diag level. You may want to look at them, but for the most part they are things that only infrequently need to be monitored because they are very low level details.
I included the properties field because it’s very important…it tells us how to read the counter. From the API documentation:
- raw: single counter value is used
- delta: change in counter value between two samples is used
- rate: delta divided by the time in seconds between samples is used
- average: delta divided by the delta of a base counter is used
- percent: 100*average is used
Looking at the descriptions, it appears that we want to look at the user_reads
, user_writes
, and total_transfers
counters to determine how much activity is happening on our aggregate. Each of these is a rate counter, which means we need to measure it once, wait some known amount of time (e.g. 5 seconds), then measure again and divide by the number of seconds.
Instances of the Object
Now that we know the objects and counters, and we’ve determined what we want to monitor, we need to find the instances. To do that we use the Get-NcPerfInstance
cmdlet.
1 2 3 4 5 6 7 8 9 10 |
PS C:\Users\Andrew> Get-NcPerfInstance -Name aggregate | Where-Object { $_.Name -notlike "*root" } Name Uuid ---- ---- VICE01_aggr1_sas 96f8b6c9-4444-11b2-be67-123478563412 VICE02_aggr1_sas 49f45938-45a8-11b2-9ea8-123478563412 VICE03_aggr1_sas 0b916a30-45a8-11b2-9a6d-123478563412 VICE04_aggr1_sas 6ee009b9-45a8-11b2-8bac-123478563412 VICE05_aggr1_sata 8dffa99a-45a8-11b2-839d-123478563412 VICE06_aggr1_sata 15c61be8-b5a6-4db1-b61a-8566bd967c32 |
I excluded root aggregates from this listing using the Where-Object
snippet because I’m not interested in those at this time.
Reporting Performance
We now have everything needed to monitor performance: the object, the counters, and the instance. We use the Get-NcPerfData
cmdlet to query for information.
1 |
Get-NcPerfData -Name aggregate -Instance VICE01_aggr1_sas -Counter user_reads,user_writes,total_transfers |
Here is what it looks like in action:
1 2 3 4 5 6 7 |
PS C:\> (Get-NcPerfData -Name aggregate -Instance VICE01_aggr1_sas -Counter user_reads,user_writes,total_transfers).counters | Select-Object Name,Value Name Value ---- ----- total_transfers 10477200561 user_reads 10168492251 user_writes 157344312 |
Remember that these are rate counters. To determine the values, we simply measure at two intervals and divide…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# collect the first values $one = (Get-NcPerfData -Name aggregate -Instance VICE01_aggr1_sas -Counter user_reads,user_writes,total_transfers).counters # wait a few seconds Start-Sleep -Seconds 5 # collect the second values $two = (Get-NcPerfData -Name aggregate -Instance VICE01_aggr1_sas -Counter user_reads,user_writes,total_transfers).counters # an object to print results in $result = "" | Select-Object "user_reads","user_writes","total_transfers" # do the math for each counter...(value_at_t2 - value_at_t1) / time $result.user_reads = (($two | ?{ $_.Name -eq "user_reads" }).value - ($one | ?{ $_.Name -eq "user_reads" }).value ) / 5 $result.user_writes = (($two | ?{ $_.Name -eq "user_writes" }).value - ($one | ?{ $_.Name -eq "user_writes" }).value ) / 5 $result.total_transfers = (($two | ?{ $_.Name -eq "total_transfers" }).value - ($one | ?{ $_.Name -eq "total_transfers" }).value ) / 5 # print the result $result |
And the output, remember this is a per second average over the time between polls (5 seconds in this instance):
1 2 3 |
user_reads user_writes total_transfers ---------- ----------- --------------- 47.4 18.6 81.6 |
We can modify this slightly to get a per-second report for an aggregate:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
$aggregate = "VICE01_aggr1_sas" $waitSeconds = 1 Write-Host "user_reads user_writes total_transfers" Write-Host "---------- ----------- ---------------" # collect the first values $one = (Get-NcPerfData -Name aggregate -Instance $aggregate -Counter user_reads,user_writes,total_transfers).counters while ($true) { # wait a bit Start-Sleep -Seconds $waitSeconds # collect the second values $two = (Get-NcPerfData -Name aggregate -Instance $aggregate -Counter user_reads,user_writes,total_transfers).counters # an object to print results in $result = "" | Select-Object "user_reads","user_writes","total_transfers" # do the math for each counter...(value_at_t2 - value_at_t1) / time...and print $result.user_reads = (($two | ?{ $_.Name -eq "user_reads" }).value - ($one | ?{ $_.Name -eq "user_reads" }).value ) / $waitSeconds $result.user_writes = (($two | ?{ $_.Name -eq "user_writes" }).value - ($one | ?{ $_.Name -eq "user_writes" }).value ) / $waitSeconds $result.total_transfers = (($two | ?{ $_.Name -eq "total_transfers" }).value - ($one | ?{ $_.Name -eq "total_transfers" }).value ) / $waitSeconds # format the output and display it "{0,10} {1,11} {2,15}" -f $result.user_reads,$result.user_writes,$result.total_transfers # set the starting values for the next iteration $one = $two } |
Giving us an easy to read, per second, output of the number of reads, writes, and total transfers for our aggregate…
1 2 3 4 5 6 7 8 |
user_reads user_writes total_transfers ---------- ----------- --------------- 102 0 102 0 0 0 1 0 1 0 0 0 7 26 89 1 40 58 |
Performance Monitoring is Fun!
This has been just a short introduction to performance monitoring of a cDOT system using the PowerShell Toolkit. There is a huge number of things that can be monitored, and you can choose to display the information however you like…maybe a real-time report of performance for troubleshooting, intermittent collection to go into a summary report, collection at regular intervals to feed into a trend analysis tool.
Please reach out to me using the comments below or the NetApp Community site with any questions about how to collect performance information from your systems.
Hey Andrew, great writeup! I’m “Magyk” on the Netapp support forum, the guy who posed the original question. I was talking to the guys at the local Netapp office and they said they knew you and that you were a good guy. I had a quesstion:
For the -name parameter you’re using “aggregates” as an example. What other options are available for that parameter, or better yet, is there a ways to get a list of options?
Thanks.
Hi James,
Thanks for reading! I hope that this response, and the one in the communities, has been helpful.
The “-Name” parameter comes from the performance object. Use “Get-NcPerfObject” to view a list…there is 358 returned from my cDOT 8.3 system, so it’s quite a few to sort through. To make it a bit easier, show the description property:
You can also view them from the ClusterShell:
Remember that the user you are connected to the cluster with must have permissions to the object, and just like ClusterShell there are three privilege levels: admin, advanced, and diag.
Andrew
Of course, my bad! You did mention that right at the beginning. *smacks forehead*
Thanks for the script. I have few doubts can you clarify?
Is {read,write,total}_data is given in bytes? and to get actual latency it will be divided by no of ops? Is latency given in micro seconds?
Thanks
Hello Kannan,
Yes, the data values are given in bytes and latency is in microseconds. Occasionally capacity counters will be in blocks, you can see the units using the
Get-NcPerfCounter
cmdlet.Latency will have a base counter, identified using the
Get-NcPerfCounter
cmdlet as well.Hope that helps!
Andrew