An Exploration of Aggregates that Underlay VMware Datastores

NetApp storage, much like ogres and onions, is made up of several layers. Regardless of using Data ONTAP 7-Mode or clustered Data ONTAP (cDOT), there are always aggregates which contain volumes which contain NFS/CIFS shares and/or LUNs. Aggregates are the physical grouping of disks into RAID groups on which all data is stored when using Data ONTAP, they are the foundation on which everything else rides.

Storage Layers

I am going to start examining those configurables which may, or may not, be important when hosting virtual machines. This will be broken into several parts, one for each of the layers:

All of these components are configured similarly with both 7-Mode and C-Mode. C-Mode adds another layer of abstraction, known as the Storage Virtual Machine, which enhances data mobility and manageability on the storage array, but that does not affect the settings on the actual data container constructs.

Each of these entities has configuration options and settings that can be tweaked, tuned, and adjusted for various scenarios. The defaults for these settings are conservative and capable of meeting a broad range of requirements, but they can also be changed to meet a variety of more specific needs for capacity, performance, ease of management, etc. Remember, just because a setting can be adjusted doesn’t mean that it needs to be. All environments are different, and there is rarely only a single “correct” way to configure your storage.

Before we begin, I want to note that TR-3749 and TR-4068 should always be the primary reference and guide when deploying VMware using NetApp storage.

Creating Aggregates

Aggregates are composed of one or more RAID groups and leverage WAFL to write data across all of the disks. This means that adding more disks, even in additional RAID groups, equals more performance. When creating aggregates you want to take into account some things other than just “what disks to I have available?”.

General Recommendations
A small number of aggregates with more disks is, generally, better than more aggregates with fewer disks. Put simply, more disks in an aggregate equals better performance. There is no reason not to create aggregates as large as possible, regardless if you are using 32 or 64bit aggregates. OnTAP 8.2 Cluster Mode introduced QoS, which eliminates any concerns about noisy neighbors sharing aggregates with other customers.

RAID-DP is the NetApp implementation of RAID-6. It uses two parity disks to provide protection against multiple disk failures. RAID-DP is the recommended configuration…I am by no stretch of imagination an expert on it, so I will instead provide references: TR-3298 – NetApp Implementation of Double-Parity RAID for Data Protection and this awesome article from Tech ONTAP.

RAID group size
RAID groups range from 12 to 28 disks in size. More disks in a RAID group means more usable capacity with less RAID overhead (10+2 = 16.6% penalty, 26+2 = 7.1% penalty), however that comes at the price of increased risk of data loss…it’s easier to lose 3 disks in a group of 28 than 12.

I tend to stick with the defaults here: 16 (14+2) for SATA/FC, 14 (12+2) for SATA. I will occasionally go as high as 18+2 if I have a controller with a single shelf of available disks (3 for root aggr, 1 spare, 20 for data aggr). Another important thing to note here is that you aren’t just sizing the RAID groups for whole disk failures, but also for bad blocks. RAID scrubs find many of the bad blocks, but not all. It is particularly terrifying when you have a disk fail, then begin to see bad data blocks in the aggregate during the recovery (there are preventative measures for this…make sure your RAID scrubs are running for an appropriate amount of time!). Without the additional protection from RAID-DP / 6 these become silent data corruptions. This is also the rationale for using fewer disks in SATA aggregates…there’s a lot more data (and rebuild times are much higher) which increases the chances of finding bad blocks, or another drive failing.

RAID group size can be adjusted after the aggregate has been created if you forget during creation. This is done with an aggregate option:

Aggregate Snapshots
Snapshots on aggregates seem wasteful until you need them. If something goes wrong, they can greatly expedite time to recovery…for example, if the wrong volume is deleted. In Data ONTAP 8.2, aggregate snapshots are enabled by default, but snap reserve is set to 0.

To enable/disable aggregate snapshots:

To set the snap reserve for an aggregate:

Aggregate Maintenance

Aggregate Free Space
There has been much debate about how much free space should be left in an aggregate. I have seen some people say 20%, and others say less than 2%. My personal practice is to try to keep 10% free space for “emergency allocations”, with 5% free space being “the aggregate is full”. I do not have any official documentation or other communication saying this is right or wrong, so I would say that it is entirely up to you and what amount of free space makes you feel comfortable.

My understanding is you will need to keep at least 3% of the aggregate free for aggregate metadata (very bad things can happen if the aggr is > 97% full) and 1% for flexvol metadata. Deduplication can add to this as well (up to 4% of the vol size), as it’s metadata is stored at the aggregate level.

All that said, there is some logic wherein a full aggregate may perform worse than an empty aggregate. WAFL always writes to free space. As the aggregate fills up, there is less and less contiguous free space available for it to quickly write data to. This means that more disk seeks may have to occur for writes, thus slowing down the write process. This segues nicely into my next topic…

Aggregate Reallocation
Reallocation on an aggregate is used to regain contiguous free space. It is similar to a defragmenter. As your aggregates get closer to being full, or have existed for some time (commonly referred to as aging), improvements in write performance can be had by enabling scheduled aggregate reallocates. However, do keep in mind that this is a disk intensive operation…that means it has the potential to impact other IO operations on the aggregate!

ONTAP 8.1.1 introduced a new aggregate option, free_space_realloc that acts as a continuous optimizer for free space. Similar to the volume option read_realloc (which takes randomly written, but sequentially read data, and re-writes it sequentially), this aggregate option is constantly optimizing free space to ensure more consistent write performance over time. This option leads to better performance over time (as the aggregate ages), so I definitely recommend turning it on. There is a caveat however: enabling free_space_realloc will increase CPU utilization a small amount, so if your controller is CPU bound already this could affect it. The System Administration Guide for Cluster Administrators, page 267, has significantly more information if you are interested.

It is well known that WAFL always writes where the most contiguous free space is located. This means that if you add disks to an aggregate, they will become write hot spots and potentially impact performance. Whenever possible, you should add disks to an aggregate in increments of an entire RAID group, and regardless of how many disks get added, you will want to rebalance the data across the entire aggregate as much as possible. The best way to do that is by running reallocate -f against all of the volumes hosted by the aggregate.

Additional information about reallocation can be found in the Adminsitrator’s Guide and TR-3929.


Aggregates are the construct that all other logical storage containers on the NetApp reside. Ensuring correct and appropriate aggregate sizing and configuration helps to keep your environment operating with peak efficiency.

This post has, by no stretch of the imagination, been an exhaustive discussion on aggregates. There are many, many more things that can affect how the storage is configured and managed. If there are any questions, please feel free to post them in the comments!

Leave a Reply