An Exploration of FlexVols that Underlay VMware Datastores

This post is a continuation of the series that I started with aggregates. FlexVols are created inside of an aggregate and are the logical assignment of the aggregate’s capacity to sub-containers. Think of a FlexVol as a folder on a file system with a quota applied to it…while that isn’t technically true, it get’s the gist across.

FlexVols are the data containers from which CIFS/NFS data (including virtual machines) is served, and/or LUNs are hosted from. They are the functional level for which many features are applied, such as deduplication, and provide logical separation for data sets. From a security point of view, no data in one volume is available from another, and even though the disks are shared, there are no shared blocks between volumes (even with deduplication).

Clustered Data ONTAP introduced the ability to move volumes between nodes in the cluster. I won’t preach about the benefits of cDOT, but there are many and they far outweigh the added complexity. This series is meant to stay focused on the data container settings, which are the same between 7-Mode and clustered Data ONTAP.

Before we begin, I want to note that TR-3749 and TR-4068 should always be the primary reference and guide when deploying VMware using NetApp storage.

Creating Volumes

General Recommendations
Volumes can be created to be nearly any size…from 20MB up to 100TB (the current maximum aggregate size). With rare exception, I have almost always run out of IOPS available in the aggregate before running out of GB available. How you provision volumes is entirely up to your organization…some provision them by business unit, some by capacity, some by function, etc. There is no right way and there is no wrong way!

Thin Provisioning
Thin provisioning is well known by now. I’m not going to redefine it…here is the wikipedia article…they do a much better job than I would anyway… Thin provisioning is part of NetApp’s Storage Efficiency feature set.

Using thin provisioning, along with the other Storage Efficiency features, is meant to enable you to truly use the capacity of your disks. Having a large number of volumes which are “thick” provisioned, but not filled with data, means that those unfilled GB are going to waste. By using thin provisioning, you can provision aggregate capacity by IOPS and manage the GB as needed. If the volume approaches full, snap autodelete and volume autogrow will manage the space and keep things operating smoothly. If the aggregate is approaching full, if you are using clustered Data ONTAP, you can move the volume to a different aggregate on any node in the cluster. If you are using Data ONTAP 7-Mode, use the vol move command to move the volume to another aggregate on the same controller. With either mode, you can create a new datastore, storage vMotion the virtual servers, and destroy the original.

Snap autodelete does exactly what it says…when a threshold is met, it deletes snapshots according to a policy. This is good for a couple of reasons…for example, if you delete a large number of VMs it will cause the snapshot to grow, potentially taking up additional space in the volume. When the volume get’s too full, autodelete will remove the snaps, thus truly freeing the space.

This may be undesirable, as you may want to retain the snapshots for recovery purposes, and so the volume autogrow option is available. Again, just like the name implies, the volume grows in predetermined increments to a predetermined maximum size according to policy. The preference of which is executed first is set with the volume option “try_first

Let’s use an example to explain this… I have an aggregate, it’s 10TB in size. Inside that aggregate I create 20 volumes, all 1TB in size and all thin provisioned. Snap autodelete is enabled (and set for oldest first), volume autogrow is also enabled with a maximum size of 1.5TB and an increment of 100GB. try_first is configured for autogrow.

Some time goes by and volume1 is approaching full, so the administrators begin deleting things off the volume. They delete 600GB worth of data, which far exceeds the snap reserve causing the snapshots to bleed into the active volume space…the volume reaches 97% full and now snap autodelete kicks off. It will begin deleting snapshots until the free space threshold is reached (20%), or it runs out of snapshots to delete. Let’s assume that the latter is the case…all the snaps are gone, but we are still running out of space. At that point, volume autogrow kicks in and extends the size of the volume, preventing a condition where the volume runs out of space.

Where you must be careful with thin provisioning is when you have a highly over committed aggregate. If the aggregate runs out of space, then all of those customers are impacted (one of the main reasons I love clustered Data ONTAP…you can move that volume to another aggregate nondistruptively!).

For more information, I highly recommend this Tech OnTAP article, TR-3563, and TR-3965.

Deduplication
NetApp uses the WAFL file system to store the data in volumes. WAFL works similar (but not exactly like) to a traditional Unix file system with inodes that point to where the data lies on disk. This is a ridiculously gross over simplification…if you are interested in more detail look at TR-3002. The same mechanisms that make snapshots so easy with NetApp make deduplication possible…by simply changing multiple inodes to point at the same physical blocks of data, we accomplish deduplication.

My opinion is that there are very few reasons to not use deduplication on your production VMware volumes. You get space savings and a performance boost (all layers of cache are also deduplication aware). If you are using NFS for your VMware datastores, deduplication is extra awesome – you can immediately see the capacity benefits by the reduced amount of used space in the datastore. LUNs still benefit from deduplication, however the space savings is only seen as reduced utilization from the NetApp volume perspective, not the VMware LUN datastore.

A handful of things to be aware of…

Firstly, deduplication is a post processing action. Data is first written to disk in it’s original format, then at a later time (either scheduled or after a threshold of change) the data is processed. This is something to be aware of if you have a very high rate of change or new data, and you are running your volumes close to the line. This also means that when the deduplication processing runs, it competes for system resources and has the potential to affect resources that the production data is utilizing.

Secondly, deduplication is also a CPU intensive task. There’s a lot going on in the background (reading data and metadata, comparing blocks to ensure they are identical, writing data and metadata), and this requires CPU time. I highly recommend monitoring your controllers for CPU utilization, and the volumes/aggregates in question for activity and latency, to determine when the best time to schedule deduplication is.

Third, be conscious of the amount of time that deduplication takes to complete. A very large volume with a large amount of data written and/or changed can take a very long time for the process to complete.

A note about a special case…deduplication of non-persistent VDI volumes. My VMware View and Citrix XenDesktop (with PVS) non-persistent desktop storage does not have deduplication turned on. To be clear, this is only the linked clone datastores (in the case of View) or “D” drives (PVS), so the only thing that is stored is delta data from the desktops. All of the user data is redirected, and the gold images are stored elsewhere. I purposely chose to do this because the data is all short lived, so any deduplication would be a waste of effort. Please remember that all environments are different…testing to verify any configuration is crucial to ensuring the best performance and optimal space utilization.

Additional information about deduplication is available in TR-3505, TR-3958, and TR-3966. Also, NetApp’s very own “Dr. Dedupe” posted an article here that is an excellent read.

Compression
A relatively new feature for NetApp is compression. It can be enabled either inline or post process (like deduplication), and can result in significant space savings, but at the cost of CPU.

I am largely unfamiliar with the feature in practice, so I will defer to the experts. Tech OnTap has a very good article here. Also be sure to read TR-3958 and TR-3966 which cover deduplication also.

Snapshot Reserve
Snap reserve is the amount of space in a volume that is specifically set aside for snapshot data. The capacity is not made available to the volume for consumption. The size of a snapshot is determined by how much existing data has changed since the snapshot was taken. Newly written data to the volume is not reflected in the size of the snapshot.

Snapshot size is cumulative and deleting a newer snapshot could cause the size of an older snapshot to increase. To expand a little on that…the size being cumulative means that the oldest snapshot is the size of all the other snapshots combined. The reason that snap autodelete starts with the oldest snapshots first is because deleting new snapshots generally doesn’t free much space.

Let’s say I have three snapshots…one, two, and three…taken in that order. The size of “One” is a reflection of how much data changed between “One” and “Two”, the size of “Two” is the amount of data changed between “Two” and “Three”. If I delete “Two”, there will be three types of data:

  1. data that was modified in “One”, but not in “Two” (this data only increases the size of “One”)
  2. data that was modified in both “One” and “Two” (this data increases the size of both “One” and “Two”)
  3. data that was modified in “Two”, but not “One” (this data only increases the size of “Two”)

When the snapshot is deleted, data of type 1 will remain, data of type 2 in snapshot “One” will be overwritten with data from snapshot “Two”, and type 3 will transfer to snapshot “One”. If there is no data of type 3, then the older snapshot will not increase in size.

The size of your snap reserve depends on three things…how frequent your snaps are (a small amount of data that changes constantly and is snapped frequently will consume additional snapshot space), how long you want to retain the snaps, and the amount of existing data that is modified.

There is one other thing to consider…whether you want to reserve any space at all. If you choose not to have a snap reserve then the size of the snapshots is reflected in the consumed space of the volume. This can be confusing if you are not expecting it (“why does this 100GB volume show as being 80% full when there is only 25GB of data in it?”), so just be aware.

There is significant additional information available in the Logical Storage Management Guide, starting at page 22, but snapreserve is specifically on page 51.

Fractional Reserve
Fractional reserve is a special kind of reserve specifically for LUNs. The setting is not relevant unless your volume contains LUNs and you intend to take snapshots. Fractional reserve is the amount of space (in percent) that will be reserve for writes to LUNs. So…what does that mean….?

I have a volume, it is 100GB in size where fractional reserve is set to 100%. Let’s say I create a 10GB LUN in that volume. The space consumed in the volume is now 10GB. A snapshot is created (scheduled, manaul, snap* product, doesn’t matter)…the volume will now reserve 100% the size of the LUN (10GB in this case) for writes in case the volume fills up. So, the consumed space in the volume is now: 10GB for the LUN, 10GB for the fractional reserve, and

All of that is fine and well, but say I forget to delete snapshots for a VERY long time and my volume fills up. Under normal circumstances we couldn’t do any writes to the volume…it’s full right?! This is where fractional reserve comes into play…that space is reserved so that writes can continue to occur even when the volume is full. Snapshots will automatically disable when the volume is > 95% full, so all writes are going into the fractional reserve space and there are no additional snaps to cause multiple copies of the same block to be retained (so a block gets written over itself multiple times…ok, not literally over itself…we are using WAFL, but you know what I mean). This keeps the LUN operating (and the consumer unaffected), despite the volume being out of space.

What should you set fractional reserve to? Well, it depends. Are you using snapshots? If so, how long do you keep them? Do you have snap reserve configured? What is the rate of change for the data in your LUN? Do you frequently forget to check your storage for old snaps? Do you have issues running out of space in volumes? When I use LUNs for VMware, I typically set the fractional reserve to 25% or less, but, again, it depends!

As of DataONTAP 8.2, there are only two settings for fractional reserve: 100 or 0. In other words, it is either on or off. Please be aware of this in your capacity planning!

For additional information, there is a very good article on the NetApp Communities by Chris Kranz. Also, the Logical Storage Management Guide has great information on page 35 about fractional reserve.

Here is how you can modify fractional reserve:

Access Time Updates
Access time updates are pretty simple…they are metadata about a file which stores the last time that a file was updated/accessed/etc. They are very useful for a user data share, but have limited usefulness for VMware VMDKs. They use up an extra IO or two when accessing the files for no real gain, so it is safe to turn them off.

Read Reallocation
The volume option read_realloc is used to address the issue of random writes followed by sequential reads. When enabled, the volume will actively move data that was randomly written to disk, but sequentially read, and rewrite it sequentially. This is very beneficial in certain situations, however, if you are already running regular reallocations, then this process is a part of that.

There are two things to be aware of with this option: 1) there is some CPU and disk overhead, and 2) unless the space_optimized option is used, additional snapshot space may be consumed. If your data follows the random write/sequential read pattern, it could be useful to enable this option…I recommend testing it beforehand.

Volume Maintenance

Volume Free Space
Much like aggregates, there has been great amounts of debate about the amount of free space to maintain in the volume and no published maximums. I have always strived for 10% free space, knowing that some functionality is lost above 97% utilization (for example, snapshots will be disabled). Volume performance is significantly more affected as the volume approaches full than aggregates are. If you choose to run your volumes near the “full” line, I highly encourage you to also closely monitor the performance to ensure that users are not impacted.

Reallocation
Just like aggregates, volumes can be reallocated as well. Aggregate reallocation is primarily focused on creating contiguous free space, whereas volume reallocation is more focused on optimizing the data layout in the volume.

There are a couple of additional options for volume reallocate. The -f option forces a full reallocate which will reallocate all data, unless it is predicted that there will be a performance decrease after the movement. This is different than the standard behavior of not reallocating data that is predicted to have no change in performance.

There is also the -p option. This option specifically tells the reallocation to take place on the physical blocks. Normally a reallocate would cause the size of a snapshot to grow…this is because the data is being moved and is recorded as change. With the physical reallocate, the file system (remember a volume is a logical entity on top of the disks) is unaware that the data is moving beneath it because the logical block allocations are preserved. This can have some benefits…for example, snapmirror transfers only the changed data, so less data “changed” equals less data to transfer. It can also have some disadvantages…if a large number of blocks are moved, reading from snapshots before the reallocate may be slower.

Additional information about reallocation can be found in the Adminsitrator’s Guide and TR-3929.

Summary

Volumes are the logical containers which store the data associated with your virtual machines. They play a critical role in the overall environment and provide many of the features that are important to your storage plan, including deduplication, snapshots (and the associated replication using snapmirror), and compression.

As always, this is nowhere near an exhaustive discussion of volumes. There have (literally) been books written about them and their many nuances, behaviors, and what to expect. Please remember that there is never one correct way of configuring storage, the answer is always “it depends”. Please post questions in the comments!

Leave a Reply