Disaster mitigation and recovery is a subject that is near and dear to my heart. As Glenn can attest, I am extremely conservative when it comes to datacenter design…I am a firm believer in redundant redundancy.
I have almost 100% blade centers, for better or worse, in my virtualization infrastructure. Let’s assume I have four brand new blade chassis, each with 14 blades (care to guess who they’re from?), for a total of 56 blades to assimilate. Rather than assign the blades to clusters sequentially, I wanted to ensure that the loss of a blade chassis was mitigated as much as possible. To do this, as hardware is assigned to the virtual clusters and datacenters it is taken in stripes across the blade chassis.
For this to make a little more sense, think of it this way: blade center one has blades 1-14, blade center two has 15-28, three has 29-42 and four has 43-56. Since there are four chassis, this makes for a nice round eight blades per cluster, or two from each chassis (I went with eight because VMware’s HA seems to get a little flaky with 12).
There are two vCenter datacenters that I maintain, “development” and “production”. Development is significantly smaller than production, so it only has two clusters (for a total of 16 blades). Production gets the rest. Using this setup I can then garner which datacenter and which cluster to put the blade in by it’s id. I arbitrarily assigned the first 16 blades (or blades 1-4 of each blade chassis) to development, while the remainder go to production.
To make this a little easier to understand, I’ve attempted to create a chart…
Blade Chassis 1 Chassis 2 Chassis 3 Chassis 4
1 Dev C1 Dev C1 Dev C1 Dev C1
2 Dev C1 Dev C1 Dev C1 Dev C1
3 Dev C2 Dev C2 Dev C2 Dev C2
4 Dev C2 Dev C2 Dev C2 Dev C2
5 Prod C1 Prod C1 Prod C1 Prod C1
6 Prod C1 Prod C1 Prod C1 Prod C1
7 Prod C2 Prod C2 Prod C2 Prod C2
8 Prod C2 Prod C2 Prod C2 Prod C2
9 Prod C3 Prod C3 Prod C3 Prod C3
10 Prod C3 Prod C3 Prod C3 Prod C3
11 Prod C4 Prod C4 Prod C4 Prod C4
12 Prod C4 Prod C4 Prod C4 Prod C4
13 Prod C5 Prod C5 Prod C5 Prod C5
14 Prod C5 Prod C5 Prod C5 Prod C5
To determine the blade id you would multiply it’s position in the blade chassis (1-14) by the chassis number (1-4). For example, chassis 3, blade 7 would have the id 21. Using the above chart, you can see it would be in the second production cluster.
So, you see, an equal amount of hardware from each chassis is assigned to each cluster. This setup means that capacity is lost equally across all clusters in the event of failure. If one chassis fails (25% of capacity), then all clusters lose 25%, rather than having 1.75 clusters lost (8 blades for one cluster, 6 in another), which would essentially mean two entire clusters being out of commission (HA can only handle, at most, four host failures).
I am exceptionally paranoid about any amount of failure, but especially large scale failure…a large part of our business operation is virtualized, so when physical servers go down it means that potentially lots of VMs go down, which means money lost, and worse (ok, maybe not worse…but still not good) it makes virtualization look bad. Rather than having all four of the above blade chassis in the same rack, they would, ideally, be spread across the datacenter…different RPPs (power), different CRACs nearby (cooling), different distribution switches (network), etc. If your datacenter is divided into multiple rooms, then utilize them…put hardware in each. Unless something truly catastrophic happens (100% power loss to the entire building, 100% network outage, etc.), then my infrastructure is resilient enough to withstand it without having to worry about an offsite/colocation DR solution.
This concept can be applied even without blades, just exchange the words “blade chassis” with “rack of servers”. It also scales up and down well, which makes determining where to assign new hardware easy.
Remember, 9s are expensive. A five 9 infrastructure (99.999% uptime, or just over 5 minutes of downtime) can cost exponentially more than a three (525.9 minutes of downtime), or even four (52.59 minutes of downtime), 9 infrastructure. One level of redundancy costs 100% more than the base cost of equipment (you have to buy two of everything :), so if you are involved in purchasing, even if you just make recommendations about what to get, make sure you take into account how much failure you (and your boss) are willing to accept…and if the budget guys get involved, remind them how much it would cost if all the VMs were individual physical servers 🙂