Metro Cluster – d8taDude

What is a VMware Metro Cluster?

A VMware Metro Cluster (vMSC) is also sometimes called a stretched cluster which gives more of a clue to it’s function, since it allows a single cluster to operate across geographically separate data centres. This ability to operate two locations as a single cluster gives significant benefits in terms of availability both for planned and unplanned outages.

How does a Metro Cluster Work?

A Metro Cluster allows VMs spread across data centres to act like they are in a single local cluster. In order to allow this functionality the VMs need access to the same storage at both sites, this is achieved with products like NetApp’s Metrocluster and HPE’s Peer Persistence products which enable a single view of the storage even though it is located in a multi site configuration,this is depicted in the diagram below. Let’s dig into how this works.

Each LUN is replicated between both storage systems using synchronous replication, however only one LUN can be written to at a time, whilst the other remains in a read only mode. The writable LUN is presented out to the hosts via active paths, the read only LUN is effectively hidden by the paths to it being marked as standby. This is based on ALUA (Asymmetric Logical Unit Access), which was used in traditional storage systems like the EMC Clarion. ALUA was used to mark preferred optimized paths to the controllers owning a LUN and non optimized paths marked indicated indirect paths to the LUN. The non optimized standby paths would only become live if the primary path failed.

Below shows an example of the paths on a ESXi host connected to a Metro Cluster, the live paths are shown as active but this can be switched over using the storage array management software so that the active and standby paths reverse.

What are the requirements FOR A STRETCHED CLUSTER?

In order to setup a VMware Metro Cluster the following is required:

VMware metro storage cluster licencing – There is no minimum license edition of vSphere for the creation of a metro cluster. However if automated workload balancing is required with DRS the minimum licence required would be Enterprise Plus edition
Supported storage connectivity. Fibre Channel, iSCSI, NFS, and FCoE are supported
Max latency for vMotion from vSphere 6 is 150ms
Stretched storage network across sites
Max supported storage replication 10ms, may be lower depending on vendor
Suitable software options selected for storage e.g. 3PAR Peer Persistence option
Maximum network latency RTT between sites for the VMware ESXi management networks is 10ms
vSphere vMotion network has a redundant network link, minimum of 250Mbps.
A third site is required for deployment of a witness which will act as an arbitrator
Storage IO control is not supported on a Metro Cluster enabled datastore

Benefits

Mobility – since storage and network config is shared across the sites VMotion requirements are met and VMs can be either manually migrated or dynamically balanced across the cluster and locations using DRS
Reduce physical boundaries – DRS can be used to automatically balance workloads across locations
Reduce downtime – A metro cluster allows the online movement of VMs and storage for planned events without downtime. These can be performed together or independently. For example if maintenance was planned on the storage system the active paths could be switched over to the other site or if the entire site was expected to be offline the storage and VMs could be migrated to the opposite site
High availability – vMSC protects against both storage system and site failures. In the event of a storage system failure this will be detected by a witness VM and the active paths switched over to the other system and for a site failure VMware HA will restart the VMs at the surviving site
Reduced RTO – Automated recovery reduces RTO for storage or site failure

Disadvantageous

Complexity – Although setting up a vMSC is not too strenuous, it is certainly more complex than a single site cluster
Testing – Although vMotion between sites and switch over of storage between sites can be tested there is no simple way to test for a full failover scenario for example with a run book

Considerations for VMware metro cluster design

HA admission control – The first consideration around HA is a logical one and this is that you should use admission control and set it to a reservation level of 50% for CPU and memory. This is to ensure that should a failover between sites be required it will guarantee the resources are available
HA datastore heart beating – Is used to validate the state of a host. It is important that datastores used for heart beating are configured at both locations so that false results are not received if a site is lost. It is recommended by VMware that 2 datastores are set for heart beating at each site
HA APD – The response for an All Paths Down needs customising, you will find the setting in HA settings after selecting Protect against Storage Connectivity Loss you will then need to select Power off and restart VMs.

ESX host names – Create a logical naming convention which will allow you to quickly identify which site a host is in. This could be the site is in the naming convention you choose or you choose a numbering system that reflects the location, for example odd hosts are in site one. This will make designing your system and running it day to day easier
Data locality and host affinity rules – Ideally hosts should be accessing data from their local storage array to improve response time. To ensure this is the case use VMware affinity rules to define the preferred site for VMs to run from a local LUN. Do not use must rules, if you do even in the event of a site failure the VM’s will not move as it would violate the rule
Logically name the LUN’s with their home sites – This is not a must and some may argue they want the flexibility to move LUNs between datacentres but it will make it easier for BAU staff to track which are local datastores

What causes a failover?

For an automated fail over of the storage to occur there are a number of failure conditions that must be met, those conditions that must be met for 3PAR are summarised in the following table from HPE.

Essentially contact needs to be lost with the storage array and replication needs to be stopped.

Testing

There is no automated testing method for a Metro Cluster however with a bit of thought it is possible to run some tests, although some are more invasive and risky than others. We will run through the tests starting with the least risky and move towards more invasive and risky

1 vMotion – This is the simplest test to move a VM between sites. Although a simple test vMotion has more requirements than HA and so will start to build confidence as we move through tests

2 Storage switch over – Switching which site the storage is actively running on can again be completed online with little risk

3 Simulated storage failure – This test incurs risk since it is possible IO could be lost when the storage system is taken offline. Verify the specifics of a failover scenario with your storage vendor but for example with a 3PAR you will need to take the management network and Remote Copy network offline simultaneously. Before you complete this disable auto failover of the LUNs you do not wish to be part of the test

4 Simulated site fail over – For this test you simulate a site failure by simulating a storage failure as above plus a host failure to get HA to kick in. Choose a VM to test and move this to a host by its self, power off other VMs in the environment put the hosts out of scope into maintenance mode. Perform HA simulated failover as per https://kb.vmware.com/s/article/2056634. Again there is risk in this test, be selective about which VMs you choose to test

Remember tests 3 and 4 do incur risk, perform them at your at your own risk and only if the project requires it.

Peer Persistence is a 3PAR feature that enables your chosen flavour of hypervisor, either Hyper-V or vSphere to act as a metro cluster. A metro cluster is a geographically dispersed cluster that enables VM’s to be migrated without interruption from one location to the next with zero down time. This transparent movement of VM’s across data centres, allows load balancing, planned maintenance and can form part an organisations high availability strategy.

Peer Persistence can also be used with Windows failover clustering to enable a metro cluster for services such as SQL server on physical servers.

What are the building blocks for Peer Persistence?

The first thing you are going to need is two 3PAR systems with Remote Copy, Peer Persistence is effectively an add-on to Remote Copy and cannot exist without it. Remote Copy must be in synchronous mode and so there are some requirements around latency. The max round-trip latency between the systems must be 2.6ms or less, this rises to 5ms or less with 3PAR OS 3.2.2

As this is effectively a cluster setup a quorum is required, which HPE provide in the form of a witness VM deployed from OVF. This witness VM acts as the arbitrator to the cluster to verify which systems are available and if automatic fail over to the 2^nd site should be initiated.

The other requirements are:

The 3PAR OS must be a minimum 3.2.1 or newer for Hyper-V. I would recommend at least 3.2.1 MU3 since this included a fix which removed the need to rescan disks on hosts after a fail over. 3.1.2 MU2 or newer for VMware
The replicated volumes must have the same WWN on both 3PAR systems. If you create a new volume and add it to Remote Copy this will happen automatically.
You will need a stretched fabric that will allow hosts access to both systems
Hosts need to be zoned to the 3PAR systems at both sites
When you create the remote copy groups you must enable both auto_failover and path_management polices to allow for automated failover
FC, iSCSI, or FCoE protocols are supported for host connectivity. RCFC is recommended for the remote copy link.

Further requirements specific to hypervisor are:

Hyper-V

Windows hosts must be set to a persona of 15
For non-disruptive failover Hyper-V hosts must be 2008 R2 or 2012 R2

VMware

Windows hosts must be set to a persona of 11
For non-disruptive failover ESXi hosts must be ESXi 5.x or newer
No storage DRS in automatic mode
Recommended to configure datastore heart beating to a total of 4 to allow 2 at each site
Set the HA admission policy to allow all the required workloads from the other site to run in the event of a fail over

The picture bit

I have robbed the picture above from HPE Peer Persistence documentation, it has more lines on it than the London underground, but let me try and explain. There are two geographically dispersed data centres site A and B. Both sites contain 3 hypervisor hosts shown at the top of the picture and a 3PAR shown at the bottom. The data centres are then linked by a stretched fabric so the zoning information is shared across the sites, synchronous Remote Copy will also occur across the link. Each host is zoned to 3PAR systems at both sites.

At the top of the picture is a blue cylinder at site A and a grey one at site B this represents that each volume is presented twice, once at each site. The volume has the same WWN and by using ALUA one of the volume will be marked as writeable (blue cylinder), whilst the other is visible to the host but marked non writeable (grey cylinder). In the event of a switchover the volume from site A has its paths are marked as standby at site A and whilst the volume at site B has its paths marked as active.

The quorum witness shown at the bottom of the picture as QW is a VM which must sit at a third site not site A or B and must not rely on the storage it is monitoring. It is deployed using an OVF template and is available in Hyper-V and VMware versions, I will cover its deployment in another post. The job of the quorum witness is to act as an independent arbitrator and decide if an automatic failover from one site to another should occur. The witness VM essentially checks two things the status of the remote copy link and the availability of the 3PAR systems. When it detects a change in one of these conditions the action taken is displayed in the following table borrowed from the HPE Peer Persistence documentation. The key thing to take away from the table is that an automatic failover across sites will only ever occur if the witness detects that Remote Copy has stopped and one of the 3PAR systems cannot be contact

Enough chat, let’s implement this thing

To summarise everything we have talked about so far I am going to list the high level steps to create a 3PAR Peer Persistence setup. I am going to use a Hyper-V setup as an example but the steps for another type of failover cluster and VMware are very similar

Infrastructure Steps

I will assume you have synchronous replication up and running and meet the latency requirements as described above
Verify you have the Peer Persistence licence
Setup a stretched fabric across both sites
If necessary upgrade your 3PAR OS to the 3PAR OS version listed in the requirements section
Deploy the witness VM. Check out my full deploying the 3PAR quorum witness VM guide for assistance on this

Host Steps

Configure your zoning so all hosts are zoned to both 3PAR systems
Check and if necessary set the correct host persona
On the source system create the remote copy group which contains all the volumes requiring replication. From SSMC Main menu, Remote Copy Group, create.
When creating the remote copy group the WWN of the source and target volume need to be identical. To ensure this is the case when you create the remote copy group ensure that that Remote copy volumes create automatically is selected

Also when creating the remote copy group ensure the two tick boxes in the Peer Persistence section are checked for path management and auto failover
For each 3PAR export the volumes to the hosts in the cluster. i.e. the source and destination volumes should both be exported to the cluster. Before doing the export ensure the host are already added to a cluster to avoid corruption
You may need to rescan the disks, in Disk manager once they are exported

Management

To change which 3PAR system is actively serving the data and which is standby. Select Remote copy groups, highlight the group you wish to change where it is active, and choose Switchover

To stay in touch with more 3PAR news and tips connect with me on LinkedIn and Twitter.

Tag: Metro Cluster

VMware Metro Cluster Guide