esxi health status displayed in VxRail manager

VxRail in a nutshell

Dell EMC’s VxRail is selling like hotcakes,  I was lucky enough recently to attend a one day overview session on the product.  The training was really good and I wanted to share what I had learnt to present a high level overview on what VxRail is about.

What is VxRail?

VxRail is Dell EMC’s hyper-converged offering. Hyper-converged appliances allow a datacenter in a box approach,  rather than buying servers, storage and hypervisor separately hyper-converged appliances bundle the components into one package.  The storage, compute and hypervisor components used by VXRail are as follows:

  • Storage – VMware vSAN 6.6
  • Compute – 14th Generation DellEMC PowerEdge servers
  • Hypervisor – vSphere 6.5

Together the above form VXRail 4.5

What else do you get with VXRail?

You also get some other software bundled in:

  • vCentre
  • vRealize log insight
  • RecoverPoint for VM’s
  • vSphere Replication
  • vSphere data protection

It is worth noting the ESXi licenses are not included

What is VxRack?

You may also hear VxRack mentioned, this is the bigger brother to VxRail with the software defined storage provided by ScaleIO.  Networking using VMware NSX is also an option.

How many nodes do you need?

The minimum number of nodes required is three,  although 4 is recommended to allow overhead for failures and maintenance.   This is also the minimum number of nodes required to use erasure coding rather than mirroring for data protection.

How do you manage VxRail?

The system is managed from two places.  You will be spending most of your time in the vSphere web console since all vSphere management is still performed from here. Also since the storage is being provided by vSAN this is also managed within vSphere.  The second tool you will need to become familiar with is VxRail Manager.

vSphere Integration

There is the option to create a new vCentre which will be housed within the hyper-converged infrastructure it’s self or to use an external vCentre.  The choice to use an internal or external vCentre can be set during the initial deployment wizard.

What is VxRail Manager?

The VxRail manager allows you to manage the hardware i.e. the servers in a VxRail deployment plus perform a number of other tasks including:

  • Deployment – Initial deployment and addition of nodes
  • Update – As this is a hyper-converged system upgrade of all components can be completed from the VxRail manager.
  • Monitor – Check for events, and monitor resource usage and component status
  • Maintain – Dial home support

The following shows a number of screenshots from within VxRail Manager

Physical node view

vxrail physical health

Logical node view showing resource useage

vxrail logical health

ESXi component status

esxi health status displayed in VxRail manager

What are the models?

You can see detailed specifications of all the models on the DellEMC site, this section just provides the major differences between the series

  • S Series – Hybrid only
  • G Series – Space saving format 2U can contain 4 appliances
  • E Series – 1U hence supports less capacity than the other 2U nodes
  • P Series –  2U supports twice the capacity of E series and is therefore suitable to more demanding workloads
  • V Series – Same spec as P series plus GPU’s. Optimised for VDI environments

How do you add a node

You add a node using the VxRail Manager as shown in the screenshot below which is a non disruptive process. Hybrid and all flash models cannot be mixed plus the first three nodes in any cluster must be identical after this you can mix and match models . Although there is something to be said for maintaining consistency across the cluster so it is balanced and will probably make managing it easier in the future. The cluster can be scaled to a max of 64 nodes.

Adding node to VxRail via VxRail Manager

How do I take a node out for maintenance?

Since the data is stored inside each of the nodes there are some additional considerations when putting a node into maintenance versus using a traditional SAN.  When you do put a host into maintenance mode the default option is ensure accessibility it makes sure all data is available during maintenance, although redundancy may be reduced.

vSAN summary

vSAN is VMware’s software defined storage solution, this means no specialist storage equipment is required and storage is provided by conglomerating the storage within each of the ESXi servers. Key concepts to understand

  • Management  – vSAN is managed from within vSphere and is enabled at the cluster level  
  • Disk groups – Each disk group consists of a caching disk which must be an SSD disk and 1-7 capacity drives.  The capacity drives can be flash drives or spinning disks in a hybrid setup. All writes are made via the caching disk, 30% of its capacity is also reserved to act as a read cache.
  • vSAN datastore – Disk groups are combined together to create a single usable pool of storage called a vSAN datastore
  • Policy Driven – Different performance and availability characteristics can be applied at a per disk level on VMs

vSAN availability and data reduction

  • Dedupe and compression –  Enabled at the cluster level, not suitable for all workloads. If you have a requirement for workloads that do do not require dedupe/compression you would need multiple clusters
  • Availability –
    • Availability and performance levels are set by creating policies, you can have multiple policies on a single vSAN datastore
    • Availability is defined in the policy setting fault tolerance method, the available choices are RAID-1 (Mirroring) and RAID-5/6 (Erasure Coding)
    • RAID 1 is more suited to high performance workloads and will ensure there are two copies of the data across nodes
    • RAID 5 – Stripes the data across nodes, more space efficient but reduced performance.



vmware metro cluster paths

VMware Metro Cluster Guide

What is a VMware Metro Cluster?

A VMware Metro Cluster (vMSC) is also sometimes called a stretched cluster which gives more of a clue to it’s function, since it allows a single cluster to operate across geographically separate data centres. This ability to operate two locations as a single cluster gives significant benefits in terms of availability both for planned and unplanned outages.

How does a Metro Cluster Work?

A Metro Cluster allows VMs spread across data centres to act like they are in a single local cluster. In order to allow this functionality the VMs need access to the same storage at both sites, this is achieved with products like NetApp’s Metrocluster and HPE’s Peer Persistence products which enable a single view of the storage even though it is located in a multi site configuration,this is depicted in the diagram below.  Let’s dig into how this works.

VMware metro cluster

Each LUN is replicated between both storage systems using synchronous replication, however only one LUN can be written to at a time, whilst the other remains in a read only mode. The writable LUN is presented out to the hosts via active paths, the read only LUN is effectively hidden by the paths to it being marked as standby. This is based on ALUA (Asymmetric Logical Unit Access), which was used in traditional storage systems like the EMC Clarion. ALUA was used to mark preferred optimized paths to the controllers owning a LUN and non optimized paths marked indicated indirect paths to the LUN. The non optimized standby paths would only become live if the primary path failed.

Below shows an example of the paths on a ESXi host  connected to a Metro Cluster, the live paths are shown as active but this can be switched over using the storage array management software so that the active and standby paths reverse.

vmware metro cluster paths

What are the requirements FOR A STRETCHED CLUSTER?

In order to setup a VMware Metro Cluster the following is required:

  • VMware metro storage cluster licencing – There is no minimum license edition of vSphere for the creation of a metro cluster. However if automated workload balancing is required with DRS the minimum licence required would be Enterprise Plus edition
  • Supported storage connectivity. Fibre Channel, iSCSI, NFS, and FCoE are supported
  • Max latency for vMotion from vSphere 6 is 150ms
  • Stretched storage network across sites
  • Max supported storage replication 10ms, may be lower depending on vendor
  • Suitable software options selected for storage e.g. 3PAR Peer Persistence option
  • Maximum network latency RTT between sites for the VMware ESXi management networks is 10ms
  • vSphere vMotion network has a redundant network link, minimum of 250Mbps.
  • A third site is required for deployment of a witness which will act as an arbitrator
  • Storage IO control is not supported on a Metro Cluster enabled datastore


  • Mobility – since storage and network config is shared across the sites VMotion requirements are met and VMs can be either manually migrated or dynamically balanced across the cluster and locations using DRS
  • Reduce physical boundaries – DRS can be used to automatically balance workloads across locations
  • Reduce downtime – A metro cluster allows the online movement of VMs and storage for planned events without downtime. These can be performed together or independently.  For example if maintenance was planned on the storage system the active paths could be switched over to the other site or if the entire site was expected to be offline the storage and VMs could be migrated to the opposite site
  • High availability – vMSC protects against both storage system and site failures. In the event of a storage system failure this will be detected by a witness VM and the active paths switched over to the other system and for a site failure VMware HA will restart the VMs at the surviving site
  • Reduced RTO – Automated recovery reduces RTO for storage or site failure


  • Complexity – Although setting up a vMSC is not too strenuous, it is certainly more complex than a single site cluster
  • Testing – Although vMotion between sites and switch over of storage between sites can be tested there is no simple way to test for a full failover scenario for example with a run book

Considerations for VMware metro cluster design

  • HA admission control – The first consideration around HA is a logical one and this is that you should use admission control and set it to a reservation level of 50% for CPU and memory.  This is to ensure that should a failover between sites be required it will guarantee the resources are available
  • HA datastore heart beating – Is used to validate the state of a host. It is important that datastores used for heart beating are configured at both locations so that false results are not received if a site is lost.  It is recommended by VMware that 2 datastores are set for heart beating at each site
  • HA APD – The response for an All Paths Down needs customising, you will find the setting in HA settings after selecting Protect against Storage Connectivity Loss you will then need to select Power off and restart VMs.

vsphere metro cluster HA settings

  • ESX host names – Create a logical naming convention which will allow you to quickly identify which site a host is in. This could be the site is in the naming convention you choose or you choose a numbering system that reflects the location, for example odd hosts are in site one. This will make designing your system and running it day to day easier
  • Data locality and host affinity rules – Ideally hosts should be accessing data from their local storage array to improve response time. To ensure this is the case use VMware affinity rules to define the preferred site for VMs to run from a local LUN. Do not use must rules, if you do even in the event of a site failure the VM’s will not move as it would violate the rule
  • Logically name the LUN’s with their home sites – This is not a must and some may argue they want the flexibility to move LUNs between datacentres but it will make it easier for BAU staff to track which are local datastores

What causes a failover?

For an automated fail over of the storage to occur there are a number of failure conditions that must be met, those conditions that must be met for 3PAR are summarised in the following table from HPE.

3par peer persistence error handling

Essentially contact needs to be lost with the storage array and replication needs to be stopped.


There is no automated testing method for a Metro Cluster however with a bit of thought it is possible to run some tests, although some are more invasive and risky than others. We will run through the tests starting with the least risky and move towards more invasive and risky

1 vMotion – This is the simplest test to move a VM between sites. Although a simple test vMotion has more requirements than HA and so will start to build confidence as we move through tests

2 Storage switch over – Switching which site the storage is actively running on can again be completed online with little risk

3 Simulated storage failure – This test incurs risk since it is possible IO could be lost when the storage system is taken offline. Verify the specifics of a failover scenario with your storage vendor but for example with a 3PAR you will need to take the management network and Remote Copy network offline simultaneously.  Before you complete this disable auto failover of the LUNs you do not wish to be part of the test

4 Simulated site fail over – For this test you simulate a site failure by simulating a storage failure as above plus a host failure to get HA to kick in. Choose a VM to test and move this to a host by its self, power off other VMs in the environment put the hosts out of scope into maintenance mode.  Perform HA simulated failover as per Again there is risk in this test, be selective about which VMs you choose to test

Remember tests 3 and 4 do incur risk, perform them at your at your own risk and only if the project requires it.

Further Reading

VMware vSphere Metro Storage Cluster Recommended Practices

The dark side of Stretched clusters

NetApp Metro Cluster tutorial video


Thoughts on VMware Cloud on AWS


Last month VMware announced the availability of VMware cloud on AWS. The size and scale of VMware essentially means that any large-scale product launch like this is significant. Large players like this can create the trend as well as follow it. Whilst VMware has not been massively successful in the cloud space yet, their foothold on-premises is huge and therefore the market potential also.

Technical Specs

Components of VMware cloud on AWS

This service leverages vSphere, NSX and vSAN to allow you to run your Vmware VM’s in the AWS cloud. This is not a nested solution like Ravello, but runs on dedicated hosts housed in the AWS data center. Today this service is only available in one region, AWS west and with a minimum of 4 dedicated hosts. The ESXi hosts are beefy with each having dual E5-2686 v4 CPUs @2.3GHz with 18 Cores, that’s 36 cores total or 72 including hyper threading. Memory is 512GB of RAM and storage 10TB raw per node.


Cost has been one of the most eagerly anticipated aspects of this announcement and initially there is one option which is an on demand billed per hour. This is $8.37 per host per hour, given the requirement for 4 hosts minimum this works out at a monthly cost of approx. $24,000 /month. This off the bat sounds expensive but Keith Townsend has done some analysis which shows this is comparative to running a VM in AWS EC2. In time 1 year and 3 year pricing deals will be available which will offer 30% and 50% reductions respectively compared to the on demand pricing.

Adoption and Use cases

In terms of technical innovation VMware on AWS does not offer significant additional benefits currently v hosting on premises today. Further integration with AWS services is expected in the future. However this still offers a number of cloud type benefits such as on demand pricing and scaling. VMware are responsible for all the patching and hardware maintenance of the hosts, so this becomes like a SAAS offering of VMware with only the management of the VM’s remaining a concern.

The 4 host minimum may be prohibitive to many SME’s. If VMware was able to deliver a non-dedicated hardware model this would facilitate the adoption rate by lowering the barriers to entry. It will be interesting to see if they look to a Ravello style nested system or if the performance hit of this approach is viewed as too great.

The on demand pricing facilities cloud bursting, imagine a travel company that in the summer season has double demand could request and deploy additional capacity in a familiar tool. This would be powerful.

VMware is extremely familiar to most organisations, it is a known and trusted technology. Some organisations may choose to do a lift and shift of their current VMware infrastructure to the cloud. When choosing to move to another cloud technology for example native AWS this would require a significant re-skilling process and potentially costly redesign exercise. VMware on AWS would enable a far simpler transition and be compatible with current processes and skill sets

From a pessimistic point of view VMware on AWS also offers CIO’s an easy on ramp for those under pressure to introduce a move to the cloud into the organisation.

This new lift and shift model being offered by VMware and Ravello gives organisations a more simplified path to the cloud. Whilst re-architecting application may be optimal to leverage the architectural differences of the cloud, that is a significant undertaking. This is a 1.0 release, it seems likely additional integration with AWS will come over time plus this product brings choice to the market with another method to move to the cloud.

What are your thoughts? Let me know in the comments

Don’t miss any more news or tips by following via e-mail or Twitter.