vSAN Health Checks Explained – Part 1

Hello everyone, In this blog post we will walk through the health check options available with vSAN 6.7. vSAN includes around 50+ automated health checks (as of now) which monitors and tells you how your HCI infrastructure is doing. You also have the option to perform an online health test which downloads the new health definitions and categories for vSAN. In this case you don’t require that the vSphere infrastructure is updated to get these signatures. These health checks are based on vSAN 6.7U1 which I am currently working on.

Lets go through the categories one by one, starting with “Cluster”

v1

ESXi vSAN Health service installation

This test checks whether all the ESXi hosts has got the health service plugin installed. This should happen automatically during vSAN setup as the plugin is deployed automatically for version 6.0U2 and above.

vSAN Health Service up-to-date

This test will check the version of the health check plugin on all the ESXi hosts in the vSAN cluster and the vCenter server. The health check plugin (VIB) is initially installed onto the vCenter server and the vCenter pushes the VIBs to the ESXi hosts in the vSAN cluster.

Advanced vSAN configuration in sync

It’s possible that the advanced configuration options for the ESXi hosts in the cluster are tuned for some special requirements and left thereafter. These advanced configuration options are at per host level (for eg: enabling Trim/unmap to take advantage of space reclamation), this might create inconsistencies at a later point of time. This test checks that the advanced configuration options of all the ESXi hosts in the vSAN cluster are consistent. Understand that this test just checks for consistency only. If you have set an incorrect option on all the hosts, then this test is not going to remediate the incorrectness. Using host profiles is a better way to keep the ESXi configurations consistent.

vSAN CLOMD liveness

CLOMD stands for Cluster Level Object Manager Daemon. All the ESXi hosts in the vSAN cluster runs a service called clomd.service which is responsible for all data path operations like new object creation, repair and rebalance actions, data evacuations (like disk/host failures, maintenance modes etc). This test will check whether the service is in running state on all the ESXi hosts in the vSAN cluster. To manually see the status, try the below on an ESXi ssh session.

v2

vSAN Disk Balance

vSAN is an object storage. Data is written as objects and based on the FTT (storage policy), multiple copies (called components) are written to different fault domains. With default policy, any object upto 255 GB is supported, beyond which is striped into two components and written. It’s possible that we see variance across disks between the ESXi hosts. This test will monitor the Average and Maximum Disk usage & the variance and triggers an alert if the variance is above 30%. It automatically kickstarts a rebalance activity. You could also trigger a manual rebalance if necessary. This is a slow process and good to schedule it outside business hours.

v3

Resync operations throttling

By default, vSAN maintains a good balance between VM IO and Resync IO. In most cases, you don’t need to set a resync throttling. This test checks whether a throttling limit has been applied on any of the hosts. You can view the current resync activity and used bandwidth here:

v4

v5

vCenter state is authoritative

All ESXi hosts in the vSAN cluster uses vCenter server as a single point of truth. vCenter maintains the vSAN cluster membership & the unicast tables and pushes this information down to all the ESXi hosts in the vSAN cluster. The unicast table looks like the below:

v6

This table has to be consistent across all the ESXi hosts in the vSAN cluster and vCenter.

In some scenarios like a vCenter server is recovered from a backup, it might be having stale or no information.  This test checks whether inconsistencies exist between the vCenter state and the ESXi hosts in the vSAN cluster. Just in case, if it’s a host issue, you can update the ESXi configuration manually as seen below:

v7

You can see that the alert is triggered while new ESXi hosts are being added to the cluster but it auto clears by itself.

vSAN cluster configuration consistency

While moving hosts across vSAN clusters or merging two clusters together,  it is possible that some hosts might have an inconsistent configuration (like deduplication/compression, encryption,performance service etc). This test checks if any inconsistencies exist across the hosts in the vSAN cluster. If there is an inconsistency, you can remediate it as shown below:

v8

Time is synchronized across hosts and VC

This test checks whether time is synchronized across all the hosts in the vSAN cluster and vCenter. Its recommended to set the time source to an NTP server (internal or external). If there is a time difference greater than 60 sec, then this test triggers an alert.

vSphere cluster members match vSAN cluster members

This test checks that vSphere cluster members match vSAN cluster members. vCenter and ESXi hosts communicate via the management port group. vSAN communication happens via the vSAN port group. Suppose a scenario where the vSAN vmkernel adapter for a host is not setup, then this host is a member of vSphere cluster (via management port group) but not a member of vSAN cluster.  For a healthy cluster, the test result is shown below:

v9

Software version compatibility

This test checks the vSAN on-disk format version with the installed ESXi version across all the hosts in the vSAN cluster. Once the cluster uses the latest on-disk format version and a new host with an older ESXi version is introduced to the cluster, this test triggers an incompatibility.  The host might then be moved into a separate network partition in the cluster. Inter-operability between different versions of on-disk format is quite challenging. For eg: version 2 supports 1K block size and versions 3 -7 support 4K block size. In a different scenario, when you have upgraded to a new ESXi version, then the on-disk format might be in the old version which triggers an alert and will require an upgrade.

Disk format version

This test checks the on-disk format version of all the disks in the cluster and ensures that all vSAN disks use the highest disk format version that the host supports. To get full support of all the vSAN features, on-disk upgrade to the highest supported version is recommended. The on-disk format version with different vSAN versions are summarized below:

v10

You can upgrade the on-disk format version as shown below:

v11

vSAN extended configuration in sync

This test checks whether the vSAN cluster extended configuration options like object repair timer, site read locality, large scale support etc are consistent across all the ESXi hosts in the cluster. If you see a mismatch, you can remediate it as shown below:

v12

This is all about the checks available at the “cluster” category.

Continue reading? Here are the next parts:

Part 2 -> https://vxplanet.com/2019/02/01/vsan-health-checks-explained-part-2/

Part 3 -> https://vxplanet.com/2019/03/11/vsan-health-checks-explained-part-3/

Part 4 -> https://vxplanet.com/2019/03/22/vsan-health-checks-explained-part-4/

Part 5 -> https://vxplanet.com/2019/03/29/vsan-health-checks-explained-part-5/

 

One thought on “vSAN Health Checks Explained – Part 1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s