vSAN Health Checks Explained – Part 5

Hello everyone. This is the fifth and the final part to the blog series “vSAN Health Checks Explained”. Before we proceed to the contents, let me thank the readers for providing valuable feedback. I am glad to see that this really helped many of the Newbies to vSAN to understand the concepts properly and this is what the #vCommunity is all about -Sharing the knowledge and helping others to grow.

I am also glad to see that this blog series got its place in the #AwesomevSAN list from the VMware community. This is the link to the #AwesomevSAN list.

https://github.com/valdecircarvalho/awesome-vsan#vsan-blogs-series

This is an all in one place where you can get everything that you need on vSAN and the list is growing. For more information on this list or if you have some recommendations, please contact Val (Twitter : @homelaber)

Now let’s move on to the next category “Data”. This category mainly deals with the vSAN object Health.

vSAN Object Health

This health test gives a sumary of the health status of all the objects in the vSAN cluster and based on the health status, it groups them under different sub-categories. Based on the position of the object in the sub-categories above, administrators will get a clear idea about the reason for that health status as well as the action items that he needs to take. Each sub-category is tied with a risk factor, so timely corrective action is recommended.

The above screenshot shows that the cluster objects are all healthy and no rebalance activities are going on. Lets go through the sub-categories one by one.

Reduced availability with no rebuild : If you see any objects here, it means that those objects have failed, and vSAN was able to tolerate the failure. No rebuilding actions are initiated here due to many reasons. One reason might be due to non-availability of resources on the other hosts to recreate the copy of the object. Eventhough any virtual machines using this object can still access the data, it is highly recommended to understand the root cause and resolve it. Because a subsequent failure of a host or a host component might bring that object down. You can also see the “Limits” category to see what is the status under “One additional host failure”. If needed you can either scale up or scale out the cluster to add more resources.

Reduced availability with no rebuild – Delay Timer : Similar to the above sub-category, if you see any objects here, it means that those objects have failed, and vSAN was able to tolerate the failure. vSAN will wait for a period of 60 minutes before trying to re-protect the object. This can be either temporary (like a host reboot, a drive replacement etc) or permanent (a host failure etc). You can trigger a manual rebuild operation if the wait time is too much for you. You can modify the wait time under the vSAN settings:

Non availability related config : If you see any objects here, this doesn’t mean that the objects are at risk of availability. This might be due to an SPBM policy change that is not related with the availability. That is the policy might not change the FTT , but something like changing the stripe width of an object for example. Administrators need not worry about objects in this sub-category, instead they can keep this under monitoring to confirm that the policy change has succeeded.

Reduced availability : If you see objects here, it means that those objects have suffered a failure and vSAN is attempting to re-protect them. The objects are still available for any virtual machines using them and because the SPBM policy has an FTT value for resilience, vSAN is rebuilding the missing components. No manual action is needed but administrators can monitor this situation and confirm that the rebuild has succeeded.

Data move : If you see objects in this sub-category, this means that there are some data rebalancing actions happening in the background either because of a host evacuation or manual maintenance activities. The objects in the sub-category are all healthy and are compliant to the SPBM policy and hence administrators need not worry about objects in this sub-category. If required, you can apply throttling limits to the resyn operations. This has been explained in the part 1 of the series.

Inaccessible : If you see any objects in this sub-category, then it means that those objects have suffered a failure beyond the configured tolerance level and vSAN is unable to recover it. This means that this object is not available to any virtual machines that need it. This is a high-risk category and administrators need to identify the root cause and take remediation actions.

Healthy : The object is available and is compliant with all the configured Storage policies. No rebalancing actions are taking place in this sub-category.

Non-availability related incompliance : Any object in this sub-category doesn’t have issues with availability but is non-compliant from its configured policies. This is a kind-of catch-all category if none of the above policies applies.

Manually repairing the objects

In order to initiate a manual repair option on the vSAN datastore you can try the option under the health test itself.

Purging inaccessible VM Swap Objects

Following cluster issues, it is possible that inaccessible swap objects might get created. They become orphaned when virtual machines are powered on because new swap files will be created. These inaccessible swap files are safe to be deleted. vSAN 6.7 U1 gives this console option to purge inaccessible swap files to conserve space.

Now let’s move on to the next category ‘vSAN Build recommendation’.

vSAN Build Recommendation Engine Health

vSAN gives a build recommendation based on the installed hardware and Hardware compatibility Guide. There are some requirements to be met for this like internet access, Vmware Update Manager, access to HCG and the release metadata etc. This particular health test ensures that these services are running and are available.

vSAN build recommendation

This test recommends the best ESXi build based on the installed hardware, HCG and the release metadata.

vSAN release catalog up-to-date

This test checks whether the release catalog used for the build recommendation is up-to-date or not.

This is all for this category. Now let’s move on to the final category – ‘Online Health Checks’

Vmware collects non-sensitive information from customer’s vSAN implementations as part of continuous improvement service. This is a kind of telemetry data that VMware uses for proactive analysis and customer guidance. For this customer should enable the ‘Customer Experience Improvement Program CEIP’. The online health checks are available only when you have enable CEIP.

When you run an online health test, vSAN will get the most recent test recommendations and definitions and they appear under the ‘Online Tests’ category. It is independent of Vmware Update manager, that is you don’t need to update the platform to get latest definitions.

Lets discuss one use case related to this. Suppose VMware has identified a bug in vSAN which could be addressed with a new patch. They release the new patch and customers get the new patch via Update Manager. VMware will then add a health definition under ‘Online Tests’ and those customers fail that online test will get notified automatically. VMware GSS can get the list of vulnerable customers using the telemetry data and they could proactively contact them.

Now lets go through some of the tests under ‘Online Tests’ category.

Customer Experience Improvement Program (CEIP)

This test checks whether CEIP is enabled on the platform. This is necessary for online health checks to function.

Online Health Connectivity

This test checks and ensures that there is internet connectivity. Online Health Tests require internet connectivity to work.

Physical Network Adapter Link Speed Consistency

This test checks whether the physical NICs used by the vSAN hosts have a consistent link speed. Inconsistent link speeds might affect vSAN performance.

vSAN Critical Alert – Patch available for critical vSAN issue

This is a health test related to the use case explained above. This test checks whether a critical patch is available for the vSAN hosts. If it is available, it is recommended to update the hosts as soon as possible.

vCenter Server up to date

This test checks whether vCenter Server is up to date and is compatible with the hosts as per the Product Interoperability Guide.

Thick Provisioned VMs on vSAN

This test checks for any VMs that are Thick provisioned in the vSAN datastore but got a storage policy with Thin provisioning applied to it. These are some special cases and the disks need to be converted to thin for space reclamation. More details are available in the this KB: https://kb.vmware.com/s/article/66758

Now that we completed all the health tests available on my vSAN infrastructure (6.7 U1) as of today. This concludes the 5 part series. Hope this was useful. Thanks for reading.

See the other parts below:

Part 1 -> https://vxplanet.com/2019/01/30/vsan-health-checks-explained-part-1/

Part 2 -> https://vxplanet.com/2019/02/01/vsan-health-checks-explained-part-2/

Part 3 -> https://vxplanet.com/2019/03/11/vsan-health-checks-explained-part-3/

Part 4 -> https://vxplanet.com/2019/03/22/vsan-health-checks-explained-part-4/