vSAN Health Checks Explained – Part 3

Hello everyone, I hope Parts 1 & 2 were useful. In this third part, we will go through the other health checks, lets start with “Physical Disks” category.

Operation Health

This test checks the physical disk operation status of all hosts in the cluster like hard and soft errors, issues with reading metadata, congestion, accessibility errors, capacity etc. Poor performing disks or disk groups causes higher IO latencies and congestion. The Dying Disk Handling (DDH) feature of vSAN will detect such faulty conditions with the disks or diskgroups, mark them as unhealthy and trigger data evacuation depending upon the data policies and compliance state of the objects. If objects on the failed disk/diskgroup has an FTT=1 and can tolerate the failure of this disk/diskgroup, then vSAN will make this disk/diskgroup as “absent”. The replica copy is generated lazily after the wait period of 60 mins. But if a component cannot tolerate the failure or is a needed one like the quorum, then it is evacuated immediately. Logs are written to /var/run/log on the vSAN hosts.

Note that if the caching disk has failed, then the entire disk group is marked as failed.

It is always recommended to maintain some slack space of around 30% on the vSAN cluster to handle data evacuation scenarios like this.

More details on vSAN DDH is available in the below KB : https://kb.vmware.com/s/article/2148358

Disk Capacity

This test is applicable only for the capacity drives and not for cache drives. This test checks the utilization of the capacity drives and triggers an alert when the free space goes less than 20%. vSAN automatically attempts to rebalance the cluster at these points of higher utilization in compliance with the storage policies. You can either scale up or scale out the cluster to increase the capacity as per the requirements.

Congestion

Congestion can be the result of a number of factors, one reason might be the low data destaging rates from the cache tier to capacity tiers. This can be a fault with cache drive or capacity drives, wrongly sized cache tier, storage controller issues, driver or firmware issues etc. Some common scenarios are with higher read cache miss rates and during host failures and rebalance activities in a hybrid vSAN configuration.

Component Limit Health

vSAN has a component limit of 9000 at the host level and 47661 at the disk level. Since the disk level limit is higher than the host level, this issue occurs only at the witness node for the stretched cluster. If this component limit is reached, then provisioning of new VMs on a stretched cluster might fail.

Component metadata health

This health check verifies the integrity of the component metadata on a disk. vSAN is an object store. vSAN maintains a metadata for each component that is written to the vSAN datastore. The metadata corruption can occur due to drive or controller issues, driver / firmware issues or can even intermittently originate from the vSAN software itself (in rare cases).

Memory Pools (Heaps and Slabs)

This test checks the virtual memory pools and triggers and alert if this is low. The memory pools are used by the cache and capacity tiers and are used in destaging the data to capacity tier. They are utlized in the form of LLOGs and PLOGs. Whenever data is written to the caching tier, it maintains a VLOG (Virtual Log) that has the component mapping information. During destaging, this is written to the PLOG (Physical Log). These logs are also referenced during a log replay scenario during vSAN startup.

It is less likely that we see errors triggered for the memory pools as vSAN uses congestion as a means to throttle down the incoming I/O rate in order to reduce pressure on memory pools. But if you see this error, then it is most likely that you see a congestion error also.

This is all for the “Physical Disks” category. Let’s move on to the “Limits” category

Current cluster situation

This test gives a simplistic view of the current cluster utilization like the component utilization, disk space utilization and Read cache reservations.

After 1 additional host failure

This test will simulate the failure of one vSAN host with the most resources and shows the resource consumption details thereafter. Note that if there is already a failure in the cluster, this test will report on one additional failure. If this check reports that after a host failure, more than 100% of resources will be used, it means that re-protection fails for some objects because there are not enough resources available.

Host component limit

Each vSAN host has a component limit of 9000. This test checks the component utilization across all the hosts. When the component limit is reached or the component can’t be balanced, then you might need scale out the cluster.