vSAN Health Checks Explained – Part 4


Hello everyone, this is the fourth part of the series “vSAN Health Checks Explained”. I hope the previous posts were useful and thanks for the feedback. Lets now move on to the next category “Performance Service”

1

This test is applicable only when Performance Service is enabled for the vSAN cluster (which is the recommendation). When you enable Performance Service, vSAN creates a StatsDB object in the vSAN datastore which stores all the statistics collected by the service. This object has the vSAN default storage policy with FTT=1 and stripe width=1. This is where you enable the Performance service.

2

Stats DB Object

This test checks the health of the statsDB object including the availability and free space. In case if you notice that the statsDB object isn’t available, simply turn the service off and then turn it back on.

The history data in the Stats DB object is kept for approximately three months (93 days), and old data is purged.

3

Stats master election

For the statsDB object, vSAN elects one host in the vSAN cluster to be a stats master. This host is responsible for collecting the statistics from all the other hosts in the cluster and update the statsDB object. That means that this is the host which is authoritative to make writes to the statsDB object. There should be only one statsDB master in the vSAN cluster.

4

Performance data collection

This test checks whether the vSAN performance Statistics collection is working normally. It tests whether the statistics are collected recently and that it is able to write to the statsDB object in a timely manner.

5

All hosts contributing stats

This test checks whether all the hosts that are part of the vSAN cluster are able to contribute to statistics to the collection. Hosts which are not contributing stats because of a network partition or other reasons will show up here.

6

Stats DB object conflicts

This test checks whether multiple statsDB objects are there in the vSAN cluster. One common scenario when this could happen is where two clusters with performance statistics enabled merges. Merging of the statsDB objects are not supported, hence vSAN renames one of the statsDB directories and marks the other one as relevant. You can use the datastore browser to either back up the additional renamed directories or delete them.

7

That is all for the “Performance Service” category. Now lets move on to the next category “Hardware Compatibility”

8

vSAN HCL DB up-to-date

This test checks whether the local copy of the HCL database on the vSAN is up to date. All compatibility tests are done against this local copy of the HCL database and not against the HCL on the website, so it is important to ensure that the local copy is up to date. If this is an isolated cluster, you can perform an HCL update manually as well.

9

vSAN HCL DB Auto Update

This test checks whether the vSAN HCL DB auto updater is running and that it is able to access the Vmware HCL release website. If this is an isolated cluster, this test will be skipped when the auto updater sees no internet connectivity. In that case you need to perform a manual HCL update, else the previous check “vSAN HCL DB up-to-date” will fail.

10

SCSI controller is VMware certified

This test checks whether the SCSI controller is supported in the VMware Compatibility Guide. Each controller will have a PCI ID and this PCI ID is looked up against the HCG. If you see a warning here, update the HCL database and run a retest of the health checks. It may be possible that the controller got certified very recently and might be missing from the local HCL and an update might fix it. You can also perform a manual lookup of the PCI ID online.

11

PCI ID format represents the Vendor 1D, Device ID, Subvendor ID and the Sub Device ID. For Eg: for the PCI ID – 1000,005d,1028,1fd1, then the breakup looks like this.

Vendor ID (VID) – 1000

Device ID (DID) – 005d

SubVendor ID (SVID) – 1028

Sub Device ID (SSID) – 1fd1

12

As you can see from the above HCG, this specific controller is certified for ESXi 6.7U1 along with the certified firmware and driver versions. It is highly recommended to adhere to the HCG.

Controller driver is VMware certified

This test checks whether the controller driver is certified and that it is listed in the VCG. It’s possible that vendors update the driver versions and VMware might certify the new drivers and revoke the certification for an older version of the driver. In that case, as part of Lifecycle management, it is recommended to migrate to the most recent version of the driver that is listed on the VCG. It is very important to perform a health test before and after the driver upgrade.

13.png

Controller firmware is VMware certified

This test checks whether the controller firmware is certified and that it is listed in the VCG. Similar to controller drivers, vendors can update the firmware versions and VMware might certify the new version and revoke the certification status for the old firmware version.In that case, as part of Lifecycle management, it is recommended to migrate to the most recent version of the firmware that is listed on the VCG.

If vSAN is unable to retrive the controller firmware status, then you might need to install the vendor tool via the “Update Software” section. vSAN queries the controller information via this vendor tool.

14

Controller disk group mode is VMware certified

This test checks whether the disk group mode that we have selected for the vSAN cluster is supported for the current controller and the ESXi version. Some controllers might support only hybrid mode, and putting an All-flash configuration might affect the vSAN stability. 

15

vSAN firmware provider health

This is the health check for vSAN firmware version recommendation engine. This health check verifies that the firmware recommendations are made correctly. That is

  • Checks to ensure that vSphere Update manager is installed, enabled and is available.
  • vSAN Release metadata is up to date.

16.png

That’s all for now.

Continue reading? Here are the other parts:

Part 1 -> https://vxplanet.com/2019/01/30/vsan-health-checks-explained-part-1/

Part 2 -> https://vxplanet.com/2019/02/01/vsan-health-checks-explained-part-2/

Part 3 -> https://vxplanet.com/2019/03/11/vsan-health-checks-explained-part-3/

Part 5 -> https://vxplanet.com/2019/03/29/vsan-health-checks-explained-part-5/

One thought on “vSAN Health Checks Explained – Part 4

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s