Fix CVE-2024-37085 in VMware ESXi | Troubleshoot Storage & Path Issues

Table of Contents

Fix CVE-2024-37085 in VMware ESXi and Troubleshoot Storage & Path Issues

Introduction

As infrastructure engineers managing large VMware environments (vSphere, ESXi, vCenter), staying ahead of both security vulnerabilities and storage/path performance issues is critical. In this article we’ll cover two major problem-areas

The highly publicized CVE‑2024‑37085 vulnerability (domain-joined ESXi hosts)
Common storage / All Path Down (APD) / Permanent Device Loss (PDL) issues on ESXi.

Part 1: Understanding & Mitigating CVE-2024-37085 on ESXi

The vulnerability CVE-2024-37085 affects ESXi hosts joined to Active Directory domains, where a domain group named “ESX Admins” (or “ESXi Admins”) is treated by ESXi as having full administrative access, even if that group did not originally exist. Attackers have exploited this in the wild via ransomware campaigns.

What happens?

If an attacker has domain privileges to create or rename groups, they can create/rename a group named “ESX Admins”, add themselves and gain full ESXi host admin access.
Hosts joined to AD that implicitly trust “ESX Admins” group aren’t validating its existence properly, giving the attacker full privileges.
This leads to compromise of ESXi hypervisor, potentially compromising VMs, datastores and enabling ransomware attacks.

Immediate Mitigation Steps

While you plan patching/upgrading, apply these mitigations:

Remove any group named “ESX Admins” or “ESXi Admins” from the AD domain if not required.
Ensure ESXi hosts are **not** exposed to untrusted networks and restrict management access to trusted networks only.
Limit AD user privileges: ensure only required groups have join/management rights.
Apply latest ESXi patches or upgrade to a fixed version. For ESXi 8.0 the fix is in U3-24022510.
Audit logs for creation of the group “ESX Admins” or additions to it, and look for suspicious group rename operations.

Long-Term Remediation & Best Practice

Once immediate mitigation is done, proceed with a full remediation plan:

Patch or upgrade ESXi hosts to the version that addresses CVE-2024-37085.
Review host AD integration: consider using local accounts instead of domain join if not required.
Enable least privilege on vCenter and ESXi roles; avoid giving too many permissions to host/cluster admins.
Enable multi-factor authentication (MFA) for domain accounts that can administer ESXi/vCenter.
On system side, isolate management networks, use jump hosts or bastion systems, implement monitoring/alerting for group change events.

By proactively handling this vulnerability, you not only secure your ESXi hosts, but also strengthen defense around your entire vSphere infrastructure.

Part 2: Troubleshooting Storage & Path-Down Issues on ESXi

Error: storage performance degradation, “All Path Down” (APD) and “Permanent Device Loss” (PDL) states on ESXi hosts. These can stem from fabric issues, driver/firmware mismatches, or mis-configured path-selection policies.

Common Symptoms

Datastore becomes inaccessible in vSphere client.
Host shows “Not Responding” for VMs, or host disconnects from vCenter.
Log output shows entries like “state in doubt; requested fast path state update” or “NMP Device … is blocked”.
High storage latency seen via esxtop (GAVG/cmd), repeated QFULL/BUSY errors etc.

Root Causes

Some of the most frequent root-causes include:

Outdated or unsupported HBA / driver / firmware combinations.
Storage fabric mis-configurations: bad cables, broken ISL links, high latency SAN paths.
Incorrect VMware path-selection policy (PSP) or queue depth configuration (e.g., default IOPS for Round Robin too high).
Underlying storage mis-configuration: too many VMs/run-on LUNs, alignment issues, non-optimal stripes etc.

Step-by-Step Troubleshooting Walk-through

Follow this structured approach:

Check Host and vCenter Alerts – Look for datastore accessibility warnings, path errors, host disconnections.

SSH into the ESXi Host – Use commands like:

esxcli storage core adapter list  
esxcli storage core path list  
esxcli storage core device list

to identify affected devices and paths.

Inspect Logs – Review `/var/log/vmkernel.log` for messages like APD/PDL, or blocked paths. Example:
> “WARNING: NMP: nmp_DeviceStartLoop:721: NMP Device … is blocked”
Check Path-Selection Policy (PSP) – For example, ensure that Round Robin policy is set correctly (some arrays recommend IOPS=1).
Examine Storage Fabric & Firmware – Verify HBA driver versions, firmware, cable and switch health. Replace faulty ISL or links if needed.
Check Latency Metrics – Via `esxtop`, check GAVG/cmd should be well under your Storage I/O Control threshold (default 30 ms) for healthy performance.
Corrective Action – If you identify a faulty path or HBA, mark it as Dead / remove it. Adjust queue depths if QFULL/BUSY errors. Redeploy VMs or re-migrate if needed. Re-scan and present devices back to hosts cleanly.
Monitor & Validate – After remedial steps, monitor latency, path state, host behavior over 24-48 hours. Ensure no new APD/PDL warnings return.

Best Practices to Prevent Recurrence

Keep HBA driver/firmware combinations certified for your ESXi version.
Follow your storage array vendor’s host connectivity and VMware best practices guides.
Use appropriate path-selection policies; e.g., for typical FC SANs, Round Robin with IOPS=1 may give best result.
Segment storage traffic, use dedicated networks for ISCSI/NFS/vMotion/storage rather than oversubscribed links.
Monitor storage latency proactively via `esxtop` or vRealize operations; set alerts for thresholds (e.g., GAVG/cmd > 20-30 ms).
Perform periodic storage maintenance: check for misaligned LUNs, oversubscription, path failures, and keep firmware/driver updates in your patch schedule.

By applying these best practices and structured troubleshooting steps, you will reduce the risk of nasty storage issues that can bring down VMs, hosts, or entire clusters.

Conclusion

For virtualization professionals managing large VMware environment (vSphere/ESXi/vCenter), staying ahead means securing your environment (as with CVE-2024-37085) and ensuring robust storage/substrate health (preventing APD/PDL, latency issues).