Below is a collection of general VMware notes. Some of it references functions from the VMware Module Library.
Table of Contents
- Fix – Identifying Mismatched HBA zoning
- Fix – vMotion “Operation Timed Out” error (ESX 4.1 Classic)
- Fix – Datastore mismatches (showing “[]” as the VMDK path in the VM properties) – ESX 4.1
- Fix – How to kill an unresponsive VM
- Fix – Unlock user account in ESX 4.1
- Fix – SCSI Lock on VMs with Orphaned Snapshots
- Fix – Remove From Inventory
- Fix – VM with snapshot that had its base disk extended
- Fix – WinPE Disk Alignment (2003 & earlier)
- Fix – HP SIM agents to restart for issues
- Performance – CPU Ready Time Troubleshooting
- Performance – monitor storage performance per HBA
- Performance – Expand specific row in ESXTOP
- Performance – Capture ESXTOP data for an interval
- Performance – View World for specific VM IO load
- Performance – esxtop array latency
- Gather Info – VM & Guest OS Type Mismatches
- Gather Info – Generate log bundle for specific VM & Host its on
- Gather Info – Check NIC driver/firmware
- Gather Info – Find log files modified in the last 2 days and copy to /tmp
- Gather Info – Optimized preferred paths
- Gather Info – View storage drivers
- Gather Info – Display Array Multipathing
- Gather Info – Validate Jumbo Frames
- Gather Info – Check SCSI sense codes for storage connectivity issues
- Gather Info – Check pathing and failover mode
- Gather Info – View svMotion log (“vmware.log” within VM folder)
- Gather Info – View Hypervisor Driver Queue Depth
- Gather Info – Firewall – what ports are open
- Gather Info – To determine recommended driver for the card
- Gather Info – List VML -> NAA mapping
- Gather Info – View Hypervisor Driver Queue Depth
- Gather Info – view host memory info from CLI
- Misc – OpenManage Daemon
- Misc – VI Client Remembered Entries
- Misc – Video Streaming VM Custom Settings
- Config – RPM install
- Config – Log HBAs back into fabric
- Config – FTP from CMD Prompt
- Config – Hypervisor Driver Queue Depth
Fix – Identifying Mismatched HBA zoning
- Run “Get-VMHostHBAHealthMultithread”, and identify which VMHosts have improper zoning
- Open VI Client > Storage Adapters > Click on each vmhba and check the “targets/devices/paths” section. Compare these values between each HBA
- SSH into host
- esxcfg-mpath –l
- Copy the results to excel > text to columns
- Compare the Target and LUN #’s between both vmhba’s by sorting the results
- esxcfg-mpath –l | grep –B 1 –i “target: 10 lun: 0”
- The parts in red are the target/lun to check for. From the mpath command above, compare all of the targets that are missing/mismatched between HBAs. Record the Device Display Names for all targets that need to be corrected
- Run Get-DSName –VMHost <vmhostname> -NAA <NAA> to look up which datastores are affected
- Right click on the datastore > manage paths. Examine the pathing and see what is mismatched
Fix – vMotion “Operation Timed Out” error (ESX 4.1 Classic)
Prerequisites
- Local administrator password for the guest OS
- Screenshot the VM summary tab
- Screenshot Edit Settings screen, and write down the VMDK size for every Hard Disk
- Screenshot ipconfig /all
- Validate that space is available for svMotion for all VMDKs (clean up step after recreating the shell VM – see Config subsection below)
- Schedule downtime for the server:
- When this server can be brought down for approximately 60 minutes
- Who to notify before/during/after the maintenance window
- Any special startup/shutdown requirements
Config
- Get the local admin password for the guest OS
- Create all necessary tickets
- Notify all relevant parties about the maintenance
- Log into the server and take note of the info from the Prerequisites section above
- Power down VM
- Remove VM from inventory
- (KB 1002294) Create new virtual machine and use the existing VMDK
- Modify the new VM to be in line with the identical settings that were recorded
- Power on VM
- Reconfigure storage and network to match what you took note of in the Prerequisites section
- Validate VM is on the network with the correct volumes mounted and accessible
- Restart VM
- Validate ability to log in into the guest OS
- vMotion VM to another host to validate
- Cmd > set devmgr_show_nonpresent_devices=1, devmgmt.msc, show hidden devices, uninstall the old vmxnet3 adapter on the VM
- Perform svMotion of the VMDK files off-hours to the new datastore that has the newly created VMX file
- Delete the original folder that has the old VMX file on the original datastore
Fix – Datastore mismatches (showing “[]” as the VMDK path in the VM properties) – ESX 4.1
service mgmt-vmware restart
service vmware-vpxa restart
Fix – How to stop an unresponsive VM
- Make sure that the VM is inaccessible to everyone and that it really is down.
- Browse the datastore where the VM is located (best to do this via the CLI on the service console with “ls -lh”) and check the time stamps of the files to see how long the snapshots, if any, have been sitting there for.
- In VirtualCenter, or “vCenter” the VM will probably still be showing as powered on. Check on which of your ESX hosts it is running.
- Log onto the service console of the ESX host that is running the VM. Elevate your priviledges to root.
- Now, as the VM has an active task, you won’t be able to send any other commands to the VM. You won’t be able to use vmware-cmd to change the state of the VM either. Until the task that’s stuck in progress has completed, the ESX host will not be able to send any power commands to the VM. The only way to now release the VM from its sorry state and get rid of the “Active task” is to kill the VM’s running process from the service console. In order to do so, you need to find the PID for the “running” VM. To get the PID do:
- ps -auxwww |grep <VM-NAME>
Example:
Suppose you have a VM called WKSTNL01 The command will be:
ps -auxwww |grep WKSTNL01
This should return something like this:
root 12322 0.0 0.4 3140 1320 ? S<s 13:32 0:03 /usr/lib/vmware/bin/vmkload_app –sched.group=host/user/pool1 /usr/lib/vmware/bin/vmware-vmx -ssched.group=host/user/pool1 -# name=VMware ESX;version=4.0.0;buildnumber=164009;licensename=VMware ESX Server;licenseversion=4.0 build-164009; -@ pipe=/tmp/vmhsdaemon-0/vmx673aca8b7403868b; /vmfs/volumes/489a1228-2bfd25b5-6a2c-000e0cc41e52/WKSTNL01/WKSTNL01.vmx
The PID in this instance is 12322. This is what we need to kill.
6. Kill the process ID with kill -9:
kill -9 12322
Fix – Unlock user account in ESX 4.1
[root@ ~]# pam_tally –user username –reset
User username (500) had 10
Shows it had 10 failed attempts
1. log in as root
2. Type: “passwd username -u” (-u is unlock)
Fix – SCSI Lock on VMs with Orphaned Snapshots
- Try a standard VM delete
- Locate the datastore the VM resides on
- Vmkfstools –L release <path to VMDK>
- If the release fails, check /var/log/vmkernel.log, search for “Lock”
- Check the “owner” section – if the UUID is valid, follow this guide: (link)
Fix – Remove From Inventory
http://www.yellow-bricks.com/2011/11/16/esxi-commandline-work/
vim-cmd /vmsvc/getallvms (the first column will be the VMID)
vim-cmd /vmsvc/unregister <VMID>
Fix – VM with snapshot that had its base disk extended
Fix – WinPE Disk Alignment (2003 & earlier)
http://thefoglite.com/2012/12/10/using-winpe-to-align-boot-disk-for-windows-2003/
Fix – HP SIM agents to restart for issues
service hp-health restart
service hp-snmp-agents restart
service snmpd restart
service hpsmhd restart
Performance – CPU Ready Time Troubleshooting
- Check for CPU limits under VM settings
- Check resource pool that VM might be in
- Performance tab of VM > Filter based on past day > Convert CPU summation value into CPU Ready % (KB 2002181)
- Take the converted CPU RDY % value and divide it by the number of vCPUs on the VM
- If the CPU Ready % value per core is above 5-10, this may indicate an issue
- CPU Ready % includes not just the time the VMM world is spent waiting for CPU time – if there are storage latency issues, this could be the root cause of a high CPU ready %, but low Host CPU oversubscription ratio and low % CSTP values (thread)
- Host oversubscription: current rule of thumb (2014) is 4:1 max (link)
- % CSTP value: Anything 3 or higher is a problem
- SSH into host > esxtop
- Press c for CPU
- Lowercase “L” to filter based on VM GID column – select the VM reporting high CPU Ready %
- Press “e” to expand the VM worlds > input the GID for the VM
- Press “s” for seconds > input 2
- Press “L” for length
- Subtract %WAIT – %IDLE
- Waiting on response from (thread)
- Press “m” for memory > check NUMA node balance. If this is out of balance it could cause high CPU RDY % values
- Run “Get-VMHostCPURatio”
- If the host is oversubscribed, migrate VM to a new host
Performance – monitor storage performance per HBA
- Start esxtop by typing esxtop at the command line.
- Press d to switch to disk view (HBA mode).
- Press f to modify the fields that are displayed.
- To view the entire Device name, press SHIFT + L and enter 36 in Change the name field size.
- Press b, c, d, e, h , and j to toggle the fields and press Enter.
- Press s, then 2 to alter the update time to every 2 seconds and press Enter.
See Analyzing esxtop columns for a description of relevant columns
Performance – Expand specific row in ESXTOP
Press 2, then once highlighted press 6 to expand
Performance – Capture ESXTOP data for an interval
(2 second intervals over 20 seconds):
vm-support -S -i 2 -d 20
tar –zxf esx-2012-12-02—12.31.23720.tgz
cd vm-support-bs-tse-i142-2012-12-02—12.31.23720/snapshots
./untar.sh
Cd ..
Esxtop –R .
Performance – View World for specific VM IO load
I was watching INF-VSP1423 – esxtop for Advanced Users today by Krishna Raj Raja. This is a VMworld 2012 San Francisco session, if you attended SF but did not attend this session look it up and watch it… If you are going to VMworld Barcelona, schedule it. It is an excellent session, deep technical with some great insights presented by a very smart VMware engineer. There was a tip in there which I found very useful.
Krishna showed an example where he noticed a lot of I/O being generated on a particular LUN. How do you figure out who / what is causing this? Well it is not as difficult as you think it would be…
- Open up esxtop (more details on my esxtop page)
- Go to the “Device” view (U)
- Find the device which is causing a lot of I/O
- Press “e” and enter the “Device ID” in my case that is an NAA identifier so “copy+paste” is easiest here
- Now look up the World ID under the “path/world/partition” column
- Go back to CPU and sort on %USED (press “U”)
- Expand (press “e”) the world that is consuming a lot of CPU, as CPU is needed to drive I/O
This should enable you to figure out which world is driving the high amount of I/Os. Now you can kill it, contact the user / admin causing it… nice right.
Performance – esxtop array latency
Login to putty
Type = esxtop
Type – d ( this will sort for you to view disk information such as I/O and commands
DAVG/cmd – The latency see between the HBA and disks
KAVG/cmd – Latency created by the vmkernel, should be close to 0.00 ms
GAVG/cmd – Latency as seen by the Guest = (Davg + Kavg)
These are general numbers for DAVG: You are in good shape if < 10ms. 10-20ms is still OK. >20ms you might start to see some performance degradation but things will still be working. > 30-40 you will start to see applications slow down.
With that said, if you are working with really large block sizes you might be getting 100MB/s with 30ms latency, not bad. With small block sizes you might be seeing 1MB/s with 2-3ms latency, again, not bad.
Gather Info – VM & Guest OS Type Mismatches
If this shows up on a vCheck report, check these settings:
$vm = get-vm –name <vmname>
$vm.Guest.GuestFullName (may not be shown, doesn’t return anything for ESX 4.1)
$vm.Summary.Config.GuestFullName (may not be shown, doesn’t return anything for ESX 4.1)
$vm.guest.extensiondata.guestfullname
$vm.extensiondata.config.guestfullname
Gather Info – Generate log bundle for specific VM & Host its on
Gather Info – Check NIC driver/firmware
for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13; do echo “”;echo vmnic$i;ethtool -i vmnic$i; done
esxcfg-nics -l
Gather Info – Find log files modified in the last 2 days and copy to /tmp
find /var/log -maxdepth 1 -ctime -2 -iname “vmk*” -exec cp “{}” /tmp \;
Gather Info – Optimized preferred paths
esxcli nmp path list | less |grep “Device: naa.6006016059921f00e220a9ffa2b4e111” -B 2 -A 4
esxcli nmp path list | less | grep “{current: yes; preferred: yes}” -B 6|grep “TPG_state=ANO” -B 5 -A
1>> /vmfs/volumes/<datatsorename>/Preferred_ANO.txt
Gather Info – View storage drivers
Command to see all the drivers for HBAs and gives more details
less /proc/scsi/qla2xxx/*
less /proc/scsi/lpfc820/*
Gather Info – Display Array Multipathing
esxcfg-mpath -L | grep -i naa.6006016033201c00a43a4ab9be9cde11
vmhba1:C0:T1:L1 state:standby naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 1 1 NMP standby san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016941e0b7ec
vmhba1:C0:T0:L1 state:active naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 0 1 NMP active san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016141e0b7ec
vmhba0:C0:T1:L1 state:standby naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 1 1 NMP standby san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016941e0b7ec
vmhba0:C0:T0:L1 state:active naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 0 1 NMP active san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016141e0b7ec
In this example, the initial portion of the output, such as vmhba1:C0:T1:L1
and vmhba1:C0:T0:L1
, breaks down to HBA1/0, Controller 0, Target (SP) 1/0, Lun 1
.
Gather Info – Validate Jumbo Frames
- Vmkping –I vmk1 –s 8972 –d 192.168.100.213
- Above: use vmkernel interface associated with iSCSI, use the correct payload size (ICMP payload size which is 9000-28 – ICMP header is 8, IP header is 20), and use “-d” to not allow IP fragmentation. Without all of these settings, jumbo frames will not be properly validated
- Set physical switches to 9198 or 9216 if the MTU setting on the ESXi hosts & storage is set to 9000
Gather Info – Check SCSI sense codes for storage connectivity issues
Check for all SCSI sense codes which are not “OK” (0x0) (link):
grep -i -r “h:0x1\|h:0x2\|h:0x3\|h:0x4\|h:0x5\|h:0x6\|h:0x7\|h:0x8\|h:0x9\|h:0xb\|h:0xc\|h:0xd” messages* | more
Gather Info – Check pathing and failover mode
Rpowermt display dev=all host=<hostname>
Gather Info – View svMotion log (“vmware.log” within VM folder)
vMotion the VM to another host to recreate the vmware.log file, then the original vmware.log file will be unlocked
Gather Info – View Hypervisor Driver Queue Depth
vmkload_mod -l | grep -i “qla”
esxcfg-module -q “qla2xxx”
esxcfg-module -s “ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60” qla2xxx
Gather Info – Firewall – what ports are open
esxcfg-firewall –q
netstat –pan
lsof -i -P –n
Gather Info – To determine recommended driver for the card
vmkchdev -l |grep vmnic0
002:01.0 8086:100f 15ad:0750 vmkernel vmnic0
In this example, the values are:
- VID = 8086
- DID = 100f
- SVID = 15ad
- SDID = 0750
Gather Info – List VML -> NAA mapping
ls –latrh /vmfs/devices/disks
Gather Info – View Hypervisor Driver Queue Depth
vmkload_mod -l | grep -i “qla”
esxcfg-module -q “qla2xxx”
Gather Info – view host memory info from CLI
- Putty
- Cat /proc/meminfo
Misc – OpenManage Daemon
/usr/lib/ext/dell/srvadmin/bin/dataeng restart
Misc – VI Client Remembered Entries
HKEY_CURRENT_USER\Software\VMware\VMware Infrastructure Client\Preferences
Misc – Video Streaming VM Custom Settings
Config – RPM install
rpm –Uvh xxxxxxxx.rpm
Config – Log HBAs back into fabric
This command will attempt to log HBAs back into fabric. Different commands for different HBA cards. Qlogic(qla2xxx) and Emulex(lpfc820)
echo “scsi-qlascan” > /proc/scsi/qla2xxx/6
echo “scsi-lpfc820scan” > /proc/scsi/lpfc820/6
Config – FTP from CMD Prompt
- ftp ftpsite.vmware.com
- cd 14426172901
- mk dir 14426172901
- cd 14426172901
- quote pasv
- lcd C:\Users\
Config – Hypervisor Driver Queue Depth
esxcfg-module -s “ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60” qla2xxx