VMware General Notes

Below is a collection of general VMware notes. Some of it references functions from the VMware Module Library.

Table of Contents

  • Fix – Identifying Mismatched HBA zoning
  • Fix – vMotion “Operation Timed Out” error (ESX 4.1 Classic)
  • Fix – Datastore mismatches (showing “[]” as the VMDK path in the VM properties) – ESX 4.1
  • Fix – How to kill an unresponsive VM
  • Fix – Unlock user account in ESX 4.1
  • Fix – SCSI Lock on VMs with Orphaned Snapshots
  • Fix – Remove From Inventory
  • Fix – VM with snapshot that had its base disk extended
  • Fix – WinPE Disk Alignment (2003 & earlier)
  • Fix – HP SIM agents to restart for issues
  • Performance – CPU Ready Time Troubleshooting
  • Performance – monitor storage performance per HBA
  • Performance – Expand specific row in ESXTOP
  • Performance – Capture ESXTOP data for an interval
  • Performance – View World for specific VM IO load
  • Performance – esxtop array latency
  • Gather Info – VM & Guest OS Type Mismatches
  • Gather Info – Generate log bundle for specific VM & Host its on
  • Gather Info – Check NIC driver/firmware
  • Gather Info – Find log files modified in the last 2 days and copy to /tmp
  • Gather Info – Optimized preferred paths
  • Gather Info – View storage drivers
  • Gather Info – Display Array Multipathing
  • Gather Info – Validate Jumbo Frames
  • Gather Info – Check SCSI sense codes for storage connectivity issues
  • Gather Info – Check pathing and failover mode
  • Gather Info – View svMotion log (“vmware.log” within VM folder)
  • Gather Info – View Hypervisor Driver Queue Depth
  • Gather Info – Firewall – what ports are open
  • Gather Info – To determine recommended driver for the card
  • Gather Info – List VML -> NAA mapping
  • Gather Info – View Hypervisor Driver Queue Depth
  • Gather Info – view host memory info from CLI
  • Misc – OpenManage Daemon
  • Misc – VI Client Remembered Entries
  • Misc – Video Streaming VM Custom Settings
  • Config – RPM install
  • Config – Log HBAs back into fabric
  • Config – FTP from CMD Prompt
  • Config – Hypervisor Driver Queue Depth

 

Fix – Identifying Mismatched HBA zoning

  • Run “Get-VMHostHBAHealthMultithread”, and identify which VMHosts have improper zoning
  • Open VI Client > Storage Adapters > Click on each vmhba and check the “targets/devices/paths” section. Compare these values between each HBA
  • SSH into host
  • esxcfg-mpath –l
    • Copy the results to excel > text to columns
    • Compare the Target and LUN #’s between both vmhba’s by sorting the results
    • esxcfg-mpath –l | grep –B 1 –i “target: 10 lun: 0
      • The parts in red are the target/lun to check for. From the mpath command above, compare all of the targets that are missing/mismatched between HBAs. Record the Device Display Names for all targets that need to be corrected
      • Run Get-DSName –VMHost <vmhostname> -NAA <NAA> to look up which datastores are affected
      • Right click on the datastore > manage paths. Examine the pathing and see what is mismatched

Fix – vMotion “Operation Timed Out” error (ESX 4.1 Classic)

Prerequisites

  • Local administrator password for the guest OS
  • Screenshot the VM summary tab
  • Screenshot Edit Settings screen, and write down the VMDK size for every Hard Disk
  • Screenshot ipconfig /all
  • Validate that space is available for svMotion for all VMDKs (clean up step after recreating the shell VM – see Config subsection below)
  • Schedule downtime for the server:
  1. When this server can be brought down for approximately 60 minutes
  2. Who to notify before/during/after the maintenance window
  3. Any special startup/shutdown requirements

Config

  1. Get the local admin password for the guest OS
  2. Create all necessary tickets
  3. Notify all relevant parties about the maintenance
  4. Log into the server and take note of the info from the Prerequisites section above
  5. Power down VM
  6. Remove VM from inventory
  7. (KB 1002294) Create new virtual machine and use the existing VMDK
  8. Modify the new VM to be in line with the identical settings that were recorded
  9. Power on VM
  10. Reconfigure storage and network to match what you took note of in the Prerequisites section
  11. Validate VM is on the network with the correct volumes mounted and accessible
  12. Restart VM
  13. Validate ability to log in into the guest OS
  14. vMotion VM to another host to validate
  15. Cmd > set devmgr_show_nonpresent_devices=1, devmgmt.msc, show hidden devices, uninstall the old vmxnet3 adapter on the VM
  16. Perform svMotion of the VMDK files off-hours to the new datastore that has the newly created VMX file
  17. Delete the original folder that has the old VMX file on the original datastore

Fix – Datastore mismatches (showing “[]” as the VMDK path in the VM properties) – ESX 4.1

service mgmt-vmware restart

service vmware-vpxa restart

Fix – How to stop an unresponsive VM

  1. Make sure that the VM is inaccessible to everyone and that it really is down.
  2. Browse the datastore where the VM is located (best to do this via the CLI on the service console with “ls -lh”) and check the time stamps of the files to see how long the snapshots, if any, have been sitting there for.
  3. In VirtualCenter, or “vCenter” the VM will probably still be showing as powered on. Check on which of your ESX hosts it is running.
  4. Log onto the service console of the ESX host that is running the VM. Elevate your priviledges to root.
  5. Now, as the VM has an active task, you won’t be able to send any other commands to the VM. You won’t be able to use vmware-cmd to change the state of the VM either. Until the task that’s stuck in progress has completed, the ESX host will not be able to send any power commands to the VM. The only way to now release the VM from its sorry state and get rid of the “Active task” is to kill the VM’s running process from the service console. In order to do so, you need to find the PID for the “running” VM. To get the PID do:
    • ps -auxwww |grep <VM-NAME>

Example:

Suppose you have a VM called WKSTNL01 The command will be:

ps -auxwww |grep WKSTNL01

This should return something like this:

root     12322  0.0  0.4   3140  1320 ?        S<s  13:32   0:03 /usr/lib/vmware/bin/vmkload_app –sched.group=host/user/pool1 /usr/lib/vmware/bin/vmware-vmx -ssched.group=host/user/pool1 -# name=VMware ESX;version=4.0.0;buildnumber=164009;licensename=VMware ESX Server;licenseversion=4.0 build-164009; -@ pipe=/tmp/vmhsdaemon-0/vmx673aca8b7403868b; /vmfs/volumes/489a1228-2bfd25b5-6a2c-000e0cc41e52/WKSTNL01/WKSTNL01.vmx

The PID in this instance is 12322. This is what we need to kill.

6. Kill the process ID with kill -9:

kill -9 12322

Fix – Unlock user account in ESX 4.1

[root@ ~]# pam_tally –user username –reset

User username    (500)   had 10

Shows it had 10 failed attempts

1. log in as root

2. Type: “passwd username -u”      (-u is unlock)

Fix – SCSI Lock on VMs with Orphaned Snapshots

  • Try a standard VM delete
  • Locate the datastore the VM resides on
  • Vmkfstools –L release <path to VMDK>
  • If the release fails, check /var/log/vmkernel.log, search for “Lock”
    • Check the “owner” section – if the UUID is valid, follow this guide: (link)

Fix – Remove From Inventory

http://www.yellow-bricks.com/2011/11/16/esxi-commandline-work/

vim-cmd /vmsvc/getallvms  (the first column will be the VMID)

vim-cmd /vmsvc/unregister <VMID>

Fix – VM with snapshot that had its base disk extended

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1646892

Fix – WinPE Disk Alignment (2003 & earlier)

http://thefoglite.com/2012/12/10/using-winpe-to-align-boot-disk-for-windows-2003/

Fix – HP SIM agents to restart for issues

service hp-health restart

service hp-snmp-agents restart

service snmpd restart

service hpsmhd restart

Performance – CPU Ready Time Troubleshooting

  • Check for CPU limits under VM settings
  • Check resource pool that VM might be in
  • Performance tab of VM > Filter based on past day > Convert CPU summation value into CPU Ready % (KB 2002181)
  • Take the converted CPU RDY % value and divide it by the number of vCPUs on the VM
  • If the CPU Ready % value per core is above 5-10, this may indicate an issue
  • CPU Ready % includes not just the time the VMM world is spent waiting for CPU time – if there are storage latency issues, this could be the root cause of a high CPU ready %, but low Host CPU oversubscription ratio and low % CSTP values (thread)
    • Host oversubscription: current rule of thumb (2014) is 4:1 max (link)
    • % CSTP value: Anything 3 or higher is a problem
    • SSH into host > esxtop
      • Press c for CPU
      • Lowercase “L” to filter based on VM GID column – select the VM reporting high CPU Ready %
      • Press “e” to expand the VM worlds > input the GID for the VM
      • Press “s” for seconds > input 2
      • Press “L” for length
      • Subtract %WAIT – %IDLE
        • Waiting on response from (thread)
  • Press “m” for memory > check NUMA node balance. If this is out of balance it could cause high CPU RDY % values
  • Run “Get-VMHostCPURatio”
    • If the host is oversubscribed, migrate VM to a new host

Performance – monitor storage performance per HBA

  1. Start esxtop by typing esxtop at the command line.
  2. Press d to switch to disk view (HBA mode).
  3. Press f to modify the fields that are displayed.
  4. To view the entire Device name, press SHIFT + L and enter 36 in Change the name field size.
  5. Press b, c, d, e, h , and j to toggle the fields and press Enter.
  6. Press s, then 2 to alter the update time to every 2 seconds and press Enter.

See Analyzing esxtop columns for a description of relevant columns

Performance – Expand specific row in ESXTOP

Press 2, then once highlighted press 6 to expand

Performance – Capture ESXTOP data for an interval

(2 second intervals over 20 seconds):

vm-support -S -i 2 -d 20

tar –zxf esx-2012-12-02—12.31.23720.tgz

cd vm-support-bs-tse-i142-2012-12-02—12.31.23720/snapshots

./untar.sh

Cd ..

Esxtop –R .

Performance – View World for specific VM IO load

I was watching INF-VSP1423 – esxtop for Advanced Users today by Krishna Raj Raja. This is a VMworld 2012 San Francisco session, if you attended SF but did not attend this session look it up and watch it… If you are going to VMworld Barcelona, schedule it. It is an excellent session, deep technical with some great insights presented by a very smart VMware engineer. There was a tip in there which I found very useful.

Krishna showed an example where he noticed a lot of I/O being generated on a particular LUN. How do you figure out who / what is causing this? Well it is not as difficult as you think it would be…

  • Open up esxtop (more details on my esxtop page)
  • Go to the “Device” view (U)
  • Find the device which is causing a lot of I/O
  • Press “e” and enter the “Device ID” in my case that is an NAA identifier so “copy+paste” is easiest here
  • Now look up the World ID under the “path/world/partition” column
  • Go back to CPU and sort on %USED (press “U”)
  • Expand (press “e”) the world that is consuming a lot of CPU, as CPU is needed to drive I/O

This should enable you to figure out which world is driving the high amount of I/Os. Now you can kill it, contact the user / admin causing it… nice right.

Performance – esxtop array latency

Login to putty

Type = esxtop

Type –  d  ( this will sort for you to view disk information such as I/O and commands

DAVG/cmd – The latency see between the HBA and disks

KAVG/cmd – Latency created by the vmkernel, should be close to 0.00 ms

GAVG/cmd – Latency as seen by the Guest  =  (Davg + Kavg)

These are general numbers for DAVG: You are in good shape if < 10ms.  10-20ms is still OK.  >20ms you might start to see some performance degradation but things will still be working.  > 30-40 you will start to see applications slow down.

With that said, if you are working with really large block sizes you might be getting 100MB/s with 30ms latency, not bad.  With small block sizes you might be seeing 1MB/s with 2-3ms latency, again, not bad.

Gather Info – VM & Guest OS Type Mismatches

If this shows up on a vCheck report, check these settings:

$vm = get-vm –name <vmname>

$vm.Guest.GuestFullName (may not be shown, doesn’t return anything for ESX 4.1)

$vm.Summary.Config.GuestFullName (may not be shown, doesn’t return anything for ESX 4.1)

$vm.guest.extensiondata.guestfullname

$vm.extensiondata.config.guestfullname

Gather Info – Generate log bundle for specific VM & Host its on

KB 2005715

Gather Info – Check NIC driver/firmware

for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13; do echo “”;echo vmnic$i;ethtool -i vmnic$i; done

esxcfg-nics -l

Gather Info – Find log files modified in the last 2 days and copy to /tmp

find /var/log -maxdepth 1 -ctime -2 -iname “vmk*” -exec cp “{}” /tmp \;

Gather Info – Optimized preferred paths

esxcli nmp path list | less |grep “Device: naa.6006016059921f00e220a9ffa2b4e111” -B 2 -A 4

esxcli nmp path list | less | grep “{current: yes; preferred: yes}” -B 6|grep  “TPG_state=ANO” -B 5 -A

1>> /vmfs/volumes/<datatsorename>/Preferred_ANO.txt

Gather Info – View storage drivers

Command to see all the drivers for HBAs and gives more details

less /proc/scsi/qla2xxx/*

less /proc/scsi/lpfc820/*

Gather Info – Display Array Multipathing

esxcfg-mpath -L | grep -i naa.6006016033201c00a43a4ab9be9cde11

 

vmhba1:C0:T1:L1 state:standby naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 1 1 NMP standby san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016941e0b7ec
vmhba1:C0:T0:L1 state:active naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 0 1 NMP active san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016141e0b7ec
vmhba0:C0:T1:L1 state:standby naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 1 1 NMP standby san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016941e0b7ec
vmhba0:C0:T0:L1 state:active naa.6006016033201c00a43a4ab9be9cde11 vmhba1 0 0 1 NMP active san fc.20000000c9739842:10000000c9739842 fc.50060160c1e0b7ec:5006016141e0b7ec

In this example, the initial portion of the output, such as vmhba1:C0:T1:L1 and vmhba1:C0:T0:L1, breaks down to HBA1/0, Controller 0, Target (SP) 1/0, Lun 1.

Gather Info – Validate Jumbo Frames

  • Vmkping –I vmk1 –s 8972 –d 192.168.100.213
  • Above: use vmkernel interface associated with iSCSI, use the correct payload size (ICMP payload size which is 9000-28 – ICMP header is 8, IP header is 20), and use “-d” to not allow IP fragmentation. Without all of these settings, jumbo frames will not be properly validated
  • Set physical switches to 9198 or 9216 if the MTU setting on the ESXi hosts & storage is set to 9000

Gather Info – Check SCSI sense codes for storage connectivity issues

Check for all SCSI sense codes which are not “OK” (0x0) (link):

grep -i -r “h:0x1\|h:0x2\|h:0x3\|h:0x4\|h:0x5\|h:0x6\|h:0x7\|h:0x8\|h:0x9\|h:0xb\|h:0xc\|h:0xd” messages* | more

Gather Info – Check pathing and failover mode

Rpowermt display dev=all host=<hostname>

Gather Info – View svMotion log (“vmware.log” within VM folder)

vMotion the VM to another host to recreate the vmware.log file, then the original vmware.log file will be unlocked

Gather Info – View Hypervisor Driver Queue Depth

vmkload_mod -l | grep -i “qla”

esxcfg-module -q “qla2xxx”

esxcfg-module -s “ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60” qla2xxx

Gather Info – Firewall – what ports are open

esxcfg-firewall –q

netstat –pan

lsof -i -P –n

Gather Info – To determine recommended driver for the card

vmkchdev -l |grep vmnic0

002:01.0 8086:100f 15ad:0750 vmkernel vmnic0

In this example, the values are:

  • VID = 8086
  • DID = 100f
  • SVID = 15ad
  • SDID = 0750

Gather Info – List VML -> NAA mapping

ls –latrh /vmfs/devices/disks

Gather Info – View Hypervisor Driver Queue Depth

vmkload_mod -l | grep -i “qla”

esxcfg-module -q “qla2xxx”

Gather Info – view host memory info from CLI

  1. Putty
  2. Cat /proc/meminfo

Misc – OpenManage Daemon

/usr/lib/ext/dell/srvadmin/bin/dataeng restart

Misc – VI Client Remembered Entries

HKEY_CURRENT_USER\Software\VMware\VMware Infrastructure Client\Preferences

Misc – Video Streaming VM Custom Settings

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2040065

Config – RPM install

rpm –Uvh xxxxxxxx.rpm

Config – Log HBAs back into fabric

This command will attempt to log HBAs back into fabric. Different commands for different HBA cards. Qlogic(qla2xxx) and Emulex(lpfc820)

echo “scsi-qlascan” > /proc/scsi/qla2xxx/6

echo “scsi-lpfc820scan” > /proc/scsi/lpfc820/6

Config – FTP from CMD Prompt

  1. ftp ftpsite.vmware.com
  2. cd 14426172901
  3. mk dir 14426172901
  4. cd 14426172901
  5. quote pasv
  6. lcd C:\Users\

Config – Hypervisor Driver Queue Depth

esxcfg-module -s “ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60” qla2xxx

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: