I had the misfortune of stumbling across this in a vSphere 5.1 with RHEL 6.3.  This was just the icing on a hellish cake of fail that I inherited.

vmware-recordrouteinfo

OK it wasn’t that bad and kind of an interesting problem, but definitely the last thing you want to see on a VM booting your central NFS server  after vmotion fails due to someone else’s host networking misconfiguration.  Fortunately it’s still in QA and not production, despite the over-reaching ambitions of technically illiterate management.

It turns out this had nothing to do with any IP routing table whatsoever, as I discovered when I kept experimenting with forcing static routes and triple-checked every network device and NIC that could be possibly be involved.  I thought maybe it was a kernel driver issue with vmnic, that some kind of nasty was introduced in either Red Hat’s vmware tools package or the ones vSphere uses.  After lots of tedious exploring with different combinations of kernel modules and versions of vmware tools I found nothing.

Well, thanks to the work of Chris Colotti and others, it turns out the problem is related to the default behavior of vmware tools to time sync with the host regardless of whether NTP is configured to do this in the OS, simply by renaming these files to something else:

/usr/lib/vmware-tools/plugins32/vmsvc/libtimeSync.so
/usr/lib/vmware-tools/plugins64/vmsvc/libtimeSync.so

It’s not clear exactly why it hangs yet, but the solution for now is to move the shared obj file vmware tools uses out of the way so it skips its timesync and doesn’t hang the entire boot process.  This incidentally highlights the need once again of a parallelized init process that replaces ye olde serialized init.

Advertisements