• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

ESXi woes...

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Mpegger

Member
Joined
Nov 28, 2001
Looks like something went bonkers in my ESXi box and I'm going to re-install everything fresh with a slightly different config to better suit my future networking setup. But hey, maybe someone has experienced this before and might have some idea what I should try to adjust before I go and redo the whole machine.

Symptoms: ESXi management IP is pingable, but none of the VMs are. Cannot connect to anything on the ESXi host, including the host itself.

This problem will randomly popup, and hours later, randomly go away. When this first started, it was only the VMs that would have trouble accessing the network and each other, then eventually within days it escalated to the current point of a seemingly locked up ESXi box.

Cause: No idea, but it did start to give trouble after I had to redo my ZFS setup because the pool was set to ashift=9 instead of 12, and the replacement drive could not be used as a replacement untill I redid the pool with ashift=12.

Because of redoing the ZFS pool, I had to move everything stored on the ZFS pool off, which included the VMs that were installed on it. I did a simple copy from the ZFS pool to the ESXi boot drive/storage, redid my ZFS setup, then moved a couple of the VMs back to the ZFS pool. Everything seemed ok at first, except for the random network drops in the VMs. I tried all kinds of different fixes in the various VMs to try and cure the problem (removing the NIC and reinstalling in the VM, redoing the IP addresses, redoing the vSwitch and vNICs in ESXi, etc), and the problem only seemed to get worse. I was doing a re-install on one of the VMs to see if that would solve the problem (I was thinking that something in the VM OS got corrupted) and it locked up again with the same issue; ESXi management IP pingable, but nothing else worked.

Only causes I can think of are:

a)ESXi boot/storage drive is going belly up and possible corrupted some files or VMs.

b)During transferring the VMs to and fro, something got corrupted which is affecting the whole system, even ESXi itself.

c)Something in the network side of ESXi is really really fubared (for whatever reason) which is causing all the issues.

Anyone ever have this issue or some possible reason/resolution?
 
this is related to ZFS and not esxi, but you know you could have inserted your replacement drive without redoing the pool

I just did this myself last night. My old drive was ashift=9 so when I inserted the new drive it went like this

Code:
zpool replace -o ashift=9 TB1.5 ata-ST31500341AS_9VS2DTPA ata-ST3000VN000-1H4167_Z300QN7R -f

and viola! in she went.

Sorry I can't offer more help with ESX, I use OpenVZ myself, with Virtualbox on the side if a container wont do
 
this is related to ZFS and not esxi, but you know you could have inserted your replacement drive without redoing the pool

I just did this myself last night. My old drive was ashift=9 so when I inserted the new drive it went like this

Code:
zpool replace -o ashift=9 TB1.5 ata-ST31500341AS_9VS2DTPA ata-ST3000VN000-1H4167_Z300QN7R -f

and viola! in she went.

Sorry I can't offer more help with ESX, I use OpenVZ myself, with Virtualbox on the side if a container wont do

I couldn't find any specific instructions about changing ashift for OpenIndiana, other then "you need to add in the entry for the drive in xxxx.xxx file", with no clue as to what that line was supposed to be. Being *nix clueless, this is somewhat a big detriment for me.

Besides, at this point any future drives purchased will be 4k native. Seeing as I had enough spare drives laying around to move everything off the pool now, I decided to just go ahead with the redo, so in the future there wouldn't be a problem at all. Just a simple drop in and go. :thup:
 
I might have found my problem.

I booted up the ESXi box today to get everything I could off the VMs needed to do a full reinstall. Everything actually seemed to be working ok, though I was only running 1-2 VMs at a time. Then took the ESXi boot drive out and plugged it into my other PC to test it out.

ESXi main drive coming up with 3 weak sectors in SMART. Running a full read/write/read test on the drive x2 to see if those sectors will be remapped or if any others pop up. I'll decided afterwards if I'll put it back into the system or swap in the spare drive. Most likely one of the VMs currently housed on that drive had data on those sectors, and that's what was locking up the ESXi box.
 
According to my drive test, 21 weak sectors in total now. More then what SMART initially detected and more then likely the root of the problem.

The real problem with weak sectors, is that they have yet to be marked as a bad sector, so the drive will continue to try and use it till it finally decides to mark it bad, or even mark it good if it reads normal. But marking it good just means there is a possible bad sector that can be used and then suddenly go bad in the future at any time.

In other words, unless I can force the sectors to be marked bad with continuous read/write cycles, this drive is pretty much useless. :bang head
 
You can try using the program "bad blocks" Its a linux based tool, but I am sure its available in BSD as well
 
You can try using the program "bad blocks" Its a linux based tool, but I am sure its available in BSD as well

<-- Linux/BSD inept. Unless there are thorough instructions on how to use it (with examples) for CLI (which I find most of those power apps in *nix clones require), I more then likely will just pass on it. :p

Besides, I have enough drive testing apps to do the allocation via continuous read/writes to the specific area. It'll reallocate eventually whether it likes it or not.

In the meantime, I hope I can use the drive temporarily to move some data around the ZFS file system before I go run it through the gauntlet. I'm also gonna update the ESXi box when I re-install ESXi and all the VMs from scratch. Purchased 2 64GB SSDs, one for the ZIL, and one for the ESXi boot and 2 critical VMs. :D
 
I believe the recommended is to mirror the ZIL or else you could be screwed if the drive gets pooched
 
If you can, clone/mirror that drive to "save" your current config/pools/etc. and then you can beat on it without fear of destroying your config.

Re: BSD and bad sectors, enjoy my link dump!
http://www.freebsddiary.org/smart-fixing-bad-sector.php
http://forums.freebsd.org/showthread.php?t=27507
http://forums.freebsd.org/showthread.php?t=1823
http://forums.freebsd.org/showthread.php?t=20292
http://forums.freebsd.org/showthread.php?t=18634
These few are a start.


I had something similar on my PFSense box, the drive was going bad and there was no point in trying to nudge it along again and again instead of just getting it as best as possible long enough to offload all the configs I needed to.
 
I believe the recommended is to mirror the ZIL or else you could be screwed if the drive gets pooched
The only information I can find on this is that if you are using a version of ZFS prior to 19, a failed Zil drive could cause the whole pool to be lost. Any ZFS version higher should not experience this, and ZFS should act as if the drive was simply removed and go back to using the ZFS pool as its Zil device. OpenIndiana uses ZFS version 28. If it's really an issue even with v28, I can just disable write cacheing on the pool. I have a UPS and a working shutdown method if the power should go out.

If you can, clone/mirror that drive to "save" your current config/pools/etc. and then you can beat on it without fear of destroying your config.

Re: BSD and bad sectors, enjoy my link dump!
http://www.freebsddiary.org/smart-fixing-bad-sector.php
http://forums.freebsd.org/showthread.php?t=27507
http://forums.freebsd.org/showthread.php?t=1823
http://forums.freebsd.org/showthread.php?t=20292
http://forums.freebsd.org/showthread.php?t=18634
These few are a start.


I had something similar on my PFSense box, the drive was going bad and there was no point in trying to nudge it along again and again instead of just getting it as best as possible long enough to offload all the configs I needed to.
The drive giving the problems was the ESXi boot drive. From what I understand, ESXi uses some of Linux in it, so I don't know if BSD would help in this case.

Rest assured, everything from the ZFS pool is just fine and dandy. I wish I knew how to check, but I think the only VM that was really affected by the bad blocks was the Astaro firewall. Running the other 2 VMs that were on the same drive didn't give any problems, and the important data (the firewall configuration) was saved.

I also plan on going the USB boot disk route so in the future a problem like this won't be such a big deal.
 
Back