Monday, December 7, 2015

VMWare Upgrade 5.5 Dell R730xd Coredump SD Card Issue

I ran into an issue recently with the upgrade process. The build number for VMWare 5.5 that came preinstalled on the 2 SD cards from Dell was build #2068190. All well and good, but by the time I installed the VSAN servers some months later, I learned of a new release from Dell; 5.5 Update 3. This is specifically for Dell Esxi servers and available on the Dell Support & Download pages. We are running R730's and the software download page is here: Dell Download Under "Enterprise Solutions". I created a new Baseline in VMWare Update Manager, and it failed. I tried several more times, and this is the message displayed on the console.
Error (see log for more info):
vmkfstools failed with message: create fs deviceName: '/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:2', fsShortName: 'vfat', fsName:'(null)'
deviceFullPath:/dev/disks/mpx.vmhba32:C0:T0:L0:2 deviceFile:mpx.vmhba32:C0:T0:L0:2
Checking if remote hosts are using this device as a valid file system. This may take a few seconds...
Create vfat file system on "mpx.vmhba32:C0:T0:L0:2"
with blockSize 1048576 and volume label "none".
/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:2: Permission denied. (Have you set the partition type to 0xfb?)
Error: Permission denied.



This is my SD Card. For some reason, having the coredump configured for a host locks up the card and it appears to be read-only, but this is not entirely true. Having the coredump pointed to the SD Card puts a lock on that partition. So, I tried to disable the coredump with esxcli system coredump partition set --enable false

However, it seems that this is reset when I reboot.

So, the next thing I tried was changing the partition type for the affected partitions on the SD Card.
Note: You have to run
esxcli system coredump partition set --enable false
before you can delete the partition
First, I did a fdisk on the SD card.

fdisk /vmfs/devices/disks/mpx.vmhba32\:C0\:T0\:L0

The partition in question is
.../mpx.vmhba32:C0:T0:L0:L0p2&nbsp901&nbsp3460&nbsp2621440&nbspfc&nbspVMKcore

Then I pressed a t, f0 to change the partition type to something other than VMKcore, in this case, I switched to f0 Linux/PA-RISC boot

Don't forget to w before you exit fdisk to write the changes.
I tried booting up again. The thing that sucks is that even though you see this screen



...with the new build number, the upgrade script doesn't actually run until after this all is loaded. So the turnaround time for troubleshooting this problem has turned into a substantial investment. Needless to say, I hope it works.

After all that fooling around, it didn't work. Same error message. This partition is killing me. Maybe I should just delete it. After recovering the old image with Shift+R on bootup screen, I went back to check my coredump partition. Oh, after waiting another 10 minutes for bootup...DOH!

It doesn't have anything for that partition when I run the coredump command. I'm deleting this partition. If anything, I can re-create it after the upgrade, right? Well, instead of removing it completely, I'll remove it, re-add a new partition in the same space, and make it type VMFS.

Here goes nothing.
  1. re-run coredump enable false command (shown above)
  2. fdisk /vmfs/devices/disks/mpx.vmhba32\:C0\:T0\:L0
    1. d delete
    2. 2 partition 2
    3. w write changes
  3. fdisk /vmfs/devices/disks/mpx.vmhba32\:C0\:T0\:L0
    1. p print partition table, verify it is removed
  4.  fdisk /vmfs/devices/disks/mpx.vmhba32\:C0\:T0\:L0
    1. n create new partition
    2. p primary partition (if prompted, enter 2 for partition number)
    3. 901 starting cylinder
    4. 3460 ending cylinder
    5. t change partition type
    6. 2 partition 2 
    7.  fb type VMFS
    8. w write changes
  5. fdisk /vmfs/devices/disks/mpx.vmhba32\:C0\:T0\:L0
    1. p print partition table (verify partition2  is type vmfs)


Another 10 minutes later....

Holy shnickeys it worked. So, the upgrade is installed and I'm waiting for the first boot. Now, I want to go in and re-create partition #2, set it up as coredump, and reboot again.



Ok. reboot and here it is. I upgraded. Amazing.



As expected, when the host reconnected I had the familiar error message.



Ok. going back into fdisk.

fdisk /vmfs/devices/disks/mpx.vmhba32\:C0\:T0\:L0

Just as an aside, isn't this the same as doing /dev/disks/mpx....?

Here is my output.


I changed partition 7 earlier and deleted and re-recreated partition 2 as well. Before we change anything, lets see what system coredump get/list looks like
esxcli system coredump partition get



So, it still thinks 7 is valid, even though I changed it. But since it was never "active", it must not be locking the file. After making the changes to Partition 2 & 7, here is another print of partition table.



After adding those I ran coredump list again...and vmware seems to be smart enough to have everything figured out already, almost.
esxcli system coredump partition list



Run a few commands to setup the coredump
esxcli system coredump partition set --partition="mpx.vmhba32:C0:T0:L0:2"
esxcli system coredump partition set --enable true
esxcli system coredump partition list



I think that is all. Going to reboot and enable HA, rejoin to the cluster and vmotion some vm's.

Thanks for reading

5 comments:

  1. Thanks for this, I also just found this article that led to the same conclusion but seemed much simpler.. http://myitoverview.blogspot.com/2015/09/esxi-55-upgrade-to-6-invalid-argument.html

    ReplyDelete
    Replies
    1. I wish I would've seen that before I started, would have made my life a whole lot easier!

      Delete
  2. Thanks for this article, the first one I found after attempting to upgrade a Dell R630 with dual SD cards from ESX5.5 to ESX6.0

    Wedding Favors is accurate when they say http://myitoverview.blogspot.com/2015/09/esxi-55-upgrade-to-6-invalid-argument.html appears to be a simpler explanation.

    I still found that article to be overly complicated in my situation.

    This article: http://en.community.dell.com/techcenter/b/techcenter/archive/2016/02/05/esxi-upgrade-fails-with-an-error-permission-denied was even simpler for my specific situation, and worked flawlessly. There doesnt seem to be a need to go back and recreate the deleted partition 2, or reconfigure the coredump partition.

    the comments in this article mention this: http://www.itxperience.net/en/esxi-upgrade-operation-failed-error-permission-denied/

    The only other thing i would add is my original error was referencing mpx.vmhba32, but when i looked my coredump was actually on mpx.vmhba40 (i had no mpx.vmhba32), so i just worked with mpx.vmhba40, then re-did the upgrade and it worked fine.


    ReplyDelete
  3. I guess I could clean this up a bit. Mostly, I wrote this blog while I was in the process of troubleshooting, as a reference for myself in the future and for helping anyone that might have the same problem. Let's face it, this post isn't exactly going to show up as #1 or #2 on anybody's Google Search! However, if it helped anyone even remotely, I'm satisfied. Thanks for the comment.

    ReplyDelete