Jump to content
Welcome to our new Citrix community!
  • 0

Hypervisor 8 upgrade server not exiting maintenance mode, VMs not rebooting.


Iacovos Zacharides

Question

Hello,

 

We had a pool of 3 Dell R710 hosts on Xen7.6 Standard Edition.

We decided to upgrade to Hypervisor 8 and add a 4th server, as well as a new SAN (Dell MD3620i)

I did a fresh install on host1, created a new pool, and migrated 1/3 of the VMs from the old pool to the new pool.

I repeated for host2 and finally host3. Everything seemed to be working fine until I added a 4th host (Dell R720).

 

Our pool has AD enabled, so I added it using XenCenter v8.0.0.7086. After that, I modified the network adapter settings to match the settings on all other hosts. NIC0+NIC1=Bond0+1, NIC2=iSCSI1, NIC3=iSCSI2. After enabling multipath, I saw the iSCSI sessions on all hosts start to drop. There are 3SRs and normally 4 iSCSI sessions to each host. I had sessions either dropping to 2of4, or even 0of4, but VMs residing on the SRs were not affected at all.

 

Host4 took a long time to finish booting (i.e. be able to exit maintenance mode without getting the "server still booting" message).

After confirming that our iSCSI host mappings and switch/VLAN configurations are correct, I tried removing one SR - which was empty, forget it, and then reattach it. I get a "SM has thrown a generic python exception" message.

 

I removed host4 from the pool, and sessions between the 2 remaining SRs and the pool went back to 4of4, but I still cannot reattach the removed SR. I get the same error.

I restarted the toolstack on all hosts, to no avail, and then attempted to put one in maintenance mode in order to reboot, but VM migration hangs. I had to force shutdown and restart the toolstack on the host in order for the VM to poweroff, I cancelled maintenance mode, and tried to power on the VMs, but each took either multiple attempts to boot (error was that the VDI could not be found, so I detached/reattached the VDI to the VM) but in both cases it tootk the VMs about 10 minutes to boot. I tested rebooting a 3rd VM from within the OS, and the reboot took again >10 minutes.

 

I then tried changing the pool master, but the host has since remained in "Disabled" state (VMs are running) and cannot exit maintenance mode, due to the "still booting" error.

I ran xe task-list and got these results:

uuid ( RO)                : 17747bac-f7cc-9f88-7762-929784f56df9
          name-label ( RO): server_init
    name-description ( RO):
              status ( RO): pending
            progress ( RO): 0.000


uuid ( RO)                : 0ced6d00-cbd1-e8b9-5d4d-433782f48ea9
          name-label ( RO): Connection to VM console
    name-description ( RO):
              status ( RO): pending
            progress ( RO): 0.000

 

After a while, 1,2 and then a 3rd  "sr.scan" tasks also appeared, stuck in the pending status as well.

I tried task-cancel uuid=... for the first 2 tasks, but nothing changed.

xe task-param-list uuid=17747bac-f7cc-9f88-7762-929784f56df9 for the server init task returns this:


uuid ( RO)                  : 17747bac-f7cc-9f88-7762-929784f56df9
            name-label ( RO): server_init
      name-description ( RO):
            subtask_of ( RO): <not in database>
              subtasks ( RO):
           resident-on ( RO): 88c23e0e-916f-4457-bbfc-1a56700eb4c1
                status ( RO): pending
              progress ( RO): 0.000
                  type ( RO): <none/>
                result ( RO):
               created ( RO): 20191002T14:35:01Z
              finished ( RO): 19700101T00:00:00Z
            error_info ( RO):
    allowed_operations ( RO):
          other-config (MRW): startup_operation: considering executing on-master-start script

 

Note that the plan was to perform a rolling pool upgrade  after the migration was complete, so XS80E001/2/3/4 are not installed.

Also note that when adding host4 to the pool I got a notice saying that the existing CPU feature set of the existing hosts is older than the one of host4 (given that it's a newer model) and that host4's functionality would be limited by the feature set of the older hosts in the pool. I assume that Xen would have updated the CPU features mask on it's own.

 

Any help would be appreaciated.

Thank you.

Link to comment

5 answers to this question

Recommended Posts

  • 0

Hello,

 

Prior to commencing with the upgrade of the host servers, did you check the latest Hardware Compatibility List (HCL) for CH 8.0 to confirm compatibility with Dell R710: http://hcl.xenserver.org/servers/?serversupport__version=15&vendor=3

 

Referring to the CH 8.0 HCL, the R710 is not on the list. We advise contacting your Dell representative and inquiring as to whether or not Dell elected to test compatibility of the R710 server with CH 8.0.

 

Thank you,

Andy

CH/XS PM

 

Link to comment
  • 0

Hello Andy, thanks for you response.

We'll get in  touch with Dell to confirm hardware support.

 

However, I'd like to point out that the the first host had been upgraded over 12 days ago and had been running fine until today, with multiple VM startups, shutdowns, migrations, etc.

The second host had been upgraded over 5 days ago and the 3rd 2 days ago, and had also been running fine since.

The issues described above, came up right after joining a 4th host to the pool.

Link to comment
  • 0

I'm a little confused where you say you configured the network adapter settings to match the other hosts. A server should have a single nic (usually eth0) configured and when you join the pool your networks will configure automatically. The only thing you would need to provide is perhaps other management/storage interfaces that are unique. Also, since these are different models is the number of nics and their order the same as the older three ? You may need to do some nic reordering to get the nics aligned with the original pool proper.

 

--Alan--

 

Link to comment
  • 0

When you add a new host to a pool, make sure the only network configuration in place is a single, primary management interface (not a bond!) as most of the networking will then be inherited from the pool settings (if you have an iSCSI device like the MD3600i, you'll have to add those networks manually to the host).

 

I have installed CH 8.0 successfully on a pair of Dell R200 units which predate the R710 and hence I would be surprised if the R710 would not work. Do make sure it has the same BIO and same hotfix levels as the rest of the pool before you add that server to the pool or it may think it is not compatible!

 

-=Tobias

Link to comment
  • 0

It turns out that we had assigned the same LUN number on some virtual disks, and on the Dell MD3620i side, it limits the number of connections per LUN that it can accept. That's why the issue came up as soon as we added a 4th host.

 

We're in the process of a rolling pool upgrade and then we'll try and add the 4th host again.

I'll update if needed with more information.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...