Jump to content
Welcome to our new Citrix community!
  • 0

fresh installed 6.5 server fails to join pool


Alberto Sierra

Question

Hey folks,

 

I've been trying to find a solution for my problem in this forum for about a week and finally I have decided to ask.

A couple weeks ago one of the servers of our pool had a network card failure. It was a master and I went through the painful process of emergency transition SR-recovery and VM migration. Currently i'm running on a pool with a single server.

The failed server's nic card was replaced but after having problems with the network ordering I re-installed it fresh. Configured eth0 as the management interface and joined the pool. The next thing that happens is that the joined server loses all interfaces and connectivity and I need to remove it from the pool (host-forget, remove db on the slave, restart networking).

I've repeated this process several times with the same result. The new server loses all network and needs to be reset for the NICs to be visible again.

 

There are a few things to notice:

The master has a bonded management interface

Right after the server joins the pool, there is an "unknown" interface configured with the IP address originally configured in the management interface (eth0) and the management interface is not configured and not attached to any network.

Right after the server joins the pool, all NICs show as disconnected.

Finally, the weirdest thing is that the new server's password is reconfigured (by the master?) with the original root password from the failed server.

 

Any ideas how to get this host back into the pool?

Link to comment

Recommended Posts

You always need to revert to just a single primary management interface (PMI) and have no bonds active when joining a host to a pool or you can get a failed connection because of the wrong path being active. This isn't well documented, alas, AFAIK.

 

As to the HCL list that is often way out of date with shat is supported vs. what actually works, the best we can suggest is to eject one member of your pool if possible and do a stand-alone install of whatever version of XS you plan to upgrade to and use that as a test bed. If all is well, you know you're safe to upgrade your pool. You can do that and then join that one stand=-alone host back into it and thereby not waste any time. Worst case, you revert to the previous version and join the standalone host back to the original pol and start rethinking what to do about moving to a newer version of XS.

 

-=Tobias

Link to comment

Last question is a tough one. If your hardware is a main vendor and you are confident it will run okay I would upgrade. 

If you have the overhead you can always pull one server out of the pool and clean install it and run some test VM's 

to make sure it runs okay. The hardware compatibility is up to the vendors to qualify their product for and over time

it seems there is less and less interest in keeping it updated. 

 

--Alan--

 

Link to comment

Hey guys, thank you both Alan and Tobias for your help and guidance.

I was able to recover the pool by creating a new pool and migrating the SR with the VM metadata backup.

While configuring the new pool I figured out that the secondary interface for the management bond was configured with a different interface name by the installer and I guess that is why it failed to auto-configure the bond correctly when joining the pool. Once I migrated to the new pool I reinstalled the other server and added it without issues. This one also had the interface misconfiguration but I was able to detect it and fix it before trying to join the pool.

 

My last question would be if you think it's better to keep this pool in xs6.5 on compatible hardware but no security fixes or moving to xs7.6 with no official hardware compatibility.

 

Link to comment

Does this looks correct?

  1. take a metadata backup
  2. shut the VMs down
  3. detach SR from the old pool
  4. attach SR to the new pool
  5. restore VM metadata
  6. turn VMs back on the new pool

 

Is it possible to shut all the VMs down, detach the SR and attach it to the new pool and start the VMs from the new pool? that would also be a short downtime if it was possible.

Link to comment

Basically, you just use xsconsole to back up the VM metadata on to the SR, detach the SR (all VMs must have no activity on it), re-attach it to the new pool and introduce it as an existing SR (after configuring whatever networks, etc. you'd do the same as if it were a new SR), and finally restore the VM metadata again using xsconsole.

 

-=Tobias

Link to comment
On 3/16/2019 at 4:53 PM, Tobias Kreidl said:

You could also detach the SR from the one pool after first creating a VM metadata backup onto it, then attach it to the new pool and use the restore function to get the VM metadata back. You can re-attach the storage while preserving the contents. That way, no VM migrations would be needed, but you would experience a bit of down time.

 

Otherwise, yes, you can create  a separate NFS export and use it on a totally different pool. Each NFS export can be handled as a different SR from one or more pools, totally independently from each other. We have a number of such configurations.

 

-=Tobias

This is probably the best alternative I have (detaching the SR and attaching it to a new pool). Do you have any documentation at hand for this procedure?

Thanks for your advice on the hardware compatibility, this pool runs on a pair of Dell R510, and not compatible with xenserver 7.6 according to the HCL. But i'm now facing a dilemma: keep running on 6.5 without security fixes and support or running on 7.6 without official hardware compatibility...

Link to comment

No, I have not tried that, though it should be possible since these are in essence just files.  I would certainly suggest backing up everything first to some sort of external storage.

If you can afford a little bit (maybe 10-15 minutes?) of downtime, the detaching/re-attaching the SR works really well as long as you have a good copy of your VM metadata on that SR.

 

-=Tobias

Link to comment

You could also detach the SR from the one pool after first creating a VM metadata backup onto it, then attach it to the new pool and use the restore function to get the VM metadata back. You can re-attach the storage while preserving the contents. That way, no VM migrations would be needed, but you would experience a bit of down time.

 

Otherwise, yes, you can create  a separate NFS export and use it on a totally different pool. Each NFS export can be handled as a different SR from one or more pools, totally independently from each other. We have a number of such configurations.

 

-=Tobias

Link to comment

You can use the same shared storage, but those two pools must be isolated from each other. In other words 

two different pools can't access the same data. With iSCSI its done with different LUN's. With NFS Tobias would

be the expert as I don't use it for SR's very much, but I'm assuming since the SR UUID is stored on the NFS share

you could point multiple SR's to the same NFS share as each SR would be stored in its own UUID folder.

 

 

--Alan--

 

Link to comment

I can do that, but I need to prepare a plan in case that master fails, how do I recover the VMs. in fact I would love to upgrade the pool to 7.6 but I'm not sure if I can use the same shared storage in both the current (6.5) and the new (7.6) pool to avoid downtime and migrate the VMs between the pools. Once the VMs are in the 7.6 pool, then I would fresh install the master of the 6.5 pool and join the new one.

 

is that possible?

Link to comment

Sounds like they were uploaded to XenServer and never applied. That explains why a bare install was able to join

your pool. This could also explain some of your issues. I know it would be painful but shutting down the master

and getting it to SP1 at least would be a good move. This is the recommended hoffix install order.

 

https://support.citrix.com/article/CTX138115

 

Also, XenServer 6.5 is no longer supported so if you have issues you are more 

on your own. 

 

 

 

--Alan--

 

 

Link to comment

You can fake it by editing the contents of /var/update/applied if it comes to an emergency issue like this where there is a mismatch and in some cases, a particular hotfix is no longer available because it was deprecated. You just need to know the UUID of the hotfix. You should be able to copy the contents over from a server where it was applied..

 

Note that I do not recommend this practice and it could very well void warranties and/or support on any system under maintenance contracts.

 

-=Tobias

Link to comment

The pool master shows this updates in XenCenter:

Base Pack (version 1.9.0)
XenServer Pack (version 6.5.0)
XenServer Transfer VM (version 6.5.0)

 

but the CLI output is different:

# xe patch-list |grep name-label
              name-label ( RO): XS65E003
              name-label ( RO): XS65E006
              name-label ( RO): XS65E002
              name-label ( RO): XS65E005
              name-label ( RO): XS65E001
              name-label ( RO): XS65E008
              name-label ( RO): XS65ESP1
              name-label ( RO): XS65E007

 

Link to comment
1 minute ago, Alan Lantz said:

After installing the SP1 did you restart ? XenCenter should show both servers on the same hotfix level. As far as the pool database, 

if you install the host clean again it will have different UUID's and should appear as a different device as far as XAPI is concerned.

 

--Alan--

 

 

Yes, I restarted the new server after installing SP1. I can't restart the current pool master I have have all my VMs running off it right now

Link to comment

After installing the SP1 did you restart ? XenCenter should show both servers on the same hotfix level. As far as the pool database, 

if you install the host clean again it will have different UUID's and should appear as a different device as far as XAPI is concerned.

 

--Alan--

 

 

Link to comment

You removed the failed//ejected host form the current xenserver pool, right? You've ssen thi?:

https://support.citrix.com/article/CTX126382

Use "xe host-forget uuid=<host uuid> --force" if there is still a trace of it on the pool.

 

Worst case, you can force the host to be added to the pool, but should make sure you get all the hotfixes to match the pool or there may be issues sooner or later.

You should be able to manually fix all the next work settings, worst case.

 

-=Tobias

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...