Jump to content
Welcome to our new Citrix community!
  • 0

All non-local SR's disappear after host reboot


Joshua Holmes

Question

We have two pools (one AMD and one Intel).  In our AMD cluster we had to reboot a host server today and when it came up VM's were evacuated so I went to start them up on that host again and it wouldn't.  So I ran:

xe diagnostic-vm-status

The message I got was: 

Cannot start here [VM requires access to SR: <SR UUID REMOVED FOR SECURITY> (ZFS1)]

Also, below that it says: 

VM is not agile because: VM requires access to non-shared SR: <SR UUID> (ZFS1). SR must both be marked as shared and a properly configured PBD must be plugged-in on every host

 

This SR was in fact attached to this host prior to the host reboot.  

 

Was rebooting using linux command "reboot" an inappropriate reboot method?

 

All hosts are running XenServer 7.2

Link to comment

20 answers to this question

Recommended Posts

  • 0

Okay, final update on this ticket.

 

Got a replacement fan for the Broadcom NIC card handling all non-management and non-IPMI traffic for this host.  Card working fine now, and host is back into production.  I can also confirm I see it's storage mounted now as it was before, and all SR's are present etc.  I am working to get some non-critical VM's migrated it to make sure it's running stable and the NIC card is behaving as I suspect it is.

 

Thanks for the help.  Turns out this issue had nothing to do with Xen, and was a NIC card failure.  Let it also be a lesson to those managing infrastructure... Don't buy NIC cards that have fans... Because they can burn out.  With the internal case fans, if the card can't stay cool like 99% of the cards on the market, then buy a different card :)

Have a great week!

  • Like 1
Link to comment
  • 1

Not sure why that happened, I would hope just fixing the unplugged SR's would be it. I typically manually evaucate all of the VM's 

from a XenServer and then reboot the hosts with either XenCenter or xsconsole. I like to do it that way so I can make sure absolutely

no VM's are running on a host before I restart.  Prevents me from restarting the wrong server which I have done in the past, but its been

a while.

 

--Alan--

 

  • Like 1
Link to comment
  • 0
2 minutes ago, Alan Lantz said:

Yea, never reboot with the reboot command ! You want to do a graceful proper shutdown with either XenServer 

or using command line XenServer commands.

 

--Alan--

 

 

Specifically should I have ran instead (I should mention I am not using XenCenter): 

#xe host-reboot uuid=<host UUID>

 

Lastly, other than knowing I have SR's not longer plugged in, how else can I ascertain anything else that might be gone after a hard reboot like that?

Link to comment
  • 0
21 hours ago, Alan Lantz said:

Not sure why that happened, I would hope just fixing the unplugged SR's would be it. I typically manually evaucate all of the VM's 

from a XenServer and then reboot the hosts with either XenCenter or xsconsole. I like to do it that way so I can make sure absolutely

no VM's are running on a host before I restart.  Prevents me from restarting the wrong server which I have done in the past, but its been

a while.

 

--Alan--

 

 

Okay so I should have known it wouldn't be this simple.  This morning I planned on adding the SR's back to this host and seeing if I can't move some VM's back over to this host.

 

When I go into Xen Console (monitor on the host), and I navigate to the tool for managing the SR's; I can see the SR's in the list but they all say "This storage repository is unplugged and not usable by this host"

Link to comment
  • 0
10 minutes ago, Alan Lantz said:

You should be able to do a repair or replug the SR as long as connectivity is good and the SR 

is not corrupted.

 

--Alan--

 

 

The SR has 4 different PBD associated with it, for use with other hosts in the pool.   So I assume the SR is healthy and operating as it should otherwise other hosts would be losing their minds.

I did however, try and replug in terminal on the host.
 

xe sr-list (and got the UUID of PBD, and it showed it associated with this host, and un-attached) .


I then tried to plug it back in.

xe pbd-plug uuid=10bc2f12-9401-9153-5b74-7066c4995f17

When I did this, I got an error message

[root@server-r2u38 ~]# xe pbd-plug uuid=10bc2f12-9401-9153-5b74-7066c4995f17
Error code: SR_BACKEND_FAILURE_108
Error parameters: , Unable to detect an NFS service on this target.,

 

I can confirm though that in fact the target is in-fact NFS as here is the info about the SR

 

uuid ( RO)                : 206b25e9-c70c-7814-49c1-a12b8a2f6b0b
          name-label ( RW): ZFS1
    name-description ( RW): NFS SR [<IP removed for security>:/vol1/xen]
                host ( RO): <shared>
                type ( RO): nfs
        content-type ( RO):

 

I checked with the guy who manages our storage clusters, he said there's no firewall on the cluster controllers blocking a request.

 

Link to comment
  • 0

Taking a further look after trying to mount that PBD, I took a look at the output of df -h and it appears the amount for the SR's I want to use are no longer there... Here's whats on it

 

Filesystem                                                                                                 Size  Used Avail Use% Mounted on
devtmpfs                                                                                                   1.9G     0  1.9G   0% /dev
tmpfs                                                                                                      1.9G   60K  1.9G   1% /dev/shm
tmpfs                                                                                                      1.9G  912K  1.9G   1% /run
tmpfs                                                                                                      1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/sda1                                                                                                   18G  1.8G   15G  11% /
xenstore                                                                                                   1.9G     0  1.9G   0% /var/lib/xenstored
/dev/loop0                                                                                                  56M   56M     0 100% /var/xen/xc-install
/dev/sda5                                                                                                  3.9G  678M  3.0G  19% /var/log
/dev/mapper/XSLocalEXT--2eed34c5--b759--8542--3f74--dd5e98b6afde-2eed34c5--b759--8542--3f74--dd5e98b6afde  875G  133M  831G   1% /run/sr-mount/2eed34c5-b759-8542-3f74-dd5e98b6afde
tmpfs                                                                                                      387M     0  387M   0% /run/user/0

and when I take a look at another host operating just fine I have this.....

 

Filesystem                                                              Size  Used Avail Use% Mounted on
devtmpfs                                                                1.9G  132K  1.9G   1% /dev
tmpfs                                                                   1.9G  1.1M  1.9G   1% /dev/shm
tmpfs                                                                   1.9G   13M  1.9G   1% /run
tmpfs                                                                   1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/sda1                                                                18G  1.8G   15G  11% /
xenstore                                                                1.9G     0  1.9G   0% /var/lib/xenstored
/dev/loop0                                                               56M   56M     0 100% /var/xen/xc-install
/dev/sda5                                                               3.9G  3.9G     0 100% /var/log
<ip hidden>:/vol1/xen/206b25e9-c70c-7814-49c1-a12b8a2f6b0b                39T  2.6T   36T   7% /run/sr-mount/206b25e9-c70c-7814-49c1-a12b8a2f6b0b
<ip hidden>:/vol1/nfs/test/8df170e2-9aa7-3907-fc62-e52ef8fb3992          22T  951G   21T   5% /run/sr-mount/8df170e2-9aa7-3907-fc62-e52ef8fb3992
<ip hidden>:/mnt/r2s1/cloud/xen72/53d88ec2-b47b-3c52-5b07-019e1b87babf  5.5T  2.1T  3.5T  38% /run/sr-mount/53d88ec2-b47b-3c52-5b07-019e1b87babf
<ip hidden>:/mnt/r2s1/cloud/isos/                                       5.5T  2.1T  3.5T  38% /run/sr-mount/3e74a012-a443-8d14-8c5f-7d0307c41801
tmpfs                                                                   387M     0  387M   0% /run/user/0

So correct me if I am wrong, but I need to mount these NFS and iSCSI locations before I can try and plug the PBD in, yes?

Link to comment
  • 0

Yes, and part of plugging that PBD should be creating that mount and mounting it. 

AT least you can see /vol1/xen. Thats a good start. I would probably try to forget the

SR in XenCenter and go through the process of attaching to the SR again and let it

recreate the connection. Just be sure and don't destroy. I don't think destroy is even

available for you on this host, but detach/forget then add the SR back to this one

bad host should work.

 

--Alan--

 

Link to comment
  • 0
2 minutes ago, Alan Lantz said:

Yes, and part of plugging that PBD should be creating that mount and mounting it. 

AT least you can see /vol1/xen. Thats a good start. I would probably try to forget the

SR in XenCenter and go through the process of attaching to the SR again and let it

recreate the connection. Just be sure and don't destroy. I don't think destroy is even

available for you on this host, but detach/forget then add the SR back to this one

bad host should work.

 

--Alan--

 

 

Thanks Alan, I had to leave to a customer site to fix some items but will try this later.  I ran a history command on that host and the admin that was here (no longer here) who set it up, doesn't appear to have even use mount command, so I am completely flummoxed as to how he mounted it before AND I don't see an entry in fstab which means it's not permanent either and in fact, in ALLLLL the hosts in any of our pools I see that fstab wasn't used.  So had there been a physical reboot like I did here, it would have lost mount... I am super curious why he did this. 

 

I will correct and report back to this thread but right now it appears that the reboot happened, the mounted endpoints were lost and thus the PBD was unplugged and also why I can't plug them back in.

Link to comment
  • 0
1 minute ago, Alan Lantz said:

Reconnecting the SR should work all of that out.

 

--Alan--

 

 


So unlike the pbd-plug which is reconnecting a pointer to a VDI, I need to disassociate the SR entirely from the host, and then re-add it?

 

Is there any reason why I wouldn’t be okay to do this over SSH vs the XenConsole on the host itself?  I assume not use XAPI anyways.

Link to comment
  • 0
On 1/1/2020 at 10:44 AM, Alan Lantz said:

Since its not plugging starting over and re-adding seems like the best step. /var/SMLog will 

be where you will find storage issue information. But it can be cryptic at times.

 

--Alan--

 

 

Okay, well I have some concrete useful data here. (adding context for those who find this thread later, then have an a question)

 

I ordered a bunch of new RAM because I had planned to upgrade the RAM in the hosts anyways.  While this one was out of service, I figured might as well upgrade.  I went to go unplug the server and get it ready to remove and I noticed the SFP+ NIC LED's weren't recognizing a connection at all. (Each host here has dual 10GB SFP+ and 4 x 1Gbe ethernet RJ45).

 

I took the server out and I ran it on my test bench and sure enough the Broadcom NIC card it shot, not showing up in the pcilist, not doing much of anything.  A while back in 2019 we had to reboot this host and the card started working again, IMO it's probably been on it's way out.

 

These servers rather than having dual NIC cards with a single 10GB port on each instead have a single card with dual, so sadly, they're a single point of failure.
 

I have a new card in it now an Intel 520x-DA2, but my understanding was that swapping out a NIC on a server that's part of a pool, requires that the pool master be made aware of the card.  According to Xen docs, any networking changes requires all host servers to reboot.  

This seems odd to me, that it would require all hosts to do a reboot to simply switch out a NIC, so I am guessing I am simply missing something, because that doesn't seem right.

The previous card had LACP bond setup, but obviously that bond is going to be different now with a new card and different hardware for each port.  Any suggestion on what I am missing from my understanding would be great.  

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...