Jump to content
Welcome to our new Citrix community!
  • 0

Connection to HA NFS SR was lost on 2 member servers


Toni Tschampke

Question

Dear experts,

 

today we noticed a downtime of 2 member servers in a 3 server HA pool (XenServer 7.1 LTSR). The SR is on a Synology RS3617xs+ and is used with nfs protocol.

After digging through the logfiles we noticed that both affected servers were throwing exceptions that NFSSR is not available any longer.

This resulted in HA failover as the failover count decreased from 2 to 0. All VMs were stopped and restarted on the remaining server (which was elected earlier to master).

 

The strange thing is, that both affected servers did reboot by itself but we have no clue as we do not find anything relevant in the logs (maybe just searched the wrong ones yet).

Is this an expected bevaviour? Is Xen force killing the hosts if this failover is triggered? The log files are showing a gap of roughly 5min.

 

I remember when we were using pacemaker/corosync/drbd, that in this combination we eventually saw STONITH mechanism doing something similar, which was indeed intended in these cases.

First we checked for a phase power outage, which could have these effects but our rack is fine, everything else was still running with high uptime.

The server came back after a few minutes, showing no sign of a crash log afterwards in Xencenter but uptime was indicating the recent reboot.

 

Luckily the affected servers were only hosting our service VMs and our staging platform, the production VMs were running on the remaining one.

 

But now I'm a little bit lost where to look deeper into this, next time this could happen may affect our production which I now hardly try to prevent for good reasons.

 

The obvious erros were:

Raising exception [108, Unable to detect an NFS service on this target.]
***** NFSFileVDI.attach_from_config: EXCEPTION <class 'SR.SROSError'>, Unable to detect an NFS service on this target.
  
Raising exception [47, The SR is not available
***** generic exception: vdi_attach_from_config: EXCEPTION <class 'SR.SROSError'>, The SR is not available [opterr=Unable to attach from config]

***** NFS VHD: EXCEPTION <class 'SR.SROSError'>, The SR is not available [opterr=Unable to attach from config]

But as server 1 was using this SR the whole time (VMs were not affected, all on the same RAID6 NFS) the cause would be unlikely the storage itself.

 

I would really appreciate any tips / hints how to diagnose this any further.

 

Thank you

-t

Link to comment

10 answers to this question

Recommended Posts

  • 0

Thanks Tobias,

 

as the boxes rebooted by itself, we had no chance to check this in time, after rebooting the boxes were back up without any signs.

The third member server was up and running the whole time and was using the same SR.

 

I guess the SR config is shared throughout all member hosts in a HA pool?

 

Any chances to get some clue afterwards? Is the automatic reboot of members intended in a HA failure situation?

Link to comment
  • 0

You still need to set up individual IP addresses for the connections on each individual host and the NFS share has to give permission to each host's NFS IP address to be lalowed to connect.  As long as all is now connected, you should be good. As to issues check /var/log/.SMlog in possibly older files that correspond to the time when you say the issues as they would have entries for storage-related issues and might give you some clues.

 

-=Tobias

Link to comment
  • 0

The cluster was running without problems for the last ~2 years without any problems, NFS was fine all the time, including all the required IP allow rules.

I think we will have watch closely in the next time if this happens again. At least I found the confirmation about fencing in

https://docs.citrix.com/en-us/xencenter/7-1/pools-ha-about.html

so the forced reboots were intended part of this HA failover situation.

 

Thanks

Link to comment
  • 0

Thanks Tobias,

 

that makes a lot of sense and looks like this could have been the cause.

We had a drift of ~2sec on one of our servers as ntpd wasn't able to reach our dedicated time server.

Now they are synced, time to add metrics for this to our monitoring.

 

For now I think we can see this as solved, thank you.

Toni

Link to comment
  • 0
1 minute ago, Toni Tschampke said:

Thanks Tobias,

 

that makes a lot of sense and looks like this could have been the cause.

We had a drift of ~2sec on one of our servers as ntpd wasn't able to reach our dedicated time server.

Now they are synced, time to add metric this to our monitoring.

 

For now I think we can see this as solved, thank you.

Toni

Hope so! Keep us posted.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...