Jump to content
Welcome to our new Citrix community!
  • 0

Automatic restart of servers in Pool


Shingo Anezaki

Question

I am using XenServer 6.0.2.

 

The other day, all 6 servers in Pool have been restarted, so I'm investigating.

 

Examining xensource.log of Pool Master has the following description

 

unix-RPC|host.call_plugin R:3ce71ef5ea9a|stunnel] get_reusable_stunnel: fresh stunnel connection failed with exception: Sys_error("/tmp/stunnel8bf49a.log: No space left on device"):
delaying 10.00 seconds before reconnecting to xxx.xxx.xxx.xxx:443 (attempt 1 / 10)

 

This log remains around 7 am and it seems that inode has been exhausted. After 3 hours, servers in the Pool have been restarted sequentially.
The daemon log of the server in Pool had the following description.

 

ha_monitor|HA monitor D:e608809f61ed|xapi_ha] HA_HEARTBEAT_APPROACHING_TIMEOUT
ha_monitor|HA monitor D:e608809f61ed|xapi_ha] HA_STATEFILE_APPROACHING_TIMEOUT

 

So I have a question. When Pool Master can not communicate with member server, does member server restart itself and attempt recovery?
If servers restart yourself, how many hours will it be done after?

 

If it is described in the configuration file, I would like to know which configuration file contains the description.

 

Best regards.

Link to comment

6 answers to this question

Recommended Posts

With older versions of XenServer, you may need to increase the heartbeat timeout parameter.   See: https://support.citrix.com/article/CTX139166

As to exact timing, it depends. Also, HA assumes you have 2 failures: the heartbeat device and the ability to ping the other hosts, and both need to fail for a new master to be elected. You can follow the process in the log file when it happens and it starts looking for a quorum to elect a new member. You could simulate a failure and check the logs to get an idea how long it actually takes.

 

That aside, I'd also verify your NTP settings are good -- check the offsets on each host with "ntpstat -s" to ensure they are all good withing a couple hundred milliseconds or less.

 

-=Tobias

Link to comment

Running out of space on /dev/sda1 was an issue previous to 7.x.  The 7.x versions implement a new file layout that is 

much more forgiving. Something happening on one server, even the master, shouldn't cause other servers to restart.

I don't think even HA would reboot a slave system for the sake of recover and becoming a new master.

 

--Alan--

 

 

 

Link to comment

I saw the "heartbeat approaching timeout" many times before on XS 6.X servers and increased it to I believe 120 seconds or the like at the time I was still running XS 6.X and that definitely helped. The HA code was partly rewritten in 7.X and is way more stable; I have not seen such issues since.

 

Also, make sure your dom0 has adequate resources (memory and VCPUs) that it isn't overloaded; check with "top" to make sure the CPUs are not close to saturation and you should also be seeing little or now swapping going on, and if so, the memory needs to be increased.

 

-=Tobias

Link to comment

Thank you everyone !


Because we use XenServer for our service, we can not update immediately.

We recognize that updating is the best, but we will use the current version for a while.

 

It is confirmation by one more point in addition.
The following command shows the folder that causes inode to increase.

 

#find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n -r |more

----------------------------------------------------------------------------------------
 3829 /var/lock/sm/lvm-3e352980-65ec-1cd2-8f1c-f81f2d982648

           ・

           ・

           ・

----------------------------------------------------------------------------------------
 

Is it data of a VM running on a XenServer?
If the cause is a particular VM, I would like to stop this VM.

 

Thank you,

Link to comment

You should be able to match up that with the listing from sr-list. I don't think thats a particular VM but 

probably just the SR. You would have to dig further to find out which VM. As far as upgrading you 

really only have two choice. 7.1 LTSR with support agreement or 7.6 where you commit to upgrading 

as new releases come out. Coming from 6.x you also need to verify your hardware is compatible.

 

--Alan--

 

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...