Jump to content
Welcome to our new Citrix community!
  • 0

Server Host rebooting after Watchdog Timeout reports x CPU is Stuck


Nigel Mossman

Question

Hi

 

I have built a "new" XenServer for my home lab using as 2x E5-2650L CPUs and 64gb RAM (16x Sun 4GB PC3-10600R 2Rx4) all taken from a known good HP-360DL G8 server and installed onto an used ASRock EP2C602-4L/D16.  I have a new RM850x PSU.  Having built the system and installed Xen 7.6, I found that the system would restart at random periods, generally two three times in a 24 hour period.  Examining the /var/crash xen.log file showed a report of WatchDog Time out reporting CPU "x" was stuck and this triggered an NMI.  The CPUs are running in powersave mode.  Board and CPU temperatures are all good, never going above 50c.

 

CPU x is interesting is at the CPU reported as causing the failure has changed each time.  It's always an odd numbered CPU (e.g. 23,25 etc) and always above 19.  I've tried the following:

* Upgraded to 8.0

* Tried disabling the watchdog in Grub (whole server appeared to freeze, no crashdump)

* Disabled Hyperthreading (crashed reporting CPU 17, which would imply the second cpu)

 

The problem appears to be related to the second CPU.  The few online posts I have found regarding this issue both for Xen and Linux in general suggest a hardware problem which could be anything from a broken mobo, broken cpu, failing PSU, CPU voltage adjustment, none of which is really helpful is diagnosing the problem or deciding on a fix.

 

My latest test (running as I write this) is to disconnect all of the drives (having turned Hyperthreading on) and installed another single drive with Ubuntu Linux 18.04 and run stress-ng testing all of the CPU's (40) to see if this triggers a crash.  I have this on a 1 hour cycle, if that passes I will try on a 24 hour cycle.

 

I am a bit stuck and would really value some views on how to diagnose the problem!

 

 

Link to comment

7 answers to this question

Recommended Posts

  • 0

Thank your for your replies.  I ran 3 hours of stress tests under linux and had no crashes.  I did try to install Win10 and to try and run the Intel diagnostics but couldn't get Win10 to install.  I removed the second CPU to check for bent bins, I didn't find any but did a bit of thermal grease on the bottom of the CPU to cleaned that off with some alcohol.  Also removed a PCI card and adjusted the motherboard slightly and put everything back together.  The system has been running for 24 hours with no crashes (two VM's live) so wondering if I have fixed the problem. 

Link to comment
  • 0

Hi,

To update on this, the system stayed up for a week and as I imported a larger VM it crashed, but this time it was odd as the Dom0 stayed up but all of the VM's and Dom0 lost contact with all of the storage.  After shutting down it all came back up OK and happened again when importing this final VM.  At that point I have up, bought a new m/board, processors and ram which into I am loading VMs as I write this.  I am going to see if I can use the old board and CPU's for something, or at least workout if they are faulty so I can sell or scrap.

 

Nigel

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...