Jump to content
Welcome to our new Citrix community!
  • 0

Citrix Hypervisor Crash LPFC ?


fhanig

Question

Hi.

 

We're using 3 Citrix Hypervisor Server 8.2 CU1 with HA and FibreChannel San.

San Unit is "Infotrend DS 3016RUE".

Fibre Channel Cards in the Hypervsior are Emulex Cards.

The Fiber Channel connection is redundant.
Multipathing is enabled on every server.

 

We're dealing with some crashes now.

 

Kern.log:

Jun 19 11:50:56 xen1 kernel: [2402318.265133] lpfc 0000:af:00.0: 0:(0):0713 SCSI layer issued Device Reset (0, 0) return x2002

Jun 19 11:50:56 xen1 kernel: [2402318.288899] sd 15:0:0:0: Power-on or device reset occurred

Jun 19 11:50:56 xen1 kernel: [2402318.312788] sd 15:0:0:0: alua: port group 02 state N non-preferred supports ToluSNA

 

XHA.log:

XHA. Log steht folgendes:

Jun 19 11:51:30 CEST 2022 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE 

Jun 19 11:52:28 CEST 2022 [warn] Host (0) and the local host do not agre on the view to the pool membership.

Jun 19 11:52:28 CEST 2022 [warn] local HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] local SF domain = (000)

Jun 19 11:52:28 CEST 2022 [warn] remote HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] remote SF domain = (110)

Jun 19 11:52:28 CEST 2022 [warn] remote HB domain on SF = (111)

Jun 19 11:52:28 CEST 2022 [warn] remote SF domain on HB = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain on SF = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain on HB = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain on SF = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain on HB = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain = (000)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain on SF = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain on HB = (000)

Jun 19 11:52:28 CEST 2022 [warn] weight[0] = 1

Jun 19 11:52:28 CEST 2022 [warn] weight[1] = 1

Jun 19 11:52:28 CEST 2022 [warn] weight[2] = 1

Jun 19 11:52:28 CEST 2022 [warn] after merger:

Jun 19 11:52:28 CEST 2022 [warn] other HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain on SF = (000)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain on HB = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain on SF = (000)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain on HB = (110)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain = (111)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain = (000)

Jun 19 11:52:28 CEST 2022 [warn] other HB domain on SF = (000)

Jun 19 11:52:28 CEST 2022 [warn] other SF domain on HB = (000)

Jun 19 11:52:28 CEST 2022 [warn] weight[0] = 1

Jun 19 11:52:28 CEST 2022 [warn] weight[1] = 1

Jun 19 11:52:28 CEST 2022 [warn] weight[2] = 1

Jun 19 11:52:28 CEST 2022 [warn] merged HB domain = (000)

Jun 19 11:52:28 CEST 2022 [warn] merged SF domain = (000)

Jun 19 11:52:28 CEST 2022 [debug] FH: All hosts now have consistent view to the pool membership.

Jun 19 11:52:35 CEST 2022 [warn] SM: partition_size[0] = 0 0

Jun 19 11:52:35 CEST 2022 [warn] SM: partition_size[1] = 0 0

Jun 19 11:52:35 CEST 2022 [warn] SM: partition_size[2] = 0 0

Jun 19 11:52:35 CEST 2022 [warn] SM: winner_index = -1

Jun 19 11:52:35 CEST 2022 [debug] FH: I have lost.

Jun 19 11:52:35 CEST 2022 [err] Survival rule failed (1905) FH: Survival Rule is not met for the local host.  - Self-Fence.

Jun 19 11:52:35 CEST 2022 [info] watchdog_selffence.

 

This only happened on one server.
The other 2 servers are running without problems.

 

We've had the same problem in the past. (also with the other servers)

It was good for a long time before the problem reappeared.

 

 

Any ideas what could be the cause ?

 

 

Link to comment

22 answers to this question

Recommended Posts

  • 0

RedHat had something similar and they blame hardware. Is everything on the Citrix HCL ?

 

--Alan

 

 

 

Resolution

In most cases above errors represent a problem on the storage array or fabric. As such, diagnostics should be run and error logs be checked on the hardware side to determine if there is anything wrong that would cause storage devices to stop responding.

Root Cause

When a FC device becomes unreachable and the FC network doesn't send an RSCN to notify the HBA the port is gone, the kernel starts error recovery when a command times out. However since the device is unreachable, each error routine in the lpfc driver can take up to a minute before it times out and gives up trying to communicate with the unreachable port. Since no I/O will be sent to devices on this host while error recovery is running, it can result in long delays, stalling I/O for minutes at a time.

The messages Device offlined - not ready after error recovery get printed when every attempt to recover the failing devices is unsuccessful.

The error handler (EH) performs recovery actions with increasing severity. Some actions are performed by issuing SCSI commands and others are performed by invoking one of the following fine-grained host EH callbacks. Callbacks may be omitted and omitted ones are considered to always fail.

Link to comment
  • 0

Hi Alan,

 

thank you.

FC Card is: "EMULEX LPE16002B-M6".

The SAN Unit is "Infotrend DS 3016RUE".

 

SAN and FC card are on the HCL.
Am I correct ?

 

On the SAN unit you can only see the disconnect in the logs when the server restarted, nothing more.

 

Since all 3 servers have this problem, it is actually impossible to have a hardware problem on the server side, right?

Link to comment
  • 0
4 minutes ago, Tobias Kreidl said:

Could be a hardware issue on the storage device or fiber interface itself. Are all hosts properly time-synched to NTP?

SAN has 2 host boards and is completely redundantly equipped.

No errors in SAN Logs…

 

How to check ntp sync?

I think, ntp is all good…

„ntpstat -s“ Not work anymore in this Version to check ntpsync…

Link to comment
  • 0
4 hours ago, Tobias Kreidl said:

From the CLI: ntpstat -s

on each host.

 

I thought the same.
But it does not work.

 

[root@xen-1 ~]# ntpstat -s
-bash: ntpstat: command not found
[root@xen-1 ~]#

 

 

Edit:

"timedatectl status" says on all Servers NTP enabled: yes + NTP synchronized: yes...

 

[root@xen1 ~]# timedatectl status
      Local time: Mon 2022-06-20 08:21:32 CEST
  Universal time: Mon 2022-06-20 06:21:32 UTC
        RTC time: Mon 2022-06-20 06:21:32
       Time zone: Europe/Berlin (CEST, +0200)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2022-03-27 01:59:59 CET
                  Sun 2022-03-27 03:00:00 CEST
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2022-10-30 02:59:59 CEST
                  Sun 2022-10-30 02:00:00 CET


 

 

Edit2:

I think its "chronyc tracking" now...

 


[root@xen1 ~]# chronyc tracking
Reference ID    : 0A010105 (dc06.xxx.lan)
Stratum         : 3
Ref time (UTC)  : Mon Jun 20 06:30:43 2022
System time     : 0.000018951 seconds fast of NTP time
Last offset     : +0.000038639 seconds
RMS offset      : 0.000232955 seconds
Frequency       : 21.712 ppm slow
Residual freq   : +0.026 ppm
Skew            : 1.544 ppm
Root delay      : 0.010992894 seconds
Root dispersion : 0.105259106 seconds
Update interval : 64.9 seconds
Leap status     : Normal
[root@xen1 ~]#

[root@xen2 ~]# chronyc tracking
Reference ID    : 0A010105 (dc06.xxx.lan)
Stratum         : 3
Ref time (UTC)  : Mon Jun 20 06:31:12 2022
System time     : 0.000090672 seconds slow of NTP time
Last offset     : -0.000067649 seconds
RMS offset      : 0.000303751 seconds
Frequency       : 3.536 ppm slow
Residual freq   : -0.041 ppm
Skew            : 2.284 ppm
Root delay      : 0.011164381 seconds
Root dispersion : 0.105785571 seconds
Update interval : 65.2 seconds
Leap status     : Normal
[root@xen2 ~]#

[root@xen3 ~]# chronyc tracking
Reference ID    : 0A010105 (dc06.xxx.lan)
Stratum         : 3
Ref time (UTC)  : Mon Jun 20 06:30:58 2022
System time     : 0.000063725 seconds slow of NTP time
Last offset     : -0.000055952 seconds
RMS offset      : 0.000200258 seconds
Frequency       : 23.793 ppm slow
Residual freq   : -0.020 ppm
Skew            : 0.962 ppm
Root delay      : 0.010716650 seconds
Root dispersion : 0.103734046 seconds
Update interval : 64.8 seconds
Leap status     : Normal
[root@xen3 ~]#

 

Link to comment
  • 0

Okay, but the output from "chronyc tracking" looks like everything is ok ?

 

What else can I check to find the problem here?

 

Another info:
After the server crashed, all VMs that were on this server are partially corrupt and I have to restore them from backup...

Link to comment
  • 0
14 hours ago, Alan Lantz said:

I would try disabling HA first and see if the issue disappears. If not I would then disable multipathing to see if that is the root issue.

 

--Alan--

 

 

 

I really don't want to do "try & error" because I can't actively provoke the problem.

I'd rather analyze where the problem is coming from and figure something out from the existing logs.

 

The problem has not occurred for several months now and is very sporadic.

The risk is too high for me to run productive systems on this farm.

 

The problem must be identified and eliminated first.

Any Ideas ?

Link to comment
  • 0

Just looks like you lost connectivity, possibly ALUA switched over to the non-preferred port. I take it nothing crashed completely (or did the storage go down?) or got interrupted, so maybe a minor power glitch or hardware hiccough. Could be also an issue o the port on the fiber channel controller.

Hard or impossible to easily replicate, hence probably not much that can be done at this point.

 

-=Tobias

Link to comment
  • 0
On 6/22/2022 at 4:46 PM, Tobias Kreidl said:

Just looks like you lost connectivity, possibly ALUA switched over to the non-preferred port. I take it nothing crashed completely (or did the storage go down?) or got interrupted, so maybe a minor power glitch or hardware hiccough. Could be also an issue o the port on the fiber channel controller.

Hard or impossible to easily replicate, hence probably not much that can be done at this point.

 

-=Tobias

 

The servers and the SAN are USV buffered..

 

The storage continues to run normally the whole time.
It only showed the disconnect on a channel.

 

 

A hardware problem on the server side can actually be ruled out.
It is very unlikely that all 3 servers have the same hardware defect.


Which is also noticeable:
Only 1 server crashed and lost fc-connection.
The VMs running on the other two servers were also corrupted by the crash. (corrupt data)

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...