We have two XenServer 7 hosts in a pool with a single Centos 6.10 iSCSI storage repository. Neither the XenServer hosts or the storage server have been updated for quite some time.
Yesterday XenServer host 1 crashed. I uploaded the crash dump to Citrix Insight Services and the analysis engine was ‘unable to generate any reports’.
Ever since the crash, once I get to the point I have about ~30 iSCSI shared storage VMs running I am unable to start anymore and start to run into issues. (This is nowhere near the number of VMs we had running before the crash.) It does not appear to be any specific VMs, or Specific XenServer host, or specific iSCSI LUN. Logs on both the XenHosts and storage repository seem to indicate iSCSI connection issues. My research indicates that these type of issues seem to fall into X categories:
Networking issues (switch/NICs/cabling)
Bugs (with iSCSI daemon)
Load issues
However I have done all the troubleshooting to negate (1), I don’t see (2) would suddenly be a problem and I am fully patched, (3) this starts happening before I can get the same number of VMs running that I had running for months before this crash.
For the moment I am running as many of the VMs from local disks and a small number
I am kind of out of Ideas, any help is very much appreciated.
Question
Robert Metzger1709158910
We have two XenServer 7 hosts in a pool with a single Centos 6.10 iSCSI storage repository. Neither the XenServer hosts or the storage server have been updated for quite some time.
Yesterday XenServer host 1 crashed. I uploaded the crash dump to Citrix Insight Services and the analysis engine was ‘unable to generate any reports’.
Ever since the crash, once I get to the point I have about ~30 iSCSI shared storage VMs running I am unable to start anymore and start to run into issues. (This is nowhere near the number of VMs we had running before the crash.) It does not appear to be any specific VMs, or Specific XenServer host, or specific iSCSI LUN. Logs on both the XenHosts and storage repository seem to indicate iSCSI connection issues. My research indicates that these type of issues seem to fall into X categories:
However I have done all the troubleshooting to negate (1), I don’t see (2) would suddenly be a problem and I am fully patched, (3) this starts happening before I can get the same number of VMs running that I had running for months before this crash.
For the moment I am running as many of the VMs from local disks and a small number
I am kind of out of Ideas, any help is very much appreciated.
XenServer 7.1.0 Hosts:
2017-06-19 / 137272c
Version: 7.1
XS71E001 (version 1.0)
XS71E002 (version 1.0)
XS71E003 (version 1.0)
XS71E004 (version 1.0)
XS71E005 (version 1.0)
XS71E006 (version 1.0)
XS71E007 (version 1.0)
XS71E008 (version 1.0)
XS71E009 (version 1.0)
XS71E010 (version 1.0)
XS71E011 (version 1.0)
XS71E012 (version 1.0)
XS71E013 (version 1.0)
XS71E014 (version 1.0)
XS71E015 (version 1.0)
XS71E016 (version 1.0)
XS71E017 (version 1.0)
XS71E018 (version 1.0)
XS71E019 (version 1.0)
Centos 6.10 Storage Repository:
Name : scsi-target-utils
Arch : x86_64
Version : 1.0.24
Release : 18.el6
XenServer dmesg:
[24713.444531] connection28:0: detected conn error (1021)
[24713.444659] connection28:0: detected conn error (1021)
[24716.484355] connection14:0: detected conn error (1021)
[24716.484519] connection14:0: detected conn error (1021)
<Later tests>
[53334.062050] connection11:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4308214536, last ping 4308215788, now 4308217040
[53334.062102] connection11:0: detected conn error (1022)
Centos 6.10 messages:
Apr 3 06:09:52 SAN01 tgtd: abort_task_set(1149) found 3e 0
Apr 3 06:09:52 SAN01 tgtd: abort_cmd(1125) found 3e 6
Apr 3 06:10:07 SAN01 tgtd: conn_close(101) connection closed, 0x2527508 5
Apr 3 06:10:07 SAN01 tgtd: conn_close(107) sesson 0x25561f0 1
Apr 3 06:10:09 SAN01 tgtd: conn_close(90) already closed 0x2527508 4
Apr 3 06:10:19 SAN01 tgtd: abort_task_set(1149) found 72 0
Apr 3 06:10:19 SAN01 tgtd: abort_cmd(1125) found 72 6
Apr 3 06:10:34 SAN01 tgtd: conn_close(101) connection closed, 0x2f94058 3
Apr 3 06:10:34 SAN01 tgtd: conn_close(107) sesson 0x2517c50 1
Apr 3 06:10:37 SAN01 tgtd: conn_close(90) already closed 0x2f94058 2
Apr 3 06:10:55 SAN01 tgtd: bs_thread_open(331) 16
<Later tests>
Apr 3 14:14:20 SAN01 tgtd: conn_close(101) connection closed, 0x252c048 2
Apr 3 14:14:20 SAN01 tgtd: conn_close(107) sesson 0x2522cc0 1
Apr 3 14:41:05 SAN01 tgtd: abort_task_set(1149) found a0000054 0
Apr 3 14:41:05 SAN01 tgtd: abort_cmd(1125) found a0000054 6
Troubleshooting steps already performed:
Crash Log dom0.log:
[14756336.125583] INFO: vif vif-147-0 vif147.0: Guest Rx ready
[14823591.813184] INFO: device vif9.0 left promiscuous mode
[16374872.612351] INFO: connection22:0: detected conn error (1021)
[16374872.612498] INFO: connection22:0: detected conn error (1021)
[16376610.372331] INFO: connection30:0: detected conn error (1021)
[16376610.372542] INFO: connection30:0: detected conn error (1021)
<REPEATED 5-6 TIMES...>
[16410812.494831] INFO: connection30:0: detected conn error (1021)
[16410918.536797] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[16410918.536817] ALERT: IP: [<ffffffffa046d950>] iscsi_eh_abort+0x1a0/0x570 [libiscsi]
[16410918.536834] WARN: PGD 1d2a9067 PUD 4f7f6067 PMD 0
[16410918.536843] WARN: Oops: 0000 [#1] SMP
[16410918.536850] WARN: Modules linked in: tun nfsv3 nfs fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf x86_pkg_temp_thermal coretemp crc32_pclmul aesni_intel ipmi_si aes_x86_64 sb_edac ablk_helper cryptd lrw dm_mod psmouse gf128mul ipmi_msghandler sg glue_helper edac_core hpilo shpchp lpc_ich wmi mfd_core tpm_tis tpm nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables sd_mod ata_generic pata_acpi uhci_hcd serio_raw ata_piix libata ehci_pci ehci_hcd tg3 hpsa(O) ptp scsi_transport_sas pps_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod
[16410918.536950] WARN: ipv6 autofs4
[16410918.536957] WARN: CPU: 10 PID: 9709 Comm: kworker/u32:4 Tainted: G W O 4.4.0+2 #1
[16410918.536966] WARN: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/20/2012
[16410918.536982] WARN: Workqueue: scsi_tmf_32 scmd_eh_abort_handler [scsi_mod]
[16410918.536988] WARN: task: ffff88008ab69c00 ti: ffff8800859f0000 task.ti: ffff8800859f0000
[16410918.536996] WARN: RIP: e030:[<ffffffffa046d950>] [<ffffffffa046d950>] iscsi_eh_abort+0x1a0/0x570 [libiscsi]
[16410918.537008] WARN: RSP: e02b:ffff8800859f3db8 EFLAGS: 00010246
[16410918.537013] WARN: RAX: 0000000000000000 RBX: ffff8800afb25c10 RCX: 00000000ffffffff
[16410918.537021] WARN: RDX: 0000000000000001 RSI: ffff880149091818 RDI: ffff8800afb25d28
[16410918.537028] WARN: RBP: ffff8800859f3df0 R08: ffff8800b0142bb0 R09: ffff8800b0143da8
[16410918.537036] WARN: R10: 0000000000000000 R11: 000000000000000c R12: ffff8800b0142b80
[16410918.537044] WARN: R13: ffff8800afb25d28 R14: ffff8800afb25c18 R15: ffff8800afb26b20
[16410918.537058] WARN: FS: 00007f82bc393840(0000) GS:ffff880149b40000(0000) knlGS:0000000000000000
[16410918.537067] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[16410918.537072] WARN: CR2: 0000000000000050 CR3: 000000005afbe000 CR4: 0000000000042660
[16410918.537080] WARN: Stack:
[16410918.537084] WARN: ffff8800b0143aa8 ffff8800b0143a80 ffff8800b0142ba8 ffff8800b0142b80
[16410918.537094] WARN: ffff8800b2cda800 0000000000000400 ffff8800b0142ba8 ffff8800859f3e18
[16410918.537104] WARN: ffffffffa006c051 ffff8801091a9680 ffff880149091800 ffff88013cd3e500
[16410918.537113] WARN: Call Trace:
[16410918.537126] WARN: [<ffffffffa006c051>] scmd_eh_abort_handler+0xb1/0x2a0 [scsi_mod]
[16410918.537138] WARN: [<ffffffff8108862c>] process_one_work+0x1fc/0x3b0
[16410918.537145] WARN: [<ffffffff81088f95>] worker_thread+0x2a5/0x470
[16410918.537153] WARN: [<ffffffff8159cab8>] ? __schedule+0x648/0x870
[16410918.537159] WARN: [<ffffffff81088cf0>] ? rescuer_thread+0x300/0x300
[16410918.537166] WARN: [<ffffffff8108ddf5>] kthread+0xd5/0xe0
[16410918.537172] WARN: [<ffffffff8108dd20>] ? kthread_stop+0x110/0x110
[16410918.537179] WARN: [<ffffffff815a0b8f>] ret_from_fork+0x3f/0x70
[16410918.537185] WARN: [<ffffffff8108dd20>] ? kthread_stop+0x110/0x110
[16410918.537190] WARN: Code: c7 c2 26 12 47 a0 48 c7 c7 d5 10 47 a0 44 89 5d c8 4c 89 55 d0 48 8d b0 48 01 00 00 31 c0 e8 d8 be fb e0 44 8b 5d c8 4c 8b 55 d0 <49> 83 7a 50 00 75 41 8b 05 23 5a 00 00 85 c0 0f 84 ea 02 00 00
[16410918.537239] ALERT: RIP [<ffffffffa046d950>] iscsi_eh_abort+0x1a0/0x570 [libiscsi]
[16410918.537256] WARN: RSP <ffff8800859f3db8>
[16410918.537263] WARN: CR2: 0000000000000050
[16410918.559378] WARN: ---[ end trace afa1626d2de164fd ]---
[16410918.559395] EMERG: Kernel panic - not syncing: Fatal exception in interrupt
(I am not really sure what other crash logs are relevant.)
23 answers to this question
Recommended Posts
Archived
This topic is now archived and is closed to further replies.