Jump to content
Welcome to our new Citrix community!
  • 0

load based iSCSI issue after XenServer 7.1 crash


Question

We have two XenServer 7 hosts in a pool with a single Centos 6.10 iSCSI storage repository.  Neither the XenServer hosts or the storage server have been updated for quite some time.

 

Yesterday XenServer host 1 crashed. I uploaded the crash dump to Citrix Insight Services and the analysis engine was ‘unable to generate any reports’.

 

Ever since the crash, once I get to the point I have about ~30 iSCSI shared storage VMs running I am unable to start anymore and start to run into issues. (This is nowhere near the number of VMs we had running before the crash.) It does not appear to be any specific VMs, or Specific XenServer host, or specific iSCSI LUN. Logs on both the XenHosts and storage repository seem to indicate iSCSI connection issues. My research indicates that these type of issues seem to fall into X categories:

  1. Networking issues (switch/NICs/cabling)
  2. Bugs (with iSCSI daemon)
  3. Load issues

 

However I have done all the troubleshooting to negate (1), I don’t see (2) would suddenly be a problem and I am fully patched, (3) this starts happening before I can get the same number of VMs running that I had running for months before this crash.

 

For the moment I am running as many of the VMs from local disks and a small number

 

I am kind of out of Ideas, any help is very much appreciated.


 

XenServer 7.1.0 Hosts:

2017-06-19 / 137272c

Version: 7.1

XS71E001 (version 1.0)

XS71E002 (version 1.0)

XS71E003 (version 1.0)

XS71E004 (version 1.0)

XS71E005 (version 1.0)

XS71E006 (version 1.0)

XS71E007 (version 1.0)

XS71E008 (version 1.0)

XS71E009 (version 1.0)

XS71E010 (version 1.0)

XS71E011 (version 1.0)

XS71E012 (version 1.0)

XS71E013 (version 1.0)

XS71E014 (version 1.0)

XS71E015 (version 1.0)

XS71E016 (version 1.0)

XS71E017 (version 1.0)

XS71E018 (version 1.0)

XS71E019 (version 1.0)

 

Centos 6.10 Storage Repository:

Name        : scsi-target-utils

Arch        : x86_64

Version     : 1.0.24

Release     : 18.el6


 

XenServer dmesg:

[24713.444531]  connection28:0: detected conn error (1021)

[24713.444659]  connection28:0: detected conn error (1021)

[24716.484355]  connection14:0: detected conn error (1021)

[24716.484519]  connection14:0: detected conn error (1021)

<Later tests>

[53334.062050]  connection11:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4308214536, last ping 4308215788, now 4308217040

[53334.062102]  connection11:0: detected conn error (1022)



 

Centos 6.10 messages:

Apr  3 06:09:52 SAN01 tgtd: abort_task_set(1149) found 3e 0

Apr  3 06:09:52 SAN01 tgtd: abort_cmd(1125) found 3e 6

Apr  3 06:10:07 SAN01 tgtd: conn_close(101) connection closed, 0x2527508 5

Apr  3 06:10:07 SAN01 tgtd: conn_close(107) sesson 0x25561f0 1

Apr  3 06:10:09 SAN01 tgtd: conn_close(90) already closed 0x2527508 4

Apr  3 06:10:19 SAN01 tgtd: abort_task_set(1149) found 72 0

Apr  3 06:10:19 SAN01 tgtd: abort_cmd(1125) found 72 6

Apr  3 06:10:34 SAN01 tgtd: conn_close(101) connection closed, 0x2f94058 3

Apr  3 06:10:34 SAN01 tgtd: conn_close(107) sesson 0x2517c50 1

Apr  3 06:10:37 SAN01 tgtd: conn_close(90) already closed 0x2f94058 2

Apr  3 06:10:55 SAN01 tgtd: bs_thread_open(331) 16

<Later tests>

Apr  3 14:14:20 SAN01 tgtd: conn_close(101) connection closed, 0x252c048 2

Apr  3 14:14:20 SAN01 tgtd: conn_close(107) sesson 0x2522cc0 1

Apr  3 14:41:05 SAN01 tgtd: abort_task_set(1149) found a0000054 0

Apr  3 14:41:05 SAN01 tgtd: abort_cmd(1125) found a0000054 6



 

Troubleshooting steps already performed:

  • Reboot everything.
  • Scsi-target-utils is already latest for Centos 6.10
  • Check for Rx/Tx errors on the switch port and NIC ports: 0
  • Used different ethernet ports on both xenserver hosts and iscsi host
  • Replaced networking switch.
  • Used different ethernet cables
  • Check RAID condition on XenServer hosts and iscsi host: Clean
  • Checked IPMI logs on XenServer hosts and iscsi host: OK

 

Crash Log dom0.log:

[14756336.125583]   INFO: vif vif-147-0 vif147.0: Guest Rx ready

[14823591.813184]   INFO: device vif9.0 left promiscuous mode

[16374872.612351]   INFO: connection22:0: detected conn error (1021)

[16374872.612498]   INFO: connection22:0: detected conn error (1021)

[16376610.372331]   INFO: connection30:0: detected conn error (1021)

[16376610.372542]   INFO: connection30:0: detected conn error (1021)

<REPEATED 5-6 TIMES...>

[16410812.494831]   INFO: connection30:0: detected conn error (1021)

[16410918.536797]  ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

[16410918.536817]  ALERT: IP: [<ffffffffa046d950>] iscsi_eh_abort+0x1a0/0x570 [libiscsi]

[16410918.536834]   WARN: PGD 1d2a9067 PUD 4f7f6067 PMD 0

[16410918.536843]   WARN: Oops: 0000 [#1] SMP

[16410918.536850]   WARN: Modules linked in: tun nfsv3 nfs fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf x86_pkg_temp_thermal coretemp crc32_pclmul aesni_intel ipmi_si aes_x86_64 sb_edac ablk_helper cryptd lrw dm_mod psmouse gf128mul ipmi_msghandler sg glue_helper edac_core hpilo shpchp lpc_ich wmi mfd_core tpm_tis tpm nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables sd_mod ata_generic pata_acpi uhci_hcd serio_raw ata_piix libata ehci_pci ehci_hcd tg3 hpsa(O) ptp scsi_transport_sas pps_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod

[16410918.536950]   WARN: ipv6 autofs4

[16410918.536957]   WARN: CPU: 10 PID: 9709 Comm: kworker/u32:4 Tainted: G        W O 4.4.0+2 #1

[16410918.536966]   WARN: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/20/2012

[16410918.536982]   WARN: Workqueue: scsi_tmf_32 scmd_eh_abort_handler [scsi_mod]

[16410918.536988]   WARN: task: ffff88008ab69c00 ti: ffff8800859f0000 task.ti: ffff8800859f0000

[16410918.536996]   WARN: RIP: e030:[<ffffffffa046d950>]  [<ffffffffa046d950>] iscsi_eh_abort+0x1a0/0x570 [libiscsi]

[16410918.537008]   WARN: RSP: e02b:ffff8800859f3db8  EFLAGS: 00010246

[16410918.537013]   WARN: RAX: 0000000000000000 RBX: ffff8800afb25c10 RCX: 00000000ffffffff

[16410918.537021]   WARN: RDX: 0000000000000001 RSI: ffff880149091818 RDI: ffff8800afb25d28

[16410918.537028]   WARN: RBP: ffff8800859f3df0 R08: ffff8800b0142bb0 R09: ffff8800b0143da8

[16410918.537036]   WARN: R10: 0000000000000000 R11: 000000000000000c R12: ffff8800b0142b80

[16410918.537044]   WARN: R13: ffff8800afb25d28 R14: ffff8800afb25c18 R15: ffff8800afb26b20

[16410918.537058]   WARN: FS: 00007f82bc393840(0000) GS:ffff880149b40000(0000) knlGS:0000000000000000

[16410918.537067]   WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033

[16410918.537072]   WARN: CR2: 0000000000000050 CR3: 000000005afbe000 CR4: 0000000000042660

[16410918.537080]   WARN: Stack:

[16410918.537084]   WARN: ffff8800b0143aa8 ffff8800b0143a80 ffff8800b0142ba8 ffff8800b0142b80

[16410918.537094]   WARN: ffff8800b2cda800 0000000000000400 ffff8800b0142ba8 ffff8800859f3e18

[16410918.537104]   WARN: ffffffffa006c051 ffff8801091a9680 ffff880149091800 ffff88013cd3e500

[16410918.537113]   WARN: Call Trace:

[16410918.537126]   WARN: [<ffffffffa006c051>] scmd_eh_abort_handler+0xb1/0x2a0 [scsi_mod]

[16410918.537138]   WARN: [<ffffffff8108862c>] process_one_work+0x1fc/0x3b0

[16410918.537145]   WARN: [<ffffffff81088f95>] worker_thread+0x2a5/0x470

[16410918.537153]   WARN: [<ffffffff8159cab8>] ? __schedule+0x648/0x870

[16410918.537159]   WARN: [<ffffffff81088cf0>] ? rescuer_thread+0x300/0x300

[16410918.537166]   WARN: [<ffffffff8108ddf5>] kthread+0xd5/0xe0

[16410918.537172]   WARN: [<ffffffff8108dd20>] ? kthread_stop+0x110/0x110

[16410918.537179]   WARN: [<ffffffff815a0b8f>] ret_from_fork+0x3f/0x70

[16410918.537185]   WARN: [<ffffffff8108dd20>] ? kthread_stop+0x110/0x110

[16410918.537190]   WARN: Code: c7 c2 26 12 47 a0 48 c7 c7 d5 10 47 a0 44 89 5d c8 4c 89 55 d0 48 8d b0 48 01 00 00 31 c0 e8 d8 be fb e0 44 8b 5d c8 4c 8b 55 d0 <49> 83 7a 50 00 75 41 8b 05 23 5a 00 00 85 c0 0f 84 ea 02 00 00

[16410918.537239]  ALERT: RIP [<ffffffffa046d950>] iscsi_eh_abort+0x1a0/0x570 [libiscsi]

[16410918.537256]   WARN: RSP <ffff8800859f3db8>

[16410918.537263]   WARN: CR2: 0000000000000050

[16410918.559378]   WARN: ---[ end trace afa1626d2de164fd ]---

[16410918.559395]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

 

(I am not really sure what other crash logs are relevant.)

Link to comment

23 answers to this question

Recommended Posts

  • 0
14 minutes ago, Mark Syms said:

Out of support means you don't get any support.

 

The iSCSI 1021 connection errors are reporting that the target is rejecting the connection.

I am not trying to be difficult but I understand I won't get any citrix official support because I don't have a support contract regardless of the version being in support. However in the past I have found users on this forum willing to help me regardless.

 

Thank you it would be nice to know if this crash is fixed in 7.1 CU2, however at this time it is not clear, at least to me, if the crash is what caused this problem or a result of another issue which is still ongoing.

 

I could not find what 1021 means only 1020 which pointed me in the direction of network issues. Do you think that the 1021 'target is rejecting the connection' is purely a storage repository issue or could it also be the XenServer hosts causing the SR to react that way?

 

 

Link to comment
  • 0
8 minutes ago, Mark Syms said:

And it's quite likely that the iSCSI target is rejecting the connection due to receiving invalid iSCSI requests due to bugs in the 7.1.0 kernel several of which are fixed in 7.1 CU2 and 7.6.

Why might this only be an issue after a crash? I would assume that the bug would only appear periodically such that i could resume normal operation for a while until the next crash.

Link to comment
  • 0
Just now, Mark Syms said:

I would guess that it occurred due to some condition relating to the storage and that condition holds true still.

LOL fair. I know from at least my standpoint this occurred overnight so it at least was probably not user induced. Thank you for  your help. Maybe just one more thing.

 

I think i may have also misunderstood the new licensing model which is why I think i was thrown off by your support comment. When I upgraded to 7, I installed the 7.1 LTSR version which at the time was in the "free". Since then there have been changes. Are any of the LTSR versions free or do I have to upgrade to the latest 7.x version (7.6)?

Link to comment
  • 0
1 minute ago, Alan Lantz said:

Do you do Jumbo frames or multipath or even bonding for storage?  You might want to eliminate any of those types 

of technologies as well. It does sound like you have having issues with XenServer reaching storage.

 

--Alan--

 

Good suggestions. No jumbo frames, no multipath.. We are using dt-lacp across a pair of switches, but i have already checked with a new single switch without bonding and the issues still occurs. (Keep in mind no changes had been made in months and this had been working quite well.)

 

I think a xenserver 7.1 bug seems more likely at this point.

 

Link to comment
  • 0
Just now, Alan Lantz said:

Could very well be.  From the SAN perspective if performance doesn't seem to be an issue then looking at 

the host version is sound and reasonable.

 

--Alan--

 

 

Before the crash I had been running ~75 VMs from the shared storage, but now i run into issues around ~30 VM. (No real change in workloads (IOPS/throughput) of those VMs either.) 

Link to comment
  • 0

LTSR 7.1 is the latest such release and probably will be for some time still. 8.0 will be a CR release.

 

Also, from a load perspective, 75 VMs is quite a lot. You may need 8 or more GB of dom0 RAM to handle that load well. Run top and xentop to monitor your load and memory/swap usage and adjust accordingly.

 

-=Tobias

Link to comment
  • 0

When you try to start a VM and have issues exactly what is the issue? Doesn't start ? Starts slow? I would look at SMlog to see if it 

could shed some light with regards to storage. Do both hosts exhibit the VM issue ? And yes, 75 VM's is a pretty good load. How

often do you restart the servers ? It could very well just be memory leaks and such catching up with you if you don't regularly

reboot for hotfixes.

 

--Alan--

 

Link to comment
  • 0
11 minutes ago, Tobias Kreidl said:

LTSR 7.1 is the latest such release and probably will be for some time still. 8.0 will be a CR release.

 

Also, from a load perspective, 75 VMs is quite a lot. You may need 8 or more GB of dom0 RAM to handle that load well. Run top and xentop to monitor your load and memory/swap usage and adjust accordingly.

 

-=Tobias

Hey Tobias, you are always here to help my XS issues :)

 

We had been running more than that earlier in the year without issue and actually have reduced the number recently. Many of them are super light weight single purpose vms (dhcp, dns, or tftp etc). I will keep that tip in mind though as any performance increase is great. (Unless you think it could be causing my current issue.)

Link to comment
  • 0

I have upgraded to 7.6CR and am experiencing the same issue.:

 

Host2:

[ 1690.895805] ------------[ cut here ]------------
[ 1690.895819] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x39/0x90()
[ 1690.895822] Modules linked in: tun nfsv3 nfs fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 libcrc32c ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath dm_mod ipmi_devintf x86_pkg_temp_thermal coretemp crc32_pclmul aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper sg psmouse ipmi_si sb_edac hpilo edac_core lpc_ich mfd_core ipmi_msghandler wmi tpm_tis tpm shpchp nls_utf8 isofs xen_wdt nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ata_generic pata_acpi hid_generic usbhid hid sd_mod uhci_hcd serio_raw ata_piix libata ehci_pci ehci_hcd hpsa(O) scsi_transport_sas tg3 ptp pps_core scsi_dh_rdac scsi_dh_hp_sw
[ 1690.895901]  scsi_dh_emc scsi_dh_alua scsi_mod ipv6
[ 1690.895909] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.4.0+10 #1
[ 1690.895911] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 09/18/2013
[ 1690.895914]  0000000000000000 ffff88024da03cd8 ffffffff8131c8a3 0000000000000000
[ 1690.895918]  ffffffff8180a255 ffff88024da03d10 ffffffff810722ae 0000000000000200
[ 1690.895923]  ffff88024ab7b558 ffff88024c827000 ffff88024ab78410 ffff88022ee398a8
[ 1690.895927] Call Trace:
[ 1690.895929]  <IRQ>  [<ffffffff8131c8a3>] dump_stack+0x63/0x90
[ 1690.895941]  [<ffffffff810722ae>] warn_slowpath_common+0x9e/0xc0
[ 1690.895945]  [<ffffffff8107238a>] warn_slowpath_null+0x1a/0x20
[ 1690.895949]  [<ffffffff81076969>] __local_bh_enable_ip+0x39/0x90
[ 1690.895955]  [<ffffffff815a52ee>] _raw_spin_unlock_bh+0x1e/0x20
[ 1690.895964]  [<ffffffffa047e8fb>] __iscsi_conn_send_pdu+0x2bb/0x360 [libiscsi]
[ 1690.895970]  [<ffffffffa047eaef>] iscsi_send_nopout+0xdf/0x100 [libiscsi]
[ 1690.895975]  [<ffffffffa047ed25>] iscsi_eh_cmd_timed_out+0x215/0x2a0 [libiscsi]
[ 1690.895991]  [<ffffffffa005bd52>] scsi_times_out+0x92/0x250 [scsi_mod]
[ 1690.895995]  [<ffffffff812fa95f>] blk_rq_timed_out+0x1f/0x60
[ 1690.895999]  [<ffffffff812faa3a>] blk_rq_timed_out_timer+0x9a/0x120
[ 1690.896003]  [<ffffffff812fa9a0>] ? blk_rq_timed_out+0x60/0x60
[ 1690.896008]  [<ffffffff810cf0ef>] call_timer_fn+0x5f/0x140
[ 1690.896011]  [<ffffffff812fa9a0>] ? blk_rq_timed_out+0x60/0x60
[ 1690.896015]  [<ffffffff810d0850>] run_timer_softirq+0x220/0x2a0
[ 1690.896019]  [<ffffffff81076729>] __do_softirq+0x129/0x290
[ 1690.896024]  [<ffffffff81076a62>] irq_exit+0x42/0x90
[ 1690.896028]  [<ffffffff813c8be5>] xen_evtchn_do_upcall+0x35/0x50
[ 1690.896032]  [<ffffffff815a752e>] xen_do_hypervisor_callback+0x1e/0x40
[ 1690.896034]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[ 1690.896041]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[ 1690.896045]  [<ffffffff8100c570>] ? xen_safe_halt+0x10/0x20
[ 1690.896049]  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
[ 1690.896052]  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
[ 1690.896057]  [<ffffffff810ab352>] ? default_idle_call+0x32/0x40
[ 1690.896060]  [<ffffffff810ab5ac>] ? cpu_startup_entry+0x1ec/0x330
[ 1690.896064]  [<ffffffff8159a9a7>] ? rest_init+0x77/0x80
[ 1690.896069]  [<ffffffff81b060c2>] ? start_kernel+0x45b/0x468
[ 1690.896072]  [<ffffffff81b05996>] ? set_init_arg+0x57/0x57
[ 1690.896075]  [<ffffffff81b055ee>] ? x86_64_start_reservations+0x2a/0x2c
[ 1690.896079]  [<ffffffff81b090ee>] ? xen_start_kernel+0x4cf/0x4db
[ 1690.896083] ---[ end trace e38b011bffda82ee ]---

 

Host1:
[ 1804.844511] ------------[ cut here ]------------
[ 1804.844528] WARNING: CPU: 9 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x39/0x90()
[ 1804.844530] Modules linked in: tun hid_generic usbhid hid nfsv3 nfs fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 libcrc32c ipt_REJECT nf_reject_ipv4 dm_multipath nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter x86_pkg_temp_thermal coretemp crc32_pclmul ipmi_devintf aesni_intel aes_x86_64 ablk_helper dm_mod cryptd lrw gf128mul glue_helper hpilo sg shpchp ipmi_si psmouse lpc_ich sb_edac edac_core mfd_core wmi ipmi_msghandler tpm_tis tpm nls_utf8 isofs nfsd auth_rpcgss xen_wdt oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 sd_mod ata_generic pata_acpi uhci_hcd serio_raw ata_piix libata ehci_pci ehci_hcd hpsa(O) tg3 scsi_transport_sas ptp pps_core scsi_dh_rdac scsi_dh_hp_sw
[ 1804.844598]  scsi_dh_emc scsi_dh_alua scsi_mod ipv6
[ 1804.844604] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G           O    4.4.0+10 #1
[ 1804.844606] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/20/2012
[ 1804.844608]  0000000000000000 ffff88024db23cd8 ffffffff8131c8a3 0000000000000000
[ 1804.844611]  ffffffff8180a255 ffff88024db23d10 ffffffff810722ae 0000000000000200
[ 1804.844613]  ffff880249d2cd58 ffff88024935c000 ffff880249d2e410 ffff880246310fc0
[ 1804.844616] Call Trace:
[ 1804.844618]  <IRQ>  [<ffffffff8131c8a3>] dump_stack+0x63/0x90
[ 1804.844629]  [<ffffffff810722ae>] warn_slowpath_common+0x9e/0xc0
[ 1804.844632]  [<ffffffff8107238a>] warn_slowpath_null+0x1a/0x20
[ 1804.844634]  [<ffffffff81076969>] __local_bh_enable_ip+0x39/0x90
[ 1804.844639]  [<ffffffff815a52ee>] _raw_spin_unlock_bh+0x1e/0x20
[ 1804.844647]  [<ffffffffa044d8fb>] __iscsi_conn_send_pdu+0x2bb/0x360 [libiscsi]
[ 1804.844650]  [<ffffffffa044daef>] iscsi_send_nopout+0xdf/0x100 [libiscsi]
[ 1804.844654]  [<ffffffffa044dd25>] iscsi_eh_cmd_timed_out+0x215/0x2a0 [libiscsi]
[ 1804.844666]  [<ffffffffa005bd52>] scsi_times_out+0x92/0x250 [scsi_mod]
[ 1804.844671]  [<ffffffff812fa95f>] blk_rq_timed_out+0x1f/0x60
[ 1804.844673]  [<ffffffff812faa3a>] blk_rq_timed_out_timer+0x9a/0x120
[ 1804.844676]  [<ffffffff812fa9a0>] ? blk_rq_timed_out+0x60/0x60
[ 1804.844682]  [<ffffffff810cf0ef>] call_timer_fn+0x5f/0x140
[ 1804.844684]  [<ffffffff812fa9a0>] ? blk_rq_timed_out+0x60/0x60
[ 1804.844687]  [<ffffffff810d0850>] run_timer_softirq+0x220/0x2a0
[ 1804.844690]  [<ffffffff81076729>] __do_softirq+0x129/0x290
[ 1804.844693]  [<ffffffff81076a62>] irq_exit+0x42/0x90
[ 1804.844697]  [<ffffffff813c8be5>] xen_evtchn_do_upcall+0x35/0x50
[ 1804.844700]  [<ffffffff815a752e>] xen_do_hypervisor_callback+0x1e/0x40
[ 1804.844701]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[ 1804.844706]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[ 1804.844709]  [<ffffffff8100c570>] ? xen_safe_halt+0x10/0x20
[ 1804.844712]  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
[ 1804.844714]  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
[ 1804.844718]  [<ffffffff810ab352>] ? default_idle_call+0x32/0x40
[ 1804.844720]  [<ffffffff810ab5ac>] ? cpu_startup_entry+0x1ec/0x330
[ 1804.844724]  [<ffffffff81013dd8>] ? cpu_bringup_and_idle+0x18/0x20
[ 1804.844726] ---[ end trace 30a15e71a5031c9c ]---
[ 1850.892620]  connection2:0: detected conn error (1021)
[ 1850.892903]  connection2:0: detected conn error (1021)
[ 1850.924571]  connection5:0: detected conn error (1021)
[ 1850.924997]  connection5:0: detected conn error (1021)
[ 1983.883040] block tdaf: sector-size: 512/512 capacity: 83886080
[ 1984.824407] device vif29.0 entered promiscuous mode
[ 1985.347691] device tap29.0 entered promiscuous mode
[ 2001.571005] device tap29.0 left promiscuous mode
[ 2008.694640] vif vif-29-0 vif29.0: Guest Rx ready
[ 2056.072012] block tdag: sector-size: 512/512 capacity: 104857600
[ 2191.917067]  connection7:0: detected conn error (1021)
[ 2207.321267] INFO: task vhd-util:12214 blocked for more than 120 seconds.
[ 2207.321290]       Tainted: G        W  O    4.4.0+10 #1
[ 2207.321301] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2207.321320] vhd-util        D ffff88021af47d28     0 12214  12138 0x00000000
[ 2207.321328]  ffff88021af47d28 ffff88024ca3b800 ffff88009894d400 ffff88021af48000
[ 2207.321333]  ffff88021af47e48 ffff88021af47e40 ffff88009894d400 0000000000000228
[ 2207.321337]  ffff88021af47d40 ffffffff815a1cd4 7fffffffffffffff ffff88021af47dd8
[ 2207.321342] Call Trace:
[ 2207.321355]  [<ffffffff815a1cd4>] schedule+0x64/0x80
[ 2207.321361]  [<ffffffff815a44a2>] schedule_timeout+0x32/0x280
[ 2207.321367]  [<ffffffff812f49cb>] ? blk_queue_bio+0x2fb/0x360
[ 2207.321373]  [<ffffffff810b1cb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 2207.321378]  [<ffffffff815a2e7d>] wait_for_completion+0x9d/0x110
[ 2207.321384]  [<ffffffff81097f50>] ? wake_up_q+0x80/0x80
[ 2207.321389]  [<ffffffff812e9cfe>] submit_bio_wait+0x5e/0x70
[ 2207.321394]  [<ffffffff812f670a>] blkdev_issue_flush+0x5a/0x90
[ 2207.321400]  [<ffffffff811ebfc5>] blkdev_fsync+0x35/0x50
[ 2207.321404]  [<ffffffff811e5ad1>] vfs_fsync_range+0x91/0xb0
[ 2207.321411]  [<ffffffff81060142>] ? __do_page_fault+0x302/0x420
[ 2207.321414]  [<ffffffff811e5b4d>] do_fsync+0x3d/0x70
[ 2207.321417]  [<ffffffff811e5d90>] SyS_fsync+0x10/0x20
[ 2207.321420]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2207.321429] INFO: task tapdisk:14069 blocked for more than 120 seconds.
[ 2207.321444]       Tainted: G        W  O    4.4.0+10 #1
[ 2207.321454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2207.321468] tapdisk         D ffff880213e5bd28     0 14069      1 0x00000000
[ 2207.321473]  ffff880213e5bd28 ffff88024ca55400 ffff880096a09c00 ffff880213e5c000
[ 2207.321477]  ffff880213e5be48 ffff880213e5be40 ffff880096a09c00 00000000024f2d30
[ 2207.321480]  ffff880213e5bd40 ffffffff815a1cd4 7fffffffffffffff ffff880213e5bdd8
[ 2207.321485] Call Trace:
[ 2207.321490]  [<ffffffff815a1cd4>] schedule+0x64/0x80
[ 2207.321494]  [<ffffffff815a44a2>] schedule_timeout+0x32/0x280
[ 2207.321497]  [<ffffffff812f49cb>] ? blk_queue_bio+0x2fb/0x360
[ 2207.321501]  [<ffffffff810b1cb1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 2207.321504]  [<ffffffff815a2e7d>] wait_for_completion+0x9d/0x110
[ 2207.321508]  [<ffffffff81097f50>] ? wake_up_q+0x80/0x80
[ 2207.321511]  [<ffffffff812e9cfe>] submit_bio_wait+0x5e/0x70
[ 2207.321515]  [<ffffffff812f670a>] blkdev_issue_flush+0x5a/0x90
[ 2207.321519]  [<ffffffff811ebfc5>] blkdev_fsync+0x35/0x50
[ 2207.321522]  [<ffffffff811e5ad1>] vfs_fsync_range+0x91/0xb0
[ 2207.321525]  [<ffffffff811e5b4d>] do_fsync+0x3d/0x70
[ 2207.321529]  [<ffffffff811e5d90>] SyS_fsync+0x10/0x20
[ 2207.321532]  [<ffffffff815a57ee>] entry_SYSCALL_64_fastpath+0x12/0x71

Link to comment
  • 0

The iSCSI timeout is disconcerting. Did you try to do a full iscsiadm discovery process from each host? Also, your iSCSI is on a totally separate physical network from all other networks, correct? And NTP is properly set up and working on all hosts and all hosts are close to each other in the time offset (check with "ntpstat -s")?

 

-=Tobias

Link to comment
  • 0

That's very good to know. Was the array management software not giving you any pre-failure alerts? Doesn't seem like anything that the hypervisor would have any visibility of other than the performance being sucky (I guess the memory failures were causing the firmware on the RAID controller to do recovery operations, or even crash and restart).

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...