Jump to content
Welcome to our new Citrix community!
  • 0

Server stopped working: SM has thrown a generic python exception/SR_BACKEND_FAILURE_1200


Peter VARGA1709157052

Question

Hello,

 

my XS 7.6 with all updates [latest is XS76E007] stopped suddenly working. I noticed it due to a blue screen in the console of the SBS VM.

Restarting Toolstack didn't help. When I tried to restart the SBS VM I got these error messages:

  • SR_BACKEND_FAILURE_1200: directly on the xsconsole running on the host accessing it remotely.
  • SM has thrown a generic python exception: was shown in the XenCenter

 

Any operation with the remaining/running VMs ended up with the above messages [for example restart or shutdown].

 

In this case, there is only one solution: Restarting the host with the XenServer. This helped, all VMs are now running normally.

 

What I see in a log file: It stopped working while NAUbackup was running.

EXPLICITLY: I am NOT blaming NAUBACKUP for anything! I am only trying to provide you as much information as is available to me which may help to find out the problem.

I am convinced it was an another problem which coincidentally happened while NAUbackup was running.

 

I have my XS since 6 years starting with XS 6.2. I ran all updates up to 7.6 on my own. This, what happened this morning scared me to death and I don't really find any answer in the internet.

 

Which log files do you need? I read in another thread /var/log/SMlog would help. Should I post it here? I see lot of error messages there regarding the storage.

 

Thank you

Link to comment

12 answers to this question

Recommended Posts

  • 0

Is there all or part of  snapshot of the VM present? How about space on the SR -- is there still enough of it? A generic python error could be almost anything, alas. Does the backup work for other VMs? Did you run a syntax check (using the "preview" option) with the backup script you're using?

 

-=Tobias

Link to comment
  • 0
2 hours ago, Tobias Kreidl said:

Is there all or part of  snapshot of the VM present? How about space on the SR -- is there still enough of it? A generic python error could be almost anything, alas. Does the backup work for other VMs? Did you run a syntax check (using the "preview" option) with the backup script you're using?

 

-=Tobias

 

All VMs are in the XS snapshot plan schedule. 20% of the storage are available, in my case 500 GB. This value is more or less the same since 1 year. I don't suspect the backup script. It is yours. As I wrote, I think it is just a coincident.

Link to comment
  • 0
3 hours ago, Alan Lantz said:

A portion of SMlog showing communication/disk errors may be helpful. But since a server restart cleared it out,

it was likely either some sort of soft error or trouble communicating with storage. Do you use shared storage?

 

--Alan--

 

 

I don't know exactly what a shared storage should be, but the Local Storage is used only by the one XS on which it is hosted.

Link to comment
  • 0
1 hour ago, Tobias Kreidl said:

Do look in /var/log/SMlog for error reports -- might be bad sectors that formed on the disk or any number of other things, but if a local disk at least networking is not involved. If an fsck is needed, that generally happens automatically when the server is rebooted.

 

-=Tobias

 

I don't know how to look for bad sectors. I searched in the 6MB big log file for "bad " or "sector" and no one entry, but, I found this:

"INTERRUPTED COALESCE-LEAF OP DETECTED"

 

In this forum there are these threads regarding this entry:

 

Yes, I have no clue what it means and what I should do. These threads are apparently not about bad sectors. They are 7 and 5 years old!

What is going on? My RAID information on the Fujitsu host shows all SSDs with a green LED and no errors.

 

Should I add here a compressed version of the SMlog file? It has compressed a size of 418KB.

 

I also found this in GitHub: https://github.com/xapi-project/sm/blob/master/drivers/LVHDSR.py

The function throwing the error is in line 911-916.

 

This is the comment of the script where the "INTERRUPTED COALESCE-LEAF OP DETECTED" is thrown:

Quote

 

CA-288222: deactivate LV on slaves before changing the LVM layout

When creating snapshot for a LVHD attached on a slave host, we firstly change the LV on the master and then refresh the LV on the slave. As part of the process, we'll run "lvchange -an" on the slave and expect it would tear down the existing DM (so that we can create new one based on the new configuration). At the point when we do this, the LV's configuration has already been changed (by master) hence differs from the existing DM at runtime. It's unclear, whether in this case, the lvchange will still tear down the (now diverted) DM instance. It shouldn't be a suspire if lvchange refuses to do so for safety consideration, or simply fails to do so because of the existence of some assertion in place. Anyway, it looks like an undefined behaviour which is implementation dependent. To avoid uncertainty, this patch shift the deactivateNoRefcount step forward, to before any LV changes taking place on the master, where the LV configure and DM setup still match each other.

 

 

As it clearly comes from creating the snapshots as I have defined them in the snapshot schedule, I disabled for the moment the schedule plan until we may found a solution.

 

In XenCenter I found this entry and this is exactly the time when it started:

Quote

"Local storage on AAEB91: failed to unpause tapdisk",
"Failed to unpause tapdisk for VDI 01d20aba-84da-4615-b338-2efacd15157a,
VMs using this tapdisk have lost access to the corresponding disk(s)","AAEB91","Oct 5, 2019 2:54 AM",""

And the VM using this disk is the SBS which crashed.

 

This thread is about this - it is 6 years old!

https://discussions.citrix.com/topic/329187-tapdisk-failure-and-the-vm-cannot-boot/

 

Any idea?

Edited by Peter VARGA
Link to comment
  • 0

Tapdisk would hint at a disk error or timeout issue. I'd honestly try a reboot to see if things clean themselves up, if you want to try that. Of course, things could potentially get worse.  Odd error messages or no error messages and weird behavior would in my book hint at storage errors.  Do pvdisplay, lvdisplay, vgdisplay all show the expected results?

 

-=Tobias

Link to comment
  • 0
4 hours ago, Tobias Kreidl said:

Tapdisk would hint at a disk error or timeout issue. I'd honestly try a reboot to see if things clean themselves up, if you want to try that. Of course, things could potentially get worse.  Odd error messages or no error messages and weird behavior would in my book hint at storage errors.  Do pvdisplay, lvdisplay, vgdisplay all show the expected results?

 

-=Tobias

 

As I wrote, I restarted the server. This fixed it.

@pvdisplay:

 

Quote

  --- Physical volume ---
  PV Name               /dev/sda3
  VG Name               VG_XenStorage-f008dabf-481b-7a25-6ebf-56c8f657cf5a
  PV Size               2.69 TiB / not usable 14.98 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              703996
  Free PE               171083
  Allocated PE          532913
  PV UUID               EYzBNG-facF-QmYu-ZNWy-SjNC-tiHS-Gq6RZL

 

@lvdisplay: It is a very long list. For what should I look which is "not an expected result"?

@vgdisplay:

  --- Volume group ---
  VG Name               VG_XenStorage-f008dabf-481b-7a25-6ebf-56c8f657cf5a
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  104311
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                363
  Open LV               181
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               2.69 TiB
  PE Size               4.00 MiB
  Total PE              703996
  Alloc PE / Size       532913 / 2.03 TiB
  Free  PE / Size       171083 / 668.29 GiB
  VG UUID               P3XK0W-GDlZ-9730-Ladx-JbQo-DwLR-rFEfWu

image.thumb.png.c7e46f44587a59ec0c724f52b2a503ee.png

 

This looks OK to me:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0+10] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               QNAP
Product:              iSCSI Storage
Revision:             4.0
User Capacity:        2,834,678,415,360 bytes [2.83 TB]
Logical block size:   512 bytes
Logical block provisioning type unreported, LBPME=-1, LBPRZ=0
Logical Unit id:      0x6e843b6736aea74d36efd4949db272d0
Serial number:        736aea74-36ef-4949-b272-0122db0a854f
Device type:          disk
Transport protocol:   iSCSI
Local Time is:        Sun Oct  6 09:26:50 2019 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Error Counter logging not supported


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging

 

Edited by Peter VARGA
Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...