Jump to content
Welcome to our new Citrix community!
  • 0

XS71CU2 vbd-unplug on control domain hangs, leads to hanging tasks for every task started after that


Toni Tschampke

Question

Hello Forum,

 

I know similar questions were asked a lot of times, but unfortunately no solution seems to work for us.

 

We are using a modified backuppc approach to do file based backups on our VMs. It basically creates a VDI snapshot, creates a VBD and attaches it to control domain.

The snapshot is mounted afterwards in Dom0. For two VMs the vbd-unplug always fails and leads to pending of all following tasks, only "solved" by issuing "xe-toolstack-restart".

 

The only way to remove this is to reboot both Xen servers in the pool which leads to late night tasks as the VMs must not downtime during the day at all.

 

To my understanding from what I read so far, the cause would most likely be a not finished storage i/o operation.

 

We already exported/imported the VM to a flatfile (xva) as Tobias suggested on different threads which should result in a "clean" VM without any snapshots.

We also tried to export to another storage repository and after "flattening out", reimported it to our default SR.

 

All the SRs are thin provisioned NFS3 storages based on BSD/FreeNAS.

 

The next time the backup runs we have the same issue. Currently we had to disable the additional file based backups for these two VMs at all.

 

How can we debug this any further, I'm carrying this issue for months now and not be able to find a solution to this.

 

A few questions from our side:

 

1. Is it possible to identify a hanging I/O operation? I would guess that some tapdisk process could be involved into this.

2. If the "vbd-unplug" hangs, is it safe to "xe-toolstack-restart" followed by "vbd-destroy force=true" on the control domain?

    The snapshot VDI is identified without problems as we use special names for it and could be removed manually afterwards.

3. What could cause this on these two specific VMs? (some are lvm-based, some have just a single partition rootfs)

4. For some times the "vbd" is already gone, we are not able to find the uuid for it, but it's still attached to the control domain and shows up as 0% pending task.

EDIT: Issuing "xe task-cancel force=true" for the corresponding task-uuid does absolutely nothing

I am really out of any more ideas and would be happy to provide additional informations if needed.

 

Thank you very much in advance for any help on this issue.

 

-Toni

Link to comment

3 answers to this question

Recommended Posts

  • 0

 

Quote

1. Is it possible to identify a hanging I/O operation? I would guess that some tapdisk process could be involved into this.

 

Try xe-task list 

or if you know the disk UUID , you can try tap-ctl list | grep -i <vdi-uuid>  to find any process associated. you can kill it using kill -9 <pid>

 

Quote

2. If the "vbd-unplug" hangs, is it safe to "xe-toolstack-restart" followed by "vbd-destroy force=true" on the control domain?

 

Yes.

 

Quote


3. What could cause this on these two specific VMs? (some are lvm-based, some have just a single partition rootfs)

 

 

Maybe look at /var/log/SMlog and /var/log/daemon.log - might give some clues 

 

Quote

4. For some times the "vbd" is already gone, we are not able to find the uuid for it, but it's still attached to the control domain and shows up as 0% pending task.

 

if listing vbds of dom0 shows VBDs still attached  , we cannot really confirm that its gone

Link to comment
  • 0

Thanks Boby John for the suggestion using tap-ctl, I will look into this.

 

@Tobias One of the affected VMs was indeed going strange some time ago, required us to issue a powerstate reset as the VM wasn't shutting down(soft, then hard) and the shutdown task was hanging at 20%.

 

The force power reset changed the visible state in Xencenter/Orchestra to offline on both hosts. We tried to start it but only got a weird "vdi not managed" error.

Looking into xe vm-list afterwards showed it was still running on one host. Luckily the weird state was reset after rebooting the whole pool and the VM was then usable again,

but since then we have this weird vbd issue.

 

Maybe I have to set these 2 VMs cleanly up again.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...