Jump to content
Welcome to our new Citrix community!
  • 0

Final Leaf Coalescing on LVM


Mark Flattery

Question

Hi All,

 

I know there used to be an issue with Final Leaf Coalescing on File based storage but we are using LVM and experiencing some issues with base copies  being left behind after a snapshot is deleted.

 

We are currently trying to use the offline coalescing tool and it appeared to complete successfully (after only 10 minutes), so i had a look through the SMlog and saw the lines below. It would appear Xenserver is trying an online coalesce first but there is not enough room on the SR then trying the offline coalesce (please correct me if i am wrong.

 

Any help would be much appreciated. Xenserver 7.1 and Storage type is LVM over iSCSI

 

Regards

 

Blair

 

May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] Set vhd-blocks = eJztyjENACAQBLALxt46UpDB9lhgIGFp53bPJNUAAAAAAAAA8Neu27ny3Dhz48r2 for *84c414aa[VHD](125.000G//123.871G|n)
May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] Num combined blocks = 63295
May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] Coalesced size = 123.871G
May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] Got other-config for f96e4add[VHD](125.000G/78.338G/125.250G|n): {'leaf-coalesce': 'offline', 'content_id': 'daab8dcc-f06f-6f3a-6cc2-c7c25d40eed5'}
May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] No space to leaf-coalesce f96e4add[VHD](125.000G/78.338G/125.250G|n) (free space: 64718110720)
May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] ...but enough space if skip snap-coalesce
May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] Set leaf-coalesce = offline for f96e4add[VHD](125.000G/78.338G/125.250G|n)
May  7 18:04:42 DELLR1C1T4XS SMGC: [20797] Got sm-config for *84c414aa[VHD](125.000G//123.871G|n): {'read-caching-reason-44c8f564-d73d-4824-8fab-23d2a141c9b9': 'SR_NOT_SUPPORTED', 'read-caching-enabled-on-44c8f564-d73d-4824-8fab-23d2a141c9b9': 'false', 'vdi_type': 'vhd', 'vhd-blocks': 'eJztyjENACAQBLALxt46UpDB9lhgIGFp53bPJNUAAAAAAAAA8Neu27ny3Dhz48r2'}
May  7 18:04:42 DELLR1C1T4XS SMGC: [20797] No work, exiting
May  7 18:04:42 DELLR1C1T4XS SMGC: [20797] GC process exiting, no work left

Link to comment

13 answers to this question

Recommended Posts

  • 1

Blair,

It's not always easy to estimate and in part probably depends on the size of some of the VDIs, themselves.

The error message:

May  7 18:04:41 DELLR1C1T4XS SMGC: [20797] No space to leaf-coalesce f96e4add[VHD](125.000G/78.338G/125.250G|n) (free space: 64718110720)

 

would indicate that there is _not_ enough space.

 

Note that you can partly address this by deleting the unneeded snapshots an dthen doing a storage migration from one SR to another, which will take care of collapsin gthe storage of that VM. you can always then eventually omve it back to the original SR.

 

I'd suggest making some extra space available on that SR until that error message goes away.

 

-=Tobias

  • Like 2
Link to comment
  • 1

So, when running the offline coalesce tool you should see messages like
 

Quote

VDI to consider: <UUID>

in the SMlog file.

 

If you see any of the following then it hasn't done anything

  • Unable to check VDIs on SR ....
  • The VM has no VDIs that could be leaf-coalesced
  • Skipping non-LVM VDI:
  • Error: VM <UUID> not found
  • Failed to suspend the VM

And, if the VM was running you should see

 

Quote

Suspending VM  <uuid>

 

  • Like 1
Link to comment
  • 1

A qualified "possibly".

 

Obviously every deployment is different but it is a fact that in order to coalesce snapshot trees after deletion of snapshots (and SxM uses snapshots internally) on LVM SRs does need a significant amount of storage as the nodes being collapsed together need to be inflated up to a size suitable to accommodate all the data of the combined pair and then only once the data is copied up can the lower level VHD image be deleted and finally free the space up. So in a pathological scenario where the parent node contains very little data but the leaf contains a very large amount of data (potentially it can have every block rewritten and so be at 100% capacity) we need to inflate the parent node to 100% of capacity. This means that temporarily the SR is using 200% of the virtual capacity of the VDI.

 

Mark.

  • Like 1
Link to comment
  • 1

If I'm not mistaken, the timeout for such processes is 24 hours, I believe, hence this must be a different, internal timeout of sorts.  I'm not sure.

 

That report is I believe for a single VHD's size.

 

I'm not seeing any "Timed out" messages in my logs. Maybe you still need a bit more space, or try again after a number of the processes have succeeded.

  • Like 1
Link to comment
  • 1

Good to know.

 

I think the timed out error you got is the result of a  (what looked to be longstanding) bug where the leaf coalesce process would try several times to snapshot the current leaf to allow online coalescing of it before the final pause but decide that it wasn't making progress with this, i.e. the VM was writing faster than the coalesce to run and so the leaf kept growing - the bug then caused the code to drop through into trying doing a paused leaf coalesce instead of giving up and leaving the VM to offline coalesce. The time out for a paused leaf coalesce is by necessity very short because while it's happening the VM is unable to perform IO to the disk, so the code was just stupid because if the snapshotting coalesce couldn't do it then there was no way the paused coalecse could complete within the timeout. This is now fixed in Citrix Hypervisor 8.0 but the hotfix for 7.1 CU2 hasn't been released yet.

  • Like 1
Link to comment
  • 0

Hello,

 

No problem, i will allocate some extra space on that SR  and try the offline coalesce again and see if it succeeds.

 

Do you know what the numbers mean in the brackets (125.000G/78.338G/125.250G|n). Are they the sizes of the 3 disks, as my understanding is that when a snapshot is taken it creates and additional two VDIs.

 

We tried to move SRs but we are mighty tight for space at the moment. Just in the process of clearing up some space.

 

Many thanks for the advice.

 

Blair

Link to comment
  • 0

The three numbers for the VDI are the sizes of the single VDI, they are the virtual size of the VDI (125G), the amount of space actually used inside the VDI (78.338G) and the space used on the SR to hold the VDI (125.25G). The space used in the SR is the full virtual size of the VDI (i.e. what you can write to it from a VM) + the space used to contain the headers and maps to allow the data to be accessed from the VHD image format.

Link to comment
  • 0

Hi Mark,

 

Thanks for your reply. I do not see any of the messages you have described above. 

 

We can afford to increase this SR a little bit and i wonder if i should add enough to attempt an online coalesce.

 

I inherited this setup when I started this current position and i am bit dubious as to whether the SR setup is ideal. 

 

We have 31 SRs which are mostly 600GB in size and most of them do not have sufficient space to properly utilise Snapshots and Storage Motion. I know every scenario is different but i wonder if less SRs with more capacity would help us avoid these kind of issues in the future.

 

Thanks again for your advice

 

Blair

Link to comment
  • 0

Hi Guys,

 

Thanks for your input so far, I have added some space to the SR and re-scanned it, which kicked off the online coalesce and appear to be working away non problem but after checking the SMlog after a wee while i saw this message.

 

Quote

*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]          ***********************
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]          *  E X C E P T I O N  *
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]          ***********************
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706] leaf-coalesce: EXCEPTION <class 'util.SMException'>, Timed out
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 1563, in coalesceLeaf
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     self._coalesceLeaf(vdi)
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 1766, in _coalesceLeaf
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     return self._liveLeafCoalesce(vdi)
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 2298, in _liveLeafCoalesce
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     return SR._liveLeafCoalesce(self, vdi)
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 1813, in _liveLeafCoalesce
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     self._doCoalesceLeaf(vdi)
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 1847, in _doCoalesceLeaf
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     vdi._coalesceVHD(timeout)
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 789, in _coalesceVHD
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     self.sr.uuid, abortTest, VDI.POLL_INTERVAL, timeOut)
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]   File "/opt/xensource/sm/cleanup.py", line 165, in runAbortable
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]     raise util.SMException("Timed out")
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706]
May  8 12:24:09 DELLR1C1T4XS SMGC: [16706] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*

 

Now i believe i have a couple of options, try the offline coalesce again (potentially quicker?), or lengthen the timeout value (not sure about how to do that).

 

Now, additional question, i also saw this line - 

 

Quote

Coalesced size = 123.871G

 

Is this the size to be read back in? or is that too useful to be true.

 

Cheers

 

Blair

Link to comment
  • 0

Hi Tobias,

 

This morning the Coalescing has completed so thanks for your advice, a little patience was all that was required.

 

As you say the time out errors must have been some kind of internal process. 

 

Thanks again to yourself and Mark for you advice.

 

Cheers

 

Blair

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...