Jump to content
Welcome to our new Citrix community!
  • 0

Snapshot chain too long - how to get rid of hidden snapshots without export/import


Chirag Katbamna

Question

Hello, 

I have been reading through a lot of posts related to "snapshot chain too long" issue. I have determined that the chain has grown to 30 and that is the root cause. All of these snapshots are hidden=1.  We did not realize that the snapshots have been failing since over 2 months ago (yeah, I know - thats a big discussion as well). Cloudstack is setup to do daily snapshots and full backup every 5 days. It seems to be working for all other VMs except this one - this VM has  1 TB disk attached which is the only difference compared to other VMs (whose disks are at most 250 GB). 

 

I am unable to go the export import route to resolve this because we have cloudstack running on top of it and I am afraid the metadata change may break it. (Or of there is a way to export import without breaking the vhd filename/metadata - please let me know). This is in production so I want to be sure before I make changes -- From what I understand, import/export will make changes to the vhd and may not work as expected afterwards in Cloudstack.

 

However, besides export, import, is there any way to delete the hidden=1 snapshots? I read that I can do a coalesce. However the issue here is that we have used one template with multiple VM created by cloudstack. I am afraid to do the incorrect (leaf-coalesce) and break other VMs. Cleanup.py seems to do a leaf-coalesce so I have not run it. I have not been able to locate a  script to run the normal non-leaf coalesce. Perhaps thats what I need. 

 

Below is output from my vhd-util. As you can see only the last entry has hidden=0.

Please advise on how to correct this. 

 

 /usr/bin/vhd-util scan -f -a -p -c -m d2ff04e8-16b9-434a-b89e-c0e9b3efd830.vhd
vhd=7e3e5ad7-b172-4561-b7fe-a1aab2859029.vhd capacity=1073741824000 size=846981562880 hidden=1 parent=none
   vhd=9e61813d-5464-4726-a2b7-c624a6f7fea1.vhd capacity=1073741824000 size=105802052096 hidden=1 parent=7e3e5ad7-b172-4561-b7fe-a1aab2859029.vhd
      vhd=0c3a9f51-43ad-4ae2-a920-02b1b4629cd7.vhd capacity=1073741824000 size=11424498176 hidden=1 parent=9e61813d-5464-4726-a2b7-c624a6f7fea1.vhd
         vhd=8aadff7d-6a3f-454a-833c-54712ea9da48.vhd capacity=1073741824000 size=10745795072 hidden=1 parent=0c3a9f51-43ad-4ae2-a920-02b1b4629cd7.vhd
            vhd=9f49acb7-8a45-4843-aa13-9f642bf9beaa.vhd capacity=1073741824000 size=10357064192 hidden=1 parent=8aadff7d-6a3f-454a-833c-54712ea9da48.vhd
               vhd=1b82cd32-5561-4455-923f-455398101345.vhd capacity=1073741824000 size=10056585728 hidden=1 parent=9f49acb7-8a45-4843-aa13-9f642bf9beaa.vhd
                  vhd=4c4ebc56-4346-4a88-ae0b-3357406916c9.vhd capacity=1073741824000 size=11214373376 hidden=1 parent=1b82cd32-5561-4455-923f-455398101345.vhd
                     vhd=1e2ce19e-fc3d-4f65-b341-00aa2f8ccf91.vhd capacity=1073741824000 size=11529560576 hidden=1 parent=4c4ebc56-4346-4a88-ae0b-3357406916c9.vhd
                        vhd=44efde05-15d9-46ce-848c-f64528a98665.vhd capacity=1073741824000 size=13315621376 hidden=1 parent=1e2ce19e-fc3d-4f65-b341-00aa2f8ccf91.vhd
                           vhd=b8e33443-76ca-4904-a61e-7cf15d6d07a4.vhd capacity=1073741824000 size=10598707712 hidden=1 parent=44efde05-15d9-46ce-848c-f64528a98665.vhd
                              vhd=cebe4cb5-a7b9-45bc-838a-f5be99f2ecf0.vhd capacity=1073741824000 size=9871675904 hidden=1 parent=b8e33443-76ca-4904-a61e-7cf15d6d07a4.vhd
                                 vhd=ebd42538-4639-4ee0-a785-cc0b86a6a6c5.vhd capacity=1073741824000 size=10273014272 hidden=1 parent=cebe4cb5-a7b9-45bc-838a-f5be99f2ecf0.vhd
                                    vhd=1a642477-ccde-48ba-b75a-9fb9a3188019.vhd capacity=1073741824000 size=13725364736 hidden=1 parent=ebd42538-4639-4ee0-a785-cc0b86a6a6c5.vhd
                                       vhd=78a3a972-f03c-4169-9370-e6de14670a4b.vhd capacity=1073741824000 size=11134525952 hidden=1 parent=1a642477-ccde-48ba-b75a-9fb9a3188019.vhd
                                          vhd=69c5cd33-705f-4722-a403-ccb0df7ed69a.vhd capacity=1073741824000 size=11161842176 hidden=1 parent=78a3a972-f03c-4169-9370-e6de14670a4b.vhd
                                             vhd=675efca2-8754-49de-825d-3e9b9590bdd1.vhd capacity=1073741824000 size=12311224832 hidden=1 parent=69c5cd33-705f-4722-a403-ccb0df7ed69a.vhd
                                                vhd=e4fd405d-8573-4069-9dab-eed68efa8a50.vhd capacity=1073741824000 size=10518860288 hidden=1 parent=675efca2-8754-49de-825d-3e9b9590bdd1.vhd
                                                   vhd=0b8363a5-89e4-4195-9e73-9af4c72aadd1.vhd capacity=1073741824000 size=10296128000 hidden=1 parent=e4fd405d-8573-4069-9dab-eed68efa8a50.vhd
                                                      vhd=93543c0e-db35-4018-9260-def054ec6568.vhd capacity=1073741824000 size=10483139072 hidden=1 parent=0b8363a5-89e4-4195-9e73-9af4c72aadd1.vhd
                                                         vhd=25d28b9f-f021-45d9-8e8f-b1bb41e71768.vhd capacity=1073741824000 size=14920974848 hidden=1 parent=93543c0e-db35-4018-9260-def054ec6568.vhd
                                                            vhd=879abd75-bb84-4bf3-8a48-6d82c458fbca.vhd capacity=1073741824000 size=12309123584 hidden=1 parent=25d28b9f-f021-45d9-8e8f-b1bb41e71768.vhd
                                                               vhd=e0a06835-9d51-4bc5-972d-aece90052d96.vhd capacity=1073741824000 size=12983624192 hidden=1 parent=879abd75-bb84-4bf3-8a48-6d82c458fbca.vhd
                                                                  vhd=e60b034e-9a44-45df-9f93-209707251c51.vhd capacity=1073741824000 size=14792798720 hidden=1 parent=e0a06835-9d51-4bc5-972d-aece90052d96.vhd
                                                                     vhd=f3838c28-7eaf-482e-90d2-4b967c2871ca.vhd capacity=1073741824000 size=11695559168 hidden=1 parent=e60b034e-9a44-45df-9f93-209707251c51.vhd
                                                                        vhd=e415a776-0cd6-4723-a1c6-4d39677c6a14.vhd capacity=1073741824000 size=10022965760 hidden=1 parent=f3838c28-7eaf-482e-90d2-4b967c2871ca.vhd
                                                                           vhd=71b6ebb0-ad9b-4ebd-a711-2b2b16187d56.vhd capacity=1073741824000 size=9848562176 hidden=1 parent=e415a776-0cd6-4723-a1c6-4d39677c6a14.vhd
                                                                              vhd=3db65d5f-a641-41f1-a2d3-3b742a8153c8.vhd capacity=1073741824000 size=13309317632 hidden=1 parent=71b6ebb0-ad9b-4ebd-a711-2b2b16187d56.vhd
                                                                                 vhd=92da8af8-551e-4c43-b75b-819f99ac5b15.vhd capacity=1073741824000 size=11247993344 hidden=1 parent=3db65d5f-a641-41f1-a2d3-3b742a8153c8.vhd
                                                                                    vhd=01b636bf-405d-4bcf-a9a6-c616058c86d2.vhd capacity=1073741824000 size=10680656384 hidden=1 parent=92da8af8-551e-4c43-b75b-819f99ac5b15.vhd
                                                                                       vhd=d2ff04e8-16b9-434a-b89e-c0e9b3efd830.vhd capacity=1073741824000 size=144895144960 hidden=0 parent=01b636bf-405d-4bcf-a9a6-c616058c86d2.vhd

 

Link to comment

21 answers to this question

Recommended Posts

  • 0

There should still be a coalesce-leaf option, though I believe it was removed possibly in later XS releases ... see if this article helps: https://support.citrix.com/article/CTX201296

If the SR is too full, it won't work regardless, which means either moving some storage elsewhere, or moving the storage associated with that VM to anothe rSR, and if need be, back again. That should only preserve the most recent copy plus I think a maximum of around three snapshots, if present.

Link to comment
  • 0
1 hour ago, Tobias Kreidl said:

There should still be a coalesce-leaf option, though I believe it was removed possibly in later XS releases ... see if this article helps: https://support.citrix.com/article/CTX201296

If the SR is too full, it won't work regardless, which means either moving some storage elsewhere, or moving the storage associated with that VM to anothe rSR, and if need be, back again. That should only preserve the most recent copy plus I think a maximum of around three snapshots, if present.


I went through the article earlier. Somehow I don’t have access to the link in Problem 3 section for CTX133470.

 

are the contents of that link something you can copy paste here? 
 

Link to comment
  • 0

CTX133470 

 

Symptoms or Error

Taking a snapshot fails because XenServer reports "The snapshot chain is too long".

 

Solution

 

Complete the following steps to fix the issue:

 

Validate the pool master in the /var/log/SMlog log file for any errors.

If no errors appear in the pool master, validate the /var/log/cloud/management/management-server.log for the following log message:

05:46:12,182 DEBUG [agent.transport.Request] (DirectAgent-28:null) Seq 2-1213457649: Processing: { Ans: , MgmtId: 72754994106119, via: 2, Ver: v1, Flags: 10, [{"ManageSnapshotAnswer":{"_snapshotPath":"8301ebea-46f7-4967-afac-e6d538bee484","result":true,"wait":0}}] }

Search the corresponding host in the hosts table (based on the host id) and verify the /var/log/SMlog on that host.

The VHD might be corrupted, which is preventing the GC. To fix this issue, the VHD must be repaired.
a) Repair the VHD using the following command:
vhd-util repair -n /path/to/VHD
OR
b) Remove the VHD using any one of the following commands:
lvremove /path/to/VHD or xe snapshot-uninstall

Note: Do not use vhd-util coalesce unless there is no other choice. It is recommended to use vhd-util scan -capfm to print the VHD tree and find out the snapshot index disks. Set their name labels and delete the unwanted snapshots from SR's Storage tab. Use the xe sr-scan command or Rescan in XenCenter to trigger GC / Coalescing process.

 

Additional Steps

If you are unable to locate any issue from the SMlog, run the following commands to fix the issue:

xe vdi-list is-a-snapshot=true | grep -B1 i-#-#-VM

The result lists the snapshots.

a) Combine the snapshots:

vhd-util coalesce -n /dev/VG-storage-uuid/VHD-uuid_of_vdi

OR

b) Destroy the snapshot:

xe vdi-destroy uuid=uuid_of_snapshot

Take another snapshot of your virtual machine.

 

Problem Cause

XenServer supports a limited number of child VHDs from the parent. Typically XenServer's GC should clean up any VHD which has been marked for removal, but sometimes there is an issue which prevents XenServer from cleaning up.

Link to comment
  • 0

I followed the steps from CTX133470 

 

The only error I see in SMLog are related to Snapshot chain too long.

Dec  3 06:02:32 nexusvm7 SM: [32469] ***** generic exception: vdi_snapshot: EXCEPTION <class 'SR.SROSError'>, The snapshot chain is too long
Dec  3 06:02:32 nexusvm7 SM: [32469]     raise xs_errors.XenError('SnapshotChainTooLong')
Dec  3 06:02:32 nexusvm7 SM: [32469] ***** NFS VHD: EXCEPTION <class 'SR.SROSError'>, The snapshot chain is too long
Dec  3 06:02:32 nexusvm7 SM: [32469]     raise xs_errors.XenError('SnapshotChainTooLong')

I ran the 

vhd-util repair -n /path/to/VHD. 

It did not change anything. 

 

I then ran 

vhd-util coalesce -n /dev/VG-storage-uuid/VHD-uuid_of_vdi

That process took over 4 hours. It has now inflated the size of my base file to become 1.5 TB (while the size of the base vhd earlier was 850 GB.  However, there are still 30 snapshots present and the chain is still present. This is bad because the Virtual size is only 1 TB, but actual disk size if 1.5 TB + a lot of other  snapshots - sum of which is now over 2 TB. 

 

 

I kicked off SR-SCAN. I dont know how to check progress on that.  But if I start a new SR-SCAN then it exits by logging that another GC is in progress. I am hoping that the GC process will delete the snapshots but its been running for over 3 hours now and nothing has happened yet. 

 

So what do I do now? Is it safe to delete the snapshots in the chain by using xe vdi-destroy uuid=uuid_of_snapshot

 

Will it break anything if I delete the snapshot from middle of the chain? 

 

 

Link to comment
  • 0

I just tried the vdi-destroy for snapshot - did not delete anything...


 

 xe vdi-destroy uuid=0c3a9f51-43ad-4ae2-a920-02b1b4629cd7
This operation cannot be performed because the system does not manage this VDI
vdi: 0c3a9f51-43ad-4ae2-a920-02b1b4629cd7 (base copy)

 

Now I really dont know what to do.. 

 

Any advice?

 

Link to comment
  • 0

At some point when there is insufficient remaining storage on the SR to do the cleanup, the easiset it to try to move the storage for the VM to a different SR. If that's not possible for whatever reason (including possibly too long a snapshot chain) you can do a full export/delete/import of the VM (which unfortunately means downtime).  Another option might be to bring down the original VM, perform a full clone the VM, detach the original VM's storage as well as that of the clone, and attach the new storage copy to the original VM, and bring up the VM. You can then destroy the original VM's old storage as well as the VM clone once you are satisfied all is working -- this makes for a shorter amount of downtime compared to a full export/import procedure, especially if your VM uses a lot of storage.

 

-=Tobias

Link to comment
  • 0

for Shutting down and import/export I am facing two issues

This is the production database server and so I cannot bring it down without significant impact. 

But I guess there is no way out at this point. I can start Friday night and hope import/export OR cloning fixes the issue. 

 

I will update once these steps are done.

 

Link to comment
  • 0

Would it be safer to do clone in that case (the second option Tobias suggested)? 

So something like

  • shutdown VM1 
  • Full clone VM1 to VM2
  • detach disk from both VMs
  • delete VM1 disks
  • attached new VM2 disk from VM2 into VM1
  • delete VM2 ?

Secondly - can I simply rename the VHD file before attaching to VM1? The reason is that the VMs were created via Cloudstack, and so the VHD file names are stored in the Cloudstack database. I do not want to manually change those (although I have done that once before and it worked fine)

 

Thanks,

Chirag

 

Link to comment
  • 0
Just now, Alan Lantz said:

Thats what I would try. Not sure how fast that type of full copy is going to take. Just be

very careful !

 

--Alan--

 

 

Let me take a moment  here- I just want to say Thanks to both of you - Alan and Tobias - and others who have contributed to this thread.  Both of your support to this forum is legendary! 

 

Back to the issue - I will be going across storage pool (space issue on current storage) but the network is 10Gb, so I hope this will be done within 3-4 hours (hoping!!) I will be making a full copy of the actual vhd disks before attempting anything so that will likely add a few more hours... 

 

I will post more info and update once I actually get to doing this over the weekend.

 

Link to comment
  • 0
1 hour ago, Alan Lantz said:

Tobias is the Legend.  I just hang out here because I have a cushy government job.  :5_smiley:

 

--Alan--

 

Alan, don't be so modest! Your contributions are epic and I continue to learn stuff from you. As an old German expression goes, "One hand washes the other."

 

"Legend" is a term used when someone is so old that nobody can actually remember if and when they existed. :6_smile:

 

-=Tobias

Link to comment
  • 0

I finally completed this maintenance, so thought I would post it here for completion... I had a slightly unique situation as we have Cloudstack on top of XenServer, and so I had to make sure that Cloudstack continues to work normally. Additionally due to Cloudstack, if I shutdown a VM from Cloudstack portal, it would unregister the VM from XenServer so it does not show up at all in XenCenter - which means I cannot do a full VM Clone as XenServer no longer knows that GUID. 

 

So I had to improvise a few things, and made do... 

 

Steps taken- 

1. Shutdown Cloudstack Management server -- I had no way around it as it syncs back to Xen.

2. From XenCenter, shutdown VM1. This allows the VM1 to stay visible in XenCenter after shutdown.

3. Full clone VM1 to VM2 --- This process took nearly 6-7 hours (1 TB size, but apparently, VM Clone does not work at Network/Disk speed - same as Export)

4. Start Cloudstack Management Server. In a few minutes I got an email showing that the Server Status had sync'd - I was scared it may try to start it up due to HA, but since the clone was in process it did not start I assume. I wish I had a better way to handle this. 

5. Copy VM2 disk to VM1 storage location (simple cp command - this took under 15 mins as it was full network speed + hdd speed)

6. archive all VM1 existing disks (mv command  -- including the entire snapshot chain (30 vhd files) to another folder

7. rename VM2 vhd to the VM1 vhd file name. Verified that the chain is now 0 length.

8. Start server from Cloudstack. 

9. Delete the cloned VM2 and VM2 disk from other storage (remember I did a copy across storage pool of this disk but did not touch the cloned disk or VM2 otherwise)

 

After this was all done, I initiated a Snapshot from Cloudstack (scheduled) and it took over 6-7 hours just like the Clone process. 

 

At this time, all is back to normal - except below.

 

ONE THOUGHT/QUESTION - the Clone process made the cloned VHD 100% size of allocated space and not used space (out of 1TB, only 600 GB was used, but the cloned vhd file is 1 TB in size).

Is there a way to shrink it?

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...