Jump to content
Welcome to our new Citrix community!
  • 0

Elastic layers disconnect overnight with XenApp


Chris Blocker

Question

Starting a new thread because it's not quite the same, as we're using XenApp session hosts, but here's a link to a very similar thread.
https://discussions.citrix.com/topic/394883-layering-service-unresponsive-after-a-login/

-Multiple users logged in, elastic layers attach fine.
-Users log out, after some time the elastic layer and it's app might become unavailable.

-Users log back in after some time, this sometimes causes the server to crash and reboot. 

-No Antivirus outside of defender, as SEP was causing multiple issues.

 

As far as I can track it down, the disconnects may be related to SMB and the fact that if a user logs out, the SMB connection is still open with that user's credentials for up to 2 hours. Once that connection closes, all hell breaks loose.

 

We've been dealing with this for months and Citrix support is unable to find anything. Management is about to abandon layers completely because of this.

Link to comment

Recommended Posts

  • 0

What technology are you using for your elastic layer share?  I was working with a customer using user layers on Acropolis File System and they found that there was a kerberos ticket timeout after 10 hrs.  Yours seems much shorter.  Also i have seen customers where their PVS cache fills up very quickly and it looks pretty similar to your symptoms as well.

 

You should not be seeing the vhd connections go away on the file server until the machine reboots.  Maybe a good test woudl be to publish an elastic layer wiht no users assigned.  Then from one of your other servers just mount that disk and try to leave that mount open for a day and see if it gets disconnected.  So your taking the PVS image out of the troubleshooting.

Link to comment
  • 0

Are you using Citrix Federated Authentication Service in your environment? I saw one instance of FAS being attributed to elastic layer disconnects. For that case they did not need this feature so it was disabled and no other troubleshooting was done in that instance. I opened a ticket for this feature to be QA'd with elastic layers to see if there is a real problem or if this was a one off configuration issue.

 

Otherwise I agree with the tests Rob suggested above. Another possibly useful test would be to have a server with elastic layering disabled. Login with one user and manually attach one of the elastic layer VHD's to the VM using the Windows Disk Manager (Action -> Attach VHD). This will simulate what our service is doing but will remove App Layering from the equation. This could help uncover any SMB specific issues.

Link to comment
  • 0
1 hour ago, Rob Zylowski1709158051 said:

What technology are you using for your elastic layer share?  I was working with a customer using user layers on Acropolis File System and they found that there was a kerberos ticket timeout after 10 hrs.  Yours seems much shorter.  Also i have seen customers where their PVS cache fills up very quickly and it looks pretty similar to your symptoms as well.

 

You should not be seeing the vhd connections go away on the file server until the machine reboots.  Maybe a good test woudl be to publish an elastic layer wiht no users assigned.  Then from one of your other servers just mount that disk and try to leave that mount open for a day and see if it gets disconnected.  So your taking the PVS image out of the troubleshooting.


I've tried the shares on 4 different techs:
-EMC CIFS share: This is no bueno, doesn't even attach at all.
-SOFS: This is where we first started noticing the issue.
-2012 R2 share: Same issue.

-2016 share: Same issue.

 

I've tried manually attaching the VHD and the same disconnects were not occurring.

Link to comment
  • 0

Just wanted to update this and see if there's any more advice someone could offer.

 

I had a test server running in which I manually attached one of the app layer VHDs (from the same location, just to keep things consistent).

I'd forgotten about that until yesterday, and logged in to find that the manually attached VHD has been connected for over a month. I confirmed that elastic app layers still disconnect and freak out when there are multiple users on the server.

 

Thoughts?

Link to comment
  • 0

Hi,

did you ever get a resolution to this? we have a similar issue (with a case in with support) If we mount the layer and leave it it stays connected. The same layers don't  disconnect when users on Server VDI machines, but do disconnect in RDSH environments, this seems to be around the times users start to log off, so i'm not sure if thats connected. This does however cause considerable disruption, as users applications either disappear (best case) or the whole machine restarts,

thanks

Phil 

Link to comment
  • 0

 

 

 

We still had any success with this, support don’t seem to be able to help. We nave this exact same issue, its about 2 hours after someone logs out, the SMB connection is closed.

 

We have tried rebuilding file servers, and migrating the Layers share, we have removed AV from both client and File server, we also have move the File servers into an OU with all policy block. 

 

This does not happen on Single Session Delivery Groups but does on Multi Session OS.

 

I'm starting to feel that this tech is not enterprise ready, and simply does not work. We are looking to remove it, and i would strongly advise people against using this tech.

 

If anyone has any ideas, we can try this would be great!

 

thanks

Link to comment
  • 0

Currently experiencing the same issue. We have a case open with support but i am not expecting a resolution based on the above. We are hosting our Elastic Layers and FSLogix profiles on the same 2016 SOFS clusters. AV has all the recommended exclusions on the filers and workloads (Windows Defender).

I'm seeing the issue across multiple workloads (2016, 2019 and Windows 10 Multi-User). As above this does not impact single user VDA (Win 10)

 

All of our workloads are MCS provisioned in Azure. We do utilize and require (cannot disable) FAS.

 

When the issue occurs all of the elastic layers disks drop, it appears however that the FSLogix disks are not impacted although i have been told FSLogix has a recovery mechanism so may not be seeing the issue. The event logs indicate disk warnings for the virtual disks at the time the issue occurs, there are also additional warnings indicating the the disks have been retired.

Link to comment
  • 0

Do you see any patterns with the disconnects.  Sounds like its not an issue at the file server end because you have single session VDAs not experiencing the issue.   These are the questions i thought of:

  1. Are all your users affected or only some?
  2. Are all the VDAs afffected or only some?
  3. Are all the disconnects at the same time or different times? On the same VDAs and different VDAs?
  4. If ti is all at one time are there any scheduled tasks running at that time?
  5. Are you doing anything that uses VSS on the VDAs whcih would normally only be used for backups?

Rob

 

Link to comment
  • 0

Hey Rob

 

Issue impacts all users on the same VDA (ie if a server experiences the issue all users logged in are impacted and any new sessions to the server are impacted)

All of our layers are currently assigned to the servers via machine association (Pattern matching).

Will double check the scheduled tasks but unlikely (Citrix Optimiser is run on the OS layer with the Citrix OS template using default settings and your App Layering supplement and Profile templates).

If a server is impacted it tends to only be random servers, other servers in the delivery group will not be impacted at the same time.

Nothing else using VSS on the VDAs.

I've also confirmed that manually mounting of VHDs from an impacted server is fine (using the same UNC path specified in GPO for the layers.

Also worth noting is that although we are using MCSIO for 2019 and Win10 MultiUser catalogs, the 2016 catalog is not configured for MCSIO and we see the issue there also.

 

Michael

Link to comment
  • 0

Hi,

 

We have finally made some progress on this, we removed app layering as such from the equation and ran this test 

 

1.    Log onto a server (not using VDA or App layering) 
2.    Connect .vhd from file share
3.    Log off server
4.    Check file server to see the .vhd is still open as that user even though there is no session on the server
5.    Confirm Idle timer is reset every 10 mins
6.    Check back after over 10 hours and .vhd will be dismounted
7.    This will match to an anonymous access log on the file server as the ticket has expired

 

Support kept pushing us to look at SMB timers etc, which we kept explain were set at the defaults of 15mins, so it this was taking effect then this would door much sooner…

We did notice that the .vhd stayed connected, and the idle timer of the connection on the file server never reached 15mins, in fact this was reset every 10 mins. Out assumption it this must be a service impersonating the user, as the user is physically not logged on to that sever. Once 10 hrs is reached and the “service” tries to keep the SMB connection alive, this fails and the connection is disconnects, and you will see and “Anonymous login” event on the file server. This suggested to us that this was related to “Maximum lifetime for service ticket” We increased this value to 24 hours (as we reboot our machine overnight) and this has now removed the problem.

 

It appears that the first user to login the software layers are attached for that user, and even after the log off this is the access method for all users on the server, on the service ticket expire this disconnects all users from the .vhd.

Support have told me they can’t replicate this, however I have managed to do this every time, and it’s also worth noting that the default for “Maximum lifetime for service ticket” is 10 hours (Default: 600 minutes (10 hours)) so this is added to default domain policy on every single windows domain, so I’m not sure how it either can’t be replicated or I am the only person to have ever seen this issues.

 

Whilst the above does work, it does decrease security on the domain and some people might have an issues with that, however in our case increasing this from 10 -24 hrs is not that big a problem. If you were only doing weekly reboots then  have a service ticket lifetime of a week would be a real problem

I could be wrong with the above as support can’t really give me a straight answer, but it all seems to fit, and since we made the change this has resolved the issues for us completely. I also appreciate that this is not “app layering” as a such, but its relaying on underlying windows tech and we could not find any mention of this as an issue, in any support documentation. If this is the case it seem to be a weak point in the product that it uses the first users access token to connect the .vhd and not the actual logged in user which might prevent this happening?

 

If anyone has any other ideas or, anything above is not correct then we are happy for suggestions!

Thanks

Phil

Link to comment
  • 0

Hi Rob,

 

we are using server 2016, however we have tested 2019 and the same behavior is expedited. This only happens when user as "Multi Session", when we use the OS as "server VDI" we dont see this behavior. 

 

This happens with both Machine assigned and User assigned layers, as the mount process seems to be the same from a windows point of view.

 

thanks

Phil 

Link to comment
  • 0

I left the session connected with the second user over night.  This morning that i checked the file server and the connection for the second user was still there.  I connected as that user and my app was still running and the layer files were still accessible.  I am not sure what is different between our environments causing your situation to be different.  I will test also with machine account assignments in case that behavior is different.  I imagine it might be.

 

image.thumb.png.de0f8e13306a5da7d38e00e9e27b2a98.png

 

Link to comment
  • 0

Same results with the machine account.  The file share connection for the initial user that had logged out is gone but the still logged in user's connection is still active, the application is still open and the disk is properly mounted.

 

I dont have that many GPOs applying but maybe its worthwhile setting up a test with a minimal set of GPO's, just the ones needed to set up a terminal server in case that is the difference.

Link to comment
  • 0

Hi,

 

I'm not sure you are running the same test as me. The issues seems to arise as the first users logged in to the server is assigned that disk, and the this mount path is used for all subsequent users. Once the 10 limit is reached this disconnects disconnecting all users.  If you have a connected session at that time the application dies. 

 

Our test initially was not using App Layering,  

 

1. log into a server (with out applayering)

2. Mount the .vhd from the file server

3. log off the server - you will then see the .vhd is mount ed as that user, even though they are logged off. So this must be maintained by a service\process

4. The server resets the idle timer (as seen on the open files in computer management) after 10 mins, so this never reaches the SMB timeout of 15mins.

5. After 10 hrs the process tries to keep the SMB connection open, but this fails as the ticket has expired, an anonymous login is observed on the file server. 

 

As it appears the first user to login "owns" the .vhd once the above happens the disks drop, however can be reestablished on logout login as it remounts the .vhd, however if the user is logged in and active the application dies.

 

thanks

Phil 

Link to comment
  • 0

I have carried out testing in our environment and can confirm that with actual user connections with machine allocated elastic layers all elastic layers show as sessions opened by the initial user session. Subsequent user sessions do not establish open file sessions as witnessed on the SOFS cluster via the MMC.

When the initial user logs off and the token expires the elastic layers disconnect. (in our environment it is very frequent due to 60 minute max ticket lifetime)

 

Can i suggest that this is a product limitation due to the way the elastic layers are attached. Would it not be better to have the machine Elastic Layers be attached via either the computer account or at least a service account?

Link to comment
  • 0

Just tested with User assigned Elastic Layers, can confirm that we see the initial user establish the connection to the Machine assigned Elastic Layers and user assigned Elastic Layers.

Subsequent user only establishes connection to Elastic Layers not already attached by the initial user as Machine or User assigned layers.

 

I expect in many environments with uniform shifts ie 9-5, that the default 10 hour token life is fine or in environments with frequent login logout as that will trigger a re-attach of the disconnected layers this may not be frequently noticed, or if the user are in the habit of letting sessions idle to disconnect as long as the initial user still has a session (connected or disconnected) the token will renew.

 

If you enforce idle to disconnect and disconnect to logoff session limits your are more likely to be impacted.

Link to comment
  • 0

Attached screen shot of sessions on SOFS cluster for two users. Initial user attaches all of the machine assigned elastic layers, and any user assigned layers (for test removed Edge, Chrome and Firefox from machine assignment). Assigned the initial user Chrome and Edge.

Subsequent user only establishes session to the additional user assigned Elastic Layer for Firefox but can access the already attached Edge layer and any machine layers but not Chrome (as expected).

 

image.thumb.png.7541c7e5a4eb8bf81c0f886bd815912a.png

 

Only other thing i am now wondering is the difference in behavior of what we are seeing and what Rob is seeing is to do with presentation of the share. I currently have the SOFS cluster presented from a DFS namespace.

Link to comment
  • 0
5 hours ago, Michael Shirtcliff1709159619 said:

I have carried out testing in our environment and can confirm that with actual user connections with machine allocated elastic layers all elastic layers show as sessions opened by the initial user session. Subsequent user sessions do not establish open file sessions as witnessed on the SOFS cluster via the MMC.

When the initial user logs off and the token expires the elastic layers disconnect. (in our environment it is very frequent due to 60 minute max ticket lifetime)

 

Can i suggest that this is a product limitation due to the way the elastic layers are attached. Would it not be better to have the machine Elastic Layers be attached via either the computer account or at least a service account?

 

Hi, 

 

I'm glad someone else is seeing this, my though are exactly that it is a limitation of the product, and really need to be looked at. I had a case open with support they thy were hell bent on closing (you would think they would be concerned about a potential issues with the product!).

 

Currently i have set the token lifetime to 24 hrs, and we reboot the servers daily however as on of the environments its 24X7 this is not idea to have to do every server every day. 

 

Im not sure Citrix believe me this is a problem, but i see this hitting every Multi Session OS deployment using App Layering.

 

thanks 

Phil 

Link to comment
  • 0
On 4/8/2020 at 3:01 PM, Phil Taylor1709159934 said:

Hi,

 

I'm not sure you are running the same test as me. The issues seems to arise as the first users logged in to the server is assigned that disk, and the this mount path is used for all subsequent users. Once the 10 limit is reached this disconnects disconnecting all users.  If you have a connected session at that time the application dies. 

 

Our test initially was not using App Layering,  

 

1. log into a server (with out applayering)

2. Mount the .vhd from the file server

3. log off the server - you will then see the .vhd is mount ed as that user, even though they are logged off. So this must be maintained by a service\process

4. The server resets the idle timer (as seen on the open files in computer management) after 10 mins, so this never reaches the SMB timeout of 15mins.

5. After 10 hrs the process tries to keep the SMB connection open, but this fails as the ticket has expired, an anonymous login is observed on the file server. 

 

As it appears the first user to login "owns" the .vhd once the above happens the disks drop, however can be reestablished on logout login as it remounts the .vhd, however if the user is logged in and active the application dies.

 

thanks

Phil 

 

Hi Rob,

 

have you tried the above not using app layering? this problem happens when windows (without app layer) mounts the .vhd. As you seem to be sing the access from the initial user to mount the disk this seems to be the problem we run into.

 

I'm happy to try a variety of suggestions, and supply further details, albeit this happens on a domain with no amendments to the default domain policy, and using a files server and client machine with no policy or AV. 

 

thanks 

Phil 

Link to comment
  • 0

Phil and everyone,

 

I have not stopped working on this.  After meeting with our engineers i realized what I found in my testing results are not possible so i went back and tested again.  I need to apologize because i must have somehow been testing across two VDAs when i thought i was restricting to one.

 

In my latest test I get the same symptoms you are having.  I am actively working with engineering on it.

 

Rob

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...