Recently, I worked with a healthcare customer experiencing excessive logon times (around 50 seconds) that far exceeded the industry norm of under 30 seconds. This article details the troubleshooting process we used to identify and resolve the issue. Although this specific issue will not be common, the troubleshooting methodology can be applied to other logon time issues.
Starting with Director
In a prior article (How to Speed Up Logon Times) I mentioned using Director as an invaluable tool for understanding the logon time landscape and narrowing the focus for troubleshooting. In this case, Director's dashboard showed a site-wide logon performance of 55 seconds.
Drilling down into the “Trends > Logon Performance” section allowed us to quantify times for the logon sub-phases by hovering over the bar graph:
Here, we can see an average of the sub-phases, which are described in detail in this article:
- Brokering: 0.09s
- HDX Connection: 5.32s
- Authentication: 0.04s
- GPOs: 5.57 seconds
- Logon Scripts: 1.47s
- Profile Load: 4.78s
- Interactive Session: 13.17s
Often, we can easily target one of these top areas to improve logon times, such as long profile load or GPOs. If so, we focus on leading practices in those respective areas. However, in this particular scenario, each segment is pretty fast. Also, note that Interactive Session consists of multiple phases which overlap with GPOs so we can kind of exclude that 5.57 seconds since it’s double counted. So adding up the individual times doesn’t make much sense because the total (24.87s) is much lower than the 52 or 55-second average logon duration. What’s happening?
Focusing on the Middle Rather Than the Worst
Under Trends > Logon Performance, it is tempting to look at the default sort by longest logon to shortest logon and start trying to address the worst offenders.
In my opinion, this isn’t the right approach. We actually want to address the “typical” user and bring the average time down. Then, we can start looking at outliers. To better illustrate this, I ran a Custom Report in Director, exported the logon time duration for a given period, and graphed it in Excel. I prefer to focus on those 80-90% of the sample set and bring their logon time down rather than chase what’s happening on the right. So instead, I would like to filter on a 2-hour sample set in Director and sort by Logon Time and start looking at the data that way.
Utilizing the New Director Timeline View
DaaS and CVAD 2311+ have a new logon timeline view under User Details view > Session Logon, which helps visualize what is happening. By selecting some sample users, we determined that there is a 25-second gap between HDX Connection and Profile Load and that Authentication is only a blip in that. We also spot-checked several other users and saw the same behavior. On the other hand, a couple of users did not have the same gap. Now we have something valuable: comparing a “working” vs “non-working” scenario.
Testing Some Launches
With a test user account, we tried to see what was happening. As we launched the app, we could see Workspace App would sit on this ‘Welcome’ message for 20 or so seconds, which gave us some clues.
We also tried performing an RDP to the server, and interestingly enough, we saw the server sitting there for around 20 seconds as well.
At this time, we think this isn’t an endpoint issue but rather a server-side problem. We can try out a few things:
- Collect logs and engage Citrix and Microsoft support
- Build a new clean image and test, and add applications until the delay occurs again
Just as importantly, we ruled out some paths:
- Don’t need to mess around with profiles because this delay is pre-profile
- Don’t need to adjust GPOs because those also kick in after
- Don’t need to change the Workspace app version because this seems to happen with both with Citrix and RDP
Detailed Logs Needed
We decided to collect logs for the “working” and “non-working” scenarios to get more granular:
- Event logs from the VDA
- CDF Trace on the VDA - https://support.citrix.com/s/article/CTX111961-citrix-cdfcontrol-tool?language=en_US
- Windows Performance Recorder on the VDA - https://learn.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-recorder
- Procmon on the VDA
The best course of action is to work with vendor support to analyze the logs through support cases and get closer to the resolution. Here, we decided to do some DIY investigation and see if we could come up with anything.
Combing through Logs
Since this pointed at a VDA-side issue, we explored Windows Performance Recorder to look at the Windows-level process flow. I’m a beginner using this tool, but after setting up WPA and looking at the process timeline, we can zoom into the area between the start and when the published app starts loading. In this case, I can find a common .exe that runs briefly each time (mpnotify.exe) and refine the timeline:
“Non-working” Session:
“Working” Session:
Okay, so for some reason, processes like csrss.exe, winlogon.exe, LogonUI.exe, dwm.exe, and CtxGfx.exe are 30 seconds or 40 seconds longer. It doesn’t really point to an issue, but by comparing through WPA, I could also narrow the focus a little bit more.
Next, I pulled up procmon to try and find anything relevant. Procmon is also a haystack of data, so trial and error on various filters is the only method I know of:
- Filter out “Buffer Overflow”
- Exclude processes like procmon.exe and others that don’t seem like offenders.
- Filter to only file system or registry
- Narrow the timeline if possible, like only events more mpnotify.exe
- Try Including only specific processes (maybe winlogon.exe or LogonUI.exe in this case)
- Search identifying information in the Path or Details
Resolution
As luck would have it, I tried out a filter to search the Details field to contain the username and something caught my eye. There was a process that retrieved a registry value with the user’s email. I then broadened that filter to include that entire registry path. We can even use procmon’s Activity Summary to look at the timeline, which shows that lsass.exe is spending 22 seconds longer doing something between non-working and working.
By now, we’ve really narrowed our attention to what’s happening, which is an issue related to Entra ID authentication. It turns out this customer had hybrid joined their VDA to Entra ID, although this wasn’t necessary for their use case. By unjoining their servers from Entra ID, the logon time delay went away, and all users accessing the application shaved 25 seconds from their logon time!
Conclusion
Will this particular issue resolution help you? Likely not. The reasons for long logon times are plentiful; sometimes, they can be addressed by big hitters like Profile load, GPO, logon scripts, or resource constraints. However, an obscure issue will often require a deep dive into the particular environment. I hope walking through this particular case study can be instructive for others as they seek to improve logon times.
- 1
Recommended Comments
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now