Jump to content
Welcome to our new Citrix community!
  • 0

VDA Reboot Cycle | Server Stuck at Reboot with StreamService timeout


Question

Hi togehter

 

Since december we have a recurring issue at one of our customers citrix farm.

We do a automated reboot of the VDA at nighttime using broker reboot schedule v2.

 

Issues:

From time to time one or two server will not boot and are offline in the morning.

On DDC server which manages the power commands for the schedule you can see the following error:

Event 3015: Citrix Broker Service

The Citrix Broker Service marked a power action as failed because virtual machine 'DOMAIN\HOSTNAME' failed to register with the site after being started, restarted, or resumed.

Event 3013: Citrix Broker Service

The Citrix Broker Service successfully performed power action 'Shutdown' (origin: Policy) on virtual machine 'DOMAIN\HOSTNAME'.

Which happens one hour after the reboot process because MaxRegistrationDelayMin is set to 1 hour.

So the VDA needs over one hour to load the vDisk or someting and will be terminated because the server is not online in one hour time.

In the morning we can start the server manually and it will start.

This happed almost daily before I changed the regkeys on the DDC server as mentioned:
- Edit MaxShutDownDelayMin and Name: MaxRegistrationDelayMin
Now sometimes it won't happen for one month and another week it happens on three days in a row.

There is no visible pattern regarding the day it happens or the server it happens on (customer has 3 VDA server)

 

__________________________________________________

System:
Citrix VDA:
OS: Server 2019 1809

Hypervisor: HP Moonshot 1500 Chassis

Cardridges: ProLiant m710x Server Cartridge

Citrix Version: 1912 LTSR CU7

Windows Updates: Up to date

 

PVS Server:

OS: Server 2019 1809

vCenter Vsphere ESXi

Citrix Version: 1912 LTSR CU7

 

Delivery Controller:

OS: Server 2019 1809

vCenter Vsphere ESXi

Citrix Version: 1912 LTSR CU7

 

We have other customers on the same ESXI / moonshot constellation which are not facing this issue.

For antivirus we are using Symantec Endpoint Protection and the rules are created after citrix best practice:

https://knowledge.broadcom.com/external/article/177425/citrix-and-terminal-server-best-practice.html

 

_________________________________________________

 

Reboot Schedule

Frecuency: Daily

RebootDuration: 60

StartTime: 02.00.00

Send notification to users: Do not send a notification

 

 

Troubeshooting:

Best findings:
- Delete brokerrebootschedulev2 with powershell and recreate

- Remove notification from schedule completly

- Edit MaxShutDownDelayMin and Name: MaxRegistrationDelayMin -> To 60min which is max

- Update Citrix Farm and VDA to 1912 LTSR CU7 (previously on CU6)

- Changing Restart duration from: "All machines at the same time" -> "Within 30min" -> "Within 1 hour"

 

Today I did another deepdive and was able to recreate the problem on a production device which was shut down in the morning.

I did a boot and the machine was stuck for over 1 hour on the screen you can see in the attached pictures:

Server stuck at boot

 

Also I did some research on PVS logging and found the possibility to analyze logs with CDFControl.

I did a export from this to excel but I'm not sure how to share logs here with you.

But I can see different kinds of timeouts and stuff in these logs:

___________________________________________________________

DiskVersionInfoGet succeeded for deviceIP at 10.100.x.x:6901, for base disk id = 5, version number = 9

CManagedVdiskImpl constructed for vdisk path <D:\XenApp\AIR_CVAD_v2_0_M365.9.vhdx>

name = <AIR_CVAD_v2_0_M365.9.vhdx>, folder = <D:\XenApp>

AcquireLock succeeded for device (airctx02), MAC=944XXXXXXXXX, DiskId=589829 (base disk id = 5, version number = 9)

constructed fallback CacheFullFilename_ <D:\XenApp\WriteCache\airctx02_AIR_CVAD_v2_0_M365.9.vhdx> for device <airctx02>

Full name of cache file is <D:\XenApp\WriteCache\airctx02_AIR_CVAD_v2_0_M365.9.vhdx> for device <airctx02>

Attached vdisk <AIR_CVAD_v2_0_M365.9.vhdx> and its cache for device <airctx02>. CacheMode_: 9

Send: LS_UEFI_SERVICE_TLV_REPLY to device at 10.100.x.x:6901: Name=airctx02, ClassName=<NULL>, ID=4

Service granted for UEFI device airctx02, IP:10.100.x.x:6901, Reconnect: 0, ContextReuse: 0

Recv: LS_TLV_OPTIONS_REQUEST from device at 10.100.x.x:6901.

DeviceStatusGetByGuid succeeded for device: Name=airctx02, MAC=944XXXXXXXXX, ID=4

DeviceStatusAdd succeeded for device: Name=airctx02 MAC=944XXXXXXXXX, ID=4

  makLicenseActivated = 00

Send: LS_TLV_OPTIONS_REQUEST to device at 10.100.x.x:6901

Timeout on device airctx02, delete context

Detached vdisk <AIR_CVAD_v2_0_M365.9.vhdx> and its cache from device <airctx02>.

ReleaseLicense failed for Device: Name=airctx02 MAC=944XXXXXXXXX Status=LICMGR_NOT_STARTED

DeviceStatusGetByGuid succeeded for device: Name=airctx02, MAC=944XXXXXXXXX, ID=4

DeviceStatusDelete succeeded for device: Name=airctx02 MAC=944XXXXXXXXX, ID=4

RemoveContext 

____________________________________________________________

Also in the same log this messages are visible, which are a bit strange looking:

Send: LS_GET_PERSONALITY_REPLY for device at 10.100.XXX.XXX:6905: Name=airctx03 MAC=54802XXXXXX guid=a72b5ce3-8510-48d9-9f27-ff6c89bff298

[airctx03: [wcHD] Searching for local HDs qualified for write cache usage]

[airctx03: [wcHD] WcLocateCacheDrive: WcHDQueryVolume for (\Device\Harddisk0\Partition1\) reported  Size = 244195 MB, Free space= 239709 MB]

[airctx03: [wcHD] WcLocateCacheDrive: disk 0 (\Device\Harddisk0\Partition0), partition 2 has no PartitionNumber, skipping...]

[airctx03: [wcHD] WcLocateCacheDrive: disk 0 (\Device\Harddisk0\Partition0), partition 3 has no PartitionNumber, skipping...]

[airctx03: [wcHD] WcLocateCacheDrive: disk 0 (\Device\Harddisk0\Partition0), partition 4 has no PartitionNumber, skipping...]

[airctx03: [wcHD] Redirect dump file also set Path: D]

[airctx03: Disk 0/partition 1 is qualified (239709 MB free space, smaller than vdisk size 244199 MB) and has been setup for local write cache]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 27, expecting: 29)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 32, expecting: 34)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 32, expecting: 34)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 40, expecting: 42)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 40, expecting: 42)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 41, expecting: 43)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 55, expecting: 57)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 66, expecting: 68)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 90, expecting: 91)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 90, expecting: 91)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 90, expecting: 91)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 89, expecting: 92)] 

 

 

Does somebody have expirience with these kind of timeouts?

If you need more logs or something is missing please ask.

Thanks and kind regards

Pascal

 

 

Link to comment

6 answers to this question

Recommended Posts

  • 0
On 6/29/2023 at 2:29 PM, Pascal Heinz1709163266 said:

Yeah, same. I was not at work for 3 weeks but no progress from citrix in this time sadly.

OK, I have an update now.
Citrix support told me to increade the stuckonbootfault timeout by creating the following reg key on my DDC server:

HKEY_LOCAL_MACHINE\Software\Citrix\DesktopServer\MaxTimeBeforeStuckOnBootFaultSecs

DWORD Value: 30000

 

After this I had to restart Citrix broker and host service.
This was almost two weeks ago and since then the issue hasn't showed up again.

Fingers crossed!

  • Like 1
Link to comment
  • 0
7 hours ago, Pascal Heinz1709163266 said:

No not yet. I have a ticket open with citrix and we are doing a remote session this week.

Do you have a similar infrastructure as mentioned in the case here?

I am having similar behavior, my servers are on AWS. I am working with Citrix support too but no luck yet.

Link to comment
  • 0

Any update, I worked with Citrix support and they are "reviewing" hypervisor logs...

On 5/4/2023 at 10:02 AM, Pascal Heinz1709163266 said:

Hi togehter

 

Since december we have a recurring issue at one of our customers citrix farm.

We do a automated reboot of the VDA at nighttime using broker reboot schedule v2.

 

Issues:

From time to time one or two server will not boot and are offline in the morning.

On DDC server which manages the power commands for the schedule you can see the following error:

Event 3015: Citrix Broker Service

The Citrix Broker Service marked a power action as failed because virtual machine 'DOMAIN\HOSTNAME' failed to register with the site after being started, restarted, or resumed.

Event 3013: Citrix Broker Service

The Citrix Broker Service successfully performed power action 'Shutdown' (origin: Policy) on virtual machine 'DOMAIN\HOSTNAME'.

Which happens one hour after the reboot process because MaxRegistrationDelayMin is set to 1 hour.

So the VDA needs over one hour to load the vDisk or someting and will be terminated because the server is not online in one hour time.

In the morning we can start the server manually and it will start.

This happed almost daily before I changed the regkeys on the DDC server as mentioned:
- Edit MaxShutDownDelayMin and Name: MaxRegistrationDelayMin
Now sometimes it won't happen for one month and another week it happens on three days in a row.

There is no visible pattern regarding the day it happens or the server it happens on (customer has 3 VDA server)

 

__________________________________________________

System:
Citrix VDA:
OS: Server 2019 1809

Hypervisor: HP Moonshot 1500 Chassis

Cardridges: ProLiant m710x Server Cartridge

Citrix Version: 1912 LTSR CU7

Windows Updates: Up to date

 

PVS Server:

OS: Server 2019 1809

vCenter Vsphere ESXi

Citrix Version: 1912 LTSR CU7

 

Delivery Controller:

OS: Server 2019 1809

vCenter Vsphere ESXi

Citrix Version: 1912 LTSR CU7

 

We have other customers on the same ESXI / moonshot constellation which are not facing this issue.

For antivirus we are using Symantec Endpoint Protection and the rules are created after citrix best practice:

https://knowledge.broadcom.com/external/article/177425/citrix-and-terminal-server-best-practice.html

 

_________________________________________________

 

Reboot Schedule

Frecuency: Daily

RebootDuration: 60

StartTime: 02.00.00

Send notification to users: Do not send a notification

 

 

Troubeshooting:

Best findings:
- Delete brokerrebootschedulev2 with powershell and recreate

- Remove notification from schedule completly

- Edit MaxShutDownDelayMin and Name: MaxRegistrationDelayMin -> To 60min which is max

- Update Citrix Farm and VDA to 1912 LTSR CU7 (previously on CU6)

- Changing Restart duration from: "All machines at the same time" -> "Within 30min" -> "Within 1 hour"

 

Today I did another deepdive and was able to recreate the problem on a production device which was shut down in the morning.

I did a boot and the machine was stuck for over 1 hour on the screen you can see in the attached pictures:

Server stuck at boot

 

Also I did some research on PVS logging and found the possibility to analyze logs with CDFControl.

I did a export from this to excel but I'm not sure how to share logs here with you.

But I can see different kinds of timeouts and stuff in these logs:

___________________________________________________________

DiskVersionInfoGet succeeded for deviceIP at 10.100.x.x:6901, for base disk id = 5, version number = 9

CManagedVdiskImpl constructed for vdisk path <D:\XenApp\AIR_CVAD_v2_0_M365.9.vhdx>

name = <AIR_CVAD_v2_0_M365.9.vhdx>, folder = <D:\XenApp>

AcquireLock succeeded for device (airctx02), MAC=944XXXXXXXXX, DiskId=589829 (base disk id = 5, version number = 9)

constructed fallback CacheFullFilename_ <D:\XenApp\WriteCache\airctx02_AIR_CVAD_v2_0_M365.9.vhdx> for device <airctx02>

Full name of cache file is <D:\XenApp\WriteCache\airctx02_AIR_CVAD_v2_0_M365.9.vhdx> for device <airctx02>

Attached vdisk <AIR_CVAD_v2_0_M365.9.vhdx> and its cache for device <airctx02>. CacheMode_: 9

Send: LS_UEFI_SERVICE_TLV_REPLY to device at 10.100.x.x:6901: Name=airctx02, ClassName=<NULL>, ID=4

Service granted for UEFI device airctx02, IP:10.100.x.x:6901, Reconnect: 0, ContextReuse: 0

Recv: LS_TLV_OPTIONS_REQUEST from device at 10.100.x.x:6901.

DeviceStatusGetByGuid succeeded for device: Name=airctx02, MAC=944XXXXXXXXX, ID=4

DeviceStatusAdd succeeded for device: Name=airctx02 MAC=944XXXXXXXXX, ID=4

  makLicenseActivated = 00

Send: LS_TLV_OPTIONS_REQUEST to device at 10.100.x.x:6901

Timeout on device airctx02, delete context

Detached vdisk <AIR_CVAD_v2_0_M365.9.vhdx> and its cache from device <airctx02>.

ReleaseLicense failed for Device: Name=airctx02 MAC=944XXXXXXXXX Status=LICMGR_NOT_STARTED

DeviceStatusGetByGuid succeeded for device: Name=airctx02, MAC=944XXXXXXXXX, ID=4

DeviceStatusDelete succeeded for device: Name=airctx02 MAC=944XXXXXXXXX, ID=4

RemoveContext 

____________________________________________________________

Also in the same log this messages are visible, which are a bit strange looking:

Send: LS_GET_PERSONALITY_REPLY for device at 10.100.XXX.XXX:6905: Name=airctx03 MAC=54802XXXXXX guid=a72b5ce3-8510-48d9-9f27-ff6c89bff298

[airctx03: [wcHD] Searching for local HDs qualified for write cache usage]

[airctx03: [wcHD] WcLocateCacheDrive: WcHDQueryVolume for (\Device\Harddisk0\Partition1\) reported  Size = 244195 MB, Free space= 239709 MB]

[airctx03: [wcHD] WcLocateCacheDrive: disk 0 (\Device\Harddisk0\Partition0), partition 2 has no PartitionNumber, skipping...]

[airctx03: [wcHD] WcLocateCacheDrive: disk 0 (\Device\Harddisk0\Partition0), partition 3 has no PartitionNumber, skipping...]

[airctx03: [wcHD] WcLocateCacheDrive: disk 0 (\Device\Harddisk0\Partition0), partition 4 has no PartitionNumber, skipping...]

[airctx03: [wcHD] Redirect dump file also set Path: D]

[airctx03: Disk 0/partition 1 is qualified (239709 MB free space, smaller than vdisk size 244199 MB) and has been setup for local write cache]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 27, expecting: 29)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 32, expecting: 34)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 32, expecting: 34)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 40, expecting: 42)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 40, expecting: 42)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 41, expecting: 43)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 55, expecting: 57)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 66, expecting: 68)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 90, expecting: 91)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 90, expecting: 91)]

[airctx03: [MIO:3] MIO_READ_REPLY - invalid request id (reply: 90, expecting: 91)]

[airctx03: [MIO:2] MIO_READ_REPLY - invalid request id (reply: 89, expecting: 92)] 

 

 

Does somebody have expirience with these kind of timeouts?

If you need more logs or something is missing please ask.

Thanks and kind regards

Pascal

 

 

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...