NFS randomly timing out

Usuario Sistemas · August 24, 2023

Hello everyone,

We are facing a problem with a new Pool. This new pool is entirely made out of Dell R720, furthermore this servers are all up to date, and firmware/hardware is exactly the same on all the pool hosts. The OS version is the last stable version of XCP-ng. The OS of the VM is the last stable version of Debian 12.

The method we use is the following: We deploy a virtual machine through a custom template previously standardized by us (basic Debian 12 + cloud-init cloud-initramfs-growroot). Once the machine is deployed, we install and configure some additional software we provide, and we configure a IP. All this is done through the use of a custom ISO that have the user-data and meta-data files for cloud-init. This ISO is provided by an NFS server which is connected to the pool through an SR which is configured as an NFS ISO Library. As soon as cloud-init process is done, the same script unmounts the ISO ("eject" command), and a reboot is done ("/sbin/reboot" command). Here's when our error appears.

We have encountered that the following error does affect all VM's in one random host in the pool.

The moment the OS of the VM sends the reboot order, XCP-ng shows the console blank (white screen), preventing us from any further operation through console. The only actions available (making use of the buttons above) are "Force Reboot", "Force Shutdown", and "Pause", but these mentioned buttons don't execute any actions.

On the other hand, VM's on other hosts, work just fine (the process explained above finishes correctly and the VM is operative).

It should be noted that this same method is being used for other pools that are working correctly, so it is an error that we are only experiencing in this pool.

While investigating the issue, we have noticed that timeouts are appearing repeatedly in the affected host logs only when the ISO needs to be ejected:

/var/log/kern.log
Aug 21 12:40:01 (host) kernel: [13417.136313] nfs: server (IP) not responding, timed out.

When this log is shown, if we try to "df -h" to check the mount point, the command gets stuck and never finishes. Doing a "strace df -h" we can see that it is trying to reach the mount point itself:

[12:41 (host) ~]# strace df -h
[...]
stat("/run/sr-mount/e4fcedb2-33f6-7f0f-8263-75416dd4412a",

Through Xen Orchestra, we can see that the SR is correctly attached, no errors whatsoever.

We also checked connectivity between the host itself and the NFS server and they are seeing each other correctly

.

The thing is, before starting installing VMs we can see the SR mounted:

[12:27 (host) ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
devtmpfs                                3.9G   20K  3.9G   1% /dev
tmpfs                                   3.9G  272K  3.9G   1% /dev/shm
tmpfs                                   3.9G   11M  3.9G   1% /run
tmpfs                                   3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1                                18G  3.4G   14G  21% /
xenstore                                3.9G     0  3.9G   0% /var/lib/xenstored
/dev/sda5                               3.9G  481M  3.2G  13% /var/log
(IP):/var/ISO                            33G   23G  9.1G  71% /run/sr-mount/e4fcedb2-33f6-7f0f-8263-75416dd4412a <-- Mount point
tmpfs                                   792M     0  792M   0% /run/user/0

Thanks in advance,

Regards.

Sign In

NFS randomly timing out

Question

Usuario Sistemas

Link to comment

0 answers to this question

Recommended Posts

Create an account or sign in to comment

Create an account

Sign in

Discussions

Netscaler

Citrix

Technical Articles

Tech Insights

Community Articles

Resources

Events

Education