Jump to content
Welcome to our new Citrix community!
  • 0

high loss of performance on xenapi actions


Jimmy Koerting

Question

Hi everyone!

 

We are driving 4 xenserver pools, with >10 dual xeon hosts each, plenty of RAM, SSDs for dom0 and SSD only NFS storages, 10GBit/s networks, etc. So you can say, hardware is not a bottleneck. Also the number of VMs and VDIs is equal on all pools. One pool is with XS65, three with XS72. They all run without serious problems.

Since last year one of the 7.2 pools is going to have more and more performance problems on each actions the dom0/xenapi is involved: sr-scan, vm-reboot, live-migration and so on.

The VMs itself are still very fast and no performance problems are arising within the domUs at all.

 

What I did:

  • restarting toolstacks
  • restarting ALL hosts in the pool
  • checking ntpd/date/time on each host

What I checked:

  • xapi on master is about 120% CPU utilisation which is equal to the utilisation of the other pool masters having no problems
  • overall dom0 load is between 1.0-1.8, which is also equal to the other pool masters

 

What I found:

 

One of the most obvious things is that an sr-scan is taking a really long time on all SRs in the affected pool. I tracked this down and found in /var/log/SMlog that the 'Updating VDI with location=xxxxxx' entries for each VM on the SRs in that pool is taking much longer (a few seconds) than compared to SR scans in other pools.

Tracking things more deeper, it looks like the _db_update() functions in /opt/xensource/sm/VDI.py are taking the most time - especially these two lines:

 

        self.sr.session.xenapi.VDI.set_virtual_size(vdi, str(self.size))
        self.sr.session.xenapi.VDI.set_physical_utilisation(vdi, str(self.utilisation))

 

For me, it looks like something with writing into the xenapi database(?) is wrong an taking to much time, except the retrieving of .size/.utilisation values is not just reading a value but calling another function which takes more time than on other pools...

 

Does anyone have an idea what is going wrong here?

As I have two 7.2 pools with more or less equal (and even older) hardware and similar usage, I'm sure it is not a general bug or alike. At some point, there must happen an error that corrupted the database or something else that xenapi is depending on. Does it?

 

Is there a way to raise the logging? Some general things to try in such situations?

 

Many thanks in advance for everyone having an idea on this!

 

Jimmy

Link to comment

19 answers to this question

Recommended Posts

  • 0

I'd first check if dom0 has adequate CPU and memory resources. Run top and xentop to monitor performance. The iostat routine will show you if there are any storage botttlenecks in terms of too long wait states, excessive read.write times, and the like; running these apps during times such as a VM migration vs. a ore static time may help spot differences. Check your logs also to assure you have no obvious errors showing up in those. Make sure NTP is set up properly and synchronized with small time differences (under around 200 msec).

 

-=Tobias

Link to comment
  • 0

Tobias,

 

thank you very much for taking care! Do you have historical data/examples on how one can see such bottlenecks in terms of too long wait states e.g.? Not sure if I can interpret the things right, when I see them :) Which params of iostat, xentop etc. should I use? When I call them, I see no clear differences on the slow and the other pools on the first sight :(

NTP is set up properly on all nodes and differences are clear below 200msec - ntpd/time synchronisation was my first approach :)

 

Jimmy

Link to comment
  • 0

Jimmy,

You can post a screen shots of the various utilities and I'd be happy to take a look. Make sure they are taken during times of increased activity.

Just in general, look also for saturation in I/O and reads/writes should be preferably under 20 msec. Depending on the RAID configuration, writes can sometimes be quite a bit slower, which in turn leads to bottlenecks.  For VMs, it's also good to check how many VMs you put on any single SR as the run queue size otherwise can get too large The rule of thumb is to limit this to around 30 or so.

 

-=Tobias

Link to comment
  • 0

Tobias,

 

ok, I've collected a bunch of outputs while I did an sr-scan (taking minutes!). For me, I found nothing strange, except the long delay between the single Updating VDI entries I already spoke about in the SMlog and - that is something I found today - the tapdisk log with the tons of timeouts (which where not at the time of this particular sr-scan - I can't say if and what action happens while this tapdisk log was writen earlier today. But I thing it's not ok to have tons of timeouts on the same vhd file and nearly the same time (see timestamps). Anything you know about?

 

I just see, I have an upload limit. I will upload the other images in further comments. the tapdisk thing in the end :)

 

2021-01-19_18-50-59.thumb.png.a049bb4b931c2bd4e62b4222b3204e51.png

 

2021-01-19_18-55-55.thumb.png.76bfde99c4db18772a335a2100054035.png

 

 

Link to comment
  • 0

Hi, Jimmy:
Those tapdisk errors do not look normal. Can you describe in detail your SSD storage (what model number of disks, sizes, etc.) and how it's connected (iSCSI, fiber channel, etc.)?

Also, I should have asked to run "iostat -x -10" to get extended stats and ignore the first output which is typically cached and won't give you the current readings. This will give you a summary every 10 seconds until you ^C out of it. See: https://linux.die.net/man/1/iostat

 

-=Tobias

Link to comment
  • 0

Hi Tobias,

 

ok, I did it again with iostat -x 10, but the output does not really change at all while running the sr-scan compared to no special dom0 tasks:

 

2021-01-19_21-33-43.thumb.png.21d2bdd89112e236c62e39e998eb361c.png

 

The sr-scan is just an example, where it can be clearly seen, that there is something wrong. As you could see in one of the pictures, this pool has seven NFSv4 (10Gbit/s) storages attached. Three different models (most of them 12 Samsung 860 pro or 12 Intel D3-S4510 (linux), one with 16 Intel D3-S4510 (freenas) and one with 12 WD Gold 12TB (linux)). It does not make any differences regarding the problem, on which storage I do the sr-scan (or other dom0 actions) and the exact same NFS storage types are used in the other pools where we have absolutely no problems like this.

 

Can you tel me something about the tapdisk thing? I will check for those logs in the other pools and if this is a difference, I will search for all vhds the tapdisk errors are about and if this is just a limited group of vhds (maybe only from one SR?) it could be a hint for the source of the problem. Does it?

 

I read about other peoples with problems on tapdisk processes working/looping without any vhd file active. Do you know a way, how to check if I have such? And how to check if I'm allowed to kill such tapdisk processes without killing any data?

 

Many thanks again!

Jimmy

 

Link to comment
  • 0

some small update regarding the tapdisk thing: I found such logs on all pools - also the ones, which are very fast and showing no problems. One finding: they happen much more often on the pool with the problem and the timeout is going by far higher than on the 'good' pools. E.g. I can find lines like this:

 

tapdisk_vbd_check_progress: vhd:/var/run/sr-mount/e866a50d-84d5-4f32-0d47-34c12145b755/7837617e-cc37-4a9d-b099-617253f94d27.vhd: watchdog timeout: pending requests idle for 5189 seconds

 

while one the good pools you never find entries with more than ~40 seconds. Mostly 10-20 seconds.

 

No idea what can be deducted from this, because maybe this is just another symptom and not the source.

 

Link to comment
  • 0

Anything you think is of interest here? Uptime is just 4 days.

 

What about '11332 packets rejects in established connections because of timestamp' ? Should I disable 'net.ipv4.tcp_timestamps' for testing?

 

 

Ip:
    1535540987 total packets received
    81899 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1507475103 incoming packets delivered
    1401137575 requests sent out
    24 dropped because of missing route
Icmp:
    3828 ICMP messages received
    3779 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 66
        echo requests: 5
    115 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 113
        echo replies: 2
IcmpMsg:
        InType3: 66
        InType8: 5
        InType9: 3757
        OutType0: 2
        OutType3: 113
Tcp:
    613280 active connections openings
    1221516 passive connection openings
    15 failed connection attempts
    3817 connection resets received
    223 connections established
    1507459604 segments received
    4139246400 segments send out
    3351787 segments retransmited
    1 bad segments received.
    6862 resets sent
Udp:
    11669 packets received
    2 packets to unknown port received.
    0 packet receive errors
    12026 packets sent
    0 receive buffer errors
    0 send buffer errors
UdpLite:
TcpExt:
    8 ICMP packets dropped because they were out-of-window
    613143 TCP sockets finished time wait in fast timer
    11332 packets rejects in established connections because of timestamp
    13136314 delayed acks sent
    23670 delayed acks further delayed because of locked socket
    Quick ack mode was activated 209130 times
    91114589 packets directly queued to recvmsg prequeue.
    21859331886 bytes directly in process context from backlog
    41582128506 bytes directly received in process context from prequeue
    693302589 packet headers predicted
    1625442 packets header predicted and directly queued to user
    238369833 acknowledgments not containing data payload received
    501430543 predicted acknowledgments
    497121 times recovered from packet loss by selective acknowledgements
    Detected reordering 226 times using FACK
    Detected reordering 46221 times using SACK
    Detected reordering 1262 times using time stamp
    12321 congestion windows fully recovered without slow start
    6397 congestion windows partially recovered using Hoe heuristic
    46779 congestion windows recovered without slow start by DSACK
    41 congestion windows recovered without slow start after partial ack
    TCPLostRetransmit: 9218
    22582 timeouts after SACK recovery
    37 timeouts in loss state
    2122263 fast retransmits
    655922 forward retransmits
    159901 retransmits in slow start
    107 other TCP timeouts
    TCPLossProbes: 465587
    TCPLossProbeRecovery: 9473
    2336 SACK retransmits failed
    1 times receiver scheduled too late for direct processing
    209365 DSACKs sent for old packets
    155 DSACKs sent for out of order packets
    692469 DSACKs received
    727 DSACKs for out of order packets received
    2493 connections reset due to unexpected data
    1561 connections reset due to early user close
    25 connections aborted due to timeout
    TCPDSACKIgnoredOld: 115
    TCPDSACKIgnoredNoUndo: 326382
    TCPSpuriousRTOs: 23
    TCPSackShifted: 645306
    TCPSackMerged: 1090472
    TCPSackShiftFallback: 4975662
    TCPBacklogDrop: 1
    TCPRetransFail: 2585
    TCPRcvCoalesce: 75494158
    TCPOFOQueue: 10110314
    TCPOFOMerge: 156
    TCPChallengeACK: 18
    TCPSYNChallenge: 1
    TCPSpuriousRtxHostQueues: 5825
    TCPAutoCorking: 127221598
    TCPFromZeroWindowAdv: 251
    TCPToZeroWindowAdv: 254
    TCPWantZeroWindowAdv: 29870
    TCPSynRetrans: 68
    TCPOrigDataSent: 3780435780
    TCPHystartTrainDetect: 18916
    TCPHystartTrainCwnd: 529240
    TCPACKSkippedPAWS: 10519
    TCPACKSkippedSeq: 23344382
    TCPWinProbe: 232
    TCPKeepAlive: 10970
IpExt:
    InMcastPkts: 14937
    InBcastPkts: 484193
    InOctets: 2016686677260
    OutOctets: 5080109183302
    InMcastOctets: 493012
    InBcastOctets: 210457654
    InNoECTPkts: 2367782669
    InECT0Pkts: 15
    InCEPkts: 7

Link to comment
  • 0

Looks like a lot of re-transmissions. Do you see the same on all hosts or just this one?  If just the one host, I'd suspect  bad Ethernet cable hardware running to the host. If on all, it may be a cable or two from the switch itself going elsewhere (e.g.,to a router). As to net.ipv4.tcp_timestamps feel free to disable it if you think it might unmuddy the waters. Can't hurt.

The art of diagnosis is if the issue isn't obvious, to start eliminating things to narrow down the possibilities.

 

-=Tobias

Link to comment
  • 0

Still did not find any obvious in the network and found another point that is more realted to xapi database then to storage/network - if I want to change a name-label of a device or a vm. This takes also much more (up to single minutes) instead of less than a second on this pool. And this does not much work on the storages itself in my understanding.

 

Any more infos on raising the logging of xen/xapi? Any idea at this point?

 

Still strange... :(

 

UPDATE: another point speaking against network problems as the main source: if that would be the case, then the disk speed within the VMs had to be bad, too. But they are lightning fast without any obvious difference to the other pools. Compared that with bonnie++

 

Jimmy

Edited by hostnet1994
addition regarding network
Link to comment
  • 0

You're using separate, physical Ethernet connections to your iSCSI storage, correct? What, if any, difference is there between the hosts/pool on which you see issues and the hosts/pools where you do not? Seems that the diagnosis is going to be helped by identifying difference. The older pool has older NIC drivers, for example. What abput any TCP settings that may differ some between older and newer XS/CH instances?

Link to comment
  • 0

So, we tried around and monitored some more things. But it is difficult to run all these tests and measurements in a production environments, as you all know :(

 

The network seems to be a port of the problem. We exchanged one of the oldest (4 years) 10G switches with a brand new one and it seems to help a bit on some circumstances.

Maybe we have a general problem in our setup:

 

How important is it to have the Management-Traffic separated from the Storage-Traffic? We have enough NICs on the machines, but we startet the whole thing and it seems not to be necessary, as 10G seems to be enough. While the thing was growing, we never changed that approach. Big fault?


We now think about separating that - so a few questions on that:

 

  • do you think that could be the root cause of such effects?
  • is a dedicated 1G switch network sufficient as Management-Only Network?
  • Is it possible to separate management and storage traffic in a running pool? Especially: when we connect the new NIC, it looks like I can 'add' only a new 'storage' network. But I want to keep the existing Network as 'storage' network and add a new management network? Possible at CLI?

Or do we need to setup all pools new with the new approach and then move all VMs and data to these new pools? Would be a huge effort :/

 

 

Again, many thanks for your input in advance!!!
 

Link to comment
  • 0

With shared storage I would consider management/storage isolation very important. You don't want management operations impacting your storage performance. I also keep my management MTU at 1500 while running storage Jumbo. And the obvious is keeping storage network on the same subnet as the NFS storage so there is no routing of storage

traffic.

 

--Alan--

 

Link to comment
  • 0

Alan,

 

many thanks for your input!. Do you expect more impact on the storage performance from the management traffic, or problems/delays in management because of the storage traffic (what was my assumption). How much bandwidth should one expected as used by the management? I thought just a few mbit/s but many packages, so that not the bandwidth but the undisturbed line is important to have the management fast. Am I wrong?
And yes - the nfs storages are on the same subnet as the dedicated PIFs of the host servers.

Link to comment
  • 0

I would expect management to impact storage. Its not so much about bandwidth is it is latency. Storage needs as low a latency as you can manage. I think bandwidth/latency on management would be less critical. Something like a VM migration to another host, so what if it takes a couple more minutes. Just a couple of seconds on the storage side is an eternity. On my old Fujitsu I have ran 1Gb Management/VM traffic bonded  with 10Gb for storage iscsi multipath and never had anything I considered a performance issue.

 

--Alan--

 

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...