Jump to content
Welcome to our new Citrix community!

packet drops on netscaler vpx


Jes M

Recommended Posts

Hello,

 

I believe we have a network (throughput) resource crunch on NetScaler (HA setup) as we are hitting connection time-out issues when there is high traffic.

# nsconmsg -K /var/nslog/newnslog -g nic_err -d statswt0
Displaying current counter value excluding counters with 0 value information
NetScaler V20 Performance Data
NetScaler NS11.1: Build 57.126.nc, Date: Apr  4 2018, 17:42:03  

reltime:mili second between two records Fri Jul 26 09:20:40 2019
Index reltime     counter-value symbol-name&device-no
    1       0               129 nic_err_perx_drops interface(1/1) 
    3       0               268 nic_err_perx_drops interface(1/2) 
    5       0          18633565 nic_err_tx_carrier interface(1/1) 
    7       0           1903929 nic_err_tx_carrier interface(1/2) 
    9       0             64804 nic_err_peprocess_drops interface(1/1) 
   11       0          11511276 nic_err_peprocess_drops interface(1/2) 
   13       0                11 nic_err_perx_l2mode_drops interface(1/1) 
   15       0                 1 nic_err_perx_bridge_drops interface(1/1) 
   17       0               223 nic_err_perx_bridge_drops interface(1/2) 
   19       0             64800 nic_err_peprocess_l2mode_drops interface(1/1) 
   21       0          11511276 nic_err_peprocess_l2mode_drops interface(1/2) 
   23       0          93675045 nic_err_tx_overflow interface(1/1) 
   25       0            508485 nic_err_tx_overflow interface(1/2) 
   27       0             64933 nic_err_dropped_pkts interface(1/1) 
   29       0          11511544 nic_err_dropped_pkts interface(1/2) 
   31       0            874538 nic_err_congested_pkts_dropped interface(1/1) 
   33       0              9916 nic_err_congested_pkts_dropped interface(1/2) 

I see a lot of packet drops on the interface 1/2:

> show interface

1)    Interface 1/1 (NetScaler Virtual Interface) #0  
    flags=0xe060 <ENABLED, UP, UP, HAMON, HEARTBEAT, 802.1q>
    MTU=1500, native vlan=1, MAC=06:ee:7e:44:79:43, uptime 2378h39m52s
    LLDP Mode: NONE,         LR Priority: 1024

    RX: Pkts(78663728615) Bytes(24270565722711) Errs(0) Drops(64984) Stalls(0)
    TX: Pkts(60862981741) Bytes(11534614260520) Errs(0) Drops(0) Stalls(0)
    NIC: InDisc(0) OutDisc(0) Fctls(0) Stalls(0) Hangs(0) Muted(0)
    Bandwidth thresholds are not set.

2)    Interface 1/2 (NetScaler Virtual Interface) #1  
    flags=0xe060 <ENABLED, UP, UP, HAMON, HEARTBEAT, 802.1q>
    MTU=1500, native vlan=1927, MAC=06:82:0a:3b:72:8f, uptime 2378h39m52s
    LLDP Mode: NONE,         LR Priority: 1024

    RX: Pkts(42033882497) Bytes(9301698904702) Errs(0) Drops(11537503) Stalls(0)
    TX: Pkts(33436413989) Bytes(20667774902435) Errs(0) Drops(0) Stalls(0)
    NIC: InDisc(0) OutDisc(0) Fctls(0) Stalls(0) Hangs(0) Muted(0)
    Bandwidth thresholds are not set.

Is it possible to detect the packet drops that are resulting connection time-out issue? 
Citrix says it is expected to see packet drops on RX : https://support.citrix.com/article/CTX125544

Link to comment
Share on other sites

In addition to Diego's article, check what nslog says as it may give you some immediate error message you can research and check your license for your bandwidth level to see if maybe you are unexpectedly under licensed.

# nslog

shell

cd /var/nslog

nsconmsg -K newnslog -d event

nsconmsg -K newnslog -d consmsg

 

See if you see any overt networking related issues being reported.

 

# view license/bandwidth allowed:

# regular cli:

show license

Check the "model" field as this usually indicates your license model which includes bandwidth.

Also confirm whether you are looking at an MPX or VPX and if you are doing any link aggregation or other physical networking considerations that might play a part of the other article doesn't help you find the problem.

Link to comment
Share on other sites

  • 4 years later...

I realize this is an old post, but I wanted to add some information in case anyone else is searching for issues with dropped packets.

 

We run a high traffic environment (about 65K requests per second across four VPX Netscalers in one of our datacenters).  We were seeing some concerning CPU stats on our VPXs, so decided to upgrade from 8 to 10 virtual CPUs in each Netscaler to give them a bit more headroom.

 

And this is when things started going wrong (though it took a while to figure out when the issue started).  Application teams started reporting "slowness."  A lot of troubleshooting later, we figured out that we were losing packets to our VPXs.  We realized that TCP retransmits from our servers started increasing around the same time we did the CPU upgrade.   Even just pinging our Production SNIPs showed substantial packet loss.

 

One of our infrastructure teams found that in 'esxtop' they were seeing a lot of "DRPRX" for certain interfaces on our VPXs (not the management interface, but the Production SNIPs).  They pointed at this document which states that this could happen if "the virtual machine’s network driver runs out of receive (RX) buffer memory."

 

This eventually led me to this document.  Specifically this line:

Quote

By default, the VMXNET3 device supports only 8 Rx and Tx queues. When the number of vCPUs on the VPX goes beyond 8, the number of Rx and Tx queues configured for a VMXNET3 interface switches to 1 by default.

 

We ended up increasing ethernetX.maxTxQueues and ethernetX.maxRxQueues, as indicated in the document, although the document is not very intuitive.  We had to first upgrade the VM Hardware version of our VPXs, then have our ESX team apply the parameter changes using "config parameters" in vCenter.

 

You can see that the update works after rebooting, by dropping to a shell and running:
 

# dmesg  | grep -i vmx
<SNIP>
vmx3: device version read: 0x7f
vmx3: device supports revision 6
vmx3: device supports max #TX queues: 10, #RX queues: 10
vmx3: configured max #TX queues: 9, #RX queues: 9
vmx3: RX ring size = 512
<SNIP>

 

After upgrading the ethernetX.maxTxQueues and ethernetX.maxRxQueues, TCP retransmits basically went away, and there was no more packet loss when pinging the SNIPs.

 

Oh, another thing, the 'nic_err_tx_overflow' counter above was the counter that showed the issues.  I read somewhere else that "Drops" in the output from 'show interface' could be due to VLAN tagging, where the interface "sees' traffic destined for other VLANs.  It doesn't necessarily indicate a problem.

 

Hopefully this will help others, because it took forever to figure out!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...