Jump to content

Strange L2 behavior on MPX8905: PING starts to work after ARP cache clear


Recommended Posts

Hello all,

 

I am configuring an MPX8905 HA set that has a trunked 10G interface.

One of the VLANs on the trunk is showing strange behavior which I discovered due to route monitor check.

As the configuration of the failing VLAN is exactly the same as for the other, correct functioning, VLANs, I do not expect the Netscaler to be the root cause. But the network team is pointing to the Netscaler so we are in a deadlock situation now...

 

When I can't PING the gateway on the failing VLAN, I can clear the ARP entry of the gateway using the command "rm arp <IP_ADDRESS_OF_VLAN_X_GATEWAY>".
PING immediately starts to work.
If I keep the PING running, it stops after a short time. Most RTT is <1ms, but I also see RTT of 55ms in the responses if it works...
When the ping fails, it starts to work again if I clear the ARP cache.

What I noticed is that VRRP is used to have high availability on the gateway too, and I can see the same VRRP MAC address on several VLANs
From the example output below, <VL2_GW> is having issues, and I have also seen it happen on the <VL2_R1> and <VL_R2> IP addresses that have a "normal" MAC address associated.

 

netscaler> show arp
IP           MAC                Iface VLAN   Origin    TTL   Traffic Domain
----------   -----------------  ----- -----  -------   ---   --------------
<VL1_GW>   00:00:5e:00:01:01  10/1  <VL1>  DYNAMIC   620   0
<VL2_R1>     d8:18:d3:10:98:44  10/1  <VL2>  DYNAMIC   240   0
<VL2_R2>     64:c3:d6:b6:ba:c8  10/1  <VL2>  DYNAMIC   194   0
<VL2_GW>   00:00:5e:00:01:01  10/1  <VL2>  DYNAMIC   264   0
<VL3_R1>     64:c3:d6:b6:ba:c8  10/1  <VL3>  DYNAMIC   558   0
<VL3_GW>   00:00:5e:00:01:01  10/1  <VL3>  DYNAMIC   397   0

 

Anyone seen such behavior before? 

 

I am expecting this is a layer2 problem, but I have been told no spanning tree is active and no L2 loops are present.

 

Thanks in advance!

 

Marco

Link to comment
Share on other sites

Hi!

 

I haven't seen this behavior but probably because i haven't use vrrp in the network.

But one thing came into my mind. Do you use MAC based forwarding  (MBF)?(#show ns mode)

Because if you do it might not work with some vrrp routers. If you have it enabled and you don't really need it, try to disable it.

But be careful, make sure you have all the routes you need. After you disable MBF, netscaler will use the routing in place .

https://support.citrix.com/article/CTX132952

 

(MBF is used on the uplinks where the client request come in . I am not sure if you use VRRP on the uplinks)

 

Link to comment
Share on other sites

3 hours ago, Mihai Cziraki1709160741 said:

Hi!

 

I haven't seen this behavior but probably because i haven't use vrrp in the network.

But one thing came into my mind. Do you use MAC based forwarding  (MBF)?(#show ns mode)

Because if you do it might not work with some vrrp routers. If you have it enabled and you don't really need it, try to disable it.

But be careful, make sure you have all the routes you need. After you disable MBF, netscaler will use the routing in place .

https://support.citrix.com/article/CTX132952

 

(MBF is used on the uplinks where the client request come in . I am not sure if you use VRRP on the uplinks)

 

Hi, we are not using MBF; see the output of show ns mode below:

> show ns mode

        Mode                           Acronym              Status
        -------                        -------              ------
 1)     Fast Ramp                      FR                   ON
 2)     Layer 2 mode                   L2                   OFF
 3)     Use Source IP                  USIP                 OFF
 4)     Client Keep-alive              CKA                  OFF
 5)     TCP Buffering                  TCPB                 OFF
 6)     MAC-based forwarding           MBF                  OFF
 7)     Edge configuration             Edge                 ON
 8)     Use Subnet IP                  USNIP                ON
 9)     Layer 3 mode (ip forwarding)   L3                   OFF
 10)    Path MTU Discovery             PMTUD                ON
 11)    Media Classification           Mediaclassification  OFF
 12)    Static Route Advertisement     SRADV                OFF
 13)    Direct Route Advertisement     DRADV                OFF
 14)    Intranet Route Advertisement   IRADV                OFF
 15)    Ipv6 Static Route Advertisement SRADV6               OFF
 16)    Ipv6 Direct Route Advertisement DRADV6               OFF
 17)    Bridge BPDUs                   BridgeBPDUs          OFF
 18)    RISE APBR Advertisement        RISE_APBR            OFF
 19)    RISE RHI Advertisement         RISE_RHI             OFF
 20)    Unified Logging Format         ULFD                 OFF
 Done

The strange thing is that this Netscaler is connected with a single 10G physical interface configured as a trunk that carries all production VLANs. Interface 0/1 is dedicated to management only.

There are currently several VLANs configured on the 10G interface that all have a VRRP-based gateway. And only a single VLAN is showing this behavior that it stops working ~40sec after I cleared the ARP cache entry for that gateway.

 

We checked the switch configuration and router configuration, and compared the config of working VLAN against the failing VLAN. No differences found in the config that explain the behavior. 

In the meantime I also raised a support ticket at Citrix, hopefully we will be able to find the root cause...
 

Link to comment
Share on other sites

3 hours ago, Srikanth Challa1709160785 said:

I see you mentioning about route monitors as the cause:

 

One of the VLANs on the trunk is showing strange behavior which I discovered due to route monitor check..

 

What is the use case of route monitors in your setup?

 

-Srik

Hi,

 

We want to use route monitors to bind to the HA config so a failover is initiated if one of the mandatory gateways cannot be reached. The 10G interface is carrying multiple production VLANs, and each has its own gateway. There are static routes defined that point to those gateways, and they are required to support all possible usecases. So if one of the gateways cannot be reached, a failover should be initiated ("bind HA node -routemonitor <destination_subnet> <destination_mask>")

Link to comment
Share on other sites

On 3/19/2019 at 1:39 PM, Mihai Cziraki1709160741 said:

can you try to see what errors you have on that interface?

go to shell,  cd /var/nslog/  , then do this : nsconmsg -K newnslog -d stats | grep 10/1 | grep err 

instead of 10/1 , put your interface. 

It will display the error counters for  your interface.

Try and see which are incrementing.

 

 

 

I tried this and could not see a counter incrementing.

But at the sime time the issue also miraculously disappeared - so at this moment everything works as expected and I have no clue what caused the issue.

 

Thanks everyone for trying to help me out on this, I know it is not very satisfactory to have a working situation without understanding what has changed to make it work. 
But almost a week now of stable operation....

 

Have added route monitors towards the gateways that have to carry production traffic, and bound it to the ha config, so if the issue re-appears a failover will be initiated.  
I limited the amount of failovers to 2 to prevent being stuck in a situation where every 3 minutes a failover is initiated ;)

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...