Marco Warnier1709158729 Posted March 18, 2019 Posted March 18, 2019 Hello all, I am configuring an MPX8905 HA set that has a trunked 10G interface. One of the VLANs on the trunk is showing strange behavior which I discovered due to route monitor check. As the configuration of the failing VLAN is exactly the same as for the other, correct functioning, VLANs, I do not expect the Netscaler to be the root cause. But the network team is pointing to the Netscaler so we are in a deadlock situation now... When I can't PING the gateway on the failing VLAN, I can clear the ARP entry of the gateway using the command "rm arp <IP_ADDRESS_OF_VLAN_X_GATEWAY>". PING immediately starts to work. If I keep the PING running, it stops after a short time. Most RTT is <1ms, but I also see RTT of 55ms in the responses if it works... When the ping fails, it starts to work again if I clear the ARP cache. What I noticed is that VRRP is used to have high availability on the gateway too, and I can see the same VRRP MAC address on several VLANs From the example output below, <VL2_GW> is having issues, and I have also seen it happen on the <VL2_R1> and <VL_R2> IP addresses that have a "normal" MAC address associated. netscaler> show arp IP MAC Iface VLAN Origin TTL Traffic Domain ---------- ----------------- ----- ----- ------- --- -------------- <VL1_GW> 00:00:5e:00:01:01 10/1 <VL1> DYNAMIC 620 0 <VL2_R1> d8:18:d3:10:98:44 10/1 <VL2> DYNAMIC 240 0 <VL2_R2> 64:c3:d6:b6:ba:c8 10/1 <VL2> DYNAMIC 194 0 <VL2_GW> 00:00:5e:00:01:01 10/1 <VL2> DYNAMIC 264 0 <VL3_R1> 64:c3:d6:b6:ba:c8 10/1 <VL3> DYNAMIC 558 0 <VL3_GW> 00:00:5e:00:01:01 10/1 <VL3> DYNAMIC 397 0 Anyone seen such behavior before? I am expecting this is a layer2 problem, but I have been told no spanning tree is active and no L2 loops are present. Thanks in advance! Marco
Mihai Cziraki1709160741 Posted March 19, 2019 Posted March 19, 2019 Hi! I haven't seen this behavior but probably because i haven't use vrrp in the network. But one thing came into my mind. Do you use MAC based forwarding (MBF)?(#show ns mode) Because if you do it might not work with some vrrp routers. If you have it enabled and you don't really need it, try to disable it. But be careful, make sure you have all the routes you need. After you disable MBF, netscaler will use the routing in place . https://support.citrix.com/article/CTX132952 (MBF is used on the uplinks where the client request come in . I am not sure if you use VRRP on the uplinks)
Srikanth Challa1709160785 Posted March 19, 2019 Posted March 19, 2019 I see you mentioning about route monitors as the cause: One of the VLANs on the trunk is showing strange behavior which I discovered due to route monitor check.. What is the use case of route monitors in your setup? -Srik
Marco Warnier1709158729 Posted March 19, 2019 Author Posted March 19, 2019 3 hours ago, Mihai Cziraki1709160741 said: Hi! I haven't seen this behavior but probably because i haven't use vrrp in the network. But one thing came into my mind. Do you use MAC based forwarding (MBF)?(#show ns mode) Because if you do it might not work with some vrrp routers. If you have it enabled and you don't really need it, try to disable it. But be careful, make sure you have all the routes you need. After you disable MBF, netscaler will use the routing in place . https://support.citrix.com/article/CTX132952 (MBF is used on the uplinks where the client request come in . I am not sure if you use VRRP on the uplinks) Hi, we are not using MBF; see the output of show ns mode below: > show ns mode Mode Acronym Status ------- ------- ------ 1) Fast Ramp FR ON 2) Layer 2 mode L2 OFF 3) Use Source IP USIP OFF 4) Client Keep-alive CKA OFF 5) TCP Buffering TCPB OFF 6) MAC-based forwarding MBF OFF 7) Edge configuration Edge ON 8) Use Subnet IP USNIP ON 9) Layer 3 mode (ip forwarding) L3 OFF 10) Path MTU Discovery PMTUD ON 11) Media Classification Mediaclassification OFF 12) Static Route Advertisement SRADV OFF 13) Direct Route Advertisement DRADV OFF 14) Intranet Route Advertisement IRADV OFF 15) Ipv6 Static Route Advertisement SRADV6 OFF 16) Ipv6 Direct Route Advertisement DRADV6 OFF 17) Bridge BPDUs BridgeBPDUs OFF 18) RISE APBR Advertisement RISE_APBR OFF 19) RISE RHI Advertisement RISE_RHI OFF 20) Unified Logging Format ULFD OFF Done The strange thing is that this Netscaler is connected with a single 10G physical interface configured as a trunk that carries all production VLANs. Interface 0/1 is dedicated to management only. There are currently several VLANs configured on the 10G interface that all have a VRRP-based gateway. And only a single VLAN is showing this behavior that it stops working ~40sec after I cleared the ARP cache entry for that gateway. We checked the switch configuration and router configuration, and compared the config of working VLAN against the failing VLAN. No differences found in the config that explain the behavior. In the meantime I also raised a support ticket at Citrix, hopefully we will be able to find the root cause...
Mihai Cziraki1709160741 Posted March 19, 2019 Posted March 19, 2019 can you try to see what errors you have on that interface? go to shell, cd /var/nslog/ , then do this : nsconmsg -K newnslog -d stats | grep 10/1 | grep err instead of 10/1 , put your interface. It will display the error counters for your interface. Try and see which are incrementing.
Marco Warnier1709158729 Posted March 19, 2019 Author Posted March 19, 2019 3 hours ago, Srikanth Challa1709160785 said: I see you mentioning about route monitors as the cause: One of the VLANs on the trunk is showing strange behavior which I discovered due to route monitor check.. What is the use case of route monitors in your setup? -Srik Hi, We want to use route monitors to bind to the HA config so a failover is initiated if one of the mandatory gateways cannot be reached. The 10G interface is carrying multiple production VLANs, and each has its own gateway. There are static routes defined that point to those gateways, and they are required to support all possible usecases. So if one of the gateways cannot be reached, a failover should be initiated ("bind HA node -routemonitor <destination_subnet> <destination_mask>")
Marco Warnier1709158729 Posted March 26, 2019 Author Posted March 26, 2019 On 3/19/2019 at 1:39 PM, Mihai Cziraki1709160741 said: can you try to see what errors you have on that interface? go to shell, cd /var/nslog/ , then do this : nsconmsg -K newnslog -d stats | grep 10/1 | grep err instead of 10/1 , put your interface. It will display the error counters for your interface. Try and see which are incrementing. I tried this and could not see a counter incrementing. But at the sime time the issue also miraculously disappeared - so at this moment everything works as expected and I have no clue what caused the issue. Thanks everyone for trying to help me out on this, I know it is not very satisfactory to have a working situation without understanding what has changed to make it work. But almost a week now of stable operation.... Have added route monitors towards the gateways that have to carry production traffic, and bound it to the ha config, so if the issue re-appears a failover will be initiated. I limited the amount of failovers to 2 to prevent being stuck in a situation where every 3 minutes a failover is initiated ;)
Recommended Posts
Archived
This topic is now archived and is closed to further replies.