Jump to content
Welcome to our new Citrix community!

HA failover and split brain


Michael B

Recommended Posts

Hi All.  Looking for an HA config Expert to help out.

Running Physical MPX appliances with 12.1.57.18

I am trying to make sense of the HA configuration after an event occurred recently.  We have had successful failovers due to power issue or an ADC freezing up, but basically one node is completely unavailable.  This time a switch failure caused the management interface to lose connectivity and one of the two interface configured in a port channel (LA/1).  Failover was initiated and the secondary took the role of Primary, but the VIPs went into a Down state.  It seems they were flapping up/down probably between the two ADCs, but I couldn't connect to node A because of the mgmt interface.  Long story short,  Node A was powered down and things settled down and VIPs were UP state.  I'll paste some config below, but my initial thinking is something with how the native vlan and the Tagall parameter.

 

"Interfaces on which heartbeats are not seen : LA/1"  Makes me wonder if the device cannot see heartbeat on the channel than is it only monitoring on the mgmt 0/1?

Looking at the Channel config on both nodes, one has Tag All option, but the other does not.  If I'm reading the info on support.citrix.com/article/CTX214033 correctly, then that could be a problem if the heartbeat is not on the native vlan.  Am I reading that correctly?

 

Thanks for any thoughts you may have.

MB

 

Output:

 

NodeB> sh ha node
1)      Node ID:      0 
        IP:      172.35.5.73 (NodeB) 
        Node State: STAYPRIMARY
        Master State: Primary
        Fail-Safe Mode: OFF
        INC State: DISABLED
        Sync State: ENABLED
        Propagation: ENABLED
        Enabled Interfaces : 0/1 LA/1
        Disabled Interfaces : 1/6 1/5 1/4 1/3 0/2
        HA MON ON Interfaces :  0/1 LA/1
        HA HEARTBEAT OFF Interfaces : None
        Interfaces on which heartbeats are not seen : LA/1
        Interfaces causing Partial Failure: None
        SSL Card Status: UP
        Hello Interval: 200 msecs
        Dead Interval: 3 secs
        Node in this Master State for: 2:22:46:12 (days:hrs:min:sec)
2)      Node ID:      1 
        IP:      172.35.5.72 
        Node State: UP
        Master State: Secondary
        Fail-Safe Mode: ON
        INC State: DISABLED
        Sync State: SUCCESS
        Propagation: ENABLED
        Enabled Interfaces : 0/1 LA/1
        Disabled Interfaces : 1/6 1/5 1/4 1/3 0/2
        HA MON ON Interfaces : 0/1 LA/1
        HA HEARTBEAT OFF Interfaces : None
        Interfaces on which heartbeats are not seen : LA/1
        Interfaces causing Partial Failure: None
        SSL Card Status: UP

Local node information:
        Critical Interfaces: 0/1 LA/1
 Done

---------------------------------------------

NodeB> sh channel

1)      Interface LA/1 (802.3ad Link Aggregate) #9  
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q>
        MTU=1500, native vlan=1, MAC=02:e0:ed:52:7e:a0, uptime 1814h29m00s
        Requested: media NONE, speed AUTO, duplex NONE, fctl NONE,
                 throughput 0
        Actual: throughput 2000
        LLDP Mode: NONE
        RX: Pkts(1115745861) Bytes(1154932327075) Errs(0) Drops(36757617) Stalls(0)
        TX: Pkts(1082575256) Bytes(1122799145130) Errs(0) Drops(0) Stalls(0)
        NIC: InDisc(0) OutDisc(0) Fctls(0) Stalls(0) Hangs(0) Muted(0)
        Bandwidth thresholds are not set.
        LA mode: AUTO


                1/1: UTP-1000-FULL-OFF          UP  62h28m30s
                PortID=(32768,6), Mux=DISTRIBUTING, Rx=CURRENT, SELECTED
                <Active, Long timeout, Agg, Sync, Collecting, Distributing>
                Partner: SysID=(1000,00:23:04:ee:be:14), Key=32784, PortID=(32768, 16649)
                        <Active, Long timeout, Agg, Sync, Collecting, Distributing>

                1/2: UTP-1000-FULL-OFF          UP  1814h29m03s
                PortID=(32768,5), Mux=DISTRIBUTING, Rx=CURRENT, SELECTED
                <Active, Long timeout, Agg, Sync, Collecting, Distributing>
                Partner: SysID=(1000,00:23:04:ee:be:14), Key=32784, PortID=(32768, 265)
                        <Active, Long timeout, Agg, Sync, Collecting, Distributing>

 Done

===================================================================================
NodeA> sh channel

1)      Interface LA/1 (802.3ad Link Aggregate) #9  
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, native vlan=1, MAC=02:e0:ed:32:9b:82, uptime 62h23m16s
        Requested: media NONE, speed AUTO, duplex NONE, fctl NONE,
                 throughput 0
        Actual: throughput 2000
        LLDP Mode: NONE
        RX: Pkts(1645625) Bytes(260540289) Errs(0) Drops(1437880) Stalls(0)
        TX: Pkts(444245) Bytes(155088375) Errs(0) Drops(0) Stalls(0)
        NIC: InDisc(0) OutDisc(0) Fctls(0) Stalls(0) Hangs(0) Muted(0)
        Bandwidth thresholds are not set.
        LA mode: AUTO


                1/1: UTP-1000-FULL-OFF          UP  62h23m19s
                PortID=(32768,6), Mux=DISTRIBUTING, Rx=CURRENT, SELECTED
                <Active, Long timeout, Agg, Sync, Collecting, Distributing>
                Partner: SysID=(1000,00:23:04:ee:be:14), Key=32783, PortID=(32768, 16648)
                        <Active, Long timeout, Agg, Sync, Collecting, Distributing>

                1/2: UTP-1000-FULL-OFF          UP  62h23m20s
                PortID=(32768,5), Mux=DISTRIBUTING, Rx=CURRENT, SELECTED
                <Active, Long timeout, Agg, Sync, Collecting, Distributing>
                Partner: SysID=(1000,00:23:04:ee:be:14), Key=32783, PortID=(32768, 264)
                        <Active, Long timeout, Agg, Sync, Collecting, Distributing>

 Done

Link to comment
Share on other sites

If your channel does not allow untagged packets, then you need to do some trickery to get HA heartbeats to work. Choose one VLAN on that channel and uncheck the tagged box. Then enable tagall on the channel on both HA nodes. That will cause the nodes to tag the heartbeat packets with the VLAN ID than you unchecked tagging.

  • Like 1
Link to comment
Share on other sites

18 hours ago, Michael B said:

Quick question that isn't clear in the documentation.  Does the ADC use the NSIP as the source on all interfaces?  It sounds that way, but not clear about it.

 

Thanks.

Hey,


I would say where your NSVLAN (Default VLAN 1) is bound is used for MGMT / NSIP traffic. More information here: https://docs.citrix.com/en-us/netscaler/11-1/networking/interfaces/configuring-nsvlan.html

 

Cheers,

Daniel

Link to comment
Share on other sites

  • 2 weeks later...

Update:

Removing the TAGALL parameter resolved the issue with the heartbeats not seen on interface.  So no need to do anything fancy on our end including having to change the native vlan on the channel.  My Cisco guy didn't think that would matter even with the native vlan on the switch side being different, since these are connecting to a VPC and no routing involved for the heartbeats.

I Removed HA monitoring from the mgmt 0/1 port since this is not part of the VPC  leaving  the monitoring on on the data interface LA/1 channel.  This should avoid any issue we ran into a few weeks ago.

 

 

sh node_output.txt

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...