Jump to content
Welcome to our new Citrix community!

Probing ownership is with some other node in the cluster


Recommended Posts

Hi everyone,

 

I'm setting up a cluster instance with two CITRIX ADC appliances of the same platform (MPX).

I created the cluster instance on the first node and addedd this appliance to the cluster as Node 1 with backplane interface 1/1/1. After creating the node, the "Operational state" remains  "INACTIVE".

I proceeded to adding and joining the second node to the cluster instance which has Node ID as 2 backplane interface 2/1/1.  The "Operational state" of this second node remained "INACTIVE" as well. 

 

Connected via putty to the CLIP, I used the

show cluster node

command to see the reason for this INACTIVE state and the reason is that "Some enabled and HAMON interfaces of the node are down" (see screenshot1 & 2 attached)

I checked on each node, and the interfaces used as backplane interfaces on each node have their HA Monitoring value "ON" (see screenshot 3 attached)

 

I have configured load balancing feature on the CLIP and all the vservers and services are DOWN, hence no possibility to access the backend servers.

I checked the monitors attached to the services and they display an UNKNOWN state with the last response being that "Probing ownership is with some other node in the cluster"

 

Please, where could the problem lie?

 

PS : The 02 appliances pertain to the same subnet network.

PS2: The two backplane interfaces belong to the same VLAN.

PS3: The SNIP set for the cluster instance is in the same subnet.

PS4: QUORUM type for the cluster instance is set to NONE.

 

Thanks in advance for your help.

screenshot1.jpg

screenshot2.jpg

screenshot3.jpg

Link to comment
Share on other sites

 

For your service monitoring, NetScaler to backend communication relies on a SNIP and not the NSIP.  

There's a lot of ways cluster networking can be misconfigured, but the quick thing to check would be 

1) Is your SNIP spotted (assigned to a single node) or striped?  If striped (snip owned by all nodes), are you in a situation where each node needs a unique snip in the subnet to pass different traffic simultaneously?  If you need multiple nodes to pass traffic to the same destination network at the same time, you would need a set of spotted SNIPs so each node has a unique and dedicated SNIP in the network...  See https://docs.citrix.com/en-us/citrix-adc/current-release/clustering/cluster-overview/ip-addressing.html

2) Is USNIP mode enabled, so SNIPs can be used.

3) Do you need any additional routes for the SNIP to destination network to function?

4)  Finally, service monitoring is discussed here:  https://docs.citrix.com/en-us/citrix-adc/current-release/clustering/cluster-usage-scenarios/path-monitoring-in-cluster.html

You might just need to define the service path.

 

 

For the "some interfaces enabled but hamon on are down" message. Basically its saying that you have interfaces marked as critical (HAMon enabled) but for which they interface is showing as down.  This could mean the interface is DOWN because it is not physically connected or some issue with the switch configuration it is dependent on.

1) Are any of these interfaces in a bond? Is the bond properly configured on the ADC and on the physical switch.

2) Check nslog for additional "low-level" networking issues that might tell you what is wrong. For example, if a switch is muting a port.

shell

cd /var/nslog

nsconmsg -K newnslog -d event

nsconmsg -K newnslog -d consmsg

 

3) The rest of it is probably dependent on the cluster routing/backplane configuration (which I will not be any help with).

What Network config are you relying on and are you using any ECMP or other config requirements?

Review network requirements here:  https://docs.citrix.com/en-us/citrix-adc/current-release/clustering/cluster-traffic-distribution.html

Additional backplan network requirements here:  https://docs.citrix.com/en-us/citrix-adc/current-release/clustering/cluster-setup/cluster-setup-backplane.html


 Post back and maybe someone else can help with untangling what else is wrong.

 

 

  • Like 1
Link to comment
Share on other sites

Hi Rhonda

Thanks for your reply

 

1) Is your SNIP spotted (assigned to a single node) or striped? => My SNIPs are spotted, each node has its own SNIPassigned for communication with backend servers

2) Is USNIP mode enabled, so SNIPs can be used. => Yes, it is

3) Do you need any additional routes for the SNIP to destination network to function? => I do and I added some static routes

 

Concerning the interfaces, this is were the issue was lying. As you said, effectively some of the interfaces were DOWN (reason being that they were not connected to the switch port). I had to disable them (for later configuration) and the node went in UP Health state, as well as the operational state which went ACTIVE. The Load balancing vservers and services are now UP.

 

As of the network config, all the 02 nodes are in the same subnet. Tha backplane interfaces are the same from each node and are connected to the same switch and pertain to the same VLAN.

Thanks greatly for your help in sorting out this.

 

 

Moreover, I have another issue related to this cluster being operational.

Actually, the cluster configuration is meant to replace a Netscaler 7500 MPX which is for now in production environment.

To enable a smooth and transparent migration from the Netscaler 7500MPX to the ADCs 5900 MPX, we decided to configure the ADCs MPX 5900 in clustering mode and with the same network configurations (same VIP addresses, same backend servers and services, same routes, etc...) as the one in production. The only thing which differs from the two environements are the NSIPs of the appliances.

The itchy issue now is that whenever we put on the cluster, users complain that they can not longer access the services on the backend servers. Untill we put off the cluster before they can have access again.

This raises a question : Can the 02 setups (the one in production and the one on test, the cluster) not function at the same time? If no, why?

Does this imply that we have to program a complete switchover from the old to the new cluster configuration to avoid these interferences and serve users effectively?

Link to comment
Share on other sites

4 hours ago, Eunice Mapong said:

Moreover, I have another issue related to this cluster being operational.

Actually, the cluster configuration is meant to replace a Netscaler 7500 MPX which is for now in production environment.

To enable a smooth and transparent migration from the Netscaler 7500MPX to the ADCs 5900 MPX, we decided to configure the ADCs MPX 5900 in clustering mode and with the same network configurations (same VIP addresses, same backend servers and services, same routes, etc...) as the one in production. The only thing which differs from the two environements are the NSIPs of the appliances.

The itchy issue now is that whenever we put on the cluster, users complain that they can not longer access the services on the backend servers. Untill we put off the cluster before they can have access again.

This raises a question : Can the 02 setups (the one in production and the one on test, the cluster) not function at the same time? If no, why?

Does this imply that we have to program a complete switchover from the old to the new cluster configuration to avoid these interferences and serve users effectively?

 

Glad the first issue resolved.

I'm confused by this block above and my have misunderstood it.

You are trying to run OLD Cluster and NEW Cluster at same time but wondering why there are issues?

 

Your 5900 and 7500's probably can't be in a cluster together as different models.  So I'm assuming you have separate clsuters.

If you are migrating one entity at a time between clusters, you would have to ensure that Cluster A and Cluster B operate indepdently:

Meaning both clusters have to have unique CLIPs/NSIPs/SNIps.  And a given VIP can only exist in one cluster at a time during migration or you will have IP conflicts.

 

Can you clarify how you are doing your migration, because duplicate ips in separate clusters simultaneously will be in conflict.

 

 

 

 

 

Link to comment
Share on other sites

Hi Rhonda,

Sorry for not being much clearer in my previous reply

 

The situation is this one :

Actually, there's the Netscaler MPX 7500 on production serving clients

 

We want to migrate from this Netscaler to the recent models CITRIX ADC MPX 5900 to resolve some performance issues. Having as principal constraint the fact that we wish the migration to be totally transparent for the users, we decided to move on with the same configuration as in actual production environement (same VIPs, backend servers, services, etc...).

With the new appliances, we configured them (they are 02) in cluster mode.

 

So having these two environments ON (the OLD Netscaler MPX 7500 and the cluster CITRIX ADCs MPX 5900), users complain about not having access to services anymore. It's only once we put the cluster environment OFF that they can continue working.

 

So if I understand your thoughts, the 02 environements CAN'T BE UP AT THE SAME TIME AND USING THE SAME VIPs and be operational, right?

We have to completeley cutover to the new environement to have it function?

If this shall be the case, please, are there some best practices to ensure that this cutover goes on without any issues, apart from keeping the OLD environment around during cutover so that in case there are disruptions we can rollback quickly?

 

PS : On the 02 environements (OLD and NEW), elements which are alike are the VIP addresses, the services and backend servers, the routes we created and some DNS configurations.

Elements which are different are the NSIPs (though in the same sub network as the one in production, 10.241.110.x) the SNIPs (though in the same subnetwork as the one in production as well)

Link to comment
Share on other sites

If you have two indepdent systems with the same IPs, you will have mac address conflicts with both system GARP out the same IP addresses.

So, your SNIPs would have to be unique between OLD and NEW to avoid backend conflicts.

 

If you had the NEW cluster with different snips and other dependent IPs than the active system and you wanted to migrate vips (configs) entity at a time to new enviornment by disablin old vserver when you enable new vserver, they could run simultaneously but the OLD and NEW enviornments would have to have everything else different: nsip/snips separate to avoid conflicts.

You would have to ensure that for individual entities the VIP is only active on old OR new system at one time and not at both.

 

Otherwise, a full cutover would be needed.  Some other experts may have a more detailed rec for you.

Link to comment
Share on other sites

Hi Rhonda,

 

Thanks for your reply

 

As for the 02 environments, the OLD and the NEW, they have different NSIPs and SNIPs.

 

As for the VIPs, I would try disabling and enabling the VIPs one at a time on the OLD and NEW environment to test. If I do not succeed, then we would plan a complete cutover.

Thanks for your help.

Link to comment
Share on other sites

Hello Rhonda,

 

As discussed earlier, we effectively did a first test by turning OFF the OLD environment and turning ON the new one to see if we could access the VIPs (backend apps) without any problem.

Inside the new environment, among the load balancing vservers, we added the ones which are presently in production as mentionned earlier and we created 02 more vservers as TEST servers.

 

At the issue yesterday's test, ONE of the vservers we created as TEST server was accessible and the other was not.

All the vservers we copied from the OLD environment, though displaying the (UP state) were not accessible.

 

We've been reflecting and searching on what could be the cause, to no satisfying explanation/solution.

 

Please, could you have an insight on this?

 

PS1 : The USNIP option is enabled and the SNIPs configured on each of the 02 cluster nodes are in the same subnet as the NSIPs and some of the backend servers.

PS2 : For each of the vservers we're testing, the services and vservers state appear UP.

 

Thanks in advance

Link to comment
Share on other sites

As a first round test, you could try just creating a new vserver/new vip on the NEW environment against one of the existing backend (services/servicegroups) so there is no frontend conflict with the vips. And make sure it will pass traffic for a completely new entity. If not, you probably have some sort of issue in the cluster (and I do mean real Clustering feature vs. HA pair) data plane/network side to resolve first.   Sounds like you did a version of this but had mixed results, which means you probably have a cluster network configuration issue in the new cluster.

 

So, if it works for a completely NEW VIP against the services/servicegroups without conflict and ONLY fails during the use of the OLD VIP when moving it from OLD cluster to NEW, then there might be an issue with the frontend network recognizing the change in GARPs/mac address ownership or something else.

 

But this may give you a way to determine if the new cluster is set up properly before migrating VIPs.

 

Regarding your SNIPs:

1) Your SNIP(s) need to be able to reach the backend networks where your resources exist. This may require additional routes as well. Because you are in a cluster, the cluster networking is more complex and you might have an issue with your spotted or striped IP config.

 

2) Check nslog to see if any low-level network issues are being reported. Run traces.

Do trace routtes from the "clients" doing the test to see if there are any issues getting to the new cluster's vips.

Link to comment
Share on other sites

Hi Rhonda

Thanks for your reply

 

Well, yes we did a first test with a completely new VIP against backend servers and had the load balancing feature to work just fine.

 

Also, we added some routes, STATIC routes for routing issues if any. So actually, there are spotted SNIPs for each node and routes to enable correct packet routing from the appliance to the backend servers.

 

For the previously mentionned issue concerning TEST vservers, we could finally resolve it by deleting the TEST vservers and completely recreating them. Strangely, it worked and also the OLD VIPs we have from the OLD environment are also working fine now.

 

We are having one more issue at this step. The users access the VIPs and backend servers just well, but complain (and more than once) that there are some moments during which the apps are unavailable and they can't access them. 

On performing some analysis, we checked the Events log using the GUI utility of the cluster instance and notices that the services are very frequently alternating between DOWN and UP states. When a service appears down, the reason for the failure mentionned is "...SYN TCP sent, reset received" (please see screenshots attached).

We arrived that this behaviour can explain why users access the apps at one moment and the next moment, they can't.

 

My question is : Is this behaviour linked to cluster configuration or is it a limitation of the clustering feature?

What optimizations/corrections could we make to solve this please?

 

PS : Know you've been helping a lot. Thanks greatly for your support

service-DOWN.jpg

Link to comment
Share on other sites

8 hours ago, Eunice Mapong said:

We are having one more issue at this step. The users access the VIPs and backend servers just well, but complain (and more than once) that there are some moments during which the apps are unavailable and they can't access them. 

On performing some analysis, we checked the Events log using the GUI utility of the cluster instance and notices that the services are very frequently alternating between DOWN and UP states. When a service appears down, the reason for the failure mentionned is "...SYN TCP sent, reset received" (please see screenshots attached).

We arrived that this behaviour can explain why users access the apps at one moment and the next moment, they can't.

 

My question is : Is this behaviour linked to cluster configuration or is it a limitation of the clustering feature?

What optimizations/corrections could we make to solve this please?

 

PS : Know you've been helping a lot. Thanks greatly for your support

 

I don't know.

You have two complications at once: 1) DSR for the SMTP traffic (I think) AND 2) networking in a cluster config.

 

If no other application is having an issue than it might be specific to the smtp dsr or monitoring details. If any other apps are going up/down, then I would investigate the cluster config.

At this point, I don't know what a good set of things to isolate would be...but double check that all the correct cluster network settings are implemented and whether or not ecmp or other dynamic routing is properly configured.

 

You should still make sure that the nslog doesn't have some overt networking issues that might indicate what is happening:

shell

cd /var/nslog

nsconmsg -K newnslog -d consmsg

nsconmsg -K newnslog -d event

UPDATE:  Check nslog in clusters:  https://docs.citrix.com/en-us/citrix-adc/current-release/clustering/cluster-troubleshooting.html

 

 

But I'm not sure where else to look.  But I think its still a routing/network issue of some sort just not sure how to diagnose.  Hopefully, someone else can think of something.

 

This might help:  https://support.citrix.com/article/CTX200852

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...