Jump to content
Welcome to our new Citrix community!

SMTP Load balancer stopped working


Chris Gundry

Recommended Posts

Hi all

 

We have had an SMTP load balancer setup for sometime now. Recently we needed to failover our NetScaler cluster for some maintenance, this went fine and testing most things after the failback were fine. However, we noted that our SMTP load balancer was no longer working... When I say not working, the service group members are still showing as 'UP' and the VS is showing as 'UP', but when we try and send mail via the LB VS it fails to connect. Connecting to the individual Exchange servers directly is working fine. It is also worth nothing that we have a 'backup server' configured, but the auto failover to that is not working either, where it previously worked fine. It is as if any connection to the LB VS main IP is not working...

 

Considering we had not changed anything in relation to the LB VS, this was surprising. I have spent some time checking all the settings, and everything seems correct and nothing has changed. I have also downloaded previous configs and checked them and the settings related to the LB VS are unchanged. So I am really struggling to see any reason why this has stopped working.

 

NS version 12.1 57.18 (update is planned, but postponed due to other works)

 

Things tested/checked:

Restarted both NetScalers in cluster

Checked previous configs, no changes, configuration seems fine

Disabled each service group member separately, to check if connections to either member would work, they don't

Changed the IP of the LB VS to something else, problem remains

Completely removed the LB VS and re-created on a different IP (just to be sure), problem remains

 

I have raised with Citrix support, but they are not being particularly helpful at this point...

 

Any help or suggestions would be appreciated!

 

Thanks!

 

Chris

 

Link to comment
Share on other sites

The only thing that springs to mind is if Source IP somehow became enabled or disabled on your Service Group, causing either SMTP or the firewall on your back-end server to see the traffic as coming from a different source and refusing it.  But that's a long shot.

 

Do you even get a prompt back when doing telnet port-25 testing?

Link to comment
Share on other sites

15 hours ago, Robert Blissitt said:

The only thing that springs to mind is if Source IP somehow became enabled or disabled on your Service Group, causing either SMTP or the firewall on your back-end server to see the traffic as coming from a different source and refusing it.  But that's a long shot.

 

Do you even get a prompt back when doing telnet port-25 testing?

Thank you for the suggestion. We have already confirmed that has not changed and is correctly configured/not being blocked by the back end servers.

 

No, no prompt back, no connection at all, it just times out and fails to connect. It is as if the NetScaler is not listening on that IP/port. We can of course ping the NetScaler on the LB VS IP.

Link to comment
Share on other sites

Was your failover test to the same member of the ha pair (so no ips changed) or was it to different set of netscalers, such as in a dR Location? 
(Sorry, read you have a cluster instead; more potential fun with networking.)

Are you using USIP mode or not?

Is it possible that during the failover test the default gateway of the CAS servers was set to the SNIP of the failover pair and not returned to the original pair?

 

You might try checking syslog for any evens that might explain the smtp issue

Also run a trace both from the ADC and the exchange servers to see if traffic is not returning to the load balancer as expected.

 

About your ping test:

One thing to remember, a NetScaler always responds to pings for any VIP assigned to any vserver...regardless of vserver up/down state. Its  not a port-based check and since one VIP can be used on multiple vservers the default is to always respond. While the behavior can be modified to be dependent on al least one vserver up vs down, usually a ping test does nothing more than confirm the ADC is alive and doesn't tell us much about the vserver using the vip.

 

A port based check (like telnet) or a monitor from another system even to see if the vserver is in fact handling traffic.

Link to comment
Share on other sites

Hi Rhonda

 

Firstly, thanks for replying with such a detailed response!

 

I should have prefaced this by saying I am not a NetScaler expert, at all, but I know what I know...

 

17 hours ago, Rhonda Rowland1709152125 said:

Was your failover test to the same member of the ha pair (so no ips changed) or was it to different set of netscalers, such as in a dR Location? 
(Sorry, read you have a cluster instead; more potential fun with networking.)

 

Yes, clustered pair of NetScalers...

 

When I said failover, I meant that we have the LB VS setup with a 'backup' server. When I set this up last year, I tested the 'failover' to the backup server by disabling the primary VS and it failed over correctly. I also disconnected the back end servers from the network and the primary VS identified that it was unable to monitor them anymore and failed over to the backup server. Currently if I disable the primary LB VS it does not failover to the backup VS like it used it... I thought this was worth sharing.

 

18 hours ago, Rhonda Rowland1709152125 said:

Are you using USIP mode or not?

 

Not globally, but on the LB service group I have 'Use client IP' checked. This is unchanged from original config (as is everything at the moment... :/)

 

18 hours ago, Rhonda Rowland1709152125 said:

Is it possible that during the failover test the default gateway of the CAS servers was set to the SNIP of the failover pair and not returned to the original pair?

 

 

 

The DGW of all servers in the network is set to the network core switch, not the NetScaler. The DGW of the servers have not been changed at any point.

 

18 hours ago, Rhonda Rowland1709152125 said:

You might try checking syslog for any evens that might explain the smtp issue

Also run a trace both from the ADC and the exchange servers to see if traffic is not returning to the load balancer as expected.

 

Yes, syslog is on my list of things to check out more thoroughly.

 

Unfortunately I can't install WireShark on the Exchange servers. I did a trace on the primary NetScaler and to me it looks like the traffic is hitting the NetScaler, but not being forwarded on to the Exchange server... It seems to just be dying at the NetScaler. I have given this trace to Citrix support but not had anything back yet...

18 hours ago, Rhonda Rowland1709152125 said:

About your ping test:

One thing to remember, a NetScaler always responds to pings for any VIP assigned to any vserver...regardless of vserver up/down state. Its  not a port-based check and since one VIP can be used on multiple vservers the default is to always respond. While the behavior can be modified to be dependent on al least one vserver up vs down, usually a ping test does nothing more than confirm the ADC is alive and doesn't tell us much about the vserver using the vip.

 

A port based check (like telnet) or a monitor from another system even to see if the vserver is in fact handling traffic.

 

A good point. I guess I was just saying that I can ping the IP, so the IP is 'alive' on the NetScaler, albeit as you say, not necessarily on the VS itself... I did the telnet 25 testing and that won't connect, so not sure what other testing I could do at this point...

 

Thanks again!

Link to comment
Share on other sites

To clarify:  So your failover test, wasn't a NSA failover to NSB test, but a switch of the lb vserver from primary to backup vserver? (Did I understand that right?)

 

Do you have a setting like "Disable primary when down" enabled on the second (backup) vserver?

To fallback you have to manually re-enable the primary vserver.

 

Verify the primary lb vserver is enabled and UP and still pointing to the correct active service as expected?

Verify the secondary (backup) lb vserver is enabled and up and still pointing to its correct destination.

Verify the secondary is still designated as the primary backup; the secondary should not have a backup listed.

Fully disable/reenable the primary to make sure it is "ready".

 

Do you need to flush a DNS cache somewhere (on ADC or between client and ADC)?

 

At this point I'm not sure. But again check syslog for info about if something else is going on.

 

Syslog:
shell

cd /var/log

more ns.log

more ns.log | grep smtp -i

# for new events of all types...

tail -f ns.log | grep -v CMD_EXECUTED 

 

 

 

Link to comment
Share on other sites

5 minutes ago, Rhonda Rowland1709152125 said:

To clarify:  So your failover test, wasn't a NSA failover to NSB test, but a switch of the lb vserver from primary to backup vserver? (Did I understand that right?)

 

Correct, sorry if I was unclear.

 

5 minutes ago, Rhonda Rowland1709152125 said:

Do you have a setting like "Disable primary when down" enabled on the second (backup) vserver?

To fallback you have to manually re-enable the primary vserver.

 

No, we do not have that setting configured.

 

5 minutes ago, Rhonda Rowland1709152125 said:

Verify the primary lb vserver is enabled and UP and still pointing to the correct active service as expected?

Verify the secondary (backup) lb vserver is enabled and up and still pointing to its correct destination.

Verify the secondary is still designated as the primary backup; the secondary should not have a backup listed.

Fully disable/reenable the primary to make sure it is "ready".

 

I disabled and re-enabled both the primary and secondary/backup server, both are 'UP' and green.

The back end servers are both showing as 'UP' in the monitor as well.

 

6 minutes ago, Rhonda Rowland1709152125 said:

Do you need to flush a DNS cache somewhere (on ADC or between client and ADC)?

 

No IPs have changed, so DNS flush won't really do anything, but I did it anyway. The ADCs have also both been restarted as well. Lastly, I tried pointing the ADC at the back end SMTP servers via IP instead of name, just to be sure, no change, still not working.

 

7 minutes ago, Rhonda Rowland1709152125 said:

Syslog:
shell

cd /var/log

more ns.log

more ns.log | grep smtp -i

# for new events of all types...

tail -f ns.log | grep -v CMD_EXECUTED 

 

Thank you for the info.

 

I was already looking at that log, but I can see nothing in that log related to the LB VS for SMTP. I don't see anything in there at all. If I tail the log and I wait for a quiet moment, then attempt a connection, I don't see anything either...

I also don't see anything in there related to a HTTP LB we have setup either. I have tried searching based on the client IP, back end server IP, VS IP. When searching on client IP, I can the commands I have been running showing, but no SMTP connection attempts etc.

with grep smtp I get no results at all...

 

I DO see LOTS of log entries related to our external connections for NetScaler Citrix Gateway, storefront, group extraction etc.

 

I feel like I am missing something....? Is it possible that the traffic for the LB VS is not logged in some way, does it need to be enabled somewhere?

Link to comment
Share on other sites

Syslog is the audit log so command changes, monitors/status go up and down, and lots of gateway events.

Data not being there isn't always a problem; just depends on event for what is and isn't logged.

 

Now if the log itself is empty, then your local logging settings could be omitting things such as default is info and higher, but you might have the local logging set to warning and higher.

Or if local logging has been replaced with only remote logging.

 

Logging level and log output destinations:

In Gui, go to System > Auditing. Then in right pane, confirm global syslog parameters indicate local logging destination as the localhost ip (or if not, you're only logging externally) and logging level.

Under System > Auditing > Syslog > policies and actions, you can see if one or more syslog policies are created and if so, where they are bound. The actions would indicate additional logging locations and log level. Usually, you keep audit parameters logging locally and then use policies (bound to global system or to specific vservers) to log to ADDITIONAL external locations. For something like load balancing, additional custom logging probably isn't needed.  

 

For this particularly, I wasn't expecting messages but if something was wrong and it was logged, we'd have new information to look at. 

 

Maybe you should share your smtp load balancing config. This is usually done with content switching.

Check the lb vserver(s) for any additional policies that may be affecting the load balancing flow.

 

For the nstrace you ran from the ADC did you also enable the "Trace filtered connection's peer traffic" to get backend traffic in the trace. That way we can see if it doesn't leave the ADC or if it attempts, but the backend is the problem.

 

Hopefully, someone else has some ideas as I can't think of any. 

 

Link to comment
Share on other sites

Thanks for replying again!

 

14 hours ago, Rhonda Rowland1709152125 said:

Syslog is the audit log so command changes, monitors/status go up and down, and lots of gateway events.

Data not being there isn't always a problem; just depends on event for what is and isn't logged.

 

Now if the log itself is empty, then your local logging settings could be omitting things such as default is info and higher, but you might have the local logging set to warning and higher.

Or if local logging has been replaced with only remote logging.

 

Logging level and log output destinations:

In Gui, go to System > Auditing. Then in right pane, confirm global syslog parameters indicate local logging destination as the localhost ip (or if not, you're only logging externally) and logging level.

Under System > Auditing > Syslog > policies and actions, you can see if one or more syslog policies are created and if so, where they are bound. The actions would indicate additional logging locations and log level. Usually, you keep audit parameters logging locally and then use policies (bound to global system or to specific vservers) to log to ADDITIONAL external locations. For something like load balancing, additional custom logging probably isn't needed.  

 

For this particularly, I wasn't expecting messages but if something was wrong and it was logged, we'd have new information to look at. 

 

OK that's fine then. I can see the LB VS go up/down in that log when I enable/disable, but nothing else really related to it.

 

14 hours ago, Rhonda Rowland1709152125 said:

Maybe you should share your smtp load balancing config. This is usually done with content switching.

Check the lb vserver(s) for any additional policies that may be affecting the load balancing flow.

 

We are not using content switching. I can't remember where I saw how to configure it like this, but it made sense and seemed to work great until now....

 

Not sure on the best way to share the config. The the commands to setup were (IPs and names changed):

add server PRODSVR1 10.1.0.1
add server PRODSVR2 10.1.0.2
add server DRSVR1 10.2.0.1

 

add serviceGroup "SMTP Relay Service Group" ANY -maxClient 0 -maxReq 0 -cip DISABLED -usip YES -useproxyport NO -cltTimeout 120 -svrTimeout 120 -CKA NO -TCPB NO -CMP NO -appflowLog DISABLED


add serviceGroup "SMTP Relay Service Group DR" ANY -maxClient 0 -maxReq 0 -cip DISABLED -usip YES -useproxyport NO -cltTimeout 120 -svrTimeout 120 -CKA NO -TCPB NO -CMP NO -appflowLog DISABLED

 

add lb vserver "SMTP Relay" ANY 172.25.0.70 25 -persistenceType NONE -lbMethod SOURCEIPHASH -m MAC -sessionless ENABLED -cltTimeout 120 -backupVServer "SMTP Relay DR"

add lb vserver "SMTP Relay DR" ANY 0.0.0.0 0 -persistenceType NONE -lbMethod SOURCEIPHASH -m MAC -sessionless ENABLED -cltTimeout 120

add lb monitor "SMTP Monitor" SMTP -scriptName nssmtp.pl -dispatcherIP 127.0.0.1 -dispatcherPort 3013 -LRTM DISABLED -destPort 25

bind lb vserver "SMTP Relay" "SMTP Relay Service Group"
bind lb vserver "SMTP Relay DR" "SMTP Relay Service Group DR"

bind serviceGroup "SMTP Relay Service Group" PRODSVR1 25
bind serviceGroup "SMTP Relay Service Group" PRODSVR2 25
bind serviceGroup "SMTP Relay Service Group" -monitorName "SMTP Monitor"

bind serviceGroup "SMTP Relay Service Group DR" DRSVR1 25

bind serviceGroup "SMTP Relay Service Group DR" -monitorName "SMTP Monitor"

 

14 hours ago, Rhonda Rowland1709152125 said:

For the nstrace you ran from the ADC did you also enable the "Trace filtered connection's peer traffic" to get backend traffic in the trace. That way we can see if it doesn't leave the ADC or if it attempts, but the backend is the problem.

 

I don't know if that option was enabled originally or not as Citrix support took the original trace. I just re-ran it with that option enabled and I don't see anything different, I still don't seem to see any traffic from NS to back end servers for SMTP/TCP port 25... Well, I see one connection attempt, which appears to be fine.

 

I CAN see traffic to the same back end servers for HTTPS/TCP 443, which is handled by a content switching policy for OWA & Active Sync. So that proves that the NetScaler can talk to the Exchange server and vice versa, for HTTPS/443 traffic, which is working fine.

 

When looking at the TCP 25 traffic, the trace is mostly just showing 'TCP Retransmission'. There are 8.3 million entries in the trace! The trace was only running for the time it took me to start the telnet and for it to timeout, which was less than 30 seconds. That is way high... Given that 99% of the entries are 'TCP Retransmission', it makes me think that the NetScaler is not acknowledging/accepting the traffic from the clients correctly...?

 

I also confirmed that if I login to the NetScaler itself via SSH I can telnet from the NetScaler to all of the back end Exchange servers fine.

 

I am going to investigate CS VS for SMTP LB now and see if that changes anything while I wait for Citrix support to come back to me.

 

Thanks!

Link to comment
Share on other sites

So were there any other changes made when you did your failover test?  Any config changes on the ADC such as firmware change or other changes made?

Any PBR (policy based routes implemented)?

 

 

Were you also using DSR (direct server return mode) for your SMTP load balancing with USIP mode?  Which is possibly why your backend smtp servers still keep their default gateway instead of the ADC snip?  (Since you seem to also have MB on the servicegroup enabled.)

DSR may require your VIP to have ARP disabled and a similar config on the switch behind it.  It may mean that when you changed your IPs back, you need to clear or update the arp tables again.  If you weren't using DSR mode then this may not apply.

 

Some references on this config:  https://www.mycugc.org/blogs/braxton-koppelman1/2020/07/07/securing-exchange-smtp-relay-hosts-with-citrix-adc

and https://www.mycugc.org/blogs/cugc-blogs/2016/12/27/load-balancing-exchange-smtp-relay-and-iis-smtp-re

This second article mentions the ARP monitor

But it mentions that the server (backend itself) has to e configured with the SMTP virtual server vip. Its possible after your failover, that you are missing your primary vip and just have your backend vip in the config.

 

I'll be honest SMTP with USIP and DSR mode is not something I'm familiar with outside these articles. But if this sounds like what you were doing instead of USIP mode (where the traffic returns to the adc via the snip). That second article may help you figure out what changed.  Or someone else with more experience may be able to weigh in here.  There may be other ways to get the backend server to not do the arp lookup than what that second article shows. So be cautious before making changes, but it may show you where to look if this scenario even applies.  There may be better ways to do this.

Link to comment
Share on other sites

16 minutes ago, Rhonda Rowland1709152125 said:

So were there any other changes made when you did your failover test?  Any config changes on the ADC such as firmware change or other changes made?

 

Nope, nothing else changed

 

16 minutes ago, Rhonda Rowland said:

Any PBR (policy based routes implemented)?

 

Nope

 

16 minutes ago, Rhonda Rowland said:

 

Were you also using DSR (direct server return mode) for your SMTP load balancing with USIP mode?  Which is possibly why your backend smtp servers still keep their default gateway instead of the ADC snip?  (Since you seem to also have MB on the servicegroup enabled.)

DSR may require your VIP to have ARP disabled and a similar config on the switch behind it.  It may mean that when you changed your IPs back, you need to clear or update the arp tables again.  If you weren't using DSR mode then this may not apply.

 

Some references on this config:  https://www.mycugc.org/blogs/braxton-koppelman1/2020/07/07/securing-exchange-smtp-relay-hosts-with-citrix-adc

and https://www.mycugc.org/blogs/cugc-blogs/2016/12/27/load-balancing-exchange-smtp-relay-and-iis-smtp-re

This second article mentions the ARP monitor

But it mentions that the server (backend itself) has to e configured with the SMTP virtual server vip. Its possible after your failover, that you are missing your primary vip and just have your backend vip in the config.

 

I'll be honest SMTP with USIP and DSR mode is not something I'm familiar with outside these articles. But if this sounds like what you were doing instead of USIP mode (where the traffic returns to the adc via the snip). That second article may help you figure out what changed.  Or someone else with more experience may be able to weigh in here.  There may be other ways to get the backend server to not do the arp lookup than what that second article shows. So be cautious before making changes, but it may show you where to look if this scenario even applies.  There may be better ways to do this.

 

I remember reading about DSR, but I can't remember what we might have done there. I know that a 'loopback' adapter was created on the Exchange servers, but that is all I can remember this stage, I will have to look back at the setup and notes.

Link to comment
Share on other sites

Thanks Rhonda. 

 

I configured the overall setup as per https://www.mycugc.org/blogs/cugc-blogs/2016/12/27/load-balancing-exchange-smtp-relay-and-iis-smtp-re

 

We do not use the IIS SMTP relay, so that bit is not configured, but the rest is the same, including the loopback adapter on the Exchange servers. I have double checked everything against the article and it appears correct, so not sure where else to go with this at the moment, very frustrating.

 

Unfortunately Citrix support are being next to useless as well at the moment, they don't reply for days and when they do it's with something small like a basic question they should have asked in the beginning. Again, very frustrating!

 

I am trying to get Wireshark installed on the Exchange server, hoping that will shed some light on what is happening.

Link to comment
Share on other sites

OK, so I have some strange progress...

 

I managed to get Wireshark installed on one of the Exchange server and I was seeing traffic from the NS to the Exchange server, but not a single stream, several streams. During the trace I only attempted to connect once, so should have only seen one stream... I was thinking that the duplicate streams might have been the 'SMTP monitor' testing the Exchange server was up. To ensure my trace was clean and only showing the actual telnet test traffic I unbound the 'SMTP monitor', and then tested again and the telnet connected!

 

I then re-bound the 'SMTP monitor' and the telnet failed again... I unbound it again and tested and it worked...

 

I was obviously confused as to why this was causing a problem, but at least getting somewhere... I then re-bound the monitor again to test some other things and it is still working!? This makes zero sense to me...

 

So everything is currently unchanged and as it has always been, and it is now working again as far as the primary VS is concerned.

 

I have an issue still with the secondary/backup VS. I unbound the monitor for that early in my testing and never re-linked it. I have re-bound the monitor and it is failing, reporting the server as down. I can see the traffic going to the Exchange server in site B and it is permitted at the FW level, so should be fine. Again, nothing has changed on the Exchange server or NetScaler in relation to that, so no idea why it has failed now! :(

 

The fact that the primary is now working is good, but I have no idea why it stopped and unbinding/rebinding the SMTP monitor to the service group made it work!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...