28 Jul 2008 04:18 PM EDT
[ Tags: high availability,  netscaler,  xenapp ]

Does anyone care about having high-availability for their XenApp farms?  I would envision many of you would say yes.  But what does HA for XenApp really mean?  On the server hosting side, you essentially have HA because you have load balancing at the application level. So if you lose a XenApp server, not too much of a concern as those users can simply restart their application and get load balanced to another server (of course they lose their previous session information, which can be annoying.)   But what other areas of critical to providing a more available XenApp environment?

I've been thinking about this a lot lately, which is probably because my manager has had a lot of meetings and I tend to space out and watch episodes of The Simpsons on my laptop.  Since my DVD player broke, I started to think about HA for XenApp during these meetings (at least I'm now doing work). I was able to come up with the following thoughts:

  1. Smart Monitors:  First, I want to know that something has failed or has gone flakey.  I don't want a bunch of messages telling me everything is ok, I just want to know when something is about to go horribly wrong. For example, the XML Black Hole.  I've seen the black hole cause too many issues, so how do we detect it?  You create a smart monitor that does more than pings. It tries to make requests to the XML service. If the expected data comes back, we are good to go. If the request is never answered or the response is junk, then Homer, we have a problem.  
  2. HA for the Critical Components: Now if we can detect a failure, DO SOMETHING ABOUT IT.  As we continue looking at the XML Black Hole, if we see there is an issue, then stop making requests to it. But this requires another XML Brokers to take over the responsibility of the failed one without requiring changes to the environment's configuration.   Sounds a lot like load balancing to me.
  3. Business Continuity:  Essentially what I'm saying is that if my XenApp environment at one site fails, I  better have another site already waiting for connections without requiring me to make changes.  Many people have 2 data centers: a primary and a backup. Others have 2+ data centers that are all active.  For those organizations with 2 data centers (primary and backup), how do you fail users over to the backup in the event of a failure?  For those organizations that have 2+ active data centers, how do you tell your users data center is their preferred site?  That is really a trick question (Did I get anyone?).  You shouldn't have to tell your users anything about going to a primary, backup, tertiary site. It should happen automatically.  Users want their applications in the fastest possible means necessary, which could mean that one day it is from data center 1 and on another day it could be data center 2.

These three items are all part of NetScaler, and it is easy to setup.  For those of you who know me will notice that I've worked with the integration of NetScaler and XenApp for some time.  Well, the NetScaler product group is actually making my job easier because they are making this solution a lot easier.  I created and maintained a 40+ page document that showed you how to set all of these goodies up. Now that document is about 14 pages (with pictures for each step) because of the new NetScaler for XenApp wizards.  I'm just glad I don't get paid by the word.  Take a look at what I'm talking about. In about 5 minutes you will see me configure and integrate NetScaler with XenApp:

Watch this Video:


Also, take a look at recently released articles  that goes into more detail on this integrated solution: http://support.citrix.com/product/nsad/v8.1/consulting/

  • Taking XenApp to the Next Level of Availability - Reference Architecture
  • Taking XenApp to the Next Level of Availability - Implementation Gudie

I'm curious what other areas concern you when you are focused on HA for XenApp?   Let me know. Yes, my manager finally ended the meeting, I am outta here.

Daniel

(Homer Quote of the Blog "Kids, you've tried your best and failed miserably.  The lesson is, never try")

Permalink | Comments (3) |

Firstly, kudos for publishing your thoughts on HA designs; it's a topic that doesn't receive sufficient attention. 

Having said that, what you have described so far is still an incomplete solution.  

As you indicated in your reference architecture doc, the XML Broker performs 3 critical tasks - User Authentication, Application Enumeration and Application Launch.  Each of these three functions is invoked as a separate XML transaction from Web Interface to XML Broker, and each transaction is independent; there is no requirement that the XML Broker that performed the initial User Authentication be the XML Broker that fulfills that user's Application Enumeration or Application Launch request. Furthermore each transaction interacts with separate services which may be hosted locally on the XML Broker itself or, more likely, on some other server, and any one of these services could fail without impacting the others.  Thus for an Application Delivery Controller (ADC) to be fully informed of the health of the services it is managing each of these three functions must be testable independently. 

Fortunately, it is relatively easy to craft custom health checks to augment the base configuration and provide as much monitoring as anyone might consider necessary, but it would be preferential to have these additional health checks provided out-of-the-box.

Now you could argue that with XenApp 4.5 we have Health Monitoring and Recovery agents that can perform functional validation of XML Broker operation so running in-depth XML Broker health checks on the ADC isn't necessary.  However, experience has made me cautious (there's still the risk that the HMR service might fail or a specific HMR test might not detect some esoteric error condition), so I'm not going to hand over my service availability to a single technology if I can reduce risk by having multiple independent health monitoring systems keeping an eye on this critical service. 

Anyway to get back to the question you asked "what other areas concern you when you are focused on HA for XenApp?" 
The big issue for me goes beyond the questions asked in your reference architecture doc, and asks the question how you manage partial failures.

What happens in those less than clear cut situations when a farm is operating at less than 100% capacity?  How do you detect and react to application capacity constraints within a farm that prevent it from meeting demand, but where the access infrastructure (Web Interface, XML Broker etc) is fully operational?  And lets not forget the really hard question - how do you make informed load management decisions when your farms are fully operational but the application is not working?
 

Regards

Simon

Posted by Simon Bramfitt at Jul 29, 2008 11:58Updated by Simon Bramfitt | Reply To This

After deploying this solution at many of my XENApp customers, I offer the following suggestions.

1. don't load balance the STA's as Web Interface and AGEE both offer STA failover and load balancing. 

2. You didn't focus on persistence in this demo, definetly a big concern when it comes to WI and AGEE traffic.

3. I wonder if we can use Custom load monitors (SNMP) to monitor the XENApp backend servers, bind these monitors to GSLB and make DR decisions on this.  That would be cool!

4. Don't forget, we can use DNS views and to send users to a WI site and XENApp site that is closest to them.  (great for internal LB).

5. Don't forget about setting up XENApp rules for multi-site redundancy.

I've been configuring this at many locations, and it ROCKS!!!  total XENApp redundancy! Total solution!  

Persistence. It is in there, but I didn't cover it.  Those wizards actually do quite a log of configuration for you.  When you use the wizards, persistence to the Web Interface is set automatically.  Persistence to the XML Broker is not as it is not needed. 

3. Very interesting idea.  I think that plays into what Simon was looking for in his post.  What do you do when the XenApp farm runs at full capacity but the application does not. 

5.  I'm assuming you are talking about Zone Preference and Failover.  A really cool feature.  You can get into some really detailed design when you bring in ZP&F as well as GSLB. Based on your example, I would guess users connect to an Access Gateway.  Then from AG, the XenApp will have users launch apps from a XA server at a different site.  Now you got the traffic going from the second site to the primary site across your internal WAN ,and then out to the user from site 1's AG.  A different option would be to use NetScaler GSLB, which sends you to the correct site and then your apps launch from that same site.  I brought these concepts up in the design considerations document located here: http://support.citrix.com/article/CTX116051

Great ideas.  And we need a blog white board