I joined Citrix in 2002 as a developer when we were still a single product company (MetaFrame) of about 1,800 employees and revenues of appx $500m. Since then, we've had magnificient growth in nearly every possible metric - revenues, employees, customers, market share and strategic importance to IT. Working behind the scenes to make it all happen are XenApp's 18 different Technology Component teams working around the clock in 5 different countries to develop 450 product binaries and rigorously test 320 build layouts throughout the project cycle. With this enormity and explosive growth, the question of "How do we deliver products that bring the most value to IT at the lowest cost in the shortest time?" is one that our engineering teams deal with in every product release cycle.
In the last 10 months, being on the Release Team for XenApp, I learned 10 valuable lessons in software engineering management (in addition to losing weight, sleep and some hair
). Although these observations may be based on Citrix-specific examples, I believe that in here are universal truths that apply not just to software development at Citrix but any large IT or software project. Secondly, I want you to get an insight into what some of our engineering processes are like.
Lesson 1: Tackle cross-team dependencies first
XenApp Platinum Edition has many components with numerous cross-team dependencies. If you are a team that owns an SDK (even a couple of APIs or modules for another team to consume), focus on those first before worrying about your internal milestones. You ask why? There are two reasons. First, the team that consumes your interfaces might in turn have other important milestones that depend on this SDK being delivered on time. You can't starve them. Second, most problems arise from these touch points. Eg: RPC calls, 32-bit consumers vs 64-bit SDK, logistics of releasing an SDK are not trivial - build, install issues are often unaccounted for and/or overlooked. That is why it is important to at least provide a skeletal interface (that returns dummy data) and fill in the meat later. Remember - it is better to be somewhat correct than precisely wrong.
Lesson 2: Refactor gradually, not all at once
Our natural tendency as engineers is to try and build the best performing, ideally architected, and most logically modular components. Realistically, it is impossible to score on all these fronts unless you have infinite time. Software engineers are builders and architects. They are extremely passionate about their buildings (code). We often come up with grandiose ideas of re-writing entire components with the latest available technology to try and make it all better and new, this time. But it just doesn't work like that. Here are the pitfalls in "grand" refactoring -
•Grand refactoring comes at a huge upfront cost that we tend to ignore (most of this is in reverse-engineering legacy code). These costs are hard to justify to managers in a tough economy like this, especially when everything is working as is.
• There is no guarantee that refactored code will perform better. The original engineer who wrote that code made certain design decisions consciously (might have been subtle compromises). Statistically speaking, the new engineers who are doing the refactoring are no smarter than the original authors of the code. So don't touch something if it is not completely broken.
• By the time you are half-way into refactoring, it is quite likely that a new set of requirements may come in that contradict your refactoring plan.
I am not against refactoring. Here's what I think is the best approach to refactoring:
• Identify the top problematic areas (in key measures such as maintainability, performance and security) and start by going after those first.
• Learn from smaller refactoring undertakings before you take a big step (think big but start small).
• Advertise refactoring improvements. If a modest investment got you a big gain, blog about it, share your experiences so others can learn from it.
• Refactoring must be continuous (in my opinion, every release should layout 5-7% of $$$ for refactoring improvements) but don't overdo it.
Lesson 3: Don't overestimate demos
Demos are a great way to reveal earned value. It helps to showcase engineering innovation, secure (or maintain) funding and most importantly give confidence in your design. But demo's that show PoC's (proof of concepts) should not be mistaken for end-products. We take several short-cuts when doing demos. There is a long way between demo/prototype quality and release quality that you need to account for in your project plan.
Lesson 4: Don't underestimate integrations
When an novice engineer says "Oh its easy, will take 10 minutes to do", they are almost always wrong. This is especially true of system integration. The convenience of having a VBL (Virtual Build Lab, a sort of private tree for building code) for isolated and disruptive development does not come for free (unless you are isolated by binary based releases, even there you may have a cost arising from dependency alignment). Integrations don't end when your code compiles. It all needs to work (in Citrix, we use a product-wide test automation framework that runs on every build to ascertain quality metrics on a continuous basis). Assign a generous amount of time to do integrations.
Lesson 5: Automation is not a silver bullet
There is a sign in the first floor's break room in our main XenApp engineering building here in Ft. Lauderdale that says "Automation is not a silver bullet". Keep a copy of it in your office. Certain code lends itself well to automation and certain classes of code do not. Like refactoring, here are some thoughts I have on automation:
• Full-fledged automation comes at a huge fixed cost. If your payback period is more than 3 years, re-think. We are in a fast-paced industry. The scenario or code that you automate now, may be far less important (or not even applicable) 2-3 years from now.
• There is code that can be tested very effectively using automation (eg: session management, capacity and load management), and some that just can't be (multimedia). 100% automation is impractical (trying to achieve 100% of anything is somewhat impractical for that matter).
• Automation also needs continuous maintenance. For example, if you author automated unit tests, you need to make sure a. they keep running release after release and b. they keep passing release after release. As code changes, the way it is tested may also need to change. So while calculating payback, be sure to attach a maintenance cost for every test you author.
• Like refactoring, I believe automation effort is one of those things that you have to include for each release, but not overdo it. Start with automating the biggest bang for the buck.
Lesson 6: Keep processes lightweight, yet efficient
As part of the Release Team, I have an obligation to make sure that release processes facilitate and not burden Technology Component (TC) teams. Citrix's processes have evolved greatly since we started. When Citrix had less than 10 engineers in the late 80s, they used to keep track of bugs on a single common whiteboard! Simple yet effective. 20 years later, we have sophisticated bug tracking and requirements management systems. But yet we still strive for the same simplicity and effectiveness that we had 20 years ago. To that effect, we introduced a number of process improvements:
• Formal requirements authoring processes take backstage, instead lightweight feature specs with visuals, screenshots and video demos take center-stage. We need to document things that need to be communicated, not every possible thing one can ask.
• Formalized Graphical Test Plans only where it makes sense (for complete end-end features, not for individual components or modules).
• For features submissions that are surgical changes, you don't need to produce a full code coverage report (cost doesn't always justify the benefit). But make sure you methodically and carefully step-through code changes with various inputs.
• Test automation only where ROI can be justified.
• No more separate low level and high level design documents. They can be combined into one that is readable by both technical and non-technical audiences.
Lesson 7: Cross-team hand-off is more than just being code-complete
With a process methodology that is component-centric, we "deliver" frameworks, "release" SDKs or "hand-off" API sets. I get really upset when a team makes tall claims to have done one of these things without having a single consumer try it out first. How do you know that your hand-off meets the requirements of the SDK consumer? Does it even work? Did you factor in the time that it takes to write an installer and release a build, tasks that go hand-in-hand with releasing something? I really think Hand-offs must be signed not by the team that releases it, but by the team that actually consumes it.
Lesson 8: Component complexity is multiplicative, not additive
The amount of cross-team dependencies that we have in our product is amazing. This is no different from any other mature product of our size. Don't underestimate this when taking on a new project. If 20 TC teams each take on a project with a complexity of n, the overall complexity of the release becomes n^20, not 20n. This is due to the high cost of integration and integration testing. Keep this in mind before you take on a new feature. Often, less is more.
Lesson 9: What quality means...
The definition of quality has changed in the least 2 decades. The 80s and 90s were about trying to achieve high quality control. If there was no Six Sigma or SEI certification, you were considered an outcast. Things have changed since then. In the age of well-designed products like the iPhone, Windows 7, Google maps, Amazon and Salesforce, strong visual appeal, good design and usability are now taking center-stage. This has been true for Citrix as well. In the last few years, our focus has been shifting from a blind, knee-jerk, single-dimension view on "bug counts" to taking a more practical approach of solving real customer pain-points (for example: XA5 for Win2003 IMA resiliency, VM Hosted Apps, etc) and making some smart investments in design and intuitive visual appeal (Dazzle, Receiver, etc). Also, we've been constantly keeping maintenance costs low and passing on the cost benefits to our customers, by shedding expensive baggage (high cost, niche legacy feature deprecation and removal of code that is no longer executed). Keep looking for these opportunities at the code level.
Lesson 10: Balance Idealism and practicality
This is really the biggest lesson that I've learned. You can't achieve your ideal vision in one release. Be practical and be patient. Go after hard and known technical problems one at a time. Build on small successes to springboard you to the next level. Enterprise products and quality are built like concentric circles, inside - out. Create and follow a roadmap to your ultimate vision.
In this interview, Willie Wright, one of the original developers of XenApp's CPU Management Technology, talks to Prasanna Padmanabhan about the history of MalooCPU, Delaware improvements as part of Preferential Load Balancing and some longer term research in the area of general resource management.
Some you may have listened to this one, but our podcasts don't support comments yet. So I thought I'd put it in here as a blog post, so that we now have a way to hear back from you.
There's a lot of excitement around project Delaware, the first "XenApp" release for the new Windows Server 2008 platform. In this video, I talk about Preferential Load Balancing or PLB, a new feature in Delaware, that brings improvements to CPU Management and Load Balancing.
Hi, I am Prasanna Padmanabhan. I am a software developer in Citrix Engineering. This is my first blog entry in the Citrix Community blogs, so I am a newbie.
I am working with a small engineering team looking into some ideas for enhancing Load Balancing (LB) in Citrix Presentation Server (CPS), something we are investigating as part of the Constellation set of technologies here at Citrix. The reason why I am writing this blog is not only to get your thoughts on these new ideas, but also to understand from you about how you typically configure LB in CPS .
Before I start to describe the new ideas, let me tell you this. I or another one of our team members be able to write a separate blog post describing in detail how LB works in CPS today. If you like me to do that, do let me know since LB is a somewhat complex subject for most of us.
There are a couple of ideas that we are exploring.
- User Experience (EUE) based load balancing
- preferential load balancing.
The first has to do with load balancing based on the experience. That is loadbalancing to the server that at instance gives the best user experience. is a separate topic in itself and it discussion.
In this post, I talk User-Application preferential load balancing, which is certain applications and users to others. Often times, administrators want to provide a certain level of service to certain users based on their job functions, their position within the company etc. the same way published applications may want to be assigned a level of importance based on how critical that application is to their business. This could also change during different times of the year. Accounting and finance applications become all the more mission critical during the end of each quarter!
Today, administrators can do this in several ways, but we think they are not too straightforward.
- 1. Manual isolation or siloing - manually assigning mission critical applications to their own servers and grouping many normal applications together on other servers.
This is acceptable but could sometimes be an administrative hassle- if you by mistake publish an important application along with other important applications on the same server, users will begin to notice performance issues.
- 2. CPU priority levels - You can assign CPU priority levels for published applications (via the Access Management Console). This uses the operating systems CPU scheduler, which could potentially starve a lower priority application if a higher priority application does not relinquish the CPU for a long time (does not happen very often, but when it does users might start to notice slowness). Another problem is that you might publish the same application twice, with a different importance level setting, and end up with Outloook_2 and Outlook_3.
- 3. Using the CPS policy mechanism to set limits on virtual channels such as bandwidth. But unfortunately, these are hard limits. Even if someone is not using that much bandwidth, a needy user cannot get it.
I am looking at this from a designer/developer point of view. As our customers you use the admin tools much more than I do. Do you
- a) Agree that these are indeed problems? If so how do you currently work around it or
- b) Think that it not a huge problem but would make your life easier things if there was a better solution to this or
- c) Feel that this is not an issue at all?
We heard of (a) and (b), but not many (c) But if you feel that this is not a problem I like to hear why you think so.
User Application preferential load balancing is an idea that we believe will solve a good portion of these issues and that what I am going to talk about.
Central to the idea of User Application preferential load balancing, is the idea of resource shares. shares are simply numbers or quantifiers that denote how important a user or application will be treated. The more resource shares you have, the more CPU you get. The more resource shares an application gets, the more CPU it gets. example, assume that there are two ICA sessions, S1 and S2, currently running on a server, and they had 4 shares and 8 shares respectively. Then, if they were at any point competing for CPU at the same time, then they would get 33% and 66% of the CPU respectively.
The clause, in italics is important to understand - they were at any point competing for CPU at the same time These are soft-caps. So if the S2 wasn doing anything (i.e., in an idle state), then S1 could get more than its share of 1/3rd of the CPU if it needed it. But the moment S2 started to do something CPU intensive that would make it grab much of the CPU (eg: doing a search operation), then the CPU share enforcer would kick in, would take away the extra CPU cycles that S1 was temporarily enjoying, and hand those off to S2, the more important session.
Typically applications never consume large portions of the CPU for very long periods of time. They usually do so in short bursts (eg: a macro is being run in an application like Word, a search is being performed on a document etc). Without the CPU rebalancing feature, users might suddenly see longer response times or general slowness when other users perform these CPU intensive tasks. Thus, the CPU rebalancer effectively shields important users and application from these kinds of situations.
Some readers of this might be able to relate and connect this feature to the fair-share CPU scheduling feature ( CPU and the user Server User load rule.
border="4" cellpadding="0" cellspacing="0" style="width: 681px; height: 289px">| |
CPS 4.5 Enterprise valign="top" width="145">
CPU scheduling feature| Codenamed CPU this allowed administrators to ensure fair share (equal share) CPU usage amongst user sessions on a server. | Instead of each user session getting equal importance, sessions are given a numerical importance level based on who the user is and what applications they run within that session. Sessions with more get more importance within a server. this denotes inter session competition within a CPS server. | Server User load rule | This load rule tries to load balance sessions such that each server would approximately have equal number of sessions running on them at any given point. | Uses the notion of shares described above (session importance levels) to load balance between servers, so that important sessions are made to occupy more on a server thus prohibiting other important sessions to run on the same server . this denotes inter server competition within a CPS farm. | many shares a session gets is a function of the user and the applications that he or she is running. It is a product of your importance and the max of all the apps running in your session. Here is an example with a matrix to explain this better.
In a hospital, doctors and nurses may want to be assigned more shares for mission critical applications such as medical imaging applications (X-rays, CAT scans etc) or patient records since they deal with people lives; plus doctors in a hospital have several patients to visit during ward-rounds that patient record information must be quickly available to them when they are by the patient side, without any delays or bottlenecks in launching or using the application.
Compare this with that of a clinical lab technician or an administrative assistant at the same hospital. He/she might also have to pull up the same patient records once in a while but small delays might be acceptable in this case. So they could be simply considered normal users, which is of course the default setting. In the same way, a normal application could be something like a standard home grown which gets the default number of shares (2).
Thus, the User-Application preferential load balancing gives users a more predictable and consistent user experience. This workload management that I described comes from Aurema, a company that Citrix recently acquired.
That about all I have to say in this article. I shall talk about EUE based load balancing in a separate post. In the meantime, I am looking for actionable feedback on this idea. Also don forget to tell me whether you want to see another post on how load balancing, in general, works in Citrix Presentation Server.
Blogs for Prasanna Padmanabhan