Thursday, June 13, 2013

Exchange Load Balancing and Mysterious MAPI Behavior, Part 1

MAPI application to Update Outlook Contacts

Microsoft Outlook and Exchange Server Load Balancing
Load balancing Outlook clients on Microsoft Exchange Server is possibly the last piece of the enterprise messaging strategy on the way to nirvana. It's compelling because you get extreme reliability and scalability at the same time. The load balancer also makes maintenance easier because you can take down individual Client Access Servers without impacting any user.

The list of the load balancer’s benefits makes it a worthy endeavor. However, can we spoil it with too much of a good thing? Does everything that uses Exchange Server need to traverse the load balancer? We like consistency, so shouldn't we also insist that third-party applications using Exchange also be routed via the load balancer?
If that is part of the plan, I am writing to suggest that there are potholes in this road and you may want to drive carefully. After a few painful experiences, itrezzo technical support has found that Exchange load balancers do break a few axles from time to time. What’s good for Outlook isn’t always good for MAPI applications. Two such examples are the BlackBerry Enterprise Server, and the itrezzo Unified Contact Manager.

Outlook Connectivity
First, let's review some of the things that make Outlook an ideal application for load balancers.
  • RPC packets are encapsulated in HTTP. Load balancers are already good at dealing with HTTP.
  • Transactions are usually small. For example, an Outlook user will read, delete, or send a message. A pinned network session is not required.
  • Highly resilient with unreliable networks. Outlook has been developed and refined for 16 years. It knows how to thrive in poor network conditions. If a load balancer needs to close a connection, the client will reliably re-establish itself.
  • Users are able to work offline and synchronization happens in the background. If a session is interrupted for several seconds, it is easy to recover and it is unlikely that a user will even notice.
  • Thousands of users can be easily spread across multiple Client Access Servers. Load balancers have their choice of numerous paths to get back to the Exchange Mailbox server.
  • Ubiquity. There are probably tens of millions of Outlook users already working through load balancers. The market is mature and vendors can afford to extensively test all versions of Outlook in a wide variety of scenarios.

MAPI Connectivity

So why would load balancers have a problem with Exchange / MAPI applications, and not with Outlook? In many cases, an application that makes a transient connection to Exchange Server probably won’t have any issues. However, an application that has a persistent connection to Exchange runs a higher risk of instability.
Let’s start with a very simple MAPI application. The application might involve multiple steps such as creating a profile, logging on and authenticating with the domain, connecting to the default information store for a specific user, connecting to a public folder store, accessing a specific folder, and then accessing a few messages, appointments, or Outlook contacts. The application might then make a few changes and log off.
If this entire sequence takes 10 seconds and is infrequently performed, it is unlikely that a network failure in the middle of this sequence would even occur. For example, if the load balancer just established a MAPI session on the least active CAS, it probably isn't going to change in the following 10 seconds.
If a MAPI application is persistent and stays connected to one or more mailboxes over a long period of time, there may be a very different scenario.
For example, what happens when the application has to update 1,500 Outlook contacts with pictures and 200 phone numbers while finding exact contact matches in a folder that has as many as 2,500 items? What happens when it has to update 2,000 more mailboxes after it completes that first mailbox?

This is the case with itrezzo Unified Contact Management software where we recently had two customers that decided to roll out employee pictures to all mobile devices. Here is a high level sequence of the contact synchronization that the software performs:
  1. Logon to the end user mailbox
  2. Load 2500 items from the contact folder
  3. Determine obsolete Outlook contacts for inactive employees and delete them
  4. Add Outlook contacts for new employees
  5. Compare every field to detect discreet contact updates
  6. Perform field level updates exactly where needed (merge new data to the Outlook contact so no wireless smartphone activity is generated unless required). For example if a phone number changes, the synchronization should only be a 10 byte transaction instead of deletion and recreation of the entire contact.
  7. Load and Compare picture, update as required.
  8. Backup all contact changes as an attachment to capture end user changes that could potentially get overwritten
  9. Log out of the user mailbox
  10. Repeat for 2000 mailboxes
Normally when our software pushes a Mandatory Contact List, it would just have small incremental update to 2500 Outlook contacts. That would usually take just a few minutes.  However, with 1500 new pictures, we first noticed mailbox updates were taking between 10 and 25 minutes per user. Our customers were not too happy about this.
With such a large update, our application has to stay logged on for the duration of the synchronization process. The contact management task creates a stateful session where it expects a nailed up MAPI session with the Exchange Server. It may perform dozens of transactions per minute.  This operation goes on for hours, even days on multiple servers. It’s like painting an aircraft carrier. You start painting at one end of the ship, by the time you get to the other end, it is time to start over. After walking 2000 mailboxes, new employee contacts have been added, other employees have been terminated, address changes, job title changes and of course contact phone number changes have occurred in SharePoint, or Active Directory so the process needs to run again.
We can say with a sense of certainty that over this span of time, a load balancer will take a perfectly productive MAPI Session and attempt to migrate it to another node.  Even though the requests are instantly passed to a new CAS, the MAPI session is no longer valid. Our application is not usually very happy about this.

This story continues in Part 2 of this Blog post where I cover developer challenges and latency issues. With some basic windows tools, you can diagnose if there is a problem and use a very old fashion way to work around these issues.