Thursday, September 12, 2013

Exchange Loading Balancing and MAPI Mud, Part 2


Exchange Server Loading Balancing and Mysterious MAPI Behavior, Part 2

MAPI Applications and Port Connectivity

In Part 1 of this Blog post, I hopefully made a case that Outlook was an ideal application to operate through a load balancer. However, MAPI applications don't directly inherit the same load balancing qualities.
Most application developer probably would not even be aware of load balancers. If they are, it is less likely they would be aware of potential impacts to their application. It is safe to say that the last thing on their burn down list is some obscure error handler to dealwith a load balancer that slams their MAPI connection. In fact, the app may work perfectly during debug and QA, but randomly fail in production. For example, the app may experience rare sporadic issues that you have no rational explanation for. Unless of course you know what to look for -- you app may just get stuck in the mud.

Custom MAPI Development

If you are writing custom code for one specific environment and you know that there will be no load balancer involved, read no further. However, if their ever a chance your application will need to be used in front of a load balancer, you may want to familiarize yourself with ways to diagnose and resolve possible issues that would come to be.

Even with a Load Balancer, the application can still work reliably. For example, if the sequence of MAPI calls are simple enough, it may be fairly easy to retry operations with a new session.

a) change app code
b) hosts file
c) Note that this probably does not apply to applications built on top of Outlook.
For more complex and multi hour updates, this is not as easy to code around. First you need to be able to detect that failure and re-initialize the session to the user mailbox, then start over.   The failure condition can have ambiguous error codes and if you have simultaneous sessions to multiple exchange servers, your code doesn’t even know where to start. Then there is also the synchronous calls made to the MAPI subsystem that are never returned at all. Yes, you can code defensively.The question is, should you drive your car directly through the minefield, or should you drive around it ?
To further complicate things, we have another enemy; latency.  When the CAS role is on the mailbox server, the connection is a simple client / server communication.   Latency is in the single digits (of milliseconds).  Now add a load balancer which talks to a CAS, which talks to a mailbox server. Even if the load balancer is really fast, (probably isn’t when it has 10,000 users connected) what was was single digit latency for the application has now become 50 or 60 ms latency.  While the application could previously update 15 contacts in a one second interval, it can now only make five contacts updates.  Of course, Outlook users are completely oblivious to any of this latency.  When you double click to open a message, it doesn't matter if an extra 50 ms was injected into the equation.

Looking for Evidence

When we see lost MAPI connections and poor performance, the first step is to see if there is a load balancer in the picture.  Using the windows Netstat command, we scan for any connections to the DNS name of the load balancer.  If we see the load balancer in the list of active connections, we are ready to re mediate. The first step is to convince the customer that in this case, the load balancer is part of the problem, not part of the solution.
To provide some context on our application, there is a primary mailbox for the itrezzoAgent which is always kept open. When a target user need to be updated, a secondary session is created. Because we support servers in geographically diverse locations, the first step is to lookup determine which Exchange database each user resides on. After that, we can find the active server for that database.
Each itrezzoAgent server is configured a server wildcard that describes all of the local mailbox servers and in turn all of the elements of the local CAS array.  
When customers started deploying Exchange 2010 with an array of client access servers, we adapted our

Explicit list of client access servers
Each app server uses different CAS

  • Dynamic profile
  • Profiles may get redirected to the load balancer
  • Host file for redirect load balancer


Results

  • Much faster performance
  • Higher reliability, no hung Mapi sessions
  • Spreading the load is achieved
  • Possible trade off for application as specific CAS may go offline