Deceptively Simple, but Capacity Management is Difficult

By now we’ve all heard enough about the failed launch of Healthcare.gov. The bickering back and forth over this issue by the left, the right, and everyone in between has had more the ring of children arguing over the front seat in mom’s car than it does the sort of high-minded discourse that we deserve from our elected officials. Arguably the only thing worse than the endless banter itself is the complete lack of technical understanding that it’s coupled with.

It is in that spirit that I’m kicking off this 3-part series suggesting some practical lessons we can all take away from this debacle. I figure it’s the least I can do. The first lesson sounds deceptively simple, but it’s a snag that countless web applications suffer from.

Capacity management is hard.

Recently, the Democratic representative of California’s 18th district proved my point. Speaking at a hearing in which House members were grilling contractors involved with the project, Rep. Anna Eshoo stated “Amazon and eBay don’t crash the week before Christmas, ProFlowers doesn’t crash on Valentines Day.”

What’s remarkable about that statement is that less than 5 minutes spent on Google reveals that ProFlowers has in fact crashed on Valentines Day and while Amazon didn’t crash the week before Christmas, they sure did last Christmas Eve, and took out Netflix in the process.

Is the moral of the story that Amazon and ProFlowers are bad platforms built by bad people hell bent on their customers’ demise? Of course not. The moral is that capacity management can be tricky and often takes time to get right.

Fortunately there are some guidelines that apply, if not to Healthcare.gov and HHS, to most organizations in the early stages of building a new application.

Plan to scale from the start. Often the rush to launch a new application can lead to corner-cutting with respect to future growth. This is a mistake. Think about the different service tiers in your application, think about how each will scale over time, and build each with growth in mind. Believe me – you don’t want to be going back and reworking these things after launch. Good examples include:

Adopt a load balancing strategy from the start – Load balancing a web app introduces variables that you wouldn’t have to worry about otherwise. None of these are new concepts, but they are things that often trip up the team that develops and tests on standalone platforms before releasing to a load balanced production environment. Session handling is probably the most significant concept to take into account here. Does your application retain user session state in memory on a given server, or is session state shared between members of the pool? While the latter is definitely my preference, session affinity can be used in the case of the former to pin user sessions to a given server rather than allowing them to float between members in the pool. Bear in mind, though, that if your users are passing data over SSL, your load balancing platform will need SSL offload capabilities so that it can decrypt the HTTP payload and make the right forwarding decision.

Understand your database workload – Understanding your database workload is critical to building a growth strategy for your data tier. Beyond that, understanding how failures of components in your database platform affect user experience can help minimize the overall disruption users experience when one component or another fails.  For example, if your workload is extremely read-heavy, simple replication coupled with load balancing can prove to be a highly effective combination without a lot of the added complexity introduced by different clustering technologies. From there, establishing a highly available writer and a highly scalable reader can be a very effective strategy. One tip: make this decision early. If you need to programmatically break up reads and writes against different data sources, you want to do this from the start – you don’t want to be retrofitting your application to support this a year into production.

Avoid monolithic design – Failures will happen. Accept that simple truth from the start and build accordingly. The perfect platform isn’t the one that never fails – it’s the one that fails gracefully with as little disruption to the user experience as possible. If you’re unable to render a given component of the UI because of one outage or another, make sure that everything else around it can be isolated from that failure. For this reason, building a modular application with as little interdependency as possible is a choice you should adopt from the start. Ben Christensen of Netflix has written extensively on this subject including this great article on application resilience.

Understand your capabilities. When building a new application, it’s important not just to understand the capacity constraints of each tier, but to also understand those constraints in the context of real world scenarios. Understanding that your web tier can support “X” requests per minute and that your database can support “Y” requests per second is only so helpful if you can’t also attribute those metrics to a concurrent user count. questionmanfeaturedUnderstanding as granularly as possible the resource footprint of a single user is a great way to forecast how each tier will have to grow as user concurrency increases. Similarly, just as load testing for each individual component is important, end to end performance testing in advance of – and after launch – is critical to monitoring user experience and ensuring that no potential bottlenecks have been overlooked.

Sessions are expensive. Next to servers (be they physical or virtual), firewalls and load balancers are two of the most ubiquitous devices found in web application stacks today. Firewalls traditionally fill a role of managing perimeter access control and perhaps some level of intrusion detection or prevention, while load balancers distribute requests across aggregated pools of resources.  As helpful as each of these functions sound, they’re equally dangerous. Why? In order to behave correctly these platforms need to track the state of every session they’re handling in memory.  As a result, the numbers of new sessions per second, along with the total number of concurrent sessions that a given platform can support, are metrics that should be understood when factoring upgrades into your growth strategy.

Just as understanding the tolerances of these platforms is important, making use of them wisely is equally so. To start, you should avoid design approaches that cause an undue amount of round trips through firewall policy for traffic you knew you were going to permit to begin with. For example, if 100% of the traffic between your app and database tiers is MSSQL over TCP 3306, and the only firewall policy in place between those platforms permits this traffic, then consolidating them into the same subnet can save you firewall overhead.

Similarly, if your front end web servers accept unsolicited traffic, an approach that avoids use of a perimeter firewall can be considered. Access control in this scenario can still be achieved effectively and securely via OS-level or hypervisor-level firewall alternatives. I should mention that many of the regulatory frameworks such as the PCI DSS may disagree with me on this point. While I’m confident that they’ll all catch up eventually, it is still highly recommended that you consult your auditor before making any changes to an application that is regulated by one of these frameworks.

Stay tuned…

The 2nd in this series will cover the dangers of equating the age of a technology with its viability. After all, much of the internet is supported by technology that’s old.

The 3rd in this series will cover the dangers of building a product without customer input.

Categories

Leave a Reply

Your email address will not be published. Required fields are marked *