High-Traffic Web Sites and Scalability Strategies

I was writing on a Startupping thread about scalability and figured I could reproduce some of what I wrote here. The topic was about planning and dealing with having to scale web sites as traffic increases. I pasted on some of my responses & added some headings that give a frame of reference to what I’m babbling about. Again, the below is personal experience and opinion, and there’s a lot of information on the web to supplement (and I’m sure in some cases contradict) what I refer to, below.

Planning to Scale

In my previous life at a big dot-com, we actually started thinking about multiple servers right away, in order to provide uptime/redundancy & zero-downtime code deployment. We learned along the way, did a lot of research, and/or hired good people with experience. It’s always good to have good people. We also hired a few specialty consultants for a few hours of suggestions, and worked with hosting providers who had other big clients, so that we could benefit from their experience.

We started out using software load balancing, moved to hardware load balancers, & eventually to round robin DNS between two large server farms (which had their own hardware load balancers, redundant everything, etc).

We had issues with scaling whenever we hit a specific wall, e.g. when we hit bandwidth & server limits for routers, firewalls, & data centers. We had issues when we hit certain technology limits. We also had issues if we got slashdotted, but that happens. One year we were unexpectedly linked to by a very large, prominent ad on the home page of a top 5 site, so our traffic levels went through the roof, we all got called at 4am, and we stayed pretty busy ensuring things kept running.

A few things we did

We used a managed hosting provider who could prep & deliver new servers quickly, to help us with scaling quickly. Managed hosting also meant the provider could help out with monitoring, troubleshooting, setup, and solving problems. So e.g. if you lose a PSU at 2am, you don’t have to drive down to a colo and swap it out.
We did regular load & performance testing so we knew how many servers we’d need for certain traffic levels. We also did profiling to pinpoint bottlenecks, optimize certain parts of the site, etc.
We worked closely with our marketing team, so that we knew about upcoming media blitzes. We kept alerts on blogs & news sites to watch for links, & we kept an eye on our traffic.
We had monitors on our servers, so that if traffic/bandwidth/cpu started climbing unexpectedly high, we were alerted & could react before it was too late.
We looked at our traffic history so that we knew what times, days, & seasons were popular for our site.
We designed our software for easy deployment & replication, & streamlined our server setup process so that once a new server was ready, we could add it into the farm quickly.
We consciously built out our site to handle a certain load level, e.g. 3x last year’s peak or whatever (3x is a made up example). You obviously need to handle spikes, but being ready & able to handle an “infinite” level of traffic isn’t cost-effective, especially when you’re young & trying to spend wisely.
We planned for & expected scalability & multiple boxes right away.
We had remote access to servers, reports, & monitoring tools, so that if a crisis occurred in the off hours, we could contact staff & not wait for them to get into the office.
We had names, functions, & phone numbers of key staff printed up on a laminated card (business card size) & given out to a number of staff, so that people could quickly contact others in case of an emergency.
We bought staff donuts or pizza or food. It was a cheap way to say thank you & made them feel appreciated & more willing to not complain (too much) when getting calls in the wee hours.

In hindsight, maybe we could have gotten away with spending less time thinking about scalability (or maybe not), but that would have increased our risk of disaster/embarrassment/etc, and potentially made for more painful & frequent firefighting. So there’s the tradeoff. Just like other things in business, you can spent extra time and money up front to be more prepared for business growth, or you can wait until later and possibly pay more and feel more pain.

How do you know what load your site should handle? Should it be 3x last year’s peak?

There’s no magic number. We typically used 2x previous peak as a starting point only, and worked from there based on various factors. We put a good amount of thought into measuring our server performance and determining the traffic we expected and/or wanted to be able to handle, because it translated directly into how many servers we rented and thus how much money we spent on hardware. The load/server number really depends on your budget, your expected growth (including what your marketing team has up their sleeve), your risk/downtime tolerance (including SLAs or clients you have to maintain), and how fast you can scale. I mean, if you started out very small last year, and your marketing department is going to buy a superbowl ad this year, your traffic is going to be 100x last year’s mark, not 3x.

So since capacity costs money (and unused capacity gets noticed), you have to plan it out a bit. If you just follow the rule of thumb and buy/rent enough equipment for twice last year’s peak (or whatever), then that means most of the time your web servers are running at 1-15% capacity. And unless you have a lot of money, that means you might have management or investors asking you “hey, how come we spend so much money on servers when they’re only running at 10%?” We mitigated that issue somewhat by having some servers keep busy performing “non-essential” functions (internal reports, data crunching, load testing, QA, etc) that could be curtailed during a crisis where we need every drop of CPU.

The whole “running at 10% cpu 90% of the time” issue one reason why expandable services like CDNs (content delivery networks), S3, & EC2 are intriguing, b/c in theory they allow you to scale some aspects of your service on demand without having a lot of extra horsepower & bandwidth sitting around unused. Or why VPSes are kinda interesting since you can potentially do some really quick deployments.

Another thing we planned from early on — we planned to have our images & static files (css, js, etc) running on a separate domain, which allowed us to serve up static files using CDNs or other sources of cheap, fast bandwidth. That also helped us not max out router/datacenter bandwidth (at least not for a while), since images/css/etc are often a large portion of your outbound bandwidth.

Any must-have tools for performance analysis and dealing with traffic?

We were on a Windows platform, so it was tools like Web Application Stress Tool, LoadRunner, and Visual Studio Enterprise Architect’s stress/load tool, among others. And of course you’ll also want some sort of web analytics package (AWStats, google analystics, webtrends, etc) to tell you what typical user behavior is, what your most popular pages are, etc. Because your tests should mirror typical user behavior if possible. You’ll also need performance monitoring tools like PerfMon, etc, so you can see how your servers perform under various loads. So you figure out the page activity levels for a given number of users, max requests per second, max/avg response time, cpu load, how high cpu/ram/disk can go before your performance & user experience starts to drop. There may be better/other tools today, and I don’t know what platform you’re targeting, but those can get you started. I think it’s ok to start with simple tools. We used Excel a good amount, too, to map, graph, and view performance test data.

For monitoring, we used the tools that our hosting provider had (we had managed hosting), but we also had a simple script that ran every X minutes, tried downloading a page, tried executing a database query, checked cpu/ram/etc, & blasted out an email if things seemed wrong. We also wrote our web site to log errors & send out emails if for example, database queries were timing out, or if code started getting bad errors like “out of memory”. There are also a number of third party apps & sites that can monitor your site or servers every X minutes & send you alerts — some will check your serevrs remotely (e.g. Gomez, Webmetrics), while others can install on your servers. I can’t think of any installed apps offhand to recommend, but I’m sure you can find something workable w/ some research.

When you start having scaling issues, does your staff spend all their time on scaling instead of development?

If you don’t hire any additional staff & don’t outsource any work, then yes, as your site gets bigger and your company grows, there’s going to be more work to do, and your initial team will probably be spending more time on support/scalability/firefighting/misc tasks.

For us, as the site got more popular, more time needed to be spent on performance/scalability/uptime issues. But we still had features to develop. So we hired more people to keep up with increasing workload.

A supplementary route would be to go with a higher-level hosting provider (e.g. Rackspace) & pay them to help you scale out, and/or hire short term scalability consultants who can give you suggestions, etc. If your app can already scale horizontally, that helps, b/c then you have the luxury of having a choice: either spend time making your app faster, or money buying new servers. Or both.

Another suggestion: if you’re starting out very small & cheap, you could start with a VPS on a single server, since you can quickly move that VPS to a bigger/faster machine before you need to scale to multiple machines. Once you go multi-server, you may even want to stick w/ VPSes, since the additional ease of backups/migrations/deployment might make up for the performance hit. Since virtualization technology is getting better all the time, the performance impact of running VPSes is smaller than you’d think (e.g. Xen is supposedly only a 5% hit).

How quickly you can scale make a big difference in terms of the time you spend on capacity planning and actually scaling your site. Akamai also has some neat stuff (although it can be expensive) that allows them to serve up dynamic pages on their CDN, which in turn reduces the load on your main servers & prevents you from having to set up extra servers as quickly. Think of it like a front-end cache/reverse proxy type thing. Other CDNs might have similar offerings. There are also accelerators (both hardware, or software like Squid) that do similar stuff, although setting up your own front-end cache w/ stuff like Squid means you’re losing one of the big benefits of CDNs, namely that they do the scaling instead of you.