Posts Tagged ‘cluster’

Scaling Hell

Saturday, November 25th, 2006

I’ve been pretty quiet (again) lately because I’ve been busy trying to scale the msgpad platform sitting beneath ScribbleHere.

A couple of weeks ago I had split the code so that I could use a cluster of nodes to perform processing on HTTP requests, instead of just a single computer. In theory it was a nice idea, but I ended up spending a few days resolving a dead lock that came about from closing result sets out of order.

Anyway, with that out of the way I thought I was home free. Of course, I wasn’t, I started to find that I was getting horrible performance. The performance was several times worse than when I only had a single box processing the requests. To cut a very long and frustrating story short, it turned out the link between the database and the application server was becoming saturated.

The problem was that the proxy was on the same machine as the database. The data flow was like this: user – proxy – application server – db – application server – proxy – user. I had effectively created a high tech echo chamber, because having the proxy and database on the same box meant a single request was routed along the same pipe four times. Even with this architecture I thought the network would not be the bottleneck, but it was. Anyway, for testing purposes I threw in another node to split the bandwidth consumption between the db/proxy node and the application server node, and the difference was quite amazing.

Here’s a couple of graphs to illustrate the difference. The black line is the response time, and the grey line is the error rate. The horizontal axis is the number of concurrent requests per second.

The benchmark with only a single application server:

graph

Here’s the benchmark with two application servers:

graph

I will be modifying the architecture of the system after seeing these results. I am still a bit undecided, but I am leaning towards performing a basic HTTP temporary redirect instead of having the reverse proxy. The only down side is that the system is polling based and every poll will result in a redirect, which is a bit wasteful. Another option would be to perform round robin DNS but I am using EC2 as application nodes, and the conflict with the ease of setup/teardown with the EC2 nodes contrasted with the delays in DNS propogation leaves a sour taste.

Ps. I mentioned I didn’t like EC2 and S3 for hosting purposes in a previous post, but EC2 as an application server cluster is beautiful. I can scale up/down quickly, and it was very easy to create custom images for EC2.