Tuesday, November 1, 2011

Notes on a failed launch with node.js and mongodb

Intro

What follows is a play-by-play of i.TV's launch of viWEer with Entertainment Weekly. We were using all of the latest technologies: node.js, MongoDB, nginxVarnish. And yet our initial launch was a complete disaster. If you want you can skip straight to the lessons learned.

Day 1

After a hurried, but exciting 4 weeks we were finally ready to launch our app. We had spent plenty of time tweaking the CSS3 box-shadows, working through all the CORS (cross-origin AJAX) bugs in Safari and IE, and were ready to launch our awesome production ready app. We passed our partner's load test requirement like pros, the security audit was likewise a success. Primetime here we come.

Launch party ensures. It was a pretty mild get together. Brad (CEO) hands everyone's wive's gifts for our hard work (See's Candy, delicious). We all get to feel good about ourselves and there we go live in 3..2..1..........crap. NGINX errors or Varnish errors. jQuery AJAX errors. Yeah, we're screwed.

So while the family and friends chatted away, we all got into a room and tried to figure out why our servers were dying. I mean Mixpanel is telling us we have maybe 400 people hitting the servers. I'm pretty sure my 256mb LAMP box could bench press that many people without breaking a sweat. But we've got VARNISH and MONGODB and NODEJS, NOSQL / JAVASCRIPT to the ends of the earth.  Why do we suck so much? I mean we even have an API for our iOS app built on the same infrastructure that gets tons of hits without any of these problems.

Okay, turns out, we needed to learn a little bit more about Varnish. Varnish hates cookies. We originally built the API powering our app with sessions, because it was going to serve itself up. Instead, we decided to go the CORS-based single-page-web-app approach. Our API and our front-end would be totally separate. Just throw our HTML/JS up anywhere and we're good to go. So scalable and awesome and un-coupled. We're amazing! So we have a backend that's supposed to be de-coupled, but ends up relying on cookies and sessions. Not so amazing.

Day 2

So the problem is we can't handle the load. Our not very good load testing, wasn't very good. Varnish can't cache things well that use cookies. So we quickly write up something that tells express to not use the cookies middleware on certain routes. We also figured out how to look at the AGE header from varnish to make sure that things are really caching. Our tests show things are caching. We're pretty confident things are going to be awesome today. So we chill out.

Nighttime comes (app gets used from like 6-9PM) and yeah it's basically like yesterday, which is complete crap. Not as complete crap, but basically it sucks. A lot of errors in different browsers, generally a lot of things go down from time to time, but overall they work throughout the night. We're live troubleshooting with the partner who we're launching this with. They have some few problems, but after refreshing their browser or retrying things, things basically worked.

We still aren't very expert at Varnish. We're not sure things are really cacheing properly. It looks like things are caching a little bit, but not very much. What the heck. varnishstat, be my friend, please. Our database is being murdered. Mongostat is showing hundreds of concurrents and it's really slow. We also use that database to do a lot of hard queries for our iPhone app. db.currentOp() on mongo shows tons of things being blocked. We go to our servers, we find out that nginx, which is in front of all of our node servers (to handle SSL, forwarding, and some other stuff) is running out of connections. 

We quickly hypothesize that our database is too small to handle both a barrage of connections AND our really intense map reduces that are bogging it down every 20 minutes. Maybe the (fairly small) database is blocking, so NGINX is queuing up the connection, waiting for them to return and because they never return it eventually just has thousands of opened connections. So sometimes things work, but mostly it sucks.

Day 3

We make the database HUGE, first thing. Rackspace was like, screw you, you can't even make a database that big in your "huddle", but we said, no way, we want it. 16gb of ram. In your face wait times and blocked servers and NGINX too many connections / timeout errors, because we were waiting on the database. After a few hours of working with Rackspace we got our new production-ready database in order. Copy the image of the old db. Start up the new one. Manually move over any critical updates that happened during the process and we're good to go. Hallalujah. Finally off to NOSQL heaven!

Come showtime no one can post or do anything, except us. Our partners who we're launching this experience with can't do anything. We can't reproduce it. This sucks. We get our whole team on Skype, finally some people start having the problem. The sessions are just basically constantly restarting. It appeared to work, because you could login and do one thing that required sessions. After that your session would restart and you couldn't do it anymore. We had two things going on that we're possibly giving us grief:

1) We're still doing our auth over CORS, which is generally weird and send about 4 times more requests than is necessary.
2) We migrated our database to a new server and didn't clean out our old sessions collection.

We struggled, we stayed up really late. Our partner on this launch calls our CEO. They're about to drop our partnership. Another conference call with the whole team. We have basically have a day to show them that we're not complete idiots or the entire thing falls through.

Our CTO mounts our totally de-coupled api server on-top of our web server using some express middleware craziness app.use('/viewer', require('api').createServer()). No more CORS.

Next up we clear the sessions collection on our new database.

Wow, everything seems to work now. Too bad we've had 3 days of crap. So we decide that to prove that we're awesome we need to get a bunch of users to test this NOW. We split the team in 2. Half the team stays up all night and tries to get people from the Internets to test this / put in place some legit load testing. The other team shows up at work at 5:30AM and carries on making sure that everything is solid.

I'm pretty sure at this point that I'm going to lose my job.

Day 4

First off: don't try mechanical turk at 4 in the morning to try and get a bunch of users to test your site. It doesn't work.

Secondly, the middle of the night team setup a PhantomJS server that's able to properly emulate hunderds of concurrent users, actually using the app. It's pretty awesome and it really was the first thing we were able to use to make sure our scaling was working properly.

I'm part of the early in the morning team. We get their too early. We can't open the door, so we hang out in Einstein's Bagel's nearby talking about what we're going to do. We weren't able to get online (other than our phones) for like 30 minutes, so we talked and talked and talked.

Here are my ideas:
- Get more app servers
- Get NGINX out of the way of node?
- Figure out Varnish, varnishstat, etc
- Lots and lots and lots and lots of load testing
- Get a less hacky way to get varnish to ignore sessions

So we did all of those. I learned how to read varnishstat. I did some basic apache-bench testing. I figured out that things still weren't being cached. We found the limits of our server setup. I added 5 app servers. I changed the varnish config to round-robin across all of our servers. We figured out how to get varnish and not express to handle routing /viewer to the api server. We made a plan to completely git rid of sessions, because they were the only thing left in the stack that we didn't build ourselves. Things are looking pretty good.

We're still nervous about how things are going to go with our partner. We're expecting a call with them to hear about how they're canceling the partnership.  The call never came. The night comes. The app seems to work okay.

By this point though, we've lost most of our users :-/

Day 4 and beyond


I did a lot of testing and found that NGINX was no more performant than node.js directly on our small Rackspace servers. In fact, things were a lot better when NGINX was out of the situation.

After this, our partner's amazing Load Testing team helps us really kill our servers. We tweak a few things with our client code, add more caching, and are able to consistently handle thousands of concurrent connections. Finally we're getting somewhere.

For the next few days we dig into everything we can, constantly working on improving our load capacity and learning about. We find bottlenecks all over the place, but under the highest loads, Varnish, on a a cloud server with 4gb of ram, eventually starts to choke, but we're still pretty happy with say 2000 concurrents connections (which roughly translates to maybe 10,000 users at a time).

Finally, the ad campaigns run, we get back to features instead of bug fixes and things start to slowly settle down. For the next few weeks we're paranoid and really careful that we're going to break things. After a while, they just keep working. We're happy and we can sleep at night.

Lessons Learned


1. It's really important to understand your stack.

At the begging we didn't know if Varnish was caching anything. We couldn't tell you why we should or shouldn't have NGINX in front of Node. We weren't sure why the sessions collection was growing endlessly. It would have been really helpful to understand that stuff before we launched. It was really hard to learn how to debug and load test these components on the fly. You need to understand the tech before you run into problems with it.

2. Load testing is really freaking important.

Without knowing the limits of our stack it was really hard for us to know how to scale when the time came. Day after day we made assumptions about how to improve things (grow the database, put static files on a CDN). They were useful, but they may not have been necessary had we figured out before we tried to launch where things were likely to break. We know now that 5 app servers is complete overkill for us. That's a couple grand we probably didn't need to spend on hosting. Also a lot of problems show under heavy load that don't show up otherwise. PhantomJS, httpperf, and apachebench were all helpful tools to help us figure out our server stack.


3. Real-Time is hard. 

Part of our problem was that we were polling for updates every single second. With 500 users, that meant our database was getting hit 30000 times per minute from this one app. Later we changed the polling interval to 5 seconds and got Varnish to cache those requests properly. Next time, we might just look at using Socket.io.

4 comments:

  1. Great post. Appreciate the candor. I can't seem to find the "viewer" express middleware you mentioned. We too have an awesome, totally decoupled font-end and API that may need its help :-) Got a link?

    ReplyDelete
  2. I'll try and get it posted to github one of these days!

    ReplyDelete
  3. I'm curious what conclusions were made regarding CORS. Did this actually end up being the bottleneck or was it just scrapped in the heat of the moment? I'm no expert on the subject, but as far as I know, the "preflight" request is only made when the request method isn't a case-sensitive match on GET, HEAD or POST. And even then, it's just that one extra request (not that one extra request is cool or anything. But it's not 4 times the requests.)

    I haven't used CORS in a heavy load production environment yet, and I'm curious to find out more about where it may choke so anything else you can tell me with your experience, I'm happy to find out!

    ReplyDelete
    Replies
    1. @Josh, great question.

      It's been a few months since this all went down and I don't work there anymore, but let me see what we ended up with.

      1) Instead of CORS or mounting one app inside of the other node.js app, we setup Varnish to handle routing properly under a single domain. It was a sane place to handle the problem. No CORS.
      2)I think the pre-flight stuff was due to the fact that we had logged in users and there were cookies and I think Safari or some other target browser had to do funky stuff to handle auth. Even after we switched to custom headers instead of cookies I think it was still a problem in some browsers. Again, when you're CPU bound, cutting extra requests helps a lot! Even if they're not really doing anything.
      3) The server configuration for CORS was also kind of weird. Even though we were using node.js and it was just middleware we had to do a lot of tricks to get it to work under localhost/dev/staging/prod environments. It was mostly annoying.
      4) The pre-flight calls were doubling the requests and when they were going every second, we couldn't afford it. Probably now they tuned down the polling to like 5 seconds, so it wouldn't kill the servers as bad.
      4) JSONP would have been fine, the non-CORS solution to fix it at Varnish was fine.
      5) The new stuff i.TV is building is using pubnub to handle real-time updates. Let someone else handle the scaling :)

      Hope that answers your questions. In general I left with a bad taste of CORS in my mouth and I'd probably avoid it in the future, though for single-page web apps without tons of hits it's probably still an awesome way to go.

      Delete