A week ago BazQux Reader experienced two major outages. What happened?

I decided to move my servers to new datacenter. Moving was appointed at 8:00, Moscow time.

At about 7:55 I switched main page to page with maintenance message and started turning off feed crawler and database servers. And Whoops! I lost connection to my servers at 7:57. Punctual German technicians started moving servers a little bit earlier (servers was turned off at 7:59:30 but it seems they pulled network cables out before).

I can't blame technicians. It's my fault. It's bad to do everything in last five minutes.

Then I thought "Um, it's probably bad for database" and started waiting for servers to arrive in new datacenter (and found that main page wasn't actually redirected to maintenance one, Whoops #2).

Servers arrived. Configured new network addresses and turned DB on. Waiting DB recovery a bit. Everything looks OK. Did some testing. And here come the real WHOOPS! DB servers start crashing when trying to read something from database. Tried to restart them a few times (later turned out to be a bad idea). Looked into logs... Ahem, it seems I need to repair about 1TB of randomly corrupted data this nice sunny spring morning. And my customers won't be very happy about it (that was the thing bothered me most).

Some technical details. I keep all my data in Riak. It's a distributed database with focus on availability and fault-tolerance. It's really nice. For example, after Google's announce I've added more memory and SSDs to my servers with only 10 minutes of downtime (my fault again). Riak has several storage backends: Bitcask, Memory, LevelDB and InnoDB. About a year ago I decided to use InnoDB (LevelDB was not stable at the moment, Bitcask requires too much memory for my use case).

It is InnoDB storage backend that caused all the troubles. It doesn't like power offs. Although it has recovery it doesn't always work (especially on my write heavy load). Actually InnoDB is deprecated in the new versions of Riak.

It turned out that InnoDB failed to recover on ALL my servers. The long long dump and load was necessary to recover. More than 1TB of data was slowly read byte by byte and then all correctly read data was written back again. It took about 11 hours.

I wasn't in a good mood during those long hours. I strive to provide premium quality service and BazQux Reader was very stable during last year (less than hour of downtime in total). It survived Google Reader effect without any major problems (feeds crawled slower but the reader itself was fast as always). And boom! Such a big outage.

But keep going. DB restored, reader and crawler launched. Time to sleep... Waking up, checking mail. "My feeds aren't updating". Oh, shi!

What happened this time?

Riak splits data into partitions to move them between servers. When I restarted corrupted Riak nodes in the morning it created new partitions for failed ones and then InnoDB crashed the nodes (keeping files with partitions but doesn't remembering that they were created). After DB was restored Riak continuously tried to recreate partitions but failed ("file already exists"). This lead to timeouts on accessing some data and jammed feeds crawler.

Removed superfluous files... And some necessary ones. Node crashed. Voila! Second outage! Restored files. Checking that everything OK. Phew, the "two hours" morning maintenance finished.

A day later I got another little timeout on Riak rebalancing. Removing of processed feeds from queue stopped and that lead to long queue cleaning on the next crawler restart. Feeds didn't updated for a few hours. OK. Maintenance finished. For real this time.

Conclusions? I think that what happened is actually a good thing.

I learned that it's better not to do things in the last few minutes the hard way.

All Riak nodes are now migrated to LevelDB storage backend (which is never writing data in place and hence more friendly to power offs than InnoDB). I did it during last few days without any downtime. That's why I love Riak.

I fixed some corner cases in feeds crawler. Now it works correctly even in presence of timeouts from database layer.

It's good that problem appeared and solved before July 1st.

And the most important. I was afraid that customers would blame me for such fault but the reaction I got was much more positive than I expected. Almost no one blamed me for downtime. And some people were surprisingly supportive.

I'm so happy I have such awesome customers and will work hard to justify your trust.