15 January 2005

LiveJournal down 24 hours

As mentioned in my profile, my regular blog is on LiveJournal. Due a power outage in the data centre where it's servers are hosted the site has been down for about 24 hours at time of writing this entry. At the moment this is what those behind the service have to say about it:

Our data center (Internap, the same one we've been at for many years) lost all its power, including redundant backup power, for some unknown reason. (unknown to us, at least) We're currently dealing with verifying the correct operation of our 100+ servers. Not fun. We're not happy about this. Sorry... :-/ More details later.

Update #1, 7:35 pm PST: we have power again, and we're working to assess the state of the databases. The worst thing we could do right now is rush the site up in an unreliable state. We're checking all the hardware and data, making sure everything's consistent. Where it's not, we'll be restoring from recent backups and replaying all the changes since that time, to get to the current point in time, but in good shape. We'll be providing more technical details later, for those curious, on the power failure (when we learn more), the database details, and the recovery process. For now, please be patient. We'll be working all weekend on this if we have to.

Update #2, 10:11 pm: So far so good. Things are checking out, but we're being paranoid. A few annoying issues, but nothing that's not fixable. We're going to be buying a bunch of rack-mount UPS units on Monday so this doesn't happen again. In the past we've always trusted Internap's insanely redundant power and UPS systems, but now that this has happened to us twice, we realize the first time wasn't a total freak coincidence. C'est la vie.

Update #3: 2:42 am: We're starting to get tired, but all the hard stuff is done at least. Unfortunately a couple machines had lying hardware that didn't commit to disk when asked, so InnoDB's durability wasn't so durable (though no fault of InnoDB). We restored those machines from a recent backup and are replaying the binlogs (database changes) from the point of backup to present. That will take a couple hours to run. We'll also be replacing that hardware very shortly, or at least seeing if we can find/fix the reason it misbehaved. The four of us have been at this almost 12 hours, so we're going to take a bit of a break while the binlogs replay... Again, our apologies for the downtime. This has definitely been an experience.

Update #4: 9:12 am: We're back at it. We'll have the site up soon in some sort of crippled state while the clusters with the oldest backups continue to catch up.

Update #5: 1:58 pm: approaching 24 hours of downtime... *sigh* We're still at it. We'll be doing a full write-up when we're done, including what we'll be changing to make sure verify/restore operations don't take so long if this is ever necessary again. The good news is the databases already migrated to InnoDB did fine. The bad news (obviously) is that our verify/restore plan isn't fast enough. And also that some of our machine's storage subsystems lie. Anyway, we're still at it... it's long because we're making sure to back up even the partially out of sync databases that we're restoring, just in case we encounter any problems down the road with the restored copy, we'll be able to merge them. And unfortunately backups and networks are too slow.


LiveJournal was recently bought by the company behind MoveableType and TypePad. One can only marvel at the coincidence. I have been using LiveJournal for over 4 years now and this is the first outage I've seen over a couple of hours in all that time. A long outage for a volunteer run service (as LiveJournal was when I first got a paid account) is bad enough, for a service on a commercial footing it's very bad.

No comments: