Apologies for the delayed announcement on this recent downtime. We’re still gathering information from last night and because it was a long night for everyone involved we tried and let people catch some sleep once everything seem up and secure again. Here’s the information that we know so far:
As most of you know, we had a planned maintenance period last Thursday to restart some of our servers to complete the datacenter migration we did recently. One of the changes we made was to the configuration on our file server setup. Basically, in addition to the RAID setup, we wanted to supplement the hard drive redundancy with a replicating file server to sync up those files to another server.
At about 4:30 AM EST this morning, Wufoo’s file server went down due to a RAID failure on the system that caused both hard drives to actually fail. The are three reasons why we couldn’t get the site back up quickly:
1) Bad timing. Because of the 4th of July holiday weekend we were light on staff in a position to react to the situation as quickly as we normally would in these situations.
2) Bad setup. Wufoo also stores all caches associated with our templating system on the file server. This setup and dependency therefore caused the rest of the service to not fail gracefully. So while our web servers and database servers were fully operational, they weren’t working due to resources needed on the file server.
3) Bad backup. Unfortunately, in addition to both hard drives failing at the same time (something we hadn’t expected) we were also in a curious place with the back up server. Even though we had just put the hardware in place for just this type of situation a few days ago, the file server had not finished syncing up the data with the live file server and so we couldn’t switch over right away without new data being created in the wrong place.
Anyway, our first priority was to get the site up and showing some message quickly on people’s forms and reports, so we changed the load balancers to throw up a maintenance page while we worked on and assessed the situation.
Our second step in the recovery, was to get the rest of the site back up independent of the file server. After some configuration changes, we made the web servers temporarily store their cache files on themselves. This allowed us to bring the server up proper with forms accepting submissions. Of course, because the file server wasn’t available still, Wufoo still couldn’t accept new file uploads on forms with file upload capability, create new accounts or create new forms.
After all that, Bitpusher diagnosed the problem with the file server and fixed one of the failed hard drives. They also supplemented that drive with a manual backup on to one of the backup hard drives and this allowed us to bring back up the system to full capability without any data loss.
Anyway, we’ll be working on changing our dependency structures in our code to create a more graceful failure in a repeat situation and diving further into why the file server failed the way that it did. Again, our sincerest apologies to all our users for the inconvenience this weekend and we hope the rest of your fourth goes off with less of a bang.