Wufoo Status

Moving to Rackspace

This Saturday, October 10th, at 11:00 PM EST, we will be moving Wufoo to a new home with Rackspace. During this move we are expecting all services to be unavailable for 30 minutes.

October 10th, 2009 at 11:00 PM EST
[ View All Timezones ]

If you’re up or interested, you can follow our progress during the downtime on our Tumblr Status Blog and/or Twitter feed. We also have more details about the reasoning behind the move on our blog.

Server Reboot

We are experiencing a problem with replication on one of our main database servers, so we have to take the site down for a short period of time. We’re estimating about 15 minutes. Downtime started at:

2:20pm EST

View all timezones

*Update - Back up at 2:28 EST. Thanks for your understanding.

Server Maintenance : Thursday, July 23, 2009 at 9:00pm PDT

This Thursday at 9:00 PM PDT, we’ll be doing some software and security upgrades to the web and file servers in addition to running a failover test on the database server. During the upgrades and tests, Wufoo’s services will be unavailable for at least 30 minutes up to an hour. The standard maintenance message will show on your forms and reports during this period.

July 23rd, 2009 at 9:00 PM PDT
[ View All Timezones ]

If you’re up or interested, you can follow our progress during the downtime on our Tumblr Status Blog and/or Twitter feed.

Explanation of this Morning's Downtime

Apologies for the delayed announcement on this recent downtime. We’re still gathering information from last night and because it was a long night for everyone involved we tried and let people catch some sleep once everything seem up and secure again. Here’s the information that we know so far:

As most of you know, we had a planned maintenance period last Thursday to restart some of our servers to complete the datacenter migration we did recently. One of the changes we made was to the configuration on our file server setup. Basically, in addition to the RAID setup, we wanted to supplement the hard drive redundancy with a replicating file server to sync up those files to another server.

At about 4:30 AM EST this morning, Wufoo’s file server went down due to a RAID failure on the system that caused both hard drives to actually fail. The are three reasons why we couldn’t get the site back up quickly:

1) Bad timing. Because of the 4th of July holiday weekend we were light on staff in a position to react to the situation as quickly as we normally would in these situations.

2) Bad setup. Wufoo also stores all caches associated with our templating system on the file server. This setup and dependency therefore caused the rest of the service to not fail gracefully. So while our web servers and database servers were fully operational, they weren’t working due to resources needed on the file server.

3) Bad backup. Unfortunately, in addition to both hard drives failing at the same time (something we hadn’t expected) we were also in a curious place with the back up server. Even though we had just put the hardware in place for just this type of situation a few days ago, the file server had not finished syncing up the data with the live file server and so we couldn’t switch over right away without new data being created in the wrong place.

Anyway, our first priority was to get the site up and showing some message quickly on people’s forms and reports, so we changed the load balancers to throw up a maintenance page while we worked on and assessed the situation.

Our second step in the recovery, was to get the rest of the site back up independent of the file server. After some configuration changes, we made the web servers temporarily store their cache files on themselves. This allowed us to bring the server up proper with forms accepting submissions. Of course, because the file server wasn’t available still, Wufoo still couldn’t accept new file uploads on forms with file upload capability, create new accounts or create new forms.

After all that, Bitpusher diagnosed the problem with the file server and fixed one of the failed hard drives. They also supplemented that drive with a manual backup on to one of the backup hard drives and this allowed us to bring back up the system to full capability without any data loss.

Anyway, we’ll be working on changing our dependency structures in our code to create a more graceful failure in a repeat situation and diving further into why the file server failed the way that it did. Again, our sincerest apologies to all our users for the inconvenience this weekend and we hope the rest of your fourth goes off with less of a bang.

Touch Ups from Server Move

Tonight, Thursday July 2nd, 2009, at midnight eastern, we would like to clean up a few loose ends from our server move. We believe all fixes can be accomplished with a simple reboot. And unfortunately, our replication servers are affected as well, so we can’t do a rolling downtime. Plan for approximately 10 minutes of downtime. The standard error messages will show (http://www.flickr.com/photos/wufoo/3665372897/).

Server Move

As planned, our server move has begun. We’ll be working on it over the next 4 hours, and we’ll update this post if there are any advances/delays in the time frame.

UPDATE

Three and a half hours in, all services have been restored. Everything went as planned, and all accounts should be functioning as expected. Thanks for your patience as we went through this move.

Power Problems

We just received word that there are power problems at the data center. All customers at the center are affected. Will know more soon.

*Edit on 5/19 3:30 EDT*

It appears the unthinkable is happening again. As we were evaluating the downtime yesterday, power went out again. The root of the problem bypasses our redundant circuits, and we are completely dependent on a third party.

*Edit on 5/19 5:00 EDT*

Service has been restored. We will be following up on the Wufoo blog with more details about what has happened over the past two days. A brief overview is below:

On Monday night, and again on Tuesday afternoon, we had a power outage at our data center.  Please note that that we have redundant circuits that are supposed to be on two independent power systems, but all circuits went down. Without a power source, all service became unavailable.

The downside to a power loss on this scale is that all core level services are affected. This significantly increases the time to get all servers online. Networking on critical level servers must be brought up first, and then all application level servers go through a crash recovery process.

We are currently looking for answers to why redundancy with the power did not work, and where there is room for improvement on the recovery process. We will follow up on our blog with a much more detailed post that will address all of the questions we have been receiving.

New Database Server - 5/15/2009

At midnight we made the switch to a new primary database server (our master lookup database). We tried to do this with no downtime, but it didn’t work as expected. Wufoo was unavailable for 37 minutes during this cutover.

Outage Earlier Today

Monday, May 11th, 2009, we were down between 5:00pm and 6:15pm EST. The trouble began when one of our DB servers had a process list build up that began affecting the web servers. For about 10 minutes, starting at 5:10pm, the DB server recovered and the site was responsive. We thought the problem was fixed. Shortly after, we found out the server could not fully recover, so we began the reboot and crash recovery process. This, combined with the rebuilding of cache, took the remaining time.

We have a few leads on potential symptoms that could have caused this downtime. We’ll be investigating and hopefully coming up with a concrete answer. We apologize to everyone who has been affected by this outage.

Unexpected Reboots

Tonight, April 29th 2009 at 12:30, we had to reboot one of our primary servers. The service was unavailable for a few minutes while this took place. We apologize for any inconvenience caused.