Failover across geographically distributed datacenters is a challenge that doesn...

Failover across geographically distributed datacenters is a challenge that doesn't get talked about all that much.

As a small company you probably aren't able to easily get your own IP block allocated (that I know of) so BGP [0] isn't really an option and the best you can do is probably DNS switching. Use a good DNS provider and set your TTLs to something low like 30 seconds or 1 minute. Then when you have an outage, change the DNS entry to point to a secondary datacenter, which would have a static error page or a reduced-functionality site. There seems to be some debate around whether low DNS TTLs increase users' request times, but we haven't seen it.

There are some companies that will handle the monitoring and switchover for you (Dyn comes to mind) but we prefer to manually switchover for the time being. We have a Big Red Button sinatra app that reports the status of the site and allows you to fail over to the secondary and recover when the primary returns; I'm planning on open sourcing it once it gets some documentation.

I think the reason failover doesn't get talked about as much in the startup world is just because it's hard to do and the costs are disproportionately high for a small company unless availability is really critical to you. For most people, just using multiple availability zones on EC2 is probably sufficient.

[0] http://ajohnstone.com/achives/high-availability-across-multi...