Synopsis
I have seen many of good and bad articles on the (probably biggest) AWS outage. I guess the companies like Reddit did not know what datacenter redundancy means and therefore they went down together with that single availability zone where their service is located. This is neither desired nor optimal solution they have.
Regions and Availability Zones
Let me shed some light on the datacenters and availability zones and how these things are mapped. Infrastructure is divided into regions. A region is divided further into availability zones, usually shortened to AZ, you can think of this as a classical datacenter.
./bin/ec2-describe-regions
REGION eu-west-1 ec2.eu-west-1.amazonaws.com
REGION us-east-1 ec2.us-east-1.amazonaws.com
REGION ap-northeast-1 ec2.ap-northeast-1.amazonaws.com
REGION us-west-1 ec2.us-west-1.amazonaws.com
REGION ap-southeast-1 ec2.ap-southeast-1.amazonaws.com
And now check the availability zones.
./bin/ec2-describe-availability-zones
AVAILABILITYZONE us-east-1a available us-east-1
AVAILABILITYZONE us-east-1b available us-east-1
AVAILABILITYZONE us-east-1c available us-east-1
AVAILABILITYZONE us-east-1d available us-east-1
Imagine the worst case scenario, you lose an AZ. Why am I so sure about it? Usually the big enterprises roll out changes incrementally to different locations, but never at the same time to all datacenters (AZs). External threats like lighting hits, floods, fires usually damage one DC as well. I am not saying it is impossible to lose all the DCs in a region, but it is highly unlikely.
Service Availability
We can do some maths on this, just to prove the point. According to the
EC2 SLA doc the aim is provide 99,95% availability for any given region. The region is unavailable if more than 1 AZ is not reachable.
A=1-(1-Az)^N
Based on this equation, the service availability looks like the following:
DC availability : 0.9 No. of DCs: 1 Service availability: 0.9
DC availability : 0.95 No. of DCs: 1 Service availability: 0.95
DC availability : 0.99 No. of DCs: 1 Service availability: 0.99
DC availability : 0.9 No. of DCs: 2 Service availability: 0.99
DC availability : 0.95 No. of DCs: 2 Service availability: 0.9975
DC availability : 0.99 No. of DCs: 2 Service availability: 0.9999
DC availability : 0.9 No. of DCs: 3 Service availability: 0.999
DC availability : 0.95 No. of DCs: 3 Service availability: 0.999875
DC availability : 0.99 No. of DCs: 3 Service availability: 0.999999
DC availability : 0.9 No. of DCs: 4 Service availability: 0.9999
DC availability : 0.95 No. of DCs: 4 Service availability: 0.99999375
DC availability : 0.99 No. of DCs: 4 Service availability: 0.99999999
There is a really nice article on wikipedia what this means by day, week and year.
http://en.wikipedia.org/wiki/High_availability#Percentage_calculation
Conclusion
Having 3 elastic IPs in the same region but different availability zones, gives the most of the companies suitable uptime for their service. SQL replication is available for all of the platforms (including MySQL). From now, the faith of your website is in the architect's hand, unnecessary to blame any cloud provider if you put all your servers into a single AZ/DC and it goes down.