AT&T suffered a major service outage on May, 25 affecting 1.5 million customers using their U-Verse VOIP (Voice over Internet Protocol) Service. The 4 1/2 hour outage was blamed on a “server crash” .
Excuse Me ? A Server Crash. One lone, humble server causes a outage that takes down customers in 22 states. Really?
While we cant address the massive AT&T infrastructure we can use this incident as a teachable moment.
Even with the best equipment available server outages happen. This is why (especially with mission critical services) server clustering and the elimination of single points of failure is almost a requirement.
Clustering allows for statefull failover between hardware if a member node in that cluster fails for some reason. Meaning if one of the servers in the cluster fails the other node(s) take over without missing a beat. While clustering is a requirement for a mission critical delivery of services, it is not the only consideration.
Part of effective network architecture is to reduce or eliminate single points of failure in the infrastructure. The problem is that when some network designers look for single points they don’t extend their view all the way to the edge of their delivery network. For instance, having a server farm is great but when it is held together by one switch you still have a single failure point that can take down the entire delivery system.
There are many ways to design in redundancy to the environment but it all starts with a comprehensive analysis of your infrastructure to discover those single failure points.
Once those failure points are found and eliminated the network is well on its way to becoming a dependable platform to run your business.