Murphy’s Law and why CSPs must expect the unexpected

Murphy’s Law and why CSPs must expect the unexpected
Chris Stanhope, Cerillion’s Principal Infrastructure Consultant, looks at the challenges faced by Communications Services Providers (CSPs) in providing high availability systems.

Most of you will be aware of Murphy’s Law that states “anything that can go wrong will go wrong”, but how does this relate to real world examples in the telecoms industry?
Five 9s (99.999%) is something that is touted all the time for system availability as a requirement for the telecoms market. This means roughly an allowed downtime of 5.26 minutes per year, 25.9 seconds per months or a whopping 6 seconds per week. Achieving this requires a huge amount of careful planning and design, with the corresponding levels of investment, and of course a certain element of luck.

Many hardware vendors will quote five 9s availability for their platforms when implemented with the various premium options and redundant hardware configurations, but this of course does not cover the network infrastructure / middleware / operating system / databases and of course the actual applications running on the system.

So what does downtime cost? This depends on what channels the customer base or internal staff has to access their services or data. For example, this could take the form of forward facing systems being available while the engine room is taken offline momentarily for an upgrade or patch.

Uptime was traditionally less of an issue for offline / batch systems; however the shift to convergent services and a 24x7 online culture means high availability is now increasingly becoming the norm for all business and operational support systems.

Of course telecoms networks have been built for resilience for many years, however earlier this year, two of the major mobile operators in the UK both suffered outages caused by something that was not hardware / software related. Vodafone and O2 services were both disrupted by break-ins with Vodafone quoting "some specialist network equipment and IT hardware was stolen," whilst O2 also cited "theft and vandalism at one of our operations sites in East London."

I’m sure that Vodafone and O2 would have invested a considerable amount of CAPEX budget making sure that their network and systems complied with five 9s availability and to try and avoid any single point of failure. However, both apparently suffered from a chink in their setup that meant common theft brought their networks to a standstill for parts of the UK.
The last year has also seen a high profile failure of part of Amazon’s EC2 cloud services and there have been regular interruptions to internet services in some parts of the world due to breakages in subsea cables.

It is clear that even the most high profile of operations can fall victim to apparently unforeseeable problems, however CSPs should take every step possible to mitigate this risk and prepare for the worst-case scenario. Areas that must be considered in any design include theft, natural disasters, vandalism and the increasing threat of external security breaches. How secure your data centre is from power outages and how many hours the generators can support the site should also be major areas for consideration.

Going back to our old friend Murphy, systems will fail, it’s inevitable. Accepting this and planning around this fact means that the fallout and consequences can be minimised.