Honey, I shrunk the internet
Posted:
6 / 9 / 2021
Tagged:
Picture this: Amazon, Reddit, and Hulu are all down. The White House gets nothing but a blank white screen and the UK government’s website is all-but invisible. The world does not end but there is significant chaos; #CyberAttack is trending on Twitter, and the media spins into a flurry of excitement.
Well, it did happen. On Tuesday 8th June, the internet appeared to temporarily ‘vanish’ for consumers. Yes, indeed, the unthinkable became a reality for all of 2.5 hours.
The cause? The cloud computing services provider Fastly suffered an outage caused by a software bug, leaving many internet users unable to access e-commerce sites, official government websites and popular online news outlets.
Whilst the bug was fixed relatively quickly, what it shows is just how digitally dependent we are, and how heavily we rely on online activity to go about our day to day. The internet is the lifeblood to society, meaning safeguarding against outages on this scale is critical.
But how? How did this happen? Surely there’s systems and processes in place to keep this kind of thing from happening… You’d think, right?
Whilst the internet itself is decentralized, cloud computing providers are very centralized, making them highly susceptible. The web is increasingly dependent on cloud providers, meaning that large-scale downtime isn’t such a rarity, as this marks the third time this year that a major cloud computing provider has gone dark, leaving apps and websites alike completely stranded.
In regards to what went wrong, Fastly issued the following, somewhat vague statement: “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration” with the issue taking a little over two hours to resolve.
To dig a little deeper into this, Fastly is a Content Delivery Network (CDN), maintaining a network of servers that transfers content from websites to the end user. CDNs work by caching a version of a site’s content in multiple geographical locations to speed up loading times for websites. Should the service be subject to a failure of catastrophic proportion, it will prevent these companies from operating on the internet at all.
The nature of a CDN means that, whilst for the most part a ‘fast and reliable’ service is available, if a problem arises many customers will be simultaneously affected on a global scale – with internet observability company Kentik reporting that Tuesday’s outage caused a 75% drop in traffic from Fastly’s servers.
Was it a cyberattack?
Although there have been rumours that the outage was a result of a cyberattack (with Twitter the first to whip up a frenzy, as #CyberAttack began trending), there’s little to suggest that this event was the work of a sophisticated gang of cybercriminals. The fault has instead been labelled as a configuration error, with a Fastly spokesperson reportedly saying: “We experienced a global outage due to an undiscovered software bug triggered by a valid customer configuration change, which caused 85% of our network to return errors.”
What can we learn from this?
Firstly, we can learn to think outside the box. During the outage, different companies remained delivering content in some rather imaginative ways. The Guardian temporarily shifted its communications to Twitter to run a live blog, whilst The Verge (owned by Vox Media) created a series of live Google Docs that were then shared with users – a genius plan until one editor went rogue and accidentally left a doc unlocked, causing readers to swarm the document and begin editing freely themselves.
The BBC was one of few companies left unaffected. How? The BBC has a backup system in place, meaning that if one CDN goes down, content can still get to the end user, just along a different route (smart, eh).
And then there’s Netflix, weathering the storm with its Simian Army – a suite of independently developed tools that randomly disable production to make sure the streaming platform can survive sudden system failures without disrupting the service to consumers.
But, overall, a key learning here is that by allowing a few large companies to have a monopoly over the CDN market there exists an increasing centralization of internet infrastructure, resulting in single points of failure to cause sweeping outages.
As we hunger for more speed online, the internet’s infrastructure has been concentrated into the hands of a relatively small number of companies, as a smattering of CDNs (like Fastly and Cloudflare) and Cloud Hosts (like AWS, Microsoft Azure, and Google Cloud Platform) dominate the space.
Due to the scale of these companies, the providers rarely fail. However, on the rare occasion that they do fail (oftentimes as a result of human error), they pull large sections of the internet down with them, and whilst it is possible to run a site on two or more providers (like the BBC), this is a technically complicated and costly process. What is needed is a democratization of the CDN and Cloud Hosting market, one that safeguards against one small error causing global upheaval and allows users to access content undisrupted.