Today, 19 July, the world has been hit by a tech outage causing global economic and social disruption of the worst kind. Hospitals had to cancel operations, payment systems failed, government services stopped functioning, and newsrooms went dark.
The root cause seems to be an automatic update of a cybersecurity product from CrowdStrike that crashed Microsoft Windows systems, leaving them in an inoperable state. The company has acknowledged this issue and is in the process of deploying a fix.
While this has not been a cyber attack by a malicious actor, this is a cybersecurity incident nonetheless—one that will result in questions in parliaments and raise attention about our reliance on digital infrastructure.
As people are scrambling to fix the issues, we want to share a few observations with respect to the Internet, given that we are the Internet Society, after all.
- This was not an Internet outage or shutdown. There was no loss of connectivity. The networks all continued to work. Data continued to flow.
- This was a failure of some systems using a specific operating system and a specific vendor’s management tools. Unfortunately, those systems were used widely and for many functions critical to people’s daily lives.
- This type of event demonstrates that for critical services, redundancy and diversity are key. We need diversity across the all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down.
- On the Internet, everything is about scale: deploying software on Internet scale, in particular updates, can cause issues on a massive scale. While errors and mistakes happen, there are questions to ask about the rollout of this update.
- This event also demonstrates the inter-reliance of systems and services. If enough upstream systems break, the societal impact can be enormous.
- There are people who attack systems across the Internet—and the people who deployed CrowdStrike protection software did so for good reason. They wanted to be careful and spot possible intrusion on their system early. There is no need to point fingers at those who run CrowdStrike tools.
- The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again. It is critical that the companies involved in this issue today are transparent about what happened and provide information from which we can all learn.
This incident will undoubtedly raise important questions about our understanding of our reliance on complex digital systems.
In our work, we speak about the need for resilience for Internet connectivity to ensure that Internet access continues to be available. But beyond connectivity, today’s incident highlights the need for resilience in the software and systems that we use in our daily lives.
It is not a good day for the many people affected by these system outages, and we hope those systems get restored soon.
Image credit: Alessandro Demetrio (note that this is not an image from today’s outage but an example from a past outage)