On Friday July 19, 2024, the world suffered one of the most widespread and disruptive set of IT outages, when a faulty Crowdstrike software update wreaked havoc on Windows Systems worldwide, impacting transportation, commerce, critical infrastructure, and more.
Many customers and partners reached out to ask if we were impacted. Fortunately, the answer was “no.” But, it would be an act of extreme hubris to take too much satisfaction in this. Failure is an inherent part of IT systems. Equipment fails, software updates, power systems, and networks can all fail. Storms, fires, floods, earthquakes, and war can take out entire data centers. And people, whether through malice or errors are inherently capable of causing widespread damage. Complex systems fail in complex ways.
So what does this have to do with the potato famine of the 1800’s?
Ireland’s economy and food supply in the mid 1800s had an over-reliance on one crop—the potato. That meant that one pathogen could destroy an entire nation's food supply.
The same argument can be made that perhaps we have an over-reliance on centralized architecture. Crowdstrike and Windows are ubiquitous platforms. Many organizations have an IT monoculture. This means that a "supply chain" incident (whether an actual attack like Solarwinds in 2023 or a bad update, like the Crowdstrike update) can have huge and widespread damage. Other "black swan" events, like widespread storms, power outages, etc. can also be damaging if assets are concentrated.
Heterogenous and broadly distributed infrastructure is needed.
In order to limit the damage from these incidents, the IT industry needs to move to having less reliance on centralized infrastructure (our proverbial potato). This means avoiding monoculture, avoiding reliance on a single provider/tech stack, and choosing partners who are themselves highly distributed and heterogeneous.
This doesn’t mean having multiple operating systems and cybersecurity providers, but rather thinking about the infrastructure that supports operations and finding ways to mitigate potential damage. Traditionally, companies have attempted to do this by making copies of data and storing them in multiple geographic regions. While this helps, it is incredibly expensive and is often still a homogeneous approach with a single provider.
The alternative that is gaining momentum is distributed cloud infrastructure. While it is impossible to prevent such failures, it is possible to mitigate risk and build resilience, and this underlies our design philosophy at Storj. We have built a system that is both highly distributed and highly heterogeneous. For our public storage offerings, we rely on a network of over 20,000 storage nodes in over 100 countries. Every file is encrypted, redundantly sharded, and distributed across a broad swath of those nodes. Those nodes are, of course, operated by different people, in different geographies, on different power supplies, with different equipment, and different software. Similarly, our compute offerings consisting of over 16,000 HPC (GPUs) are distributed all over the world in Tier 3 and 4 enterprise grade data centers.
This level of heterogeneity is multiplicative of what traditional IT infrastructure is capable of.
How the distributed cloud performed during this IT outage.
While it is certain that individual nodes and node operators were impacted by Friday’s Crowdstrike incident (about 20% of our nodes run Windows, most of the rest run some 10 different varieties of Linux), the Storj public network as a whole (and by extension, all of the customers of that network) were unaffected. To date, while Storj has certainly had incidents and made mistakes, we have been able to largely shield our system and our customers. We’re not so arrogant to believe that we are immune from failure, which leads us to build resilient systems through distributed architecture and heterogeneity.
Heterogeneity, and a sense of hubris—whether in society, organizations, or IT—can be a source of strength. Moving toward a more distributed architecture is critical and is an actionable takeaway from this outage.