Things went very wrong on Friday.
A bugged CrowdStrike update temporarily bricked quite a lot of computers, bringing down such fun things as airlines, hospitals and 911 services.
It was serious out there.
Ryan Peterson: Crowdstrike outage has forced Starbucks to start writing your name on a cup in marker again and I like it.
What (Technically) Happened
My understanding it was a rather stupid bug, a NULL pointer from the memory unsafe C++ language.
Zack Vorhies: Memory in your computer is laid out as one giant array of numbers. We represent these numbers here as hexadecimal, which is base 16 (hexadecimal) because it's easier to work with... for reasons.
The problem area? The computer tried to read memory address 0x9c (aka 156).
Why is this bad?
This is an invalid region of memory for any program. Any program that tries to read from this region WILL IMMEDIATELY GET KILLED BY WINDOWS.
So why is memory address 0x9c trying to be read from? Well because... programmer error.
It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean "there's nothing here", don't try to access it or you'll die.
…
And what's bad about this is that this is a special program called a system driver, which has PRIVLIDGED access to the computer. So the operating system is forced to, out of an abundance of caution, crash immediately.
This is what is causing the blue screen of death. A computer can recover from a crash in non-privileged code by simply terminating the program, but not a system driver. When your computer crashes, 95% of the time it's because it's a crash in the system drivers.
If the programmer had done a check for NULL, or if they used modern tooling that checks these sorts of things, it could have been caught. But somehow it made it into production and then got pushed as a forced update by Crowdstrike... OOPS!
Here is another technical breakdown.
A non technical breakdown would be:
CrowdStrike is set up to run whenever you start the computer.
Then someone pushed an update to a ton of computers.
Which is something CrowdStrike was authorized to do.
The update contained a stupid bug, that would have been caught if those involved had used standard practices and tests.
With the bug, it tries to access memory in a way that causes a crash.
Which also crashes the computer.
So you have to do a manual fix to each computer to get around this.
If this had been malicious it could probably have permawiped all the computers, or inserted Trojans, or other neat stuff like that.
So we dodged a bullet.
Also, your AI safety plan needs to take into account that this was the level of security mindset and caution at CrowdStrike, despite CrowdStrike having this level of access and being explicitly in the security mindset business, and that they were given this level of access to billions of computers, and that their stock was only down 11% on the day so they probably keep most of that access and we aren’t going to fine them out of existence either.
EDIT, added 11:30am 7/22/24: Ben Thompson has a post summarizing what happened. It broadly agrees with what is described here, and in particular highlights the EU’s role via the 2009 Microsoft dissent decree that prevents Microsoft from locking down the Windows kernel space. I am convinced that without that decree, Microsoft would probably have done that.
Who to Blame?
George Kurtz (CEO CrowdStrike): CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed.
We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website. We further recommend organizations ensure they’re communicating with CrowdStrike representatives through official channels. Our team is fully mobilized to ensure the security and stability of CrowdStrike customers.
Dan Elton: No apology. Many people have been wounded or killed by this. They are just invisible because we can't point to them specifically. But think about it though -- EMS services were not working. Doctors couldn't access EMR & hospitals canceled medical scans.
Stock only down 8% [was 11% by closing].
I don't think the full scope of this disaster has really sunk in. Yes, the problems will be fixed within a few days & everything will go back to normal. However, 911 services are down across the country. Think about that for a second. Hospitals around the world running on paper.
It's hard to map one's mind around since all the people who have been killed and will be killed by this -- and I'm sure there are many -- are largely invisible.
Claude’s median estimate is roughly 1,000 people died due to the outage, when given the hypothetical scenario of an update with this bug being pushed and no other info.
Where Claude got it wrong is it expected a 50%+ drop in share price for CrowdStrike. We should be curious why this did not happen. When told it was 11%, Claude came up with many creative potential explanations, and predicted that this small a drop would become an object of future study.
Then again, perhaps no one cares about reputation these days? You get to have massive security failures and people still let you into their kernels?
Anshel Sag: For those who don't remember, in 2010, McAfee had a colossal glitch with Windows XP that took down a good part of the internet. The man who was McAfee's CTO at that time is now the CEO of Crowdstrike. The McAfee incident cost the company so much they ended up selling to Intel.
I mean, sure, it looks bad, now, in hindsight.
At this rate, the third time will be an AGI company.
So do we blame George Kurtz? Or do we blame all of you who let it happen?
How Did We Let This Happen
Aside from ‘letting a company run by George Kurtz access your kernel,’ that is.
It happened because various actors did not do deeply standard things they should obviously have been doing.
A fun game is to watch everyone say ‘the real problem is X and Y is a distraction’ with various things being both X and Y in different statements. It can all be ‘real’ problems.
Owen Lynch: Everyone is talking about how memory safety would have stopped the crowdstrike thingy. Seems to me that's a distraction; the real problem is that the windows security model is reactive (try to write software that detects hacks) rather than proactive (run processes in sealed sandboxes with permissions granted by-need instead of by-default). Then there's little need for antivirus in the same sense.
Of course, the kernel managing these sandboxes needs to be memory safe, but this is a low bar, ideally it should be either exhaustively fuzzed (like SQLite) or actually formally verified.
But most software should be allowed to be horrendously incorrect or actually malicious, but only in its little box.
Here is a thread where they debate whether to blame CrowdStrike or Microsoft.
Luke Parrish: Microsoft designed their OS to run driver files without even a checksum and you say they aren't responsible? They literally tried to execute a string of zeroes!
Jennifer Marriott: Still the issue is CrowdStrike. If I buy a program and install it on my computer and it bricks my computer I blame the program not the computer.
…
Luke Parrish: CrowdStrike is absolutely to blame, but so is Microsoft. Microsoft's software, Windows, is failing to do extremely basic basic checks on driver files before trying to load them and give them full root access to see and do everything on your computer.
This is analogous to the fire safety triangle: Heat, fuel, and oxygen. Any one of those can be removed to prevent combustion. Multiple failures led to this outcome. Microsoft could have prevented this with good engineering practices, just as CrowdStrike could have.
The market did not think Microsoft would suffer especially adverse effects. The Wall Street Journal might say this was the ‘latest woe for Microsoft’ but their stock on Friday was down less than the Nasdaq. That seems right to me. Yes, Microsoft could and should have prevented this, but ultimately it will not cause people to switch.
The Wall Street Journal also attempts to portray this as a failure of Microsoft to have a ‘closed ecosystem’ the way Apple does (in a limited way on a Mac, presumably, this is not a phone). This, they say, is what you let others actually do things for real on your machine, the horrors. There are a minimum of two ways this is Obvious Nonsense, even if you grant a bunch of other absurd assumptions.
Linux exists.
Microsoft is barred from not giving this access by a 2009 EU consent decree.
Did Microsoft massively screw up by not guarding against this particular failure mode? Oh, absolutely, everyone agrees on that. But they failed (as I understand essentially everyone) by not having proper safety checks and failure modes, not by failing to deny access.
There was a clear pattern where ‘critical infrastructure’ that is vitally important to keep online like airlines and banks and hospitals went down, while the software companies providing other non-critical services had no such issues.
‘Too important to improve’ (or ‘too vital to allow?’) is remarkably common.
Where you cannot f** around, you cannot find out. And where you cannot do either, it is hard to find good help.
Microsoft Worm: In retrospect it's pretty ~funny how most shitware SaaS companies & social media companies exclusively run Real Software for Grown-Ups while critical infrastructure (airlines, hospitals, etc.) all uses dotcom-era software from comically incompetent zombie firms with 650 PE ratios.
Gallabytes: We used to explain this bifurcation as a function of size but with most of the biggest companies being tech giants now that explanation has been revealed as cope. what's the real cause?
Sarah Constantin: My guess would be it's "do any good software engineers work there or not?" Good software engineers work at both startups and Big Tech cos but I have *one* smart programmer friend who works at a bank, and zero at hospitals, airlines, etc.
Gallabytes: This is downstream I think and not universal - plenty of good programmers in gaming industry but it's still full of this kind of madness. So far the most accurate classifier I've got is actually "does this company run on Windows?"
Scott Leibrand: I think it comes down to whether they hire mostly nerd vs. normie employees.
Illiane: Pretty sure it’s just a result of these tech companies starting out with a « cleaner » blank slate than critical infra that’s been here for decades and relies on mega legacy system which would be very hard and risky to replace. Banks still largely run on COBOL mainframes!
Tech companies at least started out able to find out and hire good help, and built their engineering cultures and software stacks around that. Banks do not have that luxury.
Regulatory Compliance
Why else might we have had this stunning display of incompetence?
Lina Khan, head of the FTC, has no sense of irony.
Lina Khan: All too often these days, a single glitch results in a system-wide outage, affecting industries from healthcare and airlines to banks and auto-dealers. Millions of people and businesses pay the price.
These incidents reveal how concentration can create fragile systems.
Concentrating production can concentrate risk, so that a single natural disaster or disruption has cascading effects.
This fragility has contributed to shortages in areas ranging from IV bags to infant formula.
Another area where we may lack resiliency is cloud computing.
In response to @FTC's inquiry, market participants shared concerns about widespread reliance on a handful of cloud providers, noting that consolidation can create single points of failure.
And we're continuing to collect public comment on serial acquisitions and roll-up strategies across the economy.
If you've encountered an area where a series of deals has consolidated a market, we welcome your input.
Yes. The problem is too much concentration in cloud providers, says Lina Khan. We must Do Something about that. I mean, how could this possibly have happened? That all the major cloud providers went down at the same time over the same software bug?
Must be a lack of regulation.
Except, well, actually, says Mark Atwood.
Mark Atwood: If you are in a regulated industry, you are required to install something like Crowdstrike on all your machines. If you use Crowdstrike, your auditor checks a single line and moves on. If you use anything else, your auditor opens up an expensive new chapter of his book.
The real culprit here is regulatory capture. Notice that everybody getting hit hard by this is in a heavily regulated industry: finance, airlines, healthcare, etc. That's because those regulations include IT security mandates, and Crowdstrike has positioned themselves as the only game in town for compliance. Hence you get this software monoculture prone to everything getting hit at once like this.
Andres Sandberg: A good point. I saw the same in the old FHI-Amlin systemic risk of risk modelling project: regulators inadvertently reduce model diversity, making model-mediated systemic risk grow. "Sure, you can use a model other than RMS, but it will be painful for both of us..."
Ray Taylor: what if you use Mac / Linux?
Andres Sandberg: You will have to use the right operating system to run the industry standard software. Even if it is Windows XP in 2017.
Some disputed this. I checked with Claude Sonnet 3.5. It looks like there are plenty of functional alternative services, and yes they will work, but CrowdStrike does automated compliance reporting and is widely recognized, and this is actually core to their pitch of why companies should use them - to reduce compliance costs.
I also checked with two friends who know about such things. It seems CrowdStrike did plausibly have a superior product to the alternatives, even discounting the regulatory questions.
It was also pointed out that while a lot of installs were to please auditors, a lot of what the auditors were checking for was not formal government regulations, rather it was largely industry standards without legal enforcement, but that you need to do to get contracts, like SOC 2 or ISO 27001.
In the end, is there a functional difference? In some ways, probably not.
So given the increasing number of requirements Claude was able to list off, and the costs of non-compliance, everyone in these ‘critical infrastructure’ businesses ended up turning to the company whose main differential, and perhaps to them main product offering, was ‘regulatory compliance.’
That then set us up with additional single points of failure. It also meant that the company in charge of those failure points had a culture built around checking off boxes on government forms rather than actual computer security or having a security mindset.
You know who did not use CrowdStrike? Almost anyone who did not face this regulatory burden. It was only in 8.5 million Windows machines.
Byrne Hobart: <1% penetration. This Crowdstrike company seems like it's got a nice TAM to go after, just have to make sure they don't do anything to mess it up.
Another nice bit that I presume is a regulatory compliance issue: Rules around passwords and keys are reliably absurd.
Dan Elton: Many enterprises in healthcare use disk encryption like Bitlocker which complicates #CrowdStrike cleanup.
This is what one IT admin reports:
"We can't boot into safe mode because our BitLocker keys are stored inside of a service that we can't login to because our AD is down."
Another says "Most of our comms are down, most execs' laptops are in infinite BSOD boot loops, engineers can't get access to credentials to servers."
Consequences
Would it be better if the disaster were worse, such as what likely happens to a crypto project in this spot? Crypto advocate says yes, Gallabytes points out actually no.
Dystopia Breaker: in crypto, when a project has a large incompetence event (hack, insider compromise, whatever), the project loses all of their money and is dead forever in tradtech/bureautech, when a project has a large incompetence event, they do a 'post mortem' and maybe get some nastygrams.
Consider for a moment the incentives that this dynamic creates and the outcomes that arise by dialing out these two incentive gradients into the future.
It's actually worse than 'they get some nastygrams', what usually happens is that regulators (who usually know less than nothing about the technosphere) demand band-aid solutions (surveillance, usually) that increase systemic risk [e.g. CrowdStrike itself].
Gallabytes: And that's a huge downside of crypto!
Most systems will be back to normal by Monday, while in crypto many would be irreversibly broken.
It'd be better still if our institutions learned from this failure but I'm not holding my breath. you basically only see this kind of failure in over regulated oligopolistic markets, so the case for massive deregulation is much clearer than migration to crypto.
As George Carlin famously said, somewhere in the middle, the truth lies. Letting CrowdStrike off the hook because they ‘are the standard’ is insufficiently strong incentives. Taking everything involved down hard is worse.
Careful With That AI
What about the role of AI?
Andrej Karpathy: What a case study of systemic risk with CrowdStrike outage... that a few bits in the wrong place can brick ~1 billion computers and all the 2nd, 3rd order effects of it. What other single points of instantaneous failure exist in the technosphere and how do we design against it.
Davidad: use LLMs to reimplement all kernel-mode software with formal verification.
How about we use human software engineers to do the rebuild, instead?
It is great that we can use AIs to write code faster, and enable people to skill up. For jobs like ‘rewrite the kernel,’ I am going to go ahead and say I want to stick with the humans. There are many overdetermined reasons.
Patrick Collison (responding to Karpathy): I've always thought that we should run scheduled internet outages.
Andrej Karpathy: National bit flip day.
Indomitable American Soul: Its crazy when you think that this could have all been avoided by testing the release on a single sandbox machine.
Andrej Karpathy: I just feel like this is the particular problem but not the *actual* deeper problem. Any part of the system should be allowed to go *crazy*, randomly or even adversarially, and the rest of it should be robust to that. This is what you want, even if robustness is very often at tension with efficiency.
There are two problems.
This error should not have been able to bring down the system.
This error should never have happened even if it couldn’t crash the system.
Either of these on its own should establish that we have a terrible situation that poses catastrophic risks even without AI, and which AI will make a lot worse, and urgently needs fixing.
Together, they are terrifying.
The obvious failure mode is not malicious. It is exactly what happened this time, except in the future, with AI.
AI accidentally outputs buggy code.
Human does not catch it.
What do you mean ‘unit tests’ and ‘canaries’?
Whoops.
Or the bug is more subtle than this, so we do run the standard tests, and it passes. That happens all the time, it is not usually quite this stupid and obvious.
The next failure is that the AI intentionally outputs bugged code, or malicious code, whether or not a human instructed it (explicitly, implicitly or by unfortunate implication) otherwise.
And of course the other failure mode is that the AI, or someone with an AI, intentionally seeks out the attack vector in order to deploy such code.
Shako: A rogue AI could probably brick every computer in the world indefinitely with ongoing zero days to exploit things like we saw today. Probably not too far from the capability either.
Arthur: It won't need zero days, we'll have given it root power globally because it's convenient.
Leo Gao (OpenAI, distinct thread): Thankfully, it's unimaginable that an AGI could ever compromise a large fraction of internet connected computers.
Jeffrey Ladish: Fortunately there are no single points of failure or over reliances on a single service provider with system level access to a large fraction of the computers that run, uh, everything.
Everyone: “Oh no the AGI will be able to discover 0days in every piece of software, we’ll be totally pwned”
AGI: “Why would I need 0days? 🙄”
Where should we worry about concentration? Is this a reason to want everyone to be using different AIs from different providers, instead of the same AI?
That depends on what constitutes the single point of failure (SPOF).
If the SPOF is ‘all the AIs turn rogue or go crazy or shut off at the same time’ then you want AI diversity.
If the SPOF is ‘every distinct frontier AI is itself an SPOF, because if even one of them goes fully off the rails then that is a catastrophe’ then you do not want AI diversity.
These questions can have very different answers for catastrophic or existential risk, versus mundane risk.
For mundane risk, you by default want your systems to fail at different times in distinct ways, but you need to worry about long dependency chains where you are only as strong as the weakest link. So if you are (for example) combining five different AI systems that each are the best at a particular subtask, and cannot easily swap them out in time, then you are vulnerable if any of them go haywire.
For existential or catastrophic risk, it depends on your threat model.
Any single rogue agent under current conditions, be it human or AI, could potentially have set off the CrowdStrike bug, or a version of it that was far worse. There are doubtless many such cases. So do you think that ‘various good guys with various AIs’ could then defend against that? Would ‘some people defend and some don’t’ be sufficient, or do you need to almost always (or actual always) successfully defend?
I am very skeptical of the ‘good guy with an AI’ proposal, even if such defenses are physically possible (and I am skeptical of that too). Why didn’t a ‘good guy with a test machine or a debugger’ stop the CrowdStrike update? Because even if there was a perfectly viable way to act responsibly, that does not mean we are going to do that if it is trivially inconvenient or is not robustly checked.
Again, yes, if we allow it you are going to give the AI root access and take yourself out of the loop, because not doing so is going to be annoying, and expensive, and you are in competition with people who are willing to do such things. If you don’t, someone else will, and their AIs will end up with the market share and the power.
Indeed, the very fact that these many AIs are allowed to be in this intense competition with each other with rapid iteration will make it all but certain corners will be cut to absurd degrees, especially when it comes to things like collective security.
Another thing that can happen is the one dangerous AI suddenly becomes a lot of dangerous AIs, because it can be copied, or it can scale its resources with similar effect. Or by having many such potentially dangerous AIs, you place authority over it into many hands, and what happens if even one of them chooses to be sufficiently irresponsible or malicious with it?
What about the risk of regulatory capture happening with safety in AI, the way it happened here with mundane computer security and CrowdStrike? What happens if everyone is hiring a company, Acme Safety Compliance (ASC), to handle all their ‘AI safety’ needs, and ASC’s actual product is regulatory compliance?
Well, then we’re in very big trouble. As in dead.
Every time I look at an AI lab’s scaling policy, I say some form of:
If they implement the spirit of a good version of this document, I don’t know if that is good enough, but that would be a big help.
If they implement the letter of even a good version of this document, and game the requirements, then that is worth very little if anything.
If they don’t even implement the letter of it in the breach, it’s totally worthless.
We cannot rely on their word that they will implement even the letter of this.
This is another reason most of the value, right now, is in disclosure and information requirements on the largest frontier models. If you have to tell me what you are doing, then that is not an easy thing to meaningfully ‘capture.’
But yeah, this is going to be tough and a real danger. It always is. And it always needs to be balanced against the alternative options available, and what happens if you do nothing.
It can also be pointed out that this is another logical counter to ‘but you need to tell me exactly what constitutes compliance, and if I technically do that then I should have full safe harbor,’ as many demand for themselves in many contexts. That is a very good way to get exactly what is written down, and no more, to get the letter only and not the spirit. That works if there is a risk that can indeed be taken out of the room by adhering to particular rules. But if the risk is inherent in the system and not so easy to deal with, you cannot make the situation non-risky on one side of a line.
One thing to note is that CrowdStrike was an active menace. It was de facto mandatory that they be given this level of access. If CrowdStrike was (for example) instead a red teaming service that attempted to break into your computers, it would have been much harder (but not, indirectly, impossible) for it to cause this disaster.
Another key insight is that you do not only have to work around things that might go wrong when everyone does their jobs properly, and you face an actually hard problem.
Your solution must also be designed anticipating the stupidest failures.
Because that is what you probably first get.
And saying ‘oh there are like 5 ways someone would take action such that this would obviously not happen’ is a surprisingly weak defense.
Then, later, you also get the failures that happen when the AI is smarter than you.
And again, then, whatever happens, there is a good chance many will say ‘it would have been fine if we hadn’t acted like completely incompetent idiots and followed even a modicum of best practices’ and on this exact set of events they will have been right. But that will also be why that particular set of events happened, rather than something harder to fathom.
Unbanked
Also down were the banks. Anything requiring computer access was stopped cold.
Patrick McKenzie: In “could have come out of a tabletop exercise”, sudden surge by many customers of ATM transactions has them flagging customers as likely being fraud impacted.
Good news: you have an automated loop which allows a customer to recognize a transaction.
Bad news: Turns out that subdomain is running on Windows.
I’m not trying to grind their nose in it. Widespread coordinated outages are terrible and the few things that knock out all the PCs are always going to be nightmares.
I do have to observe that some people who write regulations which effectively mandate a monoculture don’t know what SPOF stands for and our political process is unlikely to put two and two together for them.
Same story at three banks, two GSFIs and one large regional, for anyone wanting a data point. Well I guess I know next week’s Bits about Money topic.
It was only a single point of failure for Windows machines that trusted CrowdStrike. But in a corporate context, that is likely to either be all or none of them.
That created some obvious issues, and offered opportunity for creative solutions.
Patrick McKenzie: Me: *cash*
Tradesman: Wait how did you get that with the banks down?
Me: *explains*
Tradesman: Oh that’s creative.
Me: Nah. Next plan was creative.
Tradesman: What was that?
Me: Going to the church and buying all cash on hand with a check.
Tradesman: What.
Me: I don’t drink.
Tradesman: What.
Me: The traditional business to use in this situation is the local bar, but I don’t drink and so the local bar doesn’t know me, so that’s right out.
Tradesman: What.
Me: Though come to think of it I certainly know someone who knows both me and the bar owner, so I could probably convince them to give me a workweek’s take on a handshake.
Tradesman: This is effed up.
Me: I mean money basically always works like this, in a way.
Called someone who I (accurately) assumed would have sufficient cash on hand and said “I need a favor.”, then he did what I’d do on receiving the same phone call.
Another obvious solution is ‘keep an emergency cash fund around.’ In a world where one’s bank accounts might all get frozen at once, or the banks might go down for a while, it seems sensible to have such a reserve somewhere you can access it in this kind of emergency. You are not giving up much in interest.
This is also a damn good reason to not ban or eliminate physical cash, in general.
I'm coming out of the woodwork to perform an Old Man Yells At Cloud.
I have a turbulent and mostly negative relationship with C++. I am a python programmer by preference, an ML programmer by trade, and I sat in an airport all of Friday evening, and didn't get home until 4am. I have every reason to be angry at Crowdstrike, and I *am* angry at Crowdstrike.
*But what kind of an idiot thinks you can have a fast, clean operating system without a bunch of OS level C and C++?!*
Blaming Crowdstrike for writing OS level code in C++ is a stupid, stupid, stupid take. It's like blaming the highway construction crew for using concrete.
Saying they should have caught their C++ mistake, or used modern tooling to catch it, is still a very stupid take. They're deeply embedded in fucking *Windows* and have been for a while, which means they're fairly battle tested, which means they almost certainly do generally use modern tooling and catch a lot of things, and this just happened to be the thing that slipped through.
Readers, if you are kind-of-a-programmer, or learned-programming-a-long-time-ago, please do not update from this. Zvi is signal boosting a fairly silly idea.
"[T]heir stock was only down 11% on the day..."
That's the most shocking thing about this to me. The only point of the software is to prevent incidents like this! And now it's caused one! Why on earth would anyone keep using it? I'm reminded of the XKCD on the TornadoGuard app (https://xkcd.com/937/).