On the CrowdStrike Incident

Jul 22, 2024

Things went very wrong on Friday.

46 Comments

K Tucker Andersen

Thanks - this is incredibly illuminating for someone likes me who understands just enough of this ( as a former programmer in machine language and COBOL in the 1960’s on IBM 650’s and 1620’s in my first career) to know that a huge number of CIO’s now combine regulatory capture with only a rudimentary first order understanding of software and hardware.

As this thread points out, such a state of affairs makes the rapid drive towards the adoption of AI incredibly worrisome and dangerous. Certainly this post alone worth becoming a paid subscriber if that helps incentivize you.

Expand full comment

K Tucker Andersen

Addendum- it makes me laugh at the obvious incompetence and groupthink/action involved and shake my head in disparity at the same time

Expand full comment

I have a friend traveling by air from Florida to San Diego for scheduled brain surgery. He can't get there due to this error - he is still sitting in Atlanta Multiply this by millions.

Expand full comment

Askwho Casts AI

I have been de-facto converting all of Zvi's posts for a while, as they are all super high value posts. Given that, I have now created a dedicated "Don't Worry About the Vase Podcast"

https://dwatvpodcast.substack.com/

Here is this post on this new feed.

https://open.substack.com/pub/dwatvpodcast/p/on-the-crowdstrike-incident

If you find this content valuable, it would be fantastic if you would consider becoming a paid subscriber. Your support would help cover the costs of creating these episodes, including the use of ElevenLabs for high-quality audio conversion.

Expand full comment

That's great that you're doing this, and it's something I'll probably use, but asking for paid subscribers when you're just reposting someone else's content in a different form feels...dirty. Did you talk to Zvi before you did this?

Expand full comment

Askwho Casts AI

Yes, I've communicated with Zvi about this project.

To be 100% clear, the thing I'm asking for is for those who like having the audio option to help chip in for the cost of it's production. ElevenLabs, while it is in my opinion the best voice model for long form content, is an expensive model to use. I personally really like to be able to listen to Zvi's posts, I find them super insightful, and he gathers together in one place a lot of worthwhile takes, I'm hoping that some other people who also find it useful are willing to spread the cost of generation.

Expand full comment

I am happy there is a feed. If he wants to ask for people to support him I see zero problem with that, as long as nothing is behind a paywall, same as nothing here is behind a paywall. Providing the feed costs time and money (although obviously not as much as it does to write the posts, and you can support me too if you'd like).

Expand full comment

Tavis Ormandy disagrees strongly with the thread you linked at the start: https://x.com/taviso/status/1814762302337654829

Expand full comment

Michael Bacarella

Concur. At the minimum, it's clearly not a NULL pointer dereference but *is* an invalid pointer dereference. This doesn't make CrowdStrike look good but the true error is not of the blatant "lol they didn't check for NULL, the most basic of C++ programming mistakes" variety.

Expand full comment

Still likely a NULL pointer, you just get a pointer to a structure with one of the fields having offset 0x9c from the start (or an array but that's less likely) and try to get the value of that field. Voila, since "illegal pointer" a.k.a. NULL/nullptr is represented by 0x0000 on most architectures you read 0x0000 + 0x9c = 0x009c.

Expand full comment

Michael Bacarella

read Tavis's thread for a disassembly explaining why that isn't it

Expand full comment

A couple of typos:

"a minimum of who ways"

"combining give different"

Expand full comment

"[T]heir stock was only down 11% on the day..."

That's the most shocking thing about this to me. The only point of the software is to prevent incidents like this! And now it's caused one! Why on earth would anyone keep using it? I'm reminded of the XKCD on the TornadoGuard app (https://xkcd.com/937/).

Expand full comment

Michael Bacarella

I too would also like to understand why their stock is not $0.

Surely the total of all incoming lawsuits will be for damages greater than their market cap.

Surely their EULA will not save them from all liability, in all jurisdictions.

Expand full comment

Edward Scizorhands

Jul 22Edited

I'm not saying there won't be lawsuits. Because no one can stop someone from filing a stupid lawsuit.

But the EULA on most software is "if it breaks, you get to keep both pieces."

Without specifically looking I'm confident it has "this software is not suitable for any software where failure could lead to loss of life."

I think there's some good arguments for software liability, but the people who argue for it usually don't know what world they're asking for. (Unless they just want to remove software from the world, which, okay, fair enough.)

Expand full comment

Michael Bacarella

Jul 22Edited

Even if we can't pin loss of life on them, plenty of businesses simply lost a day of productivity. That's huge.

I think we're going to find out just how much liability EULAs can actually disclaim.

Expand full comment

That varies widely by jurisdiction. In the US, I would agree that the EULA would generally protect the company, but I'm not so sure about that in other jurisdictions, particularly the EU.

Expand full comment

I'm coming out of the woodwork to perform an Old Man Yells At Cloud.

I have a turbulent and mostly negative relationship with C++. I am a python programmer by preference, an ML programmer by trade, and I sat in an airport all of Friday evening, and didn't get home until 4am. I have every reason to be angry at Crowdstrike, and I *am* angry at Crowdstrike.

*But what kind of an idiot thinks you can have a fast, clean operating system without a bunch of OS level C and C++?!*

Blaming Crowdstrike for writing OS level code in C++ is a stupid, stupid, stupid take. It's like blaming the highway construction crew for using concrete.

Saying they should have caught their C++ mistake, or used modern tooling to catch it, is still a very stupid take. They're deeply embedded in fucking *Windows* and have been for a while, which means they're fairly battle tested, which means they almost certainly do generally use modern tooling and catch a lot of things, and this just happened to be the thing that slipped through.

Readers, if you are kind-of-a-programmer, or learned-programming-a-long-time-ago, please do not update from this. Zvi is signal boosting a fairly silly idea.

Expand full comment

Running canary on their code seems much more reasonable unless they did and somehow didn’t break while running canary. I hope they publicize why the bug was able to slip through because we’re just guessing at the process that created this bug

Expand full comment

Michael Bacarella

Jul 22Edited

They implied that they bypassed their normal release engineering process because the update they pushed out was a small "content" update.

But yes, without a solid RCA from them we are speculating.

Expand full comment

The problem here is the culture of accepting, nay expecting, frequent updates to systems. Frequent updates are cheaper in some legible ways but more expensive in less legible costs, which the people pushing frequent updates don't care about.

Expand full comment

Anecdotally, two friends work as a doctor and a lab tech in different hospital systems. One friend's hospital system has a system that delays and batches the Windows updates, and they were fine. The other friend was unable to read charts all day that day.

Expand full comment

Ah, good old "I didn't do anything, I just..." strikes again. :-(

Expand full comment

Edward Scizorhands

Jul 22Edited

The engineer writing this could've made his code able to withstand bad inputs. He probably should have, too. (Maybe he was measured on performance and got rid of safety checks but press X to doubt.) But he was never instructed to do it. There was likely never a design document saying "given fuzzed inputs, our driver-level software will never crash." If management sucks they will attempt to apply this requirement post-hoc, "OF COURSE that was in the design docs, even if not said out loud."

If your business in a place where you stuff can crash millions of computers, your business needs to dedicate actual resources making sure it doesn't crash all of them. Until today no one ever lost their annual bonus because the code would crash if someone gave it a bad config file.

(I lack the imagination to figure why they didn't have phased rollouts. I've done these updates and always felt like I was riding a roller-coaster pushing it out to the first several hundred customers all at once. Are they still there? In many ways it's harder to update all 10 million things at once simultaneously instead of having a hierarchy but they somehow pulled it off.)

Expand full comment

Michael Bacarella

> Linux exists.

Microsoft gets a lot of flak for this but if Linux was a popular enough desktop platform that people thought it was profitable enough to target with *desktop* malware, it would be under similar pressures to run a CrowdStrike-like agent.

Linux is not architecturally much different than Windows, and mitigating this malware/attackware vector would take the same form.

In fact, CrowdStrike, which has a Linux offering, has bricked Linux hosts in the past as well.

https://access.redhat.com/solutions/7068083

Personal computers need security models more like smartphones have if we want to do away with CrowdStrike-like endpoint detection and response tools that have ultimately privileged system taps.

Part of what I think the WSJ was driving at (or if not, let me copy/paste and fix what they were driving at) is that Windows risks breaking a lot of backward compatibility if they radically alter their security model so the changes to their security story have been slow. Apple, to the contrary, has been a lot more shameless and aggressive at doing this on MacOS, to endless frustration and outrage of their users.

Expand full comment

> 1. AI accidentally outputs buggy code.

> 2. Human does not catch it.

> 3. What do you mean ‘unit tests’ and ‘canaries’?

> 4. Whoops.

----

I see it the better use for AI in asking "what could be wrong with this code?"

I really don't like forced autoupdate on this kind of scale. I turn off automatic updates on everything. Why should I be the gamma test?

Expand full comment

The "bad config causes program to crash" problem is not quite a C/C++ problem - in my experience, it happens often enough in every language. Which is why there are a normally bunch of development practices that are designed to catch programs before they crash on bad configs, and bad configs before they cause a lot of production programs to crash. It seems CrowdStrike was pretty lazy on all of these.

And it's also distinct from the supply-chain attack problem - there are a lot of suppliers where an attack on which could cause customers to be compromised, and in fact CrowdStrike is one of the *mitigations* for this problem, rather than a cause.

Also distinct from the "CrowdStrike can crash Windows" problem. In order to work, CrowdStrike needs quite low-level system access, and it's unfeasible for Windows to allow so much low-level access without the possibility of a crash. Given that there are many Windows components that can cause a crash, and normally Windows does not crash, I'm not sure Microsoft is that blameworthy.

Expand full comment

https://nitter.privacydev.net/taviso/status/1814762302337654829 Tavis Ormandi (a well-known cybersec researcher) points out mistakes in Zack Vorhies' thread - it wasn't a null pointer.

MS‌ pointing to 2009 EU decree sounds like an obvious deflection.

Expand full comment

> Claude’s median estimate is roughly 1,000 people died due to the outage...

It is discouraging to me that even people with sophisticated understanding of modern AI systems, who know enough to know better, still treat a system like Claude as though it were an oracle whom you can ask questions like these and in return get facts you can rely on.

Expand full comment

Yeah, I’ve been wondering about this for a while with these posts. Many such cases. It’s never spelled out with how much of a pinch of salt Zvi takes these outputs — I’d assume, in context, “a decent pinch” — but then here he says “where Claude got it wrong” was the stock price estimate (easily checked), with an apparent unspoken assumption that the deaths estimate (almost impossible to check) was somehow right, at least by comparison. What gives?

Expand full comment

As in, the other *checkable* predictions it made about the scenario did check out. The deaths number is of course unknownable, I just wanted to get a sanity check Fermi estimate, and I got one. Seemed not insane, wouldn't take it as anything other than that.

Expand full comment

Ah, interesting. In general, I’ve always assumed these AI quotes were more in the realm of “FWIW” sanity checks rather than oracular. But it’s illuminating to know that, at least in this instance, you’re basing the level of confidence on other, verified, predictions.

Expand full comment

This suggests that you believe something like this: "If an LLM makes checkable predictions about a scenario that check out, it's unknowable answers about the scenario can also be relied upon." This is not the lesson that I have learned from my interactions with LLMs. In my experience, where you leave the realm of the checkable is right about where you enter the realm where LLMs tend to just start making things up.

Expand full comment

No, I am not saying that.. I am saying that the number seemed like a sane finger-in-the-wind guess to me, and Claude seemed to be understanding the scenario pretty well. Some number is almost always more informative than no number.

Expand full comment

> round 3 in 2038

If this guy goes to a Linux company, watch out.

Expand full comment

I can't even be sure this was related to the CrowdStrike thing but, my hospital wasn't directly impacted except that we use Cerner (for our electronic medical records) which was completely flatlined for around 3 hours in the middle of the morning, so there's an additional layer of possible frustrating "we were small enough not to fall for this particular honeytrap but were nonetheless relying on infrastructure that might have been". It happened after the outages were already in the news (my boss called me and was asking about it before I'd even heard about it) and maybe an hour or so into my shift so I can't be sure if it got hit or was just dealing with unusual network load and croaked.

Expand full comment

I live next to Russia and very obviously always keep cash between my 'Factfulness'

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts