Gemenon Technologies Contact
← All posts
· Kevin Luckenbach

The CrowdStrike Outage: What Happened and What Changed

The July 2024 CrowdStrike incident was a faulty update, not a breach. Here is what happened, how it was fixed, and the lessons that outlast it.

On July 19, 2024, a routine update from CrowdStrike took down an estimated 8.5 million Windows machines worldwide, grounding flights, halting hospitals, and freezing checkout lines. It is worth being precise about what this was: not a cyberattack or a breach, but a defective software update. The distinction matters, because the lessons are about release engineering as much as security.

What actually happened

CrowdStrike’s Falcon sensor receives frequent “content” updates separate from its main software. One of these, a configuration update, contained a logic error that the sensor could not safely process. On Windows, the Falcon sensor runs at the kernel level, so the fault did not just crash an app: it crashed the operating system into a boot loop with the blue screen of death. Because the update pushed to all systems at once, the failure was global within minutes.

How it was fixed

There was no clean remote fix, because affected machines could not boot. Remediation was largely manual: boot into safe mode or recovery, delete the offending channel file, and restart. Microsoft published recovery tooling to speed this up, but for many organizations it meant touching machines one by one, which is why recovery stretched on for days.

What changed going forward

In its root cause analysis, CrowdStrike committed to changes aimed squarely at the failure mode:

  • Staggered, canary deployment of content updates instead of pushing to everyone at once.
  • Customer control over how and when content updates roll out, rather than automatic and immediate.
  • Stronger validation and testing of content updates before release, plus additional checks in the sensor itself.
  • Independent code and process reviews.

The lessons that outlast the incident

You may never run Falcon, but the takeaways are universal:

  • Anything that auto-updates with kernel-level access is critical infrastructure. Treat vendor update mechanisms with the same scrutiny as your own deployments.
  • Phased rollouts and a kill switch are not luxuries. “Deploy everywhere at once” is a single point of failure regardless of who is deploying.
  • A tested recovery runbook for mass endpoint failure belongs in your continuity plan, because the cause may be entirely outside your walls.

Resilience is not only about keeping attackers out. It is about surviving the day a trusted vendor ships a bad file.

References