CrowdStrike Outage: The Aftermath

Much has been said already about the CrowdStrike update-related outage Friday, and I am not sure I can add any technical wisdom to this ongoing discussion, but I do believe it is important be aware of what happened and what this means to all of us in terms of incident response and ongoing cybersecurity.

For anyone who has been off the grid the past several days, there was a significant worldwide systems outage last Thursday / Friday (July 18-19, 2024) involving Microsoft Windows operating systems running the CrowdStrike Falcon security agent. This outage affected services and businesses across the Fortune 500, crippling airlines, hospitals, government agencies, and organizations large and small who relied on CrowdStrike Falcon for cybersecurity protection. This outage was widespread because CrowdStrike is a major vendor in the IT cybersecurity sector and their products and services are used across numerous industries and by a majority of the Fortune 500. According to CrowdStrike – “CrowdStrike is the leader in next-generation endpoint protection, threat intelligence and response services. CrowdStrike’s core technology, the Falcon platform, stops breaches by preventing and responding to all types of attacks — both malware and malware-free.”

The specifics of the incident are this: At approximately 4:09AM UTC - 7/19/2024 (11:09PM EST – 7/18/2024), CrowdStrike released a defective content update to their Falcon sensor agent for Microsoft Windows hosts. This defective content update caused bug-check and bluescreen crashes on the affected MS Windows hosts. Issues with this channel file were detected and a fix deployed by 5:27AM UTC - 7/19/2024 (12:27AM EST - 7/19/2024). These issues were human error related and are NOT related to any form of cyberattack or malicious system compromise. The following is CrowdStrike’s official statement regarding this incident:

https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/

As of today, Tuesday, July 23, 2024, most systems have been recovered and/or are online, though some industries including air travel are still recovering from the prolonged outage-related delays.

There has been some debate in the IT security community as to whether or not the CrowdStrike outage should be considered an IT security incident. Many have argued that this was not a malicious act and that there is no evidence of system or data compromise, and as such, this should not be considered or documented as an IT security incident. I disagree and believe it should be documented as an incident for several reasons.

The CIA Triad is and has been used as a standard methodology for cybersecurity controls for many years and stands for concepts of Confidentiality, Integrity, and Availability. These concepts or principles should be our goals as IT security professionals. We are tasked with ensuring the confidentiality of data, the integrity and validity of that data, and the timely availability of that data and the systems upon which it resides. The CrowdStrike outage is clearly a situation where system availability was severely compromised for a multitude of systems and resources worldwide. Services could not be rendered, records could not be access, and basic industry functions ground to a halt due to this outage.

Also, as part of the remediation process as provided by CrowdStrike, the CrowdStrike Falcon agent was disabled on numerous systems for a period of time, rendering those systems temporarily less secure and vulnerable to potential compromise. In any situation where cybersecurity controls are weakened or removed, the potential for malicious activity against systems and related data rises and the possibility for the loss of confidentiality and integrity increases.

Given these realities, I believe anyone in an organization directly or indirectly affected by this outage should treat it as an IT security incident and react accordingly, including documenting what happened and how your organization responded. It is important to note that this does not only apply to those organizations running CrowdStrike Falcon on internal systems. Most of us had a third-party vendor or cloud hosted solution affected by this outage and, therefore, should identify as a victim of this outage.

Many organizations, as part of ongoing incident response procedures, require documentation of incidents, both internal and industry related. For those of you that fall into this category, I want to provide a few cheat notes for your documentation process:

Date of Incident:

July 18-19, 2024

Incident Title:

CrowdStrike Failed Content Update – Microsoft Operating System Outage

Incident Type:

Industry Level and/or Internal Systems – Unexpected outage or interruption

Threat to the Organization:

In-house and 3rd party systems availability | Possible loss of security controls

Description of the Incident:

At approximately 4:09AM UTC (11:09PM EST) on July 18, 2024, CrowdStrike released a defective content update to their Falcon sensor agent for Microsoft Windows hosts. This defective content update caused bug-check and bluescreen crashes on the affected MS Windows hosts. Issues with this channel file were detected and a fix deployed by 5:27AM UTC (12:27AM EST). These issues were human error related and are NOT related to any form of cyberattack or malicious system compromise. The following is CrowdStrike’s official statement regarding this incident:

https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/

How Did These Threats Affect the Organization’s IT Environment?:

If your organization used CrowdStrike Falcon and any systems were offline or rendered unusable for a period of time, note those issues here.

Also, note any 3rd party systems or resources that were offline or otherwise unavailable to your organization and the timeframe these resources were offline including any cloud hosted services.

Actions Taken to Protect the Organization:

Note any actions taken by your organization including any incident monitoring processes and any additional controls implemented or workaround processes invoked.

This outage was scary and concerning on a variety of levels, and one of the common themes discussed since Friday is the implication of an outage when so many industries and organizations are reliant on a single product for cybersecurity and ongoing functionality. This is the great fear of having all of one’s eggs in a single basket, and this situation should bring to the forefront discussions around vendor diversity and redundant controls. This is why we document and review incidents – we need to learn from them and get stronger as we move forward. Start this conversation now in your organization. Learn from the pains of Friday, and work hard to get stronger and safer. Good luck!