Zack Garron, Software Quality Analyst III
You have probably heard of the CrowdStrike incident on July 19th. It impacted computers around the world, ranging from retail stores to emergency services. I had CPR training scheduled that day, and the Oak Ridge Fire Department was unable to access their PowerPoint presentation! This has been called the largest IT outage in history and was mentioned on local and national news for weeks. How did this make it to the end customer?
CrowdStrike’s Falcon platform is a cloud-based cyber security software used to protect their customer’s endpoints (primarily workstations and phones). It is used by thousands of companies, including Google, Intel, and Amazon. To maximize preventing malicious attacks, speed is necessary. Cyber security is a constantly evolving industry as the agent attempts to remain ahead of hackers.
CrowdStrike updates the Falcon platform in two ways – the first is what I think of as software development, with code updates that go through levels of testing and deployment similar to G2’s code. The other is “Rapid Response Content” through a proprietary binary file. The issue on July 19th came from a rapid response content push straight to production. This testing was “automated” using a content validator only. The validator itself had a bug, allowing the flawed update through and causing Windows machines to come to the blue screen of death.
In their Root Cause Analysis, CrowdStrike came to six findings and outlined their mitigations to prevent an issue like this in the future. To summarize, the validator had been operating with twenty different input sources in the past, but an update had raised this to twenty-one. This led to an out-of-bounds memory read and caused systems to crash.
Going forward, CrowdStrike plans to alter how rapid response content is handled from start to finish. They have already altered the automated tests that were in place to ensure all twenty-one fields are accounted for. They are also beginning staged deployment and canary testing along with increased customer control over rapid response content delivery.
Perhaps the easiest explanation for what happened here was a focus on speed over everything else, including quality control. The lax process that rapid response content went through was partially to fulfil the “rapid” part. Rushing through to meet deadlines and address new threats created an issue of its own.