The global breakdown of services from a bad software update reminds us of the importance of emergency preparedness
The Windows 10 and Windows 11 Blue Screen of Death. (Photo: Wikimedia Commons)
The CrowdStrike tech outage of 19 July 2024, which was caused by a buggy software update, has shown the world there’s no substitute for being ready for the worst – especially since risks still remain.
The incident has become the largest IT outage in history, impacting 8.5 million laptops, PCs, and cloud-based servers running the Microsoft Windows operating system. While this is less than 1% of all Windows machines, many of the affected 8.5 million devices were used by enterprises or government agencies, and their crashes impacted enterprises and governments worldwide, including major financial, medical, transportation, and municipal services.
The difficulty of recovering the affected machines contributed to the severity of the outage. Each affected computer had to be manually rebooted into either Safe Mode or the Windows Recovery Environment to allow the offending update file to be deleted.
If the affected computer was protected with the Windows BitLocker encryption system – as most enterprise and government machines are – that computer’s unique 48-character BitLocker recovery key would first need to be manually inputted before the machine could be booted into Safe Mode or the Windows Recovery Environment. Imagine being the IT officer who had to manually do this for tens or hundreds of his organisation’s affected computers!
Here in Singapore, while the CrowdStrike incident impacted Changi Airport’s check-in operations, Singapore Post’s mail services, dozens of public car parks, as well as the Lianhe Zaobao and The Straits Times, essential and government services remained mostly unaffected.
While this may appear to be good news, Minister for Digital Development and Information, and Second Minister for Home Affairs Mrs Josephine Teo offered the following words of caution, “While we were less impacted, it will be unwise to think that we are more resilient than others.”
Her caution was warranted because the vulnerability which allowed the CrowdStrike incident to happen remains present in our computers and IT systems.
Almost all cybersecurity solutions today, including antivirus programs, require live updates of security and antivirus definitions. These are essential for protecting our computers against newly identified viruses and other malware.
Unfortunately, as the CrowdStrike incident has shown, a bad update – either from a bug which was not spotted during quality checks, or one which was planted by malicious hackers – has the potential to cause a global breakdown of services.
As Mrs Teo advised, “We have said before that even with best efforts to prevent them, such incidents are bound to occur. When they do, it is critical that we have the ability to recover quickly.”
Business continuity planning was one method of emergency preparedness which turned out to be crucial for recovering from the CrowdStrike incident. Such plans give detailed instructions for organisations to maintain their essential functions during major disruptions.
Here in Singapore, “some of the services that were disrupted, such as postal services, recovered relatively quickly as business continuity plans (BCP) were activated,” shared Mrs Teo.
How do we know if our BCPs will work? If we wait for a crisis to happen, it may be too late to fix a gap or problem that suddenly appears in a BCP. This is why regular testing is important. In a follow-up post, Mrs Teo shared that the Singapore government conducts annual tabletop exercises to stress test and improve their BCPs:
“During each exercise, we ensure our technology is up-to-date and resilient against outages. We practise our incident responses and BCPs, so that we know what to do and who to contact during crises. Our people demonstrate their dedication and hone their knowledge and capabilities to respond under stress.”
Here at HTX, BCPs and tabletop exercises are very important methods for us to ensure that we can continue empowering the Home Team’s frontline in times of crisis. We will share more about this in a future article. Stay tuned for it!