Business Continuity/Disaster Recovery , Governance , Risk Assessments
System Outage Grounds Delta Flights Worldwide A Tale of Failed ResiliencyDelta grounded all departing flight on August 8 after experiencing a power failure near its Atlanta headquarters that triggered a major system outage, leaving airport departure areas and customers' travel plans in chaos. But as airlines increasingly computerize processes, experts ask why this airline wasn't better prepared to deal with a related outage.
See Also: The Inconvenient Truth About API Security
"A power outage in Atlanta, which began at approximately 2:30 a.m. ET, has impacted Delta computer systems and operations worldwide, resulting in flight delays," Delta says in a statement. "Large-scale cancellations are expected today. All flights en route are operating normally."
Atlanta-based Delta, which is the second-largest U.S. airline based on miles flown, operates 5,000 flights per day. The airline has warned that the outages and disruptions could extend to August 12, and said all affected customers would receive refunds or be allowed to alter their tickets for free, for a limited time.
"That's some outage," says Alan Woodward, a computer science professor at the University of Surrey and a cybersecurity adviser to Europol - the EU's law enforcement intelligence agency.
"I think airlines have had the same exposure to system failure for some time now. What I think we're seeing is a failure to learn from history and allowing single points of failure in the system," Woodward tells me. "Whilst safety critical systems have failure modes analyzed, operational systems clearly are not undergoing the same degree of analysis. The result is not fatal, but nevertheless its impact can be enormous."
In this era of cloud-based computing, such outages should be almost entirely preventable. To do that, however, organizations must pay attention to the basics, including building in redundancy as well as employing uninterrupted power supplies. "For a power failure to cause this level of disruption in the 21st century is very surprising," Woodward says.
@ruskin147 after a 3rd major power outage in as many weeks causes chaos, why have forgotten the basics: UPS & no single point of failure.
Damage-Control Mode
Delta has been left apologizing for the outage and attempting to alert customers. "Customers should check the status of their flight before heading to the airport while the issue is being addressed," an August 8 tweet from Delta warns customers.
Somewhat maddeningly, however, Delta is also warning that flight-status systems - when they can be reached - have been relaying incorrect information. "We are aware that flight status systems, including airport screens, are incorrectly showing flights on time," it said. "We apologize to customers who are affected by this issue, and our teams are working to resolve the problem as quickly as possible."
By 8:40 a.m. ET on August 8, Delta said that some flights had been allowed to depart. Early indications suggest that Delta was able to get flights airborne after reverting to manual processes.
"Delta flights from Heathrow are experiencing delays due to the worldwide technical issue with their computer systems," a Heathrow airport spokesman tells me. "Check-in is currently operating using a backup system and airport staff are on hand to assist any passengers that are impacted by the delays. Passengers should check with the airline for updates on their flights."
1 hr.+ lines @HeathrowAirport for @Delta due to system outage #Heathrow #oldschool manual ticketing @DeltaAssist pic.twitter.com/syC0VwCBDD
Even so, Delta noted that "cancellations and delays continue," and that its website and Fly Delta mobile app, as well as the information being provided to its phone-based customer service employees, might still be inaccurate.
Follows Southwest Airlines Outage
Delta's outage follows Southwest Airlines on July 20 issuing a full "ground stop" after a one-hour "system outage" that occurred after a router failed and a backup failed to kick in, The Dallas Morning News reported. Ultimately, the outage led to five days of disruptions, with the airline canceling 2,300 flights, or about 11 percent, of all flights that it operated in that timeframe.
Industry analysts say that airlines are increasingly digitizing processes - such as allowing boarding passes to be downloaded to mobile devices - as well as adding new technological capabilities to airplanes, such as in-flight Wi-Fi, as the Guardian reports. But without proper planning - and resiliency - computerizing more systems and processes leaves them vulnerable, both to inadvertent outages as well as anyone with malicious intent (see Hack Attack Grounds Airplanes).
Outstanding Investment, Design Questions
Seeing power failures lead to widespread system outages is not unique to the aviation sector. The same day as the Southwest outage, notably, a power loss at Telecity Harbor Exchange, which houses equipment for the London Internet Exchange and which is based in London's Docklands, also led to widespread outages, disrupting about 20 percent of all traffic that was being routed via LINX. Many customers of major U.K. internet service provider BT were reportedly also affected.
But such situations can be prevented, not least by ISPs and airlines. "With sufficient investment and suitable design this need not happen," Woodward says. "That it does occur suggests that one, the other or both are missing."