The recent problems with Southwest Airlines is a good example of a Metastable failure at scale in the physical world:
TRIGGERs: Capacity reducing triggers (reduced staff capacity due to sickness, snow storms at Denver, Chicago, and the rest of the country).
AMPLIFICATION: Capacity degradation amplification caused by a combination of factors such as:
—-point to point business model meant the crew is not in the right places,
—-scheduling software breaking down resulting in manual matching of flights to crews - (can’t even imagine how tedious this would have been…kudos to the manual schedulers)
—-crew not able to communicate with the airlines (!) due to phone systems being down, likely due to a metastable failure of the phone system caused by overload due to customers trying to reach the airline for rescheduling..
So, even if the matching of a flight to a crew was done, the crew might not have been aware of that assignment! So, even as “system capacity” (airport, flights, crew) started becoming available, they couldn’t be used effectively…
MITIGATION: As with many metastable failure mitigations, load shedding was the mitigation- they temporarily reduced the number of flights to 1/3rd of the usual number…
Looks like the airline was running the system in an extremely vulnerable state (optimizing for high turnaround time to improve efficiency and packing the schedule without any headroom to handle overloads caused by capacity degradation).
Hope they do a thorough incident analysis using the metastable failure framework and make improvements…
References:
https://www.cnn.com/2022/12/27/business/southwest-airlines-service-meltdown/index.html
https://www.cnn.com/2022/12/29/business/southwest-airlines-service-meltdown/index.html