How a Single Uncaught SQLException Grounded a Multibillion Dollar Airline?
U.S. Federal Aviation Administration (FAA) in Jan 2023 announced new details on the cause of the Notice to Air Mission (NOTAM) system outage, which caused the delay or cancellation of more than 8,400 flights earlier that month.
The FAA announced that a contractor “deleted files while working to correct synchronization between the live primary database and a backup database.”
This reminded me of a case study I had read recently in the book
“Release It! Design and Deploy Production-ready Software” by Michael Nygard.
Interestingly the case study involved an airline and a database.
The book in the section "Case Study: The Exception That
Grounded An Airline" talks about how a tiny programming error starts the snowball rolling downhill.
In the post-mortem analysis of a major outage that occurred at an airline company, it was discovered that the root cause of the problem was a single uncaught SQLException in the code of a session bean.
The incident happened after a routine database failover and maintenance, and it caused all check-in kiosks and IVR servers to stop servicing requests at the same time.
Through investigating the thread dumps, log files, and configurations of the servers, it was determined that the problem was caused by a resource leak in the connection pool of the application server.
The leak was caused by a failure to handle SQLException when closing a JDBC statement, which resulted in the exhaustion of the resource pool and the blocking of all future calls to connectionPool.getConnection().
This incident serves as a reminder of the importance of proper handling of exceptions in code, and the potential consequences of a seemingly small oversight.