Software maintenance at the root of CBP’s January outage, report finds
A Department of Homeland Security inspector general’s report examining a four-hour outage of the Custom and Border Protection’s information technology system in January found that the agency had insufficient software testing and maintenance.
The problem ultimately had to be corrected through legacy system servers, according to the report on the Jan. 2 outage of CBP’s Advanced Passenger Information System, which processes the ingress and egress of passengers to the U.S., utilizing information from the agency’s TECS databases.
The nationwide system crash caused significant flight delays, with the Office o the Inspector General reporting that 13,000 travelers were hindered by the outage in Miami International Airport alone.
CBP’s TECS system had been operating on a modernized server environment that DHS began putting in place in 2008 to replace its legacy mainframe environment, but the newer servers failed to respond following CBP’s efforts to restart them.
At 8:22 p.m., more than three hours into the outage, CBP’s assistant commissioner of the Office of Information Technology, Phillip A. Landfried, directed staff to redirect processing efforts toward the legacy mainframe, which returned the systems online at airports less than two hours later, the report said.
The report noted that CBP’s investigation found the highest volume of passenger queries in a month had occurred on Jan. 2, contributing to a degradation of processing that proceeded the outage.
However, that was coupled with a programming change that occurred when CBP shifted passenger query processes from the legacy mainframe to the modernized server environment between April and November 2016.
“Further review of application logs by CBP staff and contractors determined that an error handling routine in the TECS Modernization server environment was a direct cause of the outage,” the report said.
“Specifically, when processing a large number of queries, this error handling routine sometimes did not terminate processing as designed. As such, computer processing resources used by the routine remained unavailable to support other software applications.”
The agency corrected the problem by altering the memory and processing resources to the routine and adding new code to address error reporting, but the OIG said a lack of software testing, performance monitoring and updated patches could cause another outage.
The report also said that CBP failed to implement failover procedures in time to correct the outage and alternative processing sites lacked adequate disaster recovery capabilities once it occurred.
At the time of the outage, the alternative recovery site for the TECS modernization server environment was not yet operational, with CBP staff in the process of configuring and testing it.
The OIG offered five recommendations:
- Ensure that the TECS test environment is sufficiently similar to the TECS production environment to simulate large query volume errors
- Ensure OIT staff have timely notification of critical vulnerabilities to operating systems
- Adjust Technology Operations Center alert criteria for earlier notification of system slowdowns and outages.
- Establish policy to implement recovery operations within an hour of an outage
- Provide the DHS Chief Information Officer with a weekly status of CBP’s planned and actual modernization migration schedule and milestones detailing when the legacy mainframe environment is no longer needed, and the recovery site is fully functional.
CBP officials concurred with four out of the five recommendations and said they were in the process of implementing them.
They didn’t concur with the OIG’s recommendation on vulnerability notification, saying that software maintenance wasn’t a contributing factor to the outage and that the agency’s vulnerability management was adequate.
The OIG responded that the recommendation was meant to increase vulnerability communication with CBP’s software testing team and considers the matter unresolved.