Keep Traffic Moving: Lessons from the TSA.
What is Operations Assurance?
Picture an airport. TSA has two important jobs.
The most visible job of TSA is to detect and block questionable items from entering a secured area, so people can connect and conduct business with fewer worries about safety. Those of us who fly frequently recognize that this is important, so we grudgingly remove our laptops, shoes, and water bottles.
In contrast, the most fundamental job of TSA is to keep the line moving so that people can get to their flights. The main complaint that most of us have about airport security is that it takes too long to get through. That delay ties into the already stressful situation of making it onto a plane on time, which provides an anchor point for our anxiety and frustration, and then escalates to criticism of the whole process.
Where Indeni fits
Indeni knows that security infrastructure is similar in airports and in networks: keep business running by blocking bad stuff from getting into a trusted environment, but do it without becoming the thing that impedes business. Any time someone can’t get to their favorite website, or when an operations team is having network difficulty with a server, the first thought is to blame the firewall. This is the case whether or not there is even a firewall between the user and the service.
When people complain that security is too invasive, or that it violates privacy, it often is brought to mind only because of the inconvenience. Pain breeds pain: people will ignore true problems until something small interrupts their intended activity, at which point the backlog of issues becomes front of mind. The complicating factor is that the complaints may be valid: the big blue box backscatter machines really did reveal intimate physical details, and rebinding SSL connections to weaker encryption really does have the potential to expose personal information. However, even when these issues are remediated – backscatter machines have been pulled from use – passengers and users will remember those previous affronts, rather than the quick response.
Awareness of the wrong issue – a focus on the delay – leads to distracting discussions of the wrong solutions. Is the firewall policy too restrictive? Should we disable content inspection? These aren’t bad questions, and they might even provide a moderate speed improvement when everything is working, but they don’t address the big problems. The real impact comes from the times when something goes wrong, and the network grinds to a halt. The firewall still gets the blame, plus the newly added ire that “they said they would fix it, but obviously they didn’t.”
In a TSA checkpoint, there’s a lot of complexity in which is mostly hidden from the people moving through, including two types of body scanners, dual-wavelength differential X-ray baggage analyzers, chemical sniffers, and personnel trained both to operate the equipment and to monitor and manage the passengers. Regardless of what types of objects are prohibited – what the security policy is that’s being enforced – an agent calling in sick can slow down a lane, and equipment failure can shut it down entirely.
Networks prepare for equipment problems by architecting for HA (High Availability). In the example of adding another TSA lane, that’s Active/Active: at the inspection point, there are two (or more) identical firewalls, each running the same policy, so traffic isn’t slowed down waiting for a single slow device. Much more common is Active/Standby, in which a single firewall handles all traffic, with another one ready to take over. If TSA operated this way, there would be two lanes, but only one open. At shift change, the new personnel would all go to the second lane and redirect the line of passengers to that lane, while the old personnel shut down the first lane and left.
Let’s go back to the airport and imagine being in line at a security checkpoint. There’s excitement among the passengers if a TSA agent approaches some equipment not currently in use: will they open another lane? Since it takes several minutes to get the lane ready, the answer depends on whether that equipment is in a state of operational readiness. If the baggage scanner isn’t working, the TSA agent walks away, and the line of passengers collectively groans in disappointment. Now imagine that scenario happens at a shift change: the first lane has shut down, but the second lane isn’t running. Airport security relies on a TSA lane being available, so a failure to fail over means that if there’s not an architectural workaround like another TSA checkpoint elsewhere in the building, the airport is essentially closed to anyone not already on a plane.
Indeni provides operational assurance for network security by detecting the issues that could lead to downtime or to HA failure. In TSA terms, we are not searching bags or testing the line by sending in fake weapons. Instead, we’re monitoring task-specific metrics, like making sure that the x-ray receiver is detecting both wavelengths, and that the conveyor belt motor power consumption is within spec, because those metrics are reliable indicators of problems. Indeni knows that these metrics often aren’t obvious, which is why we also include descriptions and remediation: high conveyor belt motor power consumption can indicate dirty gears, which can cause the conveyor belt to stop abruptly, so please inspect the gears and clean if necessary.