Technology

Engineers are on a Failure Investigation Mission

Engineers are on a Failure Investigation Mission

Engineers devised a method for quickly identifying a variety of potential failures in a system before it is deployed in the real world. Many of the services we rely on are managed by computers, from vehicle collision avoidance to airline scheduling systems to power supply grids. As the complexity and ubiquity of these autonomous systems grows, so may the ways in which they fail.

MIT engineers have now developed a method that can be used in conjunction with any autonomous system to quickly identify a variety of potential failures in that system before it is deployed in the real world. Furthermore, the approach can identify failures and recommend repairs to avoid system breakdowns.

The team demonstrated that the method can detect failures in a variety of simulated autonomous systems, including a small and large power grid network, an aircraft collision avoidance system, a rescue drone team, and a robotic manipulator. The new approach, in the form of an automated sampling algorithm, quickly identifies a range of likely failures as well as repairs to avoid those failures in each of the systems.

The new algorithm differs from other automated searches in that it is designed to detect the most severe failures in a system. According to the team, these approaches may miss subtler but significant vulnerabilities that the new algorithm can detect.

In reality, there’s a whole range of messiness that could happen for these more complex systems. We want to be able to trust these systems to drive us around, fly an aircraft, or manage a power grid. It’s really important to know their limits and in what cases they’re likely to fail.

Charles Dawson

“In reality, there’s a whole range of messiness that could happen for these more complex systems,” says Charles Dawson, a graduate student in MIT’s Department of Aeronautics and Astronautics. “We want to be able to trust these systems to drive us around, or fly an aircraft, or manage a power grid. It’s really important to know their limits and in what cases they’re likely to fail.”

Dawson and Chuchu Fan, assistant professors of aeronautics and astronautics at MIT, are presenting their work this week at the Conference on Robotic Learning.

Sensitivity over adversaries

A major system meltdown in Texas in 2021 got Fan and Dawson thinking. Winter storms rolled through the state in February of that year, bringing unexpectedly cold temperatures that caused power outages across the state. More than 4.5 million homes and businesses were left without power for several days as a result of the crisis. The systemic failure resulted in the worst energy crisis in Texas history.

“That was a pretty major failure that made me wonder whether we could have predicted it beforehand,” Dawson said. “Could we use our knowledge of the physics of the electricity grid to understand where its weak points could be, and then target upgrades and software fixes to strengthen those vulnerabilities before something catastrophic happened?”

Dawson and Fan’s work focuses on robotic systems and finding ways to make them more resilient in their environment. Prompted in part by the Texas power crisis, they set out to expand their scope, to spot and fix failures in other more complex, large-scale autonomous systems. To do so, they realized they would have to shift the conventional approach to finding failures.

Engineers are on a failure-finding mission

Designers frequently test the safety of autonomous systems by identifying the failures that are most likely to occur. They begin with a computer simulation of the system that represents its underlying physics as well as all of the variables that could affect its behavior. They then run the simulation with an algorithm that performs “adversarial optimization” – an approach that automatically optimizes for the worst-case scenario by making small changes to the system over and over until it can zero in on the changes that are associated with the most severe failures.

“By condensing all these changes into the most severe or likely failure, you lose a lot of complexity of behaviors that you could see,” Dawson notes. “Instead, we wanted to prioritize identifying a diversity of failures.”

To accomplish this, the team took a more “sensitive” approach. They created an algorithm that generates random changes within a system and evaluates the system’s sensitivity, or potential failure, in response to those changes. The more sensitive a system is to a change, the more likely that change is linked to a potential failure.

The approach allows the team to rule out a broader range of potential failures. The algorithm also allows researchers to identify fixes by backtracking through the chain of changes that led to a specific failure using this method.

“We recognize there’s really a duality to the problem,” Fan says. “There are two sides to the coin. If you can predict a failure, you should be able to predict what to do to avoid that failure. Our method is now closing that loop.”

Hidden failures

The researchers put the new method to the test on a variety of simulated autonomous systems, including a small and large power grid. The researchers paired their algorithm with a simulation of generalized, regional-scale electricity networks in those cases. They demonstrated that, whereas traditional approaches focused on a single power line as the most vulnerable to failure, the team’s algorithm discovered that, when combined with the failure of a second line, a total blackout could occur.

“Our method can discover hidden correlations in the system,” Dawson says. “Because we’re doing a better job of exploring the space of failures, we can find all sorts of failures, which sometimes includes even more severe failures than existing methods can find.”

The researchers demonstrated similarly varied results in other autonomous systems, such as avoiding aircraft collisions and coordinating rescue drones. They also demonstrated the approach on a robotic manipulator, which is a robotic arm designed to push and pick up objects, to see if their failure predictions in simulation would hold true in reality.

The researchers first tested their algorithm on a simulation of a robot tasked with pushing a bottle out of the way without knocking it over. When they ran the same scenario in the lab with the actual robot, they discovered that it failed in the manner predicted by the algorithm, such as knocking it over or failing to reach the bottle. When they applied the algorithm’s suggested fix, the robot successfully pushed the bottle away.

“This shows that, in reality, this system fails when we predict it will, and succeeds when we expect it to,” Dawson said.

In theory, the team’s approach could detect and repair flaws in any autonomous system as long as it includes an accurate simulation of its behavior. Dawson hopes that one day the approach will be turned into an app that designers and engineers can use to tune and tighten their own systems before testing them in the real world.