5 Ways Big Failures Grow from Accumulated Small Ones

Notice: This blog post was originally published on Indeni before its acquisition by BlueCat.

The content reflects the expertise and perspectives of the Indeni team at the time of writing. While some references may be outdated, the insights remain valuable. For the latest updates and solutions, explore the rest of our blog

Network engineers believe that outages, when they happen, usually come from unexpected causes, because if a network outage can be predicted, it can be prevented, right? Not quite. The cause of your next outage has probably already happened, and you can stop it if you can find it.

In an episode of the podcast “Cautionary Tales”, Tim Hartford shared principles around system failure. First, adding redundancy can actually make a system less stable. Galileo’s last published work, Discourses and Mathematical Demonstrations Relating to Two New Sciences, contains a story of a marble column that was in storage for a building project. It had been propped up in three places: one at each end, and one in the middle. The middle support was to prevent the column sagging and breaking under its own weight. However, one of the end supports crumbled, so the end sagged, creating upward pressure in the middle. As predicted, the column cracked in the middle, but in an unexpected way. Adding redundancy caused the very problem it was designed to prevent. Secondly, Hartford related a principle put forth by sociologist Charles Perrow in his 1984 book, Normal Accidents: failures are unavoidable in systems that are both complex and tightly coupled. While Perrow’s book examines Three Mile Island, it’s easier to think of a line of dominos: each additional domino makes the system slightly more complex; every time a domino is added, it’s an opportunity to knock the next one over; and tight coupling means knocking one domino over will cascade to make the rest of them fall.

The mention of complex systems brings to mind a short treatise from 1998 by Dr. Richard I. Cook, “How Complex Systems Fail”. He observed that any complex system has guards against single points of failure, so multiple failures must occur for an overall system outage. Therefore, remediation of small problems tends to get deferred, since they don’t affect production. The organization prioritizes addressing issues that actually do affect production, like moves/adds/changes, and rewards engineers who find faster ways to get things done. Over time, as changes occur in both the system and the staff, the list of small problems is increasingly inaccurate as it fades from living memory: without periodic re-examination, there’s no detection of new potentially hazardous interactions. Therefore, when an outage does occur, an organization looking for a root cause can always claim operator error, since the root cause was something known but which had been ignored. That outage is also inevitable, since well performing systems are usually given additional workload, without any additional workforce.

Combining these principles gives the following conclusions:

  1. Large system failures come from interaction of small problems.
  2. Adding redundancy necessarily adds complexity.
  3. The more complex the system, the more likely it is that there are existing small problems, and the less likely it is that their interaction can be understood.
  4. A system that seems to be performing well will be pushed harder, until it eventually breaks and experiences an outage.
  5. When the outage occurs, there will be a lot of small problems (re-)discovered, many of which will not be related to the actual cause or solution, but each of which will create a distraction that adds to the time needed to restore the overall system to operation.

Here’s how Indeni avoids the trap of a complex cascade:

  • Indeni is loosely coupled to the devices it checks for errors, so it does not contribute to system failure. 
  • Indeni uses multi-variable context in its detection elements, based on expert knowledge of complex systems, so it can detect issues that are likely to cascade.
  • When Indeni detects symptoms of an issue, its Auto-Triage feature digs deeper to provide a specific diagnosis of the root cause.
  • Each issue summary has an explanation of potential impact, with prioritized based on that potential rather than the current state, so it’s easy to identify which small issues should be fixed.
  • Each issue summary includes a recommended remediation, so fixes can be done quickly.

Find the source of your next outage and stop it before it happens, thanks to Indeni. Download Indeni or take a test drive (no hardware needed) today. 

Key takeawaysThis key takeaway was generated through LLMs crawling the page and coming up with an overview of the content.

The article explains why outages often arise from interactions of existing small problems in complex, tightly coupled systems, drawing on insights from Tim Harford, Charles Perrow, and Dr. Richard I. Cook. It describes how redundancy and gradual accumulation of unresolved minor issues increase system fragility and how organizations unintentionally defer remediation as systems are pushed harder without added resources. The piece describes how Indeni addresses this operational challenge by using loose coupling, multi-variable context detection, Auto-Triage root-cause diagnosis, prioritized impact-based summaries, and recommended remediations to find and stop the source of the next outage before it happens.

Why can adding redundancy make a system less stable according to the article?

The article uses Galileo’s example of a marble column propped at three points to show that adding redundancy changes force interactions and can introduce new failure modes. When one support failed, the redundant middle support caused unexpected upward pressure that cracked the column in a different place. In complex systems, redundancy increases system complexity and coupling, creating additional interactions and opportunities for failures to cascade rather than guaranteeing greater stability.

How do small, deferred problems contribute to large system outages?

Drawing on Dr. Richard I. Cook’s observations, the article explains that complex systems already protect against single-point failures, so outages require multiple faults. Small problems that do not immediately affect production are often deferred and accumulate as the system evolves and staffing changes. Over time these unresolved issues and stale institutional knowledge increase the chance that multiple minor faults will interact unexpectedly, producing a larger outage that is difficult to fully diagnose because many unrelated small issues reappear during incident response.

What specific features of Indeni help prevent cascaded outages as described in the article?

According to the article, Indeni avoids contributing to failures by being loosely coupled to the devices it checks. It uses multi-variable context in its detection elements—based on expert knowledge of complex systems—to identify issues likely to cascade. Indeni’s Auto-Triage digs deeper to provide a specific root-cause diagnosis, and each issue summary explains potential impact prioritized by risk rather than current state. Finally, issue summaries include recommended remediations so teams can quickly fix prioritized small problems before they interact and cause outages.

Related content

Close-up of interlocked metal chain links symbolizing connected network objects and relationships in IPAM

How to map your network with user-defined links in Integrity X

Map your network with user-defined links in Integrity X to define and manage custom relationships, such as dual-stack and NAT environments.

Read more
Flock of geese flying in formation across a blue sky, framed by a pink graphic border, symbolizing coordinated network migrat

Automate your DDI modernization path by migrating with Micetro

Automate cross-platform DNS and DHCP migration with Micetro to reduce risk, eliminate manual effort, and modernize infrastructure faster.

Read more
Three armored figures walking toward a futuristic Las Vegas skyline with pyramids, glowing orb, and "Welcome to Fabulous Las

Your journey to intelligent NetOps begins at Cisco Live

Visit BlueCat’s booth or book a meeting now to learn more about how our solutions can help you build a network that supports constant change.

Read more
Stacked colorful wooden directional arrows on a post by a calm seaside with distant hills and blue sky

Replace BIND and ISC with Micetro DNS/DHCP Server (MDDS)

Tired of patching and manually configuring BIND DNS and ISC DHCP? Discover how Micetro MDDS appliances can replace them for modern DDI.

Read more