Why does infrastructure operations still suck?
Notice: This blog post was originally published on Indeni before its acquisition by BlueCat.
The content reflects the expertise and perspectives of the Indeni team at the time of writing. While some references may be outdated, the insights remain valuable. For the latest updates and solutions, explore the rest of our blog
The article recounts a conversation with a 300-person infrastructure lead who is frustrated with the current OSS monitoring stack used across 50 global data centers, describing a fractured mix of open source and commercial tools that require heavy customization but still only provide basic up/down status. It highlights that over 70% of outages stem from human error and misconfiguration, yet existing tools detect a tiny fraction of those issues, leaving teams reactive and overwhelmed by alerts as infrastructure complexity grows. The author argues that the monitoring market—crowded with some 60+ tools—has failed to deliver true visibility and announces indeni's effort to introduce proactive infrastructure monitoring for major vendors like Cisco, Check Point, F5 and Palo Alto Networks, with more details promised in upcoming posts in 2016.
What are the core problems with the current OSS monitoring stack described in the article?
The article identifies several core problems: the OSS monitoring stack is a disjointed mix of open source and commercial tools that need extensive customization and bespoke extensions; after significant investment it still primarily offers only up/down monitoring and floods teams with data and alerts; and it lacks proactive capabilities to detect misconfigurations and human-error conditions, which account for over 70% of outages. This results in reactive operations, frequent surprises, and tools that become less useful as infrastructure complexity grows.
Why does the author believe the monitoring market has so many tools and why is that a problem?
The author reasons that a market with around 60 tools usually indicates either an extremely large addressable market or that existing solutions have failed to meet needs, prompting new entrants. In this case, the abundance of tools reflects persistent failure to deliver true visibility and proactive functionality. Many customers end up using multiple overlapping tools, investing dozens of person-years to achieve only basic up/down status, which wastes resources and perpetuates ineffective, reactive operations rather than providing the superior, consolidated solution customers need.
What solution and direction does the author propose going forward?
The author explains that indeni has decided to tackle the shortcomings by developing technology to bring proactivity to network and security teams across large enterprises. indeni claims success in delivering proactive capabilities for environments using Cisco, Check Point, F5 and Palo Alto Networks and intends to roll out a strategy to extend true proactivity to every piece of infrastructure in large enterprises. The author commits to publishing a series of posts detailing plans, actions, and rationale, positioning 2016 as a potential turning point for the market.
Last Friday, I met with an individual that leads a 300-person team, responsible for running the networking and computing infrastructure in 50 data centers around the globe. I asked him what he thought of his OSS stack – the set of tools his team uses to stay on top of what’s going on in their infrastructure.
He hates it.
As I want to keep this blog post PG-rated, I’ll refrain from using his adjectives, but I can tell you he’s not happy with it. It’s a clobber of open source and commercial tools. The tools required a lot of customization and a variety of extensions written over the years. At the end of the day, though, it only gives him up/down monitoring and no ability to proactively avoid the next outage. Over 70% of outages occur due to human error and misconfigurations and the tools available to him are incapable of identifying even one percent of that.
It is amazing to me that the market of Infrastructure Operations has barely changed with regards to getting visibility into your infrastructure – an activity commonly referred to as monitoring: Still the same SNMP monitoring tools, still flooding admins with data and alerts, still staying on the defensive and being reactive, still waking up every morning to a new surprise. The person I met on Friday, a veteran of the industry, actually estimates that Infrastructure Monitoring has been going backwards in recent years. The infrastructure is growing in complexity and the monitoring tools aren’t changing their approach to providing true visibility, so they are becoming even less useful at their job.
Just in the networking space, Wikipedia lists 65 different tools. I have seen most of these tools in use by mid-size and very large enterprises. They are usually very similar – come with some basic monitoring functionality and allow some customizations and extensibility. Only after the team invests dozens of man years in setting up the system, is it capable of only telling them if their network is up or down. How useful is that?
When we started working on our product, it was baffling to us. Over 60 tools? That’s unheard of in tech. In each market, there are usually 5-10 competitors with one or two being dominant. Look at workstations (Microsoft), server (Unix-like systems), networking gear (Cisco), load balancing (F5), CRM (Salesforce), Server Virtualization (VMware) and other markets. A market with 60 tools generally means it is either huge (tens-to-hundreds of billions, capable of supporting so many competitors) or simply all the solutions haven’t delivered, so new ones keep showing up. Many customers we speak with have several different tools, with overlapping capabilities, all of them not delivering.
This needs to end. A superior technology and product must surface and provide a solution customers have been waiting for nearly two decades. A few years ago we, at indeni, have decided to contend for this title and are welcoming anyone else attempting to do the same.
We have been successful at bringing proactivity to network and security teams around the globe who are utilizing Cisco, Check Point, F5 and Palo Alto Networks in their environments.
Over the next few months, we will be rolling out our strategy for delivering true proactivity to every single piece of infrastructure deployed in large enterprises. I will be detailing our plans and actions, and the rationale behind them, over a series of posts.
Stay tuned, the year 2016 will be a turning point in this market!