Correlate, Investigate, Resolve: Applying AI in remote IT operations

17 April 2020 | 3296 Shares


For Ops teams of any flavor (NetOps, DevOps, ITops, SRE’s, NOC technicians, et al.), it can feel like the weight of the world, or at least the weight of the enterprise is well and truly on their shoulders. Now, more than ever, as humans are stretched beyond capacity, it is critical to understand the role AI can play in IT incident management.

Running IT Ops is not simple. The ease with which a relatively non-technically-minded business function can spin up a powerful, cloud-based application instance often gives people the impression that all of IT is that simple! However, keeping the enterprise’s operations working smoothly is an increasingly complex task, one that seems to many Ops professionals to have become a purely reactive role — having to jump to an increasing number of red flags and issues — rather than one that leaves time (and energy) to play a decisive, strategic role in the way the business develops.

With complexity comes change

At the heart of IT Ops is the reality that nothing stays the same in any IT environment. Particularly now.

There are new users hitting the network from home, changes to infrastructure/topology as companies move to hybrid cloud, data centers being moved to the cloud, new software container pods and microservices move from development to testing to production environments, new applications and services for both internal and customer-facing roles are rolled out. Furthermore, there’s the constant need for systems and applications to be patched, updated, and upgraded — the list is seemingly endless.

Each of these motions or initiatives creates change – sometimes understood and sometimes unplanned – and often triggers one or more systems monitoring applications, web servers, security systems, networks and more. Those alerts come in the form of thousands of alerts in different forms: phones ring, emails start flying, and IT Ops teams are called to figure out what changed, what stopped performing because of what, and who needs to help fix issues. Manual processes, reliance on people to catch and fix things – this approach might have worked years ago but today, outages happen and humans can’t keep up.

Time for AI in IT Operations?

With so many variables in the modern hybrid environment, plus the increasing need from the business for agility in DevOps and swift resolution times for problems, is it time to think more deeply about deploying some element of artificial intelligence and machine learning across the IT Operational arena?

In some ways, all the outlying signs indicate that it certainly would be possible: after all, what AI excels at is ingesting large amounts of data and producing insights or results that would be beyond human teams combing through information manually. Enter AIOps.

AIOps is a capability – a toolset – not a marketing attribute nor one single category of solutions. In forward-looking enterprises, AIOps is embedded in toolsets that work in specific parts of the incident lifecycle. It’s also being used as a catchphrase by vendors hoping to cash in on the hype bubble that’s around all things “smart” at the moment.

Here at TechHQ, we’ve been researching this area, attempting to separate the various products in this space and try and ascertain quite how far each vendor can go to help out embattled Operations teams, and hopefully set overworked IT professionals onto a more strategic, proactive footing.

And, of course, we’ve been trying to sort the good from the bad, the spurious claims from the reality, from products that can or can’t deliver AI (or results) in a meaningful and useful way in the enterprise.

In this series of articles, we’ll be looking at AI-aided solutions specifically-designed for the IT Ops role. There are, naturally, existing products in the ERP mold that have been retooled or, in some way, attenuated so IT Operations can use them. Oftentimes, however, these products fall far short of expectations, and can (at worst) be just another contributor to IT teams’ resource overheads — contributors to the red flag storm, rather than a help.

Many products in the Ops space might leverage AI internally, but without oversight over all elements, the flags raised — albeit ones raised by self-learning algorithms — do still just become part of the background noise.

Industry professionals will be aware of, if not have used Apache Airflow, Datadog and applications like VictorOps. But BigPanda is really the only platform that offers enterprises the type of solution that IT Ops teams are seeking. BigPanda uses explainable AI to correlate disparate event data from monitoring tools across the IT environment, then seeks to understand commonalities and outliers which point to the source of change, or the “root cause” of an incident.

Full consideration

Any solution in this space has to be fully-cognizant of the changes that are a second-by-second occurrence in the modern enterprise’s IT stack and topology. What’s necessary, therefore, is a single place where data from existing tools that either affect or inform about the IT situation is gathered. Plus, of course, the necessary AI “muscle” to be able to intelligently trigger appropriate responses.

That means the most common change management applications, collaboration platforms, topology tools, and device/networking monitors are continually feeding data into a single, auto-learning platform. Taken singly, any one of those data sources could, given large budgets, consume a lot of resources in using it to its full advantage.

But in combination, tracking down the causes of problems, bottlenecks, outages, and poor end-user experiences is all but impossible. This is, after all, big data in a literal sense. Little wonder that Operations teams are usually running just to stand still!

However, it should be stressed that data should be gathered and normalized from all existing sources and be capable of the same from solutions yet to be deployed. There are probably no enterprises that are greenfield sites — legacy tools & platforms represent a significant investment and are, indubitably, powerful solutions in their own right.

BigPanda’s Open Box Machine Learning, for example, is capable of correlating topology data, with incident alerts, with change management progress information, with DevOps orchestration data, with changes in cloud provisions. Armed with all that, Big Panda’s AI smarts produce a coherent list of probable root causes of incidents.

Defining AI

Despite being a powerful technology (one so powerful it worries some commentators), even the most modern ML routines still work best with limited parameters. Despite the seeming complexity of the modern enterprise’s IT systems, smart systems can, given clean or even enriched data, learn the underlying patterns of cause and effect of what humans call incidents. To a human mind, a Kubernetes-orchestrated rollout might have little to do with a server outage on the other side of the globe.

But given correctly parsed data and a defined environment in which to work, AI engines like the BigPanda Open Box ML can be remarkably effective.

Being an AI system, such a “brain” can learn as it goes, and so within (typically) eight to twelve weeks, can route incident notifications to the correct place: levels one through three of response teams, to Dev Ops or Net Ops, for example. There’s also integration with the likes of Slack, ServiceNow, and PagerDuty, so the platform can fit alongside existing tools. Plus, along with details of the primary incident (ignoring follow-on red flags is an art done well by AI), it can also supply the probable cause of the event — however far removed it may have been from the eventual effects.

That means issues are resolved quicker, and overall times-to-resolution drop significantly. Operationally, new services and apps deploy more quickly and are more resilient from day one, and overall, teams can collaborate using a single source of truth about system, application and infrastructure performance. Having a single canonical reference is critically important, particularly as IT teams often handle operations remotely. Where possible, machines and AI are helping humans connect the dots, removing much of the manual investigation and root-cause analysis workloads.

Rather than being a catchall for all incident reporting (a function other platforms can achieve, with varying success), an intelligent system can respond immediately to real-time changes in topology and infrastructure, as well as to the day-to-day activities of DevOps, network teams, ITIL change management functions — in fact, all the variables at play.

BigPanda offers an online demo and also provides select organizations 90-days free use of the platform. We suggest that evaluating BigPanda will improve your incident responses and overall ability to run an organization with agile IT Ops.