From the blame-game to the end-game – AIOps in the enterprise
Many articles around technology start with a dose of hyperbole, like: “IT is changing at the speed of a wild stallion.” That said, most technology professionals in IT Operations roles would have to agree that many of the accepted norms in their roles have indeed changed quite rapidly – in the last few years, especially.
IT Ops teams could attest to five recent shifts in the IT environment that are causing them, if not to lose control of oversight, then at least lose a night’s sleep on occasion:
- The expectations of decision-makers that everything should “go cloud”,
- The sudden switch to remote working wholesale (and its probable continuation, to some extent, permanently),
- Automation of legacy platforms, and the move to collaborative API-based solutions,
- Modernizing software and application delivery, whether that is moving to containerization/microservices or moving an ERP platform from on-premise to cloud,
- Traditional change management and helpdesk methods (ITIL, for example) making way for DevOps and SRE (site reliability engineering) models.
All of these issues (and likely a few others) trigger changes to the IT environment, which then wreaks havoc, raises red flags of one sort or another, and generally makes all functions in an enterprise come knocking on IT Ops’ door to fix the problem, and more importantly, figure out how to avoid it happening again!
That’s no tall order, given the number of possible root causes of every problem. A single issue raised by a customer care contact center describing poor end-user experiences might have its cause deep in the changes made to a database schema that were part of a managed change request. Or indeed, any number of other more-or-less simultaneous alterations to the overall IT stack, instigated by any number of sources.
Solving an issue, and preventing its recurrence, is a long process of unpicking complex cause-and-effect chains. That’s a reality that’s made more difficult by silo-ed data, for sure, but also silo-ed departments or divisions. NetOps, DevOps, CloudOps, security, storage, and so forth might each have different ways of tracking issues and use quite different toolsets, for different reasons.
The result is that Level One response teams are overstretched with too many alerts (often for the same issue, but raised after different knock-on effects transpire), and Level Two and Three teams are unsure of where the overall cause of an issue might lie.
It’s here that AIOps can hold the answers for IT Operations. We spoke to Paul Bevan at Bloor Research recently about how this emerging technology could help IT teams day-to-day but also change the way that, for instance, L2 and L3 teams are structured.
But first, Paul described a current typical scenario that will be familiar to most readers:
“A problem would involve two or three different siloed teams trying to find where the real change or problem first occurred or originated. It could be the application, it could be in the network, and the result was finger-pointing, for want of a better word… there was a wonderful term I once heard: getting the Lowest Mean Time to Innocence!”
Systems that use AIOps effectively give teams access to all possible sources of truth available to all those siloed teams across the enterprise: storage, cloud, networks, databases, data center operations, and so forth.
The result is less L1 “chaff,” and L2 and L3 getting very specific information, a fact which, Paul said “obviates the need for the ‘war room,’ where everybody gets around the table and says ‘Well, I’ve looked at mine and it’s not my problem.’ And this can take days!”
With that better focus and at least a decent steer towards clearer root cause analysis, L2 and L3 teams will find that their MTRs (mean times to resolution) much lower.
Furthermore, Paul said that some teams are, post-AIOps installation, merging what were once separate L2 and L3 teams from different IT functions. Instead of playing territorial blame-games, the emphasis is placed back, quite rightly, on proper investigation and resolution.
And while even the most advanced AI-powered systems have yet to achieve the fully-automated, intelligent NOC (although, watch this space), AIOps algorithms are — after proof of concept trials, in most cases — more trusted by every part of the IT domain than oversight provided by any manual process, single-purpose platform or cluster of tools.
The objectivity of AI, and the fact that multiple, cross-departmental inputs feed into the silicon brain of the self-teaching routines, mean that issues flagged, with probable root cause, are trusted and appreciated by the skilled professionals tasked with keeping the IT function performing at its best.
In a previous article here on TechHQ, we looked at the BigPanda platform in the context of providing this type of AI-powered IT operational capability. And while it too has not yet developed the capability to use ML to fix issues anywhere in the enterprise’s IT provision, it combines insights gained by machine learning and automation to deliver incident alerts, open tickets, create chats and more. Suddenly, AIOps feels very real, and incredibly useful.
Download BigPanda’s guide “The Pragmatic Buyer’s Guide to AIOps Platforms” here.
25 September 2020
10 September 2020