Blockchain

Possible Root Cause: Accelerate Incident Resolution with Causal AI

It has been proven time and time again that business application outages are costly. The estimated cost of average downtime can range from USD 50,000 to USD 500,000 per hour, and as companies actively transition to digitalization, the cost can increase further. Application complexity is also increasing, requiring site reliability engineers (SREs) to identify and resolve issues for hours, sometimes days.

We’ve introduced a new feature to alleviate this issue. Possible Root Cause It’s part of Instana®’s Intelligent Incident Solving. When an incident is created, Instana uses Causal AI to automatically analyze call statistics, topology, and surrounding information. Quickly and efficiently identify possible causes of application errors. This allows SREs to resolve incidents by directly identifying the cause of the problem rather than the symptom, saving countless hours of work and avoiding significant business costs.

The results in this space often depend on well-known triples. Data, assumptions and methods applied.

data

Instana monitors 100% of every call trace, maintaining information about your infrastructure and applications for API calls, database queries, messaging, and more. It also maintains infrastructure and application metrics down to the second, as well as events, dynamic application and infrastructure topology, and additional relevant data points for users. This means Instana has unparalleled data granularity and availability, allowing you to use causal AI to identify possible root causes with specific detail and accuracy.

home

One of the key assumptions of root cause analysis in most IT management tools is that the topology of the application is always available and complete at a very detailed level. For many IT management tools, this assumption fails because IT management processes are specialized and different teams own separate parts of multi-tiered applications. This often happens due to separation of duties between teams, the use of different monitoring tools across the organization, and various other management process-related reasons.

IT management tools may not have complete visibility into the topology of multi-tier applications. However, the use of causal AI and various algorithms allows root causes to be identified even when data granularity is limited and there is a partial topology. It can provide insight even in the absence of noisy traces.

method

Causal AI allows you to combine disparate data sources, including calls, metrics, events, and topology, to identify the root cause of errors affecting your applications. Additionally, it can provide confidence and credibility about identified problem entities by showing how and why specific entities were identified as possible causes. Causal AI provides powerful insights into locating and investigating problematic components.

Example use case using Stan SRE

Let’s take a look at the experience faced by Stan the SRE. Stan is an SRE working for a small company that deployed a robotics store application on a Kubernetes cluster monitored by Instana. They recently turned on the possible root cause feature and configured several application smart alerts.

One day he received this message from a Slack notification channel configured as a smart notification set up in the company’s Robot Shop application. He notices that the robot store application seems to be having performance issues. Stan clicks on the case to research additional information about the investigation process.

The Incident page appears with a new Possible Root Causes panel. The case page provides Stan with more actionable information, but the important thing is that he now has direction to begin his investigation and solve it. Possible root causes point to specific processes within the robotics store application. This process represents one instance (out of three replicas) of the catalog service.

He then clicks on the Possible Root Cause Entity link and sends Stan to the call analysis page to immediately look at the bad calls that were impacting downstream latency.

He notices that all calls to this instance of the catalog pod have failed with a 503 (Service Unavailable) error. This led him to check more infrastructure metrics and he noticed that the Pod in question was low on available memory and had been running without restarting for quite some time. He restarts the pod as a short-term fix and flags it for review to prevent this from happening in the future.

Here you can see how Stan has saved a lot of time in the incident investigation and resolution workflow. Without the possible root cause feature, we would have had to start with incident notifications, navigate through application dashboards, manually look at call traces, backtrack through call traces until we found the catalog service, and then identify which pod was that pod. problem. You then need to determine if this is the root cause and address it accordingly. The Possible Root Cause feature allows Stan to start remediation right away, saving the most time and money.

vision for the future

Over the coming months, we will expand our root cause capabilities beyond our current level. While identifying possible root causes will have an impact on reducing the average time it takes to resolve application errors, there are several opportunities to explore in the coming months.

  • Improved explainability: Thanks to the use of Causal AI, the algorithms are fully explainable, making it easy to build explanatory tools that inform the SRE as well as the SRE. Where are their problems and why? The conclusions were all elegant and automatic. This allows you to build stories and experiences around identified root causes to create fast, reliable, and intelligent resolutions.
  • learn what Not only what happened where It happened: We are continuously improving our solutions to better analyze what happened and how it happened, as well as pinpointing where the root cause occurred. With further analysis, we can develop a formula that tells the SRE an exact description of what is wrong within the faulty entity, rather than simply pointing out the faulty entity. It also facilitates action recommendations for resolution, a more powerful next step in the Intelligent Incident Resolution initiative.

We believe this is huge potential and we are very proud of the work that has been done so far. This is a unique collaboration between engineering and IBM® Research that allows us to move quickly and solve problems immediately.

We believe there is huge potential here and we are very proud of the work that has been done so far. This is a unique collaboration between engineering and IBM® Research that allows us to move quickly and solve problems immediately.

Note: The Possible Root Cause feature is currently in Technology Preview and is triggered for incidents generated by application or service-level smart alert configurations. The full version is coming soon!

Learn more about IBM Instana’s possible root cause capabilities and intelligent remediation pipeline.

Was this article helpful?

yesno

Related Articles

Back to top button