Why 95% of AI Pilots Fail—and How Human-in-the-Loop Debugging Changes the Game

debugging-agent

Why 95% of AI Pilots Fail—and How Human-in-the-Loop Debugging Changes the Game

Tim Seager

<-- Back to Blog

Written by

Tim Seager

AI copilots have reshaped how we build software, speeding up coding and reviews. But when systems break in production, it’s a different story. Incidents demand context, workflows, and the hard-won knowledge of the team on call. Not just code snippets. At Relvy, we’ve seen firsthand that AI debugging only succeeds when engineers remain firmly in control.

‍

The Promise and the Pitfalls

‍

Early results in the industry are encouraging. From analyst reports and speaking to colleagues in the industry we’ve heard of companies seeing measurable gains by piloting AI powered debugging assistants— one example of freeing up nearly half of their SRE capacity for higher-value work. That’s a big deal, and signals the start of real disruption in observability.

‍

But it hasn’t been frictionless. We’ve also heard the stories and seen ourselves - AI powered debugging tools that generated “recommendations” so noisy or misleading that teams escalated requests to shut them down. According to MIT’s GenAI Divide report, a staggering 95% of generative‑AI business pilots fail to deliver meaningful ROI. Only 5% reach production and are successful. Largely due to brittle workflows, lack of contextual learning, and poor alignment with daily operations. The article points out that what trips companies up isn’t the models—they’re powerful—but the absence of sufficient feedback loops. These systems don’t learn, don’t evolve, and don’t retain memory from past incidents as they need to. Meanwhile, many employees bypass slow, canned enterprise tools and run their own AI assistants—“shadow AI”—because those tools offer context and adaptability that official systems don’t. Interestingly, the study also noted that, so far among successful AI deployments (these were mostly in infra/back-office applications vs. front of office like sales and marketing automation applications), 67% of vendor-built solutions succeed vs. only 33% of internally built ones. This isn’t shocking when you consider that successful vendor-built solutions are often implemented in close partnership to customers maximizing time-to-value with a lower total cost, and better alignment with operational workflows. We talk about the advantages in both cost and time of buying vs. building in our previous blog: Build vs. Buy: Agentic AI Troubleshooting in the Nvidia Blackwell & Dynamo Era.

Regardless of internal or external resources for agentic AI powered debugging platforms, it’s clear that the process of debugging isn’t just about surfacing data—it’s about navigating a workflow with adequate feedback loops with team members who actually are on call and are facing the incidents and outages their company faces each day. Without such feedback loops, adoption fails. The final clear takeaway from this study: On top of everything else the most successful agentic AI solutions are those that minimize setup burden, provide fast time-to-value and the ability to learn and improve over time.

‍

Why Context Is Everything

‍

Modern systems run on sprawling webs of microservices, logs, traces, dashboards, and infrastructure layers. Add in the explosion of auto-generated code, and incidents are only becoming more complex.

The key insight—reinforced both by our work at Relvy and across the industry—is simple: AI without context is useless. Engineers know this intuitively. That’s why developers painstakingly maintain “context files” like .claude or .cursor configs to guide their coding AIs.

Debugging is no different. An AI assistant must be given the same playbooks, dashboards, and investigation strategies that your team uses—otherwise it will flail. That’s why we’ve seen the value for any human-in-the-loop AI powered debugging to be able to import existing runbooks and postmortems directly. Teams can then transform their static documents into living instruction sets that can evolve in time. A huge step toward making debugging AI-guided, but team-driven.

‍

Human-in-the-Loop AI Powered Debugging - An ideal “AI SRE” And “AI On-Call Engineer”:

‍

We believe the missing piece in many AI pilots is the human. A path forward is for human empowered platforms to keep engineers in control from day one:

‍

Instruction sets, not black boxes - Teams create and refine instructions that tell the AI how they debug. These can be written manually or imported from existing runbooks, postmortems, or Confluence docs.

‍

Learning mode before autopilot - Generate an investigation plan first, which engineers can approve, edit, or reject before any queries run.

‍

Centralized place for collaboration - Every incident should get a place where AI and engineers can co-investigate, step by step. Queries are visible, modifiable, and reusable for future incidents.

‍

Team-specific workspaces - Each team; whether product, platform, or infra should have its own workspace with configuration knobs to teach the agentic AI debugging platform how they work.

This approach mirrors the way you’d onboard a new engineer: have them shadow you, show them your workflows, let them make mistakes in a safe environment, and gradually give them autonomy.

‍

Supporting Both Product And Platform Teams

‍

One question we hear: “Is this only useful for product teams? What about infra?”

We reflected on incidents like Kubernetes IP exhaustion or global infra outages, and the answer is: Such a human-in-the-Loop platform should be able to help.

‍

So, starting with the platform/infrastructure team that faces global or infrastructure-related incidents. And the product team that focuses on issues within their service or related dependencies. Each team should get its own workspace and instruction sets, giving both the ability to replicate their workflows. For infrastructure teams specifically, the Human-In-The-Loop Debugging assists by:

‍

Capturing the general steps you’d typically take to triage/debug as AI instructions.

‍

Ensuring Relvy has access to your infra dashboards, Kubernetes event logs, GCP metrics, and other critical sources.

‍

Adding new data sources on request to cover blind spots, ensuring the AI SRE sees what your engineers see. This dual-team approach gives customers a full picture: Does such a setup help accelerate service-level debugging? Yes. Can it also provide value for infra-level investigations? Increasingly, yes.

‍

Where AI Debugging Works (and Where It Doesn’t)

‍

AI SREs and AI On-Call Engineers following the above can excel at application-level incidents: latency spikes, 500 storms, cascading service failures. In these cases, AI can surface relevant logs, traces, and dashboards faster than any human could, cutting time-to-resolution dramatically.

‍

Infra incidents are trickier. Some issues like networking issues affecting Kubernetes clusters, still require vendor escalation or deep human expertise. But even here, such systems streamline triage, automating context-gathering, and ensuring your team starts closer to the root cause.

‍

The Road Ahead

‍

AI debugging is poised to reshape observability and as above it’s quietly starting to. In the same way AI copilots reshaped coding. Certainly in both cases a smaller team of engineers may be an option but we are seeing that the future is not about replacing engineers outright it’s about amplifying them.

‍

By combining structured context (runbooks, dashboards, reference queries) with human-in-the-loop workflows, AI SRE systems can turn debugging into a partnership: AI handles the heavy lifting, while engineers provide oversight, intuition, and correction.

‍

As we wrote in Building AI Agents on Enterprise Data, context isn’t optional, it’s the foundation. Get it right, and you unlock faster investigations, stronger adoption, and teams that feel empowered rather than replaced. The future of debugging isn’t AI alone. It’s AI + human expertise, working together.

‍

About Relvy

At Relvy, we believe engineers - not black-box AI - should stay in control. Incidents are high-stakes, so every step must be transparent, explainable, and guided by the team on call.

We’re building AI-powered debugging that eliminates manual, time consuming investigations—cutting resolution time while keeping humans in the loop. With Relvy, teams reduce failures, improve uptime, and trust their tools as much as their teammates.

Our mission: make reliability faster, smarter, and human-centered.

‍