Monitoring an AI Agent in Production: Metrics, Alerts, Drift

Jul 2, 20266 min

Deploying an AI agent without monitoring is like driving without a dashboard. Problems exist — you find out when the client complains, not before. Here’s what I set up systematically.

The 5 Critical Metrics

1. Steps per session

An agent that takes 12 turns for a task that should take 3 has a problem — broken tool, ambiguous prompt, or implicit loop.

from dataclasses import dataclass, field
import time

@dataclass
class AgentSession:
    session_id: str
    steps: int = 0
    start_time: float = field(default_factory=time.monotonic)
    tool_calls: list[str] = field(default_factory=list)
    hit_max_steps: bool = False

    @property
    def duration(self) -> float:
        return time.monotonic() - self.start_time

Alert threshold: average > 5 steps over the last 30 minutes.

2. MAX_STEPS session rate

If more than 2% of sessions hit the limit, there’s a structural bug in the agent logic.

3. P95 latency

Median latency hides outliers. The P95 (95th percentile) tells you what your least lucky users experience.

4. Error rate per tool

Each agent tool needs its own error counter. A tool that fails 10% of the time silently degrades the whole system.

5. Quality score (where measurable)

For agents with verifiable structured output (extraction, classification), compute an automatic success rate.

Detecting Drift

LLMs drift. A model update, a prompt change, or evolving input data can silently degrade results.

import statistics

def detect_drift(recent_scores: list[float], baseline_scores: list[float]) -> bool:
    if not recent_scores or not baseline_scores:
        return False
    recent_mean = statistics.mean(recent_scores)
    baseline_mean = statistics.mean(baseline_scores)
    # Alert if degradation > 10%
    return recent_mean < baseline_mean * 0.90

Always compare a recent window (last 24h) to a stable baseline (previous week).

Minimal Monitoring Stack

import logging
import json

def log_session(session: AgentSession, result: str | None, error: str | None) -> None:
    record = {
        "session_id": session.session_id,
        "steps": session.steps,
        "duration_s": round(session.duration, 2),
        "tools_used": session.tool_calls,
        "hit_max_steps": session.hit_max_steps,
        "success": error is None,
        "error": error,
    }
    logging.info(json.dumps(record))

One JSON log per session. Ingestible by any tool: Datadog, Grafana, or even a simple grep for errors.

Priority Alerts

Alert	Threshold	Action
MAX_STEPS rate	> 2% over 1h	Investigate agent logic
Avg steps	> 5	Check tools
Tool error rate	> 5%	Debug the tool
P95 latency	> 30s	Check timeouts
Quality drift	> 10% degradation	Compare with model changelog

What I Deploy Systematically

On every production agent: a dashboard with these 5 metrics + an email/Slack alert on the first two. That’s 2 hours of setup that prevents 2am incidents.

SC

Stéphanie Caumont

AI Product Owner · Learn more