Deploying an AI agent without monitoring is like driving without a dashboard. Problems exist — you find out when the client complains, not before. Here’s what I set up systematically.
The 5 Critical Metrics
1. Steps per session
An agent that takes 12 turns for a task that should take 3 has a problem — broken tool, ambiguous prompt, or implicit loop.
from dataclasses import dataclass, field
import time
@dataclass
class AgentSession:
session_id: str
steps: int = 0
start_time: float = field(default_factory=time.monotonic)
tool_calls: list[str] = field(default_factory=list)
hit_max_steps: bool = False
@property
def duration(self) -> float:
return time.monotonic() - self.start_time
Alert threshold: average > 5 steps over the last 30 minutes.
2. MAX_STEPS session rate
If more than 2% of sessions hit the limit, there’s a structural bug in the agent logic.
3. P95 latency
Median latency hides outliers. The P95 (95th percentile) tells you what your least lucky users experience.
4. Error rate per tool
Each agent tool needs its own error counter. A tool that fails 10% of the time silently degrades the whole system.
5. Quality score (where measurable)
For agents with verifiable structured output (extraction, classification), compute an automatic success rate.
Detecting Drift
LLMs drift. A model update, a prompt change, or evolving input data can silently degrade results.
import statistics
def detect_drift(recent_scores: list[float], baseline_scores: list[float]) -> bool:
if not recent_scores or not baseline_scores:
return False
recent_mean = statistics.mean(recent_scores)
baseline_mean = statistics.mean(baseline_scores)
# Alert if degradation > 10%
return recent_mean < baseline_mean * 0.90
Always compare a recent window (last 24h) to a stable baseline (previous week).
Minimal Monitoring Stack
import logging
import json
def log_session(session: AgentSession, result: str | None, error: str | None) -> None:
record = {
"session_id": session.session_id,
"steps": session.steps,
"duration_s": round(session.duration, 2),
"tools_used": session.tool_calls,
"hit_max_steps": session.hit_max_steps,
"success": error is None,
"error": error,
}
logging.info(json.dumps(record))
One JSON log per session. Ingestible by any tool: Datadog, Grafana, or even a simple grep for errors.
Priority Alerts
| Alert | Threshold | Action |
|---|---|---|
| MAX_STEPS rate | > 2% over 1h | Investigate agent logic |
| Avg steps | > 5 | Check tools |
| Tool error rate | > 5% | Debug the tool |
| P95 latency | > 30s | Check timeouts |
| Quality drift | > 10% degradation | Compare with model changelog |
What I Deploy Systematically
On every production agent: a dashboard with these 5 metrics + an email/Slack alert on the first two. That’s 2 hours of setup that prevents 2am incidents.
Stéphanie Caumont
AI Product Owner · Learn more