How to evaluate an AI agent before going to production

May 12, 20258 min

The question everyone avoids until the last moment: how do you know if your AI agent is good? Not good in the sense of “it answers my questions,” but good in the sense of “I can put it in production with confidence.”

Here’s the framework I use.

Why evaluation is often rushed

In most AI projects I’ve seen, evaluation looks like this: you manually test a few cases, it seems to work, you deploy. Then you discover problems in production.

The problem is that LLMs have probabilistic behavior. An agent that “works” on your 5 test cases can fail on 20% of real cases. Without systematic measurement, you won’t know until you’re in production.

The 4 dimensions to evaluate

1. Functional accuracy

Does the agent do what it’s supposed to do?

To measure this, you need an evaluation dataset: at minimum 50 (input, expected output) pairs. Not easy cases you invented — real cases, representative of what your users will send, including ambiguous cases and edge cases.

2. Robustness

How does the agent behave with unexpected inputs?

Malformed inputs (truncated text, special characters, mixed languages)
Out-of-scope inputs (requests the agent isn’t supposed to handle)
Adversarial inputs (attempts to derail the agent)

3. Consistency

Does the agent give the same answer to equivalent inputs?

Test the same question rephrased 5 different ways. If responses vary significantly, your system prompt isn’t precise enough.

4. Latency and cost

Often forgotten in functional evaluation, but critical in production.

P50 and P95 latency (P95 is often 3x P50 — that’s what your slow users will experience)
Average cost per request × estimated volume = monthly budget

The minimum viable evaluation

Before production, these are non-negotiable:

50 representative test cases with human-validated expected outputs. Not generated by the AI itself.

An automated regression test: every time the system prompt changes, the 50 cases run automatically and alert you if the score drops.

Explicit thresholds: “We deploy if and only if precision score is > X% and critical error rate is < Y%.”

What this changes in practice

Having an evaluation system profoundly changes how you work on an agent — you can iterate with confidence, detect regressions before they reach users, and have an honest conversation with stakeholders about what the agent knows and doesn’t know.

It’s the most thankless work on the project. It’s also the most important.

SC

Stéphanie Caumont

AI Product Owner · Learn more