Prompt Engineering in Production: What Changes When It's Real Code

Jul 2, 20266 min

There’s a fundamental difference between writing prompts to explore an LLM and writing prompts that run in production 24/7 with real users.

The first time one of my agents broke in prod — incoherent response, infinite loop on an edge case — I realized my approach was still that of an explorer, not an engineer.

Reliability Is Non-Negotiable

In experimentation mode, a weird response every 10 calls is interesting. In production, it’s a bug.

First principle I internalized: treat prompts like code that can crash. Which means:

Define exactly the failure conditions (when the LLM should say “I don’t know”)
Test with malformed, contradictory, and empty inputs
Have an explicit degraded output path

A production prompt always has a fallback instruction. “If you don’t have the necessary information, respond only with: INSUFFICIENT_DATA”. No interpretation, no invention.

Structure Outputs for the Code That Consumes Them

In exploration, you read responses by hand. In production, code parses the output.

I systematically enforce a JSON structure for anything that needs to be consumed downstream:

Respond ONLY with this JSON, no markdown, no text before or after:
{
  "action": "create|update|skip",
  "confidence": 0.0-1.0,
  "reason": "one sentence max"
}

The confidence field is underrated. It lets you route edge cases to human review rather than letting the LLM decide alone.

Version Prompts Like Code

Changing a system prompt in production without versioning is like merging without a diff.

My structure:

prompts/
├── v1/
│   ├── system.md
│   └── classify.md
├── v2/
│   └── system.md
└── current -> v2/

Each version in production is tagged. If a regression appears, rollback in 30 seconds.

Context Length Is a Cost Variable

What tutorials don’t tell you: every token in the system prompt costs money on every call. On a looping agent, a 2,000-token prompt × 10,000 calls/day shows up in the bill.

I optimize production prompts like I optimize SQL: remove what’s unused, compress redundant instructions. A production prompt is the result of multiple compression iterations, not the first draft.

Models Are Not Interchangeable

Haiku, Sonnet, and Opus don’t behave the same on the same prompt. A finely tuned Sonnet prompt can produce different results on Haiku — not necessarily worse, just different.

I maintain separate configurations per target model. When a new version drops, I revalidate against my test suite before migrating.

What This Changes in Practice

Prompt engineering in production is closer to software engineering than creative writing:

Regression tests on a representative set of examples
Monitoring outputs (rate of off-format responses, fallback rate)
Versioning and changelogs for prompts
Review before any production change

It’s not more complex than maintaining an API — it’s just very different from what most introductory tutorials show.

Stéphanie Caumont

AI Product Owner · Learn more

← All articles Contact me