Prompt Engineering in Production: What Changes When It's Real Code
There’s a fundamental difference between writing prompts to explore an LLM and writing prompts that run in production 24/7 with real users.
The first time one of my agents broke in prod — incoherent response, infinite loop on an edge case — I realized my approach was still that of an explorer, not an engineer.
Reliability Is Non-Negotiable
In experimentation mode, a weird response every 10 calls is interesting. In production, it’s a bug.
First principle I internalized: treat prompts like code that can crash. Which means:
- Define exactly the failure conditions (when the LLM should say “I don’t know”)
- Test with malformed, contradictory, and empty inputs
- Have an explicit degraded output path
A production prompt always has a fallback instruction. “If you don’t have the necessary information, respond only with: INSUFFICIENT_DATA”. No interpretation, no invention.
Structure Outputs for the Code That Consumes Them
In exploration, you read responses by hand. In production, code parses the output.
I systematically enforce a JSON structure for anything that needs to be consumed downstream:
Respond ONLY with this JSON, no markdown, no text before or after:
{
"action": "create|update|skip",
"confidence": 0.0-1.0,
"reason": "one sentence max"
}
The confidence field is underrated. It lets you route edge cases to human review rather than letting the LLM decide alone.
Version Prompts Like Code
Changing a system prompt in production without versioning is like merging without a diff.
My structure:
prompts/
├── v1/
│ ├── system.md
│ └── classify.md
├── v2/
│ └── system.md
└── current -> v2/
Each version in production is tagged. If a regression appears, rollback in 30 seconds.
Context Length Is a Cost Variable
What tutorials don’t tell you: every token in the system prompt costs money on every call. On a looping agent, a 2,000-token prompt × 10,000 calls/day shows up in the bill.
I optimize production prompts like I optimize SQL: remove what’s unused, compress redundant instructions. A production prompt is the result of multiple compression iterations, not the first draft.
Models Are Not Interchangeable
Haiku, Sonnet, and Opus don’t behave the same on the same prompt. A finely tuned Sonnet prompt can produce different results on Haiku — not necessarily worse, just different.
I maintain separate configurations per target model. When a new version drops, I revalidate against my test suite before migrating.
What This Changes in Practice
Prompt engineering in production is closer to software engineering than creative writing:
- Regression tests on a representative set of examples
- Monitoring outputs (rate of off-format responses, fallback rate)
- Versioning and changelogs for prompts
- Review before any production change
It’s not more complex than maintaining an API — it’s just very different from what most introductory tutorials show.
Stéphanie Caumont
AI Product Owner · Learn more