“We should fine-tune the model” is a phrase I often hear when results aren’t satisfactory. Most of the time, it’s not the right solution. Here’s how to decide.
The fine-tuning reflex, and why it’s often premature
Fine-tuning is presented as the solution to “customize” an LLM for your domain. That’s true — but it’s also expensive, time-consuming, and introduces technical debt.
Before fine-tuning, the real question is: have you truly exhausted prompt engineering possibilities?
In my practice, 80% of cases where fine-tuning is discussed are resolved with a better system prompt, well-chosen few-shot examples, or better input structuring.
When prompt engineering is enough
Output format. You want the model to always respond in JSON with a precise structure → prompt engineering with explicit schema and examples.
Tone and style. You want an agent that speaks like your brand → prompt engineering with examples of desired phrasings.
Business rules. You want the agent to apply domain-specific rules → prompt engineering with rules explicitly listed.
Edge case behavior. You want the agent to say “I don’t know” rather than hallucinate → prompt engineering with explicit uncertainty handling instructions.
When fine-tuning makes sense
Very high call volume. A smaller fine-tuned model can replace a large generic model for repetitive tasks, at 10x lower inference cost. At 10M requests/month, the savings can be massive.
Highly specialized task with lots of data. If you have 10,000+ high-quality examples in a very specific domain, fine-tuning can outperform generic models.
Critical latency. A smaller fine-tuned model responds faster. For real-time applications, this can make the difference.
Privacy. If you fine-tune and host your own model, your production data doesn’t leave your infrastructure.
The decision process
Before talking fine-tuning, answer these questions:
- Do you have at least 1,000 quality (input/output) examples for training?
- Do you have a team capable of maintaining the fine-tuning pipeline over time?
- Have you first tried optimizing the prompt with few-shot examples?
- Have you measured that the generic model is insufficient on your real cases?
If you answer no to any of these, fine-tuning is premature.
Stéphanie Caumont
AI Product Owner · Learn more