Integrating Claude API in Production: What I Wish I'd Known Before
I’ve integrated Claude API into several production projects — an AI spec SaaS, an artisan ERP with voice processing, an automated quoting engine. Here’s what I didn’t anticipate, and what I do differently now.
Don’t Treat the API as Infallible
Classic first mistake: writing code that assumes the API always responds correctly. In production, timeouts exist, rate limits exist, 529 errors exist.
My base wrapper:
async function callClaude(params: MessageCreateParams, retries = 3): Promise<Message> {
for (let attempt = 0; attempt < retries; attempt++) {
try {
return await anthropic.messages.create(params);
} catch (err) {
if (attempt === retries - 1) throw err;
if (isRetryable(err)) {
await sleep(exponentialBackoff(attempt));
continue;
}
throw err;
}
}
throw new Error('unreachable');
}
function isRetryable(err: unknown): boolean {
return err instanceof APIError && [429, 529].includes(err.status);
}
Without retry with exponential backoff, an Anthropic load spike can take down your feature.
Streaming Changes UX, Not Just Perf
Streaming isn’t just a performance optimization — it’s a UX decision. A user who sees text appearing progressively tolerates 8 seconds. The same user waiting for a spinner for 8 seconds abandons at 3.
For Next.js, I use native Server-Sent Events:
// app/api/generate/route.ts
export async function POST(req: Request) {
const stream = anthropic.messages.stream({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: await req.text() }],
});
return new Response(stream.toReadableStream(), {
headers: { 'Content-Type': 'text/event-stream' },
});
}
Model Costs Before You Scale
The cost of an LLM in production is not linear with users — it depends on context length, model choice, and usage patterns.
What I measure on each call:
const usage = response.usage;
const cost = (usage.input_tokens * 3 + usage.output_tokens * 15) / 1_000_000; // Sonnet pricing
On an agent processing 100 documents per day with 4,000 average context tokens, the difference between Sonnet and Haiku is ~€150/month. Worth thinking about model granularity per task.
Route Calls by Task Complexity
Pattern I use on all my projects now: complexity routing.
const model = task.complexity === 'high'
? 'claude-sonnet-4-6'
: 'claude-haiku-4-5-20251001';
Entity extraction, classification, short summaries → Haiku. Complex reasoning, spec generation, critical analysis → Sonnet. Reduces costs 60–70% without visible degradation for the user.
Handle Long Context Without Breaking the Budget
On the artisan ERP, conversations can reach 20+ exchanges. Sending the full history on every call multiplies costs by 10 quickly.
My strategy: sliding window + context summary.
function buildContext(messages: Message[], maxTokens = 8000): Message[] {
const recent = messages.slice(-6); // last 3 exchanges always present
const olderTokens = estimateTokens(messages.slice(0, -6));
if (olderTokens > maxTokens) {
const summary = await summarizeHistory(messages.slice(0, -6));
return [{ role: 'user', content: `[Conversation summary: ${summary}]` }, ...recent];
}
return messages;
}
What I Do Systematically Now
- Log every call: prompt, model, tokens, latency, estimated cost
- Explicit timeout: 30s max, no request hanging indefinitely
- Validate output format before passing it to downstream code
- Cost alerts: notification if spend exceeds a daily threshold
- Test edge cases: empty input, very long, wrong language
Integrating an LLM API in production ultimately resembles any external API integration — with an extra layer of non-determinism that demands more testing and monitoring.
Stéphanie Caumont
AI Product Owner · Learn more