Integrating Claude API in Production: What I Wish I'd Known Before

Jul 2, 20267 min

I’ve integrated Claude API into several production projects — an AI spec SaaS, an artisan ERP with voice processing, an automated quoting engine. Here’s what I didn’t anticipate, and what I do differently now.

Don’t Treat the API as Infallible

Classic first mistake: writing code that assumes the API always responds correctly. In production, timeouts exist, rate limits exist, 529 errors exist.

My base wrapper:

async function callClaude(params: MessageCreateParams, retries = 3): Promise<Message> {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      return await anthropic.messages.create(params);
    } catch (err) {
      if (attempt === retries - 1) throw err;
      if (isRetryable(err)) {
        await sleep(exponentialBackoff(attempt));
        continue;
      }
      throw err;
    }
  }
  throw new Error('unreachable');
}

function isRetryable(err: unknown): boolean {
  return err instanceof APIError && [429, 529].includes(err.status);
}

Without retry with exponential backoff, an Anthropic load spike can take down your feature.

Streaming Changes UX, Not Just Perf

Streaming isn’t just a performance optimization — it’s a UX decision. A user who sees text appearing progressively tolerates 8 seconds. The same user waiting for a spinner for 8 seconds abandons at 3.

For Next.js, I use native Server-Sent Events:

// app/api/generate/route.ts
export async function POST(req: Request) {
  const stream = anthropic.messages.stream({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages: [{ role: 'user', content: await req.text() }],
  });

  return new Response(stream.toReadableStream(), {
    headers: { 'Content-Type': 'text/event-stream' },
  });
}

Model Costs Before You Scale

The cost of an LLM in production is not linear with users — it depends on context length, model choice, and usage patterns.

What I measure on each call:

const usage = response.usage;
const cost = (usage.input_tokens * 3 + usage.output_tokens * 15) / 1_000_000; // Sonnet pricing

On an agent processing 100 documents per day with 4,000 average context tokens, the difference between Sonnet and Haiku is ~€150/month. Worth thinking about model granularity per task.

Route Calls by Task Complexity

Pattern I use on all my projects now: complexity routing.

const model = task.complexity === 'high'
  ? 'claude-sonnet-4-6'
  : 'claude-haiku-4-5-20251001';

Entity extraction, classification, short summaries → Haiku. Complex reasoning, spec generation, critical analysis → Sonnet. Reduces costs 60–70% without visible degradation for the user.

Handle Long Context Without Breaking the Budget

On the artisan ERP, conversations can reach 20+ exchanges. Sending the full history on every call multiplies costs by 10 quickly.

My strategy: sliding window + context summary.

function buildContext(messages: Message[], maxTokens = 8000): Message[] {
  const recent = messages.slice(-6); // last 3 exchanges always present
  const olderTokens = estimateTokens(messages.slice(0, -6));

  if (olderTokens > maxTokens) {
    const summary = await summarizeHistory(messages.slice(0, -6));
    return [{ role: 'user', content: `[Conversation summary: ${summary}]` }, ...recent];
  }
  return messages;
}

What I Do Systematically Now

Log every call: prompt, model, tokens, latency, estimated cost
Explicit timeout: 30s max, no request hanging indefinitely
Validate output format before passing it to downstream code
Cost alerts: notification if spend exceeds a daily threshold
Test edge cases: empty input, very long, wrong language

Integrating an LLM API in production ultimately resembles any external API integration — with an extra layer of non-determinism that demands more testing and monitoring.

Stéphanie Caumont

AI Product Owner · Learn more

← All articles Contact me