AI Integration That Actually Works in Production
Beyond the demo — what it really takes to ship AI features that survive contact with real users.
The demo always works. You give the model a perfect prompt, it returns a perfect response, and the room is impressed. Then you put it in front of real users with real inputs — and it starts hallucinating, timing out, returning inconsistent formats, and costing three times what you budgeted.
The production AI gap is real and widely underestimated
Building a working LLM integration in a Jupyter notebook takes an afternoon. Building one that is reliable, observable, cost-controlled, and user-safe in a production environment takes weeks — and requires engineering disciplines that most AI tutorials skip entirely.
We've integrated AI features into over a dozen products in the last two years. Here's what we've learned — the hard way — about what actually matters.
1. Prompt engineering is an engineering discipline, not an art
Prompts need to be versioned, tested, and deployed with the same rigour as code. We treat prompts as first-class artefacts: stored in version control, tested against a fixed evaluation set before deployment, and rolled back if they regress key metrics.
Every production prompt we write has a corresponding evaluation suite: a set of inputs with expected outputs or output properties, run automatically on every prompt change. This sounds like overhead. It's how you avoid shipping a prompt that works for 90% of inputs and catastrophically fails on the other 10%.
2. RAG is not magic — retrieval quality is everything
Retrieval-Augmented Generation (RAG) is the standard approach for giving LLMs access to your data. The theory is simple: retrieve relevant chunks from your knowledge base, stuff them into the context window, let the model answer. The practice is much more nuanced.
Retrieval quality — how well you find the right chunks — dominates output quality. A great LLM with bad retrieval produces worse answers than a mediocre model with excellent retrieval. This means your embedding model, your chunking strategy, your indexing approach, and your re-ranking logic all need to be carefully designed and continuously evaluated.
We use a four-stage retrieval pipeline: semantic search with embeddings, keyword search with BM25, cross-encoder re-ranking, and a deduplication step. It's more complex than vector search alone, but it's what produces consistently good results across diverse query types.
3. Cost is a product problem, not just an engineering problem
LLM inference costs can spiral fast. We've seen products that looked economically viable at 1,000 users become unsustainable at 10,000. The fix is treating token cost as a product metric alongside latency and accuracy — and designing your AI features with cost efficiency as a first-class constraint.
Practical techniques: aggressive caching of common queries, using smaller cheaper models for classification and routing before invoking the full model, streaming responses to reduce perceived latency, and batching non-real-time requests to take advantage of batch pricing.
4. Observability is non-negotiable
Every AI feature we ship is instrumented with logging of inputs, outputs, latency, token counts, and model version. We use a combination of structured logging and a purpose-built LLM observability tool (Langfuse in most of our stacks). Without this, you're flying blind — and you will not know when your model starts drifting or when a prompt update causes a regression.
5. Human-in-the-loop is an architecture choice, not a band-aid
For most production AI features, the right architecture includes meaningful human review for edge cases and low-confidence outputs. Not because the model is bad — because the cost of a confident wrong answer is usually higher than the cost of showing a loading state while a human reviews.
AI integration done well is boring engineering: rigorous testing, careful monitoring, cost discipline, and incremental improvement. The teams that ship AI features that work at scale are the ones who treat it exactly like they treat any other production system — with respect for failure modes and deep investment in observability.
