The Demo-to-Production Gap Is an Architecture Problem
Every RAG demo we built worked on the first try. A vector store, an embedding model, a prompt template, and a retrieval call — you can wire that up in an afternoon. But of the 50 RAG systems we shipped into production across financial services, healthcare, and public sector clients between 2024 and early 2026, only 31 survived past the 90-day mark without a major architectural rework. The gap was never the retrieval itself. It was everything around it: query preprocessing, chunk lifecycle management, evaluation coverage, and the compliance surface area that regulated industries demand before a system touches real decisions.
The pattern we see repeatedly is teams treating RAG as a retrieval problem when it is actually a data pipeline problem. The embedding index is not a static artifact — it is a living system with ingestion schedules, staleness thresholds, and schema drift. Once we started treating the vector store with the same operational rigor as a production database, failure rates in our deployments dropped from roughly 38% to under 12% within six months.
Chunking Strategies That Actually Held Up
We tested five chunking strategies across our deployments: fixed-size (512 tokens), recursive character splitting, semantic paragraph segmentation, document-structure-aware chunking, and a hybrid approach that layers semantic boundaries on top of structural markers. Fixed-size chunking at 512 tokens with 64-token overlap remains our default starting point — not because it is optimal, but because it is predictable and debuggable. In 22 of our 50 systems, it was the final production configuration. Simplicity has compounding returns when you are operating at scale.
Where fixed-size chunking failed was in document types with deep hierarchical structure: regulatory filings, clinical trial protocols, and multi-section contracts. For these, we moved to structure-aware chunking that respects heading boundaries, table integrity, and list groupings. Average retrieval relevance (measured by nDCG@10) jumped from 0.61 to 0.79 on our internal benchmarks when we switched. The cost was a 3x increase in preprocessing complexity and a pipeline that needed domain-specific parsers for each document type. For eight of our clients in banking and insurance, that tradeoff was non-negotiable.
The Embedding Staleness Problem Nobody Talks About
Embeddings go stale. This sounds obvious, but we watched three production systems degrade silently over weeks because the underlying source documents were updated while the vector index still held embeddings from the original versions. Our current standard is a freshness budget: every chunk carries an ingestion timestamp and a source-document hash. A nightly reconciliation job flags drift, and any chunk older than the configured TTL (typically 7 days for policy documents, 24 hours for market data) gets re-embedded or tombstoned. This added roughly 140ms of overhead per query for freshness validation, but it eliminated an entire class of silent-failure incidents that had previously cost us two client escalations.
Evaluation Frameworks: Measuring What Matters in Production
The biggest lesson from our 50 deployments is that offline evaluation is necessary but nowhere near sufficient. We run a three-tier evaluation framework across all production RAG systems. Tier one is offline: a curated set of 200-500 query-answer pairs per domain, scored on faithfulness, relevance, and completeness using an LLM-as-judge pipeline calibrated against human ratings (Spearman correlation of 0.84 on our calibration set). Tier two is online: real-time monitoring of retrieval hit rates, answer latency (p50 target of 1.8s, p99 under 4.5s), and user feedback signals like copy events, follow-up query rates, and explicit thumbs-down actions.
Tier three is adversarial. We run a monthly red-team exercise where domain experts deliberately craft ambiguous, out-of-scope, and adversarial queries to stress-test guardrails. In regulated environments — and 34 of our 50 deployments fall into that category — this tier is what keeps compliance teams comfortable. We log every retrieval decision, every reranking score, and every generation with its source attribution chain. The full audit trail adds approximately 15% storage overhead, but in industries where explainability is a regulatory requirement, it is the cost of doing business.
Query Understanding Is the Highest-Leverage Investment
If we could go back and change one thing about our earliest deployments, it would be investing more in query understanding upfront. Raw user queries are messy: they contain implicit context, ambiguous references, and domain jargon that embedding models handle poorly out of the box. We now run a lightweight query rewriting step — a small fine-tuned model that expands abbreviations, resolves coreferences from conversation history, and classifies intent — before anything hits the vector store. This single addition improved end-to-end answer quality by 23% across our healthcare deployments and reduced "I don't know" fallback rates from 19% to 7%. The model adds 120ms of latency, and it is the best latency budget we spend.
What We Would Build Differently Today
After 50 deployments, our architecture template looks nothing like the one we started with. We default to a two-stage retrieval pipeline: a fast approximate nearest-neighbor search over the full index, followed by a cross-encoder reranker on the top 40 candidates. We enforce chunk-level provenance tracking from day one. We build the evaluation harness before we build the retrieval pipeline. And we treat the embedding index as a managed data product with an owner, an SLA, and a deprecation policy. None of this is exotic. It is operational discipline applied to a new class of system, and it is the difference between a demo that impresses and a system that survives its first quarter in production.