Production RAG Architecture

Production RAG needs more than a notebook. You need document pipelines, background indexing jobs, versioned vector stores, retrieval APIs, prompt version control, monitoring, and often tenant-aware access control. Architecture separates ingestion, indexing, retrieval, and generation for independent scaling and debugging.

A production RAG system is several services, not one script. The ingestion service runs on a schedule or event trigger, fetching new documents, parsing them, chunking, embedding, and writing to the vector store. This is usually a background job (Celery, Prefect, Airflow) that runs independently of request serving. Versioning the index — keeping a stable read replica while a new version is being written — prevents serving degradations during re-indexing.

The serving path is a retrieval API that accepts a query, runs hybrid search, calls the optional reranker, assembles context, calls the LLM, and returns a structured response. This is typically a FastAPI or similar service, deployed behind a load balancer. Prompt templates are versioned separately from code — changing a prompt should not require a deployment. Caching at two levels — embedding cache (avoid re-embedding identical queries) and retrieval cache (return cached context for repeated queries) — dramatically reduces cost and latency for high-traffic systems.

Observability is not optional. Every request should emit: query text, retrieved chunk IDs and scores, assembled context token count, LLM latency, and any error codes. This telemetry drives both operational monitoring (SLA alerts, cost dashboards) and quality improvement (identifying the queries that consistently retrieve poor context). OpenTelemetry is the standard instrumentation layer. For multi-tenant systems, access control must be enforced at retrieval time — filtered vector search ensures users only retrieve documents they are authorised to see.