End-to-end retrieval-augmented generation from chunking strategy to production architecture. Each lesson covers a pipeline stage with a flow diagram and a runnable Python script.
Parsing quality, normalisation, and metadata extraction
RAG quality is bounded by ingestion quality. Bad parsing creates broken chunks, duplicated fragments, and irrelevant boilerplate that pollutes retrieval. Preserving structural clues — headings, timestamps, authors, URLs — turns raw text into filterable, rankable metadata.
Fixed, structural, and semantic chunking tradeoffs
Chunk size strongly affects retrieval quality. Tiny chunks lose context; large chunks reduce precision and waste context window budget. The right strategy depends on corpus structure and query style — and should always be validated empirically.
Dense vectors, similarity search, and vector store internals
Embeddings convert text into dense vectors so semantically related items appear closer in vector space. Quality depends on the model, the language/domain match, and chunking choices. Production systems use approximate nearest neighbour indexes (FAISS, pgvector, Qdrant) rather than brute-force search.
Recall, filtering, and query quality
A user query is often not the best retrieval query. Query rewriting, expansion, and decomposition can significantly improve results. Good retrieval is more about system design — metadata filters, top-k tuning, hybrid search — than raw model choice.
Cross-encoders, deduplication, and token budget management
Initial retrieval casts a wide net. Rerankers — especially cross-encoders — reorder candidates based on direct query-document relevance. After reranking, near-duplicates are removed and the best chunks are packed into the prompt within a token budget.
Building prompts that force grounded, cited answers
Even with excellent retrieval, prompts determine whether the model answers from evidence or synthesises unsupported claims. A strong RAG prompt is explicit about allowed evidence, output format, citation requirements, and what to do when the context is insufficient.
Hit-rate, recall@k, groundedness, and regression suites
RAG evaluation must separately inspect retrieval quality and answer quality. A bad answer can come from bad retrieval, bad reranking, or bad synthesis — each requires a different fix. The best teams maintain a gold dataset and run regression checks whenever any pipeline stage changes.
Ingestion pipelines, serving, caching, and observability
Production RAG needs more than a notebook. You need document pipelines, background indexing jobs, versioned vector stores, retrieval APIs, prompt version control, monitoring, and often tenant-aware access control. Architecture separates ingestion, indexing, retrieval, and generation for independent scaling and debugging.
Hybrid search, parent-child retrieval, query decomposition, and agentic loops
Advanced RAG combines lexical and semantic search, uses parent-child retrieval for granular indexing with coherent generation context, decomposes multi-hop questions into sub-queries, and optionally uses agentic retrieval loops — with careful attention to latency and failure paths.