RAG System Design

End-to-end retrieval-augmented generation from chunking strategy to production architecture. Each lesson covers a pipeline stage with a flow diagram and a runnable Python script.

10Lessons

5Modules

10Python Scripts

Foundations

1 lesson

01Foundations

What RAG Is and When to Use It

Retrieval-augmented generation fundamentals

RAG combines retrieval from an external knowledge source with LLM generation. It solves the grounding problem — giving the model access to private, recent, or domain-specific documents at inference time rather than encoding all knowledge into model weights.

Ingestion

2 lessons

02Ingestion

Document Ingestion and Cleaning

Parsing quality, normalisation, and metadata extraction

RAG quality is bounded by ingestion quality. Bad parsing creates broken chunks, duplicated fragments, and irrelevant boilerplate that pollutes retrieval. Preserving structural clues — headings, timestamps, authors, URLs — turns raw text into filterable, rankable metadata.

03Ingestion

Chunking Strategy

Fixed, structural, and semantic chunking tradeoffs

Chunk size strongly affects retrieval quality. Tiny chunks lose context; large chunks reduce precision and waste context window budget. The right strategy depends on corpus structure and query style — and should always be validated empirically.

Retrieval

2 lessons

04Retrieval

Embeddings and Vector Search

Dense vectors, similarity search, and vector store internals

Embeddings convert text into dense vectors so semantically related items appear closer in vector space. Quality depends on the model, the language/domain match, and chunking choices. Production systems use approximate nearest neighbour indexes (FAISS, pgvector, Qdrant) rather than brute-force search.

05Retrieval

Retrieval Design and Query Processing

Recall, filtering, and query quality

A user query is often not the best retrieval query. Query rewriting, expansion, and decomposition can significantly improve results. Good retrieval is more about system design — metadata filters, top-k tuning, hybrid search — than raw model choice.

Generation

3 lessons

06Generation

Reranking and Context Assembly

Cross-encoders, deduplication, and token budget management

Initial retrieval casts a wide net. Rerankers — especially cross-encoders — reorder candidates based on direct query-document relevance. After reranking, near-duplicates are removed and the best chunks are packed into the prompt within a token budget.

07Generation

Prompting, Grounding, and Citation

Building prompts that force grounded, cited answers

Even with excellent retrieval, prompts determine whether the model answers from evidence or synthesises unsupported claims. A strong RAG prompt is explicit about allowed evidence, output format, citation requirements, and what to do when the context is insufficient.

08Generation

Evaluating a RAG System

Hit-rate, recall@k, groundedness, and regression suites

RAG evaluation must separately inspect retrieval quality and answer quality. A bad answer can come from bad retrieval, bad reranking, or bad synthesis — each requires a different fix. The best teams maintain a gold dataset and run regression checks whenever any pipeline stage changes.

Production

2 lessons

09Production

Production RAG Architecture

Ingestion pipelines, serving, caching, and observability

Production RAG needs more than a notebook. You need document pipelines, background indexing jobs, versioned vector stores, retrieval APIs, prompt version control, monitoring, and often tenant-aware access control. Architecture separates ingestion, indexing, retrieval, and generation for independent scaling and debugging.

10Production

Advanced RAG Patterns

Hybrid search, parent-child retrieval, query decomposition, and agentic loops

Advanced RAG combines lexical and semantic search, uses parent-child retrieval for granular indexing with coherent generation context, decomposes multi-hop questions into sub-queries, and optionally uses agentic retrieval loops — with careful attention to latency and failure paths.