Preview — full styling will appear after the next deploy completes.

rag-system-design

What RAG Is and When to Use It

Retrieval-augmented generation fundamentals

RAG combines retrieval from an external knowledge source with LLM generation. It solves the grounding problem — giving the model access to private, recent, or domain-specific documents at inference time rather than encoding all knowledge into model weights.

flowchart LR
    D[(Documents)] --> P[Parse & Clean]
    P --> C[Chunk]
    C --> E[Embed]
    E --> V[(Vector Index)]
    Q([User Query]) --> R[Retrieve Top-K]
    V --> R
    R --> G[Assemble Context]
    G --> LLM[LLM Generator]
    LLM --> A([Answer + Citations])

Language models are powerful synthesisers, but their knowledge is frozen at training time. RAG (Retrieval-Augmented Generation) breaks this constraint by separating what the model knows from what it can look up. At inference time the system retrieves relevant passages from an external corpus and passes them as context to the generator. The model answers from supplied evidence rather than relying on parametric memory alone.

Choosing between RAG, fine-tuning, or pure prompting depends on your use case. Fine-tuning changes model behaviour permanently and is expensive to update — it is best for style, tone, and domain-specific reasoning patterns. RAG is best when knowledge changes frequently, documents are private, or answers must be traceable to a source. In practice many strong systems combine both: fine-tune once for domain reasoning, then use RAG for live document grounding.

The baseline RAG pipeline has eight stages: ingest documents, parse and clean, chunk into retrieval units, embed chunks into dense vectors, index vectors, retrieve top-k candidates at query time, optionally rerank and filter, then prompt the LLM with assembled context and generate a cited answer. Each stage has its own failure modes and tuning knobs — the lessons that follow cover each in depth.