Preview — full styling will appear after the next deploy completes.

rag-system-design

Document Ingestion and Cleaning

Parsing quality, normalisation, and metadata extraction

RAG quality is bounded by ingestion quality. Bad parsing creates broken chunks, duplicated fragments, and irrelevant boilerplate that pollutes retrieval. Preserving structural clues — headings, timestamps, authors, URLs — turns raw text into filterable, rankable metadata.

flowchart TD
    S([Raw Sources
HTML · PDF · DOCX]) --> P[Extract Text]
    P --> N[Remove Noise
nav · footer · ads]
    N --> W[Normalise
whitespace · encoding]
    W --> M[Extract Metadata
title · date · author · URL]
    M --> DD[Deduplicate]
    DD --> V[Version & Store]
    V --> IDX[(Indexed Corpus)]

The most common RAG failure mode is not a bad embedding model or a weak reranker — it is bad ingestion. When text extraction leaves in navigation menus, cookie banners, duplicated footers, or garbled PDF columns, those fragments end up in the index and degrade retrieval quality for every query. Garbage in, garbage out is a more acute problem in RAG than in almost any other ML pipeline because retrieval amplifies ingestion errors directly into the prompt.

Good ingestion pipelines run in stages: extract raw text from the source format (HTML, PDF, DOCX, Markdown), remove structural noise (nav bars, footers, sidebars), normalise whitespace and encoding, extract structured metadata (title, author, date, section heading, source URL), detect and deduplicate near-identical passages, and version the corpus so you can roll back if an ingestion job introduces regressions.

Metadata is the underused superpower of RAG. Every attribute you extract — date, author, document type, section level — becomes a potential filter at retrieval time. A question about a policy updated last month should not retrieve a stale version from three years ago. Structural clues like headings and section titles also improve chunking quality because they give natural boundary signals that a fixed-size splitter would otherwise miss.