Document Ingestion and Cleaning
Parsing quality, normalisation, and metadata extraction
RAG quality is bounded by ingestion quality. Bad parsing creates broken chunks, duplicated fragments, and irrelevant boilerplate that pollutes retrieval. Preserving structural clues — headings, timestamps, authors, URLs — turns raw text into filterable, rankable metadata.
The most common RAG failure mode is not a bad embedding model or a weak reranker — it is bad ingestion. When text extraction leaves in navigation menus, cookie banners, duplicated footers, or garbled PDF columns, those fragments end up in the index and degrade retrieval quality for every query. Garbage in, garbage out is a more acute problem in RAG than in almost any other ML pipeline because retrieval amplifies ingestion errors directly into the prompt.
Good ingestion pipelines run in stages: extract raw text from the source format (HTML, PDF, DOCX, Markdown), remove structural noise (nav bars, footers, sidebars), normalise whitespace and encoding, extract structured metadata (title, author, date, section heading, source URL), detect and deduplicate near-identical passages, and version the corpus so you can roll back if an ingestion job introduces regressions.
Metadata is the underused superpower of RAG. Every attribute you extract — date, author, document type, section level — becomes a potential filter at retrieval time. A question about a policy updated last month should not retrieve a stale version from three years ago. Structural clues like headings and section titles also improve chunking quality because they give natural boundary signals that a fixed-size splitter would otherwise miss.
Key Concepts
- Ingestion quality is the primary bottleneck — bad parsing creates broken, noisy chunks
- Remove structural boilerplate (nav, footer, ads) before chunking
- Preserve metadata: title, section headers, authors, timestamps, source URLs
- Deduplication prevents the same content from inflating retrieval scores
- Version the corpus so ingestion regressions can be detected and rolled back