Evaluating a RAG System

RAG evaluation must separately inspect retrieval quality and answer quality. A bad answer can come from bad retrieval, bad reranking, or bad synthesis — each requires a different fix. The best teams maintain a gold dataset and run regression checks whenever any pipeline stage changes.

End-to-end evaluation of a RAG system answers the wrong question: "is the final answer good?" gives you no signal about where to improve. You need to decompose evaluation into stages. Retrieval evaluation asks: given this query, did the relevant chunks appear in the top-k? Metrics are hit-rate (did any relevant chunk appear?), recall@k (what fraction of all relevant chunks appeared?), MRR (mean reciprocal rank of the first relevant result), and nDCG (normalised discounted cumulative gain, which accounts for rank position). These can be computed automatically if you have labelled relevance judgements.

Generation evaluation asks: given the context provided, is the answer grounded, accurate, and complete? Groundedness measures whether every claim in the answer can be traced to a retrieved source. Answer relevance measures whether the answer actually addresses the question. Citation accuracy checks whether cited source IDs contain the stated facts. These metrics are harder to compute automatically and often require LLM-as-judge pipelines (using a capable model to score answers) or human annotation.

The most important investment is a small, curated gold dataset: 50–200 questions with labelled relevant chunks and expected answer properties. Run this suite after every change to chunking strategy, embedding model, retrieval parameters, or prompt. Regressions are common and catch you before they reach production. Frameworks like RAGAS, TruLens, and DeepEval provide ready-made evaluation pipelines — start with one of them rather than building evaluation tooling from scratch.