Reranking and Context Assembly
Cross-encoders, deduplication, and token budget management
Initial retrieval casts a wide net. Rerankers — especially cross-encoders — reorder candidates based on direct query-document relevance. After reranking, near-duplicates are removed and the best chunks are packed into the prompt within a token budget.
Bi-encoder retrieval (embedding query and document separately, then computing dot product) is fast but imprecise — the query and document embeddings are computed independently, missing cross-attention signals. Cross-encoder rerankers fix this by jointly encoding query and document pairs and scoring them together. This is much slower (you must run the model once per candidate) but dramatically more accurate. The standard pattern is retrieve broadly with a bi-encoder (top 20–100), then rerank tightly with a cross-encoder (keep top 3–5).
Deduplication is an often-skipped step that significantly improves answer quality. Overlapping chunks from the same document section will produce near-duplicate content in the retrieved set. Packing five near-identical passages into the prompt wastes tokens and causes the model to over-weight that specific fact. After reranking, remove candidates whose text similarity exceeds a threshold, keeping only the highest-ranked unique chunk per near-duplicate cluster.
Context assembly is not just "concatenate the top-k chunks". Adjacent chunks from the same document may need stitching to restore coherence. Ordering matters — lost-in-the-middle research suggests models attend better to evidence at the beginning and end of context. Each chunk should carry its source identifier so the generator can produce accurate citations. And the assembled context must fit within the model's effective context window, accounting for the system prompt and expected output length.
Key Concepts
- Bi-encoders are fast but imprecise — cross-encoders are slow but significantly more accurate
- Retrieve broadly (top 20–100) then rerank tightly (keep top 3–5) for best quality/speed tradeoff
- Deduplication after reranking prevents the model from over-weighting repeated facts
- Context ordering matters — models attend better to evidence at the beginning and end
- Each chunk must carry a source ID to enable accurate citation in the generated answer