Preview — full styling will appear after the next deploy completes.

rag-system-design

Reranking and Context Assembly

Cross-encoders, deduplication, and token budget management

Initial retrieval casts a wide net. Rerankers — especially cross-encoders — reorder candidates based on direct query-document relevance. After reranking, near-duplicates are removed and the best chunks are packed into the prompt within a token budget.

flowchart TD
    Q([Query]) --> BI[Bi-encoder
Retrieve Top-50]
    BI --> CE[Cross-encoder
Rerank]
    CE --> DD[Deduplicate
near-similar chunks]
    DD --> TB{Token Budget
check}
    TB -->|fits| CTX[Assembled Context
with source IDs]
    TB -->|overflow| TRIM[Trim lowest-ranked]
    TRIM --> CTX
    CTX --> LLM[LLM Generator]

Bi-encoder retrieval (embedding query and document separately, then computing dot product) is fast but imprecise — the query and document embeddings are computed independently, missing cross-attention signals. Cross-encoder rerankers fix this by jointly encoding query and document pairs and scoring them together. This is much slower (you must run the model once per candidate) but dramatically more accurate. The standard pattern is retrieve broadly with a bi-encoder (top 20–100), then rerank tightly with a cross-encoder (keep top 3–5).

Deduplication is an often-skipped step that significantly improves answer quality. Overlapping chunks from the same document section will produce near-duplicate content in the retrieved set. Packing five near-identical passages into the prompt wastes tokens and causes the model to over-weight that specific fact. After reranking, remove candidates whose text similarity exceeds a threshold, keeping only the highest-ranked unique chunk per near-duplicate cluster.

Context assembly is not just "concatenate the top-k chunks". Adjacent chunks from the same document may need stitching to restore coherence. Ordering matters — lost-in-the-middle research suggests models attend better to evidence at the beginning and end of context. Each chunk should carry its source identifier so the generator can produce accurate citations. And the assembled context must fit within the model's effective context window, accounting for the system prompt and expected output length.