Preview — full styling will appear after the next deploy completes.

rag-system-design

Embeddings and Vector Search

Dense vectors, similarity search, and vector store internals

Embeddings convert text into dense vectors so semantically related items appear closer in vector space. Quality depends on the model, the language/domain match, and chunking choices. Production systems use approximate nearest neighbour indexes (FAISS, pgvector, Qdrant) rather than brute-force search.

flowchart LR
    T[Text Chunk] --> EM[Embedding Model]
    EM --> V[(Vector Store
FAISS · pgvector
Qdrant · Pinecone)]
    Q([User Query]) --> QE[Embed Query]
    QE --> ANN[ANN Search
cosine similarity]
    V --> ANN
    ANN --> K[Top-K Candidates]

Embedding models map variable-length text onto fixed-length dense vectors such that semantically related passages cluster together in vector space. The embedding is the bridge between natural language and mathematical similarity. Two chunks that discuss the same concept in different words will have similar vectors even if they share no exact tokens — this is what allows RAG to retrieve by meaning rather than keyword overlap.

Vector search is almost always approximate nearest neighbour (ANN) search rather than exact brute-force cosine similarity. For small demo datasets brute-force works fine, but at millions of chunks the latency becomes prohibitive. Production systems use specialised indexes — FAISS from Meta, pgvector as a Postgres extension, Qdrant, Weaviate, Pinecone, or Milvus. These indexes trade a small amount of recall for orders-of-magnitude speed improvements, usually with configurable precision-performance knobs.

Choosing an embedding model requires balancing several dimensions: retrieval quality on your domain, latency per request, cost per token, multilingual support, maximum context length, and whether the model can be deployed on-premises or must be called via API. General-purpose models like OpenAI text-embedding-3-large or Cohere embed-v3 work well for most English corpora. Domain-specific or multilingual corpora may benefit from purpose-built models. The only reliable way to compare models is to benchmark them on a representative subset of your actual queries.