Embedding models map variable-length text onto fixed-length dense vectors such that semantically related passages cluster together in vector space. The embedding is the bridge between natural language and mathematical similarity. Two chunks that discuss the same concept in different words will have similar vectors even if they share no exact tokens — this is what allows RAG to retrieve by meaning rather than keyword overlap.
Vector search is almost always approximate nearest neighbour (ANN) search rather than exact brute-force cosine similarity. For small demo datasets brute-force works fine, but at millions of chunks the latency becomes prohibitive. Production systems use specialised indexes — FAISS from Meta, pgvector as a Postgres extension, Qdrant, Weaviate, Pinecone, or Milvus. These indexes trade a small amount of recall for orders-of-magnitude speed improvements, usually with configurable precision-performance knobs.
Choosing an embedding model requires balancing several dimensions: retrieval quality on your domain, latency per request, cost per token, multilingual support, maximum context length, and whether the model can be deployed on-premises or must be called via API. General-purpose models like OpenAI text-embedding-3-large or Cohere embed-v3 work well for most English corpora. Domain-specific or multilingual corpora may benefit from purpose-built models. The only reliable way to compare models is to benchmark them on a representative subset of your actual queries.