Embeddings

Definition

Embeddings are dense, fixed-size numerical vectors that encode the semantic meaning of text (or other data modalities such as images and audio). When text is passed through an encoder model, semantically similar content produces vectors that are geometrically close in the high-dimensional space — so phrases like "customer support" and "help desk" will have nearby vectors if trained on similar data.

They are the bridge between raw text and vector databases. Both documents and queries must be embedded using the same encoder so that their vectors live in the same space and meaningful similarity comparisons can be made. The most common similarity metric is cosine similarity, though dot product and Euclidean distance are also used depending on the index configuration.

The choice of embedding model is one of the highest-impact decisions in a RAG system. Factors include vector dimensionality (higher = more expressive but more storage), context window (how much text the encoder processes at once), domain specificity (a legal or biomedical model may outperform a general-purpose one), multilingual support, and cost (API vs. self-hosted). Popular options include OpenAI text-embedding-3-large, Cohere Embed, and open-source sentence-transformers. See RAG architecture for how embeddings fit into the full pipeline.

How it works

Encoding pipeline

Similarity search

Text (a sentence, paragraph, or chunk) is fed into an encoder (e.g. OpenAI embeddings, Cohere, or open-source sentence-transformers). The encoder outputs a fixed-size vector (e.g. 768 or 1536 dimensions). Training uses contrastive or similar objectives so that semantically related texts get nearby vectors. At query time, similarity is computed as cosine or dot product between the query vector and stored document vectors. Models can be multilingual or domain-specific. For RAG, always use the same encoder for documents and queries so distances are meaningful.

When to use / When NOT to use

Scenario	Use embeddings	Don't use embeddings
Semantic search ("find similar meaning")	Yes — embeddings capture semantic intent	No — keyword search if exact string match is needed
Multilingual retrieval	Yes — multilingual encoders map languages to same space	No — language-specific BM25 if you only have one language
Short queries against long documents	Yes — embed query and chunked docs	No — embedding entire long documents without chunking loses precision
Exact lookup by ID or structured field	No — use a relational DB or metadata filter	Yes — embeddings not needed for exact match
Low latency, limited compute	Consider smaller models (e.g. MiniLM)	Avoid large API-based models for every request

Comparisons

Model	Dimensions	Context	Multilingual	Cost	Best for
OpenAI `text-embedding-3-large`	3072	8191 tokens	Yes	API (paid)	High-accuracy production RAG
OpenAI `text-embedding-3-small`	1536	8191 tokens	Yes	API (low cost)	Cost-sensitive apps
Cohere Embed v3	1024	512 tokens	Yes	API (paid)	Reranking + retrieval
`sentence-transformers/all-MiniLM-L6-v2`	384	256 tokens	No	Self-hosted (free)	Low-latency or offline
`BAAI/bge-large-en-v1.5`	1024	512 tokens	No	Self-hosted (free)	High-quality open-source

Pros and cons

Pros	Cons
Captures semantic meaning, not just keywords	Vector space varies by model; can't mix encoders
Enables cross-lingual retrieval with multilingual models	Dimensionality increases storage and compute cost
Reusable: same vectors serve search, clustering, dedup	Quality depends heavily on model choice and domain fit
Fast at query time with ANN indexes	No interpretability — hard to debug why a chunk was returned

Code examples

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

# Embed a document chunk and a query
doc_vec = embed("The refund policy allows returns within 30 days.")
query_vec = embed("How long do I have to return a product?")

# Cosine similarity (manual)
import numpy as np
def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

print(f"Similarity: {cosine_sim(doc_vec, query_vec):.4f}")

Practical resources

OpenAI – Embeddings guide — API usage, model comparison, and best practices
Hugging Face – Sentence Transformers — Open-source embedding models and evaluation benchmarks
MTEB leaderboard — Massive Text Embedding Benchmark for comparing models across tasks
Cohere – Embed API — Cohere embedding models with retrieval-optimized variants