AI Summary Hub

Embeddings

Dense vector representations for text and retrieval.

Definition

Embeddings are dense, fixed-size numerical vectors that encode the semantic meaning of text (or other data modalities such as images and audio). When text is passed through an encoder model, semantically similar content produces vectors that are geometrically close in the high-dimensional space — so phrases like "customer support" and "help desk" will have nearby vectors if trained on similar data.

They are the bridge between raw text and vector databases. Both documents and queries must be embedded using the same encoder so that their vectors live in the same space and meaningful similarity comparisons can be made. The most common similarity metric is cosine similarity, though dot product and Euclidean distance are also used depending on the index configuration.

The choice of embedding model is one of the highest-impact decisions in a RAG system. Factors include vector dimensionality (higher = more expressive but more storage), context window (how much text the encoder processes at once), domain specificity (a legal or biomedical model may outperform a general-purpose one), multilingual support, and cost (API vs. self-hosted). Popular options include OpenAI text-embedding-3-large, Cohere Embed, and open-source sentence-transformers. See RAG architecture for how embeddings fit into the full pipeline.

How it works

Encoding pipeline

Text (a sentence, paragraph, or chunk) is fed into an encoder (e.g. OpenAI embeddings, Cohere, or open-source sentence-transformers). The encoder outputs a fixed-size vector (e.g. 768 or 1536 dimensions). Training uses contrastive or similar objectives so that semantically related texts get nearby vectors. At query time, similarity is computed as cosine or dot product between the query vector and stored document vectors. Models can be multilingual or domain-specific. For RAG, always use the same encoder for documents and queries so distances are meaningful.

When to use / When NOT to use

ScenarioUse embeddingsDon't use embeddings
Semantic search ("find similar meaning")Yes — embeddings capture semantic intentNo — keyword search if exact string match is needed
Multilingual retrievalYes — multilingual encoders map languages to same spaceNo — language-specific BM25 if you only have one language
Short queries against long documentsYes — embed query and chunked docsNo — embedding entire long documents without chunking loses precision
Exact lookup by ID or structured fieldNo — use a relational DB or metadata filterYes — embeddings not needed for exact match
Low latency, limited computeConsider smaller models (e.g. MiniLM)Avoid large API-based models for every request

Comparisons

ModelDimensionsContextMultilingualCostBest for
OpenAI text-embedding-3-large30728191 tokensYesAPI (paid)High-accuracy production RAG
OpenAI text-embedding-3-small15368191 tokensYesAPI (low cost)Cost-sensitive apps
Cohere Embed v31024512 tokensYesAPI (paid)Reranking + retrieval
sentence-transformers/all-MiniLM-L6-v2384256 tokensNoSelf-hosted (free)Low-latency or offline
BAAI/bge-large-en-v1.51024512 tokensNoSelf-hosted (free)High-quality open-source

Pros and cons

ProsCons
Captures semantic meaning, not just keywordsVector space varies by model; can't mix encoders
Enables cross-lingual retrieval with multilingual modelsDimensionality increases storage and compute cost
Reusable: same vectors serve search, clustering, dedupQuality depends heavily on model choice and domain fit
Fast at query time with ANN indexesNo interpretability — hard to debug why a chunk was returned

Code examples

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

# Embed a document chunk and a query
doc_vec = embed("The refund policy allows returns within 30 days.")
query_vec = embed("How long do I have to return a product?")

# Cosine similarity (manual)
import numpy as np
def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

print(f"Similarity: {cosine_sim(doc_vec, query_vec):.4f}")

Practical resources

See also