Semantic search

Definition

Semantic search is a retrieval paradigm that returns results based on meaning and intent rather than exact keyword matching. A user query and the documents in the corpus are both encoded into dense vector representations (embeddings), and retrieval is performed by finding the documents whose vectors are most similar to the query vector — typically using cosine similarity or dot product. Because the embedding space is learned from large corpora, queries like "affordable accommodation" correctly retrieve documents containing "cheap hotels" even though they share no keywords.

The core insight is that a well-trained embedding model maps semantically similar text to nearby points in a high-dimensional vector space. This is achieved through contrastive training objectives: similar sentences are pulled together and dissimilar ones pushed apart. Models like Sentence-BERT, OpenAI Ada, and Cohere Embed are trained specifically for retrieval tasks, learning to distinguish subtle differences in meaning that a bag-of-words model would miss. The dimensionality of the embedding (commonly 768 to 3072) determines the expressiveness of the representation, while the choice of similarity function and approximate nearest-neighbor (ANN) index determines retrieval speed and accuracy.

Semantic search is the retrieval backbone of RAG (Retrieval-Augmented Generation): user queries are embedded and matched against a library of pre-indexed document chunks, and the top results are injected into the LLM's context window. It also underpins recommendation systems ("similar items"), deduplication pipelines, and clustering. Hybrid search — combining semantic (dense) retrieval with keyword (sparse, BM25) retrieval and re-ranking the combined results — often outperforms either approach alone, especially for queries that mix natural language intent with specific technical terms or identifiers.

How it works

Embedding and indexing

Documents are chunked (for long-form content), embedded using a bi-encoder model, and stored in a vector index. The index can be a flat brute-force index (for small corpora), or an approximate nearest-neighbor index such as HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) for large-scale retrieval.

Query execution

Hybrid search and reranking

Pure semantic search can miss results where exact terms matter (product codes, names, technical identifiers). Hybrid search runs both dense (semantic) and sparse (BM25 keyword) retrieval and merges results using Reciprocal Rank Fusion or a learned combination. A cross-encoder reranker then scores the top candidates by jointly encoding the query and each document — more accurate but slower than the bi-encoder retrieval step.

When to use / When NOT to use

Use when	Avoid when
Users express intent in natural language and exact keyword matching produces poor recall	Users always search with exact product codes, IDs, or structured filters
Corpus contains paraphrased or diverse phrasing for the same concepts	Corpus is small enough that full-text search with good tokenization suffices
Building RAG pipelines that need relevant context retrieval	Latency requirements cannot accommodate vector index lookup
Recommendation and "similar item" features in user-facing products	Privacy constraints prevent embedding documents in third-party models

Comparisons

Method	Matching strategy	Strengths	Limitations
Keyword (BM25)	Exact term frequency	Fast, interpretable, handles rare terms	Misses synonyms and paraphrases
Semantic (dense)	Embedding similarity	Handles synonymy, intent, context	Misses rare exact-match terms; needs embedding model
Hybrid (BM25 + dense)	Combined ranking	Best of both worlds	More infrastructure complexity
Cross-encoder reranker	Joint query-doc scoring	Highest accuracy	Slow; used only for top-k candidates

Pros and cons

Pros	Cons
Handles natural language queries robustly	Requires embedding model and vector index infrastructure
Works across languages if a multilingual model is used	Embedding quality determines retrieval ceiling; poor models produce poor results
Scales to millions of documents with ANN indexes	ANN indexes introduce recall-latency tradeoffs
Enables powerful RAG and recommendation systems	Chunking strategy and embedding granularity require careful tuning

Code examples

Semantic search with Sentence-BERT and FAISS (Python)

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Index a small corpus
corpus = [
    "How to fine-tune a transformer model on a custom dataset",
    "Introduction to reinforcement learning from human feedback",
    "Best practices for deploying machine learning models to production",
    "Understanding attention mechanisms in neural networks",
    "Data augmentation techniques for computer vision tasks",
]

corpus_embeddings = model.encode(corpus, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

# Build a FAISS index (inner product = cosine similarity on normalized vectors)
index = faiss.IndexFlatIP(corpus_embeddings.shape[1])
index.add(corpus_embeddings.astype(np.float32))

# Query
query = "how to deploy ML models"
query_embedding = model.encode([query], convert_to_numpy=True)
query_embedding = query_embedding / np.linalg.norm(query_embedding)

scores, indices = index.search(query_embedding.astype(np.float32), k=3)

print(f"Query: {query}\nTop results:")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0])):
    print(f"  {rank + 1}. [{score:.3f}] {corpus[idx]}")

Practical resources

Sentence-BERT (SBERT) — Dense retrieval models, documentation, and pre-trained checkpoints
FAISS documentation (Meta AI) — Efficient similarity search and clustering library
LangChain – Vector stores — Integrating semantic search into RAG pipelines
Pinecone – What is semantic search? — Practical explainer with examples
Cohere – Embed API — Multilingual embeddings for retrieval