Retrieval-augmented generation (RAG)
Combining retrieval with LLM generation for accurate, grounded answers.
Definition
Retrieval-augmented generation (RAG) is a technique that augments a large language model with an external retrieval step: given a user query, the system first retrieves relevant documents from a knowledge source (typically a vector store or search index), then passes those documents as context to the LLM to generate a grounded answer. This approach reduces hallucination by anchoring the model's output in real, verifiable data rather than relying solely on knowledge encoded during pre-training.
RAG emerged as a practical middle ground between two extremes — using a general-purpose LLM with no domain knowledge, and fine-tuning a model on domain-specific data. The original RAG architecture was proposed by Lewis et al. (2020) at Facebook AI, combining a retriever (based on Dense Passage Retrieval) with a sequence-to-sequence generator (BART). Since then, the pattern has evolved into a widely adopted architectural pattern with many variations across chunking strategies, retrieval methods, and generation techniques.
RAG is particularly important in enterprise and production settings because it allows organizations to leverage proprietary or frequently changing data without the cost and complexity of model fine-tuning. It also enables source citation — the system can point to the exact documents that informed its answer, which is critical for trust, compliance, and auditability in domains like legal, healthcare, and finance.
How it works
Indexing (offline)
Before RAG can answer queries, your knowledge base must be indexed. Documents are split into chunks (paragraphs, sections, or sliding windows), each chunk is converted into a dense vector using an embedding model, and the resulting vectors are stored in a vector database. Chunking strategy significantly impacts retrieval quality — chunks too large dilute relevance, chunks too small lose context.
Retrieval (query-time)
When a user sends a query, it is embedded using the same model, and the system performs a similarity search (cosine or dot product) against the vector database to retrieve the top-k most relevant chunks. Advanced RAG pipelines add a reranking step after initial retrieval to improve precision — a cross-encoder model scores each retrieved chunk against the query and reorders them.
Generation (query-time)
The retrieved chunks are injected into the LLM prompt as context, alongside the original query. The LLM generates an answer grounded in this context. Prompt design matters here — instructions like "Answer using only the provided context" help reduce hallucination, while "If the context doesn't contain the answer, say so" prevents fabrication.
When to use / When NOT to use
| Use when | Avoid when |
|---|---|
| Knowledge changes frequently (docs, FAQs, policies) and retraining is impractical | The knowledge is static and small enough to fit entirely in the prompt context window |
| You need answers grounded in private or domain-specific data | You need the model to learn a new behavior or style (fine-tuning is better) |
| Source citation and auditability are requirements | Latency is extremely critical and the retrieval step adds unacceptable delay |
| You want to keep costs low — no training compute needed | The domain requires reasoning across the entire corpus, not just retrieved chunks |
| Multiple data sources need to be queried (multi-index RAG) | Your data is mostly structured/tabular (SQL or structured queries may be more appropriate) |
Comparisons
| Criteria | RAG | Fine-tuning |
|---|---|---|
| Knowledge update speed | Instant (update index) | Slow (retrain model) |
| Cost | Low (inference + embedding) | High (training compute + hosting) |
| Hallucination control | Strong (grounded in retrieved docs) | Moderate (depends on training data quality) |
| Source citation | Native (retrieved chunks are traceable) | Not supported |
| Custom behavior / style | Limited | Strong |
| Setup complexity | Moderate (chunking + vector DB + retrieval) | High (dataset curation + training pipeline) |
Pros and cons
| Pros | Cons |
|---|---|
| Reduces hallucination by grounding in real data | Retrieval quality depends heavily on chunking and embedding choices |
| No need to retrain when knowledge changes | Adds latency from the retrieval step |
| Enables source citation for trust and compliance | Requires maintaining a vector database and indexing pipeline |
| Works with any LLM (API or self-hosted) | Context window limits how many chunks can be passed |
| Lower cost than fine-tuning for most use cases | "Garbage in, garbage out" — poor document quality propagates to answers |
Benchmarks
- RAGAS — Framework for evaluating RAG pipelines (faithfulness, answer relevance, context precision/recall)
- MTEB Leaderboard — Embedding model benchmarks relevant to RAG retrieval quality
- RGB Benchmark — Benchmarking retrieval-augmented generation across noise, rejection, integration, and counterfactual scenarios
Code examples
Basic RAG pipeline with LangChain (Python)
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# 1. Index documents (one-time or incremental)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents, embeddings)
# 2. Retrieve relevant chunks
query = "What is retrieval-augmented generation?"
docs = vectorstore.similarity_search(query, k=4)
context = "\n\n".join(d.page_content for d in docs)
# 3. Generate grounded answer
prompt = ChatPromptTemplate.from_messages([
("system", "Answer using only the context below. If the context "
"doesn't contain the answer, say 'I don't know'.\n\n{context}"),
("human", "{question}"),
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | llm
answer = chain.invoke({"context": context, "question": query})
print(answer.content)RAG with LlamaIndex (Python)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# 1. Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# 2. Query with built-in retrieval + generation
query_engine = index.as_query_engine(similarity_top_k=4)
response = query_engine.query("What is RAG?")
print(response)
# Access source nodes for citation
for node in response.source_nodes:
print(f"Source: {node.metadata['file_name']} (score: {node.score:.3f})")Practical resources
- RAG paper — Lewis et al. (2020) — The original research paper introducing retrieval-augmented generation
- LangChain RAG tutorial — Step-by-step guide to building a RAG pipeline with LangChain
- LlamaIndex RAG guide — Official LlamaIndex documentation on RAG concepts and implementation
- Vertex AI RAG and grounding — RAG on Google Cloud with Vertex AI
- Pinecone RAG guide — Practical guide covering chunking, embedding, and retrieval strategies