RAG architecture
Components and design choices in RAG systems.
Definition
RAG (Retrieval-Augmented Generation) architecture defines how raw documents are transformed into retrievable knowledge and how that knowledge is injected into an LLM at inference time. The pipeline has two main phases: an offline indexing phase that processes and stores documents, and an online retrieval phase that fetches relevant context for each user query.
Design choices in this architecture directly affect the quality, latency, and cost of the final system. Chunk size controls how much context each retrieved segment carries — smaller chunks are more precise but may lack context, while larger chunks reduce retrieval recall. The choice of embedding model determines how semantically meaningful the vector space is, and whether to use dense, sparse, or hybrid retrieval affects coverage for both semantic and keyword-based queries.
Advanced setups extend the base pipeline with query rewriting (rephrasing queries before embedding), multi-hop retrieval (chaining multiple retrievals), reranking (a cross-encoder that rescores top-k candidates), and citation extraction (attributing answers to source chunks). Each extension adds latency and complexity but can significantly improve answer quality for demanding use cases. See vector databases for indexing options.
How it works
Indexing phase
Documents are ingested, split into chunks, and stored in a vector index.
Retrieval phase
At query time, the query is embedded and similar chunks are retrieved and optionally reranked.
Chunk: Documents are split into segments (by paragraph, sentence, or fixed token count); overlap and metadata can be added to each chunk. Embed and index: Chunks are encoded into vectors via an embedding model and stored in a vector database. Query: The user's query is embedded with the same encoder; retrieve fetches the top-k similar chunks using dense or hybrid search. Rank: An optional reranker (e.g. cross-encoder) rescores the top candidates before they are formatted into the LLM prompt.
When to use / When NOT to use
| Scenario | Use | Don't use |
|---|---|---|
| Knowledge base is large and frequently updated | Yes — chunking + indexing handles scale | No — fine-tuning is expensive to retrain |
| Answers need source attribution | Yes — chunks carry provenance metadata | No — vanilla LLM generation loses attribution |
| Queries are highly keyword-specific | Yes — hybrid retrieval combines dense + sparse | No — pure dense retrieval may miss exact matches |
| Knowledge fits in the context window | Maybe — simpler to just stuff the prompt | Yes — no need for a retrieval layer |
| Real-time latency is critical | With optimizations — caching, smaller models | Avoid reranking + multi-hop at very low latency budgets |
Comparisons
| Approach | Chunk size | Retrieval type | Reranker | Typical use |
|---|---|---|---|---|
| Naive RAG | Fixed 512 tokens | Dense only | None | Prototyping |
| Advanced RAG | Semantic / overlapping | Hybrid (dense + BM25) | Cross-encoder | Production Q&A |
| Modular RAG | Variable, with metadata | Hybrid + filters | Learned reranker | Enterprise search |
| Multi-hop RAG | Small for precision | Dense per hop | Optional | Complex reasoning |
Pros and cons
| Pros | Cons |
|---|---|
| Keeps knowledge up-to-date without retraining | Adds indexing and retrieval latency |
| Provides source attribution for answers | Chunking strategy significantly impacts quality |
| Scales to millions of documents | Requires maintaining a vector index |
| Composable with reranking and filtering | Query-document mismatch can hurt recall |
Code examples
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
# --- Indexing ---
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
docs = splitter.create_documents([open("document.txt").read()])
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)
# --- Retrieval + Generation ---
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o-mini"),
retriever=retriever,
return_source_documents=True,
)
result = qa_chain.invoke({"query": "What does the document say about X?"})
print(result["result"])Practical resources
- LangChain – RAG architecture — End-to-end RAG walkthrough with LangChain components
- LlamaIndex – Document processing and indexing — Ingestion, chunking, and indexing pipelines
- Anthropic – RAG best practices — Claude-specific RAG guidance and tips