AI Summary Hub

RAG architecture

Components and design choices in RAG systems.

Definition

RAG (Retrieval-Augmented Generation) architecture defines how raw documents are transformed into retrievable knowledge and how that knowledge is injected into an LLM at inference time. The pipeline has two main phases: an offline indexing phase that processes and stores documents, and an online retrieval phase that fetches relevant context for each user query.

Design choices in this architecture directly affect the quality, latency, and cost of the final system. Chunk size controls how much context each retrieved segment carries — smaller chunks are more precise but may lack context, while larger chunks reduce retrieval recall. The choice of embedding model determines how semantically meaningful the vector space is, and whether to use dense, sparse, or hybrid retrieval affects coverage for both semantic and keyword-based queries.

Advanced setups extend the base pipeline with query rewriting (rephrasing queries before embedding), multi-hop retrieval (chaining multiple retrievals), reranking (a cross-encoder that rescores top-k candidates), and citation extraction (attributing answers to source chunks). Each extension adds latency and complexity but can significantly improve answer quality for demanding use cases. See vector databases for indexing options.

How it works

Indexing phase

Documents are ingested, split into chunks, and stored in a vector index.

Retrieval phase

At query time, the query is embedded and similar chunks are retrieved and optionally reranked.

Chunk: Documents are split into segments (by paragraph, sentence, or fixed token count); overlap and metadata can be added to each chunk. Embed and index: Chunks are encoded into vectors via an embedding model and stored in a vector database. Query: The user's query is embedded with the same encoder; retrieve fetches the top-k similar chunks using dense or hybrid search. Rank: An optional reranker (e.g. cross-encoder) rescores the top candidates before they are formatted into the LLM prompt.

When to use / When NOT to use

ScenarioUseDon't use
Knowledge base is large and frequently updatedYes — chunking + indexing handles scaleNo — fine-tuning is expensive to retrain
Answers need source attributionYes — chunks carry provenance metadataNo — vanilla LLM generation loses attribution
Queries are highly keyword-specificYes — hybrid retrieval combines dense + sparseNo — pure dense retrieval may miss exact matches
Knowledge fits in the context windowMaybe — simpler to just stuff the promptYes — no need for a retrieval layer
Real-time latency is criticalWith optimizations — caching, smaller modelsAvoid reranking + multi-hop at very low latency budgets

Comparisons

ApproachChunk sizeRetrieval typeRerankerTypical use
Naive RAGFixed 512 tokensDense onlyNonePrototyping
Advanced RAGSemantic / overlappingHybrid (dense + BM25)Cross-encoderProduction Q&A
Modular RAGVariable, with metadataHybrid + filtersLearned rerankerEnterprise search
Multi-hop RAGSmall for precisionDense per hopOptionalComplex reasoning

Pros and cons

ProsCons
Keeps knowledge up-to-date without retrainingAdds indexing and retrieval latency
Provides source attribution for answersChunking strategy significantly impacts quality
Scales to millions of documentsRequires maintaining a vector index
Composable with reranking and filteringQuery-document mismatch can hurt recall

Code examples

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

# --- Indexing ---
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
docs = splitter.create_documents([open("document.txt").read()])

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)

# --- Retrieval + Generation ---
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o-mini"),
    retriever=retriever,
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "What does the document say about X?"})
print(result["result"])

Practical resources

See also