RAG for AI agents: Retrieval-Augmented Generation
An agent is only as good as the knowledge it can reach. RAG fetches the right passages at query time and feeds them to the model — so answers are grounded in your data, current, and citable instead of guessed from memory.
- 12 min read
- Intermediate
- Updated 2026
RAG, or retrieval-augmented generation, is the simplest reliable way to give a language model knowledge it was never trained on — and to make its answers traceable back to a source.
A base model only knows what was in its training data, frozen at a point in time. Ask it about last week's pricing change, an internal runbook, or a customer's open ticket and it will either refuse or, worse, confidently invent something. RAG closes that gap by retrieving relevant text from an external knowledge source at the moment of the query and placing it in the prompt, so the model generates its answer conditioned on real evidence.
For agents this is foundational. An LLM agent reasons and acts in a loop, but reasoning over wrong or missing facts just produces well-structured nonsense. RAG is the mechanism that grounds each step in your documents, your database, and the live world. It pairs naturally with agent memory and with the broader set of tools an agent can call.
This guide walks the full picture: what RAG is and why agents need it, the retrieve → augment → generate pipeline, how chunking, embeddings, and similarity search actually work, when to choose RAG over fine-tuning or a long context window, the rise of agentic RAG, and the evaluation pitfalls and failure modes that trip up real systems.
Why agents need retrieval
A model's weights are a lossy, frozen snapshot. Agents act in a world that is specific, private, and constantly changing — exactly what training data can't capture.
Think of a model's parameters as parametric memory: an enormous, compressed average of public text. It is broad but blurry, static, and impossible to cite. Your agent, meanwhile, needs to answer questions about things that were never public and never static — a refund policy edited this morning, the spec for an internal API, the last three emails in a thread.
RAG adds non-parametric memory: an external store you control, query, and update independently of the model. When the agent needs a fact, it retrieves the most relevant passages and reads them like a person consulting a reference, rather than recalling from a hazy memory. Update the store and the agent's knowledge updates instantly — no retraining required.
The payoff is concrete: fresher answers, far less hallucination on in-domain questions, the ability to show where each claim came from, and access to private knowledge without ever baking it into model weights you might share.
Freshness
Re-index a document and the agent knows the new version on the next query — no retraining, no deploy.
Grounding
Answers are conditioned on retrieved evidence, sharply cutting fabricated facts on in-domain questions.
Citations
Because the source passage is in the prompt, the agent can point to the exact document and section.
Private knowledge
Serve answers from your own docs and data without ever embedding them into shared model weights.
Retrieve → augment → generate
Every RAG system, however fancy, is these three stages. Get a clear mental model of each and the rest is tuning.
1 · Retrieve
Convert the query into an embedding and search a vector index (often combined with keyword search) for the top-k most relevant chunks. Optionally filter by metadata and re-rank the candidates for precision.
2 · Augment
Assemble a prompt that places the retrieved passages alongside the question, with clear instructions: answer only from this context, cite the sources, and say you don't know if the context is insufficient.
3 · Generate
The model reads the augmented prompt and produces an answer grounded in the passages — ideally with inline citations the agent or UI can link back to the original documents.
The whole game is precision at the top
The generator can only use what retrieval hands it. If the right passage isn't in the top-k, no amount of clever prompting recovers it. That's why teams invest most in the retrieve stage — chunking, embeddings, hybrid search, and re-ranking — long before they tune the prompt. A vector database is the workhorse that makes this fast at scale.
Chunking, embeddings, and similarity search
Retrieval rests on three ideas working together. Understanding each one tells you exactly where quality is won or lost.
Chunking
Documents are split into passages small enough to embed precisely but large enough to stay meaningful. Split on semantic boundaries — headings, paragraphs — keep a little overlap, and attach metadata (title, source, section) for filtering and citation.
Embeddings
An embedding model maps each chunk to a vector — a list of numbers where semantically similar text lands close together. The same model embeds the query, so 'reset my password' can match a chunk about 'account recovery' even with no shared words.
Similarity search
At query time, the index finds the chunks whose vectors are nearest the query vector (by cosine or dot-product distance), returning the top-k. Approximate nearest-neighbour indexes keep this fast across millions of vectors.
Hybrid search and re-ranking
Pure vector search is great at meaning but can miss exact terms — a part number, an error code, a rare name. Hybrid search blends semantic (vector) results with classic keyword (BM25) results, so you catch both the gist and the literal match.
A re-ranker then takes the merged candidate pool and scores each passage against the query with a more expensive cross-encoder, pushing the truly relevant chunks to the top. Cheap recall first, precise ranking second — this two-stage shape is what separates demo-grade RAG from production RAG.
- Semantic vectors capture meaning and paraphrase.
- Keyword search nails exact, rare, or symbolic terms.
- Metadata filters scope results to a tenant, date, or doc type.
- Re-ranking lifts the best passages into the top-k.
A minimal RAG retrieval step
Strip away the framework and the retrieve stage is small: embed the query, search the index, then build a grounded prompt from the top chunks.
1def retrieve(query, k=4): // the 'retrieve' stage2 q_vec = embed(query) // query → embedding vector3 hits = index.search(q_vec, top_k=k) // nearest chunks4 return [h.chunk for h in hits]56def build_prompt(query, chunks): // the 'augment' stage7 context = "\n\n".join(8 f"[source: {c.source}]\n{c.text}" for c in chunks9 )10 return (11 "Answer ONLY from the context. Cite sources.\n"12 "If it is not in the context, say you don't know.\n\n"13 f"Context:\n{context}\n\nQuestion: {query}"14 )1516chunks = retrieve("What is our refund window?") // agent calls retrieve as a tool17answer = llm(build_prompt("What is our refund window?", chunks))That's the whole skeleton. The instruction to answer only from the context and to admit uncertainty is doing heavy lifting — it's the line between a grounded agent and one that drifts back into guessing. In an agent, retrieve is registered as a tool, so the model can call it when (and only when) it needs evidence. See AI agent tools for how that wiring works.
RAG vs fine-tuning vs long context
These three are often framed as rivals. They solve different problems, and strong systems frequently combine them.
| Dimension | RAG | Fine-tuning | Long context |
|---|---|---|---|
| Best for | Changing, citable knowledge | Durable skills, format, tone | One-off large inputs |
| Knowledge freshness | Instant (re-index) | Stale until retrained | Per-request only |
| Can cite sources | |||
| Update cost | Cheap, incremental | Expensive retraining | None |
| Scales to huge corpora | |||
| Per-query cost | Low (small top-k) | Low | High (big prompt) |
| Adds new behavior |
The clean rule: RAG injects knowledge, fine-tuning teaches behavior, long context handles a single big input. If your facts change or must be cited, reach for RAG. If you need the model to always reply in a house style, follow a niche format, or master a domain skill, fine-tune. If you simply need to reason over one large document this turn, paste it into the context window. Many production agents do all three — fine-tuned for tone, RAG for knowledge, long context for the occasional big artifact. For a deeper, side-by-side breakdown, read RAG vs fine-tuning.
A common myth is that big context windows make RAG obsolete. They don't: you still can't fit a million documents in a prompt, large prompts are slow and costly, models lose precision in the middle of very long inputs, and a raw window gives you no citations or access control. RAG remains the way to find the right few thousand tokens out of millions.
Agentic RAG: the agent decides what to retrieve
Classic RAG retrieves once, up front. Agentic RAG makes retrieval a decision the agent owns — when to search, where to search, and whether to search again.
In naive RAG, retrieval is a fixed pre-step — it always runs once, with the user's raw question, against one index. That breaks down fast: some questions need no retrieval, some need several rounds, and many span multiple sources.
Agentic RAG hands those decisions to the agent. Retrieval becomes a tool the model can choose to call, so the agent can:
- Decide whether to retrieve at all — skip it for chit-chat or arithmetic it can do itself.
- Rewrite the query into a cleaner search string, or split a complex question into sub-queries.
- Route to the right source — a vector store for docs, SQL for structured data, web search for current events.
- Retrieve iteratively — read the first results, spot a gap, and search again with a refined query.
This is RAG fused with the agent loop: reason, retrieve, observe, repeat. It's more powerful and more robust — and it demands tighter evaluation, because now the agent can be wrong about whether to retrieve, not just what it found.
Passages per query
usually 3–8, not the whole corpus
Recall then re-rank
broad search, precise ranking
Retrieval rounds
agentic RAG can loop
Answers cited
the grounding goal
Evaluation pitfalls and failure modes
RAG that demos well often fails quietly in production. Knowing the failure modes — and measuring the right things — is what makes it dependable.
Common failure modes
Retrieval miss
The right passage isn't in the top-k — usually bad chunking, a weak embedding model, or no hybrid search. The generator never had a chance.
Wrong or stale context
Retrieval returns an outdated or duplicated passage and the model faithfully repeats the error. Garbage in, grounded garbage out.
Ignored context
Good passages are retrieved but the model answers from parametric memory anyway, or buries the key fact in a long, noisy context.
Broken citations
The answer cites the wrong source, or invents a citation that looks plausible but points nowhere — eroding the trust RAG is meant to build.
What to actually measure
- Retrieval quality — Is the right passage in the top-k? Track recall and precision against a labeled question set.
- Grounding / faithfulness — Is every claim in the answer supported by a retrieved passage, with nothing invented?
- Answer relevance — Does the response actually address the question, not just quote nearby text?
- Citation accuracy — Do the cited sources truly contain the claims they're attached to?
- Refusal calibration — When context is missing, does the agent say 'I don't know' instead of guessing?
Evaluate the stages separately
The biggest pitfall is grading only the final answer. A good answer can hide bad retrieval (the model knew it anyway), and a bad answer can hide good retrieval (the prompt was the problem). Measure retrieval and generation independently so you fix the stage that's actually broken.
Pair these metrics with agent memory hygiene — deduplicate sources, expire stale chunks, and keep provenance on every passage. The discipline that makes RAG trustworthy is the same discipline that makes any agent trustworthy: ground every claim, measure every stage, and let the system admit when it doesn't know.
RAG for agents, answered
RAG is a technique that fetches relevant text from an external knowledge source at query time and inserts it into the model's prompt before it generates an answer. Instead of relying only on what the model memorized during training, the agent retrieves up-to-date, organization-specific passages — from documents, a database, or an API — and conditions its response on them. The result is an answer grounded in real source material, with the ability to cite where each fact came from.
Go deeper on grounding your agents
Ship a grounded, citable RAG agent
Connect your docs, retrieve the right passages, and let your agent answer from real evidence. Free to start — no credit card required.