RAG · Retrieval-Augmented Generation

RAG for AI agents: Retrieval-Augmented Generation

An agent is only as good as the knowledge it can reach. RAG fetches the right passages at query time and feeds them to the model — so answers are grounded in your data, current, and citable instead of guessed from memory.

  • 12 min read
  • Intermediate
  • Updated 2026

RAG, or retrieval-augmented generation, is the simplest reliable way to give a language model knowledge it was never trained on — and to make its answers traceable back to a source.

A base model only knows what was in its training data, frozen at a point in time. Ask it about last week's pricing change, an internal runbook, or a customer's open ticket and it will either refuse or, worse, confidently invent something. RAG closes that gap by retrieving relevant text from an external knowledge source at the moment of the query and placing it in the prompt, so the model generates its answer conditioned on real evidence.

For agents this is foundational. An LLM agent reasons and acts in a loop, but reasoning over wrong or missing facts just produces well-structured nonsense. RAG is the mechanism that grounds each step in your documents, your database, and the live world. It pairs naturally with agent memory and with the broader set of tools an agent can call.

This guide walks the full picture: what RAG is and why agents need it, the retrieve → augment → generate pipeline, how chunking, embeddings, and similarity search actually work, when to choose RAG over fine-tuning or a long context window, the rise of agentic RAG, and the evaluation pitfalls and failure modes that trip up real systems.

The problem RAG solves

Why agents need retrieval

A model's weights are a lossy, frozen snapshot. Agents act in a world that is specific, private, and constantly changing — exactly what training data can't capture.

Think of a model's parameters as parametric memory: an enormous, compressed average of public text. It is broad but blurry, static, and impossible to cite. Your agent, meanwhile, needs to answer questions about things that were never public and never static — a refund policy edited this morning, the spec for an internal API, the last three emails in a thread.

RAG adds non-parametric memory: an external store you control, query, and update independently of the model. When the agent needs a fact, it retrieves the most relevant passages and reads them like a person consulting a reference, rather than recalling from a hazy memory. Update the store and the agent's knowledge updates instantly — no retraining required.

The payoff is concrete: fresher answers, far less hallucination on in-domain questions, the ability to show where each claim came from, and access to private knowledge without ever baking it into model weights you might share.

Freshness

Re-index a document and the agent knows the new version on the next query — no retraining, no deploy.

Grounding

Answers are conditioned on retrieved evidence, sharply cutting fabricated facts on in-domain questions.

Citations

Because the source passage is in the prompt, the agent can point to the exact document and section.

Private knowledge

Serve answers from your own docs and data without ever embedding them into shared model weights.

The core pipeline

Retrieve → augment → generate

Every RAG system, however fancy, is these three stages. Get a clear mental model of each and the rest is tuning.

QueryUser or agent question
RetrieveSimilarity search over chunks
AugmentInject passages into prompt
GenerateModel answers from context
CiteAttribute to source passages
The RAG pipeline. A query is embedded and used to retrieve relevant chunks; those chunks augment the prompt; the model generates a grounded, citable answer.
  1. 1 · Retrieve

    Convert the query into an embedding and search a vector index (often combined with keyword search) for the top-k most relevant chunks. Optionally filter by metadata and re-rank the candidates for precision.

  2. 2 · Augment

    Assemble a prompt that places the retrieved passages alongside the question, with clear instructions: answer only from this context, cite the sources, and say you don't know if the context is insufficient.

  3. 3 · Generate

    The model reads the augmented prompt and produces an answer grounded in the passages — ideally with inline citations the agent or UI can link back to the original documents.

The whole game is precision at the top

The generator can only use what retrieval hands it. If the right passage isn't in the top-k, no amount of clever prompting recovers it. That's why teams invest most in the retrieve stage — chunking, embeddings, hybrid search, and re-ranking — long before they tune the prompt. A vector database is the workhorse that makes this fast at scale.

See it in code

A minimal RAG retrieval step

Strip away the framework and the retrieve stage is small: embed the query, search the index, then build a grounded prompt from the top chunks.

retrieve.pypython
1def retrieve(query, k=4):  // the 'retrieve' stage2    q_vec = embed(query)  // query → embedding vector3    hits = index.search(q_vec, top_k=k)  // nearest chunks4    return [h.chunk for h in hits]56def build_prompt(query, chunks):  // the 'augment' stage7    context = "\n\n".join(8        f"[source: {c.source}]\n{c.text}" for c in chunks9    )10    return (11        "Answer ONLY from the context. Cite sources.\n"12        "If it is not in the context, say you don't know.\n\n"13        f"Context:\n{context}\n\nQuestion: {query}"14    )1516chunks = retrieve("What is our refund window?")  // agent calls retrieve as a tool17answer = llm(build_prompt("What is our refund window?", chunks))
A compact retrieve-then-augment step. The model is later asked to answer only from these passages and to cite them.

That's the whole skeleton. The instruction to answer only from the context and to admit uncertainty is doing heavy lifting — it's the line between a grounded agent and one that drifts back into guessing. In an agent, retrieve is registered as a tool, so the model can call it when (and only when) it needs evidence. See AI agent tools for how that wiring works.

Choosing an approach

RAG vs fine-tuning vs long context

These three are often framed as rivals. They solve different problems, and strong systems frequently combine them.

DimensionRAGFine-tuningLong context
Best forChanging, citable knowledgeDurable skills, format, toneOne-off large inputs
Knowledge freshnessInstant (re-index)Stale until retrainedPer-request only
Can cite sources
Update costCheap, incrementalExpensive retrainingNone
Scales to huge corpora
Per-query costLow (small top-k)LowHigh (big prompt)
Adds new behavior

The clean rule: RAG injects knowledge, fine-tuning teaches behavior, long context handles a single big input. If your facts change or must be cited, reach for RAG. If you need the model to always reply in a house style, follow a niche format, or master a domain skill, fine-tune. If you simply need to reason over one large document this turn, paste it into the context window. Many production agents do all three — fine-tuned for tone, RAG for knowledge, long context for the occasional big artifact. For a deeper, side-by-side breakdown, read RAG vs fine-tuning.

A common myth is that big context windows make RAG obsolete. They don't: you still can't fit a million documents in a prompt, large prompts are slow and costly, models lose precision in the middle of very long inputs, and a raw window gives you no citations or access control. RAG remains the way to find the right few thousand tokens out of millions.

RAG inside the loop

Agentic RAG: the agent decides what to retrieve

Classic RAG retrieves once, up front. Agentic RAG makes retrieval a decision the agent owns — when to search, where to search, and whether to search again.

AssessDo I need external knowledge?
RouteVector store, SQL, or web?
RewriteSharpen the search query
RetrieveFetch + read passages
ReflectEnough? If not, retrieve again
In agentic RAG, the agent treats retrieval as a tool inside its reasoning loop — judging need, choosing a source, and re-querying until the evidence is sufficient.

In naive RAG, retrieval is a fixed pre-step — it always runs once, with the user's raw question, against one index. That breaks down fast: some questions need no retrieval, some need several rounds, and many span multiple sources.

Agentic RAG hands those decisions to the agent. Retrieval becomes a tool the model can choose to call, so the agent can:

  • Decide whether to retrieve at all — skip it for chit-chat or arithmetic it can do itself.
  • Rewrite the query into a cleaner search string, or split a complex question into sub-queries.
  • Route to the right source — a vector store for docs, SQL for structured data, web search for current events.
  • Retrieve iteratively — read the first results, spot a gap, and search again with a refined query.

This is RAG fused with the agent loop: reason, retrieve, observe, repeat. It's more powerful and more robust — and it demands tighter evaluation, because now the agent can be wrong about whether to retrieve, not just what it found.

top-k

Passages per query

usually 3–8, not the whole corpus

2-stage

Recall then re-rank

broad search, precise ranking

1+

Retrieval rounds

agentic RAG can loop

100%

Answers cited

the grounding goal

Make it trustworthy

Evaluation pitfalls and failure modes

RAG that demos well often fails quietly in production. Knowing the failure modes — and measuring the right things — is what makes it dependable.

Common failure modes

Retrieval miss

The right passage isn't in the top-k — usually bad chunking, a weak embedding model, or no hybrid search. The generator never had a chance.

Wrong or stale context

Retrieval returns an outdated or duplicated passage and the model faithfully repeats the error. Garbage in, grounded garbage out.

Ignored context

Good passages are retrieved but the model answers from parametric memory anyway, or buries the key fact in a long, noisy context.

Broken citations

The answer cites the wrong source, or invents a citation that looks plausible but points nowhere — eroding the trust RAG is meant to build.

What to actually measure

  • Retrieval qualityIs the right passage in the top-k? Track recall and precision against a labeled question set.
  • Grounding / faithfulnessIs every claim in the answer supported by a retrieved passage, with nothing invented?
  • Answer relevanceDoes the response actually address the question, not just quote nearby text?
  • Citation accuracyDo the cited sources truly contain the claims they're attached to?
  • Refusal calibrationWhen context is missing, does the agent say 'I don't know' instead of guessing?

Evaluate the stages separately

The biggest pitfall is grading only the final answer. A good answer can hide bad retrieval (the model knew it anyway), and a bad answer can hide good retrieval (the prompt was the problem). Measure retrieval and generation independently so you fix the stage that's actually broken.

Pair these metrics with agent memory hygiene — deduplicate sources, expire stale chunks, and keep provenance on every passage. The discipline that makes RAG trustworthy is the same discipline that makes any agent trustworthy: ground every claim, measure every stage, and let the system admit when it doesn't know.

FAQ

RAG for agents, answered

RAG is a technique that fetches relevant text from an external knowledge source at query time and inserts it into the model's prompt before it generates an answer. Instead of relying only on what the model memorized during training, the agent retrieves up-to-date, organization-specific passages — from documents, a database, or an API — and conditions its response on them. The result is an answer grounded in real source material, with the ability to cite where each fact came from.

Get started

Ship a grounded, citable RAG agent

Connect your docs, retrieve the right passages, and let your agent answer from real evidence. Free to start — no credit card required.