When should I use RAG instead of fine-tuning?

Use RAG when knowledge changes often, must be cited, or is too large to bake into weights — product docs, policies, tickets, a wiki. RAG updates instantly when you re-index, and you can show sources. Use fine-tuning to teach a durable skill, format, or tone the model should always apply — not to inject facts. The two are complementary: many production agents fine-tune for behavior and use RAG for knowledge. See our full comparison at /compare/rag-vs-fine-tuning.

Agentic RAG turns retrieval from a fixed pre-step into a tool the agent decides to use. Rather than always retrieving once, the agent judges whether it needs external knowledge, picks which source or index to query, rewrites the query, reads the results, and can retrieve again if the answer is still thin. It can route between a vector store, SQL, and live web search, and it knows when no retrieval is needed at all. This adaptive control loop is what makes RAG work inside real multi-step agents.

How does chunking affect RAG quality?

Chunking is how you split source documents before embedding them, and it quietly determines retrieval quality. Chunks that are too large dilute the embedding and pull in irrelevant text; chunks that are too small lose the context needed to answer. Good practice is to split on semantic boundaries (headings, paragraphs, sections), keep a modest overlap so ideas aren't cut mid-thought, and attach metadata like title, source, and section so you can filter and cite. Bad chunking is one of the most common reasons a RAG system retrieves the wrong passage.

Why does my RAG agent still hallucinate?

RAG reduces hallucination but does not remove it. If retrieval returns nothing relevant, the model may fall back on its parametric memory and invent an answer. If the retrieved passage is wrong or stale, the model will confidently repeat it. And even with good context, a model can ignore the sources or misattribute a citation. Fixes include retrieval-quality checks, instructing the model to answer only from context (or say it doesn't know), grounding and citation evaluation, and a re-retrieval step when confidence is low.

RAG · Retrieval-Augmented Generation

RAG for AI agents: Retrieval-Augmented Generation

An agent is only as good as the knowledge it can reach. RAG fetches the right passages at query time and feeds them to the model — so answers are grounded in your data, current, and citable instead of guessed from memory.

12 min read
Intermediate
Updated 2026

Build a grounded agent Vector databases explained

RAG, or retrieval-augmented generation, is the simplest reliable way to give a language model knowledge it was never trained on — and to make its answers traceable back to a source.

A base model only knows what was in its training data, frozen at a point in time. Ask it about last week's pricing change, an internal runbook, or a customer's open ticket and it will either refuse or, worse, confidently invent something. RAG closes that gap by retrieving relevant text from an external knowledge source at the moment of the query and placing it in the prompt, so the model generates its answer conditioned on real evidence.

For agents this is foundational. An LLM agent reasons and acts in a loop, but reasoning over wrong or missing facts just produces well-structured nonsense. RAG is the mechanism that grounds each step in your documents, your database, and the live world. It pairs naturally with agent memory and with the broader set of tools an agent can call.

This guide walks the full picture: what RAG is and why agents need it, the retrieve → augment → generate pipeline, how chunking, embeddings, and similarity search actually work, when to choose RAG over fine-tuning or a long context window, the rise of agentic RAG, and the evaluation pitfalls and failure modes that trip up real systems.

The problem RAG solves

Why agents need retrieval

A model's weights are a lossy, frozen snapshot. Agents act in a world that is specific, private, and constantly changing — exactly what training data can't capture.

Think of a model's parameters as parametric memory: an enormous, compressed average of public text. It is broad but blurry, static, and impossible to cite. Your agent, meanwhile, needs to answer questions about things that were never public and never static — a refund policy edited this morning, the spec for an internal API, the last three emails in a thread.

RAG adds non-parametric memory: an external store you control, query, and update independently of the model. When the agent needs a fact, it retrieves the most relevant passages and reads them like a person consulting a reference, rather than recalling from a hazy memory. Update the store and the agent's knowledge updates instantly — no retraining required.

The payoff is concrete: fresher answers, far less hallucination on in-domain questions, the ability to show where each claim came from, and access to private knowledge without ever baking it into model weights you might share.

Freshness

Re-index a document and the agent knows the new version on the next query — no retraining, no deploy.

Grounding

Answers are conditioned on retrieved evidence, sharply cutting fabricated facts on in-domain questions.

Citations

Because the source passage is in the prompt, the agent can point to the exact document and section.

Private knowledge

Serve answers from your own docs and data without ever embedding them into shared model weights.

The core pipeline

Retrieve → augment → generate

Every RAG system, however fancy, is these three stages. Get a clear mental model of each and the rest is tuning.

QueryUser or agent question

RetrieveSimilarity search over chunks

AugmentInject passages into prompt

GenerateModel answers from context

CiteAttribute to source passages

The RAG pipeline. A query is embedded and used to retrieve relevant chunks; those chunks augment the prompt; the model generates a grounded, citable answer.

1 · Retrieve
Convert the query into an embedding and search a vector index (often combined with keyword search) for the top-k most relevant chunks. Optionally filter by metadata and re-rank the candidates for precision.
2 · Augment
Assemble a prompt that places the retrieved passages alongside the question, with clear instructions: answer only from this context, cite the sources, and say you don't know if the context is insufficient.
3 · Generate
The model reads the augmented prompt and produces an answer grounded in the passages — ideally with inline citations the agent or UI can link back to the original documents.

The whole game is precision at the top

The generator can only use what retrieval hands it. If the right passage isn't in the top-k, no amount of clever prompting recovers it. That's why teams invest most in the retrieve stage — chunking, embeddings, hybrid search, and re-ranking — long before they tune the prompt. A vector database is the workhorse that makes this fast at scale.

Under the hood

Chunking, embeddings, and similarity search

Retrieval rests on three ideas working together. Understanding each one tells you exactly where quality is won or lost.

Chunking

Documents are split into passages small enough to embed precisely but large enough to stay meaningful. Split on semantic boundaries — headings, paragraphs — keep a little overlap, and attach metadata (title, source, section) for filtering and citation.

Embeddings

An embedding model maps each chunk to a vector — a list of numbers where semantically similar text lands close together. The same model embeds the query, so 'reset my password' can match a chunk about 'account recovery' even with no shared words.

Similarity search

At query time, the index finds the chunks whose vectors are nearest the query vector (by cosine or dot-product distance), returning the top-k. Approximate nearest-neighbour indexes keep this fast across millions of vectors.

Beyond naive top-k

Hybrid search and re-ranking

Pure vector search is great at meaning but can miss exact terms — a part number, an error code, a rare name. Hybrid search blends semantic (vector) results with classic keyword (BM25) results, so you catch both the gist and the literal match.

A re-ranker then takes the merged candidate pool and scores each passage against the query with a more expensive cross-encoder, pushing the truly relevant chunks to the top. Cheap recall first, precise ranking second — this two-stage shape is what separates demo-grade RAG from production RAG.

Semantic vectors capture meaning and paraphrase.
Keyword search nails exact, rare, or symbolic terms.
Metadata filters scope results to a tenant, date, or doc type.
Re-ranking lifts the best passages into the top-k.

What embeddings are

QueryEmbed + tokenize

Vector searchTop candidates by meaning

Keyword searchBM25 exact-term recall

Merge + filterDedupe, apply metadata

Re-rankCross-encoder scores top-k

A production retrieval stack: recall broadly, then re-rank for precision before augmenting the prompt.

See it in code

A minimal RAG retrieval step

Strip away the framework and the retrieve stage is small: embed the query, search the index, then build a grounded prompt from the top chunks.

retrieve.pypython

1def retrieve(query, k=4):  // the 'retrieve' stage2    q_vec = embed(query)  // query → embedding vector3    hits = index.search(q_vec, top_k=k)  // nearest chunks4    return [h.chunk for h in hits]56def build_prompt(query, chunks):  // the 'augment' stage7    context = "\n\n".join(8        f"[source: {c.source}]\n{c.text}" for c in chunks9    )10    return (11        "Answer ONLY from the context. Cite sources.\n"12        "If it is not in the context, say you don't know.\n\n"13        f"Context:\n{context}\n\nQuestion: {query}"14    )1516chunks = retrieve("What is our refund window?")  // agent calls retrieve as a tool17answer = llm(build_prompt("What is our refund window?", chunks))

A compact retrieve-then-augment step. The model is later asked to answer only from these passages and to cite them.

That's the whole skeleton. The instruction to answer only from the context and to admit uncertainty is doing heavy lifting — it's the line between a grounded agent and one that drifts back into guessing. In an agent, retrieve is registered as a tool, so the model can call it when (and only when) it needs evidence. See AI agent tools for how that wiring works.

Choosing an approach

RAG vs fine-tuning vs long context

These three are often framed as rivals. They solve different problems, and strong systems frequently combine them.

Dimension	RAG	Fine-tuning	Long context
Best for	Changing, citable knowledge	Durable skills, format, tone	One-off large inputs
Knowledge freshness	Instant (re-index)	Stale until retrained	Per-request only
Can cite sources
Update cost	Cheap, incremental	Expensive retraining	None
Scales to huge corpora
Per-query cost	Low (small top-k)	Low	High (big prompt)
Adds new behavior

The clean rule: RAG injects knowledge, fine-tuning teaches behavior, long context handles a single big input. If your facts change or must be cited, reach for RAG. If you need the model to always reply in a house style, follow a niche format, or master a domain skill, fine-tune. If you simply need to reason over one large document this turn, paste it into the context window. Many production agents do all three — fine-tuned for tone, RAG for knowledge, long context for the occasional big artifact. For a deeper, side-by-side breakdown, read RAG vs fine-tuning.

A common myth is that big context windows make RAG obsolete. They don't: you still can't fit a million documents in a prompt, large prompts are slow and costly, models lose precision in the middle of very long inputs, and a raw window gives you no citations or access control. RAG remains the way to find the right few thousand tokens out of millions.

RAG inside the loop

Agentic RAG: the agent decides what to retrieve

Classic RAG retrieves once, up front. Agentic RAG makes retrieval a decision the agent owns — when to search, where to search, and whether to search again.

AssessDo I need external knowledge?

RouteVector store, SQL, or web?

RewriteSharpen the search query

RetrieveFetch + read passages

ReflectEnough? If not, retrieve again

In agentic RAG, the agent treats retrieval as a tool inside its reasoning loop — judging need, choosing a source, and re-querying until the evidence is sufficient.

In naive RAG, retrieval is a fixed pre-step — it always runs once, with the user's raw question, against one index. That breaks down fast: some questions need no retrieval, some need several rounds, and many span multiple sources.

Agentic RAG hands those decisions to the agent. Retrieval becomes a tool the model can choose to call, so the agent can:

Decide whether to retrieve at all — skip it for chit-chat or arithmetic it can do itself.
Rewrite the query into a cleaner search string, or split a complex question into sub-queries.
Route to the right source — a vector store for docs, SQL for structured data, web search for current events.
Retrieve iteratively — read the first results, spot a gap, and search again with a refined query.

This is RAG fused with the agent loop: reason, retrieve, observe, repeat. It's more powerful and more robust — and it demands tighter evaluation, because now the agent can be wrong about whether to retrieve, not just what it found.

top-k

Passages per query

usually 3–8, not the whole corpus

2-stage

Recall then re-rank

broad search, precise ranking

Retrieval rounds

agentic RAG can loop

100%

Answers cited

the grounding goal

Make it trustworthy

Evaluation pitfalls and failure modes

RAG that demos well often fails quietly in production. Knowing the failure modes — and measuring the right things — is what makes it dependable.

Common failure modes

Retrieval miss

The right passage isn't in the top-k — usually bad chunking, a weak embedding model, or no hybrid search. The generator never had a chance.

Wrong or stale context

Retrieval returns an outdated or duplicated passage and the model faithfully repeats the error. Garbage in, grounded garbage out.

Ignored context

Good passages are retrieved but the model answers from parametric memory anyway, or buries the key fact in a long, noisy context.

Broken citations

The answer cites the wrong source, or invents a citation that looks plausible but points nowhere — eroding the trust RAG is meant to build.

What to actually measure

Retrieval quality — Is the right passage in the top-k? Track recall and precision against a labeled question set.
Grounding / faithfulness — Is every claim in the answer supported by a retrieved passage, with nothing invented?
Answer relevance — Does the response actually address the question, not just quote nearby text?
Citation accuracy — Do the cited sources truly contain the claims they're attached to?
Refusal calibration — When context is missing, does the agent say 'I don't know' instead of guessing?

Evaluate the stages separately

The biggest pitfall is grading only the final answer. A good answer can hide bad retrieval (the model knew it anyway), and a bad answer can hide good retrieval (the prompt was the problem). Measure retrieval and generation independently so you fix the stage that's actually broken.

Pair these metrics with agent memory hygiene — deduplicate sources, expire stale chunks, and keep provenance on every passage. The discipline that makes RAG trustworthy is the same discipline that makes any agent trustworthy: ground every claim, measure every stage, and let the system admit when it doesn't know.

FAQ

RAG for agents, answered

RAG is a technique that fetches relevant text from an external knowledge source at query time and inserts it into the model's prompt before it generates an answer. Instead of relying only on what the model memorized during training, the agent retrieves up-to-date, organization-specific passages — from documents, a database, or an API — and conditions its response on them. The result is an answer grounded in real source material, with the ability to cite where each fact came from.

Keep learning

Go deeper on grounding your agents

Vector databasesThe index behind fast retrieval AI agent memoryContext, state & non-parametric stores AI agent toolsRegister retrieval as a callable tool LLM agentsThe reason–act loop RAG plugs into RAG vs fine-tuningPick the right approach Embeddings glossaryVectors and semantic similarity

RAGretrieval-augmented generationRAG for AI agentsvector searchgrounding LLMschunkingembeddingsagentic RAGRAG vs fine-tuningsemantic search

Get started

Ship a grounded, citable RAG agent

Connect your docs, retrieve the right passages, and let your agent answer from real evidence. Free to start — no credit card required.

Start building free Browse templates