What is RAG? Retrieval-Augmented Generation Explained

Retrieval-Augmented Generation (RAG) is a technique that enhances AI language models by connecting them to external knowledge sources at query time. Instead of relying solely on what the model learned during training — which has a fixed knowledge cutoff and can hallucinate facts — RAG retrieves relevant documents, database records, or other data and includes them in the prompt as context. The model then generates its response grounded in that retrieved information. This approach is widely used in enterprise AI applications, customer support bots, internal knowledge bases, and any scenario where accuracy and up-to-date information matter more than creative generation.

A typical RAG pipeline works in three stages. First, your knowledge base (documents, FAQs, product data) is split into chunks and converted into vector embeddings — numerical representations that capture semantic meaning. These embeddings are stored in a vector database. Second, when a user asks a question, that query is also converted into an embedding and compared against the stored vectors to find the most semantically similar chunks. Third, the retrieved chunks are injected into the prompt alongside the user's question, and the LLM generates an answer grounded in that specific context. The result is dramatically more accurate and verifiable than asking the model to answer from memory alone.

RAG vs fine-tuning is a common decision point. Fine-tuning permanently adjusts model weights with your data — it is expensive, requires retraining when data changes, and can introduce overfitting. RAG keeps the model unchanged and swaps in fresh context dynamically, making it cheaper, more flexible, and easier to audit. Most production AI applications start with RAG because it delivers better factual accuracy with lower cost and complexity. Understanding RAG fundamentals helps you write better prompts too — knowing how context is retrieved and composed lets you structure prompts that work well with augmented information.

RAG Prompt Templates

Copy-ready prompts for building and optimizing retrieval-augmented generation pipelines.

Document Q&A with Citations

You are a research assistant. Answer the user's question using ONLY the provided documents. For every claim, include an inline citation in the format [Doc N, Section].

<documents>
{{retrieved_documents}}
</documents>

Question: {{user_question}}

Instructions:
1. Read all provided documents carefully
2. Answer the question using only information found in the documents
3. Cite every factual claim with [Doc N, Section] references
4. If the documents don't contain enough information, say "The provided documents don't contain sufficient information to answer this question" and explain what's missing
5. Never fabricate information not present in the documents
retrieved_documentsuser_question

Why it works: Explicit citation requirements force the model to ground every claim in retrieved context, dramatically reducing hallucination. The fallback instruction prevents confident-sounding fabrication when evidence is insufficient.

Knowledge Base Query Router

You are a knowledge base query router. Given a user question, determine the best retrieval strategy.

User question: {{user_question}}
Available knowledge bases: {{knowledge_bases}}

Analyze the question and return a JSON response:
{
  "query_type": "factual" | "comparative" | "procedural" | "exploratory",
  "target_bases": ["list of relevant knowledge base names"],
  "search_queries": ["2-3 reformulated search queries optimized for semantic retrieval"],
  "filters": {
    "date_range": "if time-sensitive, specify range",
    "category": "if category is implied"
  },
  "confidence": "high | medium | low"
}

Reformulate the original question into search queries that will retrieve the most relevant chunks. Decompose complex questions into simpler sub-queries.
user_questionknowledge_bases

Why it works: Query routing and reformulation are critical RAG optimizations. Decomposing complex questions into targeted sub-queries improves retrieval precision, and the structured JSON output makes it easy to integrate into automated pipelines.

Citation-Grounded Answer

Answer the following question based EXCLUSIVELY on the provided context passages. Follow these rules strictly:

<context>
{{context_passages}}
</context>

Question: {{question}}

Rules:
- Use ONLY information from the context passages above
- After each sentence that uses information from the context, add a citation: [passage_id]
- If multiple passages support a claim, cite all of them: [passage_1, passage_3]
- If the context does not contain the answer, respond: "NOT_FOUND: The provided context does not contain information to answer this question."
- Do not use any prior knowledge or training data
- Synthesize information across passages when relevant, but always cite sources
context_passagesquestion

Why it works: The NOT_FOUND fallback with a machine-readable prefix makes it easy for downstream systems to detect when retrieval failed. Per-sentence citations create an auditable trail that users and automated checks can verify against source documents.

Context Window Optimization for RAG

You are a context optimization assistant. Given retrieved chunks and a token budget, select and order the most relevant context for answering the question.

Question: {{question}}
Token budget: {{token_budget}} tokens
Retrieved chunks (ranked by similarity score):

{{ranked_chunks}}

Tasks:
1. Remove chunks that are irrelevant or redundant to the question
2. If chunks overlap in content, keep only the most informative version
3. Order remaining chunks with the most relevant first and least relevant last
4. Ensure total token count stays within the budget
5. Return the optimized context with a brief explanation of what was removed and why

Output format:
OPTIMIZED_CONTEXT:
[Selected chunks in optimal order]

REMOVED:
- [Chunk ID]: [Reason for removal]

ESTIMATED_TOKENS: [count]
questiontoken_budgetranked_chunks

Why it works: RAG systems often retrieve more context than needed. This prompt acts as a reranking and compression step that maximizes information density within token limits. Placing the most relevant context first exploits the primacy bias in LLM attention.

Retrieval Prompt Tuning

You are an expert at writing search queries optimized for vector similarity retrieval. Given the user's natural language question, generate multiple search queries that will retrieve the most relevant document chunks from an embedding-based vector store.

Original question: {{original_question}}
Domain: {{domain}}
Embedding model: {{embedding_model}}

Generate exactly 5 alternative search queries:
1. A direct reformulation that matches likely document language (not question format)
2. A broader query that captures the general topic
3. A specific query targeting the most critical detail needed
4. A query using synonyms and alternative terminology common in {{domain}}
5. A hypothetical ideal passage — write 1-2 sentences that would perfectly answer the question (HyDE approach)

For each query, explain in one line why it would retrieve different relevant chunks.
original_questiondomainembedding_model

Why it works: Multi-query retrieval with the HyDE (Hypothetical Document Embedding) technique dramatically improves recall. The domain-aware synonym expansion catches documents that use different terminology, and the hypothetical passage approach often retrieves chunks that direct queries miss.

RAG Evaluation & Quality Check

You are a RAG pipeline evaluator. Given a question, the retrieved context, and the generated answer, evaluate the quality of the RAG response.

Question: {{question}}
Retrieved context:
{{retrieved_context}}

Generated answer:
{{generated_answer}}

Evaluate on these dimensions (score 1-5 each):

1. **Faithfulness**: Does every claim in the answer have supporting evidence in the retrieved context? Flag any unsupported claims.
2. **Answer relevance**: Does the answer actually address the question asked?
3. **Context relevance**: Were the retrieved passages relevant to the question? Identify any irrelevant chunks.
4. **Completeness**: Does the answer cover all aspects of the question that the context supports?
5. **Citation accuracy**: If citations are present, do they point to the correct passages?

Output format:
SCORES: faithfulness=X, relevance=X, context_relevance=X, completeness=X, citation_accuracy=X
OVERALL: X/5
ISSUES: [List specific problems found]
SUGGESTIONS: [How to improve retrieval or generation for this query]
questionretrieved_contextgenerated_answer

Why it works: Systematic RAG evaluation catches hallucinations, retrieval failures, and incomplete answers before they reach users. The structured scoring format enables automated quality tracking, and the suggestions output creates a feedback loop for improving the pipeline.