Retrieval-Augmented Generation (RAG)

Definition:
Retrieval-Augmented Generation (RAG)

RAG augments the LLM's context with retrieved documents:

Index: Embed all documents $d_i$ into vectors $\mathbf{e}_i$
Retrieve: For query $q$ , find top- $k$ documents by cosine similarity $\text{top-}k = \arg\max_{i} \cos(\mathbf{e}_q, \mathbf{e}_i)$
Generate: Concatenate retrieved documents with the query as LLM context

# Pseudocode
query_vec = embed(query)
docs = vector_db.search(query_vec, k=5)
context = "\n\n".join(doc.text for doc in docs)
response = llm(f"Context:\n{context}\n\nQuestion: {query}")

RAG is the most practical way to give an LLM access to domain-specific knowledge (e.g., 3GPP specs, internal research reports) without fine-tuning.

Definition:
Vector Database

A vector database stores high-dimensional embedding vectors and supports efficient approximate nearest neighbor (ANN) search.

Common options:

FAISS (Meta): CPU/GPU, in-memory, very fast
ChromaDB: Lightweight, Python-native
Pinecone: Managed cloud service
Weaviate: Open-source, hybrid search

Index types: flat (exact), IVF (inverted file), HNSW (graph-based).

Example: Building a RAG System for Research Papers

Build a RAG system that retrieves relevant paper chunks for answering questions about wireless communications.

Solution

Implementation

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Embed documents
model = SentenceTransformer("all-MiniLM-L6-v2")
chunks = load_paper_chunks("papers/")  # list of text chunks
embeddings = model.encode(chunks)

# 2. Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)  # inner product (cosine after normalization)
faiss.normalize_L2(embeddings)
index.add(embeddings)

# 3. Retrieve and generate
def ask(question, k=5):
    q_vec = model.encode([question])
    faiss.normalize_L2(q_vec)
    scores, indices = index.search(q_vec, k)
    context = "\n".join(chunks[i] for i in indices[0])
    return llm_call(question, context)

Embedding Similarity Search

Visualize how RAG retrieval finds relevant documents

Parameters

RAG Pipeline Animation

Watch the RAG pipeline process a query step by step

Parameters

RAG Architecture — Retrieval-Augmented Generation: embed, retrieve, generate.

Quick Check

When should you use RAG instead of fine-tuning?

When the knowledge changes frequently

When you need the fastest possible inference

When you have unlimited compute budget

Correction:

When the knowledge changes frequently

RAG lets you update the knowledge base without retraining. Fine-tuning bakes knowledge into weights, requiring retraining to update.

Common Mistake: Wrong Chunk Size for RAG

Mistake:

Using full papers as retrieval units.

Correction:

Chunk documents into 200-500 token segments with overlap. Too large chunks dilute relevance; too small lose context. Use recursive text splitters with semantic boundaries.

Key Takeaway

RAG is the most practical way to give LLMs domain-specific knowledge. Chunk documents into 200-500 token segments, embed with a sentence transformer, store in a vector database, and retrieve top-k chunks as context for generation.

RAG vs Fine-Tuning

Aspect	RAG	Fine-Tuning
Knowledge update	Instant (update index)	Requires retraining
Setup cost	Low (no training)	High (GPU, data)
Hallucination	Reduced (grounded)	Still possible
Latency	Higher (retrieval step)	Lower (single forward pass)
Domain adaptation	Good for factual recall	Better for style/behavior

RAG (Retrieval-Augmented Generation)

A technique that augments LLM generation by first retrieving relevant documents from a knowledge base and including them in the prompt context.

Related: Vector Database

Vector Database

A database optimized for storing and searching high-dimensional embedding vectors using approximate nearest neighbor algorithms.

Tool Use and Function Calling Running Local Models

Retrieval-Augmented Generation (RAG)

Definition: Retrieval-Augmented Generation (RAG)

Definition: Vector Database

Example: Building a RAG System for Research Papers

Implementation

Embedding Similarity Search

Parameters

RAG Pipeline Animation

Parameters

RAG Architecture

Quick Check

Common Mistake: Wrong Chunk Size for RAG

Key Takeaway

RAG vs Fine-Tuning

RAG (Retrieval-Augmented Generation)

Vector Database

Definition:
Retrieval-Augmented Generation (RAG)

Definition:
Vector Database