Retrieval-Augmented Generation (RAG)

Definition:

Retrieval-Augmented Generation (RAG)

RAG augments the LLM's context with retrieved documents:

  1. Index: Embed all documents did_i into vectors ei\mathbf{e}_i
  2. Retrieve: For query qq, find top-kk documents by cosine similarity top-k=argmaxicos(eq,ei)\text{top-}k = \arg\max_{i} \cos(\mathbf{e}_q, \mathbf{e}_i)
  3. Generate: Concatenate retrieved documents with the query as LLM context
# Pseudocode
query_vec = embed(query)
docs = vector_db.search(query_vec, k=5)
context = "\n\n".join(doc.text for doc in docs)
response = llm(f"Context:\n{context}\n\nQuestion: {query}")

RAG is the most practical way to give an LLM access to domain-specific knowledge (e.g., 3GPP specs, internal research reports) without fine-tuning.

Definition:

Vector Database

A vector database stores high-dimensional embedding vectors and supports efficient approximate nearest neighbor (ANN) search.

Common options:

  • FAISS (Meta): CPU/GPU, in-memory, very fast
  • ChromaDB: Lightweight, Python-native
  • Pinecone: Managed cloud service
  • Weaviate: Open-source, hybrid search

Index types: flat (exact), IVF (inverted file), HNSW (graph-based).

Example: Building a RAG System for Research Papers

Build a RAG system that retrieves relevant paper chunks for answering questions about wireless communications.

RAG Pipeline Animation

Watch the RAG pipeline process a query step by step

Parameters

RAG Architecture

RAG Architecture
Retrieval-Augmented Generation: embed, retrieve, generate.

Quick Check

When should you use RAG instead of fine-tuning?

When the knowledge changes frequently

When you need the fastest possible inference

When you have unlimited compute budget

Common Mistake: Wrong Chunk Size for RAG

Mistake:

Using full papers as retrieval units.

Correction:

Chunk documents into 200-500 token segments with overlap. Too large chunks dilute relevance; too small lose context. Use recursive text splitters with semantic boundaries.

Key Takeaway

RAG is the most practical way to give LLMs domain-specific knowledge. Chunk documents into 200-500 token segments, embed with a sentence transformer, store in a vector database, and retrieve top-k chunks as context for generation.

RAG vs Fine-Tuning

AspectRAGFine-Tuning
Knowledge updateInstant (update index)Requires retraining
Setup costLow (no training)High (GPU, data)
HallucinationReduced (grounded)Still possible
LatencyHigher (retrieval step)Lower (single forward pass)
Domain adaptationGood for factual recallBetter for style/behavior

RAG (Retrieval-Augmented Generation)

A technique that augments LLM generation by first retrieving relevant documents from a knowledge base and including them in the prompt context.

Related: Vector Database

Vector Database

A database optimized for storing and searching high-dimensional embedding vectors using approximate nearest neighbor algorithms.

Related: RAG (Retrieval-Augmented Generation)