Retrieval-Augmented Generation (RAG)
Definition: Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG)
RAG augments the LLM's context with retrieved documents:
- Index: Embed all documents into vectors
- Retrieve: For query , find top- documents by cosine similarity
- Generate: Concatenate retrieved documents with the query as LLM context
# Pseudocode
query_vec = embed(query)
docs = vector_db.search(query_vec, k=5)
context = "\n\n".join(doc.text for doc in docs)
response = llm(f"Context:\n{context}\n\nQuestion: {query}")
RAG is the most practical way to give an LLM access to domain-specific knowledge (e.g., 3GPP specs, internal research reports) without fine-tuning.
Definition: Vector Database
Vector Database
A vector database stores high-dimensional embedding vectors and supports efficient approximate nearest neighbor (ANN) search.
Common options:
- FAISS (Meta): CPU/GPU, in-memory, very fast
- ChromaDB: Lightweight, Python-native
- Pinecone: Managed cloud service
- Weaviate: Open-source, hybrid search
Index types: flat (exact), IVF (inverted file), HNSW (graph-based).
Example: Building a RAG System for Research Papers
Build a RAG system that retrieves relevant paper chunks for answering questions about wireless communications.
Implementation
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# 1. Embed documents
model = SentenceTransformer("all-MiniLM-L6-v2")
chunks = load_paper_chunks("papers/") # list of text chunks
embeddings = model.encode(chunks)
# 2. Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim) # inner product (cosine after normalization)
faiss.normalize_L2(embeddings)
index.add(embeddings)
# 3. Retrieve and generate
def ask(question, k=5):
q_vec = model.encode([question])
faiss.normalize_L2(q_vec)
scores, indices = index.search(q_vec, k)
context = "\n".join(chunks[i] for i in indices[0])
return llm_call(question, context)
Embedding Similarity Search
Visualize how RAG retrieval finds relevant documents
Parameters
RAG Pipeline Animation
Watch the RAG pipeline process a query step by step
Parameters
RAG Architecture
Quick Check
When should you use RAG instead of fine-tuning?
When the knowledge changes frequently
When you need the fastest possible inference
When you have unlimited compute budget
RAG lets you update the knowledge base without retraining. Fine-tuning bakes knowledge into weights, requiring retraining to update.
Common Mistake: Wrong Chunk Size for RAG
Mistake:
Using full papers as retrieval units.
Correction:
Chunk documents into 200-500 token segments with overlap. Too large chunks dilute relevance; too small lose context. Use recursive text splitters with semantic boundaries.
Key Takeaway
RAG is the most practical way to give LLMs domain-specific knowledge. Chunk documents into 200-500 token segments, embed with a sentence transformer, store in a vector database, and retrieve top-k chunks as context for generation.
RAG vs Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge update | Instant (update index) | Requires retraining |
| Setup cost | Low (no training) | High (GPU, data) |
| Hallucination | Reduced (grounded) | Still possible |
| Latency | Higher (retrieval step) | Lower (single forward pass) |
| Domain adaptation | Good for factual recall | Better for style/behavior |
RAG (Retrieval-Augmented Generation)
A technique that augments LLM generation by first retrieving relevant documents from a knowledge base and including them in the prompt context.
Related: Vector Database
Vector Database
A database optimized for storing and searching high-dimensional embedding vectors using approximate nearest neighbor algorithms.
Related: RAG (Retrieval-Augmented Generation)