Running Local Models
Definition: Local LLM Inference
Local LLM Inference
Running LLMs locally eliminates API costs and latency, keeps data private, and enables customization. Key frameworks:
- Ollama: One-command setup, runs GGUF models
- vLLM: High-throughput serving with PagedAttention
- llama.cpp: CPU-optimized C++ inference
- HuggingFace Transformers: Full Python control
# Ollama example
ollama pull llama3:8b
ollama run llama3:8b "Explain MIMO"
Definition: Quantization for Local Inference
Quantization for Local Inference
Quantization reduces model precision to lower memory requirements:
| Format | Bits | Memory (7B) | Quality Loss |
|---|---|---|---|
| FP16 | 16 | 14 GB | None |
| INT8 | 8 | 7 GB | Minimal |
| INT4 (GPTQ) | 4 | 3.5 GB | Small |
| GGUF Q4_K_M | ~4.5 | 4 GB | Negligible |
A 7B model in Q4 fits in 4 GB VRAM (RTX 3060) while retaining ~95% of the full-precision quality.
Example: Local Inference with HuggingFace
Load a quantized LLM and run inference locally with HuggingFace.
Implementation
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto",
)
messages = [{"role": "user", "content": "Explain OFDM in 3 sentences."}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(input_ids.to("cuda"), max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Model Comparison Dashboard
Compare local vs API models on cost, speed, and quality
Parameters
Common Mistake: Running Out of GPU Memory
Mistake:
Loading a model that exceeds available VRAM.
Correction:
Use quantization (INT4/INT8), device_map='auto' for multi-GPU, or offload to CPU. Always check VRAM requirements before loading.
Key Takeaway
Local LLM inference provides privacy, zero marginal cost, and full customization. Quantized 7-8B models run on consumer GPUs (8 GB VRAM) with minimal quality loss, making them practical for research workflows.
Why This Matters: Data Privacy in Telecom Research
Telecom operators handle sensitive network data that cannot be sent to cloud APIs. Running local LLMs enables AI-assisted analysis of proprietary measurement data, configuration files, and internal reports without data leaving the organization.
See full treatment in Chapter 51