Running Local Models

Definition:

Local LLM Inference

Running LLMs locally eliminates API costs and latency, keeps data private, and enables customization. Key frameworks:

  • Ollama: One-command setup, runs GGUF models
  • vLLM: High-throughput serving with PagedAttention
  • llama.cpp: CPU-optimized C++ inference
  • HuggingFace Transformers: Full Python control
# Ollama example
ollama pull llama3:8b
ollama run llama3:8b "Explain MIMO"

Definition:

Quantization for Local Inference

Quantization reduces model precision to lower memory requirements:

Format Bits Memory (7B) Quality Loss
FP16 16 14 GB None
INT8 8 7 GB Minimal
INT4 (GPTQ) 4 3.5 GB Small
GGUF Q4_K_M ~4.5 4 GB Negligible

A 7B model in Q4 fits in 4 GB VRAM (RTX 3060) while retaining ~95% of the full-precision quality.

Example: Local Inference with HuggingFace

Load a quantized LLM and run inference locally with HuggingFace.

Model Comparison Dashboard

Compare local vs API models on cost, speed, and quality

Parameters

Common Mistake: Running Out of GPU Memory

Mistake:

Loading a model that exceeds available VRAM.

Correction:

Use quantization (INT4/INT8), device_map='auto' for multi-GPU, or offload to CPU. Always check VRAM requirements before loading.

Key Takeaway

Local LLM inference provides privacy, zero marginal cost, and full customization. Quantized 7-8B models run on consumer GPUs (8 GB VRAM) with minimal quality loss, making them practical for research workflows.

Why This Matters: Data Privacy in Telecom Research

Telecom operators handle sensitive network data that cannot be sent to cloud APIs. Running local LLMs enables AI-assisted analysis of proprietary measurement data, configuration files, and internal reports without data leaving the organization.

See full treatment in Chapter 51