Running Local Models

Definition:
Local LLM Inference

Running LLMs locally eliminates API costs and latency, keeps data private, and enables customization. Key frameworks:

Ollama: One-command setup, runs GGUF models
vLLM: High-throughput serving with PagedAttention
llama.cpp: CPU-optimized C++ inference
HuggingFace Transformers: Full Python control

# Ollama example
ollama pull llama3:8b
ollama run llama3:8b "Explain MIMO"

Definition:
Quantization for Local Inference

Quantization reduces model precision to lower memory requirements:

Format	Bits	Memory (7B)	Quality Loss
FP16	16	14 GB	None
INT8	8	7 GB	Minimal
INT4 (GPTQ)	4	3.5 GB	Small
GGUF Q4_K_M	~4.5	4 GB	Negligible

A 7B model in Q4 fits in 4 GB VRAM (RTX 3060) while retaining ~95% of the full-precision quality.

Example: Local Inference with HuggingFace

Load a quantized LLM and run inference locally with HuggingFace.

Solution

Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto",
)

messages = [{"role": "user", "content": "Explain OFDM in 3 sentences."}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(input_ids.to("cuda"), max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Model Comparison Dashboard

Compare local vs API models on cost, speed, and quality

Parameters

Common Mistake: Running Out of GPU Memory

Mistake:

Loading a model that exceeds available VRAM.

Correction:

Use quantization (INT4/INT8), device_map='auto' for multi-GPU, or offload to CPU. Always check VRAM requirements before loading.

Key Takeaway

Local LLM inference provides privacy, zero marginal cost, and full customization. Quantized 7-8B models run on consumer GPUs (8 GB VRAM) with minimal quality loss, making them practical for research workflows.

Why This Matters: Data Privacy in Telecom Research

Telecom operators handle sensitive network data that cannot be sent to cloud APIs. Running local LLMs enables AI-assisted analysis of proprietary measurement data, configuration files, and internal reports without data leaving the organization.

See full treatment in Chapter 51

Retrieval-Augmented Generation (RAG)Chapter Summary