Ferkans — Interactive Telecom Tutor

Definition:
Open-Weight LLM Families

Key open-weight model families (as of 2025):

Family	Organization	Sizes	Key Features
LLaMA 3	Meta	8B, 70B, 405B	GQA, RoPE, 128K context
Mistral/Mixtral	Mistral AI	7B, 8x7B MoE	Sliding window, MoE
Gemma	Google	2B, 7B, 27B	Multi-query attention
Qwen 2.5	Alibaba	0.5B-72B	Strong multilingual
DeepSeek V3	DeepSeek	671B MoE	FP8, multi-token prediction

All use decoder-only transformer with variations in attention mechanism, positional encoding, and mixture-of-experts.

Definition:
Mixture of Experts (MoE)

MoE replaces the FFN with $E$ expert networks and a router:

$\text{MoE}(\mathbf{x}) = \sum_{i=1}^{K} g_i(\mathbf{x}) \cdot \text{FFN}_i(\mathbf{x})$

where $g(\mathbf{x}) = \text{TopK}(\text{softmax}(\mathbf{W}_g \mathbf{x}))$ selects the top- $K$ experts (typically $K = 2$ ).

Total parameters: $E \times$ FFN size, but only $K$ experts are active per token, so compute cost is $K/E$ of a dense model.

Mixtral 8x7B has 47B total parameters but only ~13B active per token, achieving performance comparable to LLaMA 2 70B.

LLM Model Family Comparison

Compare model families by size, performance, and efficiency

Parameters

LLM Timeline — Timeline of major language models from GPT-1 (2018) to present.

Quick Check

A Mixture of Experts model with 8 experts and top-2 routing has 47B total parameters. How many parameters are active per token?

47B

~12B

~6B

Correction:

~12B

With top-2 routing, 2/8 = 25% of expert parameters plus shared parameters are active, giving roughly 12-13B active parameters.

Historical Note: The Open-Source LLM Revolution

2023-2025

Meta's release of LLaMA (2023) catalyzed an explosion of open-source LLM development. Within months, the community produced fine-tuned variants (Alpaca, Vicuna), efficient inference frameworks (llama.cpp, vLLM), and training tools (Axolotl, TRL). This demonstrated that the key bottleneck was data and training recipes, not architecture.

Mixture of Experts (MoE)

An architecture where each token is processed by a subset of expert networks selected by a learned router, enabling large total parameter counts with lower per-token compute.

Model Families and Their Characteristics

Definition: Open-Weight LLM Families

Definition: Mixture of Experts (MoE)