Model Families and Their Characteristics

Definition:

Open-Weight LLM Families

Key open-weight model families (as of 2025):

Family Organization Sizes Key Features
LLaMA 3 Meta 8B, 70B, 405B GQA, RoPE, 128K context
Mistral/Mixtral Mistral AI 7B, 8x7B MoE Sliding window, MoE
Gemma Google 2B, 7B, 27B Multi-query attention
Qwen 2.5 Alibaba 0.5B-72B Strong multilingual
DeepSeek V3 DeepSeek 671B MoE FP8, multi-token prediction

All use decoder-only transformer with variations in attention mechanism, positional encoding, and mixture-of-experts.

Definition:

Mixture of Experts (MoE)

MoE replaces the FFN with EE expert networks and a router:

MoE(x)=i=1Kgi(x)FFNi(x)\text{MoE}(\mathbf{x}) = \sum_{i=1}^{K} g_i(\mathbf{x}) \cdot \text{FFN}_i(\mathbf{x})

where g(x)=TopK(softmax(Wgx))g(\mathbf{x}) = \text{TopK}(\text{softmax}(\mathbf{W}_g \mathbf{x})) selects the top-KK experts (typically K=2K = 2).

Total parameters: E×E \times FFN size, but only KK experts are active per token, so compute cost is K/EK/E of a dense model.

Mixtral 8x7B has 47B total parameters but only ~13B active per token, achieving performance comparable to LLaMA 2 70B.

LLM Model Family Comparison

Compare model families by size, performance, and efficiency

Parameters

LLM Timeline

LLM Timeline
Timeline of major language models from GPT-1 (2018) to present.

Quick Check

A Mixture of Experts model with 8 experts and top-2 routing has 47B total parameters. How many parameters are active per token?

47B

~12B

~6B

Historical Note: The Open-Source LLM Revolution

2023-2025

Meta's release of LLaMA (2023) catalyzed an explosion of open-source LLM development. Within months, the community produced fine-tuned variants (Alpaca, Vicuna), efficient inference frameworks (llama.cpp, vLLM), and training tools (Axolotl, TRL). This demonstrated that the key bottleneck was data and training recipes, not architecture.

Mixture of Experts (MoE)

An architecture where each token is processed by a subset of expert networks selected by a learned router, enabling large total parameter counts with lower per-token compute.

Related: GPT (Generative Pre-trained Transformer)