Model Families and Their Characteristics
Definition: Open-Weight LLM Families
Open-Weight LLM Families
Key open-weight model families (as of 2025):
| Family | Organization | Sizes | Key Features |
|---|---|---|---|
| LLaMA 3 | Meta | 8B, 70B, 405B | GQA, RoPE, 128K context |
| Mistral/Mixtral | Mistral AI | 7B, 8x7B MoE | Sliding window, MoE |
| Gemma | 2B, 7B, 27B | Multi-query attention | |
| Qwen 2.5 | Alibaba | 0.5B-72B | Strong multilingual |
| DeepSeek V3 | DeepSeek | 671B MoE | FP8, multi-token prediction |
All use decoder-only transformer with variations in attention mechanism, positional encoding, and mixture-of-experts.
Definition: Mixture of Experts (MoE)
Mixture of Experts (MoE)
MoE replaces the FFN with expert networks and a router:
where selects the top- experts (typically ).
Total parameters: FFN size, but only experts are active per token, so compute cost is of a dense model.
Mixtral 8x7B has 47B total parameters but only ~13B active per token, achieving performance comparable to LLaMA 2 70B.
LLM Model Family Comparison
Compare model families by size, performance, and efficiency
Parameters
LLM Timeline
Quick Check
A Mixture of Experts model with 8 experts and top-2 routing has 47B total parameters. How many parameters are active per token?
47B
~12B
~6B
With top-2 routing, 2/8 = 25% of expert parameters plus shared parameters are active, giving roughly 12-13B active parameters.
Historical Note: The Open-Source LLM Revolution
2023-2025Meta's release of LLaMA (2023) catalyzed an explosion of open-source LLM development. Within months, the community produced fine-tuned variants (Alpaca, Vicuna), efficient inference frameworks (llama.cpp, vLLM), and training tools (Axolotl, TRL). This demonstrated that the key bottleneck was data and training recipes, not architecture.
Mixture of Experts (MoE)
An architecture where each token is processed by a subset of expert networks selected by a learned router, enabling large total parameter counts with lower per-token compute.