Multimodal Models

Definition:
Vision-Language Models

Multimodal models process both images and text. The typical architecture uses a vision encoder (e.g., CLIP ViT) whose output is projected into the LLM's embedding space:

$\\mathbf{h}_\\text{img} = \\text{Proj}(\\text{ViT}(\\text{image}))$

The projected image tokens are concatenated with text tokens and processed by the LLM decoder.

Example: Using a Vision-Language Model

Use a VLM to analyze a constellation diagram image.

Solution

API Call

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "data": img_b64}},
            {"type": "text", "text": "Identify the modulation scheme in this constellation diagram."},
        ],
    }],
)

Instruction Tuning and Chat Models Chapter Summary

Multimodal Models

Definition: Vision-Language Models

Example: Using a Vision-Language Model

API Call

Definition:
Vision-Language Models