Multimodal Models
Definition: Vision-Language Models
Vision-Language Models
Multimodal models process both images and text. The typical architecture uses a vision encoder (e.g., CLIP ViT) whose output is projected into the LLM's embedding space:
\\mathbf{h}_\\text{img} = \\text{Proj}(\\text{ViT}(\\text{image}))
The projected image tokens are concatenated with text tokens and processed by the LLM decoder.
Example: Using a Vision-Language Model
Use a VLM to analyze a constellation diagram image.
Solution
API Call
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "data": img_b64}},
{"type": "text", "text": "Identify the modulation scheme in this constellation diagram."},
],
}],
)