Multimodal Models

Definition:

Vision-Language Models

Multimodal models process both images and text. The typical architecture uses a vision encoder (e.g., CLIP ViT) whose output is projected into the LLM's embedding space:

\\mathbf{h}_\\text{img} = \\text{Proj}(\\text{ViT}(\\text{image}))

The projected image tokens are concatenated with text tokens and processed by the LLM decoder.

Example: Using a Vision-Language Model

Use a VLM to analyze a constellation diagram image.