Tutorial Thu, Jun 4, 2026 • PM 201

The Road to Convergence: Evolution of Unified Multimodal Models

Jindong Wang · Hao Chen · Jiakui Hu · Zhaolong Su · Sharon Li

Project Page

Abstract

Unified multimodal models are emerging as a new paradigm that integrates understanding and generation across modalities within a single foundation model. This tutorial provides a comprehensive overview of these models, addressing the currently fragmented landscape of architectures, representations, and training strategies. It introduces a unified perspective on key design choices, including modeling paradigms, multimodal tokenization, and alignment methods, while reviewing benchmarks and real-world applications. The tutorial further highlights open challenges such as scalable representation learning and unified world modeling, offering a structured roadmap for future research in multimodal AI.