From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning
Abstract
World models are emerging as a new paradigm in computer vision and multimodal learning, enabling systems to move beyond perception toward reasoning, simulation, and decision-making. This tutorial explores how world models have evolved from predictive frameworks into engines for multi-modal reasoning, capable of simulating environments, supporting counterfactual thinking, and enabling planning. It examines key approaches for learning world dynamics from visual data, including both discrete tokenization and diffusion-based methods, and highlights their role in modeling physical and causal structure. The tutorial further covers how these models support reasoning through simulation, as well as their applications in embodied agents and robotics, while discussing key challenges such as grounding, scalability, and causal understanding.