Tutorial Wed, Jun 3, 2026 • PM Mile High 3C

Principled Interpretability in Vision Models: From Mechanistic Understanding to Interpretable Models by Design

Tsui-Wei (Lily) Weng · Tuomas Oikarinen

Project Page

Abstract

As deep learning systems are increasingly deployed in high-stakes applications, understanding their internal behavior is essential for ensuring trust, safety, and reliability. However, the field of interpretability remains fragmented, spanning diverse methods without a unified framework or standardized evaluation. This tutorial aims to provide a comprehensive overview of interpretability in vision models, bridging post-hoc mechanistic analysis with approaches that design inherently interpretable models. It reviews techniques for analyzing neural networks at multiple levels—from individual neurons to circuits—alongside recent advances in evaluating the faithfulness of explanations. In addition, the tutorial covers emerging methods for learning interpretable models by design, such as concept-based approaches, and highlights practical applications in debugging, model editing, and safety auditing.