Principled Interpretability in Vision Models: From Mechanistic Understanding to Interpretable Models by Design
Abstract
As deep learning systems are increasingly deployed in high-stakes applications, understanding their internal behavior is essential for ensuring trust, safety, and reliability. However, the field of interpretability remains fragmented, spanning diverse methods without a unified framework or standardized evaluation. This tutorial aims to provide a comprehensive overview of interpretability in vision models, bridging post-hoc mechanistic analysis with approaches that design inherently interpretable models. It reviews techniques for analyzing neural networks at multiple levels—from individual neurons to circuits—alongside recent advances in evaluating the faithfulness of explanations. In addition, the tutorial covers emerging methods for learning interpretable models by design, such as concept-based approaches, and highlights practical applications in debugging, model editing, and safety auditing.