Large Transformer models have performed promisingly on a wide spectrum of AI and CV applications. These positive performances have thus stimulated a recent surge of extremely large models. However, training these models generally requires more computation and training time. This has generated interest in both academia and industry in scaling up deep learning (DL) using distributed training on high-performance computing (HPC) resources like TPU and GPU clusters.
However, continuously adding more devices will not scale training as intended, since training at a large scale requires overcoming both algorithmic and systems-related challenges. This limitation prevents DL and CV researchers from exploring more advanced model architectures.
Many existing works investigate and develop optimization techniques that overcome these problems and accelerate large model training at a larger-scale. We categorize these works as improving either model accuracy or model efficiency. One method to maintain or improve model accuracy in a large-scale setting, while still maintaining computing efficiency, is to design algorithms that require less communication and memory demands. It is notable that these are not mutually exclusive goals but can be optimized together to further accelerate training. This tutorial helps enable CV members to quickly master optimizations for large-scale DL training and successfully train large models at large-scale with different optimization techniques in a distributed environment.