Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /var/www/sites/mccormick/www/events/event.php on line 16
EVENT DETAILS
Wednesday / CS Seminar
February 15th / 12:00 PM
Mudd 3514
Title: Democratizing Large-Scale AI Model Training via Heterogeneous Memory
Speaker: Dong Li
Abstract:
The size of large artificial intelligence (AI) models increases by about 200x in the past three years. To train those models with billion- or even trillion-scale parameters, memory capacity becomes a major bottleneck, which leads to a range of functional and performance issues. The memory capacity problem becomes even worse with growth of batch size, data modality, and training pipeline size and complexity. Recent advance of heterogeneous memory (HM) provides a cost-effective approach to increase memory capacity. Using CPU memory as an extension to GPU memory, we can build an HM to enable large-scale AI model training without using extra GPUs to accommodate large memory consumption. However, not only HM imposes challenges on tensor allocation and migration on HM itself, but it is also unclear how HM affects training throughput. AI model training possesses unique characteristics of memory access patterns and data structures, which places challenges on the promptness of data migration, load balancing, and tensor redundancy on GPU. In this talk, we present our recent work on using HM to enable large-scale AI model training. We identify the major memory capacity bottleneck in tensors, and minimize GPU memory usage through co-offloading of computing and tensors from GPU. We also use analytical performance modeling to guide tensor migration between memory components in HM, in order to minimize migration volume and reduce load imbalance between batches. We show that using HM we can train industry-quality transformer models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular frameworks such as PyTorch, and we do so without requiring any model change from data scientists or sacrificing computational efficiency. Our work has been integrated into Microsoft DeepSpeed and employed in industry to democratize large-scale AI models. We also show that using HM we enable large-scale GNN training with billion-scale graphs without losing accuracy and suffering from out of memory (OOM).
Biography:
Dong Li is an associate professor at EECS, University of California, Merced. Previously, he was a research scientist at the Oak Ridge National Laboratory (ORNL). Dong earned his PhD in computer science from Virginia Tech. His research focuses on high performance computing (HPC), and maintains a strong relevance to computer systems. The core theme of his research is to study how to enable scalable and efficient execution of enterprise and scientific applications (including large-scale AI models) on increasingly complex parallel systems. Dong received an ORNL/CSMD Distinguished Contributor Award in 2013, a CAREER Award from the National Science Foundation in 2016, a Berkeley Lab University Faculty Fellowship in 2016, a Facebook research award in 2021, and an Oracle research award in 2022. His paper in SC'14 was nominated as the best student paper. His paper in ASPLOS'21 won the distinguished artifact award. He was also the lead PI for the NVIDIA CUDA Research Center at UC Merced. He is an associate editor for IEEE Transactions on Parallel and Distributed Systems (TPDS).
TIME Wednesday February 15, 2023 at 12:00 PM - 1:00 PM
LOCATION 3514, Mudd Hall ( formerly Seeley G. Mudd Library) map it
ADD TO CALENDAR&group= echo $value['group_name']; ?>&location= echo htmlentities($value['location']); ?>&pipurl= echo $value['ppurl']; ?>" class="button_outlook_export">SHARE
CONTACT Wynante R Charles wynante.charles@northwestern.edu
CALENDAR Department of Computer Science (CS)