Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training largescale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercom...
In recent years, machine learning (ML) and, more noticeably, deep learning (DL), have be- come incre...
Long training times and non-ideal performance have been a big impediment in further continuing the u...
Neural Networks (NNs) are getting deeper and more complicated to the point where single accelerator ...
The scaling up of deep neural networks has been demonstrated to be effective in improving model qual...
Accelerating and scaling the training of deep neural networks (DNNs) is critical to keep up with gro...
Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence a...
Alpa automates model-parallel training of large deep learning (DL) models by generating execution pl...
The Transformer architecture has improved the performance of deep learning models in domains such as...
Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distr...
Transformer models have achieved state-of-the-art performance on various domains of applications and...
Deep neural networks have gained popularity in recent years, obtaining outstanding results in a wide...
Deep learning models are trained on servers with many GPUs, andtraining must scale with the number o...
I present a new way to parallelize the training of convolutional neural networks across multiple GPU...
Thesis (Master's)--University of Washington, 2018The recent success of Deep Neural Networks (DNNs) [...
peer reviewedWith renewed global interest for Artificial Intelligence (AI) methods, the past decade ...
In recent years, machine learning (ML) and, more noticeably, deep learning (DL), have be- come incre...
Long training times and non-ideal performance have been a big impediment in further continuing the u...
Neural Networks (NNs) are getting deeper and more complicated to the point where single accelerator ...
The scaling up of deep neural networks has been demonstrated to be effective in improving model qual...
Accelerating and scaling the training of deep neural networks (DNNs) is critical to keep up with gro...
Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence a...
Alpa automates model-parallel training of large deep learning (DL) models by generating execution pl...
The Transformer architecture has improved the performance of deep learning models in domains such as...
Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distr...
Transformer models have achieved state-of-the-art performance on various domains of applications and...
Deep neural networks have gained popularity in recent years, obtaining outstanding results in a wide...
Deep learning models are trained on servers with many GPUs, andtraining must scale with the number o...
I present a new way to parallelize the training of convolutional neural networks across multiple GPU...
Thesis (Master's)--University of Washington, 2018The recent success of Deep Neural Networks (DNNs) [...
peer reviewedWith renewed global interest for Artificial Intelligence (AI) methods, the past decade ...
In recent years, machine learning (ML) and, more noticeably, deep learning (DL), have be- come incre...
Long training times and non-ideal performance have been a big impediment in further continuing the u...
Neural Networks (NNs) are getting deeper and more complicated to the point where single accelerator ...