Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Bian, Zhengda
Liu, Hongxin
Wang, Boxiang
Huang, Haichen
Li, Yongbin
Wang, Chuanrui
Cui, Fan
You, Yang

Publication date

October 2021

Language

English

Abstract

The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision Transformer, BERT, and GPT on a single GPU or a single machine. There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer systems and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models. In this paper,...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Abstract

Extracted data

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Abstract

Extracted data

Related items

Related items