Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Aach, Marcel
Inanc, Eray
Sarma, Rakesh
Riedel, Morris
Lintermann, Andreas

Open link

Publication date

January 2023

DOI

10.1186/s40537-023-00765-w

Publisher

SpringerOpen

Abstract

Continuously increasing data volumes from multiple sources, such as simulation and experimental measurements, demand efficient algorithms for an analysis within a realistic timeframe. Deep learning models have proven to be capable of understanding and analyzing large quantities of data with high accuracy. However, training them on massive datasets remains a challenge and requires distributed learning exploiting High-Performance Computing systems. This study presents a comprehensive analysis and comparison of three well-established distributed deep learning frameworks - Horovod, DeepSpeed, and Distributed Data Parallel by PyTorch - with a focus on their runtime performance and scalability. Additionally, the performance of two data loaders, t...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Abstract

Extracted data

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Abstract

Extracted data

Related items

Related items