We find that at sequence length 512 padding tokens represent in excess of 50% of the Wikipedia dataset used for pretraining BERT (Bidirectional Encoder Representations from Transformers). Therefore by removing all padding we achieve a 2x speed-up in terms of sequences/sec. To exploit this characteristic of the dataset, we develop and contrast two deterministic packing algorithms. Both algorithms rely on the assumption that sequences are interchangeable and therefore packing can be performed on the histogram of sequence lengths, rather than per sample. This transformation of the problem leads to algorithms which are fast and have linear complexity in dataset size. The shortest-pack-first histogram-packing (SPFHP) algorithm determines the pac...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Large Language Models have become the core architecture upon which most modern natural language proc...
Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant p...
Transformer-based language models have become a key building block for natural language processing. ...
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of...
Recently, the development of pre-trained language models has brought natural language processing (NL...
Neural networks are powerful solutions to help with decision making and solve complex problems in r...
When processing a batch of graphs in machine learning models such as Graph Neural Networks (GNN), it...
© 2022 Piao et al. This is an open access article distributed under the terms of the Creative Common...
The availability of large and rich quantities of text data is due to the emergence of the World Wide...
We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of l...
Since the first bidirectional deep learn- ing model for natural language understanding, BERT, emerge...
As the demand for sophisticated Natural Language Processing (NLP) models continues to grow, so does ...
GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various n...
In the natural language processing (NLP) literature, neural networks are becoming increasingly deepe...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Large Language Models have become the core architecture upon which most modern natural language proc...
Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant p...
Transformer-based language models have become a key building block for natural language processing. ...
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of...
Recently, the development of pre-trained language models has brought natural language processing (NL...
Neural networks are powerful solutions to help with decision making and solve complex problems in r...
When processing a batch of graphs in machine learning models such as Graph Neural Networks (GNN), it...
© 2022 Piao et al. This is an open access article distributed under the terms of the Creative Common...
The availability of large and rich quantities of text data is due to the emergence of the World Wide...
We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of l...
Since the first bidirectional deep learn- ing model for natural language understanding, BERT, emerge...
As the demand for sophisticated Natural Language Processing (NLP) models continues to grow, so does ...
GPT-2 and BERT demonstrate the effectiveness of using pre-trained language models (LMs) on various n...
In the natural language processing (NLP) literature, neural networks are becoming increasingly deepe...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Large Language Models have become the core architecture upon which most modern natural language proc...
Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant p...