Large pretrained language models (PreLMs) are rev-olutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretr...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
These improvements open many possibilities in solving Natural Language Processing downstream tasks. ...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities t...
In the natural language processing (NLP) literature, neural networks are becoming increasingly deepe...
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities t...
We propose a new neural model for word embeddings, which uses Unitary Matrices as the primary device...
Real-world business applications require a trade-off between language model performance and size. We...
Large language models have become a vital component in modern NLP, achieving state of the art perfor...
Pre-trained language models (PLMs) have demonstrated impressive performance across various downstrea...
Since the first bidirectional deep learn- ing model for natural language understanding, BERT, emerge...
Pretraining multilingual language models from scratch requires considerable computational resources ...
To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either ...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
These improvements open many possibilities in solving Natural Language Processing downstream tasks. ...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities t...
In the natural language processing (NLP) literature, neural networks are becoming increasingly deepe...
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities t...
We propose a new neural model for word embeddings, which uses Unitary Matrices as the primary device...
Real-world business applications require a trade-off between language model performance and size. We...
Large language models have become a vital component in modern NLP, achieving state of the art perfor...
Pre-trained language models (PLMs) have demonstrated impressive performance across various downstrea...
Since the first bidirectional deep learn- ing model for natural language understanding, BERT, emerge...
Pretraining multilingual language models from scratch requires considerable computational resources ...
To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either ...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
LLMs or Large Language Models are the machine learning models that are used to understand and genera...
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional com...
These improvements open many possibilities in solving Natural Language Processing downstream tasks. ...
Scaling language models with more data, compute and parameters has driven significant progress in na...