Parameter Efficient Tuning has been an prominent approach to adapt the Large Language Model to downstream tasks. Most previous works considers adding the dense trainable parameters, where all parameters are used to adapt certain task. We found this less effective empirically using the example of LoRA that introducing more trainable parameters does not help. Motivated by this we investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top $k$ experts routing with a capacity limit restricting the maximum number of tokens each expert can process. We propose a novel and simp...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, espe...
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updat...
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized ...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
We present Generalized LoRA (GLoRA), an advanced approach for universal parameter-efficient fine-tun...
In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune lar...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
In this paper, we move towards combining large parametric models with non-parametric prototypical ne...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Meta-learning is critical for a variety of practical ML systems -- like personalized recommendations...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, espe...
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updat...
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized ...
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capabilit...
We present Generalized LoRA (GLoRA), an advanced approach for universal parameter-efficient fine-tun...
In this paper, we present Delta-LoRA, which is a novel parameter-efficient approach to fine-tune lar...
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnabl...
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the...
The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural La...
In this paper, we move towards combining large parametric models with non-parametric prototypical ne...
Large language models (LLMs) based on transformers have made significant strides in recent years, th...
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fin...
Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many pri...
Meta-learning is critical for a variety of practical ML systems -- like personalized recommendations...
Sparse mixture of expert architectures (MoEs) scale model capacity without large increases in traini...
Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, espe...
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updat...