Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to ...
With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have ...
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervise...
Data selection is an effective approach to domain adaptation in statistical ma-chine translation. Th...
Large language models have transformed the field of natural language processing (NLP). Their improve...
Cross-lingual transfer learning with large multilingual pre-trained models can be an effective appro...
The use of language models in Web applications and other areas of computing and business have grown ...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
We present a systematic study and comprehensive evaluation of large language models for automatic mu...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional...
The amount of available digital data for the languages of the world is constantly increasing. Unfort...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...
© 2016 Association for Computational Linguistics. In this paper we improve over the hierarchical Pit...
The use of language models in Web applications and other areas of computing and business have grown ...
The co-existence of two scenarios, “the massive amount of unstructured text data that humanity produ...
With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have ...
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervise...
Data selection is an effective approach to domain adaptation in statistical ma-chine translation. Th...
Large language models have transformed the field of natural language processing (NLP). Their improve...
Cross-lingual transfer learning with large multilingual pre-trained models can be an effective appro...
The use of language models in Web applications and other areas of computing and business have grown ...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
We present a systematic study and comprehensive evaluation of large language models for automatic mu...
Scaling language models with more data, compute and parameters has driven significant progress in na...
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional...
The amount of available digital data for the languages of the world is constantly increasing. Unfort...
Large language models (LLMs)—machine learning algorithms that can recognize, summarize, translate,...
© 2016 Association for Computational Linguistics. In this paper we improve over the hierarchical Pit...
The use of language models in Web applications and other areas of computing and business have grown ...
The co-existence of two scenarios, “the massive amount of unstructured text data that humanity produ...
With the advancement of deep learning technologies, general-purpose large models such as GPT-4 have ...
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervise...
Data selection is an effective approach to domain adaptation in statistical ma-chine translation. Th...