We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall
With the recent growth of online content on the Web, there have been more user generated data with n...
Deep neural networks that dominate NLP rely on an immense amount of parameters and require large tex...
Nowadays, crowdsourcing is being widely used to collect training data for solving classification pro...
Large language models (LLMs) have demonstrated significant capability to generalize across a large n...
State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors...
State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors...
This paper addresses the problem of learn-ing when high-quality labeled examples are an expensive re...
A key requirement for supervised machine learning is labeled training data, which is created by anno...
International audienceCorrecting errors in a data set is a critical issue. This task can be either h...
This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine int...
Noisy Labels are commonly present in data sets automatically collected from the internet, mislabeled...
With the recent growth of online content on the Web, there have been more user generated data with n...
Large scale datasets collected using non-expert labelers are prone to labeling errors. Errors in the...
We investigated the use of supervised learning methods that use labels from crowd workers to resolve...
This thesis focuses on the aspect of label noise for real-life datasets. Due to the upcoming growing...
With the recent growth of online content on the Web, there have been more user generated data with n...
Deep neural networks that dominate NLP rely on an immense amount of parameters and require large tex...
Nowadays, crowdsourcing is being widely used to collect training data for solving classification pro...
Large language models (LLMs) have demonstrated significant capability to generalize across a large n...
State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors...
State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors...
This paper addresses the problem of learn-ing when high-quality labeled examples are an expensive re...
A key requirement for supervised machine learning is labeled training data, which is created by anno...
International audienceCorrecting errors in a data set is a critical issue. This task can be either h...
This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine int...
Noisy Labels are commonly present in data sets automatically collected from the internet, mislabeled...
With the recent growth of online content on the Web, there have been more user generated data with n...
Large scale datasets collected using non-expert labelers are prone to labeling errors. Errors in the...
We investigated the use of supervised learning methods that use labels from crowd workers to resolve...
This thesis focuses on the aspect of label noise for real-life datasets. Due to the upcoming growing...
With the recent growth of online content on the Web, there have been more user generated data with n...
Deep neural networks that dominate NLP rely on an immense amount of parameters and require large tex...
Nowadays, crowdsourcing is being widely used to collect training data for solving classification pro...