In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance -- one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering general...
Weak supervision is leveraged in a wide range of domains and tasks due to its ability to create mass...
Language model fine-tuning is essential for modern natural language processing, but is computational...
While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-...
Behavioural testing -- verifying system capabilities by validating human-designed input-output pairs...
Fine-tuning pre-trained models have achieved impressive performance on standard natural language pro...
Test suites assess natural language processing models' performance on specific functionalities: case...
Plotting a learner's generalization performance against the training set size results in a so-called...
A machine learning (ML) system must learn not only to match the output of a target function on a tra...
Natural Language Inference (NLI) is considered a representative task to test natural language unders...
In order to compare learning algorithms, experimental results reported in the machine learning liter...
Large language models generate complex, open-ended outputs: instead of outputting a class label they...
Machine learning systems often do not share the same inductive biases as humans and, as a result, ex...
Classifiers tend to learn a false causal relationship between an over-represented concept and a labe...
Neural network models have been very successful in natural language inference, with the best models ...
Over-parameterized models, typically pretrained language models (LMs), have shown an appealing expre...
Weak supervision is leveraged in a wide range of domains and tasks due to its ability to create mass...
Language model fine-tuning is essential for modern natural language processing, but is computational...
While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-...
Behavioural testing -- verifying system capabilities by validating human-designed input-output pairs...
Fine-tuning pre-trained models have achieved impressive performance on standard natural language pro...
Test suites assess natural language processing models' performance on specific functionalities: case...
Plotting a learner's generalization performance against the training set size results in a so-called...
A machine learning (ML) system must learn not only to match the output of a target function on a tra...
Natural Language Inference (NLI) is considered a representative task to test natural language unders...
In order to compare learning algorithms, experimental results reported in the machine learning liter...
Large language models generate complex, open-ended outputs: instead of outputting a class label they...
Machine learning systems often do not share the same inductive biases as humans and, as a result, ex...
Classifiers tend to learn a false causal relationship between an over-represented concept and a labe...
Neural network models have been very successful in natural language inference, with the best models ...
Over-parameterized models, typically pretrained language models (LMs), have shown an appealing expre...
Weak supervision is leveraged in a wide range of domains and tasks due to its ability to create mass...
Language model fine-tuning is essential for modern natural language processing, but is computational...
While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-...