Test suites assess natural language processing models' performance on specific functionalities: cases of interest involving model robustness, fairness, or particular linguistic capabilities. They enable fine-grained evaluations of model aspects that would otherwise go unnoticed in standard evaluation datasets, but they do not address the problem of how to fix the failure cases. Previous work has explored functionality learning by fine-tuning models on suite data. While this improves performance on seen functionalities, it often does not generalize to unseen ones and can harm general performance. This paper analyses a fine-tuning-free approach to functionality learning. For each functionality in a suite, we generate a specification instruc...
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks de...
Large language models (LMs) are able to in-context learn -- perform a new task via inference alone b...
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, lea...
Recently, Instruction fine-tuning has risen to prominence as a potential method for enhancing the ze...
In behavioural testing, system functionalities underrepresented in the standard evaluation setting (...
Language model fine-tuning is essential for modern natural language processing, but is computational...
Natural Language Inference (NLI) is considered a representative task to test natural language unders...
Large language models are becoming increasingly practical for translating code across programming la...
A central question in natural language understanding (NLU) research is whether high performance demo...
Prompts have been the center of progress in advancing language models' zero-shot and few-shot perfor...
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension a...
Behavioural testing -- verifying system capabilities by validating human-designed input-output pairs...
Large language models are able to perform a task by conditioning on a few input-output demonstration...
A Feature Model (FM) is a compact representation of all the products of a software product line. Au...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks de...
Large language models (LMs) are able to in-context learn -- perform a new task via inference alone b...
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, lea...
Recently, Instruction fine-tuning has risen to prominence as a potential method for enhancing the ze...
In behavioural testing, system functionalities underrepresented in the standard evaluation setting (...
Language model fine-tuning is essential for modern natural language processing, but is computational...
Natural Language Inference (NLI) is considered a representative task to test natural language unders...
Large language models are becoming increasingly practical for translating code across programming la...
A central question in natural language understanding (NLU) research is whether high performance demo...
Prompts have been the center of progress in advancing language models' zero-shot and few-shot perfor...
In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension a...
Behavioural testing -- verifying system capabilities by validating human-designed input-output pairs...
Large language models are able to perform a task by conditioning on a few input-output demonstration...
A Feature Model (FM) is a compact representation of all the products of a software product line. Au...
We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset ...
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks de...
Large language models (LMs) are able to in-context learn -- perform a new task via inference alone b...
Language models, given their black-box nature, often exhibit sensitivity to input perturbations, lea...