The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques to do so are based on the flow of training data influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. However, we observe that since the activation connected to the last layer of weights contains ``shared logic'', the data influenced calculated via the last layer weights prone to a ``cancellation effect'', where the data influence of different examples have large magnitude that contradicts each other. The cancellation effect lowers the d...
Traditionally, it has been assumed that rules are necessary to explain language acquisition. Recentl...
Despite the recent trend of developing and applying neural source code models to software engineerin...
Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies o...
Biases and artifacts in training data can cause unwelcome behavior in text classifiers (such as shal...
Good models require good training data. For overparameterized deep models, the causal relationship b...
Deep neural networks that dominate NLP rely on an immense amount of parameters and require large tex...
In recent years, the field of language modelling has witnessed exciting developments. Especially, th...
As the complexity of machine learning (ML) models increases, resulting in a lack of prediction expla...
Modern supervised learning algorithms can learn very accurate and complex discriminating functions. ...
Natural Language Inference (NLI) models are known to learn from biases and artefacts within their tr...
We argue that extrapolation to examples outside the training space will often be easier for models t...
Language Models (LMs) pre-trained with self-supervision on large text corpora have become the defaul...
Many recent works indicate that the deep neural networks tend to take dataset biases as shortcuts to...
This paper aims to compare different regularization strategies to address a common phenomenon, sever...
Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources...
Traditionally, it has been assumed that rules are necessary to explain language acquisition. Recentl...
Despite the recent trend of developing and applying neural source code models to software engineerin...
Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies o...
Biases and artifacts in training data can cause unwelcome behavior in text classifiers (such as shal...
Good models require good training data. For overparameterized deep models, the causal relationship b...
Deep neural networks that dominate NLP rely on an immense amount of parameters and require large tex...
In recent years, the field of language modelling has witnessed exciting developments. Especially, th...
As the complexity of machine learning (ML) models increases, resulting in a lack of prediction expla...
Modern supervised learning algorithms can learn very accurate and complex discriminating functions. ...
Natural Language Inference (NLI) models are known to learn from biases and artefacts within their tr...
We argue that extrapolation to examples outside the training space will often be easier for models t...
Language Models (LMs) pre-trained with self-supervision on large text corpora have become the defaul...
Many recent works indicate that the deep neural networks tend to take dataset biases as shortcuts to...
This paper aims to compare different regularization strategies to address a common phenomenon, sever...
Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources...
Traditionally, it has been assumed that rules are necessary to explain language acquisition. Recentl...
Despite the recent trend of developing and applying neural source code models to software engineerin...
Knowledge Distillation (KD) is a prominent neural model compression technique which heavily relies o...