We introduce BLiMP (The Benchmark of Linguistic Minimal Pairs), a human-solvable challenge set for evaluating language models (LMs) that covers a broad range of major grammatical phenomena in English. BLiMP consists of over 30 datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. Like GLUE (Wang et al., 2018), BLiMP makes it easy to directly compare models. Evaluating n-gram, LSTM, and Transformer LMs (GPT-2 and TransformerXL), we find that transformers are strongest overall, achieving (near) human performance on agreement and binding. However, phenomena like wh-islands and NPI licensing remain challenging even for state-of-the-art LMs
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating ...
We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the ...
We present a dataset for evaluating the grammatical sophistication of language models (LMs). We cons...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often a...
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often a...
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often a...
Many patterns found in natural language syntax have multiple possible explanations or structural des...
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Syntactic parsing is the process of automatically assigning a structure to a string of words, and i...
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating ...
We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the ...
We present a dataset for evaluating the grammatical sophistication of language models (LMs). We cons...
How cross-linguistically applicable are NLP models, specifically language models? A fair comparison ...
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often a...
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often a...
Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often a...
Many patterns found in natural language syntax have multiple possible explanations or structural des...
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Syntactic parsing is the process of automatically assigning a structure to a string of words, and i...
I present a novel algorithm for minimally supervised formal grammar induction using a linguistically...
Recent progress in pretraining language models on large corpora has resulted in significant performa...
Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating ...