Academic

Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase · February 26, 2026 · 1 min read · 3 views

#cs.CL

arXiv:2602.20966v1 Announce Type: new Abstract: This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

Executive Summary

This article introduces the Blackbird Language Matrices (BLM) task, a novel language task inspired by intelligence tests. BLMs are multiple-choice problems with a rich structure, designed to investigate the linguistic competence of large language models (LLMs). The authors demonstrate that BLMs can be solved at good levels of performance using simple baseline models or more tailored models. The results show that LLMs can detect linguistic objects and their properties, as well as systematic patterns across sentences. The study highlights the value of curated, structured datasets in supporting multi-faceted investigations of language and LLM properties, and their potential for explainability investigations. The findings have implications for the development and evaluation of LLMs, and the design of more effective language tasks.

Key Points

▸ BLMs are a novel language task inspired by intelligence tests
▸ BLMs can be solved at good levels of performance using simple or tailored models
▸ LLMs can detect linguistic objects and systematic patterns across sentences

Merits

Strength of the BLM framework

The BLM framework offers a comprehensive and structured approach to investigating the linguistic competence of LLMs, allowing for a deeper understanding of their strengths and weaknesses.

Value of curated datasets

The study highlights the importance of curated, structured datasets in supporting multi-faceted investigations of language and LLM properties, and their potential for explainability investigations.

Demerits

Limited generalizability

The study's findings may not be generalizable to all language tasks or LLM architectures, and further research is needed to validate the results and explore their implications.

Dependence on human annotation

The BLM datasets require significant human annotation, which can be time-consuming and expensive, and may limit the scalability of the approach.

Expert Commentary

The article presents a novel and intriguing approach to investigating the linguistic competence of LLMs. The BLM framework offers a comprehensive and structured approach to understanding the strengths and weaknesses of LLMs, and highlights the value of curated, structured datasets in supporting multi-faceted investigations. The study's findings have significant implications for the development and evaluation of LLMs, and the design of more effective language tasks. However, the limited generalizability of the results and the dependence on human annotation are notable limitations that need to be addressed in future research. Overall, the study makes a significant contribution to the field and provides new insights into the evaluation and development of LLMs.

Recommendations

✓ Future research should focus on expanding the scope of the BLM framework to include more diverse language tasks and LLM architectures.
✓ The development of more efficient and scalable methods for annotating and generating BLM datasets is essential to ensure the approach can be widely adopted.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

AI Commentary

Executive Summary

Key Points

Merits

Strength of the BLM framework

Value of curated datasets

Demerits

Limited generalizability

Dependence on human annotation

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.