Skip to main content
Academic

Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

arXiv:2602.20966v1 Announce Type: new Abstract: This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models

P
Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase
· · 1 min read · 3 views

arXiv:2602.20966v1 Announce Type: new Abstract: This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.

Executive Summary

This article introduces the Blackbird Language Matrices (BLM) task, a novel language task inspired by intelligence tests. BLMs are multiple-choice problems with a rich structure, designed to investigate the linguistic competence of large language models (LLMs). The authors demonstrate that BLMs can be solved at good levels of performance using simple baseline models or more tailored models. The results show that LLMs can detect linguistic objects and their properties, as well as systematic patterns across sentences. The study highlights the value of curated, structured datasets in supporting multi-faceted investigations of language and LLM properties, and their potential for explainability investigations. The findings have implications for the development and evaluation of LLMs, and the design of more effective language tasks.

Key Points

  • BLMs are a novel language task inspired by intelligence tests
  • BLMs can be solved at good levels of performance using simple or tailored models
  • LLMs can detect linguistic objects and systematic patterns across sentences

Merits

Strength of the BLM framework

The BLM framework offers a comprehensive and structured approach to investigating the linguistic competence of LLMs, allowing for a deeper understanding of their strengths and weaknesses.

Value of curated datasets

The study highlights the importance of curated, structured datasets in supporting multi-faceted investigations of language and LLM properties, and their potential for explainability investigations.

Demerits

Limited generalizability

The study's findings may not be generalizable to all language tasks or LLM architectures, and further research is needed to validate the results and explore their implications.

Dependence on human annotation

The BLM datasets require significant human annotation, which can be time-consuming and expensive, and may limit the scalability of the approach.

Expert Commentary

The article presents a novel and intriguing approach to investigating the linguistic competence of LLMs. The BLM framework offers a comprehensive and structured approach to understanding the strengths and weaknesses of LLMs, and highlights the value of curated, structured datasets in supporting multi-faceted investigations. The study's findings have significant implications for the development and evaluation of LLMs, and the design of more effective language tasks. However, the limited generalizability of the results and the dependence on human annotation are notable limitations that need to be addressed in future research. Overall, the study makes a significant contribution to the field and provides new insights into the evaluation and development of LLMs.

Recommendations

  • Future research should focus on expanding the scope of the BLM framework to include more diverse language tasks and LLM architectures.
  • The development of more efficient and scalable methods for annotating and generating BLM datasets is essential to ensure the approach can be widely adopted.

Sources