Academic

Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis · February 22, 2026 · 1 min read · 4 views

#cs.CL #cs.AI

arXiv:2602.16811v1 Announce Type: new Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.

Executive Summary

This study addresses the research gap in evaluating monolingual and multilingual Large Language Models (LLMs) for Greek Question Answering (QA). The authors contribute a novel dataset, DemosQA, constructed from social media user questions and community-reviewed answers, and a memory-efficient evaluation framework adaptable to diverse QA datasets and languages. They extensively evaluate 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. The findings demonstrate the effectiveness of DemosQA and the evaluation framework, and highlight the limitations of multilingual LLMs in capturing social and cultural aspects of under-resourced languages. The study's results have significant implications for the development and evaluation of LLMs for low-resource languages and contribute to the advancement of QA research in Greek.

Key Points

▸ The study contributes a novel dataset, DemosQA, for Greek QA
▸ A memory-efficient evaluation framework is adapted for diverse QA datasets and languages
▸ Monolingual and multilingual LLMs are extensively evaluated on Greek QA datasets

Merits

Strength in novel dataset creation

The study creates a novel dataset, DemosQA, that captures the Greek social and cultural zeitgeist, filling a research gap in QA for under-resourced languages.

Innovative evaluation framework

The authors develop a memory-efficient evaluation framework adaptable to diverse QA datasets and languages, enabling the evaluation of LLMs on a wide range of tasks and languages.

Thorough evaluation of LLMs

The study extensively evaluates 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies, providing a comprehensive understanding of the effectiveness of LLMs for Greek QA.

Demerits

Limited generalizability to other languages

The study focuses on Greek QA and may not generalize to other languages or tasks, limiting the applicability of the findings to broader research contexts.

Dependence on human-curated datasets

The evaluation framework relies on human-curated datasets, which may introduce biases and limit the diversity of the evaluation process.

Limited exploration of contextual factors

The study does not thoroughly investigate the impact of contextual factors, such as cultural and social aspects, on the performance of LLMs for Greek QA.

Expert Commentary

This study makes a significant contribution to the field of Question Answering research, particularly for under-resourced languages. The authors' creation of a novel dataset, DemosQA, and the development of a memory-efficient evaluation framework are crucial advancements in the field. However, the study's limitations, such as limited generalizability to other languages and dependence on human-curated datasets, highlight the need for more comprehensive and diverse evaluation methodologies. Furthermore, the study's findings on the limitations of multilingual LLMs in capturing social and cultural aspects of under-resourced languages underscore the importance of contextual factors in language modeling. Overall, the study's results have significant implications for the development and evaluation of LLMs and contribute to the advancement of QA research in Greek.

Recommendations

✓ Future studies should investigate the generalizability of the findings to other languages and tasks, and explore the impact of contextual factors on LLM performance.
✓ Developers and researchers should adopt more nuanced and culturally sensitive approaches to language modeling, taking into account the social and cultural aspects of under-resourced languages.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

AI Commentary

Executive Summary

Key Points

Merits

Strength in novel dataset creation

Innovative evaluation framework

Thorough evaluation of LLMs

Demerits

Limited generalizability to other languages

Dependence on human-curated datasets

Limited exploration of contextual factors

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.