Skip to main content
Academic

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

arXiv:2602.15753v1 Announce Type: new Abstract: Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by comple

arXiv:2602.15753v1 Announce Type: new Abstract: Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

Executive Summary

This study investigates the capacity of large language models (LLMs) to perform lemmatization and part-of-speech tagging in four under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. The authors develop a novel benchmark and evaluate the performance of LLMs in few-shot and zero-shot settings, comparing them with a task-specific RNN baseline. The results demonstrate that LLMs achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. The study highlights the potential of LLMs as a credible and relevant option for initiating linguistic annotation tasks in the absence of data. However, significant challenges persist for languages with complex morphology and non-Latin scripts. The findings have implications for the development of NLP tools and the annotation of under-resourced languages.

Key Points

  • LLMs can perform lemmatization and POS-tagging in few-shot settings for under-resourced languages
  • The study develops a novel benchmark for evaluating the performance of LLMs in under-resourced languages
  • LLMs demonstrate competitive or superior performance in POS-tagging and lemmatization across most languages

Merits

Strength of LLMs in few-shot settings

LLMs can achieve competitive or superior performance in POS-tagging and lemmatization without fine-tuning, making them a viable option for under-resourced languages.

Novel benchmark development

The study develops a novel benchmark that can be used to evaluate the performance of LLMs in under-resourced languages, providing a valuable resource for the NLP community.

Increased accessibility of annotation tools

LLMs can serve as an effective aid for annotation, making it easier to access and annotate under-resourced languages.

Demerits

Limitations for languages with complex morphology

Significant challenges persist for languages with complex morphology and non-Latin scripts, highlighting the need for further research and development.

Dependence on large datasets

LLMs require large datasets to train, which can be a limitation for under-resourced languages with limited available data.

Lack of human evaluation

The study relies on automatic evaluation metrics, which can be limiting, and human evaluation is necessary to validate the results.

Expert Commentary

The study provides a significant contribution to the field of NLP, particularly for under-resourced languages. The results demonstrate the potential of LLMs for lemmatization and POS-tagging, and the novel benchmark developed in this study will be valuable for the NLP community. However, further research is needed to address the limitations of LLMs for languages with complex morphology and non-Latin scripts. Additionally, the study highlights the need for effective annotation tools and the importance of human evaluation to validate the results. Overall, the study provides a promising direction for the development of NLP tools and the annotation of under-resourced languages.

Recommendations

  • Further research is needed to develop more robust and effective LLMs for under-resourced languages with complex morphology and non-Latin scripts.
  • The development of more comprehensive annotation tools and guidelines is necessary to facilitate the annotation and analysis of under-resourced languages.

Sources