Academic

Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

arXiv:2603.05895v1 Announce Type: new Abstract: This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and t

H
Hussein Ghaly
· · 1 min read · 10 views

arXiv:2603.05895v1 Announce Type: new Abstract: This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.

Executive Summary

This study introduces an ensemble methodology for leveraging Large Language Model (LLM) performance variability to build accurate and efficient semantic tagging systems for UN Security Council resolutions. By employing two novel evaluation metrics - Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF) - the researchers successfully avoid hallucinations and optimize the output from multiple LLM runs. The results demonstrate that smaller models, such as GPT-4.1-mini, can achieve comparable performance at a fraction of the cost. This ensemble design and metric-based approach enable the development of a reliable LLM system for semantic tagging of challenging texts.

Key Points

  • Introduction of Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF) metrics for evaluating LLM performance
  • Ensemble methodology leverages LLM performance variability for accurate semantic tagging
  • Smaller models, such as GPT-4.1-mini, achieve comparable performance at reduced cost

Merits

Methodological Innovation

The study introduces novel evaluation metrics (CPR and TWF) that effectively address the problem of hallucinations in LLM-based systems, enabling more accurate semantic tagging.

Scalability and Efficiency

The use of smaller models, such as GPT-4.1-mini, demonstrates that high-performance semantic tagging can be achieved at reduced computational costs, making the system more scalable and efficient.

Reliability and Consistency

The ensemble design and metric-based approach ensure that the system produces reliable and consistent outputs, even across multiple LLM runs and tasks.

Demerits

Limited Generalizability

The study focuses on UN Security Council resolutions, which may limit the generalizability of the findings to other domains and text types.

Dependence on LLM Performance

The effectiveness of the ensemble system relies heavily on the performance of the underlying LLM models, which may be subject to variability and degradation over time.

Expert Commentary

The study presents a significant contribution to the field of NLP, introducing novel evaluation metrics and an ensemble methodology that effectively address the challenges of hallucinations and performance variability in LLM-based systems. While the study's focus on UN Security Council resolutions may limit generalizability, the findings have far-reaching implications for the development of more accurate and efficient semantic tagging systems. The use of smaller models, such as GPT-4.1-mini, demonstrates the potential for cost-effective and scalable solutions. Nevertheless, the dependence on LLM performance highlights the need for ongoing research into model robustness and reliability.

Recommendations

  • Future studies should investigate the generalizability of the ensemble methodology and metric-based approach to other domains and text types.
  • Researchers should explore ways to improve the robustness and reliability of LLM models, reducing their dependence on specific performance metrics.

Sources