Academic

FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

arXiv:2604.06403v1 Announce Type: new Abstract: The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1's few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.

Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva · April 9, 2026 · 1 min read · 46 views

#cs.CL #cs.AI

Executive Summary

This paper, "FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts," details an investigation into the efficacy of Large Language Models (LLMs) for identifying and categorizing mentions of substance use and abuse within Spanish clinical case reports. Participating in the ToxHabits Shared Task, the research focused on subtask 1, aiming to detect Tobacco, Alcohol, Cannabis, and Drug use. The authors experimented with zero-shot, few-shot, and prompt optimization techniques, ultimately identifying few-shot prompting with GPT-4.1 as the most effective strategy. Achieving an F1 score of 0.65 on the test set, the study underscores the potential of LLMs for named entity recognition in non-English clinical contexts, particularly in the challenging domain of health informatics.

Key Points

▸ The study addresses the critical need for automated detection of toxic habits (substance use/abuse) in Spanish clinical texts.
▸ It evaluates various LLM prompting strategies (zero-shot, few-shot, prompt optimization) for named entity recognition (NER) in a specialized, non-English domain.
▸ Few-shot prompting with GPT-4.1 demonstrated the highest performance, achieving an F1 score of 0.65 for classifying substance use into four categories.
▸ The research contributes to the growing body of literature on applying LLMs to clinical natural language processing (NLP) tasks, particularly in languages other than English.
▸ The work was conducted as part of the ToxHabits Shared Task, providing a standardized evaluation framework.

Merits

Addressing a Critical Clinical Need

Accurate identification of toxic habits is paramount for patient care, public health initiatives, and epidemiological research. Automating this process in clinical notes offers significant efficiency gains.

Focus on Non-English Clinical Text

The study's specific focus on Spanish clinical texts is highly valuable, as much of the cutting-edge NLP research is English-centric. This helps bridge a significant linguistic and cultural gap in health informatics.

Systematic LLM Evaluation

The exploration of different prompting techniques (zero-shot, few-shot, prompt optimization) provides a structured understanding of LLM capabilities and limitations in this specific NER task.

Participation in a Shared Task

Engaging with the ToxHabits Shared Task ensures a rigorous, benchmarked evaluation against other approaches, lending credibility and comparability to the results.

Demerits

Moderate F1 Score

While 'promising,' an F1 score of 0.65, particularly in a high-stakes domain like clinical text, suggests there is substantial room for improvement before real-world deployment. False positives/negatives could have clinical consequences.

Dependence on Proprietary Models

The reliance on GPT-4.1, a proprietary model, introduces concerns regarding cost, data privacy (especially with sensitive clinical data), reproducibility, and long-term accessibility/control for clinical institutions.

Limited Detail on Prompt Engineering

The abstract provides high-level information on prompt optimization but lacks granular details regarding the specific prompts, number of examples in few-shot, and iterative refinement processes. This limits replication and deeper understanding.

Lack of Error Analysis Specifics

The abstract does not delve into specific types of errors made by the model (e.g., distinguishing between mention of past vs. current use, family history, or negations), which would be crucial for future improvements.

Expert Commentary

This study makes a commendable contribution to the burgeoning field of clinical NLP, particularly by venturing into the under-researched domain of Spanish clinical texts. The F1 score of 0.65, while not indicative of clinical readiness for autonomous deployment, certainly demonstrates a 'promising' trajectory for LLMs in this challenging NER task. The systematic evaluation of prompting strategies is a strength, offering valuable insights into current best practices for LLM utilization. However, the reliance on GPT-4.1 raises critical legal and ethical questions regarding data governance, intellectual property, and patient confidentiality. Future work must rigorously address the nuances of de-identification and the potential for re-identification, especially when data leaves institutional control. Furthermore, the moderate performance necessitates detailed error analysis to understand the specific linguistic and contextual challenges in Spanish clinical narratives, paving the way for more robust, perhaps hybrid, approaches. The legal and regulatory landscape for AI in healthcare is rapidly evolving, and studies like this underscore the urgent need for clear guidelines on accountability and transparency when deploying such powerful, yet imperfect, tools.

Recommendations

✓ Conduct a comprehensive error analysis to identify specific patterns of false positives and false negatives, informing targeted model improvements or data augmentation strategies.
✓ Explore fine-tuning smaller, open-source LLMs on domain-specific Spanish clinical data to mitigate privacy concerns, reduce costs, and enhance institutional control and reproducibility.
✓ Investigate hybrid approaches combining LLMs with traditional rule-based systems or domain-specific embeddings to improve accuracy and potentially offer greater interpretability.
✓ Detail the specific prompt engineering techniques, few-shot examples, and data de-identification protocols used, ensuring transparency and facilitating replication by the research community.
✓ Address the ethical implications of using LLMs for sensitive health data, focusing on bias mitigation, data security, and developing a clear framework for clinician oversight and accountability.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

AI Commentary

Executive Summary

Key Points

Merits

Addressing a Critical Clinical Need

Focus on Non-English Clinical Text

Systematic LLM Evaluation

Participation in a Shared Task

Demerits

Moderate F1 Score

Dependence on Proprietary Models

Limited Detail on Prompt Engineering

Lack of Error Analysis Specifics

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs