Conference

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track - ACL Anthology

· · 10 min read · 7 views

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track Saloni Potdar , Lina Rojas-Barahona , Sebastien Montella (Editors) Anthology ID: 2025.emnlp-industry Month: November Year: 2025 Address: Suzhou (China) Venue: EMNLP SIG: Publisher: Association for Computational Linguistics URL: https://aclanthology.org/2025.emnlp-industry/ DOI: 10.18653/v1/2025.emnlp-industry ISBN: 979-8-89176-333-3 Bib Export formats: BibTeX MODS XML EndNote PDF: https://aclanthology.org/2025.emnlp-industry.pdf PDF (full) Bib TeX Search Show all abstracts Hide all abstracts pdf bib Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track Saloni Potdar | Lina Rojas-Barahona | Sebastien Montella pdf bib abs RAVEN ++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning Deyi Ji | Yuekui Yang | Liqun Liu | Peng Shu | Haiyang Wu | Shaogang Tang | Xudong Chen | Shaoping Ma | Tianrun Chen | Lanyun Zhu Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability. pdf bib abs SAGE : A Generic Framework for LLM Safety Evaluation Madhur Jindal | Hari Shrawgi | Parag Agrawal | Sandipan Dandapat As Large Language Models are rapidly deployed across diverse applications from healthcare to financial advice, safety evaluation struggles to keep pace. Current benchmarks focus on single-turn interactions with generic policies, failing to capture the conversational dynamics of real-world usage and the application-specific harms that emerge in context. Such potential oversights can lead to harms that go unnoticed in standard safety benchmarks and other current evaluation methodologies. To address these needs for robust AI safety evaluation, we introduce SAGE (Safety AI Generic Evaluation), an automated modular framework designed for customized and dynamic harm evaluations. SAGE employs prompted adversarial agents with diverse personalities based on the Big Five model, enabling system-aware multi-turn conversations that adapt to target applications and harm policies. We evaluate seven state-of-the-art LLMs across three applications and harm policies. Multi-turn experiments show that harm increases with conversation length, model behavior varies significantly when exposed to different user personalities and scenarios, and some models minimize harm via high refusal rates that reduce usefulness. We also demonstrate policy sensitivity within a harm category where tightening a child-focused sexual policy substantially increases measured defects across applications. These results motivate adaptive, policy-aware, and context-specific testing for safer real-world deployment. pdf bib abs CRAB : A Benchmark for Evaluating Curation of Retrieval-Augmented LLM s in Biomedicine Hanmeng Zhong | Linqing Chen | Wentao Wu | Weilei Wang Recent development in Retrieval-Augmented Large Language Models (LLMs) have shown great promise in biomedical applications. However, a critical gap persists in reliably evaluating their curation ability—the process by which models select and integrate relevant references while filtering out noise. To address this, we introduce the benchmark for Curation of Retrieval-Augmented LLMs in Biomedicine (CRAB), the first multilingual benchmark tailored for evaluating the biomedical curation of retrieval-augmented LLMs, available in English, French, German and Chinese. By incorporating a novel citation-based evaluation metric, CRAB quantifies the curation performance of retrieval-augmented LLMs in biomedicine. Experimental results reveal significant discrepancies in the curation performance of mainstream LLMs, underscoring the urgent need to improve it in the domain of biomedicine. pdf bib abs VENUS : A VLLM -driven Video Content Discovery System for Real Application Scenarios Minyi Zhao | Yi Liu | Jianfeng Wen | Boshen Zhang | Hailang Chang | Zhiheng Ouyang | Jie Wang | Wensong He | Shuigeng Zhou Video Content Discovery (VCD) is to identify the specific videos defined by a certain pre-specified text policy (or constraint), which plays a crucial role in building a healthy and high-quality Web content ecology. Currently, related works typically employ multiple classifiers or similarity-based systems to support VCD. However, these approaches are difficult to manage, lack generalization power, and suffer from low performance. To tackle these problems, this paper presents a new Vision-Language Large Model (VLLM)-driven VCD system called VENUS (the abbreviation of Video contENt UnderStander). Concretely, we first develop an automatic policy-guided sequential annotator (APSA) to generate high-quality, VCD-specific, and reasoning-equipped instruct-tuning data for model training, then extend the VLLM inference to support VCD better. Following that, we construct a real VCD test set called VCD-Bench, which includes a total of 13 policies and 57K videos. Furthermore, to evaluate its practical efficacy, we deploy VENUS in three different real scenarios. Extensive experiments on both the VCD-Bench and public evaluation datasets for various VCD-related tasks demonstrate the superiority of VENUS over existing baselines. pdf bib abs FT - MDT : Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method Yuheng Li | Jiechao Gao | Wei Han | Wenwen Ouyang | Wei Zhu | Hui Yi Leong Knowledge of the medical decision process, which can be modeled as medical decision trees (MDTs), is critical to building clinical decision support systems. However, current MDT construction methods rely heavily on time-consuming and laborious manual annotation. To address this challenge, we propose PI-LoRA (Path-Integrated LoRA), a novel low-rank adaptation method for automatically extracting MDTs from clinical guidelines and textbooks. We integrate gradient path information to capture synergistic effects between different modules, enabling more effective and reliable rank allocation. This framework ensures that the most critical modules receive appropriate rank allocations while less important ones are pruned, resulting in a more efficient and accurate model for extracting medical decision trees from clinical texts. Extensive experiments on medical guideline datasets demonstrate that our PI-LoRA method significantly outperforms existing parameter-efficient fine-tuning approaches for the Text2MDT task, achieving better accuracy with substantially reduced model complexity. The proposed method achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems where computational resources may be limited. pdf bib abs P oly N orm: Few-Shot LLM -Based Text Normalization for Text-to-Speech Michel Wong | Ali Alshehri | Sophia Kao | Haotian He Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena. pdf bib abs Audio Query Handling System with Integrated Expert Models and Contextual Understanding Naveen Vakada | Arvind Krishna Sridhar | Yinyi Guo | Erik Visser This paper presents an audio chatbot system designed to handle a wide range of audio-related queries by integrating multiple specialized audio processing models. The proposed system uses an intent classifier, trained on a diverse audio query dataset, to route queries about audio content to expert models such as Automatic Speech Recognition (ASR), Speaker Diarization, Music Identification, and Text-to-Audio generation. A novel audio intent classification dataset is developed for building the intent classifier. A 3.8 B LLM model then takes inputs from an Audio Context Detection (ACD) module extracting audio event information from the audio and post processes text domain outputs from the expert models to compute the final response to the user. We evaluated the system on custom audio tasks and MMAU sound set benchmarks. The custom datasets were motivated by target use cases not covered in industry benchmarks. We proposed ACD-timestamp-QA (Question Answering) as well as ACD-temporal-QA datasets to evaluate timestamp and temporal reasoning questions, respectively. First, we determined that a BERT based Intent Classifier outperforms LLM-fewshot intent classifier in routing queries. Experiments further show that our approach significantly improves accuracy on some custom tasks compared to state-of-the-art Large Audio Language Models and outperforms models in the 7B parameter size range on the sound testset of the MMAU benchmark, thereby offering an attractive option for on device deployment. pdf bib abs Generative Reviewer Agents: Scalable Simulacra of Peer Review Nicolas Bougie | Narimawa Watanabe The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation. pdf bib abs Aligning LLM s for Multilingual Consistency in Enterprise Applications Amit Agarwal | Hansa Meghwani | Hitesh Laxmichand Patel | Tao Sheng | Sujith Ravi | Dan Roth Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English.We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry. pdf bib abs RCI : A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks Amit Agarwal | Hitesh Laxmichand Patel | Srikant Panda | Hansa Meghwani | Jyotika Singh | Karan Dua | Paul Li | Tao Sheng | Sujith Ravi | Dan Roth Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development.We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset’s reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues.When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems. pdf bib abs LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models Yungi Kim | Hyunsoo Ha | Seonghoon Yang | Sukyung Lee | Jihoo Kim | Chanjun Park Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To add

Executive Summary

The 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track proceedings features innovative research, including RAVEN++ and SAGE. RAVEN++ introduces Active Reinforcement Learning for fine-grained violation detection in advertisement videos, while SAGE provides a generic framework for Large Language Model safety evaluation. These contributions address significant challenges in natural language processing, such as precise violation localization and robust AI safety evaluation.

Key Points

  • RAVEN++ proposes a novel framework for fine-grained violation detection in advertisement videos
  • SAGE introduces a modular framework for customized and dynamic harm evaluation of Large Language Models
  • The research highlights the need for robust AI safety evaluation and fine-grained understanding in natural language processing applications

Merits

Innovative Methodologies

The proposed frameworks, RAVEN++ and SAGE, demonstrate innovative approaches to addressing complex challenges in natural language processing, showcasing the potential for improved performance and robustness.

Demerits

Limited Contextualization

The research may benefit from further contextualization of the proposed frameworks within the broader landscape of natural language processing, including potential interactions with other models and applications.

Expert Commentary

The research presented in the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track proceedings demonstrates a notable shift towards more nuanced and contextualized approaches to natural language processing. The introduction of RAVEN++ and SAGE highlights the growing recognition of the need for robust AI safety evaluation and fine-grained understanding in complex applications. As the field continues to evolve, it is essential to prioritize innovative methodologies, such as those proposed in this research, to address the emerging challenges and opportunities in natural language processing.

Recommendations

  • Further research should focus on integrating the proposed frameworks, RAVEN++ and SAGE, with other natural language processing models and applications to explore potential synergies and areas for improvement.
  • Policymakers and practitioners should prioritize the development of robust safety evaluation protocols and fine-grained understanding in natural language processing applications to mitigate potential risks and ensure more effective deployment.

Sources

Related Articles