When Agents Persuade: Propaganda Generation and Mitigation in LLMs
arXiv:2603.04636v1 Announce Type: new Abstract: Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.
arXiv:2603.04636v1 Announce Type: new Abstract: Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.
Executive Summary
This study examines the potential of Large Language Models (LLMs) to generate propagandistic content and explores mitigation strategies. The findings indicate that LLMs can exhibit propagandistic behaviors when prompted and employ various rhetorical techniques. Fine-tuning methods, particularly Odds Ratio Preference Optimization (ORPO), significantly reduce the generation of such content. The research highlights the importance of addressing propaganda generation in LLMs to prevent potential misuse.
Key Points
- ▸ LLMs can generate propagandistic content when prompted
- ▸ LLMs use various rhetorical techniques, including loaded language and appeals to fear
- ▸ Fine-tuning methods, such as ORPO, can mitigate propaganda generation
Merits
Comprehensive Analysis
The study provides a thorough examination of LLMs' potential to generate propaganda and explores effective mitigation strategies.
Methodological Rigor
The use of domain-specific models and fine-tuning methods demonstrates a rigorous approach to investigating propaganda generation and mitigation.
Demerits
Limited Generalizability
The study's findings may not be generalizable to all LLMs or contexts, as the results are based on a specific set of models and prompts.
Lack of Human Evaluation
The reliance on automated models for classification and detection may not fully capture the nuances of human evaluation and judgment.
Expert Commentary
This study contributes significantly to our understanding of the potential risks and challenges associated with LLMs. The findings underscore the importance of addressing propaganda generation and mitigation in LLMs to prevent potential misuse. The use of fine-tuning methods, such as ORPO, offers a promising approach to mitigating these risks. However, further research is needed to fully explore the generalizability and effectiveness of these methods in various contexts. Ultimately, responsible AI development and deployment require careful consideration of the potential consequences of LLMs and proactive measures to mitigate potential harms.
Recommendations
- ✓ Develop and deploy LLMs with built-in mitigation strategies to prevent propaganda generation
- ✓ Conduct further research on the generalizability and effectiveness of fine-tuning methods in various contexts and applications