Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models
arXiv:2604.04204v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii)
arXiv:2604.04204v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably "English (US)," despite the global diversity and colonial history of English. Through a postcolonial framing to explain the broader significance, we investigate how geopolitical histories of data curation, digital dominance, and linguistic standardization shape the LLM development pipeline. Focusing on two dominant standard varieties, American English (AmE) and British English (BrE), we construct a curated corpus of 1,813 AmE--BrE variants and introduce DiAlign, a dynamic, training-free method for estimating dialectal alignment using distributional evidence. We operationalize structural bias by triangulating evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward AmE, (ii) tokenizer analyses show that BrE forms incur higher segmentation costs, and (iii) generative evaluations show a persistent AmE preference in model outputs. To our knowledge, this is the first systematic and multi-faceted examination of dialectal asymmetries in standard English varieties across the phases of LLM development. We find that contemporary LLMs privilege AmE as the de facto norm, raising concerns about linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while motivating practical steps toward more dialectally inclusive language technologies.
Executive Summary
This study examines the structural bias of large language models (LLMs) towards American English (AmE) and its implications for linguistic homogenization, epistemic injustice, and AI deployment inequity. The authors introduce DiAlign, a dynamic method for estimating dialectal alignment, and analyze six pretraining corpora to demonstrate a systematic skew towards AmE. Tokenizer analyses and generative evaluations further support the AmE preference in LLMs. The findings highlight the need for more dialectally inclusive language technologies and practical steps towards rectifying this bias.
Key Points
- ▸ The study demonstrates a systematic skew towards American English (AmE) in pretraining corpora.
- ▸ DiAlign, a dynamic method for estimating dialectal alignment, reveals the AmE preference in LLMs.
- ▸ The bias towards AmE raises concerns about linguistic homogenization, epistemic injustice, and AI deployment inequity.
Merits
Strengths in methodology
The study introduces DiAlign, a novel method for estimating dialectal alignment, and employs a systematic approach to analyze six pretraining corpora, providing robust evidence for the AmE bias.
Interdisciplinary approach
The study incorporates insights from postcolonial studies, linguistics, and computer science to highlight the geopolitical implications of LLM development and deployment.
Demerits
Limited scope of analysis
The study focuses solely on AmE and BrE, neglecting other standard English varieties, such as Canadian English, Australian English, and Indian English, which may also experience bias in LLMs.
Need for more diverse training data
While the study highlights the importance of more dialectally inclusive language technologies, it does not provide concrete suggestions for incorporating diverse training data to mitigate the AmE bias.
Expert Commentary
This study makes a significant contribution to the field by highlighting the structural bias of LLMs towards AmE and its far-reaching implications for linguistic homogenization, epistemic injustice, and AI deployment inequity. The introduction of DiAlign and the systematic analysis of pretraining corpora provide robust evidence for the AmE bias. However, the study's limitations in scope and suggestions for mitigating the bias highlight the need for continued research and development in this area. The findings have significant practical and policy implications, underscoring the importance of prioritizing linguistic diversity and inclusivity in AI design and deployment.
Recommendations
- ✓ Developing more inclusive language technologies that account for dialectal variation and linguistic diversity.
- ✓ Establishing standards for LLMs to promote dialectal awareness and equity in AI applications.
Sources
Original: arXiv - cs.CL