Academic

Far Out: Evaluating Language Models on Slang in Australian and Indian English

Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi · February 23, 2026 · 1 min read · 4 views

#cs.CL #cs.AI

arXiv:2602.15373v1 Announce Type: new Abstract: Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

Executive Summary

The article 'Far Out: Evaluating Language Models on Slang in Australian and Indian English' investigates the performance of seven state-of-the-art language models in understanding slang specific to Indian English (en-IN) and Australian English (en-AU). The study employs two datasets: web-sourced examples from Urban Dictionary and synthetically generated slang usages. The evaluation tasks include target word prediction (TWP), guided target word prediction (TWP*), and target word selection (TWS). The findings reveal that models perform better on TWS tasks, on web-sourced data, and on en-IN slang compared to en-AU slang. These results highlight the challenges language models face in comprehending variety-specific slang, even within the same language.

Key Points

▸ Models perform better on TWS tasks compared to TWP and TWP* tasks.
▸ Performance is higher on web-sourced data compared to synthetically generated data.
▸ Models show better performance on en-IN slang tasks compared to en-AU slang tasks.

Merits

Comprehensive Evaluation

The study provides a thorough evaluation of language models' performance on slang in two distinct English varieties, using both real-world and synthetic data.

Methodological Rigor

The use of multiple evaluation tasks and datasets enhances the robustness of the findings.

Demerits

Limited Scope

The study focuses only on two English varieties, limiting the generalizability of the findings to other non-standard language varieties.

Dataset Bias

The reliance on Urban Dictionary for web-sourced data may introduce bias, as the platform may not represent the full spectrum of slang usage.

Expert Commentary

The article 'Far Out: Evaluating Language Models on Slang in Australian and Indian English' sheds light on a critical yet under-explored aspect of language model performance: their ability to comprehend variety-specific slang. The study's findings are significant as they reveal fundamental asymmetries in the models' generative and discriminative competencies. The better performance on TWS tasks suggests that models are more adept at selecting the correct slang term when presented with options, but struggle with predicting slang terms in context. The higher performance on web-sourced data indicates that real-world usage examples may be more effective in training models to understand slang. The disparity in performance between en-IN and en-AU slang tasks highlights the need for more inclusive training data that represents the linguistic diversity within the same language. The study's limitations, such as the reliance on Urban Dictionary and the focus on only two English varieties, should be addressed in future research to provide a more comprehensive understanding of language models' performance on non-standard language varieties.

Recommendations

✓ Future studies should expand the scope to include a broader range of non-standard language varieties to enhance the generalizability of the findings.
✓ Researchers should explore the use of more diverse and representative datasets for training and evaluating language models to improve their performance on variety-specific slang.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Far Out: Evaluating Language Models on Slang in Australian and Indian English

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation

Methodological Rigor

Demerits

Limited Scope

Dataset Bias

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.