Skip to main content
Academic

Evaluating the Usage of African-American Vernacular English in Large Language Models

arXiv:2602.21485v1 Announce Type: new Abstract: In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain't. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicate

D
Deja Dunlap, R. Thomas McCoy
· · 1 min read · 4 views

arXiv:2602.21485v1 Announce Type: new Abstract: In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain't. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicated stereotypes about African Americans. These results highlight the need for more diversity in training data and the incorporation of fairness methods to mitigate the perpetuation of stereotypes.

Executive Summary

The article 'Evaluating the Usage of African-American Vernacular English in Large Language Models' investigates the representation of African American Vernacular English (AAVE) in large language models (LLMs). The study compares the usage of AAVE in three LLMs to human usage patterns, finding that LLMs often underuse and misuse grammatical features characteristic of AAVE. Additionally, the models were found to replicate stereotypes about African Americans. The authors highlight the need for more diverse training data and the incorporation of fairness methods to mitigate these issues.

Key Points

  • LLMs underuse and misuse grammatical features of AAVE compared to human usage.
  • Sentiment analysis and manual inspection revealed that LLMs replicate stereotypes about African Americans.
  • The study emphasizes the need for more diverse training data and fairness methods in AI development.

Merits

Comprehensive Analysis

The study provides a thorough analysis of AAVE usage in LLMs, comparing model outputs to human usage patterns from diverse sources.

Identification of Stereotypes

The identification of replicated stereotypes in LLMs is a significant contribution, highlighting an important ethical concern in AI.

Demerits

Limited Scope

The study focuses on only three LLMs, which may not be representative of the broader landscape of AI models.

Generalizability

The findings may not be generalizable to all contexts where AAVE is used, as the study relies on specific corpora.

Expert Commentary

The study 'Evaluating the Usage of African-American Vernacular English in Large Language Models' offers a critical examination of how well AI models represent AAVE. The findings are concerning, as they reveal that LLMs not only underuse and misuse grammatical features of AAVE but also replicate harmful stereotypes. This underscores the urgent need for more diverse training data and the implementation of fairness methods in AI development. The study's emphasis on the ethical implications of AI is particularly noteworthy, as it highlights the potential for AI to perpetuate and amplify existing biases. However, the study's limitations, such as its focus on only three LLMs and the potential generalizability of its findings, should be acknowledged. Future research should expand the scope to include a broader range of AI models and contexts to provide a more comprehensive understanding of AAVE representation in AI. Additionally, collaboration between AI developers, linguists, and sociologists could yield more nuanced and accurate representations of vernacular languages in AI models.

Recommendations

  • Expand the study to include a broader range of LLMs and contexts to enhance the generalizability of findings.
  • Incorporate diverse training data and fairness methods in AI development to better represent AAVE and mitigate stereotypes.

Sources