Academic

Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

arXiv:2603.18171v1 Announce Type: new Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant,

M
Maria Andueza Rodriguez, Marie Candito, Richard Huyghe
· · 1 min read · 25 views

arXiv:2603.18171v1 Announce Type: new Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single "prototypical" human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

Executive Summary

This study investigates the nature of linguistic knowledge within large language models (LLMs) by comparing human and LLM-generated word associations. The researchers examine the influence of lexical factors such as word frequency and concreteness on cue-response pairs and evaluate the variability and typicality of LLM responses compared to human responses. The study finds that while LLMs mirror human trends for frequency and concreteness, they differ in response variability and typicality, with larger models producing more typical but less variable responses. The findings highlight the importance of considering model size and temperature when probing LLM lexical representations. This research has significant implications for the development and evaluation of LLMs, particularly in applications where linguistic accuracy and nuance are critical.

Key Points

  • LLMs mirror human trends for frequency and concreteness in word associations, but differ in response variability and typicality.
  • Larger models tend to produce more typical but less variable responses, while smaller models produce more variable yet less typical responses.
  • Temperature settings influence the trade-off between variability and typicality in LLM responses.

Merits

Insights into LLM Lexical Representations

This study provides valuable insights into the nature of linguistic knowledge within LLMs, shedding light on the factors that influence their lexical representations.

Methodological Contributions

The researchers employ a novel approach to evaluating LLMs, using a combination of human and LLM-generated word associations to examine the variability and typicality of LLM responses.

Demerits

Limited Generalizability

The study is limited to a specific task and dataset, and the findings may not generalize to other applications or domains.

Need for Further Research

While this study provides valuable insights into LLM lexical representations, further research is needed to fully understand the implications of these findings for LLM development and evaluation.

Expert Commentary

This study provides a nuanced understanding of the linguistic knowledge within LLMs, highlighting both the similarities and differences between human and LLM lexical representations. The findings have significant implications for the development and evaluation of LLMs, particularly in applications where linguistic accuracy and nuance are critical. The study's methodological contributions, including the use of human and LLM-generated word associations to examine variability and typicality, provide a valuable framework for evaluating LLMs. However, the study's limitations, including the need for further research to fully understand the implications of these findings, highlight the importance of continued investigation in this area.

Recommendations

  • Future research should focus on investigating the implications of LLM lexical representations for applications where linguistic accuracy and nuance are critical.
  • LLM developers and evaluators should consider model size and temperature when probing LLM lexical representations.

Sources