Academic

Using Vision + Language Models to Predict Item Difficulty

arXiv:2603.04670v1 Announce Type: new Abstract: This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.

S
Samin Khan
· · 1 min read · 2 views

arXiv:2603.04670v1 Announce Type: new Abstract: This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.

Executive Summary

This article explores the use of large language models (LLMs) to predict item difficulty in data visualization literacy tests. The study finds that a multimodal approach, combining visual and text features, outperforms unimodal approaches and demonstrates potential for psychometric analysis and automated item development. The best-performing model achieved a mean squared error of 0.10805 on a held-out test set, indicating promising results. The study highlights the capabilities of LLMs in predicting item difficulty and has implications for education and assessment.

Key Points

  • Multimodal approach outperforms unimodal approaches in predicting item difficulty
  • Large language models (LLMs) can be used for psychometric analysis and automated item development
  • The study demonstrates the potential of LLMs in education and assessment

Merits

Innovative Approach

The study's use of LLMs to predict item difficulty is a novel and innovative approach that has shown promising results.

Improved Accuracy

The multimodal approach achieved a lower mean absolute error (MAE) compared to unimodal approaches, indicating improved accuracy in predicting item difficulty.

Demerits

Limited Generalizability

The study's findings may not be generalizable to other contexts or populations, and further research is needed to validate the results.

Dependence on Data Quality

The accuracy of the LLMs depends on the quality of the training data, and poor data quality may lead to biased or inaccurate results.

Expert Commentary

The study's use of LLMs to predict item difficulty is a significant contribution to the field of education and assessment. The findings highlight the potential of AI to improve the efficiency and accuracy of assessment processes, and have implications for the development of more effective educational materials and assessments. However, further research is needed to validate the results and address the limitations of the study, including the potential for bias and the dependence on data quality. As AI continues to evolve and improve, it is likely that we will see increased use of LLMs in education and assessment, and this study provides an important foundation for future research in this area.

Recommendations

  • Further research is needed to validate the results and address the limitations of the study
  • The development of more effective educational materials and assessments should take into account the potential of AI to improve assessment processes

Sources