Academic

HEARTS: Benchmarking LLM Reasoning on Health Time Series

arXiv:2603.06638v1 Announce Type: new Abstract: The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly

arXiv:2603.06638v1 Announce Type: new Abstract: The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

Executive Summary

This article introduces HEARTS, a unified benchmark for evaluating the hierarchical reasoning capabilities of large language models (LLMs) over general health time series data. By integrating 16 real-world datasets across 12 health domains and 20 signal modalities, HEARTS covers a diverse range of health-related tasks. The benchmark reveals that LLMs substantially underperform specialized models, rely on simple heuristics, and struggle with multi-step temporal reasoning. The findings suggest that scaling LLMs alone is insufficient to improve their performance on complex health time series tasks. HEARTS provides a standardized testbed for developing next-generation LLM agents capable of reasoning over diverse health signals.

Key Points

  • HEARTS is a unified benchmark for evaluating LLMs over general health time series data
  • The benchmark integrates 16 real-world datasets across 12 health domains and 20 signal modalities
  • LLMs substantially underperform specialized models on health time series tasks

Merits

Strength in Methodology

The authors have developed a comprehensive and diverse benchmark that accurately reflects real-world physiological modeling tasks

Standardization

HEARTS provides a standardized testbed for evaluating LLMs, allowing for more accurate comparisons between models

Demerits

Limited Generalizability

The findings may not be generalizable to other domains or tasks outside of health time series analysis

Complexity of LLMs

The article does not provide a detailed explanation of how LLMs' hierarchical reasoning capabilities are evaluated, which may be a limitation for readers without a background in LLMs

Expert Commentary

The article provides a comprehensive analysis of the limitations and challenges of using LLMs for health time series analysis. The authors have made a significant contribution to the field by developing a standardized benchmark that can be used to evaluate LLMs. However, the article also highlights the need for further research and development to address the limitations of LLMs, particularly their reliance on simple heuristics and struggle with multi-step temporal reasoning. The implications of the findings are significant, particularly for the use of LLMs in healthcare applications. As LLMs continue to evolve, it is essential to develop more transparent and explainable models that can be trusted to make accurate decisions in high-stakes applications.

Recommendations

  • Developers should prioritize the development of more transparent and explainable LLMs
  • Investment in LLM research and development should be prioritized to address the limitations and challenges highlighted in the article

Sources