HEARTS: Benchmarking LLM Reasoning on Health Time Series
arXiv:2603.06638v1 Announce Type: new Abstract: The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly
arXiv:2603.06638v1 Announce Type: new Abstract: The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.
Executive Summary
This article introduces HEARTS, a unified benchmark for evaluating the hierarchical reasoning capabilities of large language models (LLMs) over general health time series data. By integrating 16 real-world datasets across 12 health domains and 20 signal modalities, HEARTS covers a diverse range of health-related tasks. The benchmark reveals that LLMs substantially underperform specialized models, rely on simple heuristics, and struggle with multi-step temporal reasoning. The findings suggest that scaling LLMs alone is insufficient to improve their performance on complex health time series tasks. HEARTS provides a standardized testbed for developing next-generation LLM agents capable of reasoning over diverse health signals.
Key Points
- ▸ HEARTS is a unified benchmark for evaluating LLMs over general health time series data
- ▸ The benchmark integrates 16 real-world datasets across 12 health domains and 20 signal modalities
- ▸ LLMs substantially underperform specialized models on health time series tasks
Merits
Strength in Methodology
The authors have developed a comprehensive and diverse benchmark that accurately reflects real-world physiological modeling tasks
Standardization
HEARTS provides a standardized testbed for evaluating LLMs, allowing for more accurate comparisons between models
Demerits
Limited Generalizability
The findings may not be generalizable to other domains or tasks outside of health time series analysis
Complexity of LLMs
The article does not provide a detailed explanation of how LLMs' hierarchical reasoning capabilities are evaluated, which may be a limitation for readers without a background in LLMs
Expert Commentary
The article provides a comprehensive analysis of the limitations and challenges of using LLMs for health time series analysis. The authors have made a significant contribution to the field by developing a standardized benchmark that can be used to evaluate LLMs. However, the article also highlights the need for further research and development to address the limitations of LLMs, particularly their reliance on simple heuristics and struggle with multi-step temporal reasoning. The implications of the findings are significant, particularly for the use of LLMs in healthcare applications. As LLMs continue to evolve, it is essential to develop more transparent and explainable models that can be trusted to make accurate decisions in high-stakes applications.
Recommendations
- ✓ Developers should prioritize the development of more transparent and explainable LLMs
- ✓ Investment in LLM research and development should be prioritized to address the limitations and challenges highlighted in the article