Academic

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

Jaehoon Lee, Suhwan Park, Tae Yoon Lim, Seunghan Lee, Jun Seo, Dongwan Kang, Hwanil Choi, Minjae Kim, Sungdong Yoo, SoonYoung Lee, Yongjae Lee, Wonbin Ahn · March 7, 2026 · 1 min read · 19 views

#cs.AI #cs.LG

arXiv:2603.02702v1 Announce Type: new Abstract: The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.

Executive Summary

The article proposes FinTexTS, a financial text-paired time-series dataset constructed via a semantic-based and multi-level pairing framework. This framework captures complex relationships between a company's stock price and broader macroeconomic factors by extracting company-specific context from SEC filings and applying an embedding-based matching mechanism to retrieve semantically relevant news articles. The dataset is evaluated using experimental results that demonstrate the effectiveness of the proposed strategy in stock price forecasting. The authors also explore the use of proprietary news sources and show that applying their method to these sources leads to higher-quality paired data and improved forecasting performance. This research contributes to the development of more accurate financial time-series analysis methods and has significant implications for investors and financial analysts.

Key Points

▸ FinTexTS is a large-scale text-paired stock price dataset constructed using a semantic-based and multi-level pairing framework.
▸ The framework captures complex relationships between a company's stock price and broader macroeconomic factors.
▸ Experimental results demonstrate the effectiveness of the proposed strategy in stock price forecasting.

Merits

Strength

The proposed framework addresses the limitation of existing approaches by capturing complex relationships between a company's stock price and broader macroeconomic factors.

Demerits

Limitation

The authors rely on large language models (LLMs) for classification, which may not generalize well to all financial contexts.

Expert Commentary

The article makes a significant contribution to the field of finance by proposing a novel framework for constructing text-paired time-series datasets. The use of semantic-based and multi-level pairing is a critical innovation that addresses the limitation of existing approaches. However, the reliance on LLMs for classification is a limitation that may need to be addressed in future research. The implications of this research are far-reaching, with potential applications in stock price forecasting, financial regulation, and risk assessment.

Recommendations

✓ Future research should explore the use of alternative classification methods to reduce reliance on LLMs.
✓ The FinTexTS dataset and proposed framework should be made publicly available to facilitate further research and development in the field of finance.

Sources

arXiv - cs.AI

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

AI Commentary

Executive Summary

Key Points

Merits

Strength

Demerits

Limitation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs