Academic

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu · February 27, 2026 · 1 min read · 3 views

#cs.CL

arXiv:2602.22045v1 Announce Type: new Abstract: We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.

Executive Summary

This article presents DLT-Corpus, a large-scale text collection for Distributed Ledger Technology research, comprising 2.98 billion tokens from 22.12 million documents across scientific literature, patents, and social media. The authors demonstrate the corpus' utility by analyzing technology emergence patterns and market-innovation correlations, revealing traditional technology transfer patterns and a virtuous cycle of research-driven economic growth. The corpus, domain-adapted model (LedgerBERT), and associated tools are publicly released, offering a valuable resource for NLP and DLT researchers.

Key Points

▸ DLT-Corpus is the largest domain-specific text collection for DLT research to date.
▸ The corpus comprises 2.98 billion tokens from 22.12 million documents across various sources.
▸ The authors demonstrate the utility of DLT-Corpus in analyzing technology emergence patterns and market-innovation correlations.

Merits

Strength in breadth and depth of coverage

DLT-Corpus provides a comprehensive text collection spanning scientific literature, patents, and social media, offering a unique perspective on the DLT domain.

Utility in analyzing technology emergence patterns

The authors demonstrate the effectiveness of DLT-Corpus in analyzing technology emergence patterns, revealing traditional technology transfer patterns and a virtuous cycle of research-driven economic growth.

Demerits

Limited contextualization of findings

The article could benefit from a more nuanced discussion of the implications and limitations of the observed technology emergence patterns and market-innovation correlations.

Potential biases in social media sentiment analysis

The authors note that social media sentiment remains overwhelmingly bullish, but they do not thoroughly address potential biases in the analysis, such as sampling or representation biases.

Expert Commentary

The development of DLT-Corpus and LedgerBERT represents a significant contribution to the field of NLP in emerging technologies. The findings on technology emergence patterns and market-innovation correlations offer a nuanced understanding of the complex relationships between research, innovation, and economic growth in the DLT sector. However, further research is needed to contextualize and generalize these findings, particularly in terms of addressing potential biases and limitations in the analysis.

Recommendations

✓ Future research should aim to develop more domain-adapted models and NLP resources for the DLT sector, building on the success of LedgerBERT.
✓ Policymakers and industry stakeholders should consider the implications of the virtuous cycle of research-driven economic growth for innovation policy and technology transfer, and develop strategies to promote continued investment in research and development in the DLT sector.

Sources

arXiv - cs.CL

Something extraordinary is coming.

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

AI Commentary

Executive Summary

Key Points

Merits

Strength in breadth and depth of coverage

Utility in analyzing technology emergence patterns

Demerits

Limited contextualization of findings

Potential biases in social media sentiment analysis

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.