Academic

ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction

arXiv:2602.15189v1 Announce Type: cross Abstract: The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), undersc

William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan · February 19, 2026 · 1 min read · 5 views

#cs.IR #cs.AI #cs.CL

Executive Summary

This article presents ScrapeGraphAI-100k, a large-scale dataset designed for web information extraction using large language models (LLMs). The dataset comprises 93,695 real-world extraction events, collected from opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. The authors characterize the dataset's structural diversity and failure modes as schema complexity increases, demonstrating its potential for efficient extraction. They also provide a fine-tuning experiment showing that a small language model trained on a subset narrows the gap to larger baselines. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing.

Key Points

▸ ScrapeGraphAI-100k is a large-scale dataset for web information extraction using LLMs.
▸ The dataset comprises 93,695 real-world extraction events collected from opt-in ScrapeGraphAI telemetry.
▸ Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata.
▸ The authors characterize the dataset's structural diversity and failure modes as schema complexity increases.

Merits

Strength in diversity and volume

The dataset's diverse range of extraction events and large volume make it a valuable resource for researchers and practitioners.

Utility for fine-tuning and benchmarking

ScrapeGraphAI-100k enables fine-tuning small models and benchmarking structured extraction, making it a useful tool for evaluating the performance of LLMs.

Demerits

Limited temporal scope

The dataset was collected from opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025, which may not capture the full range of extraction events or schema complexity.

Expert Commentary

While ScrapeGraphAI-100k is a valuable resource for researchers and practitioners, its limitations in terms of temporal scope should be carefully considered. Additionally, the dataset's characterization of failure modes as schema complexity increases highlights the need for more robust and efficient web extraction systems. The fine-tuning experiment demonstrating the potential of small LLMs is particularly noteworthy, as it underscores the utility of ScrapeGraphAI-100k for efficient extraction. However, further research is needed to fully exploit the potential of this dataset and to develop more effective web extraction systems.

Recommendations

✓ Researchers and practitioners should carefully consider the limitations of ScrapeGraphAI-100k, particularly its temporal scope, when designing experiments or deploying web extraction systems.
✓ Further research should focus on developing more robust and efficient web extraction systems that can handle complex schema and failure modes.

Sources

arXiv - cs.CL

Something extraordinary is coming.

ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction

AI Commentary

Executive Summary

Key Points

Merits

Strength in diversity and volume

Utility for fine-tuning and benchmarking

Demerits

Limited temporal scope

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.