ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
arXiv:2602.15189v1 Announce Type: cross Abstract: The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), undersc
arXiv:2602.15189v1 Announce Type: cross Abstract: The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
Executive Summary
This article presents ScrapeGraphAI-100k, a large-scale dataset designed for web information extraction using large language models (LLMs). The dataset comprises 93,695 real-world extraction events, collected from opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. The authors characterize the dataset's structural diversity and failure modes as schema complexity increases, demonstrating its potential for efficient extraction. They also provide a fine-tuning experiment showing that a small language model trained on a subset narrows the gap to larger baselines. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing.
Key Points
- ▸ ScrapeGraphAI-100k is a large-scale dataset for web information extraction using LLMs.
- ▸ The dataset comprises 93,695 real-world extraction events collected from opt-in ScrapeGraphAI telemetry.
- ▸ Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata.
- ▸ The authors characterize the dataset's structural diversity and failure modes as schema complexity increases.
Merits
Strength in diversity and volume
The dataset's diverse range of extraction events and large volume make it a valuable resource for researchers and practitioners.
Utility for fine-tuning and benchmarking
ScrapeGraphAI-100k enables fine-tuning small models and benchmarking structured extraction, making it a useful tool for evaluating the performance of LLMs.
Demerits
Limited temporal scope
The dataset was collected from opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025, which may not capture the full range of extraction events or schema complexity.
Expert Commentary
While ScrapeGraphAI-100k is a valuable resource for researchers and practitioners, its limitations in terms of temporal scope should be carefully considered. Additionally, the dataset's characterization of failure modes as schema complexity increases highlights the need for more robust and efficient web extraction systems. The fine-tuning experiment demonstrating the potential of small LLMs is particularly noteworthy, as it underscores the utility of ScrapeGraphAI-100k for efficient extraction. However, further research is needed to fully exploit the potential of this dataset and to develop more effective web extraction systems.
Recommendations
- ✓ Researchers and practitioners should carefully consider the limitations of ScrapeGraphAI-100k, particularly its temporal scope, when designing experiments or deploying web extraction systems.
- ✓ Further research should focus on developing more robust and efficient web extraction systems that can handle complex schema and failure modes.