Conference

NeurIPS Datasets & Benchmarks Track: From Art to Science in AI Evaluations

· · 5 min read · 32 views

December 5 2025 NeurIPS Datasets & Benchmarks Track: From Art to Science in AI Evaluations Communications Chairs 2025 2025 Conference This post provides an update on the 2025 Datasets and Benchmarks Track, reflecting on how the new hosting and metadata requirements affected submissions and the review process. We present submission statistics, survey findings from 851 authors and 155 reviewers, and identify areas requiring continued development. This differs from our earlier post on the review process by focusing on empirical outcomes rather than procedural changes. Background: The D&B Track The Datasets and Benchmarks Track was established in 2021 to provide a venue for work on datasets, benchmarks and evaluation methodologies that often fell outside traditional algorithmic research papers. The track has experienced consistent growth: doubling submissions annually through 2024 and increasing further to 1,820 in 2024 and 1,995 in 2025. This last year, the track operated with 41 senior area chairs, 281 area chairs and 2,680 reviewers. For 2025, the organizers also implemented two major changes to standardize quality evaluations and transform reproducibility from aspiration to expectation. First, paper submission requirements were aligned with the NeurIPS main track, while retaining dataset-specific elements like optional single-blind submission. Second, the track introduced rigorous requirements for hosting on persistent public repositories with mandatory Croissant metadata. These standards enabled automated checklists and standardized dataset summaries within OpenReview, streamlining the process to reduce reviewer effort and ensure datasets remain verifiable, accessible, and scientifically impactful over time. Submission Statistics Dataset Hosting Patterns When we looked at where authors chose to host their datasets, a clear pattern emerged – over 80% of the accepted papers used a handful of widely adopted platforms:  Hugging Face, Kaggle, Dataverse, and OpenML. Another 13% relied on self-hosted or bespoke solutions, with the rest distributed across smaller repositories like Zenodo and the Open Science Framework. Research Focus Areas The distribution of accepted papers reflected broader trends in machine learning research. Eighty-four percent of accepted papers introduced new datasets as part of benchmark or evaluation contributions. The track saw alignment with main track trends, particularly increased focus on large language model evaluation, alongside continued activity in AI for science, domain-specific applications and socially beneficial AI. Figure 1:  Overview of papers with author-provided keywords across accepted papers Metadata Compliance The majority of accepted papers included required Croissant metadata, though gaps appeared in initial submissions. Missing fields included license information (11.9 percent), dataset descriptions (4.9 percent) and URLs (3.5 percent). Less than one percent failed to include dataset names.Where licensing information was provided, authors predominantly selected open and permissive terms, particularly Creative Commons BY 4.0 and CC0 1.0. Adoption of the Croissant Responsible AI extension remained minimal. This extension captures data collection practices, biases and sensitive content, but few submissions included RAI-compliant documentation. Survey Results Author Experience After the acceptance notification, we sent an anonymous survey to both authors and reviewers. 851 authors and 155 reviewers responded. In response to questions about the hosting process, 82 percent of  authors reported smooth experiences, while 16 percent encountered difficulties. Common issues involved very large datasets (one terabyte or larger), platform rate limits and occasional instability near submission deadlines. Authors noted that automated Croissant generation sometimes failed for complex datasets. When asked about the review process, 58 percent of authors agreed that the new requirements led to fairer or more thorough reviews. 15 percent reported little effect, and 25 percent indicated review quality needed improvement. Concerns among the latter group included limited reviewer engagement in rebuttals, reliance on AI-generated feedback and emphasis on methodological novelty over real-world impact. For hosting and metadata, 63 percent of authors rated the requirements as effective or very effective in improving quality standards, with 16 percent neutral. Figure 2:  Responses of authors to the question “Do you think it was effective in improving the review process?” Reviewer Feedback For the 155 reviewers that responded to the anonymous survey, 77 percent reported that datasets were easy to access. Around 10 percent encountered difficulties, most commonly due to missing or broken links and very large files. Eleven percent did not directly inspect datasets and  instead based their evaluations solely on the accompanying papers. Automated metadata reports were first introduced in 2025 with the goal of supporting more consistent and efficient review. In this first year of use, 69 percent of reviewers found them useful or very useful, and 70 percent indicated that the compliance checklist helped them assess submissions more efficiently.  Several reviewers recommended that future iterations of the reports be shorter and more focused. Figure 3:  Responses of reviewers to the question “How did the requirement for all datasets to be hosted affected your review process?” Looking Forward Identified Challenges and Areas for Development Several patterns emerged from the 2025 cycle that require continued attention: Metadata Documentation : While most submissions included required fields, gaps in the first submission round revealed a learning curve. Authors are adapting to structured format requirements. Current guidance leaves room for interpretation, particularly regarding licensing documentation and descriptive context in machine-readable form. This could be improved through clearer documentation for metadata submission, and by refining automated validation reporting tools to provide concise, targeted information to authors and reviewers. Responsible AI Documentation : Low adoption of RAI metadata indicates a gap between available standards and practical implementation. Authors need clearer instructions for documenting data provenance, biases, limitations and societal impacts. Moreover, platform support for RAI-compliant exports would reduce the documentation burden, and more extensive validation checks at submission time can ensure that the most needed information is provided Reviewer Expertise : Submissions increasingly span specialized domains including AI for science, medicine, multimodal data and LLM evaluation. The reviewer pool shows limitations in diversity and domain coverage. Each paper requires review by experts in data-centric machine learning as well as domain-specific knowledge. Future iterations would benefit from expanding the reviewer pool to include broader domain expertise Impact Assessment : Unlike algorithmic work with performance metrics, dataset and benchmark impact depends on enabling future research by broadening applicability, surfacing underexplored problems or challenging dominant evaluation paradigms. The track requires shared frameworks to assess data coverage, representativeness and innovation aligned with the community. Next iterations could consider requiring a “demonstrated impact” section in papers or the review form that map dataset characteristics to evaluation results. Large Dataset Handling and New Standards : Handling of very large datasets presented challenges underlying the need for better platform support. Community feedback highlighted priorities for further development: : clearer self-hosting guidelines and platform partnerships for large datasets streamlined automated metadata reports for reviewer efficiency, and stronger adoption of Responsible AI documentation through improved guidance and platform support. The track continues to operate in a learning phase as the community establishes norms for data-centric research evaluation. The shift toward standardized hosting platforms, introduction of machine-readable metadata and implementation of automated review tools represent steps in developing infrastructure for reproducible, transparent dataset research.

Executive Summary

The NeurIPS Datasets and Benchmarks Track has experienced significant growth, doubling submissions annually from 2021 to 2024. The 2025 track implemented major changes to standardize quality evaluations and promote reproducibility, including aligning paper submission requirements with the NeurIPS main track and introducing rigorous hosting and metadata requirements. The analysis presents submission statistics, survey findings, and identifies areas requiring continued development. The track saw alignment with broader trends in machine learning research, with a focus on large language model evaluation and socially beneficial AI. However, gaps appeared in initial submissions regarding metadata compliance, highlighting the need for improvement.

Key Points

  • The NeurIPS Datasets and Benchmarks Track has experienced significant growth, doubling submissions annually from 2021 to 2024.
  • The 2025 track implemented major changes to standardize quality evaluations and promote reproducibility.
  • The track saw alignment with broader trends in machine learning research, with a focus on large language model evaluation and socially beneficial AI.

Merits

Standardization of Quality Evaluations and Reproducibility

The introduction of hosting and metadata requirements enabled automated checklists and standardized dataset summaries, streamlining the process and promoting reproducibility.

Alignment with Broader Trends in Machine Learning Research

The track saw alignment with main track trends, particularly increased focus on large language model evaluation and socially beneficial AI.

Demerits

Gaps in Metadata Compliance

Initial submissions showed gaps in metadata compliance, with missing fields including license information, dataset descriptions, and URLs.

Need for Improvement in Metadata Requirements

The analysis highlights the need for improvement in metadata requirements to ensure datasets remain verifiable, accessible, and scientifically impactful over time.

Expert Commentary

The NeurIPS Datasets and Benchmarks Track's experience in standardizing quality evaluations and promoting reproducibility provides valuable insights for the broader research community. The track's alignment with main track trends in machine learning research underscores the importance of large language model evaluation and socially beneficial AI. However, the gaps in metadata compliance highlighted by the analysis underscore the need for continued improvement in metadata requirements. As AI research continues to evolve, it is essential to prioritize standardization and reproducibility to ensure the scientific integrity and impact of research.

Recommendations

  • The research community should prioritize the development of metadata standards and infrastructure to support reproducibility in AI research.
  • The track's focus on standardization and reproducibility should serve as a model for other fields and research communities.

Sources

Original: NeurIPS

Related Articles