Academic

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

arXiv:2603.05750v1 Announce Type: new Abstract: Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

arXiv:2603.05750v1 Announce Type: new Abstract: Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

Executive Summary

The NERdME dataset addresses a critical gap in information extraction by introducing a manually annotated corpus of 200 README files from code repositories, enriched with over 10,000 labeled spans across 10 entity types. This dataset fills a void in scholarly information extraction (SIE) by focusing on implementation-level artifacts—materials typically overlooked in traditional SIE benchmarks that focus on scientific papers. The authors demonstrate through baseline experiments using large language models and fine-tuned transformers that implementation-level entities differ meaningfully from paper-level entities, suggesting a valuable extension of SIE evaluation frameworks. Moreover, an entity-linking experiment validates the utility of README-derived entities in supporting artifact discovery and metadata integration, offering practical applications for improving repository navigation and resource discovery.

Key Points

  • Introduction of NERdME as a novel dataset for implementation-level artifacts
  • Annotation of 200 README files with 10 entity types and over 10,000 labeled spans
  • Evidence of differential performance between paper-level and implementation-level entity recognition

Merits

Gap-Filling

NERdME fills a critical void by providing the first benchmark for information extraction from code repository README files, enabling more comprehensive SIE coverage beyond traditional scientific paper sources.

Empirical Validation

Baseline experiments using state-of-the-art language models validate the distinctiveness and relevance of implementation-level entities, supporting the dataset’s value in extending existing SIE benchmarks.

Demerits

Limited Scope

The dataset’s focus on README files restricts applicability to other types of code repository content (e.g., issue trackers, commit messages), limiting generalizability.

Annotation Bias

Manual annotation of 200 files, while rigorous, may introduce consistency or scalability issues if extended to larger code repositories without automated validation.

Expert Commentary

This work represents a significant methodological advance in the field of scholarly information extraction. Historically, SIE research has been disproportionately centered on textual content from academic publications, often excluding the rich, heterogeneous content found in code repositories—a domain increasingly central to modern research workflows. The NERdME dataset demonstrates a sophisticated understanding of the structural challenges inherent in Markdown-based README files and responds with a targeted, manually curated annotation scheme that preserves semantic diversity while enabling computational analysis. The distinction between paper-level and implementation-level entities is not merely semantic; it reflects a fundamental shift in research context—from static knowledge dissemination to dynamic, reproducible artifact discovery. Furthermore, the entity-linking experiment underscores the potential for integrating repository-level metadata into broader scholarly networks, enabling cross-referencing between datasets, code, and publications. While the dataset’s current scope is limited, its conceptual framing—particularly the recognition that metadata in code repositories demands specialized linguistic and structural modeling—sets a new precedent. Future work could extend this framework to other repository artifacts, potentially catalyzing a paradigm shift in how scholarly information is indexed, retrieved, and interlinked.

Recommendations

  • 1. Expand NERdME to include additional repository artifacts (e.g., issue templates, CONTRIBUTING.md, API docs) to broaden applicability.
  • 2. Develop automated annotation pipelines that combine rule-based heuristics with ML to scale annotation for larger repositories without sacrificing quality.
  • 3. Integrate NERdME into existing SIE evaluation frameworks as a supplementary benchmark for implementation-level entities, encouraging comparative studies across artifact types.

Sources