Academic

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Peiyao Xiao, Xiaogang Li, Chengliang Xu, Jiayi Wang, Ben Wang, Zichao Chen, Zeyu Wang, Kejun Yu, Yueqian Chen, Xulin Liu, Wende Xiao, Bing Zhao, Hu Wei · March 1, 2026 · 1 min read · 4 views

#cs.AI

arXiv:2602.22971v1 Announce Type: new Abstract: As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates "llbox" for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model "personalities" (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

Executive Summary

The article introduces SPM-Bench, a novel, PhD-level multimodal benchmark designed to evaluate the proficiency of large language models (LLMs) in the specialized scientific domain of scanning probe microscopy (SPM). The authors address gaps in existing benchmarks by proposing an automated data synthesis pipeline that ensures high authority and low cost. Utilizing Anchor-Gated Sieve (AGS) technology, the pipeline extracts high-value image-text pairs from recent arXiv and journal papers. The study introduces the Strict Imperfection Penalty F1 (SIP-F1) score to rigorously evaluate LLM performance and categorize model 'personalities.' The research highlights the reasoning boundaries of current AI in complex physical scenarios and establishes SPM-Bench as a paradigm for automated scientific data synthesis.

Key Points

▸ Introduction of SPM-Bench, a specialized benchmark for evaluating LLMs in SPM.
▸ Automated data synthesis pipeline using AGS technology for high-value image-text pairs.
▸ Introduction of SIP-F1 score to evaluate LLM performance and categorize model personalities.
▸ Insights into the reasoning boundaries of current AI in complex physical scenarios.
▸ SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

Merits

Innovative Benchmark

SPM-Bench addresses a significant gap in evaluating LLMs for specialized scientific domains, particularly in SPM, which has been overlooked in existing benchmarks.

Automated Data Synthesis

The use of AGS technology for automated data synthesis ensures high authority and low-cost extraction of high-value image-text pairs, making the benchmark scalable and efficient.

Rigorous Evaluation Metric

The introduction of the SIP-F1 score provides a rigorous and objective evaluation metric that not only assesses performance but also categorizes model personalities, offering deeper insights into LLM behavior.

Demerits

Limited Scope

The focus on SPM, while valuable, limits the immediate applicability of the findings to other scientific domains, potentially reducing the generalizability of the benchmark.

Data Contamination Concerns

Despite efforts to ensure dataset purity, there remains a risk of data contamination, which could affect the reliability of the benchmark results.

Complexity of Implementation

The hybrid cloud-local architecture and the use of AGS technology may present implementation challenges for researchers and practitioners looking to adopt the benchmark.

Expert Commentary

The article presents a significant advancement in the evaluation of LLMs for specialized scientific domains, particularly in SPM. The introduction of SPM-Bench addresses a critical gap in existing benchmarks, providing a rigorous and objective framework for assessing model performance. The use of AGS technology for automated data synthesis is a notable innovation, ensuring high authority and low-cost extraction of valuable image-text pairs. The SIP-F1 score is a particularly insightful contribution, offering a nuanced evaluation metric that categorizes model personalities and exposes reasoning boundaries. However, the limited scope of SPM-Bench to the SPM domain and potential implementation challenges should be considered. Overall, the article sets a new standard for evaluating LLMs in specialized scientific domains and provides valuable insights into the capabilities and limitations of current AI models.

Recommendations

✓ Expand the scope of SPM-Bench to include other specialized scientific domains to enhance the generalizability of the benchmark.
✓ Conduct further research to address potential data contamination issues and ensure the reliability of the benchmark results.

Sources

arXiv - cs.AI

Something extraordinary is coming.

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

AI Commentary

Executive Summary

Key Points

Merits

Innovative Benchmark

Automated Data Synthesis

Rigorous Evaluation Metric

Demerits

Limited Scope

Data Contamination Concerns

Complexity of Implementation

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.