Academic

FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra

arXiv:2603.18397v1 Announce Type: new Abstract: Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditio

J
Jianan Nie, Peng Gao
· · 1 min read · 5 views

arXiv:2603.18397v1 Announce Type: new Abstract: Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder. Notably, it achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). We also visualize the generated molecules, which further demonstrate that FlowMS produces structurally plausible candidates closely resembling ground truth structures. These results establish discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery.

Executive Summary

The article introduces FlowMS, a discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS uses a pretrained formula transformer encoder to condition spectral embeddings, enforcing chemical formula constraints through iterative refinement in probability space. The results show state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark, establishing discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation. The generated molecules are structurally plausible and closely resemble ground truth structures. The approach has significant implications for metabolomics and natural product discovery, where accurate molecular identification is crucial. The results also demonstrate the potential of discrete flow matching for graph generation tasks, highlighting its computational efficiency and accuracy.

Key Points

  • FlowMS introduces discrete flow matching for spectrum-conditioned de novo molecular generation.
  • The framework uses a pretrained formula transformer encoder to condition spectral embeddings.
  • FlowMS achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark.

Merits

Strengths in Computational Efficiency

FlowMS demonstrates computational efficiency compared to diffusion-based generation methods, making it a promising approach for large-scale molecular identification tasks.

State-of-the-Art Performance

FlowMS achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark, establishing its effectiveness for de novo molecular generation.

Structural Plausibility of Generated Molecules

The generated molecules are structurally plausible and closely resemble ground truth structures, indicating the framework's ability to produce accurate molecular candidates.

Demerits

Limited Generalizability

The framework's performance is evaluated on a specific benchmark (NPLIB1), and its generalizability to other datasets and applications remains to be investigated.

Dependence on Pretrained Models

FlowMS relies on a pretrained formula transformer encoder, which may limit its applicability to domains with limited training data or specific chemical structures.

Potential Overfitting

The iterative refinement process in FlowMS may lead to overfitting, especially if the spectral embeddings and chemical formula constraints are not well-balanced.

Expert Commentary

The article presents a novel and promising approach to de novo molecular generation, leveraging the strengths of discrete flow matching to achieve state-of-the-art performance on a challenging benchmark. The use of a pretrained formula transformer encoder and spectral embeddings demonstrates the framework's ability to handle complex chemical structures and spectral fragmentation patterns. However, the potential limitations of the framework, including limited generalizability and dependence on pretrained models, require further investigation. Nonetheless, FlowMS represents a significant advancement in the field of molecular identification, with potential applications in metabolomics, natural product discovery, and beyond.

Recommendations

  • Future research should investigate the generalizability of FlowMS to other datasets and applications, including its performance on diverse chemical structures and spectral fragmentation patterns.
  • The development of alternative pretrained models or learning strategies could help mitigate the dependence on specific models and improve the framework's robustness.

Sources