Academic

Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems

arXiv:2604.05168v1 Announce Type: new Abstract: Leadership-class HPC systems generate massive volumes of heterogeneous, largely unstructured system logs. Because these logs originate from diverse software, hardware, and runtime layers, they exhibit inconsistent formats, making structure extraction and pattern discovery extremely challenging. Therefore, robust log parsing and mining is critical to transform this raw telemetry into actionable insights that reveal operational patterns, diagnose anomalies, and enable reliable, efficient, and scalable system analysis. Recent advances in large language models (LLMs) offer a promising new direction for automated log understanding in leadership-class HPC environments. To capitalize on this opportunity, we present a domain-adapted, instruction-following, LLM-driven framework that leverages chain-of-thought (CoT) reasoning to parse and structure HPC logs with high fidelity. Our approach combines domain-specific log-template data with instruct

arXiv:2604.05168v1 Announce Type: new Abstract: Leadership-class HPC systems generate massive volumes of heterogeneous, largely unstructured system logs. Because these logs originate from diverse software, hardware, and runtime layers, they exhibit inconsistent formats, making structure extraction and pattern discovery extremely challenging. Therefore, robust log parsing and mining is critical to transform this raw telemetry into actionable insights that reveal operational patterns, diagnose anomalies, and enable reliable, efficient, and scalable system analysis. Recent advances in large language models (LLMs) offer a promising new direction for automated log understanding in leadership-class HPC environments. To capitalize on this opportunity, we present a domain-adapted, instruction-following, LLM-driven framework that leverages chain-of-thought (CoT) reasoning to parse and structure HPC logs with high fidelity. Our approach combines domain-specific log-template data with instruction-tuned examples to fine-tune an 8B-parameter LLaMA model tailored for HPC log analysis. We develop a hybrid fine-tuning methodology that adapts a general-purpose LLM to domain-specific log data, enabling privacy-preserving, locally deployable, fast, and energy-efficient log-mining approach. We conduct experiments on a diverse set of log datasets from the LogHub repository. The evaluation confirms that our approach achieves parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude. We further validate the practical utility of our fine-tuned LLM model by parsing over 600 million production logs from the Frontier supercomputer over a four-week window, uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations.

Executive Summary

This article presents a novel framework leveraging instruction-tuned large language models (LLMs) to parse and mine unstructured logs from leadership-class high-performance computing (HPC) systems. The authors fine-tune an 8B-parameter LLaMA model using a hybrid methodology combining domain-specific log templates and instruction-following examples, enabling high-fidelity log structuring through chain-of-thought reasoning. The approach demonstrates parsing accuracy comparable to much larger models (e.g., LLaMA 70B, Claude) while offering privacy-preserving, locally deployable, and energy-efficient advantages. Validation on LogHub datasets and a real-world deployment on the Frontier supercomputer—processing over 600 million logs—reveals critical operational insights, including temporal patterns, node-level anomalies, and workload-error correlations. This work bridges the gap between general-purpose LLMs and domain-specific log analysis, offering a scalable solution for HPC system monitoring and diagnostics.

Key Points

  • Domain-adapted instruction-following LLMs enable high-fidelity parsing of heterogeneous HPC logs with inconsistent formats.
  • Hybrid fine-tuning (domain data + instruction examples) improves accuracy while maintaining efficiency (8B-parameter model vs. larger alternatives).
  • Real-world validation on Frontier supercomputer logs demonstrates scalability and practical utility in detecting operational anomalies and temporal patterns.

Merits

Innovation in Domain-Specific LLM Fine-Tuning

The article introduces a hybrid fine-tuning methodology that adapts a general-purpose LLM to HPC log parsing, achieving performance parity with much larger models while preserving efficiency and privacy.

Scalability and Practical Deployment

The framework is validated on over 600 million production logs from a leadership-class HPC system, proving its scalability and real-world applicability in high-stakes environments.

Balanced Trade-offs Between Accuracy and Efficiency

By leveraging an 8B-parameter model with domain-specific adaptations, the approach achieves high parsing accuracy while minimizing computational overhead, energy consumption, and deployment complexity.

Demerits

Limited Generalizability to Non-HPC Logs

The framework is tailored specifically for HPC logs, which may limit its effectiveness when applied to logs from other domains with fundamentally different structures or terminologies.

Dependency on High-Quality Training Data

The accuracy of the fine-tuned model relies heavily on the quality and representativeness of the domain-specific log templates and instruction-tuning examples, which may require significant manual effort to curate.

Potential Overfitting to Narrow Use Cases

While effective for HPC logs, the model’s performance may degrade if deployed in scenarios with log formats or anomalies not represented in the training data.

Expert Commentary

This article represents a significant advancement in the application of LLMs to the notoriously challenging problem of log parsing in HPC systems. The authors’ hybrid fine-tuning methodology—combining domain-specific data with instruction-following capabilities—addresses a critical pain point in system monitoring: the ability to extract actionable insights from unstructured, heterogeneous logs at scale. The demonstration of parity with much larger models (e.g., LLaMA 70B, Claude) using an 8B-parameter model is particularly noteworthy, as it highlights the potential for efficient, locally deployable AI solutions that do not compromise on performance. The real-world validation on the Frontier supercomputer underscores the practical utility of this approach, revealing not only its scalability but also its capacity to uncover non-obvious patterns in system behavior. However, the authors’ focus on HPC-specific logs may limit the immediate applicability of this framework to other domains, and the reliance on high-quality training data poses a potential barrier to adoption. Nonetheless, this work sets a new benchmark for log parsing in technical environments and paves the way for further exploration into domain-adapted LLMs for complex data analysis tasks.

Recommendations

  • Future research should explore methods to enhance the generalizability of the framework to non-HPC logs while maintaining high accuracy, potentially through transfer learning or multi-domain pretraining.
  • Organizations adopting this approach should invest in robust data curation pipelines to ensure the training data is representative and diverse, mitigating the risk of overfitting to specific log formats or anomalies.
  • Collaboration between HPC operators, LLM researchers, and domain experts is encouraged to refine and expand the framework, ensuring it remains adaptable to evolving log formats and emerging use cases.

Sources

Original: arXiv - cs.AI