Skip to main content
Academic

Natural Language Processing Models for Robust Document Categorization

arXiv:2602.20336v1 Announce Type: new Abstract: This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be

arXiv:2602.20336v1 Announce Type: new Abstract: This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5\%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.

Executive Summary

The article evaluates three machine learning models—Naive Bayes, bidirectional LSTM, and BERT—for automated text classification, focusing on balancing accuracy and computational efficiency. BERT achieved the highest accuracy but required significant resources, while BiLSTM offered a strong compromise with moderate training costs and robust contextual understanding. Naive Bayes was the fastest but least accurate. The study highlights the impact of class imbalance and demonstrates a functional system for automated document routing, concluding that BiLSTM is the most balanced solution for practical applications.

Key Points

  • BERT achieved the highest accuracy but required significant computational resources.
  • BiLSTM provided a strong balance between accuracy and computational efficiency.
  • Naive Bayes was the fastest but least accurate model.
  • Class imbalance affected all models, particularly minority category recognition.
  • A functional demonstrative system validated the practical applicability of the models.

Merits

Comprehensive Evaluation

The study provides a thorough comparison of three distinct models, offering insights into their performance and resource requirements.

Practical Demonstration

The implementation of a functional system validates the real-world applicability of the models, showcasing their potential for automated document routing.

Balanced Approach

The study effectively balances the trade-offs between accuracy and computational efficiency, providing a nuanced understanding of model performance.

Demerits

Limited Scope

The study focuses on a specific set of models and a particular application, which may limit the generalizability of the findings.

Class Imbalance

The impact of class imbalance on model performance is acknowledged but not extensively explored, leaving room for further investigation.

Resource Intensive Models

The high computational requirements of BERT may limit its practical use in resource-constrained environments.

Expert Commentary

The article presents a rigorous evaluation of machine learning models for automated text classification, offering valuable insights into the trade-offs between accuracy and computational efficiency. The study's comprehensive comparison of Naive Bayes, BiLSTM, and BERT models provides a nuanced understanding of their performance in practical applications. The demonstration of a functional system for automated document routing highlights the potential of these models to enhance efficiency in document management. However, the study's focus on a specific set of models and applications limits the generalizability of the findings. Additionally, the impact of class imbalance on model performance is acknowledged but not extensively explored, leaving room for further investigation. The high computational requirements of BERT may also limit its practical use in resource-constrained environments. Overall, the study contributes significantly to the field of automated text classification and offers valuable guidance for the selection of appropriate models in real-world applications.

Recommendations

  • Further research should explore the impact of class imbalance on model performance and develop mitigation strategies.
  • Future studies should evaluate a broader range of models and applications to enhance the generalizability of the findings.

Sources