Academic

CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

arXiv:2602.12639v1 Announce Type: new Abstract: Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism t

Y
Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie
· · 1 min read · 4 views

arXiv:2602.12639v1 Announce Type: new Abstract: Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

Executive Summary

The article introduces CLASE, a hybrid method for evaluating the stylistic quality of legal text generated by large language models (LLMs). CLASE combines linguistic feature-based scores with experience-guided LLM-as-a-judge scores, learned from contrastive pairs of authentic legal documents and LLM-generated text. The method aims to address the challenges of existing evaluation metrics, which either conflate semantic accuracy with stylistic fidelity or suffer from opacity and inconsistency. Experiments demonstrate that CLASE achieves higher alignment with human judgments and provides interpretable score breakdowns and improvement suggestions.

Key Points

  • CLASE is a hybrid method for evaluating the stylistic quality of legal text generated by LLMs.
  • It combines linguistic feature-based scores with experience-guided LLM-as-a-judge scores.
  • The method is learned from contrastive pairs of authentic legal documents and LLM-generated text.
  • CLASE achieves higher alignment with human judgments compared to traditional metrics and pure LLM-as-a-judge methods.
  • It provides interpretable score breakdowns and suggestions for improvements.

Merits

Hybrid Approach

The combination of linguistic feature-based scores and LLM-as-a-judge scores captures both surface-level features and implicit stylistic norms, providing a more comprehensive evaluation.

Transparency and Interpretability

CLASE offers interpretable score breakdowns and suggestions for improvements, making it a practical tool for professional stylistic evaluation.

Scalability

The method is scalable and can be applied to a large volume of legal text, making it suitable for automated evaluation in legal text generation.

Demerits

Limited Scope

The method is currently focused on Chinese legal text, which may limit its applicability to other legal systems and languages.

Dependence on LLM-as-a-Judge

While the hybrid approach mitigates some issues, the reliance on LLM-as-a-judge scores may still introduce some level of opacity and inconsistency.

Data Dependency

The effectiveness of CLASE depends on the quality and diversity of the contrastive pairs used for training, which may not cover all stylistic nuances in legal text.

Expert Commentary

The introduction of CLASE represents a significant advancement in the evaluation of stylistic quality in legal text generated by LLMs. The hybrid approach effectively addresses the limitations of existing metrics by combining linguistic feature-based scores with LLM-as-a-judge scores. This dual-pronged method captures both explicit and implicit stylistic norms, providing a more comprehensive and reliable evaluation. The transparency and interpretability offered by CLASE are particularly valuable, as they enable legal professionals to understand the evaluation process and make informed improvements. However, the method's current focus on Chinese legal text limits its immediate applicability to other legal systems. Future research could explore the adaptation of CLASE to other languages and legal contexts, further expanding its utility. Additionally, while the hybrid approach mitigates some issues associated with LLM-as-a-judge evaluations, ongoing efforts to enhance the consistency and reliability of these evaluations are essential. Overall, CLASE sets a new standard for stylistic evaluation in legal text generation and paves the way for more sophisticated and practical applications of AI in the legal domain.

Recommendations

  • Expand the scope of CLASE to include other languages and legal systems to enhance its global applicability.
  • Continue research to improve the consistency and reliability of LLM-as-a-judge evaluations, ensuring robust and transparent scoring mechanisms.

Sources