Academic

Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

arXiv:2603.22913v1 Announce Type: new Abstract: To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat'' was rigorou

R
Ryoma Suzuki, Zhiyang Qi, Michimasa Inaba
· · 1 min read · 2 views

arXiv:2603.22913v1 Announce Type: new Abstract: To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat'' was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method's outputs. The Multilingual KokoroChat is available at https://github.com/UEC-InabaLab/MultilingualKokoroChat.

Executive Summary

The article introduces Multilingual KokoroChat, a novel initiative to address the critical shortage of high-quality counseling dialogue datasets by translating a Japanese corpus into English and Chinese using a multi-LLM ensemble method. Recognizing the variability in translation quality across LLMs, the authors employ a diverse hypothesis generation strategy followed by a consolidated analysis to select the optimal translation, thereby mitigating the limitations of single-model approaches. The method’s effectiveness is validated through human preference studies, which demonstrate a clear preference for the ensemble outputs over individual LLMs. This innovation offers a scalable and reliable solution for multilingual counseling dataset creation, particularly in sensitive domains where fidelity is paramount.

Key Points

  • Use of multi-LLM ensemble to overcome inconsistent translation quality
  • Translation of Japanese counseling corpus into English and Chinese
  • Human preference validation confirming ensemble superiority

Merits

Innovation

The multi-LLM ensemble introduces a novel approach to domain-sensitive translation by leveraging comparative strengths across models, enhancing fidelity in critical applications.

Validation

Rigorous human preference studies provide empirical evidence of the ensemble’s superior quality, lending credibility to the methodology.

Demerits

Complexity

Coordinating multiple LLMs introduces logistical and computational overhead, potentially limiting scalability without infrastructure support.

Generalizability

While validated on a Japanese corpus, applicability to other languages or domains remains unproven and may require further adaptation.

Expert Commentary

This paper represents a substantive advance in the intersection of AI translation and domain-specific data curation. The authors intelligently navigate the inherent variability of large language models by deploying an ensemble strategy that acknowledges the limitations of individual systems while maximizing collective strengths. The validation methodology—anchored in human preference—is particularly commendable, as it aligns evaluation with user-centric outcomes rather than technical metrics alone. Moreover, the open-source distribution of Multilingual KokoroChat exemplifies a commitment to open science and accessibility. While the computational cost of multi-LLM coordination remains a practical hurdle, the authors’ approach offers a replicable template for similar initiatives in other sensitive domains. Their work underscores a broader shift toward hybrid, ensemble-based AI solutions that prioritize quality over singularity in critical applications.

Recommendations

  • Adopt the multi-LLM ensemble framework for other multilingual counseling or clinical datasets where translation fidelity is paramount.
  • Explore automated metrics to complement human evaluations for scalable quality assurance in future multilingual AI projects.

Sources

Original: arXiv - cs.CL