Academic

EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting

arXiv:2603.09785v1 Announce Type: new Abstract: This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative stud

M
Maria Kunilovskaya, Christina Pollkl\"asener
· · 1 min read · 9 views

arXiv:2603.09785v1 Announce Type: new Abstract: This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.

Executive Summary

This article presents an updated version of the EPIC-EuroParl-UdS corpus, a valuable resource for research on language variation, translation, and interpreting. The corpus combines spoken and written European Parliament speeches with their translations and interpretations, allowing for the study of language variation, translationese, and disfluencies in speech. The updates include corrections to metadata and text errors, refined content, updated linguistic annotations, and new layers such as word alignment and surprisal indices. The article also presents a new illustrative study that validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from machine translation models on the task of filler particles prediction in interpreting. The study demonstrates the potential of the corpus for advancing our understanding of language variation and translation processes.

Key Points

  • Update and combination of EPIC-UdS and EuroParl-UdS corpora
  • Corrections to metadata and text errors, refined content, and updated linguistic annotations
  • New layers such as word alignment and surprisal indices added to the corpus
  • Illustrative study on filler particles prediction in interpreting
  • Evaluation of probabilistic measures derived from machine translation models

Merits

Strength in Corpus Design

The updated corpus combines spoken and written European Parliament speeches, allowing for a comprehensive study of language variation and translation processes.

Methodological Rigor

The study demonstrates a high level of methodological rigor, with a clear and detailed description of the corpus updates and the illustrative study.

Demerits

Limited Generalizability

The corpus is limited to European Parliament speeches, which may not be representative of language variation and translation processes in other contexts.

Dependence on Machine Translation Models

The study's reliance on machine translation models may limit its generalizability and applicability to human translation and interpreting processes.

Expert Commentary

The article presents a significant update to the EPIC-EuroParl-UdS corpus, which is a valuable resource for research on language variation, translation, and interpreting. The study's methodological rigor and use of machine translation models are notable strengths. However, the corpus's limited generalizability and dependence on machine translation models are potential limitations. The implications of the study are primarily practical, with potential applications in the development of translation and interpreting tools and techniques. The policy implications are more indirect, but may inform language teaching, language policy, and translation quality assessment.

Recommendations

  • Future studies should investigate the generalizability of the corpus and its findings to other contexts and languages.
  • The use of machine translation models should be complemented with more human-oriented approaches to translation and interpreting, to provide a more comprehensive understanding of language variation and translation processes.

Sources