THIVLVC: Retrieval Augmented Dependency Parsing for Latin
arXiv:2604.05564v1 Announce Type: new Abstract: We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.
arXiv:2604.05564v1 Announce Type: new Abstract: We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.
Executive Summary
The paper introduces THIVLVC, a two-stage retrieval-augmented dependency parsing system for Latin, designed for the EvaLatin 2026 Dependency Parsing task. The system first retrieves structurally similar Latin sentences from the CIRCSE treebank using sentence length and POS n-gram similarity, then uses these retrieved examples to refine baseline parses generated by UDPipe via a large language model (LLM) guided by UD annotation guidelines. Results show significant improvements in CLAS scores, particularly for poetry (Seneca) with a +17 point gain over UDPipe, and modest gains for prose (Thomas Aquinas) at +1.5 points. A blind error analysis of 300 divergences indicates that 53.3% of disagreements favor THIVLVC, though inconsistencies within and across treebanks are noted. The study highlights the potential of retrieval-augmented methods in low-resource or morphologically complex languages like Latin.
Key Points
- ▸ THIVLVC employs a two-stage retrieval-augmented parsing system combining structural retrieval from the CIRCSE treebank with LLM-based refinement guided by UD annotation guidelines.
- ▸ Performance gains are substantial for poetry (Seneca, +17 CLAS) but modest for prose (Thomas Aquinas, +1.5 CLAS), suggesting variability in effectiveness across different Latin text genres.
- ▸ A blind error analysis reveals that over half of the divergences from the gold standard favor THIVLVC, yet annotation inconsistencies within and across treebanks complicate evaluation.
- ▸ The system demonstrates the potential of retrieval-augmented generation (RAG) in enhancing dependency parsing for historically significant but low-resource languages.
Merits
Innovative Methodology
The integration of retrieval-augmented generation (RAG) with dependency parsing for Latin represents a novel approach, leveraging structurally similar examples to refine baseline parses generated by established tools like UDPipe.
Significant Performance Gains
The system achieves notable improvements, particularly in poetry parsing, where a +17 CLAS gain over UDPipe underscores the efficacy of retrieval-augmented refinement in morphologically complex or stylistically distinct texts.
Rigorous Evaluation
The inclusion of a double-blind error analysis provides qualitative insights into the system's performance, revealing both strengths and limitations in handling divergences from the gold standard.
Demerits
Genre-Specific Limitations
The system's modest gain (+1.5 CLAS) for prose (Thomas Aquinas) compared to poetry suggests that retrieval-augmented methods may be less effective for certain Latin text genres, possibly due to stylistic or syntactic differences.
Annotation Inconsistencies
The error analysis highlights annotation inconsistencies within and across treebanks, which complicate the evaluation of parsing systems and may skew performance metrics.
Dependency on Treebank Size
The effectiveness of THIVLVC is contingent on the availability and quality of the CIRCSE treebank for retrieval, raising questions about scalability and applicability to other low-resource historical languages.
Expert Commentary
THIVLVC represents a significant advancement in the application of modern NLP techniques to historical linguistics, particularly for Latin, a language with rich morphological complexity and stylistic diversity. The two-stage retrieval-augmented approach effectively leverages structural similarity and LLM refinement to enhance dependency parsing, achieving impressive gains in poetry parsing. However, the modest improvements for prose highlight the need for genre-specific adaptations and further investigation into the factors influencing retrieval efficacy. The error analysis, while illuminating, also underscores a critical challenge in historical NLP: the inconsistency within and across treebanks. This issue is not merely an academic concern but a practical barrier to robust system evaluation and deployment. Future work should focus on improving treebank consistency and exploring hybrid models that combine retrieval-augmented techniques with traditional parsing methods to address genre-specific limitations. Additionally, the integration of LLMs into parsing workflows raises questions about scalability and computational feasibility, which warrant further exploration.
Recommendations
- ✓ Develop genre-specific retrieval strategies or fine-tune LLMs on domain-specific corpora to address the variability in performance across different Latin text genres.
- ✓ Conduct further research into the causes of annotation inconsistencies within and across treebanks, and establish standardized annotation guidelines to improve the reliability of parsing evaluations.
- ✓ Explore the integration of THIVLVC's methodology with other low-resource languages to assess its generalizability and scalability beyond Latin.
- ✓ Investigate the computational efficiency and resource requirements of retrieval-augmented parsing systems to ensure their practical deployment in resource-constrained environments.
- ✓ Collaborate with treebank curators to enhance the quality and consistency of annotations, thereby improving the foundation for NLP applications in historical linguistics.
Sources
Original: arXiv - cs.CL