Academic

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

arXiv:2603.04854v1 Announce Type: new Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus r

M
Minduli Lasandi, Nevidu Jayatilleke
· · 1 min read · 2 views

arXiv:2603.04854v1 Announce Type: new Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.

Executive Summary

The SinhaLegal corpus introduces a comprehensive dataset of Sinhala legislative texts, comprising approximately 2 million words across 1,206 legal documents. The corpus includes Acts and Bills from 1981 to 2014, collected from official sources and processed using OCR and manual cleaning. The dataset is accompanied by metadata files and has undergone evaluation, including corpus statistics, lexical diversity, and perplexity analysis. This resource aims to support NLP tasks such as summarization, information extraction, and analysis, addressing a significant gap in Sinhala legal research.

Key Points

  • Introduction of the SinhaLegal corpus, a dataset of Sinhala legislative texts
  • Comprehensive evaluation of the corpus, including corpus statistics and perplexity analysis
  • Support for NLP tasks such as summarization, information extraction, and analysis

Merits

Comprehensive Dataset

The SinhaLegal corpus provides a large and diverse dataset of Sinhala legislative texts, enabling robust training and evaluation of NLP models.

High-Quality Processing

The use of OCR and manual cleaning ensures high-quality, machine-readable content, reducing errors and improving the overall reliability of the dataset.

Demerits

Limited Timeframe

The corpus only includes documents from 1981 to 2014, which may limit its applicability to more recent legal developments.

Domain-Specific Focus

The corpus's focus on legislative texts may restrict its use in other domains or applications, potentially limiting its broader impact.

Expert Commentary

The SinhaLegal corpus represents a significant contribution to the field of NLP and legal research, addressing a critical gap in Sinhala language processing. The comprehensive evaluation and high-quality processing of the dataset ensure its reliability and usability. However, the limited timeframe and domain-specific focus of the corpus may restrict its applicability. Further research and development are necessary to expand the corpus and explore its potential applications in legal research, policy-making, and governance.

Recommendations

  • Expansion of the corpus to include more recent documents and diverse legal texts
  • Exploration of the corpus's applications in legal research, policy-making, and governance
  • Development of NLP models and tools tailored to the Sinhala language and legislative domain

Sources