Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages
arXiv:2603.23521v1 Announce Type: new Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the datas
arXiv:2603.23521v1 Announce Type: new Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
Executive Summary
This article presents the Chitrakshara dataset series, a comprehensive multilingual and multimodal dataset for Indian languages. The dataset comprises two main components: Chitrakshara-IL and Chitrakshara-Cap, which collectively provide a vast collection of images, text tokens, and multilingual documents. The authors detail their data collection pipeline, highlighting the importance of curation, filtering, and processing methodologies. A quality and diversity analysis is also conducted to assess the dataset's representativeness across Indic languages. The Chitrakshara dataset series has the potential to facilitate the development of more culturally inclusive Vision-Language Models (VLMs) and address the gap in existing VLMs, which are primarily trained on English datasets.
Key Points
- ▸ The Chitrakshara dataset series addresses the gap in existing Vision-Language Models (VLMs) trained primarily on English datasets.
- ▸ The dataset comprises two main components: Chitrakshara-IL and Chitrakshara-Cap, with a large-scale interleaved pretraining dataset and a dataset of image-text pairs.
- ▸ A comprehensive quality and diversity analysis is conducted to assess the dataset's representativeness across Indic languages.
Merits
Strength in Addressing Language Gap
The Chitrakshara dataset series effectively addresses the gap in existing VLMs by providing a comprehensive multilingual and multimodal dataset for Indian languages.
Demerits
Limitation in Data Quality
The authors acknowledge the potential for data quality issues due to the use of Common Crawl, which may contain errors or biases.
Expert Commentary
The Chitrakshara dataset series is a significant contribution to the field of multilingual and multimodal research. The authors have successfully addressed a critical gap in existing VLMs by providing a comprehensive dataset for Indian languages. However, it is essential to acknowledge the potential limitations in data quality and ensure that the dataset is curated and processed to minimize errors and biases. Additionally, the development of more culturally inclusive VLMs has significant policy implications, particularly in the context of language access and digital divide in India. As AI models become increasingly prevalent in various industries, it is crucial to ensure that they are culturally sensitive and inclusive, particularly in regions with diverse linguistic and cultural backgrounds.
Recommendations
- ✓ Future research should focus on developing more robust and culturally sensitive VLMs using the Chitrakshara dataset series.
- ✓ Policymakers and industry stakeholders should prioritize the development of culturally inclusive AI models to address the language access and digital divide in India.
Sources
Original: arXiv - cs.CL