Academic

Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

arXiv:2603.23521v1 Announce Type: new Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the datas

Shaharukh Khan, Ali Faraz, Abhinav Ravi, Mohd Nauman, Mohd Sarfraz, Akshat Patidar, Raja Kolla, Chandra Khatri, Shubham Agarwal · March 26, 2026 · 1 min read · 18 views

#cs.CL #cs.AI #cs.CV

Executive Summary

This article presents the Chitrakshara dataset series, a comprehensive multilingual and multimodal dataset for Indian languages. The dataset comprises two main components: Chitrakshara-IL and Chitrakshara-Cap, which collectively provide a vast collection of images, text tokens, and multilingual documents. The authors detail their data collection pipeline, highlighting the importance of curation, filtering, and processing methodologies. A quality and diversity analysis is also conducted to assess the dataset's representativeness across Indic languages. The Chitrakshara dataset series has the potential to facilitate the development of more culturally inclusive Vision-Language Models (VLMs) and address the gap in existing VLMs, which are primarily trained on English datasets.

Key Points

▸ The Chitrakshara dataset series addresses the gap in existing Vision-Language Models (VLMs) trained primarily on English datasets.
▸ The dataset comprises two main components: Chitrakshara-IL and Chitrakshara-Cap, with a large-scale interleaved pretraining dataset and a dataset of image-text pairs.
▸ A comprehensive quality and diversity analysis is conducted to assess the dataset's representativeness across Indic languages.

Merits

Strength in Addressing Language Gap

The Chitrakshara dataset series effectively addresses the gap in existing VLMs by providing a comprehensive multilingual and multimodal dataset for Indian languages.

Demerits

Limitation in Data Quality

The authors acknowledge the potential for data quality issues due to the use of Common Crawl, which may contain errors or biases.

Expert Commentary

The Chitrakshara dataset series is a significant contribution to the field of multilingual and multimodal research. The authors have successfully addressed a critical gap in existing VLMs by providing a comprehensive dataset for Indian languages. However, it is essential to acknowledge the potential limitations in data quality and ensure that the dataset is curated and processed to minimize errors and biases. Additionally, the development of more culturally inclusive VLMs has significant policy implications, particularly in the context of language access and digital divide in India. As AI models become increasingly prevalent in various industries, it is crucial to ensure that they are culturally sensitive and inclusive, particularly in regions with diverse linguistic and cultural backgrounds.

Recommendations

✓ Future research should focus on developing more robust and culturally sensitive VLMs using the Chitrakshara dataset series.
✓ Policymakers and industry stakeholders should prioritize the development of culturally inclusive AI models to address the language access and digital divide in India.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Language Gap

Demerits

Limitation in Data Quality

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.