Exa-PSD: a new Persian sentiment analysis dataset on Twitter
arXiv:2602.20892v1 Announce Type: new Abstract: Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutr
arXiv:2602.20892v1 Announce Type: new Abstract: Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained Pars Bert and Roberta as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.
Executive Summary
This article introduces Exa-PSD, a novel Persian sentiment analysis dataset collected from Twitter. The dataset comprises 12,000 annotated tweets, categorized into positive, neutral, and negative sentiment labels. To evaluate the dataset's effectiveness, the authors utilize pre-trained Pars Bert and Roberta models, achieving a 79.87 Macro F-score. While the dataset fills a significant gap in Persian sentiment analysis, its relatively small size and limited domain (Twitter) may impact its generalizability. The study's findings contribute to the development of sentiment analysis systems for Persian language, enabling improved understanding of user opinions on various topics.
Key Points
- ▸ Exa-PSD is a newly introduced Persian sentiment analysis dataset collected from Twitter.
- ▸ The dataset comprises 12,000 annotated tweets, categorized into positive, neutral, and negative sentiment labels.
- ▸ Pre-trained Pars Bert and Roberta models achieve a 79.87 Macro F-score in evaluating the dataset's effectiveness.
- ▸ The study highlights the importance of sentiment analysis in NLP and the challenges faced in analyzing Persian language data.
Merits
Strength in Addressing a Critical Gap
Exa-PSD fills a significant gap in Persian sentiment analysis, providing a valuable resource for researchers and developers working on NLP applications in the Persian language.
Demerits
Limitations in Dataset Size and Domain
The dataset's relatively small size and limited domain (Twitter) may impact its generalizability and applicability to other contexts or languages.
Expert Commentary
The introduction of Exa-PSD represents a significant advancement in the field of Persian NLP. While the dataset's limitations should not be overlooked, its potential as a valuable resource for researchers and developers cannot be overstated. The study's findings emphasize the importance of language-specific datasets in enabling effective machine learning models. As NLP continues to evolve, the development of sentiment analysis systems for non-English languages like Persian will remain a critical area of research, with far-reaching implications for both practical applications and policy decisions.
Recommendations
- ✓ Future research should focus on expanding the dataset's size and domain to improve its generalizability and applicability.
- ✓ Developers and researchers should consider integrating Exa-PSD into existing NLP pipelines to enhance the accuracy and effectiveness of sentiment analysis systems for Persian language.