Academic

Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety

arXiv:2602.13455v1 Announce Type: new Abstract: The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform wel

P
Phyllis Nabangi, Abdul-Jalil Zakaria, Jema David Ndibwile
· · 1 min read · 2 views

arXiv:2602.13455v1 Announce Type: new Abstract: The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our analysis revealed that, while these models perform well in high-dimensional textual data, our dataset's small size and imbalance limit our findings' generalizability. Precision, recall, and F1 scores were thoroughly analyzed, highlighting the nuanced performance of each model in detecting obfuscated language. This research contributes to the broader discourse on ensuring safer online environments for children, advocating for expanded datasets and advanced machine-learning techniques to improve the effectiveness of cyberbullying detection systems. Future work will focus on enhancing data robustness, exploring transfer learning, and integrating multimodal data to create more comprehensive and culturally sensitive detection mechanisms.

Executive Summary

The article 'Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety' addresses the critical issue of cyberbullying and online abuse, particularly in the context of Swahili, a low-resource language. The study employs machine learning models such as Support Vector Machines (SVM), Logistic Regression, and Decision Trees to detect obfuscated abusive language. Despite the models' effectiveness in high-dimensional textual data, the small and imbalanced dataset limits the generalizability of the findings. The research highlights the need for expanded datasets and advanced techniques to improve cyberbullying detection systems, advocating for safer online environments for children.

Key Points

  • The study focuses on detecting abusive obfuscated language in Swahili, a low-resource language with limited linguistic resources.
  • Machine learning models such as SVM, Logistic Regression, and Decision Trees were employed and optimized through techniques like SMOTE.
  • The small size and imbalance of the dataset limit the generalizability of the findings.

Merits

Innovative Approach

The study addresses a critical gap in the detection of abusive language in a low-resource language, Swahili, which is widely spoken in East Africa.

Comprehensive Analysis

The research thoroughly analyzes precision, recall, and F1 scores, providing nuanced insights into the performance of each model.

Advocacy for Child Safety

The study contributes to the broader discourse on ensuring safer online environments for children, emphasizing the need for expanded datasets and advanced machine-learning techniques.

Demerits

Limited Dataset

The small size and imbalance of the dataset limit the generalizability of the findings, which is a significant constraint.

Technical Limitations

The study acknowledges that while the models perform well in high-dimensional textual data, the dataset's limitations hinder broader applicability.

Expert Commentary

The article presents a timely and relevant exploration of the challenges in detecting obfuscated abusive language in Swahili, a language that is widely spoken but under-researched in the context of digital safety. The use of machine learning models such as SVM, Logistic Regression, and Decision Trees is a commendable approach, given the complexity of the task. However, the limitations imposed by the small and imbalanced dataset are significant and warrant further attention. The study's emphasis on the need for expanded datasets and advanced techniques is crucial for the future development of robust detection systems. The implications for child safety are profound, as the digital landscape continues to evolve, and the need for effective measures to combat cyberbullying becomes increasingly urgent. The research contributes valuable insights to the field and sets a foundation for future work in this critical area.

Recommendations

  • Future research should focus on expanding the dataset to include a more diverse range of obfuscated abusive language examples, thereby enhancing the generalizability of the findings.
  • Exploring transfer learning and integrating multimodal data could significantly improve the robustness and cultural sensitivity of detection mechanisms.

Sources