Academic

NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection

arXiv:2602.23863v1 Announce Type: cross Abstract: With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection' competition, with F1 scores of 83.16\% and 48.88\%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.

Xiaoyu Guo, Arkaitz Zubiaga · March 3, 2026 · 1 min read · 11 views

#cs.CV #cs.CL

Executive Summary

This article proposes a multi-modal multi-task model, leveraging BERT and CLIP Vision encoders for text and image feature extraction, respectively. The model achieved notable results in the `CT2: AI-Generated Image Detection' competition, with F1 scores of 83.16% and 48.88% in Tasks A and B, respectively. The authors' use of pseudo-labeling-based data augmentation and tailored multi-task loss function demonstrates the effectiveness of their proposed architecture. The model's potential for advancing AI-generated content detection in real-world scenarios is substantial. However, the article lacks a thorough evaluation of the model's performance on diverse datasets and its generalizability to various AI-generated image detection tasks. Moreover, the article does not provide a detailed comparison with existing state-of-the-art models. Despite these limitations, the proposed architecture is a significant contribution to the field of AI-generated image detection.

Key Points

▸ The proposed model leverages BERT and CLIP Vision encoders for text and image feature extraction, respectively.
▸ The model employs cross-modal feature fusion with a tailored multi-task loss function.
▸ Pseudo-labeling-based data augmentation was utilized to expand the training dataset with high-confidence samples.

Merits

Strength in Multi-modal Fusion

The proposed model successfully combines text and image features using BERT and CLIP Vision encoders, demonstrating the effectiveness of multi-modal fusion in AI-generated image detection.

Demerits

Lack of Diversity in Dataset Evaluation

The article does not provide a thorough evaluation of the model's performance on diverse datasets, which limits its generalizability to various AI-generated image detection tasks.

Insufficient Comparison with Existing Models

The article does not provide a detailed comparison with existing state-of-the-art models, which makes it challenging to assess the proposed model's superiority in AI-generated image detection.

Expert Commentary

The proposed model is a significant contribution to the field of AI-generated image detection, leveraging multi-modal fusion and pseudo-labeling-based data augmentation. However, its limitations in evaluating diverse datasets and comparing with existing models are notable. To further improve the model's performance, it is essential to incorporate more diverse datasets and conduct a thorough comparison with state-of-the-art models. Additionally, the model's potential implications for deepfake detection and policy regulation should be explored in greater detail.

Recommendations

✓ Future research should focus on evaluating the proposed model's performance on diverse datasets and comparing it with existing state-of-the-art models.
✓ The model's potential implications for deepfake detection and policy regulation should be explored in greater detail, including the development of guidelines for the regulation of AI-generated content.

Sources

arXiv - cs.CL

NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection

AI Commentary

Executive Summary

Key Points

Merits

Strength in Multi-modal Fusion

Demerits

Lack of Diversity in Dataset Evaluation

Insufficient Comparison with Existing Models

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs