Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
arXiv:2602.18171v1 Announce Type: new Abstract: Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91\%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible rese
arXiv:2602.18171v1 Announce Type: new Abstract: Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91\%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.
Executive Summary
This article proposes a hybrid approach to detecting clickbait headlines by combining transformer-based text embeddings with linguistically motivated informativeness features. The model achieves an F1-score of 91%, outperforming traditional baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues, enabling transparent and well-calibrated clickbait predictions. The study contributes to the development of more effective clickbait detection methods, improving online information quality and user trust.
Key Points
- ▸ Hybrid approach combining transformer-based text embeddings and linguistically motivated informativeness features
- ▸ Achieves an F1-score of 91%, outperforming traditional baselines
- ▸ Proposed feature set enhances interpretability by highlighting salient linguistic cues
Merits
High Accuracy
The proposed model achieves a high F1-score of 91%, indicating its effectiveness in detecting clickbait headlines
Interpretability
The proposed feature set provides insights into the linguistic cues that contribute to clickbait detection, enhancing model transparency
Demerits
Limited Generalizability
The study's results may not generalize to other domains or datasets, highlighting the need for further testing and validation
Dependence on Large Language Models
The proposed approach relies on large language models, which can be computationally expensive and require significant resources
Expert Commentary
The proposed hybrid approach to clickbait detection represents a significant advancement in the field, leveraging the strengths of both transformer-based text embeddings and linguistically motivated informativeness features. The study's emphasis on interpretability is particularly noteworthy, as it enables a deeper understanding of the linguistic cues that contribute to clickbait detection. However, further research is needed to address the limitations of the approach, including its dependence on large language models and potential lack of generalizability to other domains.
Recommendations
- ✓ Further testing and validation of the proposed approach on diverse datasets and domains
- ✓ Exploration of alternative approaches that can reduce the computational expenses associated with large language models