Non-Zipfian Distribution of Stopwords and Subset Selection Models
arXiv:2603.04691v1 Announce Type: new Abstract: Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^\gamma)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^\gamma)$). We validate this selection probability model by a
arXiv:2603.04691v1 Announce Type: new Abstract: Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^\gamma)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^\gamma)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.
Executive Summary
This article challenges the conventional view that stopwords conform to Zipf's law, instead proposing a Beta Rank Function (BRF) to describe their distribution. The authors also present a novel stopword selection model, which accurately predicts the probability of a word being selected based on its rank. The model is validated using independent collections of texts and analytically proven to produce a BRF rank-frequency distribution for stopwords and a quadratic function for non-stopwords. This research offers significant insights into the nature of stopwords and has practical implications for natural language processing and text analysis.
Key Points
- ▸ Stopwords do not follow Zipf's law, but instead exhibit a Beta Rank Function (BRF) distribution.
- ▸ A novel stopword selection model is proposed, which accurately predicts the probability of a word being selected based on its rank.
- ▸ The model is validated using independent collections of texts and analytically proven to produce the expected distributions.
Merits
Strength in Mathematical Rigor
The article demonstrates a high level of mathematical rigor, presenting a well-substantiated argument for the BRF distribution of stopwords and analytically validating the stopword selection model.
Insightful Analysis of Stopword Distribution
The research sheds new light on the nature of stopwords, challenging the conventional view that they conform to Zipf's law and providing a more accurate understanding of their distribution.
Demerits
Limited Scope
The article's focus on stopwords and their distribution may limit its broader applicability to natural language processing and text analysis.
Complexity of the Stopword Selection Model
The proposed model, while analytically validated, may be complex to implement and interpret in practice.
Expert Commentary
This article makes a significant contribution to the field of natural language processing and text analysis by challenging the conventional view of stopwords and proposing a novel stopword selection model. The research demonstrates a high level of mathematical rigor and sheds new light on the nature of stopwords. However, the complexity of the proposed model and its limited scope may limit its broader applicability. Nevertheless, the article's findings have important practical implications for the development of more effective text analysis and processing algorithms.
Recommendations
- ✓ Future research should aim to generalize the stopword selection model to other languages and contexts.
- ✓ The development of more practical and interpretable models of stopwords and their distribution is essential for the effective application of natural language processing and text analysis algorithms.