Addressing shortcomings of the K-means algorithm via the central limit theorem and Chebyshev’s inequality

Peter Olubunmi Olukanmi

K-means is the most popular algorithm for clustering, a classic task in machine learning and related fields. It is even listed among the top ten algorithms in the entire data mining field. Although shown to be effective in many practical situations, k-means has a few wellknown drawbacks. Chiefly, it assumes the user has prior knowledge of the number k of clusters; it scales poorly with increase in data size; it is sensitive to initialization, thus, yielding local minima which can lead to arbitrarily bad accuracy; and, it is sensitive to outliers. In this thesis, these problems are addressed. First, we present an approach to scaling k-means without trading off accuracy. We show that the classical central limit theorem (CLT) addresses a special case (k = 1) of the k-means problem. Then the theorem is extended to the general case, leading to two new algorithms: scaled versions of k-means and k-means++ (a well-known k-means variant). The new algorithms are named k-means-lite and k-means-lite++, respectively. Their running time is independent of data size. These two new algorithms are presented to demonstrate the proposed approach; but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm. To the author’s knowledge, the proposed approach represents the first claim of constant time in the k-means literature. Secondly, k-means' accuracy is boosted via hybridization with k-medoids, a highly accurate but grossly inefficient algorithm. Two new algorithms are presented, namely, HP (hybrid partitioning) and HP++. They benefit from k-medoids’ high accuracy without inheriting its inefficiency. Contrary to what might be expected of hybrid algorithms, the proposed algorithms are faster than both k-means and k-medoids. Yet, they are not only more accurate than k-means, but also than k-means++ (which already improves k-means’ accuracy), although only HP++ records statistically significant difference when compared with kmeans++. Automatic estimation of the number k of clusters in a dataset, is widely considered a difficult problem. This study discovers a surprisingly simple and efficient solution to the problem. The proposed technique, named Automatic k-means (k-means-auto), takes a k... Ph.D. (Electrical & Electronic Engineering)

Addressing shortcomings of the K-means algorithm via the central limit theorem and Chebyshev’s inequality

Abstract

Files and links (1)

Metrics

Details