We present a comprehensive theoretical analysis of k-means clustering algorithms, a fundamental family of partitional clustering methods that minimize distances between data points and their assigned cluster centers. While k-means clustering has been extensively studied, a systematic theoretical analysis and comparison of different k-means variants remains lacking. Our work first establishes the NP-completeness of k-means clustering. We then characterize the structure of locally optimal solutions and prove convergence properties for the k-means algorithm. We further examine various k-means paradigms, such as K-Medoids, Fuzzy K-Means, and IK-Means. Through rigorous theoretical analysis and experimental evaluation, we provide insights into the trade-offs among these variants. Our results offer theoretical guidance for selecting appropriate k-means variants and suggest directions for future algorithmic improvements in k-means clustering.
"Do not forget that clusters are, in large part, on the eye of the beholder."
Clustering, as a fundamental problem in computer science, focuses on partitioning a set of objects into groups (clusters) such that objects within the same cluster are more similar to each other than to those in different clusters. Among various clustering approaches, k-means clustering has emerged as one of the most widely studied and applied methods, owing to its simplicity, efficiency, and theoretical guarantees. The k-means clustering problem, which aims to minimize the sum of squared distances between data points and their assigned cluster centers, presents significant computational challenges and has been proven to be NP-hard when the number of clusters exceeds one.
Within the broader context of k-means methodologies, various algorithmic variants have been developed to address different aspects of the clustering challenge. These variants can be broadly classified based on:
Each variant presents unique theoretical properties and computational characteristics that warrant careful analysis. These algorithms typically operate through iterative refinement, where cluster centers are repeatedly updated to minimize the within-cluster sum of squares. This iterative nature has led to various innovations in initialization strategies, high-dimensional data handling, and scalable optimization techniques.
This survey provides a systematic examination of k-means clustering algorithms and their variants from an algorithmic perspective. We focus on three key aspects:
The main contributions of this work include:
The remainder of this paper is organized as follows:
Click to jump to each section.
What is clustering? Clustering is a fundamental problem in unsupervised learning that aims to partition data points into groups based on their similarity. Before delving into specific algorithms, we first present a formal mathematical framework.
We first define a class of k-means-type clustering problems. Given a dataset \(\mathcal{X} = \{x_1, \ldots, x_n\} \subset \mathbb{R}^d\) and a distance function \(d: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}_{\geq 0}\), we seek:
The objective function for k-means-type problems captures the total distortion within clusters:
\begin{equation} G((\mathcal{X}, d), (C_1,\ldots,C_k)) = \sum_{i=1}^k \sum_{x \in C_i} d(x, \mu_i) \end{equation}This function measures the sum of distances between each point and its assigned cluster center. The classical k-means algorithm emerges as a special case where \(d(x,y) = \|x-y\|^2\), chosen for its mathematical properties that enable efficient optimization.
For k-means-type problems, we define a center update rule \(\phi_d\) that maps a cluster to its representative center:
\(\phi_d: 2^{\mathcal{X}} \to \mathbb{R}^d\)
Here, \(2^{\mathcal{X}}\) denotes the power set of \(\mathcal{X}\). The specific form of \(\phi_d\) depends on the problem structure and distance function \(d\). For example, in classical k-means with squared Euclidean distance, \(\phi_d\) computes the arithmetic mean.
The clustering problem can be formulated as the following optimization problem:
\begin{equation} \text{Problem P:} \quad \text{minimize} \quad \sum_{i=1}^k \sum_{j=1}^n w_{ij} d(x_j, \mu_i) \end{equation} \begin{equation} \text{subject to} \quad \sum_{i=1}^k w_{ij} = 1, \quad \forall j = 1,\ldots,n \end{equation} \begin{equation} w_{ij} \in \{0,1\}, \quad \forall i,j \end{equation} \begin{equation} \mu_i = \phi_d(\{x_j: w_{ij} = 1\}), \quad \forall i = 1,\ldots,k \end{equation}The binary variables \(w_{ij}\) encode cluster assignments, where \(w_{ij} = 1\) indicates point \(x_j\) belongs to cluster \(i\). The constraints ensure each point is assigned to exactly one cluster.
For general k-means-type problems, we extend Lloyd's algorithm with the following structure:
The convergence criterion can be:
We prove that the k-means clustering problem is NP-complete even when restricted to k = 2 . The proof proceeds by reduction from a variant of Not-All-Equal 3SAT (NaeSat*), which we know is NP-complete.
Definition 3.1 (NaeSat*)
Input: A Boolean formula ϕ(x1, ..., xn) in 3CNF where:
Output: true if there exists an assignment where each clause has exactly one or two true literals, false otherwise.
We want to prove the following theorem.
Theorem. 2-means clustering is NP-complete.
Lemma 3.3. k-means clustering is in NP.
Proof. Given a solution (C1, ..., Ck), we can verify in polynomial time:
1. The partition is valid: each point appears in exactly one cluster.
2. The cost is optimal for the given partition by checking each centroid is at the right position.
3. The total cost is below the specified threshold.
The verification takes O(nk) time where n is the number of points.
Given an instance ϕ of NaeSat* with n variables and m clauses, we construct a 2-means clustering instance that captures the satisfiability properties of ϕ. The construction proceeds as follows:
1. For each variable xi in ϕ, we create two corresponding points in our metric space: one representing the positive literal xi and one representing the negative literal ¬xi. This gives us 2n points in total.
2. We define a distance matrix D that encodes the logical structure of ϕ. For any two literals α and β:
\begin{equation} D_{\alpha \beta} = \begin{cases} 0 & \text{if $\alpha$ and $\beta$ are identical} \\ 1 + \Delta & \text{if $\alpha$ is the negation of $\beta$} \\ 1 + \delta & \text{if $\alpha$ and $\beta$ appear together in a clause} \\ 1 & \text{otherwise} \end{cases} \end{equation}
where the parameters \(\delta\) and \(\Delta\) satisfy:
3. We set our target clustering cost threshold to \(c(\phi) = n-1+\frac{2\delta m}{n}\).
We prove the reduction is correct through two key lemmas that establish the equivalence between NaeSat* solutions and low-cost clusterings.
Lemma (Forward Direction): If \(\phi\) has a satisfying assignment for NaeSat*, then there exists a 2-means clustering with cost exactly \(c(\phi)\).
Proof
Given a satisfying assignment for \(\phi\), we construct an optimal clustering as follows:
1. Let \(C_1\) contain all literals that are true in the satisfying assignment.
2. Let \(C_2\) contain all literals that are false in the satisfying assignment.
This clustering has several important properties:
The total clustering cost can be calculated as:
Lemma (Backward Direction): If there exists a 2-means clustering with cost \(\leq c(\phi)\), then \(\phi\) is satisfiable.
We prove this through a sequence of structural properties:
1. First, we prove that no cluster can contain both a literal and its negation:
2. This implies that each cluster must contain exactly \(n\) points, with exactly one literal from each variable pair.
3. The cost formula then simplifies to:
4. For this to be \(\leq c(\phi)\), every clause must be "split" between clusters.
5. Therefore, assigning true to all literals in \(C_1\) and false to all literals in \(C_2\) gives a valid NaeSat* solution.
We prove this through a sequence of structural properties:
First, we prove that no cluster can contain both a literal and its negation:
This implies that each cluster must contain exactly \(n\) points, with exactly one literal from each variable pair.
The cost formula then simplifies to:
For this to be \(\leq c(\phi)\), every clause must be "split" between clusters.
Therefore, assigning true to all literals in \(C_1\) and false to all literals in \(C_2\) gives a valid NaeSat* solution.
The previous lemmas established that our construction yields a valid reduction from NaeSat* to 2-means clustering if we can represent the distances in Euclidean space. We now show this is possible using theorems established by Schoenberg.
Theorem (Geometric Realization): Given our distance matrix \(D(\phi)\), there exist points \(\{x_\alpha\}_{\alpha \in \text{literals}} \subset \mathbb{R}^{2n}\) such that \(D_{\alpha \beta} = \|x_\alpha - x_\beta\|^2\) for all literals \(\alpha, \beta\).
Proof
The proof proceeds in two steps:
First, we show that \(D(\phi)\) satisfies Schoenberg's criterion for embeddability into \(\ell_2^2\).
Then, we explain how to explicitly construct the embedding using classical multidimensional scaling.
Step 1: Verifying Schoenberg's Criterion
By Schoenberg's theorem, a symmetric matrix \(D\) can be realized as squared Euclidean distances if and only if \(-HDH\) is positive semidefinite, where \(H = I - \frac{1}{N}\mathbf{1}\mathbf{1}^T\) is the centering matrix.
We will show this by proving that \(v^T HDHv \leq 0\) for all vectors \(v \in \mathbb{R}^N\). Let \(u = Hv\). Note that \(u \cdot \mathbf{1} = 0\) for any such \(u\), because \(H\mathbf{1} = 0\). Therefore, it suffices to show \(u^T Du \leq 0\) for all vectors \(u\) with \(u \cdot \mathbf{1} = 0\).
For any such vector \(u\), let \(u^+\) denote its first \(n\) coordinates and \(u^-\) its last \(n\) coordinates. Then:
The last inequality follows from our choice of parameters where \(2\delta n \leq 1-\Delta\).
Step 2: Constructing the Embedding
Given that \(D\) satisfies Schoenberg's criterion, we can explicitly construct the embedding points \(\{x_\alpha\}\) as follows:
This construction is guaranteed to work because \(B\) is positive semidefinite, and the resulting points \(x_\alpha \in \mathbb{R}^{2n}\) will satisfy \(\|x_\alpha - x_\beta\|^2 = D_{\alpha \beta}\) for all pairs of literals \(\alpha, \beta\).
This geometric realization completes our reduction by showing that the abstract distance matrix \(D(\phi)\) can be realized by actual points in Euclidean space, making our reduction valid for the standard k-means clustering problem.
To analyze the problem structure more deeply, we use the reduced function that eliminates the center variables defined by Selim:
Definition. The reduced function \(F(W)\) of Problem P is defined by:
where \(W = [w_{ij}]\) is a \(k \times n\) matrix of assignment variables.
This leads to the reduced problem:
To understand the properties of this reduced problem, we first analyze the structure of the reduced function \(F(W)\) under different conditions on the distance function \(d\).
Theorem. When \(d(x,y)\) is convex in \(y\) for fixed \(x\), the following properties hold:
Proof
The concavity follows from:
The binary property follows from the strict convexity of \(d\), which ensures that any fractional solution can be improved by moving to a vertex of the feasible region.
For classical k-means with non-empty clusters and \(d(x,y) = \|x-y\|^2\):
where \(\sum_{l=1}^n w_{il} > 0\) for all \(i\).
Definition. A point \((W^*, Z^*)\) is called a partial optimal solution if:
\begin{equation} f(W^*, Z^*) \leq f(W, Z^*) \quad \text{for all } W \in S \end{equation} \begin{equation} f(W^*, Z^*) \leq f(W^*, Z) \quad \text{for all } Z \in \mathbb{R}^{nk} \end{equation}where \(S\) is the feasible set of cluster assignment matrices.
To find partial optimal solutions, we consider two subproblems:
To better understand the structure of these partial optimal solutions, we can characterize them using the Kuhn-Tucker optimality conditions from nonlinear programming theory.
Theorem. Equivalence of Partial Optimal Solutions and Kuhn-Tucker Points
Under differentiability conditions, a point \((W^*, Z^*)\) is a partial optimal solution if and only if it satisfies the Kuhn-Tucker conditions:
\begin{equation} \text{i)} \quad \nabla_{w_j}f(W, Z) + \Pi^tE \geq 0 \quad j = 1,\ldots,m \end{equation} \begin{equation} \text{ii)} \quad (\nabla_{w_j}f(W, Z) + \Pi^tE)w_j = 0 \quad j = 1,\ldots,m \\ \end{equation} \begin{equation} \text{iii)} \quad Ew_j = 1, w_j \geq 0 \quad j = 1,\ldots,m \\ \end{equation} \begin{equation} \text{iv)} \quad \nabla_{z_i}f(W, Z) = 0 \quad i = 1,\ldots,k \end{equation}where \(E\) is a vector of ones and \(\Pi\) is the vector of Lagrange multipliers.
(\(\Rightarrow\)) First, assume \((W^*, Z^*)\) is a partial optimal solution.
(\(\Leftarrow\)) Now assume \((W^*, Z^*)\) satisfies conditions i)-iv).
Therefore, \((W^*, Z^*)\) satisfies both conditions of a partial optimal solution.
Theorem: K-means Convergence
Algorithm converges to a partial optimal solution of Problem P in a finite number of iterations.
Proof
Let's prove by contradiction that the algorithm visits each extreme point of S at most once.
Assume there exist iterations \(r_1\) and \(r_2\) (\(r_1 \neq r_2\)) where the algorithm visits the same extreme point:
Following the algorithm steps:
Since \(W^{r_1} = W^{r_2}\), we have:
while also having a strictly decreasing sequence between these points. Therefore, the algorithm cannot visit the same extreme point twice.
Since the feasible region has a finite number of extreme points and the algorithm never revisits the same extreme point, it must terminate in a finite number of iterations. When the algorithm terminates, the solution satisfies partial optimality conditions.
Having established the theoretical properties of the classical k-means algorithm, including its convergence guarantees and local optimality conditions, we now explore how these foundations can be extended and modified to address various practical challenges. While k-means is widely used, its limitations, such as sensitivity to initialization, restricted use of Euclidean distance, and the requirement for numeric data have motivated numerous variants of the algorithm.
In this section, we summarize some of the well-known variants of K-Means, which include different problem setups, different algorithm. We introduce their problem formulation, practical algorithms, relation to K-Means and some theoretical results. Specifically, we introduce some of the most well-established variants in K-Means with variations in the following aspects: (1) Representative Elements and Distance Generating Function for the clusters (K-Medoids, K-Medians, K-Modes, Mean Shift Clustering), (2) Feature Transformation (Fuzzy K-Means, Kernel K-Means,Weighted K-Means) , (3) Initial Representative Elements Assignment (IK-Means), (4) Algorithm Paradigm (Bisecting K-Means Clustering).
K-Medoids Clustering is a variant of K-Means Clustering. Similar to K-Means, it also aims to find a set of \(\{C_1, C_2, \cdots, C_k\}\) that minimizes a pre-defined loss function. However, the problem formulation is different in the following two ways:
Intuitively, as \(\mathit{l}_2\) norm squares each term's difference compared to \(\mathit{l}_1\) norm, the model trained on \(\mathit{l}_1\) norm will be more robust to outliers. Also, the property that all representative elements are points within the given input set \(\mathcal{X}\) makes it useful in scenarios where more "natural" solutions are desired.
Similar to K-Means, the K-Medoid clustering algorithms also follow the paradigm of iteratively updating the representative points and the clusters until convergence. A basic K-Medoid clustering algorithm is provided below:
In the naive implementation of k-Medoids (Option 1 in the algorithm), after selecting the initial medoids, we need to compute the total cost of swapping the representative object \(m\) with \(x_i\) for all non-representative objects \(x_i\). This is an \(O(n^2)\) operation.
To address this issue, a modification of the k-Medoids algorithm, Partition Around Medoids (PAM), is proposed in. The algorithm first builds a dissimilarity matrix on the given dataset, referred to in literature as the BUILD operation. The second part of the algorithm, referred to as the SWAP operation, iteratively selects a random non-representative point and calculates whether swapping the representative object with it yields a better loss function, as illustrated in Option 2 of the algorithm.
Compared to k-Means Clustering, k-Medoids Clustering:
However, the computational complexity of k-Medoids is higher, restricting its application to larger datasets. Classical works like the Clustering LARge Applications (CLARA) algorithm extend the idea of PAM to larger datasets through sampling methods.
K-Medians Clustering selects the medians of the clusters, as opposed to the means in K-Means Clustering. For each cluster \(C_i\), the median \(med\) is defined as
where \(med_{ij}\) is the \(j\)-th feature of the median point in the cluster \(C_i\), and \(x_{j}\) is the \(j\)-th feature of data point \(x\) in the cluster \(C_i\). The loss function for K-Medians is defined as:
K-medians is more robust to outliers compared to K-means. One should not confuse K-Medians' representative elements to be restricted on points in \(\mathcal{X}\). As the median is selected feature-wise, the resulting representative element may not be a point in the given set of points. For instance, given cluster \((0,0), (1,-1), (2,1)\), the representative point is \((1,0)\), which is not an element in the cluster.
Classical algorithm uses Lloyd-style iteration which iterates between an expectation step and a maximization step, as shown in k-Means. In the expectation step, input data points are assigned to their nearest median. In the maximization step, the medians are recomputed by using the median in each single dimension.
Even in metric space, k-median is known to be NP-hard. For approximation algorithms, there exists a \(O(1)\)-approximation algorithm that runs in \(\tilde{O}(kn)\), and it's been proved that there is no \(O(1)\)-approximation algorithm for metric k-median with \(o(kn)\) runtime.
Theorem 5.1
For any \(\rho \geq 1\), every approximation algorithm (including randomized algorithm) with approximation ratio \(\rho\) for estimating the cost of metric k-median for \(k=\frac{n}{2}\) requires time \(\Omega(n^2)\).
Proof
For simplicity, we denote the number of input points as \(2n\), and the number of clusters \(k=\frac{2n}{2}=n\). We then partition the \(2n\) points into two sets: \(n\) points in \(P\) and \(n\) points in \(F\). Between these two set of points, we construct a perfect matching \(\mathcal{M}\), and choose an edge \(e \in \mathcal{M}\) at random. We then define the distances in \((P \cup F, D)\) according to the following:
The metric space on \((P \cup F, D)\) is well-defined. For arbitrarily large \(n\), we know that the optimal solution is the following clustering:
If \(D(e) = Q\), then the cost of the \(k\)-Median problem is \(2n\rho +1\), and if \(D(e) = 1\), then the cost is \(2n\). Hence, any \(\rho\)-factor approximation algorithm for the \(k\)-Median problem must distinguish between these two problem instances. However, this requires an oracle to check if there exists an edge of length \(Q\), which is known to be \(\Omega(n^2)\), even if a randomized algorithm is used.
Definition
Let \(\mathcal{A} = \{A_1, A_2, \ldots, A_d\}\) denote a set of \(m\) distinct categorical attributes, each associated with a finite set \(\mathcal{O}_j\) \((1 \leq j \leq d)\) as its domain, where \(\text{DOM}(A_j) = \mathcal{O}_j\) \((\geq 2 \text{ discrete values})\).
A categorical dataset \(\mathcal{D} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}\) comprises \(n\) categorical data points, where each object \(\mathbf{x}_i \in \mathcal{D}\) \((1 \leq i \leq n)\) is a tuple \(\mathbf{x}_i = (x_{i1}, x_{i2}, \ldots, x_{id}) \in \mathcal{O}_1 \times \mathcal{O}_2 \times \cdots \times \mathcal{O}_d\).
For categorical data, k-Means Clustering is not applicable as the \(l_2\) distance between categorical data points is not well-defined. Intuitively, if we denote the categories as \(0,1,2\), the distance between \((0,2)\) and \((0,1)\) should be the same, as they both represent two points of different categories. In other words, the distance should be invariant to the order of the categories, which is not the case for the \(l_2\) distance.
Definition
The \(l_0\)-norm of a vector \(\mathbf{x} = (x_1, x_2, \ldots, x_m)\), denoted by \(\|\mathbf{x}\|_0\), is defined as the number of non-zero entries in \(\mathbf{x}\). Mathematically, it is expressed as:
where \(\mathbf{1}(\cdot)\) is the indicator function, which equals \(1\) if the condition inside is true, and \(0\) otherwise. The \(L_0\)-norm is commonly used to measure the sparsity of a vector.
Instead, in K-Modes Clustering, we use \(l_0\) norm to define the distance-generating function. This implies that the difference between two data points \(x, y\) is defined by the number of features that the two disagree with. A basic K-Modes algorithm is described below:
The algorithm works by iteratively assigning data points to the nearest cluster mode, which is the most common value within the cluster.
Mean Shift Clustering is a popular non-parametric clustering method. Its objective is to discover the modes (local maxima) in the data distribution by iteratively shifting data points toward the direction of maximum density. This process is based on the Parzen window kernel density estimation method and performs gradient ascent until convergence. The stationary points of the estimated density represent the modes, which serve as the cluster centers.
Definition: Parzen Window Kernel Density Estimate
Given \( N \) data points \(\{x_i\}_{i=1}^N\) in a \( d \)-dimensional space \(\mathbb{R}^d\), the multivariate Parzen window kernel density estimate \( f(x) \) is computed using a kernel function \( K(x) \) and window radius \( h \) as:
Definition: Mean Shift Vector
The mean shift vector \( m_h(x) \), which points toward the direction of the maximum increase in density, is calculated as:
where \( g(x) \) is the gradient of the kernel \( K(x) \).
Algorithm: Mean Shift Clustering
The Mean Shift Clustering algorithm iteratively computes \( m_h(x) \) for each data point \( x \) and updates \( x \) until convergence to a stationary point.
The Mean Shift Clustering Algorithm has broad applications in pattern recognition and computer vision, for example, image segmentation, object tracking, and feature space analysis. However, its performance is sensitive to the choice of the kernel function and the bandwidth \( h \).
Standard K-Means restricts each data point's membership to only one cluster. However, for many real-world applications, the underlying structure contains overlapping clusters. Fuzzy K-Means Clustering, or often referred to as Fuzzy C-Means (FCM) clustering, relaxes the unitary-membership restriction to address this issue.
In Fuzzy C-Means clustering algorithm (FCM), the loss function for a clustering solution \(\mathcal{C}\) is given by:
Here, \(w_{xik}\) is the membership weight of point \(x_i\) belonging to cluster \(C_k\). This weight is used during the clustering process.
Membership Weight
The membership weight \(w_{xik}\) is computed as:
Cluster Center
The cluster center \(c_k\) is calculated as:
Algorithm: Fuzzy C-Means Clustering
Kernel Fuzzy C-Means (KFCM)
KFCM (Kernel Fuzzy C-Means) is an improved version of the FCM algorithm, leveraging kernel methods to map data into a higher-dimensional feature space. Its key modifications include:
KFCM extends Fuzzy C-Means by using the kernel trick to handle non-linearly separable clusters, making it effective for complex clustering scenarios such as pattern recognition and biomedical applications.
Kernel K-Means is an extension of the standard K-Means algorithm that works by projecting the data into a high-dimensional kernel space, where clusters can be more easily separated. This projection is achieved using kernel functions, but only the distances between points in the kernel space are required, so an explicit representation of the projection function \( \phi \) is not needed.
Common kernel functions include polynomial, Gaussian (RBF), and sigmoid kernels, as shown in Table 1. We present Mercer's Theorem (Theorem 1) without proof.
Theorem 1: Mercer's Theorem
Let \( K(x, y) \) be a continuous, symmetric, and positive semi-definite kernel function defined on a compact domain \( \mathcal{X} \times \mathcal{X} \). Then, \( K(x, y) \) can be expressed as:
where:
Kernel Functions
Kernel Type | Kernel Function |
---|---|
Polynomial Kernel | \( K(\mathbf{a}, \mathbf{b}) = (\mathbf{a} \cdot \mathbf{b} + c)^d \) |
Gaussian Kernel (RBF) | \( K(\mathbf{a}, \mathbf{b}) = \exp\left(-\frac{\|\mathbf{a} - \mathbf{b}\|^2}{2\sigma^2}\right) \) |
Sigmoid Kernel | \( K(\mathbf{a}, \mathbf{b}) = \tanh(c (\mathbf{a} \cdot \mathbf{b}) + \theta) \) |
Objective Function
The objective function for Kernel K-Means minimizes the sum of squared errors (SSE) in the high-dimensional feature space:
where \( c_k \) is the centroid of cluster \( C_k \) in the kernel space.
Algorithm: Kernel K-Means Clustering
Weighted Kernel K-Means
Weighted Kernel K-Means extends Kernel K-Means by incorporating weights for each data point, enabling the algorithm to account for varying importance of points. The objective is to minimize the following distortion function:
where \( w(x) \) is the weight of point \( x \), and \( c_j \) is the weighted mean of cluster \( C_j \) in the kernel space:
Weighted K-Means (WK-Means) is an extension of the standard K-Means algorithm that introduces a feature weighting mechanism to account for the varying importance of features. By iteratively learning weights for different features, WK-Means improves clustering performance, especially in datasets where certain features are more informative than others. A user-defined parameter \( \beta \) controls the sensitivity of the algorithm to feature weights.
Weighted Distance
In WK-Means algorithm, the distance between a data point \( x_i \) and a cluster centroid \( c_k \) is defined using a weighted Euclidean distance as:
At each iteration, WK-Means assigns data points to the nearest cluster based on this weighted distance and updates the cluster centroids. Once the centroids are updated, the feature weights \( w_v \) are recalculated using:
where \( D_v \) is the sum of within-cluster variances for feature \( v \), defined as:
The weight update ensures that the weights \( w_v \) are normalized such that:
Algorithm: Weighted K-Means Clustering
By incorporating feature importance, WK-Means empirically achieves better clustering results compared to standard K-Means, especially for datasets with heterogeneous feature importance. However, the computational cost of WK-Means is higher than traditional K-Means due to the additional weight learning step.
Intelligent K-Means (IK-Means) is a clustering method that follows the principle: the farther a point is from the centroid, the more we care about it. Unlike traditional K-Means, IK-Means introduces a deterministic way of selecting initial centroids by considering points that are farthest from the existing centroid. This approach enables better clustering for datasets with wide scatter. The method leverages ideas from principal component analysis (PCA) to identify points with maximum data scatter and uses them as potential cluster centroids, often representing anomalous patterns in the data.
Step 1: Center of Gravity
The first step of IK-Means involves calculating the centroid for the entire dataset, referred to as the center of gravity \( c_g \):
Step 2: Farthest Point Selection
Once \( c_g \) is determined, the algorithm identifies the point \( c \) farthest from \( c_g \):
Step 3: Cluster Assignment
This point \( c \) becomes a candidate for an anomalous cluster centroid. The algorithm creates a cluster \( S_{\text{iter}} \) by assigning all points \( x_i \) to \( S_{\text{iter}} \) if they are closer to \( c \) than to \( c_g \):
where \( d(x_i, c) \) denotes the distance between \( x_i \) and \( c \).
Step 4: Cluster Centroid Update
Once the cluster \( S_{\text{iter}} \) is formed, the centroid of this cluster \( s_g \) is updated as the mean of the points in \( S_{\text{iter}} \):
Algorithm: Intelligent K-Means (IK-Means)
Unlike traditional K-Means, which is non-deterministic due to random initialization, IK-Means is a deterministic algorithm. It is particularly effective in scenarios where clusters are spread widely across the dataset and is well-suited for detecting anomalous patterns.
Bisecting K-Means Clustering builds a binary tree and iteratively splits the leaf nodes \( C \) into two child nodes \( C_1, C_2 \) by running 2-means. This hierarchical approach combines divisive clustering with K-Means, offering a flexible way to cluster datasets. The algorithm is detailed below.
Algorithm: Bisecting K-Means Clustering
Key Characteristics
Bisecting K-Means combines divisive hierarchical clustering with K-Means. It splits clusters iteratively, ensuring better partitioning for complex datasets. However, the computational complexity is higher compared to standard K-Means due to repeated 2-means operations at each step.
We conducted extensive experiments to evaluate and compare different k-means clustering variants across multiple dimensions including accuracy, computational efficiency, and scalability. Our experimental framework was designed to provide comprehensive insights into the practical performance of these algorithms.
Our experimental evaluation utilized four well-known benchmark datasets from the UCI Machine Learning Repository:
We implemented and compared ten variants of K-means clustering:
To ensure reliable and consistent clustering results, we implemented a comprehensive preprocessing pipeline:
Outlier Treatment: We employed the Interquartile Range (IQR) method to handle outliers:
Feature Standardization: All features were standardized using Standard Scaler:
We employed comprehensive metrics to evaluate clustering performance:
Clustering Accuracy: This metric quantifies the agreement between predicted cluster labels and true class labels:
Adjusted Rand Index (ARI): The ARI measures the similarity between two clustering partitions:
To provide an intuitive understanding of how different clustering algorithms perform, we visualize their clustering results on a synthetic dataset. This comparison allows us to directly observe how each algorithm partitions the data space:
These visualizations reveal several interesting patterns:
Our experimental evaluation highlights several notable trends in clustering performance across the four benchmark datasets:
We analyzed how different clustering algorithms scale with increasing dataset size:
Traditional K-means variants (KMeans, KMedians, KMedoids) demonstrate remarkable stability across different dataset sizes, maintaining consistently high performance (accuracy > 0.90) even as the dataset grows. This stability suggests these algorithms are well-suited for real-world applications where data volume may vary.
The runtime analysis reveals significant differences in computational efficiency across algorithms:
From the results on the Digits dataset, we observe three distinct scaling patterns:
Based on our comprehensive experimental analysis:
Future research could focus on:
This study presents a systematic investigation of k-means clustering and its variants, offering a comprehensive analysis of their algorithmic properties and practical performance. Through theoretical analysis, we have elucidated the computational complexity and convergence properties of these algorithms, providing a mathematical general framework for understanding their behavior. Our experimental evaluations across diverse datasets validate these theoretical findings while practical insights into their strengths and limitations.