^*Equal contribution

Institute for Interdisciplinary Information Sciences (IIIS)
Tsinghua University

"Do not forget that clusters are, in large part, on the eye of the beholder."

Introduction

Clustering, as a fundamental problem in computer science, focuses on partitioning a set of objects into groups (clusters) such that objects within the same cluster are more similar to each other than to those in different clusters. Among various clustering approaches, k-means clustering has emerged as one of the most widely studied and applied methods, owing to its simplicity, efficiency, and theoretical guarantees. The k-means clustering problem, which aims to minimize the sum of squared distances between data points and their assigned cluster centers, presents significant computational challenges and has been proven to be NP-hard when the number of clusters exceeds one.

Within the broader context of k-means methodologies, various algorithmic variants have been developed to address different aspects of the clustering challenge. These variants can be broadly classified based on:

Initialization strategies (e.g., k-means++) - improving the starting point selection for better convergence
Distance metrics (e.g., spherical k-means) - adapting to different data geometries
Kernel transformations (e.g., kernel k-means) - handling non-linear cluster boundaries
Computational efficiency improvements (e.g., mini-batch k-means, online k-means) - scaling to large datasets

Each variant presents unique theoretical properties and computational characteristics that warrant careful analysis. These algorithms typically operate through iterative refinement, where cluster centers are repeatedly updated to minimize the within-cluster sum of squares. This iterative nature has led to various innovations in initialization strategies, high-dimensional data handling, and scalable optimization techniques.

This survey provides a systematic examination of k-means clustering algorithms and their variants from an algorithmic perspective. We focus on three key aspects:

Theoretical foundations - establishing the mathematical basis and formal definitions
Algorithmic complexity - analyzing computational efficiency and convergence properties
Empirical evaluation - assessing practical performance across different scenarios

The main contributions of this work include:

A complete theoretical framework including NP-completeness proof and local optimality analysis
A rigorous theoretical analysis of different k-means variants, examining their computational complexities and convergence properties
An experimental framework for comparing the practical performance of different k-means variants, with a focus on convergence speed and clustering quality

The remainder of this paper is organized as follows:

Section 2: Establishes the theoretical foundations and formal definitions of k-means clustering
Section 3: Proves that the k-means clustering problem is NP-complete even for k=2
Section 4: Analyzes the structure of local optimal solutions and convergence properties
Section 5: Presents detailed analysis of various k-means variants and their complexities
Section 6: Describes experimental methodology and comparative results
Section 7: Concludes with a summary of theoretical findings

Theoretical
Framework

Computational
Complexity

Convergence
Analysis

K-Means
Variants

Experiments

Click to jump to each section.

Theoretical Framework

What is clustering? Clustering is a fundamental problem in unsupervised learning that aims to partition data points into groups based on their similarity. Before delving into specific algorithms, we first present a formal mathematical framework.

Problem Formulation

We first define a class of k-means-type clustering problems. Given a dataset $\mathcal{X} = \{x_1, \ldots, x_n\} \subset \mathbb{R}^d$ and a distance function $d: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}_{\geq 0}$, we seek:

A partition $C_1, \ldots, C_k$ where \begin{equation} \displaystyle{\bigcup_{i=1}^k C_i = \mathcal{X}},\quad C_i \cap C_j = \emptyset \quad \text{for} \quad i \neq j \end{equation}
Centers $\mu_1, \ldots, \mu_k$ that minimize the total distance based on the function $d$. These centers represent the "prototypical" points of their respective clusters.

Objective Function

The objective function for k-means-type problems captures the total distortion within clusters:

\begin{equation} G((\mathcal{X}, d), (C_1,\ldots,C_k)) = \sum_{i=1}^k \sum_{x \in C_i} d(x, \mu_i) \end{equation}

This function measures the sum of distances between each point and its assigned cluster center. The classical k-means algorithm emerges as a special case where $d(x,y) = \|x-y\|^2$, chosen for its mathematical properties that enable efficient optimization.

Center Update Rule

For k-means-type problems, we define a center update rule $\phi_d$ that maps a cluster to its representative center:

$\phi_d: 2^{\mathcal{X}} \to \mathbb{R}^d$

Here, $2^{\mathcal{X}}$ denotes the power set of $\mathcal{X}$. The specific form of $\phi_d$ depends on the problem structure and distance function $d$. For example, in classical k-means with squared Euclidean distance, $\phi_d$ computes the arithmetic mean.

Optimization Formulation

The clustering problem can be formulated as the following optimization problem:

\begin{equation} \text{Problem P:} \quad \text{minimize} \quad \sum_{i=1}^k \sum_{j=1}^n w_{ij} d(x_j, \mu_i) \end{equation} \begin{equation} \text{subject to} \quad \sum_{i=1}^k w_{ij} = 1, \quad \forall j = 1,\ldots,n \end{equation} \begin{equation} w_{ij} \in \{0,1\}, \quad \forall i,j \end{equation} \begin{equation} \mu_i = \phi_d(\{x_j: w_{ij} = 1\}), \quad \forall i = 1,\ldots,k \end{equation}

The binary variables $w_{ij}$ encode cluster assignments, where $w_{ij} = 1$ indicates point $x_j$ belongs to cluster $i$. The constraints ensure each point is assigned to exactly one cluster.

The Lloyd-type Algorithm

For general k-means-type problems, we extend Lloyd's algorithm with the following structure:

The convergence criterion can be:

No change in cluster assignments
Objective function improvement below threshold
Maximum iterations reached

Computational Complexity of k-means Clustering

Introduction

We prove that the k-means clustering problem is NP-complete even when restricted to k = 2 . The proof proceeds by reduction from a variant of Not-All-Equal 3SAT (NaeSat*), which we know is NP-complete.

Definition 3.1 (NaeSat*)

Input: A Boolean formula ϕ(x₁, ..., x_n) in 3CNF where:

Each clause contains exactly three literals
Each pair of variables appears together in at most two clauses:
- Once as either {x_i, x_j} or {¬x_i, ¬x_j}
- Once as either {x_i, ¬x_j} or {¬x_i, x_j}

Output: true if there exists an assignment where each clause has exactly one or two true literals, false otherwise.

We want to prove the following theorem.

Theorem. 2-means clustering is NP-complete.

Membership in NP

Lemma 3.3. k-means clustering is in NP.

Proof. Given a solution (C₁, ..., C_k), we can verify in polynomial time:

1. The partition is valid: each point appears in exactly one cluster.

2. The cost is optimal for the given partition by checking each centroid is at the right position.

3. The total cost is below the specified threshold.

The verification takes O(nk) time where n is the number of points.

Reduction Construction

Given an instance ϕ of NaeSat* with n variables and m clauses, we construct a 2-means clustering instance that captures the satisfiability properties of ϕ. The construction proceeds as follows:

1. For each variable x_i in ϕ, we create two corresponding points in our metric space: one representing the positive literal x_i and one representing the negative literal ¬x_i. This gives us 2n points in total.

2. We define a distance matrix D that encodes the logical structure of ϕ. For any two literals α and β:

\begin{equation} D_{\alpha \beta} = \begin{cases} 0 & \text{if $\alpha$ and $\beta$ are identical} \\ 1 + \Delta & \text{if $\alpha$ is the negation of $\beta$} \\ 1 + \delta & \text{if $\alpha$ and $\beta$ appear together in a clause} \\ 1 & \text{otherwise} \end{cases} \end{equation}

where the parameters $\delta$ and $\Delta$ satisfy:

$0 < \delta < \Delta < 1$; 1
$4\delta m < \Delta \leq 1-2\delta n$
A concrete valid setting is $\delta = \frac{1}{5m + 2n}$ and $\Delta = 5\delta m$, but we can also choose other settings.

3. We set our target clustering cost threshold to $c(\phi) = n-1+\frac{2\delta m}{n}$.

Correctness of the Reduction

We prove the reduction is correct through two key lemmas that establish the equivalence between NaeSat* solutions and low-cost clusterings.

Lemma (Forward Direction): If $\phi$ has a satisfying assignment for NaeSat*, then there exists a 2-means clustering with cost exactly $c(\phi)$.

Proof

Given a satisfying assignment for $\phi$, we construct an optimal clustering as follows:

1. Let $C_1$ contain all literals that are true in the satisfying assignment.

2. Let $C_2$ contain all literals that are false in the satisfying assignment.

This clustering has several important properties:

Each cluster contains exactly $n$ points (one literal from each variable).
By the NaeSat* property, each clause must have literals assigned both true and false.
Therefore, each clause contributes exactly one pair of related literals ($\alpha \sim \beta$) to each cluster.

The total clustering cost can be calculated as:

\[ \begin{aligned} \text{cost} &= \frac{1}{2n}\sum_{i,i'\in C_1} D_{ii'} + \frac{1}{2n}\sum_{i,i'\in C_2} D_{ii'} \\ &= 2 \cdot \frac{1}{n}\left((\binom{n}{2} - m) \cdot 1 + m \cdot (1 + \delta)\right) \\ &= n-1+\frac{2\delta m}{n} = c(\phi) \end{aligned} \]

Lemma (Backward Direction): If there exists a 2-means clustering with cost $\leq c(\phi)$, then $\phi$ is satisfiable.

We prove this through a sequence of structural properties:

1. First, we prove that no cluster can contain both a literal and its negation:

Suppose $C_1$ contains $n'$ points including a complementary pair.
The minimum possible cost would be:
$\frac{1}{n'}((\binom{n'}{2} - 1) \cdot 1 + 1 \cdot (1 + \Delta)) + \frac{1}{2n-n'}\binom{2n-n'}{2} \cdot 1 \geq n-1+\frac{\Delta}{2n} > c(\phi)$
This contradicts our assumption about the clustering cost.

2. This implies that each cluster must contain exactly $n$ points, with exactly one literal from each variable pair.

3. The cost formula then simplifies to:

\begin{equation} \frac{2}{n}\left(\binom{n}{2} + \delta \sum_{\text{clauses}} (1 \text{ if split}, 3 \text{ otherwise})\right) \end{equation}

4. For this to be $\leq c(\phi)$, every clause must be "split" between clusters.

5. Therefore, assigning true to all literals in $C_1$ and false to all literals in $C_2$ gives a valid NaeSat* solution.

We prove this through a sequence of structural properties:

First, we prove that no cluster can contain both a literal and its negation:
- Suppose $C_1$ contains $n'$ points including a complementary pair.
- The minimum possible cost would be:
  \begin{equation} \frac{1}{n'}\left((\binom{n'}{2} - 1) \cdot 1 + 1 \cdot (1 + \Delta)\right) + \frac{1}{2n-n'}\binom{2n-n'}{2} \cdot 1 \geq n-1+\frac{\Delta}{2n} > c(\phi) \end{equation}
- This contradicts our assumption about the clustering cost.
This implies that each cluster must contain exactly $n$ points, with exactly one literal from each variable pair.
The cost formula then simplifies to:

\begin{equation} \frac{2}{n}\left(\binom{n}{2} + \delta \sum_{\text{clauses}} (1 \text{ if split}, 3 \text{ otherwise})\right) \end{equation}
For this to be $\leq c(\phi)$, every clause must be "split" between clusters.
Therefore, assigning true to all literals in $C_1$ and false to all literals in $C_2$ gives a valid NaeSat* solution.

Geometric Realization of the Distance Matrix

The previous lemmas established that our construction yields a valid reduction from NaeSat* to 2-means clustering if we can represent the distances in Euclidean space. We now show this is possible using theorems established by Schoenberg.

Theorem (Geometric Realization): Given our distance matrix $D(\phi)$, there exist points $\{x_\alpha\}_{\alpha \in \text{literals}} \subset \mathbb{R}^{2n}$ such that $D_{\alpha \beta} = \|x_\alpha - x_\beta\|^2$ for all literals $\alpha, \beta$.

Proof

The proof proceeds in two steps:

First, we show that $D(\phi)$ satisfies Schoenberg's criterion for embeddability into $\ell_2^2$.
Then, we explain how to explicitly construct the embedding using classical multidimensional scaling.

Step 1: Verifying Schoenberg's Criterion

By Schoenberg's theorem, a symmetric matrix $D$ can be realized as squared Euclidean distances if and only if $-HDH$ is positive semidefinite, where $H = I - \frac{1}{N}\mathbf{1}\mathbf{1}^T$ is the centering matrix.

We will show this by proving that $v^T HDHv \leq 0$ for all vectors $v \in \mathbb{R}^N$. Let $u = Hv$. Note that $u \cdot \mathbf{1} = 0$ for any such $u$, because $H\mathbf{1} = 0$. Therefore, it suffices to show $u^T Du \leq 0$ for all vectors $u$ with $u \cdot \mathbf{1} = 0$.

For any such vector $u$, let $u^+$ denote its first $n$ coordinates and $u^-$ its last $n$ coordinates. Then:

\begin{equation} u^T Du = \sum_{\alpha,\beta} D_{\alpha\beta}u_\alpha u_\beta \end{equation} \begin{equation} = \sum_{\alpha,\beta} u_\alpha u_\beta(1 - \mathbf{1}(\alpha = \beta) + \Delta \cdot \mathbf{1}(\alpha = \overline{\beta}) + \delta \cdot \mathbf{1}(\alpha \sim \beta)) \end{equation} \begin{equation} = \sum_{\alpha,\beta} u_\alpha u_\beta - \sum_\alpha u_\alpha^2 + \Delta\sum_\alpha u_\alpha u_{\overline{\alpha}} + \delta\sum_{\alpha \sim \beta} u_\alpha u_\beta\\ \end{equation} \begin{equation} \leq \left(\sum_{\alpha} u_\alpha\right)^2 -\|u\|^2 + 2\Delta(u^+ \cdot u^-) + \delta\left(\sum_\alpha |u_\alpha|\right)^2 \\ \end{equation} \begin{equation} \leq -\|u\|^2 + \Delta(\|u^+\|^2 + \|u^-\|^2) + \delta\left(\sum_\alpha |u_\alpha|\right)^2 \\ \end{equation} \begin{equation} \leq -(1-\Delta)\|u\|^2 + 2\delta\|u\|^2n \leq 0 \end{equation}

The last inequality follows from our choice of parameters where $2\delta n \leq 1-\Delta$.

Step 2: Constructing the Embedding

Given that $D$ satisfies Schoenberg's criterion, we can explicitly construct the embedding points $\{x_\alpha\}$ as follows:

Form the matrix $B = -\frac{1}{2}HDH$.
Compute the eigendecomposition $B = U\Lambda U^T$.
The embedding points are given by the rows of $U\Lambda^{1/2}$.

This construction is guaranteed to work because $B$ is positive semidefinite, and the resulting points $x_\alpha \in \mathbb{R}^{2n}$ will satisfy $\|x_\alpha - x_\beta\|^2 = D_{\alpha \beta}$ for all pairs of literals $\alpha, \beta$.

This geometric realization completes our reduction by showing that the abstract distance matrix $D(\phi)$ can be realized by actual points in Euclidean space, making our reduction valid for the standard k-means clustering problem.

Local Optimality and Convergence Analysis

Reduced Function

To analyze the problem structure more deeply, we use the reduced function that eliminates the center variables defined by Selim:

Definition. The reduced function $F(W)$ of Problem P is defined by:

\begin{equation} F(W) = \min_{c_1,\ldots,c_k \in \mathbb{R}^d} \sum_{i=1}^k \sum_{j=1}^n w_{ij} d(x_j, c_i) \end{equation}

where $W = [w_{ij}]$ is a $k \times n$ matrix of assignment variables.

This leads to the reduced problem:

\begin{equation} \text{Problem RP:} \quad \text{minimize} \quad F(W) \end{equation} \begin{equation} \text{subject to} \quad \sum_{i=1}^k w_{ij} = 1, \quad \forall j = 1,\ldots,n \end{equation} \begin{equation} w_{ij} \geq 0, \quad \forall i,j \end{equation}

To understand the properties of this reduced problem, we first analyze the structure of the reduced function $F(W)$ under different conditions on the distance function $d$.

Theorem. When $d(x,y)$ is convex in $y$ for fixed $x$, the following properties hold:

$F(W)$ is concave in $W$.
For $w_{ij} \in [0,1]$, if $d(x,y)$ is strictly convex in $y$, then optimal solutions satisfy $w_{ij} \in \{0,1\}$.
If $d(x,y)$ is concave in $y$, then optimal $w_{ij}$ may be fractional.

Proof

The concavity follows from:

\begin{equation} F(\gamma W^1 + (1-\gamma)W^2) = \min_{c} \sum_{i,j} (\gamma w^1_{ij} + (1-\gamma)w^2_{ij})d(x_j, c_i) \end{equation} \begin{equation} = \min_{c} \{\gamma \sum_{i,j} w^1_{ij}d(x_j, c_i) + (1-\gamma)\sum_{i,j} w^2_{ij}d(x_j, c_i)\} \end{equation} \begin{equation} \geq \gamma \min_{c} \sum_{i,j} w^1_{ij}d(x_j, c_i) + (1-\gamma)\min_{c} \sum_{i,j} w^2_{ij}d(x_j, c_i) \end{equation} \begin{equation} = \gamma F(W^1) + (1-\gamma)F(W^2) \end{equation}

The binary property follows from the strict convexity of $d$, which ensures that any fractional solution can be improved by moving to a vertex of the feasible region.

For classical k-means with non-empty clusters and $d(x,y) = \|x-y\|^2$:

\begin{equation} F(W) = \sum_{j=1}^n \sum_{i=1}^k w_{ij} \left\|x_j - \frac{\sum_{l=1}^n w_{il}x_l}{\sum_{l=1}^n w_{il}}\right\|^2 \end{equation}

where $\sum_{l=1}^n w_{il} > 0$ for all $i$.

Partial Optimal Solutions

Definition. A point $(W^*, Z^*)$ is called a partial optimal solution if:

\begin{equation} f(W^*, Z^*) \leq f(W, Z^*) \quad \text{for all } W \in S \end{equation} \begin{equation} f(W^*, Z^*) \leq f(W^*, Z) \quad \text{for all } Z \in \mathbb{R}^{nk} \end{equation}

where $S$ is the feasible set of cluster assignment matrices.

To find partial optimal solutions, we consider two subproblems:

Problem $P_1$: Given $\hat{Z}$, minimize $f(W, \hat{Z})$ subject to $W \in S$
Problem $P_2$: Given $\hat{W}$, minimize $f(\hat{W}, Z)$ subject to $Z \in \mathbb{R}^{nk}$

To better understand the structure of these partial optimal solutions, we can characterize them using the Kuhn-Tucker optimality conditions from nonlinear programming theory.

Theorem. Equivalence of Partial Optimal Solutions and Kuhn-Tucker Points

Under differentiability conditions, a point $(W^*, Z^*)$ is a partial optimal solution if and only if it satisfies the Kuhn-Tucker conditions:

\begin{equation} \text{i)} \quad \nabla_{w_j}f(W, Z) + \Pi^tE \geq 0 \quad j = 1,\ldots,m \end{equation} \begin{equation} \text{ii)} \quad (\nabla_{w_j}f(W, Z) + \Pi^tE)w_j = 0 \quad j = 1,\ldots,m \\ \end{equation} \begin{equation} \text{iii)} \quad Ew_j = 1, w_j \geq 0 \quad j = 1,\ldots,m \\ \end{equation} \begin{equation} \text{iv)} \quad \nabla_{z_i}f(W, Z) = 0 \quad i = 1,\ldots,k \end{equation}

where $E$ is a vector of ones and $\Pi$ is the vector of Lagrange multipliers.

($\Rightarrow$) First, assume $(W^*, Z^*)$ is a partial optimal solution.

Since $W^*$ minimizes $f(W, Z^*)$ over $W \in S$, it must satisfy the KKT conditions for Problem $P_1$:

Stationarity: $\nabla_{w_j}f(W, Z^*) + \Pi^tE \geq 0$
Complementary slackness: $(\nabla_{w_j}f(W, Z^*) + \Pi^tE)w_j = 0$
Primal feasibility: $Ew_j = 1, w_j \geq 0$

Since $Z^*$ minimizes $f(W^*, Z)$ over $Z \in \mathbb{R}^{nk}$, we have:

$\nabla_{z_i}f(W^*, Z^*) = 0$ (unconstrained minimization)

($\Leftarrow$) Now assume $(W^*, Z^*)$ satisfies conditions i)-iv).

Conditions i)-iii) are the KKT conditions for Problem $P_1$, implying $W^*$ minimizes $f(W, Z^*)$ over $W \in S$
Condition iv) implies $Z^*$ is a stationary point of $f(W^*, Z)$, and since this is unconstrained and $f$ is convex in $Z$, $Z^*$ minimizes $f(W^*, Z)$

Therefore, $(W^*, Z^*)$ satisfies both conditions of a partial optimal solution.

Theorem: K-means Convergence

Algorithm converges to a partial optimal solution of Problem P in a finite number of iterations.

Proof

Let's prove by contradiction that the algorithm visits each extreme point of S at most once.

Assume there exist iterations $r_1$ and $r_2$ ($r_1 \neq r_2$) where the algorithm visits the same extreme point:

$W^{r_1} = W^{r_2}$

Following the algorithm steps:

At iteration $r_1$: We solve $P_2$ with $W = W^{r_1}$ to get $Z^{r_1+1}$
At iteration $r_2$: We solve $P_2$ with $W = W^{r_2}$ to get $Z^{r_2+1}$

Since $W^{r_1} = W^{r_2}$, we have:

$f(W^{r_1}, Z^{r_1+1}) = f(W^{r_2}, Z^{r_1+1}) = f(W^{r_2}, Z^{r_2+1})$

However, by the construction of the algorithm, we know that:

$f(W^{r_1}, Z^{r_1}) > f(W^{r_1}, Z^{r_1+1}) > \cdots > f(W^{r_2}, Z^{r_2+1})$

This creates a contradiction because we cannot have:

$f(W^{r_1}, Z^{r_1+1}) = f(W^{r_2}, Z^{r_2+1})$

while also having a strictly decreasing sequence between these points. Therefore, the algorithm cannot visit the same extreme point twice.

Since the feasible region has a finite number of extreme points and the algorithm never revisits the same extreme point, it must terminate in a finite number of iterations. When the algorithm terminates, the solution satisfies partial optimality conditions.

Having established the theoretical properties of the classical k-means algorithm, including its convergence guarantees and local optimality conditions, we now explore how these foundations can be extended and modified to address various practical challenges. While k-means is widely used, its limitations, such as sensitivity to initialization, restricted use of Euclidean distance, and the requirement for numeric data have motivated numerous variants of the algorithm.

Variants of K-Means

In this section, we summarize some of the well-known variants of K-Means, which include different problem setups, different algorithm. We introduce their problem formulation, practical algorithms, relation to K-Means and some theoretical results. Specifically, we introduce some of the most well-established variants in K-Means with variations in the following aspects: (1) Representative Elements and Distance Generating Function for the clusters (K-Medoids, K-Medians, K-Modes, Mean Shift Clustering), (2) Feature Transformation (Fuzzy K-Means, Kernel K-Means,Weighted K-Means) , (3) Initial Representative Elements Assignment (IK-Means), (4) Algorithm Paradigm (Bisecting K-Means Clustering).

K-Medoids Clustering

K-Medoids Clustering is a variant of K-Means Clustering. Similar to K-Means, it also aims to find a set of $\{C_1, C_2, \cdots, C_k\}$ that minimizes a pre-defined loss function. However, the problem formulation is different in the following two ways:

The representative points of each cluster $\{m_1, m_2, \cdots, m_k\}$ are restricted to be one of the data points in the dataset, i.e., $\{m_1, m_2, \cdots, m_k\} \subseteq \{x_1, x_2, \cdots, x_n\}$.
Instead of minimizing the SSE loss in K-Means, the loss function for K-Medoids is defined using Absolute Error:
\begin{equation} G_{\text{k-medoids}}((\mathcal{X}, d), (C_1,\ldots,C_k)) = \sum_{i=1}^k \sum_{x \in C_i} \|x - m_i\| \end{equation}
where the representative element $m_i$ denotes the medoid of the cluster:
\begin{equation} m_i = \min_{m \in C_i}\sum_{x \in C_i} \|x-m_i\| \end{equation}

Intuitively, as $\mathit{l}_2$ norm squares each term's difference compared to $\mathit{l}_1$ norm, the model trained on $\mathit{l}_1$ norm will be more robust to outliers. Also, the property that all representative elements are points within the given input set $\mathcal{X}$ makes it useful in scenarios where more "natural" solutions are desired.

Similar to K-Means, the K-Medoid clustering algorithms also follow the paradigm of iteratively updating the representative points and the clusters until convergence. A basic K-Medoid clustering algorithm is provided below:

Algorithm: K-Medoids Clustering
Input: Dataset $\mathcal{X} = \{x_1, \ldots, x_n\}$, number of clusters $K$
Output: Clusters $C_1, \ldots, C_K$ and medoids $m_1, \ldots, m_K$

1. Select $K$ points as the initial representative objects (medoids).
2. Repeat until convergence:
a. Assign each point to the cluster with the nearest medoid.
b. For each cluster $C_i$:
- Update the medoid $m_i$ by minimizing the total distance to other points in $C_i$.
c. Alternatively, randomly select a non-representative object and compute the total cost of swapping it with a medoid.
- If the cost improves, perform the swap.

In the naive implementation of k-Medoids (Option 1 in the algorithm), after selecting the initial medoids, we need to compute the total cost of swapping the representative object $m$ with $x_i$ for all non-representative objects $x_i$. This is an $O(n^2)$ operation.

To address this issue, a modification of the k-Medoids algorithm, Partition Around Medoids (PAM), is proposed in. The algorithm first builds a dissimilarity matrix on the given dataset, referred to in literature as the BUILD operation. The second part of the algorithm, referred to as the SWAP operation, iteratively selects a random non-representative point and calculates whether swapping the representative object with it yields a better loss function, as illustrated in Option 2 of the algorithm.

Compared to k-Means Clustering, k-Medoids Clustering:

Provides more natural solutions by restricting representative points on the input set $\mathcal{X}$.
Is more robust to noise and outliers.

However, the computational complexity of k-Medoids is higher, restricting its application to larger datasets. Classical works like the Clustering LARge Applications (CLARA) algorithm extend the idea of PAM to larger datasets through sampling methods.

K-Medians Clustering

K-Medians Clustering selects the medians of the clusters, as opposed to the means in K-Means Clustering. For each cluster $C_i$, the median $med$ is defined as

\begin{equation} med_{ij} = \text{median}_{x \in C_i}(x_{j}) \end{equation}

where $med_{ij}$ is the $j$-th feature of the median point in the cluster $C_i$, and $x_{j}$ is the $j$-th feature of data point $x$ in the cluster $C_i$. The loss function for K-Medians is defined as:

\begin{equation} \sum_{i=1}^k \sum_{x \in C_i} \sum_{j=1}^d |x_j - med_{ij}| \end{equation}

K-medians is more robust to outliers compared to K-means. One should not confuse K-Medians' representative elements to be restricted on points in $\mathcal{X}$. As the median is selected feature-wise, the resulting representative element may not be a point in the given set of points. For instance, given cluster $(0,0), (1,-1), (2,1)$, the representative point is $(1,0)$, which is not an element in the cluster.

Classical algorithm uses Lloyd-style iteration which iterates between an expectation step and a maximization step, as shown in k-Means. In the expectation step, input data points are assigned to their nearest median. In the maximization step, the medians are recomputed by using the median in each single dimension.

Even in metric space, k-median is known to be NP-hard. For approximation algorithms, there exists a $O(1)$-approximation algorithm that runs in $\tilde{O}(kn)$, and it's been proved that there is no $O(1)$-approximation algorithm for metric k-median with $o(kn)$ runtime.

Theorem 5.1

For any $\rho \geq 1$, every approximation algorithm (including randomized algorithm) with approximation ratio $\rho$ for estimating the cost of metric k-median for $k=\frac{n}{2}$ requires time $\Omega(n^2)$.

Proof

For simplicity, we denote the number of input points as $2n$, and the number of clusters $k=\frac{2n}{2}=n$. We then partition the $2n$ points into two sets: $n$ points in $P$ and $n$ points in $F$. Between these two set of points, we construct a perfect matching $\mathcal{M}$, and choose an edge $e \in \mathcal{M}$ at random. We then define the distances in $(P \cup F, D)$ according to the following:

$\forall e^* \in \mathcal{M} \setminus \{e\}$, $D(e^*) = 1$,
For one instance, $D(e)=1$, for the other, $D(e)=Q = 2n(\rho - 1) + 3$,
$\forall (x,y)\not\in\mathcal{M}$, $D(x, y) = n^3 \rho$.

The metric space on $(P \cup F, D)$ is well-defined. For arbitrarily large $n$, we know that the optimal solution is the following clustering:

\begin{equation} \{C_1,...C_{k=n}\}=\{ \{x, y\} | (x, y) \in E\} \end{equation}

If $D(e) = Q$, then the cost of the $k$-Median problem is $2n\rho +1$, and if $D(e) = 1$, then the cost is $2n$. Hence, any $\rho$-factor approximation algorithm for the $k$-Median problem must distinguish between these two problem instances. However, this requires an oracle to check if there exists an edge of length $Q$, which is known to be $\Omega(n^2)$, even if a randomized algorithm is used.

K-Modes Clustering

Definition

Let $\mathcal{A} = \{A_1, A_2, \ldots, A_d\}$ denote a set of $m$ distinct categorical attributes, each associated with a finite set $\mathcal{O}_j$ $(1 \leq j \leq d)$ as its domain, where $\text{DOM}(A_j) = \mathcal{O}_j$ $(\geq 2 \text{ discrete values})$.

A categorical dataset $\mathcal{D} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ comprises $n$ categorical data points, where each object $\mathbf{x}_i \in \mathcal{D}$ $(1 \leq i \leq n)$ is a tuple $\mathbf{x}_i = (x_{i1}, x_{i2}, \ldots, x_{id}) \in \mathcal{O}_1 \times \mathcal{O}_2 \times \cdots \times \mathcal{O}_d$.

For categorical data, k-Means Clustering is not applicable as the $l_2$ distance between categorical data points is not well-defined. Intuitively, if we denote the categories as $0,1,2$, the distance between $(0,2)$ and $(0,1)$ should be the same, as they both represent two points of different categories. In other words, the distance should be invariant to the order of the categories, which is not the case for the $l_2$ distance.

Definition

The $l_0$-norm of a vector $\mathbf{x} = (x_1, x_2, \ldots, x_m)$, denoted by $\|\mathbf{x}\|_0$, is defined as the number of non-zero entries in $\mathbf{x}$. Mathematically, it is expressed as:

\[ \|\mathbf{x}\|_0 = \sum_{i=1}^m \mathbf{1}(x_i \neq 0), \]

where $\mathbf{1}(\cdot)$ is the indicator function, which equals $1$ if the condition inside is true, and $0$ otherwise. The $L_0$-norm is commonly used to measure the sparsity of a vector.

Instead, in K-Modes Clustering, we use $l_0$ norm to define the distance-generating function. This implies that the difference between two data points $x, y$ is defined by the number of features that the two disagree with. A basic K-Modes algorithm is described below:

Algorithm: K-Modes Clustering
Input: Dataset $\mathcal{X} = \{x_1, \ldots, x_n\}$, number of clusters $K$
Output: Clusters $C_1, \ldots, C_K$ and modes $m_1, \ldots, m_K$

1. Select $K$ initial modes.
2. Repeat until convergence:
a. Assign each point to the cluster with the nearest mode using the matching metric.
b. For each cluster $C_j$, recompute the mode $m_j$.

The algorithm works by iteratively assigning data points to the nearest cluster mode, which is the most common value within the cluster.

Mean Shift Clustering

Mean Shift Clustering is a popular non-parametric clustering method. Its objective is to discover the modes (local maxima) in the data distribution by iteratively shifting data points toward the direction of maximum density. This process is based on the Parzen window kernel density estimation method and performs gradient ascent until convergence. The stationary points of the estimated density represent the modes, which serve as the cluster centers.

Definition: Parzen Window Kernel Density Estimate

Given $ N $ data points $\{x_i\}_{i=1}^N$ in a $ d $-dimensional space $\mathbb{R}^d$, the multivariate Parzen window kernel density estimate $ f(x) $ is computed using a kernel function $ K(x) $ and window radius $ h $ as:

\begin{equation} f(x) = \frac{1}{N h^d} \sum_{i=1}^N K\left(\frac{x - x_i}{h}\right), \end{equation}

Definition: Mean Shift Vector

The mean shift vector $ m_h(x) $, which points toward the direction of the maximum increase in density, is calculated as:

\begin{equation} m_h(x) = \frac{\sum_{i=1}^N x_i \cdot g\left(\left\|\frac{x - x_i}{h}\right\|^2\right)}{\sum_{i=1}^N g\left(\left\|\frac{x - x_i}{h}\right\|^2\right)}, \end{equation}

where $ g(x) $ is the gradient of the kernel $ K(x) $.

Algorithm: Mean Shift Clustering

Input: $\mathcal{X} = \{x_1, x_2, \ldots, x_N\}$, kernel function $ K(x) $, window radius $ h $
Output: Cluster centers corresponding to the modes of the data distribution

1. Initialization: Select initial points from the dataset as the starting modes.
2. Repeat until modes converge:
a. For each point $ x \in \mathcal{X} $:
- Compute the mean shift vector $ m_h(x) $ as:

\[ m_h(x) = \frac{\sum_{i=1}^N x_i \cdot g\left(\left\|\frac{x - x_i}{h}\right\|^2\right)}{\sum_{i=1}^N g\left(\left\|\frac{x - x_i}{h}\right\|^2\right)} \]

- Update the point $ x $ as:

\[ x \gets m_h(x). \]

The Mean Shift Clustering algorithm iteratively computes $ m_h(x) $ for each data point $ x $ and updates $ x $ until convergence to a stationary point.

The Mean Shift Clustering Algorithm has broad applications in pattern recognition and computer vision, for example, image segmentation, object tracking, and feature space analysis. However, its performance is sensitive to the choice of the kernel function and the bandwidth $ h $.

Fuzzy K-Means Clustering

Standard K-Means restricts each data point's membership to only one cluster. However, for many real-world applications, the underlying structure contains overlapping clusters. Fuzzy K-Means Clustering, or often referred to as Fuzzy C-Means (FCM) clustering, relaxes the unitary-membership restriction to address this issue.

In Fuzzy C-Means clustering algorithm (FCM), the loss function for a clustering solution $\mathcal{C}$ is given by:

\begin{equation} G_{\text{fuzzy}}(\mathcal{C}) = \sum_{k=1}^K \sum_{x_i \in C_k} w_{xik}^\beta \|x_i - c_k\|^2 \end{equation}

Here, $w_{xik}$ is the membership weight of point $x_i$ belonging to cluster $C_k$. This weight is used during the clustering process.

Membership Weight

The membership weight $w_{xik}$ is computed as:

\begin{equation} w_{xik} = \frac{1}{\sum_{j=1}^K \left( \frac{\|x_i - c_k\|}{\|x_i - c_j\|} \right)^{\frac{2}{\beta - 1}}} \end{equation}

Cluster Center

The cluster center $c_k$ is calculated as:

\begin{equation} c_k = \frac{\sum_{x_i \in C_k} w_{xik}^\beta x_i}{\sum_{x_i \in C_k} w_{xik}^\beta} \end{equation}

Algorithm: Fuzzy C-Means Clustering

Input: Dataset $\mathcal{X}$, number of clusters $K$, fuzziness parameter $\beta$, stopping criteria $\epsilon$
Output: Clusters $\{C_1, C_2, \ldots, C_K\}$ and cluster centers $\{c_1, c_2, \ldots, c_K\}$

1. Initialize: Randomly select initial cluster centers $c_1, c_2, \ldots, c_K$. Initialize membership weights $w_{xik}$ for all $x_i \in \mathcal{X}$ and $C_k$.
2. Repeat until convergence or change in SSE < $\epsilon$:
a. Step 1: Update membership weights:
$w_{xik} \gets \frac{1}{\sum_{j=1}^K \left( \frac{\|x_i - c_k\|}{\|x_i - c_j\|} \right)^{\frac{2}{\beta - 1}}}$
b. Step 2: Update cluster centers:
$c_k \gets \frac{\sum_{x_i \in C_k} w_{xik}^\beta x_i}{\sum_{x_i \in C_k} w_{xik}^\beta}$
c. Step 3: Compute loss function:
$G_{\text{fuzzy}}(\mathcal{C}) \gets \sum_{k=1}^K \sum_{x_i \in C_k} w_{xik}^\beta \|x_i - c_k\|^2$
d. Step 4: Check convergence: Stop if change in $G_{\text{fuzzy}}(\mathcal{C})$ is less than $\epsilon$.

Kernel Fuzzy C-Means (KFCM)

KFCM (Kernel Fuzzy C-Means) is an improved version of the FCM algorithm, leveraging kernel methods to map data into a higher-dimensional feature space. Its key modifications include:

Kernel Mapping: Each data point $x_i \in \mathcal{X}$ is mapped into a higher-dimensional feature space using a kernel function $\phi$, where the kernel function is defined as:
\begin{equation} K(x_i, x_j) = \phi(x_i)^\top \phi(x_j) \end{equation}
Loss Function: The loss function is modified to:
\begin{equation} J(U, C) = \sum_{k=1}^K \sum_{i=1}^n u_{ik}^\beta \, \| \phi(x_i) - c_k \|^2 \end{equation}
Membership Weights: Membership values $u_{ik}$ are updated using kernel distances.
Cluster Centers: The cluster centers $c_k$ in the feature space are calculated as:
\begin{equation} c_k = \frac{\sum_{i=1}^n u_{ik}^\beta \phi(x_i)}{\sum_{i=1}^n u_{ik}^\beta} \end{equation}

KFCM extends Fuzzy C-Means by using the kernel trick to handle non-linearly separable clusters, making it effective for complex clustering scenarios such as pattern recognition and biomedical applications.

Kernel K-Means Clustering

Kernel K-Means is an extension of the standard K-Means algorithm that works by projecting the data into a high-dimensional kernel space, where clusters can be more easily separated. This projection is achieved using kernel functions, but only the distances between points in the kernel space are required, so an explicit representation of the projection function $ \phi $ is not needed.

Common kernel functions include polynomial, Gaussian (RBF), and sigmoid kernels, as shown in Table 1. We present Mercer's Theorem (Theorem 1) without proof.

Theorem 1: Mercer's Theorem

Let $ K(x, y) $ be a continuous, symmetric, and positive semi-definite kernel function defined on a compact domain $ \mathcal{X} \times \mathcal{X} $. Then, $ K(x, y) $ can be expressed as:

\begin{equation} K(x, y) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(y), \end{equation}

where:

$ \{ \phi_i(x) \} $ is an orthonormal set of eigenfunctions in $ L^2(\mathcal{X}) $.
$ \{ \lambda_i \} $ are the non-negative eigenvalues associated with the eigenfunctions.

Kernel Functions

Table 1: Common Kernel Functions
Kernel Type	Kernel Function
Polynomial Kernel	$ K(\mathbf{a}, \mathbf{b}) = (\mathbf{a} \cdot \mathbf{b} + c)^d $
Gaussian Kernel (RBF)	$ K(\mathbf{a}, \mathbf{b}) = \exp\left(-\frac{\\|\mathbf{a} - \mathbf{b}\\|^2}{2\sigma^2}\right) $
Sigmoid Kernel	$ K(\mathbf{a}, \mathbf{b}) = \tanh(c (\mathbf{a} \cdot \mathbf{b}) + \theta) $

Objective Function

The objective function for Kernel K-Means minimizes the sum of squared errors (SSE) in the high-dimensional feature space:

\begin{equation} G_{\text{kernel}}(\mathcal{C}) = \sum_{k=1}^K \sum_{x_i \in C_k} \|\phi(x_i) - c_k\|^2, \end{equation}

where $ c_k $ is the centroid of cluster $ C_k $ in the kernel space.

Algorithm: Kernel K-Means Clustering

Input: Dataset $ \mathcal{X} = \{x_1, x_2, \ldots, x_n\} $, number of clusters $ K $, kernel function $ K(x, y) $, stopping criteria $ \epsilon $
Output: Clusters $ C_1, C_2, \ldots, C_K $

1. Step 1: Randomly assign each point $ x_i $ to one of the $ K $ clusters.
2. Repeat until convergence or change in $ G_{\text{kernel}}(C) < \epsilon $:
a. For each cluster $ C_k $, compute the distance of each point $ x_i $ to the centroid in the kernel space: \[ \|\phi(x_i) - c_k\|^2 = K_{x_i x_i} - \frac{2}{|C_k|} \sum_{x_j \in C_k} K_{x_i x_j} + \frac{1}{|C_k|^2} \sum_{x_j, x_l \in C_k} K_{x_j x_l}. \]
b. Assign each point $ x_i $ to the cluster with the nearest centroid in the kernel space.
c. Recompute cluster centroids $ c_k $ using the kernel function: \[ c_k = \frac{\sum_{x_i \in C_k} \phi(x_i)}{|C_k|}. \]

Weighted Kernel K-Means

Weighted Kernel K-Means extends Kernel K-Means by incorporating weights for each data point, enabling the algorithm to account for varying importance of points. The objective is to minimize the following distortion function:

\begin{equation} G_{\text{Weighted Kernel}}(\{C_j\}_{j=1}^k) = \sum_{j=1}^k \sum_{x \in C_j} w(x) \|\phi(x) - c_j\|^2, \end{equation}

where $ w(x) $ is the weight of point $ x $, and $ c_j $ is the weighted mean of cluster $ C_j $ in the kernel space:

\begin{equation} c_j = \frac{\sum_{x \in C_j} w(x) \phi(x)}{\sum_{x \in C_j} w(x)}. \end{equation}

Weighted K-Means Clustering

Weighted K-Means (WK-Means) is an extension of the standard K-Means algorithm that introduces a feature weighting mechanism to account for the varying importance of features. By iteratively learning weights for different features, WK-Means improves clustering performance, especially in datasets where certain features are more informative than others. A user-defined parameter $ \beta $ controls the sensitivity of the algorithm to feature weights.

Weighted Distance

In WK-Means algorithm, the distance between a data point $ x_i $ and a cluster centroid $ c_k $ is defined using a weighted Euclidean distance as:

\begin{equation} d(x_i, c_k) = \sum_{v=1}^M w_v^\beta \|x_{iv} - c_{kv}\|^2. \end{equation}

At each iteration, WK-Means assigns data points to the nearest cluster based on this weighted distance and updates the cluster centroids. Once the centroids are updated, the feature weights $ w_v $ are recalculated using:

\begin{equation} w_v = \frac{1}{\sum_{u=1}^M \left( \frac{D_v}{D_u} \right)^{\frac{1}{\beta - 1}}}, \end{equation}

where $ D_v $ is the sum of within-cluster variances for feature $ v $, defined as:

\begin{equation} D_v = \sum_{k=1}^K \sum_{x_i \in C_k} (x_{iv} - c_{kv})^2. \end{equation}

The weight update ensures that the weights $ w_v $ are normalized such that:

\begin{equation} \sum_{v=1}^M w_v = 1. \end{equation}

Algorithm: Weighted K-Means Clustering

Input: Dataset $\mathcal{X}$ with $M$ features, number of clusters $K$, user-defined parameter $\beta$, stopping criteria
Output: Clusters $C_1, C_2, \ldots, C_K$, feature weights $w_v$, and centroids $c_k$

1. Step 1: Initialization: Choose $K$ random centroids and initialize $M$ feature weights such that they sum to 1.
2. Repeat until convergence criterion is met:
a. Step 2: Assign each data point $x_i$ to the closest centroid $c_k$ based on $d(x_i, c_k)$.
b. Step 3: Recompute the centroids $c_k$ for each cluster.
c. Step 4: Update the feature weights using $w_v$.

By incorporating feature importance, WK-Means empirically achieves better clustering results compared to standard K-Means, especially for datasets with heterogeneous feature importance. However, the computational cost of WK-Means is higher than traditional K-Means due to the additional weight learning step.

Intelligent K-Means Clustering

Intelligent K-Means (IK-Means) is a clustering method that follows the principle: the farther a point is from the centroid, the more we care about it. Unlike traditional K-Means, IK-Means introduces a deterministic way of selecting initial centroids by considering points that are farthest from the existing centroid. This approach enables better clustering for datasets with wide scatter. The method leverages ideas from principal component analysis (PCA) to identify points with maximum data scatter and uses them as potential cluster centroids, often representing anomalous patterns in the data.

Step 1: Center of Gravity

The first step of IK-Means involves calculating the centroid for the entire dataset, referred to as the center of gravity $ c_g $:

\begin{equation} c_g = \frac{1}{n} \sum_{i=1}^n x_i, \end{equation}

Step 2: Farthest Point Selection

Once $ c_g $ is determined, the algorithm identifies the point $ c $ farthest from $ c_g $:

\begin{equation} c = \arg\max_{x_i \in \mathcal{X}} \|x_i - c_g\|^2. \end{equation}

Step 3: Cluster Assignment

This point $ c $ becomes a candidate for an anomalous cluster centroid. The algorithm creates a cluster $ S_{\text{iter}} $ by assigning all points $ x_i $ to $ S_{\text{iter}} $ if they are closer to $ c $ than to $ c_g $:

\begin{equation} S_{\text{iter}} = \{ x_i \in \mathcal{X} \mid d(x_i, c) < d(x_i, c_g) \}, \end{equation}

where $ d(x_i, c) $ denotes the distance between $ x_i $ and $ c $.

Step 4: Cluster Centroid Update

Once the cluster $ S_{\text{iter}} $ is formed, the centroid of this cluster $ s_g $ is updated as the mean of the points in $ S_{\text{iter}} $:

\begin{equation} s_g = \frac{1}{|S_{\text{iter}}|} \sum_{x_i \in S_{\text{iter}}} x_i. \end{equation}

Algorithm: Intelligent K-Means (IK-Means)

Input: Dataset $ \mathcal{X} $
Output: Clusters $ C_1, C_2, \ldots, C_K $ and centroids $ c_1, c_2, \ldots, c_K $

1. Step 1: Initialize
Compute the center of gravity: \[ c_g = \frac{1}{n} \sum_{i=1}^n x_i. \]
2. Repeat until stopping criterion is met:
a. Step 2: Select the point farthest from the current centroid: \[ c = \arg\max_{x_i \in \mathcal{X}} \|x_i - c_g\|^2. \]
b. Step 3: Assign points to the cluster: \[ S_{\text{iter}} = \{ x_i \in \mathcal{X} \mid d(x_i, c) < d(x_i, c_g) \}. \]
c. Step 4: Update the cluster centroid: \[ s_g = \frac{1}{|S_{\text{iter}}|} \sum_{x_i \in S_{\text{iter}}} x_i. \] Set $ c_g = s_g $.
d. Step 5: Prune small clusters (if necessary).

Unlike traditional K-Means, which is non-deterministic due to random initialization, IK-Means is a deterministic algorithm. It is particularly effective in scenarios where clusters are spread widely across the dataset and is well-suited for detecting anomalous patterns.

Bisecting K-Means Clustering

Bisecting K-Means Clustering builds a binary tree and iteratively splits the leaf nodes $ C $ into two child nodes $ C_1, C_2 $ by running 2-means. This hierarchical approach combines divisive clustering with K-Means, offering a flexible way to cluster datasets. The algorithm is detailed below.

Algorithm: Bisecting K-Means Clustering

Input: Dataset $ \mathcal{X} $, desired number of clusters $ K $, maximum iterations $ I $
Output: Clusters $ C_1, C_2, \ldots, C_K $

1. Step 1: Initialize
Start with the entire dataset as a single cluster $ C $.

2. Repeat until the number of clusters equals $ K $:
a. Step 2: Select cluster to split
Choose the parent cluster $ C $ to be split.

b. Step 3: Bisect the cluster
Repeat $ I $ iterations or until convergence:
- Randomly initialize two centroids within $ C $.
- Assign points in $ C $ to the nearest centroid.
- Recompute centroids and evaluate inter-cluster dissimilarity.
Split $ C $ into two subclusters $ C_1 $ and $ C_2 $.

c. Step 4: Update parent cluster
Choose the larger subcluster (if applicable) as the next parent cluster for further splitting.

Key Characteristics

Bisecting K-Means combines divisive hierarchical clustering with K-Means. It splits clusters iteratively, ensuring better partitioning for complex datasets. However, the computational complexity is higher compared to standard K-Means due to repeated 2-means operations at each step.

Experiments

We conducted extensive experiments to evaluate and compare different k-means clustering variants across multiple dimensions including accuracy, computational efficiency, and scalability. Our experimental framework was designed to provide comprehensive insights into the practical performance of these algorithms.

Experimental Setup

Our experimental evaluation utilized four well-known benchmark datasets from the UCI Machine Learning Repository:

Breast Cancer Wisconsin: A binary classification dataset with n=569 samples and d=30 features
Iris: A dataset containing n=150 samples with d=4 features categorized into three classes
Wine: Consists of n=178 samples with d=13 features divided into three classes
Digits (0-4): A subset of the handwritten digits dataset containing the first five digits, with d=64 features

We implemented and compared ten variants of K-means clustering:

Standard K-means: The classical K-means algorithm with Euclidean distance
K-medoids: A more robust variant using actual data points as centroids
K-medians: Uses Manhattan distance and median centers
K-modes: Adapted for handling continuous data through discretization
Mean Shift: A density-based approach with adaptive bandwidth
Fuzzy K-means: Allows soft cluster assignments
Kernel K-means: Incorporates non-linear relationships using RBF kernel
Weighted K-means: Assigns feature importance based on variance
Intelligent K-means: Uses strategic initialization of centroids
Bisecting K-means: Implements hierarchical divisive clustering

Data Preprocessing

To ensure reliable and consistent clustering results, we implemented a comprehensive preprocessing pipeline:

Outlier Treatment: We employed the Interquartile Range (IQR) method to handle outliers:

\begin{equation} \text{IQR} = Q_3 - Q_1 \end{equation}

\begin{equation} \text{Lower Bound} = Q_1 - 1.5 \times \text{IQR} \end{equation}

\begin{equation} \text{Upper Bound} = Q_3 + 1.5 \times \text{IQR} \end{equation}

Feature Standardization: All features were standardized using Standard Scaler:

\begin{equation} x'_i = \frac{x_i - \mu}{\sigma} \end{equation}

Evaluation Metrics

We employed comprehensive metrics to evaluate clustering performance:

Clustering Accuracy: This metric quantifies the agreement between predicted cluster labels and true class labels:

\begin{equation} \text{Accuracy} = \frac{\text{Correctly identified class}}{\text{Total number of class}} \times 100 \end{equation}

Adjusted Rand Index (ARI): The ARI measures the similarity between two clustering partitions:

\begin{equation} \text{ARI} = \frac{\sum_{ij} \binom{a_{ij}}{2} - [\sum_i \binom{r_i}{2} \sum_j \binom{s_j}{2}]/\binom{\alpha}{2}} {\frac{1}{2}[\sum_i \binom{r_i}{2} + \sum_j \binom{s_j}{2}] - [\sum_i \binom{r_i}{2} \sum_j \binom{s_j}{2}]/\binom{\alpha}{2}} \end{equation}

Experimental Results and Analysis

Clustering Visualization Comparison

To provide an intuitive understanding of how different clustering algorithms perform, we visualize their clustering results on a synthetic dataset. This comparison allows us to directly observe how each algorithm partitions the data space:

Clustering Algorithm Comparison — Visual comparison of different clustering algorithms. Each color represents a different cluster assignment.

These visualizations reveal several interesting patterns:

Traditional K-means and its direct variants (KMedoids, KMedians) show similar clustering patterns, effectively capturing the general structure of the data
MeanShift and IKMeans tend to produce more compact clusters but may miss some of the natural data boundaries
KernelKMeans shows better ability to handle non-linear cluster boundaries
FuzzyKMeans produces smoother cluster transitions, reflecting its soft assignment nature

Performance Analysis Across Datasets

Our experimental evaluation highlights several notable trends in clustering performance across the four benchmark datasets:

Analysis of Performance Scaling with Dataset Size

We analyzed how different clustering algorithms scale with increasing dataset size:

Traditional K-means variants (KMeans, KMedians, KMedoids) demonstrate remarkable stability across different dataset sizes, maintaining consistently high performance (accuracy > 0.90) even as the dataset grows. This stability suggests these algorithms are well-suited for real-world applications where data volume may vary.

Runtime Analysis with Dataset Size

The runtime analysis reveals significant differences in computational efficiency across algorithms:

From the results on the Digits dataset, we observe three distinct scaling patterns:

Most K-means variants demonstrate moderate scaling with dataset size, maintaining relatively stable performance around or less than 10⁻¹ seconds
More sophisticated algorithms like MeanShift and KModes exhibit substantially higher computational costs
Some algorithms show erratic behavior at transition points in dataset size

Recommendations and Future Directions

Based on our comprehensive experimental analysis:

For general-purpose clustering tasks, KMedians or KMedoids emerge as robust choices, demonstrating both consistent performance across different data types and efficient runtime scaling
When working with well-structured datasets with clear cluster boundaries, Fuzzy K-means and Kernel K-means can provide superior accuracy (up to 0.967 on the Wine dataset)
For complex, high-dimensional datasets like Digits, preprocessing steps such as dimensionality reduction are recommended

Future Directions

Future research could focus on:

Extending the analysis to a wider variety of datasets with diverse characteristics
Investigating the impact of different preprocessing techniques
Developing hybrid approaches that balance computational efficiency and clustering accuracy
Exploring adaptive parameter tuning strategies

Conclusion

This study presents a systematic investigation of k-means clustering and its variants, offering a comprehensive analysis of their algorithmic properties and practical performance. Through theoretical analysis, we have elucidated the computational complexity and convergence properties of these algorithms, providing a mathematical general framework for understanding their behavior. Our experimental evaluations across diverse datasets validate these theoretical findings while practical insights into their strengths and limitations.

A Survey on K-means Clustering Algorithms

Theoretical Analysis and Performance Comparison

Introduction

Theoretical Framework

Problem Formulation

Objective Function

Center Update Rule

Optimization Formulation

The Lloyd-type Algorithm

Computational Complexity of k-means Clustering

Introduction

Membership in NP

Reduction Construction

Correctness of the Reduction

Geometric Realization of the Distance Matrix

Local Optimality and Convergence Analysis

Reduced Function

Partial Optimal Solutions

Variants of K-Means

K-Medoids Clustering

K-Medians Clustering

K-Modes Clustering

Mean Shift Clustering

Fuzzy K-Means Clustering

Kernel K-Means Clustering

Weighted K-Means Clustering

Intelligent K-Means Clustering

Bisecting K-Means Clustering

Experiments

Experimental Setup

Data Preprocessing

Evaluation Metrics

Experimental Results and Analysis

Clustering Visualization Comparison

Performance Analysis Across Datasets

Analysis of Performance Scaling with Dataset Size

Runtime Analysis with Dataset Size

Recommendations and Future Directions

Future Directions

Conclusion

Kernel Type	Kernel Function
Polynomial Kernel	\( K(\mathbf{a}, \mathbf{b}) = (\mathbf{a} \cdot \mathbf{b} + c)^d \)
Gaussian Kernel (RBF)	\( K(\mathbf{a}, \mathbf{b}) = \exp\left(-\frac{\\|\mathbf{a} - \mathbf{b}\\|^2}{2\sigma^2}\right) \)
Sigmoid Kernel	\( K(\mathbf{a}, \mathbf{b}) = \tanh(c (\mathbf{a} \cdot \mathbf{b}) + \theta) \)