Keywords
K-Mean Clustering
Description
This study evaluates the performance of the K-Means clustering algorithm across a variety of benchmark datasets, including low-dimensional, high-dimensional, overlapping, and imbalanced data. Using four key metrics—Mean Squared Error (MSE), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Silhouette Score—the paper demonstrates that K-Means performs exceptionally well on well-separated and high-dimensional datasets, but faces challenges with overlapping clusters and varying densities. Through visualization and quantitative analysis, the paper highlights both the strengths and limitations of K-Means in unsupervised learning.
Abstract
Clustering is a fundamental technique in unsupervised machine learning, widely applied in various domains such as pattern recognition, data segmentation, and anomaly detection. This study evaluates the performance of the K-Means clustering algorithm on multiple benchmark datasets, including low-dimensional, high-dimensional, and imbalanced datasets. The clustering results are assessed using four key evaluation metrics: Mean Squared Error (MSE), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Silhouette Score. Experimental results demonstrate that K-Means performs effectively on datasets with well-separated clusters, particularly in high-dimensional spaces, where it achieves near-perfect clustering accuracy. However, its performance deteriorates in datasets with overlapping clusters and varying cluster densities, highlighting its sensitivity to initialization and cluster structure.
Course Name
STA 6367 Data Science 2
Instructor Name
Dr. RUI XIE
Rights
This work is licensed under a Creative Commons Attribution 4.0 International License.
College
College of Sciences
STARS Citation
Deb, Dipok, "Clustering Dataset Using K-Mean Clustering" (2025). Data Science and Data Mining. 39.
https://stars.library.ucf.edu/data-science-mining/39