Keywords

K-Mean Clustering

Description

This study evaluates the performance of the K-Means clustering algorithm across a variety of benchmark datasets, including low-dimensional, high-dimensional, overlapping, and imbalanced data. Using four key metrics—Mean Squared Error (MSE), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Silhouette Score—the paper demonstrates that K-Means performs exceptionally well on well-separated and high-dimensional datasets, but faces challenges with overlapping clusters and varying densities. Through visualization and quantitative analysis, the paper highlights both the strengths and limitations of K-Means in unsupervised learning.

Abstract

Clustering is a fundamental technique in unsupervised machine learning, widely applied in various domains such as pattern recognition, data segmentation, and anomaly detection. This study evaluates the performance of the K-Means clustering algorithm on multiple benchmark datasets, including low-dimensional, high-dimensional, and imbalanced datasets. The clustering results are assessed using four key evaluation metrics: Mean Squared Error (MSE), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Silhouette Score. Experimental results demonstrate that K-Means performs effectively on datasets with well-separated clusters, particularly in high-dimensional spaces, where it achieves near-perfect clustering accuracy. However, its performance deteriorates in datasets with overlapping clusters and varying cluster densities, highlighting its sensitivity to initialization and cluster structure.

Course Name

STA 6367 Data Science 2

Instructor Name

Dr. RUI XIE

Rights

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

College

College of Sciences

Included in

Data Science Commons

Share

COinS