Data Science K-Means Clustering

πŸ”Ή K-Means is an unsupervised machine learning algorithm used for clustering data into K groups.
πŸ”Ή It groups similar data points together based on centroids (cluster centers).
πŸ”Ή Commonly used in customer segmentation, anomaly detection, image compression, etc.

2. How Does K-Means Work?

1️⃣ Choose K β†’ Decide the number of clusters (K).
2️⃣ Randomly initialize K centroids.
3️⃣ Assign each point to the nearest centroid.
4️⃣ Recalculate centroids (average of all points in a cluster).
5️⃣ Repeat steps 3 & 4 until centroids don’t change.

πŸ‘‰ Mathematical Representation
The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS):

\( WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} \| x – \mu_i \|^2 \)

Where:

  • \( K \) = Number of clusters
  • \( C_i \)​ = Cluster \( i \)
  • \( \mu_i \)​ = Centroid of cluster \( i \)

3. Implementing K-Means in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

Try It Now

4. Example 1: K-Means on Synthetic Data

Step 2: Generate Sample Data

# Generate random data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Standardize data (optional but recommended)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Plot raw data
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=50)
plt.title("Raw Data")
plt.show()

Try It Now

Step 3: Apply K-Means Algorithm

# Apply K-Means with K=3
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X_scaled)

# Get cluster centers
centroids = kmeans.cluster_centers_

Try It Now

Step 4: Visualize Clusters

# Plot clustered data
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', s=50)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label="Centroids")
plt.title("K-Means Clustering")
plt.legend()
plt.show()

Try It Now

5. Choosing the Best K (Elbow Method)

πŸ”Ή How do we determine the best K value?
πŸ”Ή Use the Elbow Method to find the optimal number of clusters.

Step 5: Elbow Method

wcss = []  # Store Within-Cluster Sum of Squares

# Try different K values (1 to 10)
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)  # Inertia = WCSS

# Plot Elbow Graph
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("WCSS")
plt.title("Elbow Method for Optimal K")
plt.show()

Try It Now

πŸ‘‰ Find the ‘elbow’ point where WCSS starts decreasing slowly. That’s the best K!

 

6. Example 2: K-Means on Real-World Data (Iris Dataset)

Step 6: Load Dataset

from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X_iris = iris.data  # Features

# Standardize data
X_iris_scaled = StandardScaler().fit_transform(X_iris)

Try It Now

Step 7: Apply K-Means (K=3)

# Apply K-Means with K=3
kmeans_iris = KMeans(n_clusters=3, random_state=42)
y_iris_pred = kmeans_iris.fit_predict(X_iris_scaled)

Try It Now

Step 8: Visualize Clusters (First Two Features)

plt.scatter(X_iris_scaled[:, 0], X_iris_scaled[:, 1], c=y_iris_pred, cmap='viridis', s=50)
plt.scatter(kmeans_iris.cluster_centers_[:, 0], kmeans_iris.cluster_centers_[:, 1], 
            c='red', marker='X', s=200, label="Centroids")
plt.title("K-Means Clustering on Iris Dataset")
plt.legend()
plt.show()

Try It Now

7. Evaluating Clustering Performance

βœ… Silhouette Score β†’ Measures how well each point fits into its cluster. Higher is better!

from sklearn.metrics import silhouette_score

sil_score = silhouette_score(X_iris_scaled, y_iris_pred)
print("Silhouette Score:", sil_score)

Try It Now

βœ… Comparing with Actual Labels (If available)

from sklearn.metrics import adjusted_rand_score

# Compare clusters with actual species labels
rand_score = adjusted_rand_score(iris.target, y_iris_pred)
print("Adjusted Rand Index:", rand_score)

Try It Now

8. Advantages & Disadvantages of K-Means

βœ… Advantages

βœ” Simple and easy to implement
βœ” Scales well to large datasets
βœ” Works well if clusters are well-separated

❌ Disadvantages

❌ Needs to specify K in advance
❌ Sensitive to outliers
❌ Fails for non-spherical clusters

Summary

βœ” K-Means clusters similar data points into K groups.
βœ” Uses centroids to define cluster centers.
βœ” The Elbow Method helps find the best K.
βœ” Silhouette Score and Adjusted Rand Index can evaluate performance.