Data Science K-Means Clustering

πŸ”Ή K-Means is an unsupervised machine learning algorithm used for clustering data into K groups.
πŸ”Ή It groups similar data points together based on centroids (cluster centers).
πŸ”Ή Commonly used in customer segmentation, anomaly detection, image compression, etc.

2. How Does K-Means Work?

1️⃣ Choose K β†’ Decide the number of clusters (K).
2️⃣ Randomly initialize K centroids.
3️⃣ Assign each point to the nearest centroid.
4️⃣ Recalculate centroids (average of all points in a cluster).
5️⃣ Repeat steps 3 & 4 until centroids don’t change.

πŸ‘‰ Mathematical Representation
The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS):

\( WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} \| x – \mu_i \|^2 \)

Where:

  • \( K \) = Number of clusters
  • \( C_i \)​ = Cluster \( i \)
  • \( \mu_i \)​ = Centroid of cluster \( i \)

3. Implementing K-Means in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

4. Example 1: K-Means on Synthetic Data

Step 2: Generate Sample Data

# Generate random data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Standardize data (optional but recommended)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Plot raw data
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=50)
plt.title("Raw Data")
plt.show()

Step 3: Apply K-Means Algorithm

# Apply K-Means with K=3
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X_scaled)

# Get cluster centers
centroids = kmeans.cluster_centers_

Step 4: Visualize Clusters

# Plot clustered data
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', s=50)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label="Centroids")
plt.title("K-Means Clustering")
plt.legend()
plt.show()

5. Choosing the Best K (Elbow Method)

πŸ”Ή How do we determine the best K value?
πŸ”Ή Use the Elbow Method to find the optimal number of clusters.

Step 5: Elbow Method

wcss = []  # Store Within-Cluster Sum of Squares

# Try different K values (1 to 10)
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)  # Inertia = WCSS

# Plot Elbow Graph
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("WCSS")
plt.title("Elbow Method for Optimal K")
plt.show()

πŸ‘‰ Find the ‘elbow’ point where WCSS starts decreasing slowly. That’s the best K!

 

6. Example 2: K-Means on Real-World Data (Iris Dataset)

Step 6: Load Dataset

from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X_iris = iris.data  # Features

# Standardize data
X_iris_scaled = StandardScaler().fit_transform(X_iris)

Step 7: Apply K-Means (K=3)

# Apply K-Means with K=3
kmeans_iris = KMeans(n_clusters=3, random_state=42)
y_iris_pred = kmeans_iris.fit_predict(X_iris_scaled)

Step 8: Visualize Clusters (First Two Features)

plt.scatter(X_iris_scaled[:, 0], X_iris_scaled[:, 1], c=y_iris_pred, cmap='viridis', s=50)
plt.scatter(kmeans_iris.cluster_centers_[:, 0], kmeans_iris.cluster_centers_[:, 1], 
            c='red', marker='X', s=200, label="Centroids")
plt.title("K-Means Clustering on Iris Dataset")
plt.legend()
plt.show()

7. Evaluating Clustering Performance

βœ… Silhouette Score β†’ Measures how well each point fits into its cluster. Higher is better!

from sklearn.metrics import silhouette_score

sil_score = silhouette_score(X_iris_scaled, y_iris_pred)
print("Silhouette Score:", sil_score)

βœ… Comparing with Actual Labels (If available)

from sklearn.metrics import adjusted_rand_score

# Compare clusters with actual species labels
rand_score = adjusted_rand_score(iris.target, y_iris_pred)
print("Adjusted Rand Index:", rand_score)

8. Advantages & Disadvantages of K-Means

βœ… Advantages

βœ” Simple and easy to implement
βœ” Scales well to large datasets
βœ” Works well if clusters are well-separated

❌ Disadvantages

❌ Needs to specify K in advance
❌ Sensitive to outliers
❌ Fails for non-spherical clusters

Summary

βœ” K-Means clusters similar data points into K groups.
βœ” Uses centroids to define cluster centers.
βœ” The Elbow Method helps find the best K.
βœ” Silhouette Score and Adjusted Rand Index can evaluate performance.