Data Science K-Means Clustering

🔹 K-Means is an unsupervised machine learning algorithm used for clustering data into K groups.
🔹 It groups similar data points together based on centroids (cluster centers).
🔹 Commonly used in customer segmentation, anomaly detection, image compression, etc.

2. How Does K-Means Work?

1️⃣ Choose K → Decide the number of clusters (K).
2️⃣ Randomly initialize K centroids.
3️⃣ Assign each point to the nearest centroid.
4️⃣ Recalculate centroids (average of all points in a cluster).
5️⃣ Repeat steps 3 & 4 until centroids don’t change.

👉 Mathematical Representation
The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS):

\( WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} \| x – \mu_i \|^2 \)

Where:

  • \( K \) = Number of clusters
  • \( C_i \)​ = Cluster \( i \)
  • \( \mu_i \)​ = Centroid of cluster \( i \)

3. Implementing K-Means in Python

Step 1: Install Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

4. Example 1: K-Means on Synthetic Data

Step 2: Generate Sample Data

# Generate random data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Standardize data (optional but recommended)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Plot raw data
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=50)
plt.title("Raw Data")
plt.show()

Step 3: Apply K-Means Algorithm

# Apply K-Means with K=3
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X_scaled)

# Get cluster centers
centroids = kmeans.cluster_centers_

Step 4: Visualize Clusters

# Plot clustered data
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', s=50)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label="Centroids")
plt.title("K-Means Clustering")
plt.legend()
plt.show()

5. Choosing the Best K (Elbow Method)

🔹 How do we determine the best K value?
🔹 Use the Elbow Method to find the optimal number of clusters.

Step 5: Elbow Method

wcss = []  # Store Within-Cluster Sum of Squares

# Try different K values (1 to 10)
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)  # Inertia = WCSS

# Plot Elbow Graph
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("WCSS")
plt.title("Elbow Method for Optimal K")
plt.show()

👉 Find the ‘elbow’ point where WCSS starts decreasing slowly. That’s the best K!

 

6. Example 2: K-Means on Real-World Data (Iris Dataset)

Step 6: Load Dataset

from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
X_iris = iris.data  # Features

# Standardize data
X_iris_scaled = StandardScaler().fit_transform(X_iris)

Step 7: Apply K-Means (K=3)

# Apply K-Means with K=3
kmeans_iris = KMeans(n_clusters=3, random_state=42)
y_iris_pred = kmeans_iris.fit_predict(X_iris_scaled)

Step 8: Visualize Clusters (First Two Features)

plt.scatter(X_iris_scaled[:, 0], X_iris_scaled[:, 1], c=y_iris_pred, cmap='viridis', s=50)
plt.scatter(kmeans_iris.cluster_centers_[:, 0], kmeans_iris.cluster_centers_[:, 1], 
            c='red', marker='X', s=200, label="Centroids")
plt.title("K-Means Clustering on Iris Dataset")
plt.legend()
plt.show()

7. Evaluating Clustering Performance

Silhouette Score → Measures how well each point fits into its cluster. Higher is better!

from sklearn.metrics import silhouette_score

sil_score = silhouette_score(X_iris_scaled, y_iris_pred)
print("Silhouette Score:", sil_score)

Comparing with Actual Labels (If available)

from sklearn.metrics import adjusted_rand_score

# Compare clusters with actual species labels
rand_score = adjusted_rand_score(iris.target, y_iris_pred)
print("Adjusted Rand Index:", rand_score)

8. Advantages & Disadvantages of K-Means

✅ Advantages

Simple and easy to implement
Scales well to large datasets
Works well if clusters are well-separated

❌ Disadvantages

Needs to specify K in advance
Sensitive to outliers
Fails for non-spherical clusters

Summary

K-Means clusters similar data points into K groups.
Uses centroids to define cluster centers.
The Elbow Method helps find the best K.
Silhouette Score and Adjusted Rand Index can evaluate performance.