πΉ K-Means is an unsupervised machine learning algorithm used for clustering data into K groups.
πΉ It groups similar data points together based on centroids (cluster centers).
πΉ Commonly used in customer segmentation, anomaly detection, image compression, etc.
2. How Does K-Means Work?
1οΈβ£ Choose K β Decide the number of clusters (K).
2οΈβ£ Randomly initialize K centroids.
3οΈβ£ Assign each point to the nearest centroid.
4οΈβ£ Recalculate centroids (average of all points in a cluster).
5οΈβ£ Repeat steps 3 & 4 until centroids donβt change.
π Mathematical Representation
The objective of K-Means is to minimize the Within-Cluster Sum of Squares (WCSS):
\( WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} \| x β \mu_i \|^2 \)
Where:
- \( K \) = Number of clusters
- \( C_i \)β = Cluster \( i \)
- \( \mu_i \)β = Centroid of cluster \( i \)
3. Implementing K-Means in Python
Step 1: Install Required Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler
4. Example 1: K-Means on Synthetic Data
Step 2: Generate Sample Data
# Generate random data with 3 clusters X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42) # Standardize data (optional but recommended) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Plot raw data plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=50) plt.title("Raw Data") plt.show()
Step 3: Apply K-Means Algorithm
# Apply K-Means with K=3 kmeans = KMeans(n_clusters=3, random_state=42) y_pred = kmeans.fit_predict(X_scaled) # Get cluster centers centroids = kmeans.cluster_centers_
Step 4: Visualize Clusters
# Plot clustered data plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', s=50) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label="Centroids") plt.title("K-Means Clustering") plt.legend() plt.show()
5. Choosing the Best K (Elbow Method)
πΉ How do we determine the best K value?
πΉ Use the Elbow Method to find the optimal number of clusters.
Step 5: Elbow Method
wcss = [] # Store Within-Cluster Sum of Squares # Try different K values (1 to 10) for k in range(1, 11): kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) wcss.append(kmeans.inertia_) # Inertia = WCSS # Plot Elbow Graph plt.plot(range(1, 11), wcss, marker='o') plt.xlabel("Number of Clusters (K)") plt.ylabel("WCSS") plt.title("Elbow Method for Optimal K") plt.show()
π Find the ‘elbow’ point where WCSS starts decreasing slowly. Thatβs the best K!
6. Example 2: K-Means on Real-World Data (Iris Dataset)
Step 6: Load Dataset
from sklearn.datasets import load_iris # Load Iris dataset iris = load_iris() X_iris = iris.data # Features # Standardize data X_iris_scaled = StandardScaler().fit_transform(X_iris)
Step 7: Apply K-Means (K=3)
# Apply K-Means with K=3 kmeans_iris = KMeans(n_clusters=3, random_state=42) y_iris_pred = kmeans_iris.fit_predict(X_iris_scaled)
Step 8: Visualize Clusters (First Two Features)
plt.scatter(X_iris_scaled[:, 0], X_iris_scaled[:, 1], c=y_iris_pred, cmap='viridis', s=50) plt.scatter(kmeans_iris.cluster_centers_[:, 0], kmeans_iris.cluster_centers_[:, 1], c='red', marker='X', s=200, label="Centroids") plt.title("K-Means Clustering on Iris Dataset") plt.legend() plt.show()
7. Evaluating Clustering Performance
β Silhouette Score β Measures how well each point fits into its cluster. Higher is better!
from sklearn.metrics import silhouette_score sil_score = silhouette_score(X_iris_scaled, y_iris_pred) print("Silhouette Score:", sil_score)
β Comparing with Actual Labels (If available)
from sklearn.metrics import adjusted_rand_score # Compare clusters with actual species labels rand_score = adjusted_rand_score(iris.target, y_iris_pred) print("Adjusted Rand Index:", rand_score)
8. Advantages & Disadvantages of K-Means
β Advantages
β Simple and easy to implement
β Scales well to large datasets
β Works well if clusters are well-separated
β Disadvantages
β Needs to specify K in advance
β Sensitive to outliers
β Fails for non-spherical clusters
Summary
β K-Means clusters similar data points into K groups.
β Uses centroids to define cluster centers.
β The Elbow Method helps find the best K.
β Silhouette Score and Adjusted Rand Index can evaluate performance.