Data Science Scikit-learn for ML

Scikit-learn is one of the most widely used libraries in Python for building machine learning models. It provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and model evaluation. In this tutorial, we’ll walk through the basics of Scikit-learn and demonstrate how to build machine learning models for various tasks.

1. Installing Scikit-learn

If you haven’t installed Scikit-learn yet, you can install it via pip:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install scikit-learn
pip install scikit-learn
pip install scikit-learn

2. Importing Scikit-learn

Before using Scikit-learn, import it into your Python script. You can also import other necessary libraries like NumPy and pandas for data manipulation:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

3. Loading and Preparing Data

Scikit-learn provides easy access to datasets like the famous Iris dataset. You can load it using the load_iris() function:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Target labels
from sklearn.datasets import load_iris # Load the Iris dataset iris = load_iris() X = iris.data # Features y = iris.target # Target labels
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target labels

Alternatively, you can load your own dataset using pandas:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Load your own data (e.g., CSV file)
df = pd.read_csv('your_data.csv')
X = df.iloc[:, :-1] # Features (all columns except the last)
y = df.iloc[:, -1] # Target (last column)
# Load your own data (e.g., CSV file) df = pd.read_csv('your_data.csv') X = df.iloc[:, :-1] # Features (all columns except the last) y = df.iloc[:, -1] # Target (last column)
# Load your own data (e.g., CSV file)
df = pd.read_csv('your_data.csv')
X = df.iloc[:, :-1]  # Features (all columns except the last)
y = df.iloc[:, -1]  # Target (last column)

4. Splitting Data into Training and Testing Sets

Before training a model, you need to split your data into training and testing sets. This ensures that you evaluate the model’s performance on unseen data:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Feature Scaling

In many machine learning algorithms, it’s essential to scale the features to ensure the model performs optimally. You can scale the features using StandardScaler:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

6. Building a Classification Model (Logistic Regression)

In this example, we’ll use Logistic Regression for a classification task. First, import the model:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.linear_model import LogisticRegression
# Create a Logistic Regression model
model = LogisticRegression()
# Train the model
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
from sklearn.linear_model import LogisticRegression # Create a Logistic Regression model model = LogisticRegression() # Train the model model.fit(X_train_scaled, y_train) # Make predictions y_pred = model.predict(X_test_scaled)
from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

7. Evaluating the Model

Once the model has been trained and predictions are made, we need to evaluate the model’s performance. Common evaluation metrics include accuracy, confusion matrix, and classification report:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix: \n{conf_matrix}')
# Classification report
class_report = classification_report(y_test, y_pred)
print(f'Classification Report: \n{class_report}')
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy}') # Confusion matrix conf_matrix = confusion_matrix(y_test, y_pred) print(f'Confusion Matrix: \n{conf_matrix}') # Classification report class_report = classification_report(y_test, y_pred) print(f'Classification Report: \n{class_report}')
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix: \n{conf_matrix}')

# Classification report
class_report = classification_report(y_test, y_pred)
print(f'Classification Report: \n{class_report}')

8. Building a Regression Model (Linear Regression)

For a regression task, let’s build a simple Linear Regression model:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.linear_model import LinearRegression
# Create a Linear Regression model
reg_model = LinearRegression()
# Train the model
reg_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred_reg = reg_model.predict(X_test_scaled)
from sklearn.linear_model import LinearRegression # Create a Linear Regression model reg_model = LinearRegression() # Train the model reg_model.fit(X_train_scaled, y_train) # Make predictions y_pred_reg = reg_model.predict(X_test_scaled)
from sklearn.linear_model import LinearRegression

# Create a Linear Regression model
reg_model = LinearRegression()

# Train the model
reg_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_reg = reg_model.predict(X_test_scaled)

9. Evaluating Regression Model

For regression tasks, we can evaluate the model using metrics like Mean Squared Error (MSE) and R-squared:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.metrics import mean_squared_error, r2_score
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred_reg)
print(f'Mean Squared Error: {mse}')
# Calculate R-squared
r2 = r2_score(y_test, y_pred_reg)
print(f'R-squared: {r2}')
from sklearn.metrics import mean_squared_error, r2_score # Calculate Mean Squared Error mse = mean_squared_error(y_test, y_pred_reg) print(f'Mean Squared Error: {mse}') # Calculate R-squared r2 = r2_score(y_test, y_pred_reg) print(f'R-squared: {r2}')
from sklearn.metrics import mean_squared_error, r2_score

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred_reg)
print(f'Mean Squared Error: {mse}')

# Calculate R-squared
r2 = r2_score(y_test, y_pred_reg)
print(f'R-squared: {r2}')

10. Building a Clustering Model (K-Means)

In addition to classification and regression, Scikit-learn also supports clustering algorithms. Let’s use K-Means Clustering to group data points into clusters:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.cluster import KMeans
# Create a K-Means model with 3 clusters
kmeans = KMeans(n_clusters=3)
# Train the model
kmeans.fit(X_train_scaled)
# Make predictions
y_pred_kmeans = kmeans.predict(X_test_scaled)
# Cluster centers
print(f'Cluster Centers: \n{kmeans.cluster_centers_}')
from sklearn.cluster import KMeans # Create a K-Means model with 3 clusters kmeans = KMeans(n_clusters=3) # Train the model kmeans.fit(X_train_scaled) # Make predictions y_pred_kmeans = kmeans.predict(X_test_scaled) # Cluster centers print(f'Cluster Centers: \n{kmeans.cluster_centers_}')
from sklearn.cluster import KMeans

# Create a K-Means model with 3 clusters
kmeans = KMeans(n_clusters=3)

# Train the model
kmeans.fit(X_train_scaled)

# Make predictions
y_pred_kmeans = kmeans.predict(X_test_scaled)

# Cluster centers
print(f'Cluster Centers: \n{kmeans.cluster_centers_}')

Conclusion

Scikit-learn is a powerful and versatile library for building machine learning models. With just a few lines of code, you can implement classification, regression, and clustering models, along with tools for evaluating their performance. Whether you’re working with structured data or performing complex machine learning tasks, Scikit-learn is an essential tool for Data Scientists.