Data Sceince Introduction to ML

Machine Learning (ML) is a core part of Data Science, enabling computers to learn from data and make predictions without being explicitly programmed. This tutorial introduces ML concepts and includes a simple Python example.

1. What is Machine Learning?

Machine Learning is a branch of Artificial Intelligence (AI) that focuses on creating systems that can learn from data and improve performance over time.

Types of Machine Learning

  1. Supervised Learning
    • The model learns from labeled data (input-output pairs).
    • Example: Spam detection (emails labeled as spam or not).
    • Algorithms: Linear Regression, Decision Trees, Random Forest, SVM, Neural Networks.
  2. Unsupervised Learning
    • The model learns from unlabeled data (finds patterns).
    • Example: Customer segmentation (grouping similar customers).
    • Algorithms: K-Means, DBSCAN, PCA, Hierarchical Clustering.
  3. Reinforcement Learning
    • The model learns by trial and error to maximize rewards.
    • Example: Self-driving cars, AlphaGo AI playing chess.
    • Algorithms: Q-Learning, Deep Q-Networks (DQN), Policy Gradient.

2. Machine Learning Workflow

  1. Data Collection → Gather data from sources (CSV, database, APIs).
  2. Data Preprocessing → Clean missing values, normalize, and encode.
  3. Exploratory Data Analysis (EDA) → Understand trends and distributions.
  4. Model Selection → Choose the best algorithm (Linear Regression, Decision Trees, etc.).
  5. Training the Model → Feed the data to learn patterns.
  6. Model Evaluation → Test the model with unseen data.
  7. Model Deployment → Deploy the model for real-world predictions.

3. Simple Machine Learning Example in Python

Predict House Prices Using Linear Regression

Problem Statement:
Predict house prices based on the size of the house (sq ft).

Step 1: Import Required Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Step 2: Create Sample Data

# House sizes (square feet)
X = np.array([500, 800, 1000, 1500, 1800, 2000, 2500, 3000, 3500, 4000]).reshape(-1, 1)

# House prices (in $1000s)
y = np.array([150, 200, 250, 300, 350, 400, 450, 500, 550, 600])

Step 3: Split Data into Training & Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the Machine Learning Model

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

Step 5: Make Predictions

y_pred = model.predict(X_test)

# Print predicted values
print("Predicted House Prices:", y_pred)

Step 6: Evaluate the Model

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Model performance (higher is better)
r2_score = model.score(X_test, y_test)
print("R-squared Score:", r2_score)

Step 7: Visualize the Regression Line

plt.scatter(X, y, color="blue", label="Actual Prices")
plt.plot(X, model.predict(X), color="red", label="Regression Line")
plt.xlabel("House Size (sq ft)")
plt.ylabel("Price ($1000s)")
plt.title("House Price Prediction")
plt.legend()
plt.show()

4. Understanding the Output

  • Mean Squared Error (MSE) → Measures error (lower is better).
  • R-squared Score → Measures how well the model fits data (closer to 1 is better).
  • Graph → Shows a red best-fit line predicting house prices.

Summary

Machine Learning helps computers learn patterns from data.
Supervised Learning is the most common type (predicting prices, spam detection).
Linear Regression is a simple model for predicting continuous values.
Scikit-learn (sklearn) is a popular Python library for ML.