Data Science Handling Missing Data

Missing data is a common issue in real-world datasets. Handling missing data properly is crucial for ensuring the accuracy of your analysis and machine learning models.

1. Identifying Missing Data

Before you can handle missing data, you need to detect it. Pandas provides several useful functions to identify missing values.

1.1 Checking for Missing Values

  • isnull() and notnull()
    These functions help you check which values are missing.

    import pandas as pd
    
    # Create a sample DataFrame
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', None],
        'Age': [25, None, 30, 22],
        'Salary': [50000, 60000, None, 45000]
    }
    df = pd.DataFrame(data)
    
    # Check for missing values
    print(df.isnull())
    

    Try It Now

  • sum() on isnull()
    Summing up the missing values in each column gives an overview of how many missing entries exist.

    # Count missing values in each column
    missing_counts = df.isnull().sum()
    print(missing_counts)
    

    Try It Now

  • info()
    The info() method provides a summary of the DataFrame, including the number of non-null entries.

    df.info()
    

    Try It Now

2. Techniques for Handling Missing Data

There are two primary approaches to handle missing data:

2.1 Removing Missing Data

This approach is straightforward: delete the rows or columns with missing values. This method is acceptable when the amount of missing data is small.

a) Removing Rows with Missing Data

# Remove rows with any missing values
df_dropped = df.dropna()
print(df_dropped)

Try It Now

b) Removing Columns with Missing Data

# Remove columns with missing values
df_dropped_cols = df.dropna(axis=1)
print(df_dropped_cols)

Try It Now

Note: Be cautious when dropping data, as it may lead to a loss of valuable information if too much data is removed.

 

2.2 Imputing Missing Data

Imputation involves filling in the missing values with substituted data. Common strategies include using the mean, median, mode, or even more complex methods like interpolation.

a) Imputing with a Constant Value

# Fill missing values with a constant (e.g., 0 or 'Unknown')
df_filled_const = df.fillna({
    'Name': 'Unknown',
    'Age': 0,
    'Salary': 0
})
print(df_filled_const)

Try It Now

b) Imputing with the Mean (For Numerical Data)

# Fill missing numerical values with the mean of the column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print(df)

Try It Now

 

c) Imputing with the Median

# Fill missing numerical values with the median of the column
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

Try It Now

d) Imputing with the Mode (For Categorical Data)

# Fill missing categorical values with the mode
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
print(df)

Try It Now

e) Using Interpolation (For Time-Series Data)

Interpolation can be useful when the data is time-dependent, estimating missing values based on neighboring data.

# Create a time-series DataFrame with missing values
time_data = {
    'Date': pd.date_range(start='2025-01-01', periods=5, freq='D'),
    'Value': [10, None, 15, None, 20]
}
df_time = pd.DataFrame(time_data)

# Set 'Date' as the index
df_time.set_index('Date', inplace=True)

# Interpolate missing values
df_time_interpolated = df_time.interpolate()
print(df_time_interpolated)

Try It Now

3. Best Practices for Handling Missing Data

  • Analyze the Pattern: Determine if the missing data is random or follows a pattern. This can influence the method you choose.
  • Understand the Impact: Consider how missing data might affect your analysis and models.
  • Avoid Data Leakage: When imputing missing values, ensure that you do not use future data (in time-series) or information from the validation/test sets.
  • Document Your Process: Keep track of the methods you use to handle missing data for reproducibility and transparency.

 

Handling missing data is a critical step in data preprocessing. Whether you choose to remove or impute missing values depends on the context and the amount of missing data in your dataset.