Data Science Pandas for Data Manipulation

Pandas is one of the most popular Python libraries for data manipulation and analysis. It provides powerful, flexible, and easy-to-use data structures like Series and DataFrames, which are essential tools for data scientists. In this tutorial, we will explore the basics of Pandas and how to use it for data manipulation, cleaning, and filtering.

1. Installing Pandas

If you haven’t installed Pandas yet, you can install it via pip:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
pip install pandas
pip install pandas
pip install pandas

2. Importing Pandas

Before using Pandas, you need to import it into your Python script:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import pandas as pd
import pandas as pd
import pandas as pd

3. Creating Pandas DataFrame

The primary data structure in Pandas is the DataFrame, which is essentially a table or 2D array with labeled axes (rows and columns). Here’s how to create a DataFrame from a dictionary:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']} df = pd.DataFrame(data) print(df)
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

4. Viewing Data

Once you have a DataFrame, you can view the first few or last few rows using:

  • First few rows: df.head()
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df.head()) # Output: First 5 rows
    print(df.head()) # Output: First 5 rows
    print(df.head())  # Output: First 5 rows
      
  • Last few rows: df.tail()
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df.tail()) # Output: Last 5 rows
    print(df.tail()) # Output: Last 5 rows
    print(df.tail())  # Output: Last 5 rows
      

5. Accessing Data in DataFrame

DataFrames allow you to access data by column, row, or by both row and column:

  • Accessing a single column: Use df['column_name']
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df['Name']) # Output: Column 'Name'
    print(df['Name']) # Output: Column 'Name'
    print(df['Name'])  # Output: Column 'Name'
      
  • Accessing multiple columns: Use df[['col1', 'col2']]
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df[['Name', 'City']])
    print(df[['Name', 'City']])
    print(df[['Name', 'City']])
      
  • Accessing a row by index: Use df.iloc[row_index]
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df.iloc[1]) # Output: Second row (index 1)
    print(df.iloc[1]) # Output: Second row (index 1)
    print(df.iloc[1])  # Output: Second row (index 1)
      
  • Accessing a specific cell: Use df.at[row_index, 'column_name']
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df.at[1, 'Age']) # Output: 30 (second row, Age column)
    print(df.at[1, 'Age']) # Output: 30 (second row, Age column)
    print(df.at[1, 'Age'])  # Output: 30 (second row, Age column)
      

6. Modifying Data

To modify or update data in a DataFrame, you can use the following methods:

  • Adding a new column: Assign a value to a new column name.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    df['Country'] = ['USA', 'USA', 'USA']
    print(df)
    df['Country'] = ['USA', 'USA', 'USA'] print(df)
    df['Country'] = ['USA', 'USA', 'USA']
    print(df)
      
  • Updating an existing column: Assign a new value to an existing column.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    df['Age'] = df['Age'] + 1
    print(df)
    df['Age'] = df['Age'] + 1 print(df)
    df['Age'] = df['Age'] + 1
    print(df)
      

7. Filtering Data

Filtering rows based on conditions is a common operation in data manipulation:

  • Filter by condition: You can filter rows based on a condition applied to a column.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    adults = df[df['Age'] > 30]
    print(adults)
    adults = df[df['Age'] > 30] print(adults)
    adults = df[df['Age'] > 30]
    print(adults)
      
  • Multiple conditions: Use & (and) or | (or) to apply multiple conditions.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    result = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
    print(result)
    result = df[(df['Age'] > 30) & (df['City'] == 'Chicago')] print(result)
    result = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
    print(result)
      

8. Data Cleaning

Cleaning data is crucial in any data manipulation process. Pandas provides several functions to handle missing data:

  • Checking for missing data: Use df.isnull() to identify missing values.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    print(df.isnull()) # Output: Boolean DataFrame indicating missing values
    print(df.isnull()) # Output: Boolean DataFrame indicating missing values
    print(df.isnull())  # Output: Boolean DataFrame indicating missing values
      
  • Filling missing data: You can fill missing values with a specified value.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    df['Age'].fillna(30, inplace=True) # Fill missing age with 30
    df['Age'].fillna(30, inplace=True) # Fill missing age with 30
    df['Age'].fillna(30, inplace=True)  # Fill missing age with 30
      
  • Dropping missing data: Use df.dropna() to remove rows with missing values.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    df.dropna(inplace=True) # Drop rows with missing values
    df.dropna(inplace=True) # Drop rows with missing values
    df.dropna(inplace=True)  # Drop rows with missing values
      

9. Sorting Data

Sorting data by column values is a common operation:

  • Sort by a single column: Use df.sort_values(by='column_name')
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    df_sorted = df.sort_values(by='Age')
    print(df_sorted)
    df_sorted = df.sort_values(by='Age') print(df_sorted)
    df_sorted = df.sort_values(by='Age')
    print(df_sorted)
      
  • Sort by multiple columns: You can sort by multiple columns by passing a list.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    df_sorted = df.sort_values(by=['City', 'Age'])
    print(df_sorted)
    df_sorted = df.sort_values(by=['City', 'Age']) print(df_sorted)
    df_sorted = df.sort_values(by=['City', 'Age'])
    print(df_sorted)
      

10. Grouping Data

Grouping is useful for aggregating data based on certain columns:

  • Group by a column: Use df.groupby('column_name') to group data by a specific column.
  • Plain text
    Copy to clipboard
    Open code in new window
    EnlighterJS 3 Syntax Highlighter
    grouped = df.groupby('City').mean() # Group by 'City' and calculate the mean for numerical columns
    print(grouped)
    grouped = df.groupby('City').mean() # Group by 'City' and calculate the mean for numerical columns print(grouped)
    grouped = df.groupby('City').mean()  # Group by 'City' and calculate the mean for numerical columns
    print(grouped)
      

Conclusion

Pandas is an essential library for Data Scientists when it comes to data manipulation. With its powerful tools for data cleaning, filtering, grouping, and transforming, you can efficiently process large datasets and prepare them for analysis or machine learning.