Python Pandas Basics: A Beginner’s Guide

Pandas is an open-source Python library that provides high-performance data manipulation and analysis tools. It is widely used in data science and machine learning for handling structured data. In this tutorial, we will explore the basics of Pandas, including how to create DataFrames, manipulate data, handle missing values, and perform common operations.

1. Installing Pandas

If you haven’t installed Pandas yet, you can install it using pip, the Python package manager. Run the following command:

pip install pandas

Try It Now

2. Importing Pandas

Once Pandas is installed, you can import it into your Python script. It is commonly imported with the alias pd</> as follows:

import pandas as pd

Try It Now

3. Creating a DataFrame

A DataFrame is a two-dimensional data structure in Pandas, similar to a table or a spreadsheet, with rows and columns. Here are some ways to create DataFrames:

3.1 Creating a DataFrame from a Dictionary

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)

Try It Now

3.2 Creating a DataFrame from a List of Lists

data = [['Alice', 25, 'New York'],
        ['Bob', 30, 'Los Angeles'],
        ['Charlie', 35, 'Chicago']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

Try It Now

4. Accessing Data in DataFrame

Once you have a DataFrame, you can access its data using various methods:

4.1 Accessing Columns

# Accessing a single column
print(df['Name'])

# Accessing multiple columns
print(df[['Name', 'Age']])

Try It Now

4.2 Accessing Rows

# Accessing a single row by index
print(df.iloc[1])  # Access the second row

# Accessing rows using a condition
print(df[df['Age'] > 30])  # Filter rows where Age > 30

Try It Now

4.3 Accessing Specific Data (Row and Column)

# Accessing a specific cell
print(df.at[1, 'City'])  # Access cell at second row, 'City' column

Try It Now

5. Basic DataFrame Operations

Pandas allows you to perform a variety of operations on your DataFrame, such as sorting, filtering, and aggregating data:

5.1 Sorting Data

# Sorting by a specific column
print(df.sort_values('Age'))  # Sort by 'Age' column

Try It Now

5.2 Filtering Data

# Filter rows where 'Age' is greater than 30
print(df[df['Age'] > 30])

Try It Now

5.3 Aggregating Data

# Get the average of the 'Age' column
print(df['Age'].mean())  # Output: 30.0

Try It Now

6. Handling Missing Data

It’s common to encounter missing values in datasets. Pandas provides methods to handle missing data efficiently:

6.1 Detecting Missing Data

# Check for missing values
print(df.isnull())

Try It Now

6.2 Filling Missing Data

# Filling missing values with a specific value
df['Age'] = df['Age'].fillna(0)

Try It Now

6.3 Dropping Missing Data

# Dropping rows with missing values
df = df.dropna()

Try It Now

7. Combining DataFrames

Sometimes you may need to combine multiple DataFrames. Pandas provides functions like concat() and merge() to combine data:

7.1 Concatenating DataFrames

df2 = pd.DataFrame({'Name': ['David'], 'Age': [40], 'City': ['Miami']})
df_combined = pd.concat([df, df2])
print(df_combined)

Try It Now

7.2 Merging DataFrames

df3 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})
df_merged = pd.merge(df, df3, on='Name')
print(df_merged)

Try It Now

8. Saving DataFrames to Files

Once you’ve manipulated your data, you may want to save it to a file. Pandas supports various formats, including CSV, Excel, and SQL:

8.1 Saving to CSV

df.to_csv('data.csv', index=False)  # Save DataFrame to a CSV file

Try It Now

8.2 Saving to Excel

df.to_excel('data.xlsx', index=False)  # Save DataFrame to an Excel file

Try It Now

Conclusion

In this tutorial, we covered the basics of working with Pandas, including creating and accessing DataFrames, performing basic operations, handling missing data, combining DataFrames, and saving your data to files. Pandas is an extremely powerful and versatile tool for data manipulation and analysis in Python, and it forms the backbone of many data science workflows.