Data visualization relies on understanding different data types because the choice of visualization depends on the nature of the data. Data types determine how we analyze, interpret, and present data effectively. In this tutorial, we will explore various data types and their best-suited visualization techniques.
1. Types of Data in Data Science
Data in Data Science is broadly classified into:
1.1. Qualitative (Categorical) Data
Categorical data represents discrete groups or labels that do not have numerical meaning.
a) Nominal Data
- Categories with no inherent order (e.g., colors, gender, countries).
- Best Visualization Types:
- Bar Chart
- Pie Chart
- Count Plot
Example: Visualizing Gender Distribution with a Bar Chart
import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Sample categorical data data = pd.DataFrame({'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female']}) # Count plot sns.countplot(x='Gender', data=data) plt.title("Gender Distribution") plt.show()
b) Ordinal Data
- Categories with a meaningful order but unequal differences (e.g., low/medium/high satisfaction levels).
- Best Visualization Types:
- Bar Chart (ordered)
- Histogram
- Box Plot
Example: Visualizing Education Level with a Bar Chart
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Sample ordinal data education_levels = pd.DataFrame({'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']}) # Count plot with sorted order order = ['High School', 'Bachelor', 'Master', 'PhD'] sns.countplot(x='Education', data=education_levels, order=order) plt.title("Education Level Distribution") plt.show()
1.2. Quantitative (Numerical) Data
Numerical data represents measurable quantities and can be analyzed mathematically.
a) Discrete Data
- Countable values (e.g., number of students, cars in a parking lot).
- Best Visualization Types:
- Bar Chart
- Histogram
- Dot Plot
Example: Visualizing Number of Students in Different Classes
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Sample discrete data data = pd.DataFrame({'Class': ['A', 'B', 'C', 'D'], 'Students': [30, 45, 25, 40]}) # Bar chart sns.barplot(x='Class', y='Students', data=data) plt.title("Number of Students in Each Class") plt.show()
b) Continuous Data
- Any value within a range (e.g., height, weight, temperature).
- Best Visualization Types:
- Histogram
- Line Plot
- Box Plot
- Scatter Plot
Example: Visualizing Height Distribution with a Histogram
import numpy as np import matplotlib.pyplot as plt # Generate random continuous data heights = np.random.normal(loc=170, scale=10, size=1000) # Histogram plt.hist(heights, bins=20, edgecolor='black') plt.title("Height Distribution") plt.xlabel("Height (cm)") plt.ylabel("Frequency") plt.show()
2. Mixed Data Types in Visualization
Sometimes, we need to visualize relationships between different types of data.
2.1. Categorical vs. Numerical Data
Best Visualization Types:
- Box Plot (to compare numerical distributions across categories)
- Violin Plot (to see the density of numerical values across categories)
Example: Visualizing Salary Distribution by Job Title with a Box Plot
import seaborn as sns import pandas as pd import matplotlib.pyplot as plt # Sample mixed data data = pd.DataFrame({ 'Job': ['Engineer', 'Doctor', 'Teacher', 'Engineer', 'Doctor', 'Teacher'], 'Salary': [70000, 120000, 50000, 75000, 130000, 52000] }) # Box plot sns.boxplot(x='Job', y='Salary', data=data) plt.title("Salary Distribution by Job Title") plt.show()
2.2. Numerical vs. Numerical Data
Best Visualization Types:
- Scatter Plot (for relationships between two continuous variables)
- Line Chart (for trends over time)
- Heatmap (for correlation between multiple numerical variables)
Example: Visualizing the Relationship Between Age and Salary with a Scatter Plot
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Generate sample data np.random.seed(10) data = pd.DataFrame({ 'Age': np.random.randint(20, 60, 100), 'Salary': np.random.randint(30000, 100000, 100) }) # Scatter plot sns.scatterplot(x='Age', y='Salary', data=data) plt.title("Age vs. Salary Relationship") plt.show()
3. Summary of Data Types and Visualization Choices
Data Type | Example | Best Visualizations |
---|---|---|
Nominal (Categorical) | Gender, Country | Bar Chart, Pie Chart |
Ordinal (Ordered Categorical) | Satisfaction Level, Education | Ordered Bar Chart, Box Plot |
Discrete (Numerical Countable) | Number of Students, Cars | Bar Chart, Histogram |
Continuous (Numerical Measurable) | Height, Temperature | Histogram, Line Plot, Scatter Plot |
Categorical vs. Numerical | Job vs. Salary | Box Plot, Violin Plot |
Numerical vs. Numerical | Age vs. Salary | Scatter Plot, Line Chart, Heatmap |
Choosing the right visualization for different data types is essential for extracting meaningful insights. By understanding the structure of your data—whether categorical, numerical, or mixed—you can select the most appropriate charts and graphs to make your data more interpretable and impactful.