16.6 Simple data analysis examples

Table of Contents

Goals of This Chapter

In this chapter you will:

See end-to-end examples of small data analysis tasks in Python.
Combine pandas, numpy, and matplotlib in simple workflows.
Practice common patterns: loading data, cleaning, analyzing, and visualizing.
Get ideas for your own small data projects.

We will not re-explain what data science, NumPy, pandas, or matplotlib are—those are covered earlier. Here we focus on practical, small examples you can run and modify.

All examples assume you have:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

and, if needed, %matplotlib inline in a notebook environment.

Example 1: Analyzing Student Scores

Imagine you have a simple dataset of students and their exam scores in two subjects.

Creating a small dataset

In a real project, you would load a CSV, but here we’ll create a tiny dataset directly:

data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Evan"],
    "math": [85, 62, 90, 70, 95],
    "english": [88, 75, 85, 60, 92],
}
df = pd.DataFrame(data)
print(df)

Sample output:

      name  math  english
0    Alice    85       88
1      Bob    62       75
2  Charlie    90       85
3    Diana    70       60
4     Evan    95       92

Basic statistics

Let’s explore the scores:

print("Math average:", df["math"].mean())
print("English average:", df["english"].mean())
print("Highest math score:", df["math"].max())
print("Lowest english score:", df["english"].min())

You can get summary statistics for all numeric columns:

print(df.describe())

This usually shows count, mean, std (standard deviation), min, max, and some percentiles.

Adding a new column

Let’s calculate a total score and an average score per student:

df["total"] = df["math"] + df["english"]
df["average"] = df["total"] / 2
print(df)

Now you can see per-student aggregates.

Filtering data (simple queries)

Find students who are doing very well:

top_students = df[df["average"] >= 85]
print(top_students)

Find students who might need help in English:

needs_help_english = df[df["english"] < 70]
print(needs_help_english)

Simple visualization

Let’s visualize math vs English scores:

plt.scatter(df["math"], df["english"])
plt.xlabel("Math score")
plt.ylabel("English score")
plt.title("Math vs English scores")
plt.grid(True)
plt.show()

And a bar chart of total scores by student:

plt.bar(df["name"], df["total"])
plt.xlabel("Student")
plt.ylabel("Total score")
plt.title("Total scores by student")
plt.show()

What you practiced here:

Creating a DataFrame.
Calculating statistics.
Adding new columns.
Filtering rows based on conditions.
Simple scatter and bar plots.

Example 2: Daily Steps – Time Series Overview

Now imagine you tracked your daily steps for two weeks.

Creating a time-based dataset

We’ll create a date range and some fake step counts:

dates = pd.date_range(start="2025-01-01", periods=14, freq="D")
steps = [4500, 7000, 8000, 3000, 10000, 12000, 9000,
         6000, 11000, 4000, 7500, 13000, 5000, 9500]
steps_df = pd.DataFrame({
    "date": dates,
    "steps": steps,
})
print(steps_df)

Basic analysis

Total and average steps:

print("Total steps:", steps_df["steps"].sum())
print("Average steps per day:", steps_df["steps"].mean())

Find your most active day:

max_row = steps_df.loc[steps_df["steps"].idxmax()]
print("Most active day:")
print(max_row)

Filter days below a 8000-step goal:

below_goal = steps_df[steps_df["steps"] < 8000]
print("Days below goal:")
print(below_goal)

Simple time series plot

Plot steps over time:

plt.plot(steps_df["date"], steps_df["steps"], marker="o")
plt.xlabel("Date")
plt.ylabel("Steps")
plt.title("Daily Steps Over Two Weeks")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Grouping by weekday vs weekend (basic grouping)

Let’s see average steps on weekdays vs weekends.

First, add a column to show the day of week:

steps_df["day_name"] = steps_df["date"].dt.day_name()
print(steps_df[["date", "day_name", "steps"]])

Now create a column is_weekend:

steps_df["is_weekend"] = steps_df["day_name"].isin(["Saturday", "Sunday"])
print(steps_df[["date", "day_name", "is_weekend", "steps"]])

Group and calculate average:

avg_by_weekend = steps_df.groupby("is_weekend")["steps"].mean()
print(avg_by_weekend)

You can also plot it:

avg_by_weekend.plot(kind="bar")
plt.xticks([0, 1], ["Weekday", "Weekend"], rotation=0)
plt.ylabel("Average steps")
plt.title("Average Steps: Weekdays vs Weekends")
plt.show()

What you practiced here:

Working with dates and time series.
Finding min/max rows.
Grouping data and calculating averages.
Plotting time series and grouped data.

Example 3: Simple Sales Analysis

Imagine a tiny store with a log of sales. Each row is one transaction.

Creating a small sales dataset

sales_data = {
    "date": pd.to_datetime([
        "2025-02-01", "2025-02-01", "2025-02-02",
        "2025-02-02", "2025-02-03", "2025-02-03",
    ]),
    "item": ["Book", "Pen", "Notebook", "Book", "Pen", "Book"],
    "price": [12.5, 1.5, 5.0, 12.5, 1.5, 12.5],
    "quantity": [1, 3, 2, 1, 10, 2],
}
sales_df = pd.DataFrame(sales_data)
print(sales_df)

Calculating revenue

Revenue per transaction is price * quantity. Add that as a column:

sales_df["revenue"] = sales_df["price"] * sales_df["quantity"]
print(sales_df)

Total revenue:

print("Total revenue:", sales_df["revenue"].sum())

Revenue per day:

revenue_by_day = sales_df.groupby("date")["revenue"].sum()
print(revenue_by_day)

Revenue per item:

revenue_by_item = sales_df.groupby("item")["revenue"].sum()
print(revenue_by_item)

Sorting to find top items

Sort items by revenue (descending):

revenue_by_item_sorted = revenue_by_item.sort_values(ascending=False)
print(revenue_by_item_sorted)

Visualizing sales

Bar chart of revenue per item:

revenue_by_item_sorted.plot(kind="bar")
plt.ylabel("Total revenue")
plt.title("Revenue by Item")
plt.xticks(rotation=0)
plt.show()

Line plot of daily revenue:

revenue_by_day.plot(kind="line", marker="o")
plt.ylabel("Revenue")
plt.title("Revenue by Day")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

What you practiced here:

Creating new columns using existing ones.
Grouping and aggregating (sum).
Sorting results.
Plotting grouped data.

Example 4: Basic Correlation and Relationships

Suppose you have data on hours studied and exam scores for a small group of students.

Creating the dataset

study_data = {
    "hours_studied": [1, 2, 2.5, 3, 4, 5, 6, 7],
    "score":         [50, 55, 60, 62, 70, 78, 85, 90],
}
study_df = pd.DataFrame(study_data)
print(study_df)

Visualizing the relationship

Plot hours studied vs score:

plt.scatter(study_df["hours_studied"], study_df["score"])
plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Hours Studied vs Exam Score")
plt.grid(True)
plt.show()

You’ll probably see an upward trend.

Calculating correlation

Use the .corr() method:

corr = study_df["hours_studied"].corr(study_df["score"])
print("Correlation between hours studied and score:", corr)

Correlation values:

Close to 1: strong positive relationship.
Close to 0: weak or no linear relationship.
Close to -1: strong negative relationship.

Simple “trend line” (using numpy)

We can add a simple best-fit line visually:

x = study_df["hours_studied"]
y = study_df["score"]
# Fit a line y = m*x + b
m, b = np.polyfit(x, y, 1)
plt.scatter(x, y, label="Data points")
plt.plot(x, m * x + b, color="red", label="Trend line")
plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Hours Studied vs Exam Score with Trend Line")
plt.legend()
plt.grid(True)
plt.show()

What you practiced here:

Using scatter plots.
Computing correlation.
Using numpy.polyfit to draw a simple trend line.

Example 5: Handling Missing Data (Small Demo)

Real datasets often have missing values. Let’s see a tiny example of dealing with them.

Creating a dataset with missing values

temp_data = {
    "day": ["Mon", "Tue", "Wed", "Thu", "Fri"],
    "temperature": [20.5, np.nan, 22.0, np.nan, 21.5],
}
temp_df = pd.DataFrame(temp_data)
print(temp_df)

You’ll see NaN (Not a Number) for missing temperatures.

Detecting missing values

print(temp_df.isna())
print("Number of missing values per column:")
print(temp_df.isna().sum())

Dropping missing values

Sometimes it’s OK to just remove rows with missing data:

temp_drop = temp_df.dropna()
print(temp_drop)

Now only complete rows remain.

Filling missing values

Other times you want to fill missing values with something reasonable, like the mean:

mean_temp = temp_df["temperature"].mean()
print("Mean temperature:", mean_temp)
temp_filled = temp_df.copy()
temp_filled["temperature"] = temp_filled["temperature"].fillna(mean_temp)
print(temp_filled)

Or fill with a fixed value:

temp_fixed = temp_df.copy()
temp_fixed["temperature"] = temp_fixed["temperature"].fillna(0)
print(temp_fixed)

What you practiced here:

Checking for missing values.
Dropping rows with missing values.
Filling missing values with a calculated or fixed value.

Putting It All Together: A Mini Workflow Pattern

Most simple data analysis tasks follow a pattern like:

Get data

Load from CSV, Excel, database, or create it manually (as we did).

Inspect data

df.head(), df.info(), df.describe().

Clean data

Handle missing values.
Fix data types if needed.
Remove obvious errors.

Transform / enrich data

Add new columns (totals, averages, flags like is_weekend).
Group and aggregate (sum, mean, count).

Analyze

Compute statistics.
Filter subsets of interest.
Compare groups.

Visualize

Use plots (line, bar, scatter) to see trends and relationships.

Interpret

Turn results into simple statements or decisions:

“Average steps are higher on weekends.”
“Book sales bring the most revenue.”
“More hours studied is associated with higher scores.”

These small examples are good templates for your own projects: you can replace the sample data with your own CSVs or logs and keep the same basic steps.

Ideas for Your Own Simple Analyses

Here are some small project ideas you can try using the same techniques:

Track your daily screen time by app and analyze which days you use your phone most.
Record your spending for a month and find which categories cost the most.
Analyze weather data (temperature/rain) for your city and see patterns by month or weekday.
Log your study time and test results to see how preparation affects performance.

Reuse the patterns from this chapter: create/load a DataFrame, clean it, analyze with groupby, and visualize with matplotlib.

Comments

Please login to add a comment.

Don't have an account? Register now!

16.6 Simple data analysis examples

Goals of This Chapter

Example 1: Analyzing Student Scores

Creating a small dataset

Basic statistics

Adding a new column

Filtering data (simple queries)

Simple visualization

Example 2: Daily Steps – Time Series Overview

Creating a time-based dataset

Basic analysis

Simple time series plot

Grouping by weekday vs weekend (basic grouping)

Example 3: Simple Sales Analysis

Creating a small sales dataset

Calculating revenue

Sorting to find top items

Visualizing sales

Example 4: Basic Correlation and Relationships

Creating the dataset

Visualizing the relationship

Calculating correlation

Simple “trend line” (using numpy)

Example 5: Handling Missing Data (Small Demo)

Creating a dataset with missing values

Detecting missing values

Dropping missing values

Filling missing values

Putting It All Together: A Mini Workflow Pattern

Ideas for Your Own Simple Analyses

Comments

Where to Move