Kahibaro
Discord Login Register

Simple data analysis examples

Goals of This Chapter

In this chapter you will:

We will not re-explain what data science, NumPy, pandas, or matplotlib are—those are covered earlier. Here we focus on practical, small examples you can run and modify.

All examples assume you have:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

and, if needed, %matplotlib inline in a notebook environment.


Example 1: Analyzing Student Scores

Imagine you have a simple dataset of students and their exam scores in two subjects.

Creating a small dataset

In a real project, you would load a CSV, but here we’ll create a tiny dataset directly:

data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Evan"],
    "math": [85, 62, 90, 70, 95],
    "english": [88, 75, 85, 60, 92],
}
df = pd.DataFrame(data)
print(df)

Sample output:

      name  math  english
0    Alice    85       88
1      Bob    62       75
2  Charlie    90       85
3    Diana    70       60
4     Evan    95       92

Basic statistics

Let’s explore the scores:

print("Math average:", df["math"].mean())
print("English average:", df["english"].mean())
print("Highest math score:", df["math"].max())
print("Lowest english score:", df["english"].min())

You can get summary statistics for all numeric columns:

print(df.describe())

This usually shows count, mean, std (standard deviation), min, max, and some percentiles.

Adding a new column

Let’s calculate a total score and an average score per student:

df["total"] = df["math"] + df["english"]
df["average"] = df["total"] / 2
print(df)

Now you can see per-student aggregates.

Filtering data (simple queries)

Find students who are doing very well:

top_students = df[df["average"] >= 85]
print(top_students)

Find students who might need help in English:

needs_help_english = df[df["english"] < 70]
print(needs_help_english)

Simple visualization

Let’s visualize math vs English scores:

plt.scatter(df["math"], df["english"])
plt.xlabel("Math score")
plt.ylabel("English score")
plt.title("Math vs English scores")
plt.grid(True)
plt.show()

And a bar chart of total scores by student:

plt.bar(df["name"], df["total"])
plt.xlabel("Student")
plt.ylabel("Total score")
plt.title("Total scores by student")
plt.show()

What you practiced here:

Example 2: Daily Steps – Time Series Overview

Now imagine you tracked your daily steps for two weeks.

Creating a time-based dataset

We’ll create a date range and some fake step counts:

dates = pd.date_range(start="2025-01-01", periods=14, freq="D")
steps = [4500, 7000, 8000, 3000, 10000, 12000, 9000,
         6000, 11000, 4000, 7500, 13000, 5000, 9500]
steps_df = pd.DataFrame({
    "date": dates,
    "steps": steps,
})
print(steps_df)

Basic analysis

Total and average steps:

print("Total steps:", steps_df["steps"].sum())
print("Average steps per day:", steps_df["steps"].mean())

Find your most active day:

max_row = steps_df.loc[steps_df["steps"].idxmax()]
print("Most active day:")
print(max_row)

Filter days below a 8000-step goal:

below_goal = steps_df[steps_df["steps"] < 8000]
print("Days below goal:")
print(below_goal)

Simple time series plot

Plot steps over time:

plt.plot(steps_df["date"], steps_df["steps"], marker="o")
plt.xlabel("Date")
plt.ylabel("Steps")
plt.title("Daily Steps Over Two Weeks")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Grouping by weekday vs weekend (basic grouping)

Let’s see average steps on weekdays vs weekends.

First, add a column to show the day of week:

steps_df["day_name"] = steps_df["date"].dt.day_name()
print(steps_df[["date", "day_name", "steps"]])

Now create a column is_weekend:

steps_df["is_weekend"] = steps_df["day_name"].isin(["Saturday", "Sunday"])
print(steps_df[["date", "day_name", "is_weekend", "steps"]])

Group and calculate average:

avg_by_weekend = steps_df.groupby("is_weekend")["steps"].mean()
print(avg_by_weekend)

You can also plot it:

avg_by_weekend.plot(kind="bar")
plt.xticks([0, 1], ["Weekday", "Weekend"], rotation=0)
plt.ylabel("Average steps")
plt.title("Average Steps: Weekdays vs Weekends")
plt.show()

What you practiced here:

Example 3: Simple Sales Analysis

Imagine a tiny store with a log of sales. Each row is one transaction.

Creating a small sales dataset

sales_data = {
    "date": pd.to_datetime([
        "2025-02-01", "2025-02-01", "2025-02-02",
        "2025-02-02", "2025-02-03", "2025-02-03",
    ]),
    "item": ["Book", "Pen", "Notebook", "Book", "Pen", "Book"],
    "price": [12.5, 1.5, 5.0, 12.5, 1.5, 12.5],
    "quantity": [1, 3, 2, 1, 10, 2],
}
sales_df = pd.DataFrame(sales_data)
print(sales_df)

Calculating revenue

Revenue per transaction is price * quantity. Add that as a column:

sales_df["revenue"] = sales_df["price"] * sales_df["quantity"]
print(sales_df)

Total revenue:

print("Total revenue:", sales_df["revenue"].sum())

Revenue per day:

revenue_by_day = sales_df.groupby("date")["revenue"].sum()
print(revenue_by_day)

Revenue per item:

revenue_by_item = sales_df.groupby("item")["revenue"].sum()
print(revenue_by_item)

Sorting to find top items

Sort items by revenue (descending):

revenue_by_item_sorted = revenue_by_item.sort_values(ascending=False)
print(revenue_by_item_sorted)

Visualizing sales

Bar chart of revenue per item:

revenue_by_item_sorted.plot(kind="bar")
plt.ylabel("Total revenue")
plt.title("Revenue by Item")
plt.xticks(rotation=0)
plt.show()

Line plot of daily revenue:

revenue_by_day.plot(kind="line", marker="o")
plt.ylabel("Revenue")
plt.title("Revenue by Day")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

What you practiced here:

Example 4: Basic Correlation and Relationships

Suppose you have data on hours studied and exam scores for a small group of students.

Creating the dataset

study_data = {
    "hours_studied": [1, 2, 2.5, 3, 4, 5, 6, 7],
    "score":         [50, 55, 60, 62, 70, 78, 85, 90],
}
study_df = pd.DataFrame(study_data)
print(study_df)

Visualizing the relationship

Plot hours studied vs score:

plt.scatter(study_df["hours_studied"], study_df["score"])
plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Hours Studied vs Exam Score")
plt.grid(True)
plt.show()

You’ll probably see an upward trend.

Calculating correlation

Use the .corr() method:

corr = study_df["hours_studied"].corr(study_df["score"])
print("Correlation between hours studied and score:", corr)

Correlation values:

Simple “trend line” (using numpy)

We can add a simple best-fit line visually:

x = study_df["hours_studied"]
y = study_df["score"]
# Fit a line y = m*x + b
m, b = np.polyfit(x, y, 1)
plt.scatter(x, y, label="Data points")
plt.plot(x, m * x + b, color="red", label="Trend line")
plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Hours Studied vs Exam Score with Trend Line")
plt.legend()
plt.grid(True)
plt.show()

What you practiced here:

Example 5: Handling Missing Data (Small Demo)

Real datasets often have missing values. Let’s see a tiny example of dealing with them.

Creating a dataset with missing values

temp_data = {
    "day": ["Mon", "Tue", "Wed", "Thu", "Fri"],
    "temperature": [20.5, np.nan, 22.0, np.nan, 21.5],
}
temp_df = pd.DataFrame(temp_data)
print(temp_df)

You’ll see NaN (Not a Number) for missing temperatures.

Detecting missing values

print(temp_df.isna())
print("Number of missing values per column:")
print(temp_df.isna().sum())

Dropping missing values

Sometimes it’s OK to just remove rows with missing data:

temp_drop = temp_df.dropna()
print(temp_drop)

Now only complete rows remain.

Filling missing values

Other times you want to fill missing values with something reasonable, like the mean:

mean_temp = temp_df["temperature"].mean()
print("Mean temperature:", mean_temp)
temp_filled = temp_df.copy()
temp_filled["temperature"] = temp_filled["temperature"].fillna(mean_temp)
print(temp_filled)

Or fill with a fixed value:

temp_fixed = temp_df.copy()
temp_fixed["temperature"] = temp_fixed["temperature"].fillna(0)
print(temp_fixed)

What you practiced here:

Putting It All Together: A Mini Workflow Pattern

Most simple data analysis tasks follow a pattern like:

  1. Get data
    • Load from CSV, Excel, database, or create it manually (as we did).
  2. Inspect data
    • df.head(), df.info(), df.describe().
  3. Clean data
    • Handle missing values.
    • Fix data types if needed.
    • Remove obvious errors.
  4. Transform / enrich data
    • Add new columns (totals, averages, flags like is_weekend).
    • Group and aggregate (sum, mean, count).
  5. Analyze
    • Compute statistics.
    • Filter subsets of interest.
    • Compare groups.
  6. Visualize
    • Use plots (line, bar, scatter) to see trends and relationships.
  7. Interpret
    • Turn results into simple statements or decisions:
      • “Average steps are higher on weekends.”
      • “Book sales bring the most revenue.”
      • “More hours studied is associated with higher scores.”

These small examples are good templates for your own projects: you can replace the sample data with your own CSVs or logs and keep the same basic steps.


Ideas for Your Own Simple Analyses

Here are some small project ideas you can try using the same techniques:

Reuse the patterns from this chapter: create/load a DataFrame, clean it, analyze with groupby, and visualize with matplotlib.

Views: 14

Comments

Please login to add a comment.

Don't have an account? Register now!