Table of Contents
Goals of This Chapter
In this chapter you will:
- See end-to-end examples of small data analysis tasks in Python.
- Combine
pandas,numpy, andmatplotlibin simple workflows. - Practice common patterns: loading data, cleaning, analyzing, and visualizing.
- Get ideas for your own small data projects.
We will not re-explain what data science, NumPy, pandas, or matplotlib are—those are covered earlier. Here we focus on practical, small examples you can run and modify.
All examples assume you have:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
and, if needed, %matplotlib inline in a notebook environment.
Example 1: Analyzing Student Scores
Imagine you have a simple dataset of students and their exam scores in two subjects.
Creating a small dataset
In a real project, you would load a CSV, but here we’ll create a tiny dataset directly:
data = {
"name": ["Alice", "Bob", "Charlie", "Diana", "Evan"],
"math": [85, 62, 90, 70, 95],
"english": [88, 75, 85, 60, 92],
}
df = pd.DataFrame(data)
print(df)Sample output:
name math english
0 Alice 85 88
1 Bob 62 75
2 Charlie 90 85
3 Diana 70 60
4 Evan 95 92Basic statistics
Let’s explore the scores:
print("Math average:", df["math"].mean())
print("English average:", df["english"].mean())
print("Highest math score:", df["math"].max())
print("Lowest english score:", df["english"].min())You can get summary statistics for all numeric columns:
print(df.describe())
This usually shows count, mean, std (standard deviation), min, max, and some percentiles.
Adding a new column
Let’s calculate a total score and an average score per student:
df["total"] = df["math"] + df["english"]
df["average"] = df["total"] / 2
print(df)Now you can see per-student aggregates.
Filtering data (simple queries)
Find students who are doing very well:
top_students = df[df["average"] >= 85]
print(top_students)Find students who might need help in English:
needs_help_english = df[df["english"] < 70]
print(needs_help_english)Simple visualization
Let’s visualize math vs English scores:
plt.scatter(df["math"], df["english"])
plt.xlabel("Math score")
plt.ylabel("English score")
plt.title("Math vs English scores")
plt.grid(True)
plt.show()And a bar chart of total scores by student:
plt.bar(df["name"], df["total"])
plt.xlabel("Student")
plt.ylabel("Total score")
plt.title("Total scores by student")
plt.show()What you practiced here:
- Creating a DataFrame.
- Calculating statistics.
- Adding new columns.
- Filtering rows based on conditions.
- Simple scatter and bar plots.
Example 2: Daily Steps – Time Series Overview
Now imagine you tracked your daily steps for two weeks.
Creating a time-based dataset
We’ll create a date range and some fake step counts:
dates = pd.date_range(start="2025-01-01", periods=14, freq="D")
steps = [4500, 7000, 8000, 3000, 10000, 12000, 9000,
6000, 11000, 4000, 7500, 13000, 5000, 9500]
steps_df = pd.DataFrame({
"date": dates,
"steps": steps,
})
print(steps_df)Basic analysis
Total and average steps:
print("Total steps:", steps_df["steps"].sum())
print("Average steps per day:", steps_df["steps"].mean())Find your most active day:
max_row = steps_df.loc[steps_df["steps"].idxmax()]
print("Most active day:")
print(max_row)Filter days below a 8000-step goal:
below_goal = steps_df[steps_df["steps"] < 8000]
print("Days below goal:")
print(below_goal)Simple time series plot
Plot steps over time:
plt.plot(steps_df["date"], steps_df["steps"], marker="o")
plt.xlabel("Date")
plt.ylabel("Steps")
plt.title("Daily Steps Over Two Weeks")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()Grouping by weekday vs weekend (basic grouping)
Let’s see average steps on weekdays vs weekends.
First, add a column to show the day of week:
steps_df["day_name"] = steps_df["date"].dt.day_name()
print(steps_df[["date", "day_name", "steps"]])
Now create a column is_weekend:
steps_df["is_weekend"] = steps_df["day_name"].isin(["Saturday", "Sunday"])
print(steps_df[["date", "day_name", "is_weekend", "steps"]])Group and calculate average:
avg_by_weekend = steps_df.groupby("is_weekend")["steps"].mean()
print(avg_by_weekend)You can also plot it:
avg_by_weekend.plot(kind="bar")
plt.xticks([0, 1], ["Weekday", "Weekend"], rotation=0)
plt.ylabel("Average steps")
plt.title("Average Steps: Weekdays vs Weekends")
plt.show()What you practiced here:
- Working with dates and time series.
- Finding min/max rows.
- Grouping data and calculating averages.
- Plotting time series and grouped data.
Example 3: Simple Sales Analysis
Imagine a tiny store with a log of sales. Each row is one transaction.
Creating a small sales dataset
sales_data = {
"date": pd.to_datetime([
"2025-02-01", "2025-02-01", "2025-02-02",
"2025-02-02", "2025-02-03", "2025-02-03",
]),
"item": ["Book", "Pen", "Notebook", "Book", "Pen", "Book"],
"price": [12.5, 1.5, 5.0, 12.5, 1.5, 12.5],
"quantity": [1, 3, 2, 1, 10, 2],
}
sales_df = pd.DataFrame(sales_data)
print(sales_df)Calculating revenue
Revenue per transaction is price * quantity. Add that as a column:
sales_df["revenue"] = sales_df["price"] * sales_df["quantity"]
print(sales_df)Total revenue:
print("Total revenue:", sales_df["revenue"].sum())Revenue per day:
revenue_by_day = sales_df.groupby("date")["revenue"].sum()
print(revenue_by_day)Revenue per item:
revenue_by_item = sales_df.groupby("item")["revenue"].sum()
print(revenue_by_item)Sorting to find top items
Sort items by revenue (descending):
revenue_by_item_sorted = revenue_by_item.sort_values(ascending=False)
print(revenue_by_item_sorted)Visualizing sales
Bar chart of revenue per item:
revenue_by_item_sorted.plot(kind="bar")
plt.ylabel("Total revenue")
plt.title("Revenue by Item")
plt.xticks(rotation=0)
plt.show()Line plot of daily revenue:
revenue_by_day.plot(kind="line", marker="o")
plt.ylabel("Revenue")
plt.title("Revenue by Day")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()What you practiced here:
- Creating new columns using existing ones.
- Grouping and aggregating (
sum). - Sorting results.
- Plotting grouped data.
Example 4: Basic Correlation and Relationships
Suppose you have data on hours studied and exam scores for a small group of students.
Creating the dataset
study_data = {
"hours_studied": [1, 2, 2.5, 3, 4, 5, 6, 7],
"score": [50, 55, 60, 62, 70, 78, 85, 90],
}
study_df = pd.DataFrame(study_data)
print(study_df)Visualizing the relationship
Plot hours studied vs score:
plt.scatter(study_df["hours_studied"], study_df["score"])
plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Hours Studied vs Exam Score")
plt.grid(True)
plt.show()You’ll probably see an upward trend.
Calculating correlation
Use the .corr() method:
corr = study_df["hours_studied"].corr(study_df["score"])
print("Correlation between hours studied and score:", corr)Correlation values:
- Close to 1: strong positive relationship.
- Close to 0: weak or no linear relationship.
- Close to -1: strong negative relationship.
Simple “trend line” (using numpy)
We can add a simple best-fit line visually:
x = study_df["hours_studied"]
y = study_df["score"]
# Fit a line y = m*x + b
m, b = np.polyfit(x, y, 1)
plt.scatter(x, y, label="Data points")
plt.plot(x, m * x + b, color="red", label="Trend line")
plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Hours Studied vs Exam Score with Trend Line")
plt.legend()
plt.grid(True)
plt.show()What you practiced here:
- Using scatter plots.
- Computing correlation.
- Using
numpy.polyfitto draw a simple trend line.
Example 5: Handling Missing Data (Small Demo)
Real datasets often have missing values. Let’s see a tiny example of dealing with them.
Creating a dataset with missing values
temp_data = {
"day": ["Mon", "Tue", "Wed", "Thu", "Fri"],
"temperature": [20.5, np.nan, 22.0, np.nan, 21.5],
}
temp_df = pd.DataFrame(temp_data)
print(temp_df)
You’ll see NaN (Not a Number) for missing temperatures.
Detecting missing values
print(temp_df.isna())
print("Number of missing values per column:")
print(temp_df.isna().sum())Dropping missing values
Sometimes it’s OK to just remove rows with missing data:
temp_drop = temp_df.dropna()
print(temp_drop)Now only complete rows remain.
Filling missing values
Other times you want to fill missing values with something reasonable, like the mean:
mean_temp = temp_df["temperature"].mean()
print("Mean temperature:", mean_temp)
temp_filled = temp_df.copy()
temp_filled["temperature"] = temp_filled["temperature"].fillna(mean_temp)
print(temp_filled)Or fill with a fixed value:
temp_fixed = temp_df.copy()
temp_fixed["temperature"] = temp_fixed["temperature"].fillna(0)
print(temp_fixed)What you practiced here:
- Checking for missing values.
- Dropping rows with missing values.
- Filling missing values with a calculated or fixed value.
Putting It All Together: A Mini Workflow Pattern
Most simple data analysis tasks follow a pattern like:
- Get data
- Load from CSV, Excel, database, or create it manually (as we did).
- Inspect data
df.head(),df.info(),df.describe().- Clean data
- Handle missing values.
- Fix data types if needed.
- Remove obvious errors.
- Transform / enrich data
- Add new columns (totals, averages, flags like
is_weekend). - Group and aggregate (sum, mean, count).
- Analyze
- Compute statistics.
- Filter subsets of interest.
- Compare groups.
- Visualize
- Use plots (line, bar, scatter) to see trends and relationships.
- Interpret
- Turn results into simple statements or decisions:
- “Average steps are higher on weekends.”
- “Book sales bring the most revenue.”
- “More hours studied is associated with higher scores.”
These small examples are good templates for your own projects: you can replace the sample data with your own CSVs or logs and keep the same basic steps.
Ideas for Your Own Simple Analyses
Here are some small project ideas you can try using the same techniques:
- Track your daily screen time by app and analyze which days you use your phone most.
- Record your spending for a month and find which categories cost the most.
- Analyze weather data (temperature/rain) for your city and see patterns by month or weekday.
- Log your study time and test results to see how preparation affects performance.
Reuse the patterns from this chapter: create/load a DataFrame, clean it, analyze with groupby, and visualize with matplotlib.