16.4 Introduction to pandas

Table of Contents

What is pandas?

In data science, you often work with tables of data: rows and columns, like a spreadsheet.
pandas is a Python library that makes working with this kind of data much easier.

pandas gives you two main tools:

Series: a one-dimensional labeled array (like a column in a table)
DataFrame: a two-dimensional labeled table (like a whole spreadsheet)

In this chapter you will:

Create simple pandas objects
Load data from a CSV file
View and explore data
Select rows and columns
Do basic calculations and filtering

You don’t need to memorize everything—this chapter is about getting a feel for how pandas works.

Installing and importing pandas

pandas is not part of Python’s standard library, so you usually install it with pip (covered earlier):

pip install pandas

In your Python code, you almost always import it like this:

import pandas as pd

The as pd is a common shortcut, so you can write pd.DataFrame() instead of pandas.DataFrame().

The Series: a labeled one-dimensional array

A Series is like a single column with labels (called index).

Creating a Series

You can create a Series from a Python list:

import pandas as pd
numbers = [10, 20, 30]
s = pd.Series(numbers)
print(s)

Output (the left side is the index, the right side is the value):

0    10
1    20
2    30
dtype: int64

The numbers 0, 1, 2 on the left are the index labels.
dtype shows the data type.

Custom index labels

You can give your own labels:

temperatures = [18.5, 20.0, 21.2]
days = ["Mon", "Tue", "Wed"]
s = pd.Series(temperatures, index=days)
print(s)

Output:

Mon    18.5
Tue    20.0
Wed    21.2
dtype: float64

You can access a value by its label:

print(s["Tue"])   # 20.0

The DataFrame: working with tables

A DataFrame is the main structure you will use in pandas.
It looks and behaves like a table:

Rows (with an index)
Columns (with names)
Each column can have its own data type

Creating a DataFrame from a dictionary

A very common way to build a DataFrame in code is from a dictionary of lists:

import pandas as pd
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age":  [25, 30, 35],
    "city": ["London", "Paris", "Berlin"]
}
df = pd.DataFrame(data)
print(df)

Output:

      name  age    city
0    Alice   25  London
1      Bob   30   Paris
2  Charlie   35  Berlin

Columns: name, age, city
Index: 0, 1, 2 (automatically created)

Custom index for rows

You can set your own row labels:

df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)

Output:

      name  age    city
a    Alice   25  London
b      Bob   30   Paris
c  Charlie   35  Berlin

You can now use "a", "b", "c" to refer to rows.

Loading data from CSV files

In data science, you often receive data in CSV (Comma-Separated Values) files.

A simple CSV file (people.csv) might look like this:

name,age,city
Alice,25,London
Bob,30,Paris
Charlie,35,Berlin

Reading a CSV with pandas

Use read_csv:

import pandas as pd
df = pd.read_csv("people.csv")
print(df)

pandas will:

Read the file
Use the first row as column names
Create a DataFrame for you

If the file is not in the same folder, you will have to use a full or relative file path (covered earlier in the course).

Common `read_csv` options (beginner-friendly)

You don’t need many options to start, but these can be helpful:

sep – change the separator if it’s not a comma. For example: sep=";".
header – set which row is the header (column names). For example: header=0.

Examples:

df = pd.read_csv("data_semicolon.csv", sep=";")
df = pd.read_csv("data_no_header.csv", header=None)

If you set header=None, pandas will create default column names: 0, 1, 2, ...

Quick ways to inspect your data

Once you have a DataFrame, you’ll usually want to quickly look at it.

Assume this DataFrame:

import pandas as pd
data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age":  [25, 30, 35, 40, 45],
    "city": ["London", "Paris", "Berlin", "London", "Paris"],
    "salary": [40000, 50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)

Viewing the first and last rows

head() shows the first few rows (default: 5):

print(df.head())
print(df.head(3))  # first 3 rows

tail() shows the last few rows:

print(df.tail())
print(df.tail(2))  # last 2 rows

Basic information about the DataFrame

df.shape shows number of rows and columns:

print(df.shape)  # (5, 4) -> 5 rows, 4 columns

df.columns shows column names:

print(df.columns)
# Index(['name', 'age', 'city', 'salary'], dtype='object')

df.info() gives a summary:

df.info()

It shows:

Number of rows
Column names and data types
How many non-missing values each column has

Quick statistics for numeric columns

Use describe() to get statistics:

print(df.describe())

Output (similar to):

             age        salary
count   5.000000      5.000000
mean   35.000000  60000.000000
std     7.905694  15811.388301
min    25.000000  40000.000000
25%    30.000000  50000.000000
50%    35.000000  60000.000000
75%    40.000000  70000.000000
max    45.000000  80000.000000

You don’t need to fully understand all of these yet, but mean, min, max are especially useful.

Selecting columns

Think of each column as a Series inside the DataFrame.

Selecting a single column

Use square brackets with the column name:

ages = df["age"]
print(ages)

Output:

0    25
1    30
2    35
3    40
4    45
Name: age, dtype: int64

You can also access df.age if the column name has no spaces or special characters:

print(df.age)

(Be aware this style can sometimes cause confusion with other attributes, so df["age"] is safer.)

Selecting multiple columns

Pass a list of column names inside the brackets:

name_and_city = df[["name", "city"]]
print(name_and_city)

Output:

      name    city
0    Alice  London
1      Bob   Paris
2  Charlie  Berlin
3    Diana  London
4      Eve   Paris

Selecting rows

There are several ways to select rows. Here are simple ones to start.

Selecting by row index position: `iloc`

iloc is used for selection by position (like list indexing):

# First row (position 0)
print(df.iloc[0])
# First three rows (positions 0, 1, 2)
print(df.iloc[0:3])

Selecting by row label: `loc`

loc uses labels. This is more interesting when you set a custom index.
Here, we will use the default numeric index as labels:

# Row with label 2 (third row)
print(df.loc[2])
# Rows with labels 1 to 3 (inclusive!)
print(df.loc[1:3])

Note the difference:

iloc[1:3] uses positions (like normal Python slicing: stops before 3)
loc[1:3] uses labels and includes both 1 and 3

Filtering rows with conditions

Filtering lets you select only rows that match some condition.

Simple condition on one column

For example, rows where age is greater than 30:

older_than_30 = df[df["age"] > 30]
print(older_than_30)

Output:

      name  age    city  salary
2  Charlie   35  Berlin   60000
3    Diana   40  London   70000
4      Eve   45   Paris   80000

What happens here:

df["age"] > 30 creates a Series of True/False values
df[ ... ] keeps only rows where the condition is True

Combining conditions

Use bitwise operators & (and) and | (or) with parentheses:

People older than 30 AND living in London:

mask = (df["age"] > 30) & (df["city"] == "London")
result = df[mask]
print(result)

Output:

    name  age    city  salary
3  Diana   40  London   70000

People in Paris OR with a salary over 75,000:

mask = (df["city"] == "Paris") | (df["salary"] > 75000)
result = df[mask]
print(result)

Adding and modifying columns

You can create new columns or update existing ones using simple expressions.

Adding a new column

Example: convert salary to thousands:

df["salary_k"] = df["salary"] / 1000
print(df)

Output:

      name  age    city  salary  salary_k
0    Alice   25  London   40000      40.0
1      Bob   30   Paris   50000      50.0
2  Charlie   35  Berlin   60000      60.0
3    Diana   40  London   70000      70.0
4      Eve   45   Paris   80000      80.0

Modifying an existing column

For example, increase all salaries by 10%:

df["salary"] = df["salary"] * 1.10
print(df[["name", "salary"]])

This changes the values in the salary column.

Handling missing data (basics)

Real-world data often has missing values.

pandas usually represents missing values as NaN (“Not a Number”).

Example:

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age":  [25, None, 35],
    "city": ["London", "Paris", None]
}
df = pd.DataFrame(data)
print(df)

Output:

      name   age    city
0    Alice  25.0  London
1      Bob   NaN   Paris
2  Charlie  35.0    NaN

Finding missing values

Use isna() (or isnull(), which is the same):

print(df.isna())

This shows True where the value is missing, False otherwise.

To count missing values in each column:

print(df.isna().sum())

Simple ways to deal with missing values

Two very common options:

Drop rows with missing values:

   df_clean = df.dropna()
   print(df_clean)

Fill missing values with some value (e.g., 0 or "Unknown"):

   df_filled = df.fillna({
       "age": 0,
       "city": "Unknown"
   })
   print(df_filled)

More advanced handling of missing data is usually part of deeper data cleaning, but this gives you a first taste.

Grouping and simple aggregation

pandas can quickly answer questions like:

What is the average salary per city?
How many people live in each city?

This is done with groupby and aggregation methods.

Assume this DataFrame:

data = {
    "city":   ["London", "Paris", "London", "Berlin", "Paris"],
    "salary": [40000, 50000, 45000, 55000, 60000]
}
df = pd.DataFrame(data)

Average salary per city

avg_salary_per_city = df.groupby("city")["salary"].mean()
print(avg_salary_per_city)

Output:

city
Berlin    55000.0
London    42500.0
Paris     55000.0
Name: salary, dtype: float64

Explanation:

groupby("city") groups the rows by city
["salary"] focuses on the salary column
mean() calculates the average salary for each city

Counting rows per group

How many entries per city?

count_per_city = df["city"].value_counts()
print(count_per_city)

Output:

Paris     2
London    2
Berlin    1
Name: city, dtype: int64

value_counts() is a quick way to count how many times each value appears in a column.

Saving data to CSV

After working with your DataFrame, you might want to save the result to a CSV file.

Use to_csv:

df.to_csv("output.csv", index=False)

"output.csv" is the name of the file to create
index=False tells pandas not to write the row index to the file (usually what you want for simple exports)

If you omit index=False, the index will be saved as an extra column in the CSV.

Putting it all together: a tiny pandas workflow

Here is a short example that combines several of the ideas you’ve seen:

Read data from a CSV
Inspect it
Filter rows
Add a new column
Compute a grouped statistic
Save the result

import pandas as pd
# 1. Read data
df = pd.read_csv("employees.csv")
# 2. Quick look
print(df.head())
print(df.info())
# 3. Filter: keep only employees older than 30
df_filtered = df[df["age"] > 30]
# 4. Add a new column: yearly_bonus = 5% of salary
df_filtered["yearly_bonus"] = df_filtered["salary"] * 0.05
# 5. Average salary per department
avg_salary_per_dept = df_filtered.groupby("department")["salary"].mean()
print(avg_salary_per_dept)
# 6. Save filtered data to a new CSV
df_filtered.to_csv("employees_over_30.csv", index=False)

You don’t need to fully understand every line yet. The goal is to see how pandas lets you describe data tasks with relatively simple, readable code.

Tips for beginners using pandas

Always start with df.head() and df.info() to understand what your data looks like.
Use clear, descriptive column names when possible.
Don’t be afraid to print intermediate results (print(...)) while you’re learning.
Keep your experiments small: work with a few columns or rows until you understand what your code does.
Use the official documentation and simple search queries like “pandas filter rows” or “pandas groupby mean” to find examples.

As you continue with data science, you’ll use pandas constantly. This chapter gives you the basics so you can start exploring real datasets and simple analyses.

Comments

Please login to add a comment.

Don't have an account? Register now!

16.4 Introduction to pandas

What is pandas?

Installing and importing pandas

The Series: a labeled one-dimensional array

Creating a Series

Custom index labels

The DataFrame: working with tables

Creating a DataFrame from a dictionary

Custom index for rows

Loading data from CSV files

Reading a CSV with pandas

Common `read_csv` options (beginner-friendly)

Quick ways to inspect your data

Viewing the first and last rows

Basic information about the DataFrame

Quick statistics for numeric columns

Selecting columns

Selecting a single column

Selecting multiple columns

Selecting rows

Selecting by row index position: `iloc`

Selecting by row label: `loc`

Filtering rows with conditions

Simple condition on one column

Combining conditions

Adding and modifying columns

Adding a new column

Modifying an existing column

Handling missing data (basics)

Finding missing values

Simple ways to deal with missing values

Grouping and simple aggregation

Average salary per city

Counting rows per group

Saving data to CSV

Putting it all together: a tiny pandas workflow

Tips for beginners using pandas

Comments

Where to Move