Table of Contents
What is pandas?
In data science, you often work with tables of data: rows and columns, like a spreadsheet.
pandas is a Python library that makes working with this kind of data much easier.
pandas gives you two main tools:
Series: a one-dimensional labeled array (like a column in a table)DataFrame: a two-dimensional labeled table (like a whole spreadsheet)
In this chapter you will:
- Create simple pandas objects
- Load data from a CSV file
- View and explore data
- Select rows and columns
- Do basic calculations and filtering
You don’t need to memorize everything—this chapter is about getting a feel for how pandas works.
Installing and importing pandas
pandas is not part of Python’s standard library, so you usually install it with pip (covered earlier):
pip install pandasIn your Python code, you almost always import it like this:
import pandas as pd
The as pd is a common shortcut, so you can write pd.DataFrame() instead of pandas.DataFrame().
The Series: a labeled one-dimensional array
A Series is like a single column with labels (called index).
Creating a Series
You can create a Series from a Python list:
import pandas as pd
numbers = [10, 20, 30]
s = pd.Series(numbers)
print(s)Output (the left side is the index, the right side is the value):
0 10
1 20
2 30
dtype: int64- The numbers
0,1,2on the left are the index labels. dtypeshows the data type.
Custom index labels
You can give your own labels:
temperatures = [18.5, 20.0, 21.2]
days = ["Mon", "Tue", "Wed"]
s = pd.Series(temperatures, index=days)
print(s)Output:
Mon 18.5
Tue 20.0
Wed 21.2
dtype: float64You can access a value by its label:
print(s["Tue"]) # 20.0The DataFrame: working with tables
A DataFrame is the main structure you will use in pandas.
It looks and behaves like a table:
- Rows (with an index)
- Columns (with names)
- Each column can have its own data type
Creating a DataFrame from a dictionary
A very common way to build a DataFrame in code is from a dictionary of lists:
import pandas as pd
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["London", "Paris", "Berlin"]
}
df = pd.DataFrame(data)
print(df)Output:
name age city
0 Alice 25 London
1 Bob 30 Paris
2 Charlie 35 Berlin- Columns:
name,age,city - Index:
0,1,2(automatically created)
Custom index for rows
You can set your own row labels:
df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)Output:
name age city
a Alice 25 London
b Bob 30 Paris
c Charlie 35 Berlin
You can now use "a", "b", "c" to refer to rows.
Loading data from CSV files
In data science, you often receive data in CSV (Comma-Separated Values) files.
A simple CSV file (people.csv) might look like this:
name,age,city
Alice,25,London
Bob,30,Paris
Charlie,35,BerlinReading a CSV with pandas
Use read_csv:
import pandas as pd
df = pd.read_csv("people.csv")
print(df)pandas will:
- Read the file
- Use the first row as column names
- Create a
DataFramefor you
If the file is not in the same folder, you will have to use a full or relative file path (covered earlier in the course).
Common `read_csv` options (beginner-friendly)
You don’t need many options to start, but these can be helpful:
sep– change the separator if it’s not a comma. For example:sep=";".header– set which row is the header (column names). For example:header=0.
Examples:
df = pd.read_csv("data_semicolon.csv", sep=";")
df = pd.read_csv("data_no_header.csv", header=None)
If you set header=None, pandas will create default column names: 0, 1, 2, ...
Quick ways to inspect your data
Once you have a DataFrame, you’ll usually want to quickly look at it.
Assume this DataFrame:
import pandas as pd
data = {
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [25, 30, 35, 40, 45],
"city": ["London", "Paris", "Berlin", "London", "Paris"],
"salary": [40000, 50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)Viewing the first and last rows
head()shows the first few rows (default: 5):
print(df.head())
print(df.head(3)) # first 3 rowstail()shows the last few rows:
print(df.tail())
print(df.tail(2)) # last 2 rowsBasic information about the DataFrame
df.shapeshows number of rows and columns:
print(df.shape) # (5, 4) -> 5 rows, 4 columnsdf.columnsshows column names:
print(df.columns)
# Index(['name', 'age', 'city', 'salary'], dtype='object')df.info()gives a summary:
df.info()It shows:
- Number of rows
- Column names and data types
- How many non-missing values each column has
Quick statistics for numeric columns
Use describe() to get statistics:
print(df.describe())Output (similar to):
age salary
count 5.000000 5.000000
mean 35.000000 60000.000000
std 7.905694 15811.388301
min 25.000000 40000.000000
25% 30.000000 50000.000000
50% 35.000000 60000.000000
75% 40.000000 70000.000000
max 45.000000 80000.000000You don’t need to fully understand all of these yet, but mean, min, max are especially useful.
Selecting columns
Think of each column as a Series inside the DataFrame.
Selecting a single column
Use square brackets with the column name:
ages = df["age"]
print(ages)Output:
0 25
1 30
2 35
3 40
4 45
Name: age, dtype: int64
You can also access df.age if the column name has no spaces or special characters:
print(df.age)
(Be aware this style can sometimes cause confusion with other attributes, so df["age"] is safer.)
Selecting multiple columns
Pass a list of column names inside the brackets:
name_and_city = df[["name", "city"]]
print(name_and_city)Output:
name city
0 Alice London
1 Bob Paris
2 Charlie Berlin
3 Diana London
4 Eve ParisSelecting rows
There are several ways to select rows. Here are simple ones to start.
Selecting by row index position: `iloc`
iloc is used for selection by position (like list indexing):
# First row (position 0)
print(df.iloc[0])
# First three rows (positions 0, 1, 2)
print(df.iloc[0:3])Selecting by row label: `loc`
loc uses labels. This is more interesting when you set a custom index.
Here, we will use the default numeric index as labels:
# Row with label 2 (third row)
print(df.loc[2])
# Rows with labels 1 to 3 (inclusive!)
print(df.loc[1:3])Note the difference:
iloc[1:3]uses positions (like normal Python slicing: stops before 3)loc[1:3]uses labels and includes both 1 and 3
Filtering rows with conditions
Filtering lets you select only rows that match some condition.
Simple condition on one column
For example, rows where age is greater than 30:
older_than_30 = df[df["age"] > 30]
print(older_than_30)Output:
name age city salary
2 Charlie 35 Berlin 60000
3 Diana 40 London 70000
4 Eve 45 Paris 80000What happens here:
df["age"] > 30creates a Series ofTrue/Falsevaluesdf[ ... ]keeps only rows where the condition isTrue
Combining conditions
Use bitwise operators & (and) and | (or) with parentheses:
- People older than 30 AND living in London:
mask = (df["age"] > 30) & (df["city"] == "London")
result = df[mask]
print(result)Output:
name age city salary
3 Diana 40 London 70000- People in Paris OR with a salary over 75,000:
mask = (df["city"] == "Paris") | (df["salary"] > 75000)
result = df[mask]
print(result)Adding and modifying columns
You can create new columns or update existing ones using simple expressions.
Adding a new column
Example: convert salary to thousands:
df["salary_k"] = df["salary"] / 1000
print(df)Output:
name age city salary salary_k
0 Alice 25 London 40000 40.0
1 Bob 30 Paris 50000 50.0
2 Charlie 35 Berlin 60000 60.0
3 Diana 40 London 70000 70.0
4 Eve 45 Paris 80000 80.0Modifying an existing column
For example, increase all salaries by 10%:
df["salary"] = df["salary"] * 1.10
print(df[["name", "salary"]])
This changes the values in the salary column.
Handling missing data (basics)
Real-world data often has missing values.
pandas usually represents missing values as NaN (“Not a Number”).
Example:
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, None, 35],
"city": ["London", "Paris", None]
}
df = pd.DataFrame(data)
print(df)Output:
name age city
0 Alice 25.0 London
1 Bob NaN Paris
2 Charlie 35.0 NaNFinding missing values
Use isna() (or isnull(), which is the same):
print(df.isna())
This shows True where the value is missing, False otherwise.
To count missing values in each column:
print(df.isna().sum())Simple ways to deal with missing values
Two very common options:
- Drop rows with missing values:
df_clean = df.dropna()
print(df_clean)- Fill missing values with some value (e.g., 0 or
"Unknown"):
df_filled = df.fillna({
"age": 0,
"city": "Unknown"
})
print(df_filled)More advanced handling of missing data is usually part of deeper data cleaning, but this gives you a first taste.
Grouping and simple aggregation
pandas can quickly answer questions like:
- What is the average salary per city?
- How many people live in each city?
This is done with groupby and aggregation methods.
Assume this DataFrame:
data = {
"city": ["London", "Paris", "London", "Berlin", "Paris"],
"salary": [40000, 50000, 45000, 55000, 60000]
}
df = pd.DataFrame(data)Average salary per city
avg_salary_per_city = df.groupby("city")["salary"].mean()
print(avg_salary_per_city)Output:
city
Berlin 55000.0
London 42500.0
Paris 55000.0
Name: salary, dtype: float64Explanation:
groupby("city")groups the rows by city["salary"]focuses on the salary columnmean()calculates the average salary for each city
Counting rows per group
How many entries per city?
count_per_city = df["city"].value_counts()
print(count_per_city)Output:
Paris 2
London 2
Berlin 1
Name: city, dtype: int64
value_counts() is a quick way to count how many times each value appears in a column.
Saving data to CSV
After working with your DataFrame, you might want to save the result to a CSV file.
Use to_csv:
df.to_csv("output.csv", index=False)"output.csv"is the name of the file to createindex=Falsetells pandas not to write the row index to the file (usually what you want for simple exports)
If you omit index=False, the index will be saved as an extra column in the CSV.
Putting it all together: a tiny pandas workflow
Here is a short example that combines several of the ideas you’ve seen:
- Read data from a CSV
- Inspect it
- Filter rows
- Add a new column
- Compute a grouped statistic
- Save the result
import pandas as pd
# 1. Read data
df = pd.read_csv("employees.csv")
# 2. Quick look
print(df.head())
print(df.info())
# 3. Filter: keep only employees older than 30
df_filtered = df[df["age"] > 30]
# 4. Add a new column: yearly_bonus = 5% of salary
df_filtered["yearly_bonus"] = df_filtered["salary"] * 0.05
# 5. Average salary per department
avg_salary_per_dept = df_filtered.groupby("department")["salary"].mean()
print(avg_salary_per_dept)
# 6. Save filtered data to a new CSV
df_filtered.to_csv("employees_over_30.csv", index=False)You don’t need to fully understand every line yet. The goal is to see how pandas lets you describe data tasks with relatively simple, readable code.
Tips for beginners using pandas
- Always start with
df.head()anddf.info()to understand what your data looks like. - Use clear, descriptive column names when possible.
- Don’t be afraid to print intermediate results (
print(...)) while you’re learning. - Keep your experiments small: work with a few columns or rows until you understand what your code does.
- Use the official documentation and simple search queries like “pandas filter rows” or “pandas groupby mean” to find examples.
As you continue with data science, you’ll use pandas constantly. This chapter gives you the basics so you can start exploring real datasets and simple analyses.