Kahibaro
Discord Login Register

Introduction to pandas

What is pandas?

In data science, you often work with tables of data: rows and columns, like a spreadsheet.
pandas is a Python library that makes working with this kind of data much easier.

pandas gives you two main tools:

In this chapter you will:

You don’t need to memorize everything—this chapter is about getting a feel for how pandas works.

Installing and importing pandas

pandas is not part of Python’s standard library, so you usually install it with pip (covered earlier):

pip install pandas

In your Python code, you almost always import it like this:

import pandas as pd

The as pd is a common shortcut, so you can write pd.DataFrame() instead of pandas.DataFrame().

The Series: a labeled one-dimensional array

A Series is like a single column with labels (called index).

Creating a Series

You can create a Series from a Python list:

import pandas as pd
numbers = [10, 20, 30]
s = pd.Series(numbers)
print(s)

Output (the left side is the index, the right side is the value):

0    10
1    20
2    30
dtype: int64

Custom index labels

You can give your own labels:

temperatures = [18.5, 20.0, 21.2]
days = ["Mon", "Tue", "Wed"]
s = pd.Series(temperatures, index=days)
print(s)

Output:

Mon    18.5
Tue    20.0
Wed    21.2
dtype: float64

You can access a value by its label:

print(s["Tue"])   # 20.0

The DataFrame: working with tables

A DataFrame is the main structure you will use in pandas.
It looks and behaves like a table:

Creating a DataFrame from a dictionary

A very common way to build a DataFrame in code is from a dictionary of lists:

import pandas as pd
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age":  [25, 30, 35],
    "city": ["London", "Paris", "Berlin"]
}
df = pd.DataFrame(data)
print(df)

Output:

      name  age    city
0    Alice   25  London
1      Bob   30   Paris
2  Charlie   35  Berlin

Custom index for rows

You can set your own row labels:

df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)

Output:

      name  age    city
a    Alice   25  London
b      Bob   30   Paris
c  Charlie   35  Berlin

You can now use "a", "b", "c" to refer to rows.

Loading data from CSV files

In data science, you often receive data in CSV (Comma-Separated Values) files.

A simple CSV file (people.csv) might look like this:

name,age,city
Alice,25,London
Bob,30,Paris
Charlie,35,Berlin

Reading a CSV with pandas

Use read_csv:

import pandas as pd
df = pd.read_csv("people.csv")
print(df)

pandas will:

If the file is not in the same folder, you will have to use a full or relative file path (covered earlier in the course).

Common `read_csv` options (beginner-friendly)

You don’t need many options to start, but these can be helpful:

Examples:

df = pd.read_csv("data_semicolon.csv", sep=";")
df = pd.read_csv("data_no_header.csv", header=None)

If you set header=None, pandas will create default column names: 0, 1, 2, ...

Quick ways to inspect your data

Once you have a DataFrame, you’ll usually want to quickly look at it.

Assume this DataFrame:

import pandas as pd
data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age":  [25, 30, 35, 40, 45],
    "city": ["London", "Paris", "Berlin", "London", "Paris"],
    "salary": [40000, 50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)

Viewing the first and last rows

print(df.head())
print(df.head(3))  # first 3 rows
print(df.tail())
print(df.tail(2))  # last 2 rows

Basic information about the DataFrame

print(df.shape)  # (5, 4) -> 5 rows, 4 columns
print(df.columns)
# Index(['name', 'age', 'city', 'salary'], dtype='object')
df.info()

It shows:

Quick statistics for numeric columns

Use describe() to get statistics:

print(df.describe())

Output (similar to):

             age        salary
count   5.000000      5.000000
mean   35.000000  60000.000000
std     7.905694  15811.388301
min    25.000000  40000.000000
25%    30.000000  50000.000000
50%    35.000000  60000.000000
75%    40.000000  70000.000000
max    45.000000  80000.000000

You don’t need to fully understand all of these yet, but mean, min, max are especially useful.

Selecting columns

Think of each column as a Series inside the DataFrame.

Selecting a single column

Use square brackets with the column name:

ages = df["age"]
print(ages)

Output:

0    25
1    30
2    35
3    40
4    45
Name: age, dtype: int64

You can also access df.age if the column name has no spaces or special characters:

print(df.age)

(Be aware this style can sometimes cause confusion with other attributes, so df["age"] is safer.)

Selecting multiple columns

Pass a list of column names inside the brackets:

name_and_city = df[["name", "city"]]
print(name_and_city)

Output:

      name    city
0    Alice  London
1      Bob   Paris
2  Charlie  Berlin
3    Diana  London
4      Eve   Paris

Selecting rows

There are several ways to select rows. Here are simple ones to start.

Selecting by row index position: `iloc`

iloc is used for selection by position (like list indexing):

# First row (position 0)
print(df.iloc[0])
# First three rows (positions 0, 1, 2)
print(df.iloc[0:3])

Selecting by row label: `loc`

loc uses labels. This is more interesting when you set a custom index.
Here, we will use the default numeric index as labels:

# Row with label 2 (third row)
print(df.loc[2])
# Rows with labels 1 to 3 (inclusive!)
print(df.loc[1:3])

Note the difference:

Filtering rows with conditions

Filtering lets you select only rows that match some condition.

Simple condition on one column

For example, rows where age is greater than 30:

older_than_30 = df[df["age"] > 30]
print(older_than_30)

Output:

      name  age    city  salary
2  Charlie   35  Berlin   60000
3    Diana   40  London   70000
4      Eve   45   Paris   80000

What happens here:

Combining conditions

Use bitwise operators & (and) and | (or) with parentheses:

mask = (df["age"] > 30) & (df["city"] == "London")
result = df[mask]
print(result)

Output:

    name  age    city  salary
3  Diana   40  London   70000
mask = (df["city"] == "Paris") | (df["salary"] > 75000)
result = df[mask]
print(result)

Adding and modifying columns

You can create new columns or update existing ones using simple expressions.

Adding a new column

Example: convert salary to thousands:

df["salary_k"] = df["salary"] / 1000
print(df)

Output:

      name  age    city  salary  salary_k
0    Alice   25  London   40000      40.0
1      Bob   30   Paris   50000      50.0
2  Charlie   35  Berlin   60000      60.0
3    Diana   40  London   70000      70.0
4      Eve   45   Paris   80000      80.0

Modifying an existing column

For example, increase all salaries by 10%:

df["salary"] = df["salary"] * 1.10
print(df[["name", "salary"]])

This changes the values in the salary column.

Handling missing data (basics)

Real-world data often has missing values.

pandas usually represents missing values as NaN (“Not a Number”).

Example:

data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age":  [25, None, 35],
    "city": ["London", "Paris", None]
}
df = pd.DataFrame(data)
print(df)

Output:

      name   age    city
0    Alice  25.0  London
1      Bob   NaN   Paris
2  Charlie  35.0    NaN

Finding missing values

Use isna() (or isnull(), which is the same):

print(df.isna())

This shows True where the value is missing, False otherwise.

To count missing values in each column:

print(df.isna().sum())

Simple ways to deal with missing values

Two very common options:

  1. Drop rows with missing values:
   df_clean = df.dropna()
   print(df_clean)
  1. Fill missing values with some value (e.g., 0 or "Unknown"):
   df_filled = df.fillna({
       "age": 0,
       "city": "Unknown"
   })
   print(df_filled)

More advanced handling of missing data is usually part of deeper data cleaning, but this gives you a first taste.

Grouping and simple aggregation

pandas can quickly answer questions like:

This is done with groupby and aggregation methods.

Assume this DataFrame:

data = {
    "city":   ["London", "Paris", "London", "Berlin", "Paris"],
    "salary": [40000, 50000, 45000, 55000, 60000]
}
df = pd.DataFrame(data)

Average salary per city

avg_salary_per_city = df.groupby("city")["salary"].mean()
print(avg_salary_per_city)

Output:

city
Berlin    55000.0
London    42500.0
Paris     55000.0
Name: salary, dtype: float64

Explanation:

Counting rows per group

How many entries per city?

count_per_city = df["city"].value_counts()
print(count_per_city)

Output:

Paris     2
London    2
Berlin    1
Name: city, dtype: int64

value_counts() is a quick way to count how many times each value appears in a column.

Saving data to CSV

After working with your DataFrame, you might want to save the result to a CSV file.

Use to_csv:

df.to_csv("output.csv", index=False)

If you omit index=False, the index will be saved as an extra column in the CSV.

Putting it all together: a tiny pandas workflow

Here is a short example that combines several of the ideas you’ve seen:

  1. Read data from a CSV
  2. Inspect it
  3. Filter rows
  4. Add a new column
  5. Compute a grouped statistic
  6. Save the result
import pandas as pd
# 1. Read data
df = pd.read_csv("employees.csv")
# 2. Quick look
print(df.head())
print(df.info())
# 3. Filter: keep only employees older than 30
df_filtered = df[df["age"] > 30]
# 4. Add a new column: yearly_bonus = 5% of salary
df_filtered["yearly_bonus"] = df_filtered["salary"] * 0.05
# 5. Average salary per department
avg_salary_per_dept = df_filtered.groupby("department")["salary"].mean()
print(avg_salary_per_dept)
# 6. Save filtered data to a new CSV
df_filtered.to_csv("employees_over_30.csv", index=False)

You don’t need to fully understand every line yet. The goal is to see how pandas lets you describe data tasks with relatively simple, readable code.

Tips for beginners using pandas

As you continue with data science, you’ll use pandas constantly. This chapter gives you the basics so you can start exploring real datasets and simple analyses.

Views: 18

Comments

Please login to add a comment.

Don't have an account? Register now!