Table of Contents
Why “working with data” matters
In data science, almost everything you do starts with data:
- Reading it from files, databases, or the web
- Cleaning and fixing problems
- Transforming it into a useful shape
- Summarizing it to answer questions
This chapter focuses on basic, practical tasks you’ll do with raw data before using more powerful tools like NumPy and pandas (covered in later sections of this chapter).
We’ll use only built-in Python features here, so you can understand the core ideas first.
Types of data you’ll often see
Real-world data usually comes in a few common formats:
- Plain text: lines of text, one item per line
- CSV files: “comma-separated values”, like a simple spreadsheet
- JSON: structured text often used in web APIs
- Tables: rows and columns (like a spreadsheet)
You’ll later learn libraries that make these easier, but understanding them as plain text first is very helpful.
Reading raw data as text
Most data you’ll use starts as text in a file or from some source.
You can read all lines from a text file into a list of strings:
with open("data.txt", "r", encoding="utf-8") as f:
lines = f.readlines()
print(lines[:5]) # show first 5 linesKey ideas:
- Each element in
linesis a string, usually ending with\n(newline). - You often need to strip whitespace:
clean_lines = [line.strip() for line in lines]From here, your job is to turn these strings into useful Python data: numbers, lists, dictionaries, etc.
Splitting lines into pieces
Many text formats put several values on one line, separated by a character such as a comma or a tab.
Example line from a CSV-like file:
line = "Alice,25,London"
parts = line.split(",")
print(parts) # ['Alice', '25', 'London']Now you can convert elements to useful types:
name = parts[0]
age = int(parts[1])
city = parts[2]This is the basic pattern of parsing data: turn text into structured Python objects.
Simple table-like data: lists of lists
A very common way to store tabular data in plain Python is:
- Each row is a list
- All rows are stored in a list
rows = [
["name", "age", "city"],
["Alice", "25", "London"],
["Bob", "31", "Paris"],
["Carol", "29", "Berlin"],
]You can:
- Access a row:
rows[1] - Access a cell:
rows[1][0](row 1, column 0)
If you read a CSV file manually:
table = []
with open("people.csv", "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue # skip empty lines
row = line.split(",")
table.append(row)
In later sections you’ll see how pandas makes this much easier, but this model (rows and columns) is the same idea.
Converting text to numbers
Data read from files is always text at first.
You must convert it yourself:
int("42")→42float("3.14")→3.14
This is crucial for doing any numeric analysis.
Example:
raw_ages = ["25", "31", "29", "not available"]
clean_ages = []
for x in raw_ages:
try:
clean_ages.append(int(x))
except ValueError:
# Skip invalid values for now
pass
print(clean_ages) # [25, 31, 29]
You’ll learn more about try / except in the errors chapter, but notice how we have to handle “bad” values in real data.
Handling missing and bad data
Real-world data is messy:
- Missing values:
"", "NA", "N/A", "-", "?" - Wrong types:
"twenty"where you expect a number - Duplicates or impossible values: negative ages, years far in the future
Common simple strategies:
- Skip invalid rows
cleaned_rows = []
for row in table[1:]: # skip header
name, age_text, city = row
try:
age = int(age_text)
except ValueError:
continue # skip this row
cleaned_rows.append([name, age, city])- Replace invalid values with a default
def parse_age(text):
try:
return int(text)
except ValueError:
return None # means "missing"
ages = [parse_age(x) for x in raw_ages]- Filter out suspicious values
valid_ages = []
for age in ages:
if age is None:
continue
if 0 <= age <= 120:
valid_ages.append(age)These simple rules are the beginning of data cleaning.
Basic summaries: counts, min, max, average
Once you have clean numeric data, you can calculate simple statistics.
Given a list of numbers $x_1, x_2, \dots, x_n$:
- Count: $n = \text{len}(x)$
- Minimum:
min(x) - Maximum:
max(x) - Sum:
sum(x) - Mean (average):
$$\text{mean} = \frac{\text{sum}}{\text{count}}$$
Example:
ages = [25, 31, 29, 40, 22]
count = len(ages)
min_age = min(ages)
max_age = max(ages)
total = sum(ages)
mean_age = total / count
print("Count:", count)
print("Min age:", min_age)
print("Max age:", max_age)
print("Average age:", mean_age)Later, libraries like NumPy and pandas will do this much faster and more easily, but it’s important to see how it works at the basic Python level.
Grouping data with dictionaries
You often want to answer questions like:
- “How many people are in each city?”
- “What is the average age per city?”
A simple pattern for grouping is to use dictionaries.
Counting by category
Suppose you have:
cities = ["London", "Paris", "Berlin", "London", "Paris", "London"]You can count how many times each city appears:
counts = {}
for city in cities:
if city in counts:
counts[city] += 1
else:
counts[city] = 1
print(counts) # {'London': 3, 'Paris': 2, 'Berlin': 1}This pattern (or variations of it) appears constantly in data work.
Grouping values by key
Now imagine rows like [name, age, city] and you want ages per city:
rows = [
["Alice", 25, "London"],
["Bob", 31, "Paris"],
["Carol", 29, "Berlin"],
["Dave", 40, "London"],
["Eve", 22, "Paris"],
]
ages_by_city = {}
for name, age, city in rows:
if city not in ages_by_city:
ages_by_city[city] = []
ages_by_city[city].append(age)
print(ages_by_city)
# {'London': [25, 40], 'Paris': [31, 22], 'Berlin': [29]}Now you can compute an average per city:
for city, ages in ages_by_city.items():
avg_age = sum(ages) / len(ages)
print(city, "average age:", avg_age)This is a core idea of data analysis: group → aggregate (e.g. “group by city, then compute average age”).
Working with CSV data
CSV (comma-separated values) is one of the most common formats in data science.
A tiny example (people.csv):
name,age,city
Alice,25,London
Bob,31,Paris
Carol,29,Berlin
You can read it manually, as shown earlier, but Python’s standard csv module makes it easier.
Using `csv.reader`
import csv
rows = []
with open("people.csv", newline="", encoding="utf-8") as f:
reader = csv.reader(f)
for row in reader:
rows.append(row)
for r in rows:
print(r)This gives you a list of lists, similar to our earlier “table” example.
Using `csv.DictReader`
DictReader gives each column a name, using the header row:
import csv
people = []
with open("people.csv", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
people.append(row)
print(people[0])
# {'name': 'Alice', 'age': '25', 'city': 'London'}Values are still strings; you can convert them:
for person in people:
person["age"] = int(person["age"])Now you can work with the data like this:
ages = [p["age"] for p in people]
print("Average age:", sum(ages) / len(ages))
Understanding CSV at this level will make it much easier when you move to pandas.read_csv() later.
Simple text-based data cleaning example
Let’s put several ideas together:
Suppose temperatures.csv looks like:
city,temperature_c
London,15
Paris,20
Berlin,NaN
London,18
Paris,?
Berlin,17We want the average temperature per city, ignoring bad values.
import csv
temps_by_city = {}
with open("temperatures.csv", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
city = row["city"]
temp_text = row["temperature_c"].strip()
# Handle missing / bad values
if temp_text in ("", "NaN", "?", "-"):
continue
try:
temp = float(temp_text)
except ValueError:
continue
if city not in temps_by_city:
temps_by_city[city] = []
temps_by_city[city].append(temp)
# Compute averages
for city, temps in temps_by_city.items():
avg = sum(temps) / len(temps)
print(city, "average temperature:", avg)This small example already captures a lot of real-world data work:
- Read raw data
- Clean weird values
- Convert types
- Group by a category
- Calculate summary values
Working with JSON data (very briefly)
JSON is a very common format for web data. It maps naturally to Python data structures.
A JSON string might look like:
json_text = '''
[
{"name": "Alice", "age": 25, "city": "London"},
{"name": "Bob", "age": 31, "city": "Paris"}
]
'''
You can turn it into Python objects with the json module:
import json
data = json.loads(json_text)
print(data[0]["name"]) # 'Alice'Later, when you work with web APIs or complex nested data, JSON will be important. For now, remember:
- JSON → lists and dictionaries in Python
- Use
json.loads()for strings andjson.load()for files
Thinking like a data scientist (at the beginner level)
When you “work with data” in Python, even at a simple level, you are already doing data science thinking:
- Understand the question
Example: “What is the average age per city?” - Understand the data
- What does each column mean?
- Which values look wrong or missing?
- Prepare the data
- Read from a file
- Clean and convert values
- Filter invalid or impossible entries
- Transform and summarize
- Group values (e.g. by city)
- Compute statistics (count, mean, min, max)
- Check if the result makes sense
- Do numbers look reasonable?
- Are there obvious mistakes (e.g. negative ages)?
In the next sections, you’ll see how libraries like NumPy and pandas make these steps faster, safer, and more powerful. But the core ideas you’ve seen here—reading, cleaning, grouping, summarizing—stay the same.