Working with data

Table of Contents

Why “working with data” matters

In data science, almost everything you do starts with data:

Reading it from files, databases, or the web
Cleaning and fixing problems
Transforming it into a useful shape
Summarizing it to answer questions

This chapter focuses on basic, practical tasks you’ll do with raw data before using more powerful tools like NumPy and pandas (covered in later sections of this chapter).

We’ll use only built-in Python features here, so you can understand the core ideas first.

Types of data you’ll often see

Real-world data usually comes in a few common formats:

Plain text: lines of text, one item per line
CSV files: “comma-separated values”, like a simple spreadsheet
JSON: structured text often used in web APIs
Tables: rows and columns (like a spreadsheet)

You’ll later learn libraries that make these easier, but understanding them as plain text first is very helpful.

Reading raw data as text

Most data you’ll use starts as text in a file or from some source.

You can read all lines from a text file into a list of strings:

with open("data.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()
print(lines[:5])  # show first 5 lines

Key ideas:

Each element in lines is a string, usually ending with \n (newline).
You often need to strip whitespace:

clean_lines = [line.strip() for line in lines]

From here, your job is to turn these strings into useful Python data: numbers, lists, dictionaries, etc.

Splitting lines into pieces

Many text formats put several values on one line, separated by a character such as a comma or a tab.

Example line from a CSV-like file:

line = "Alice,25,London"
parts = line.split(",")
print(parts)  # ['Alice', '25', 'London']

Now you can convert elements to useful types:

name = parts[0]
age = int(parts[1])
city = parts[2]

This is the basic pattern of parsing data: turn text into structured Python objects.

Simple table-like data: lists of lists

A very common way to store tabular data in plain Python is:

Each row is a list
All rows are stored in a list

rows = [
    ["name", "age", "city"],
    ["Alice", "25", "London"],
    ["Bob", "31", "Paris"],
    ["Carol", "29", "Berlin"],
]

You can:

Access a row: rows[1]
Access a cell: rows[1][0] (row 1, column 0)

If you read a CSV file manually:

table = []
with open("people.csv", "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue  # skip empty lines
        row = line.split(",")
        table.append(row)

In later sections you’ll see how pandas makes this much easier, but this model (rows and columns) is the same idea.

Converting text to numbers

Data read from files is always text at first.
You must convert it yourself:

int("42") → 42
float("3.14") → 3.14

This is crucial for doing any numeric analysis.

Example:

raw_ages = ["25", "31", "29", "not available"]
clean_ages = []
for x in raw_ages:
    try:
        clean_ages.append(int(x))
    except ValueError:
        # Skip invalid values for now
        pass
print(clean_ages)  # [25, 31, 29]

You’ll learn more about try / except in the errors chapter, but notice how we have to handle “bad” values in real data.

Handling missing and bad data

Real-world data is messy:

Missing values: "", "NA", "N/A", "-", "?"
Wrong types: "twenty" where you expect a number
Duplicates or impossible values: negative ages, years far in the future

Common simple strategies:

Skip invalid rows

   cleaned_rows = []
   for row in table[1:]:   # skip header
       name, age_text, city = row
       try:
           age = int(age_text)
       except ValueError:
           continue  # skip this row
       cleaned_rows.append([name, age, city])

Replace invalid values with a default

   def parse_age(text):
       try:
           return int(text)
       except ValueError:
           return None  # means "missing"
   ages = [parse_age(x) for x in raw_ages]

Filter out suspicious values

   valid_ages = []
   for age in ages:
       if age is None:
           continue
       if 0 <= age <= 120:
           valid_ages.append(age)

These simple rules are the beginning of data cleaning.

Basic summaries: counts, min, max, average

Once you have clean numeric data, you can calculate simple statistics.

Given a list of numbers $x_1, x_2, \dots, x_n$:

Count: $n = \text{len}(x)$
Minimum: min(x)
Maximum: max(x)
Sum: sum(x)
Mean (average):
$$\text{mean} = \frac{\text{sum}}{\text{count}}$$

Example:

ages = [25, 31, 29, 40, 22]
count = len(ages)
min_age = min(ages)
max_age = max(ages)
total = sum(ages)
mean_age = total / count
print("Count:", count)
print("Min age:", min_age)
print("Max age:", max_age)
print("Average age:", mean_age)

Later, libraries like NumPy and pandas will do this much faster and more easily, but it’s important to see how it works at the basic Python level.

Grouping data with dictionaries

You often want to answer questions like:

“How many people are in each city?”
“What is the average age per city?”

A simple pattern for grouping is to use dictionaries.

Counting by category

Suppose you have:

cities = ["London", "Paris", "Berlin", "London", "Paris", "London"]

You can count how many times each city appears:

counts = {}
for city in cities:
    if city in counts:
        counts[city] += 1
    else:
        counts[city] = 1
print(counts)  # {'London': 3, 'Paris': 2, 'Berlin': 1}

This pattern (or variations of it) appears constantly in data work.

Grouping values by key

Now imagine rows like [name, age, city] and you want ages per city:

rows = [
    ["Alice", 25, "London"],
    ["Bob",   31, "Paris"],
    ["Carol", 29, "Berlin"],
    ["Dave",  40, "London"],
    ["Eve",   22, "Paris"],
]
ages_by_city = {}
for name, age, city in rows:
    if city not in ages_by_city:
        ages_by_city[city] = []
    ages_by_city[city].append(age)
print(ages_by_city)
# {'London': [25, 40], 'Paris': [31, 22], 'Berlin': [29]}

Now you can compute an average per city:

for city, ages in ages_by_city.items():
    avg_age = sum(ages) / len(ages)
    print(city, "average age:", avg_age)

This is a core idea of data analysis: group → aggregate (e.g. “group by city, then compute average age”).

Working with CSV data

CSV (comma-separated values) is one of the most common formats in data science.

A tiny example (people.csv):

name,age,city
Alice,25,London
Bob,31,Paris
Carol,29,Berlin

You can read it manually, as shown earlier, but Python’s standard csv module makes it easier.

Using `csv.reader`

import csv
rows = []
with open("people.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row)
for r in rows:
    print(r)

This gives you a list of lists, similar to our earlier “table” example.

Using `csv.DictReader`

DictReader gives each column a name, using the header row:

import csv
people = []
with open("people.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        people.append(row)
print(people[0])
# {'name': 'Alice', 'age': '25', 'city': 'London'}

Values are still strings; you can convert them:

for person in people:
    person["age"] = int(person["age"])

Now you can work with the data like this:

ages = [p["age"] for p in people]
print("Average age:", sum(ages) / len(ages))

Understanding CSV at this level will make it much easier when you move to pandas.read_csv() later.

Simple text-based data cleaning example

Let’s put several ideas together:

Suppose temperatures.csv looks like:

city,temperature_c
London,15
Paris,20
Berlin,NaN
London,18
Paris,?
Berlin,17

We want the average temperature per city, ignoring bad values.

import csv
temps_by_city = {}
with open("temperatures.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        city = row["city"]
        temp_text = row["temperature_c"].strip()
        # Handle missing / bad values
        if temp_text in ("", "NaN", "?", "-"):
            continue
        try:
            temp = float(temp_text)
        except ValueError:
            continue
        if city not in temps_by_city:
            temps_by_city[city] = []
        temps_by_city[city].append(temp)
# Compute averages
for city, temps in temps_by_city.items():
    avg = sum(temps) / len(temps)
    print(city, "average temperature:", avg)

This small example already captures a lot of real-world data work:

Read raw data
Clean weird values
Convert types
Group by a category
Calculate summary values

Working with JSON data (very briefly)

JSON is a very common format for web data. It maps naturally to Python data structures.

A JSON string might look like:

json_text = '''
[
  {"name": "Alice", "age": 25, "city": "London"},
  {"name": "Bob",   "age": 31, "city": "Paris"}
]
'''

You can turn it into Python objects with the json module:

import json
data = json.loads(json_text)
print(data[0]["name"])  # 'Alice'

Later, when you work with web APIs or complex nested data, JSON will be important. For now, remember:

JSON → lists and dictionaries in Python
Use json.loads() for strings and json.load() for files

Thinking like a data scientist (at the beginner level)

When you “work with data” in Python, even at a simple level, you are already doing data science thinking:

Understand the question
Example: “What is the average age per city?”
Understand the data

What does each column mean?
Which values look wrong or missing?

Prepare the data

Read from a file
Clean and convert values
Filter invalid or impossible entries

Transform and summarize

Group values (e.g. by city)
Compute statistics (count, mean, min, max)

Check if the result makes sense

Do numbers look reasonable?
Are there obvious mistakes (e.g. negative ages)?

In the next sections, you’ll see how libraries like NumPy and pandas make these steps faster, safer, and more powerful. But the core ideas you’ve seen here—reading, cleaning, grouping, summarizing—stay the same.

Comments

Please login to add a comment.

Don't have an account? Register now!