16.1 What is data science?

Table of Contents

Understanding Data Science

Data science is about using data to answer questions and make decisions. It combines three main areas:

Programming (like Python)
Math and statistics
Knowledge of a real-world problem (business, health, sports, etc.)

In simple terms:

$$\text{Data Science} \approx \text{Programming} + \text{Math/Stats} + \text{Domain Knowledge}$$

You don’t need to be an expert in all three to start, but data science lives where they overlap.

What Counts as Data?

Data is any information you can store and process. Some common types:

Numbers

Sales per day, temperatures, exam scores, steps per day

Text

Product reviews, emails, tweets, news articles

Categories

"Yes"/"No", "Male"/"Female", "Beginner"/"Intermediate"/"Advanced"

Time-based data

Stock prices over days, website visits per hour

Images, audio, video

Photos for face recognition, recordings for voice assistants

Data science often turns these into a structured form (like tables) so they can be analyzed.

The Typical Data Science Process

Data science isn’t just “running a model.” It’s a step-by-step process. A very common loop looks like this:

Define the question
Collect data
Clean and prepare data
Explore and visualize
Model and analyze
Draw conclusions and communicate
Act on results and iterate

1. Define the Question

Good data science starts with a clear question, like:

“Which customers are likely to cancel their subscription?”
“At what times is our website busiest?”
“Which products are most often bought together?”
“Can we predict tomorrow’s temperature from past data?”

A clear question guides what data you need and how you’ll analyze it.

2. Collect Data

Data can come from many places:

Databases (e.g. sales records)
Files (CSV, Excel, JSON, text)
Web APIs (e.g. weather services, social media APIs)
Logs (e.g. website visits, app usage)
Sensors and devices (e.g. fitness trackers)

Sometimes you collect data continuously; other times you work with snapshots.

3. Clean and Prepare Data

Real-world data is often messy:

Missing values (blank cells)
Wrong formats (numbers stored as text)
Duplicated entries
Outliers (values that are unusually large or small)
Inconsistent labels (e.g. "NY" vs "New York")

Data cleaning involves:

Fixing or removing bad data
Converting data types
Selecting the columns (features) you care about
Creating new useful features
(e.g. extracting the year from a date, or the domain from an email address)

This step usually takes a large portion of a data scientist’s time.

4. Explore and Visualize

Before building any fancy models, you want to understand your data:

Calculate simple statistics:

Minimum, maximum, average (mean), median
Counts and percentages

Look at distributions:

Histograms of ages, salaries, ratings

Examine relationships:

Scatter plots for how one variable changes with another
Grouped bar charts to compare categories

Exploration helps you spot:

Patterns (e.g. “sales spike on weekends”)
Problems in the data
Ideas for what to model

5. Model and Analyze

Models are mathematical tools that help you describe or predict things. In data science, models are used to:

Describe

“On average, larger houses have higher prices.”

Predict

“Given these features, what is the likely price of this house?”

Classify

“Is this email spam or not spam?”

Group similar items

“Group customers into segments based on their behavior.”

At a beginner level, you might start with:

Simple averages and totals
Basic correlations (do two variables move together?)
Very simple prediction models (like linear regression)

You don’t need advanced math right away; you can still do useful work with basic tools.

6. Draw Conclusions and Communicate

Data science only matters if the results lead to actions.

Examples of conclusions:

“We should stock more of product A on weekends.”
“Customers who use feature X are less likely to cancel.”
“The average delivery time is increasing month by month.”

Communication often involves:

Simple charts and tables
Short explanations in plain language
Clear answers to the original question

7. Act and Iterate

After sharing results, people may:

Change a product or process
Launch a new marketing campaign
Adjust prices or inventory
Run an experiment (A/B test) to confirm ideas

New actions generate new data, which can be analyzed again. Data science is usually an ongoing cycle.

How Data Science Differs from Related Fields

Data science overlaps with several other areas, but the focus is slightly different.

Data Science vs Data Analysis

Data analysis often focuses on:

Understanding what happened
Creating reports and dashboards
Answering specific questions

Data science often includes:

Data analysis
plus building predictive models
plus more automation and experimentation

In practice, people use these terms loosely, and roles can overlap.

Data Science vs Machine Learning

Machine learning is mainly about building algorithms that learn from data to make predictions or decisions.
Data science uses machine learning as one of its tools, but also includes:

Data collection and cleaning
Business understanding
Communicating and deploying results

You can do data science without heavy machine learning, especially when starting out.

Where Data Science Is Used

Data science appears in many everyday things you see:

Online shopping

“You might also like” recommendations
Predicting which products to show first

Streaming services

Suggesting movies, shows, or songs you’ll enjoy

Finance

Detecting fraud in credit card transactions
Assessing loan risk

Healthcare

Predicting disease risk from patient data
Planning hospital staffing

Sports

Analyzing player performance
Optimizing team strategies

Transportation

Ride-sharing pricing and car positioning
Traffic predictions for navigation apps

Anywhere decisions are made using data, data science is relevant.

Why Python Is Popular in Data Science

Python is one of the main languages used in data science because:

It’s relatively easy to read and write
It has powerful libraries for:

Handling data (e.g. tables)
Doing math and statistics
Visualizing data

It works well with notebooks and tools that let you:

Write code
See results
Add notes and explanations in one place

In the rest of this chapter’s sections, you’ll see how Python is used in practice for:

Working with data
Using libraries like NumPy and pandas
Visualizing data and doing simple analyses

The Skills Data Scientists Use

Over time, a data scientist typically develops skills in:

Programming

Writing scripts to load, clean, and transform data

Statistics and probability

Understanding averages, variability, uncertainty, and significance

Data visualization

Creating charts that make patterns easy to see

Domain knowledge

Understanding the field they work in (e.g. finance, health, marketing)

Communication

Turning technical results into clear, actionable insights

You don’t need all of these to begin; you can build them step by step.

A Simple Example Flow

Here’s a very simplified example of what a small data science task might look like conceptually (without focusing on the code yet):

Question:
“At what time of day do we get the most website visitors?”
Data:
Website logs with:

Visit time
User ID
Page URL

Steps:

Convert visit times to “hour of day” (0–23)
Count visits per hour
Plot a bar chart of visits vs hour

Result:
You might discover that:

Traffic peaks around 8–10 pm
Afternoons are quieter

Action:
Decide to:

Schedule important announcements in the evening
Plan maintenance for low-traffic hours

This is data science even without advanced models: using data to answer a real question and guide decisions.

What You’ll Explore Next

In the following sections of this chapter, you will:

Load and inspect simple datasets
Do basic calculations and summaries
Visualize data with simple plots
See how libraries make these tasks easier

This will give you a practical first taste of doing data science with Python.

Comments

Please login to add a comment.

Don't have an account? Register now!