Table of Contents
Understanding Data Science
Data science is about using data to answer questions and make decisions. It combines three main areas:
- Programming (like Python)
- Math and statistics
- Knowledge of a real-world problem (business, health, sports, etc.)
In simple terms:
$$\text{Data Science} \approx \text{Programming} + \text{Math/Stats} + \text{Domain Knowledge}$$
You don’t need to be an expert in all three to start, but data science lives where they overlap.
What Counts as Data?
Data is any information you can store and process. Some common types:
- Numbers
- Sales per day, temperatures, exam scores, steps per day
- Text
- Product reviews, emails, tweets, news articles
- Categories
- "Yes"/"No", "Male"/"Female", "Beginner"/"Intermediate"/"Advanced"
- Time-based data
- Stock prices over days, website visits per hour
- Images, audio, video
- Photos for face recognition, recordings for voice assistants
Data science often turns these into a structured form (like tables) so they can be analyzed.
The Typical Data Science Process
Data science isn’t just “running a model.” It’s a step-by-step process. A very common loop looks like this:
- Define the question
- Collect data
- Clean and prepare data
- Explore and visualize
- Model and analyze
- Draw conclusions and communicate
- Act on results and iterate
1. Define the Question
Good data science starts with a clear question, like:
- “Which customers are likely to cancel their subscription?”
- “At what times is our website busiest?”
- “Which products are most often bought together?”
- “Can we predict tomorrow’s temperature from past data?”
A clear question guides what data you need and how you’ll analyze it.
2. Collect Data
Data can come from many places:
- Databases (e.g. sales records)
- Files (CSV, Excel, JSON, text)
- Web APIs (e.g. weather services, social media APIs)
- Logs (e.g. website visits, app usage)
- Sensors and devices (e.g. fitness trackers)
Sometimes you collect data continuously; other times you work with snapshots.
3. Clean and Prepare Data
Real-world data is often messy:
- Missing values (blank cells)
- Wrong formats (numbers stored as text)
- Duplicated entries
- Outliers (values that are unusually large or small)
- Inconsistent labels (e.g. "NY" vs "New York")
Data cleaning involves:
- Fixing or removing bad data
- Converting data types
- Selecting the columns (features) you care about
- Creating new useful features
(e.g. extracting the year from a date, or the domain from an email address)
This step usually takes a large portion of a data scientist’s time.
4. Explore and Visualize
Before building any fancy models, you want to understand your data:
- Calculate simple statistics:
- Minimum, maximum, average (mean), median
- Counts and percentages
- Look at distributions:
- Histograms of ages, salaries, ratings
- Examine relationships:
- Scatter plots for how one variable changes with another
- Grouped bar charts to compare categories
Exploration helps you spot:
- Patterns (e.g. “sales spike on weekends”)
- Problems in the data
- Ideas for what to model
5. Model and Analyze
Models are mathematical tools that help you describe or predict things. In data science, models are used to:
- Describe
- “On average, larger houses have higher prices.”
- Predict
- “Given these features, what is the likely price of this house?”
- Classify
- “Is this email spam or not spam?”
- Group similar items
- “Group customers into segments based on their behavior.”
At a beginner level, you might start with:
- Simple averages and totals
- Basic correlations (do two variables move together?)
- Very simple prediction models (like linear regression)
You don’t need advanced math right away; you can still do useful work with basic tools.
6. Draw Conclusions and Communicate
Data science only matters if the results lead to actions.
Examples of conclusions:
- “We should stock more of product A on weekends.”
- “Customers who use feature X are less likely to cancel.”
- “The average delivery time is increasing month by month.”
Communication often involves:
- Simple charts and tables
- Short explanations in plain language
- Clear answers to the original question
7. Act and Iterate
After sharing results, people may:
- Change a product or process
- Launch a new marketing campaign
- Adjust prices or inventory
- Run an experiment (A/B test) to confirm ideas
New actions generate new data, which can be analyzed again. Data science is usually an ongoing cycle.
How Data Science Differs from Related Fields
Data science overlaps with several other areas, but the focus is slightly different.
Data Science vs Data Analysis
- Data analysis often focuses on:
- Understanding what happened
- Creating reports and dashboards
- Answering specific questions
- Data science often includes:
- Data analysis
- plus building predictive models
- plus more automation and experimentation
In practice, people use these terms loosely, and roles can overlap.
Data Science vs Machine Learning
- Machine learning is mainly about building algorithms that learn from data to make predictions or decisions.
- Data science uses machine learning as one of its tools, but also includes:
- Data collection and cleaning
- Business understanding
- Communicating and deploying results
You can do data science without heavy machine learning, especially when starting out.
Where Data Science Is Used
Data science appears in many everyday things you see:
- Online shopping
- “You might also like” recommendations
- Predicting which products to show first
- Streaming services
- Suggesting movies, shows, or songs you’ll enjoy
- Finance
- Detecting fraud in credit card transactions
- Assessing loan risk
- Healthcare
- Predicting disease risk from patient data
- Planning hospital staffing
- Sports
- Analyzing player performance
- Optimizing team strategies
- Transportation
- Ride-sharing pricing and car positioning
- Traffic predictions for navigation apps
Anywhere decisions are made using data, data science is relevant.
Why Python Is Popular in Data Science
Python is one of the main languages used in data science because:
- It’s relatively easy to read and write
- It has powerful libraries for:
- Handling data (e.g. tables)
- Doing math and statistics
- Visualizing data
- It works well with notebooks and tools that let you:
- Write code
- See results
- Add notes and explanations in one place
In the rest of this chapter’s sections, you’ll see how Python is used in practice for:
- Working with data
- Using libraries like NumPy and pandas
- Visualizing data and doing simple analyses
The Skills Data Scientists Use
Over time, a data scientist typically develops skills in:
- Programming
- Writing scripts to load, clean, and transform data
- Statistics and probability
- Understanding averages, variability, uncertainty, and significance
- Data visualization
- Creating charts that make patterns easy to see
- Domain knowledge
- Understanding the field they work in (e.g. finance, health, marketing)
- Communication
- Turning technical results into clear, actionable insights
You don’t need all of these to begin; you can build them step by step.
A Simple Example Flow
Here’s a very simplified example of what a small data science task might look like conceptually (without focusing on the code yet):
- Question:
“At what time of day do we get the most website visitors?” - Data:
Website logs with: - Visit time
- User ID
- Page URL
- Steps:
- Convert visit times to “hour of day” (0–23)
- Count visits per hour
- Plot a bar chart of visits vs hour
- Result:
You might discover that: - Traffic peaks around 8–10 pm
- Afternoons are quieter
- Action:
Decide to: - Schedule important announcements in the evening
- Plan maintenance for low-traffic hours
This is data science even without advanced models: using data to answer a real question and guide decisions.
What You’ll Explore Next
In the following sections of this chapter, you will:
- Load and inspect simple datasets
- Do basic calculations and summaries
- Visualize data with simple plots
- See how libraries make these tasks easier
This will give you a practical first taste of doing data science with Python.