Table of Contents
Overview
Descriptive statistics is about summarizing and describing data. Instead of looking at a long list of numbers, we use a small set of numbers, tables, and graphs to capture the main features of the data:
- Where the data tend to be (center).
- How spread out the data are (variability).
- What the overall “shape” looks like (distribution).
- Whether there are unusual values (outliers).
Descriptive statistics does not try to make predictions about the future or about a larger population. It only describes what has been observed. Using these summaries is the first step before doing any deeper statistical analysis.
In this chapter, we focus on the main descriptive tools that will later connect to the specific measures: mean, variance, and standard deviation.
Types of Data and Why They Matter
Before summarizing, it is important to keep in mind what kind of data you have, because not all descriptive summaries are appropriate for all data types.
Broadly, you may have:
- Categorical data (also called qualitative): values belong to categories, like eye color (blue, brown, green), or type of transport (car, bus, bike). The numbers you may attach (e.g., 1 = car, 2 = bus) do not have numerical meaning.
- Numerical data (also called quantitative): values are numbers that measure something, like height in centimeters, test scores, number of siblings.
For categorical data, we usually describe:
- How often each category occurs (counts).
- What proportion or percentage each category represents.
For numerical data, we usually describe:
- Center: typical value.
- Spread: how varied the values are.
- Shape: pattern of the distribution (symmetry, skewness, etc.).
The measures in later subsections (mean, variance, standard deviation) are for numerical data.
Frequency Tables
A frequency table shows how often each value (or each category) appears.
For categorical data, a simple frequency table might look like:
Category | Count | Relative frequency
---------|-------|-------------------
Car | 18 | 0.45
Bus | 12 | 0.30
Bike | 10 | 0.25
- The count is how many times the category appears.
- The relative frequency is the proportion of the total, often written as a decimal or percentage, for example:
$$\text{relative frequency} = \frac{\text{count}}{\text{total count}}.$$
For numerical data with many possible values, we usually group values into intervals (also called classes or bins). For example, for test scores:
Score range | Count
-----------|------
0–49 | 2
50–59 | 5
60–69 | 10
70–79 | 8
80–89 | 4
90–100 | 1
Frequency tables make it easier to see patterns such as which values or ranges are most common.
Graphical Summaries
Graphs are visual forms of descriptive statistics. They help you see patterns at a glance.
Bar charts
A bar chart is used mainly for categorical data.
- Each category is shown by a bar.
- The height of the bar represents the count or proportion for that category.
- Bars are separate (with spaces) because the categories are distinct.
Bar charts help compare sizes of categories easily.
Histograms
A histogram is used for numerical data that have been grouped into intervals.
- The horizontal axis shows numeric intervals (e.g., 0–10, 10–20, …).
- The vertical axis shows the count (or proportion) of data values in each interval.
- Bars usually touch each other to emphasize the data are continuous over the number line.
Histograms are useful for seeing:
- Where the bulk of the data lies.
- Whether the distribution is roughly symmetric or skewed.
- Whether there are multiple peaks.
Pie charts
A pie chart is a circle divided into slices, usually for categorical data.
- Each slice represents a category.
- The angle or area of the slice corresponds to its proportion of the total.
Pie charts are mainly used to emphasize how a whole is divided among categories.
Boxplots (Box-and-Whisker Plots)
A boxplot is a compact summary of a numerical data set using a few special values:
- Minimum (smallest value).
- First quartile ($Q_1$).
- Median ($Q_2$).
- Third quartile ($Q_3$).
- Maximum (largest value).
These five values form the five-number summary. A boxplot shows:
- A box from $Q_1$ to $Q_3$.
- A line inside the box at the median.
- “Whiskers” extending toward the minimum and maximum (possibly excluding outliers, depending on the method).
Boxplots are helpful for quickly comparing different groups side by side and for spotting outliers and skewness.
Measures of Center
Measures of center try to capture a “typical” or “central” value for a numerical data set.
Common measures of center are:
- Mean (arithmetic average): adds up all values and divides by the number of values. This is the focus of the later “Mean” chapter.
- Median: the middle value when the data are sorted.
- Mode: the most frequently occurring value (or values).
You should know conceptually that:
- The mean can be pulled in the direction of extreme values (outliers).
- The median is more resistant to outliers and skewed data.
- The mode can be useful for categorical data (the most common category).
The exact computation and deeper properties of the mean are treated in the “Mean” subsection; here we only place it among other descriptive measures.
Measures of Spread
A complete description of a data set’s center is not enough; it also matters how spread out the values are. Two groups can have the same mean but very different variability.
Key concepts:
- Range: largest value minus smallest value.
- Interquartile range (IQR): $Q_3 - Q_1$; measures the spread of the middle 50% of the data.
- Variance: an average of squared deviations from the mean (treated fully in the “Variance” subsection).
- Standard deviation: the square root of the variance (treated fully in the “Standard deviation” subsection).
Conceptually:
- A small spread means most values are close together (clustered around the center).
- A large spread means the data are more widely scattered.
The range is easy to compute but depends only on two values. Measures like variance, standard deviation, and IQR use more of the data and give a richer description of variability.
Shape of a Distribution
The shape of a numerical distribution describes the overall pattern you see in a histogram or boxplot.
Some common shapes:
- Symmetric: left and right sides roughly mirror each other. A special symmetric shape is bell-shaped, which becomes important for the normal distribution.
- Skewed right (positively skewed): a long tail extends to the right (toward larger values). Often, the mean is larger than the median.
- Skewed left (negatively skewed): a long tail extends to the left (toward smaller values). Often, the mean is smaller than the median.
- Uniform: all values or intervals have similar frequencies; the histogram looks flat.
- Bimodal or multimodal: the histogram has two or more distinct peaks, suggesting that the data might come from a mixture of different groups.
Understanding shape guides you in choosing appropriate summary measures and later modeling choices.
Outliers
An outlier is a data point that is unusually far from the rest of the values.
Outliers can occur for different reasons:
- A real but rare event (for example, an extremely high income).
- A data recording error (for example, typing $1000$ instead of $100$).
- A value from a different process or group than the other data.
Descriptively, outliers matter because:
- They can strongly affect the mean and the variance/standard deviation.
- They may indicate interesting phenomena or serious errors.
- They can influence the shape of a histogram or boxplot.
Boxplots and numerical criteria (for example, based on quartiles and IQR) are common tools to flag possible outliers, though deciding what to do with them depends on context.
Summarizing a Data Set in Practice
When you describe a data set, you often combine several descriptive statistics instead of relying on just one number. A typical descriptive summary for numerical data might include:
- A measure of center (mean or median).
- A measure of spread (standard deviation or IQR).
- Information about shape (symmetric, skewed) and outliers.
- A visual display (histogram and/or boxplot).
For categorical data, a typical descriptive summary might include:
- Counts and proportions for each category (frequency table).
- One or more graphs (bar chart or pie chart).
Descriptive statistics provide the foundation for the later chapters in this section:
- The Mean chapter focuses on a central measure.
- The Variance and Standard deviation chapters focus on quantifying spread.
- Later topics in probability and statistics build on these summaries to make inferences and decisions based on data.