Table of Contents
Understanding Descriptive Statistics
Descriptive statistics summarize and describe the main features of a data set using numbers. Instead of looking at every single value, you use a few measures to capture typical behavior, spread, and shape. In MATLAB, these summaries are usually computed from vectors, tables, or columns of data.
You will mainly work with numeric arrays, for example a vector x that contains measurements. Many descriptive functions in MATLAB either have dedicated functions or can be accessed through mean, median, std, and related commands.
Measures of Central Tendency
Central tendency describes where the "center" of the data lies. MATLAB provides several ways to compute this, each capturing a different notion of the center.
The most common measure is the arithmetic mean. For a vector $x$ with $n$ elements, the mean is
$$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i.
$$
In MATLAB, you compute this with:
m = mean(x);
The result m is a single scalar that represents the average value of x.
The median is the middle value of the data when sorted. It is less sensitive to extreme values than the mean. For a vector x:
m_med = median(x);If the number of elements is odd, the median is the central element after sorting. If it is even, the median is the average of the two central elements.
Sometimes you want to know the most frequent value, called the mode. MATLAB computes this with:
m_mode = mode(x);The mode is useful when the data contain repeated discrete values, for example survey responses or counts. For continuous numeric measurements, the mode can be less informative because exact repeated values may be rare.
Measures of Spread or Variability
Spread describes how much the data vary around the center. Two data sets can have the same mean but very different variability.
A basic measure is the range. For a vector x, the range is
$$
\text{range}(x) = \max(x) - \min(x).
$$
In MATLAB there is a function range:
r = range(x);
You can also compute max(x) and min(x) individually if you want the extremes themselves.
Variance and standard deviation describe how far values tend to deviate from the mean. For a sample vector $x$ with $n$ values and mean $\bar{x}$, the sample variance is
$$
s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2.
$$
The corresponding standard deviation is
$$
s = \sqrt{s^2}.
$$
In MATLAB, by default, var and std compute the sample versions:
v = var(x);
s = std(x);If you need the population versions with denominator $n$ instead of $n - 1$, you can specify a second argument:
v_pop = var(x, 1);
s_pop = std(x, 1);Larger standard deviation values indicate data that are more spread out around the mean.
Minimum, Maximum, and Percentiles
Extremes and percentiles describe where data lie in terms of their ordered positions. These are often reported together with central tendency and spread.
The minimum and maximum are direct:
x_min = min(x);
x_max = max(x);They show the smallest and largest values in the dataset.
Percentiles divide ordered data into 100 equal parts. For example, the $p$th percentile is a value such that approximately $p$ percent of the data lie below it. In MATLAB, you compute percentiles using prctile:
p25 = prctile(x, 25); % 25th percentile
p50 = prctile(x, 50); % 50th percentile, often equal to the median
p75 = prctile(x, 75); % 75th percentile
You can request multiple percentiles at once by passing a vector of percentages, for example prctile(x, [5 95]) to get the 5th and 95th percentiles.
Quartiles are special percentiles that cut the data into four equal parts. The first quartile $Q_1$ is the 25th percentile, the second quartile $Q_2$ is the 50th percentile, and the third quartile $Q_3$ is the 75th percentile. Quartiles are useful summaries of the distribution, especially when combined into the interquartile range.
Interquartile Range and Robust Spread
The interquartile range (IQR) measures the spread of the middle 50 percent of the data. It is defined as
$$
\text{IQR} = Q_3 - Q_1.
$$
In MATLAB, there is a dedicated function:
iq = iqr(x);Because the IQR ignores the lowest 25 percent and highest 25 percent of values, it is less affected by extreme values than the full range or standard deviation.
You can also access quartiles using prctile, then compute the IQR manually if you wish:
q = prctile(x, [25 75]);
iq_manual = q(2) - q(1);Robust measures like the median and IQR are often preferred when data contain outliers.
Shape of the Distribution: Skewness and Kurtosis
Beyond center and spread, you might want to capture the shape of the distribution. Two common shape measures are skewness and kurtosis.
Skewness measures asymmetry. A distribution with positive skewness has a longer tail to the right. A distribution with negative skewness has a longer tail to the left. In MATLAB:
sk = skewness(x);This returns a number that is close to zero for roughly symmetric data.
Kurtosis measures how heavy the tails are compared to a normal distribution. Larger kurtosis often indicates more extreme values in the tails. In MATLAB:
kt = kurtosis(x);The exact definition of kurtosis used depends on the options, but you can usually interpret higher values as heavier tails or more extreme outliers.
These measures are more advanced summaries. They are useful when you need a compact description of distribution shape without plotting, for example when comparing different datasets programmatically.
Working with Vectors vs Matrices
Most descriptive statistics functions operate along a chosen dimension for matrices or higher dimensional arrays. If X is a matrix, mean(X) by default returns a row vector containing the mean of each column. Similarly, std(X) returns standard deviations for each column.
If you want summaries along a different dimension, you can specify it explicitly. For example, to compute the mean of each row in X, use:
rowMean = mean(X, 2);
The same dimension argument pattern applies to many functions used in descriptive statistics, such as median, var, min, max, prctile, and iqr.
When you are working with tables, some functions can operate directly on table variables, but often you extract a table column into a vector, for instance X = T.Height;, then apply the usual numeric functions.
Dealing with Missing Data in Summaries
Real datasets often contain missing values represented as NaN. Many basic functions treat NaN as a regular number, which can spoil your summaries. For example, mean([1 2 NaN]) returns NaN. To ignore missing values, MATLAB provides nan versions of several functions.
For the mean, variance, and standard deviation, typical usage looks like:
m = nanmean(x);
s = nanstd(x);
v = nanvar(x);For other summaries, you can often supply an option that tells the function to omit missing values. For example:
m2 = mean(x, 'omitnan');
p = prctile(x, 50, 'all', 'omitnan');
Here, the 'omitnan' flag instructs MATLAB to compute the summary ignoring NaN entries. The specific syntax can vary among functions, so checking the function documentation for handling missing values is useful.
Careful handling of missing data is essential, especially if you want summaries that reflect the observed information rather than just returning NaN.
Quick Combined Summaries
When you need several descriptive measures at once, using a single function can be convenient. MATLAB provides summary behavior for tables and some higher level functions in toolboxes, but for numeric arrays a common pattern is to compute a small set of core summaries together.
A typical snippet to summarize a vector x might be:
results.mean = mean(x, 'omitnan');
results.median = median(x, 'omitnan');
results.min = min(x);
results.max = max(x);
results.std = std(x, 'omitnan');
results.iqr = iqr(x);
results.skewness = skewness(x);
results.kurtosis = kurtosis(x);This creates a structure containing several summary statistics that you can inspect or save. Such combined summaries are often the starting point for further analysis, plotting, or reporting.
Key points to remember:
Descriptive statistics summarize a dataset with a few numbers, not by examining every value.
Central tendency measures include mean, median, and mode. They describe "typical" values.
Spread measures such as range, variance, standard deviation, and interquartile range describe variability.
Percentiles and quartiles show where values lie in the ordered data. The interquartile range is $Q_3 - Q_1$.
Functions like mean, median, std, var, min, max, prctile, iqr, skewness, and kurtosis provide standard summaries in MATLAB.
Be careful with missing values NaN. Use options like 'omitnan' or nan variants of functions so that summaries are not dominated by missing data.