Kahibaro
Discord Login Register

Histograms and Density Plots

Understanding Histograms

Histograms show how data values are distributed across different ranges. Instead of plotting each data point separately, a histogram groups values into bins and shows how many values fall into each bin. This is very useful when you have many observations and want to see the overall shape of the data, such as whether it is symmetric, skewed, or has multiple peaks.

In MATLAB, histograms are usually created from a vector of numeric data. Each element of the vector is treated as one observation. Histograms are especially helpful when you want to understand the spread and concentration of values, detect outliers, or compare distributions between different data sets.

Creating Simple Histograms with `histogram`

The basic function to draw a histogram in modern MATLAB is histogram. You call it with a data vector and MATLAB creates a bar-like plot where each bar represents one bin.

A minimal example is:

matlab
data = randn(1000,1);      % 1000 samples from a normal distribution
histogram(data);

Here, randn produces random values that are roughly bell-shaped around zero. histogram automatically chooses a binning strategy and scales the vertical axis to show counts per bin.

You can store the returned object if you want to modify properties later:

matlab
h = histogram(data);

The output h is a histogram object that you can adjust, for example by changing its face color or line style.

Controlling Bin Size and Bin Edges

The appearance and usefulness of a histogram depend strongly on the choice of bins. If bins are too wide, important details are hidden. If bins are too narrow, the plot looks noisy.

MATLAB can choose bins automatically, but you can also control them. The most direct way is to set the number of bins with the NumBins property:

matlab
histogram(data,'NumBins',30);

You can also specify the bin edges exactly. This is useful if you know the ranges that matter for your problem. For example, if you want bins from -5 to 5 in steps of 0.5, you can create a vector of edges and pass it to the BinEdges name-value pair:

matlab
edges = -5:0.5:5;
histogram(data,'BinEdges',edges);

Once you call histogram, you can modify the bins via the object:

matlab
h = histogram(data);
h.NumBins = 50;

You can also set the bin width instead of the number of bins:

matlab
histogram(data,'BinWidth',0.2);

MATLAB adjusts the number of bins to cover the full range of your data with the specified width.

Counts, Normalization, and Histogram Types

By default, histograms show counts, which means the height of each bar equals the number of observations in that bin. Sometimes you want a different interpretation of the vertical axis, such as probabilities or densities, especially when you compare histograms from different sized data sets.

You can use the Normalization property of histogram to change this behavior. Common options include:

  1. Normalization set to 'count'
    This is the default. The height is the number of observations in each bin.
  2. Normalization set to 'probability'
    The height is the fraction of the total number of observations in each bin. All bar heights sum to 1.
  3. Normalization set to 'pdf'
    The histogram is scaled so that the total area equals 1. The heights approximate a probability density function (PDF). This is often used when overlaying theoretical probability densities or other smooth curves.
  4. Normalization set to 'countdensity'
    This is similar to 'pdf' but in terms of counts per unit on the x-axis. The area under the bars gives the total number of observations.

Here is an example with probabilities:

matlab
histogram(data,'Normalization','probability');

And an example with a density:

matlab
histogram(data,'Normalization','pdf');

When using 'pdf', the vertical axis can be larger than 1 if bins are narrow. The key property is that the total area under all bars equals 1.

Overlaying Histograms for Comparison

To compare distributions from different samples, you often want to display more than one histogram in the same figure. There are two common arrangements: stacked histograms and overlaid histograms.

To overlay, you can call hold on and then draw multiple histograms. To avoid one histogram completely hiding the other, you can adjust the transparency through the FaceAlpha property and possibly use the same bin edges for both data sets.

For example:

matlab
data1 = randn(1000,1);
data2 = 1 + randn(800,1);   % shifted data
edges = -4:0.25:6;
h1 = histogram(data1,'BinEdges',edges,...
                     'Normalization','probability',...
                     'FaceColor',[0 0.4470 0.7410],...
                     'FaceAlpha',0.5);
hold on;
h2 = histogram(data2,'BinEdges',edges,...
                     'Normalization','probability',...
                     'FaceColor',[0.8500 0.3250 0.0980],...
                     'FaceAlpha',0.5);
hold off;

Using consistent bin edges is important for a meaningful comparison. If each histogram chooses its own bins, the shapes are harder to compare. Transparency lets you see both data sets where they overlap.

You can also use separate axes or subplots, which is covered in other chapters, when overlapping becomes confusing.

Histogram Objects and Customization

The histogram function returns a histogram object. This object has properties you can inspect and modify to adjust the visual appearance and behavior of the plot.

After creating a histogram

matlab
h = histogram(data);

you can query or change properties such as:

matlab
h.FaceColor = 'k';       % black bars
h.EdgeColor = 'w';       % white edges
h.LineWidth = 1.5;       % thicker borders
h.FaceAlpha = 0.3;       % semi transparent bars

The object also stores computed values, for example the bin counts and bin edges:

matlab
counts = h.Values;
edges  = h.BinEdges;

These values can be useful if you want to perform additional calculations based on the histogram, such as computing cumulative counts or manually drawing extra elements on the plot.

Understanding Density Plots

While histograms use rectangles to approximate the distribution of data, density plots represent the same idea with a smooth curve. The main concept is a probability density function (PDF). A PDF describes how likely it is to observe values in a small interval around each point on the horizontal axis.

A smooth density plot can make it easier to see the overall shape of the distribution, such as multiple peaks, without being too sensitive to the choice of bin edges. Density plots are often constructed through kernel density estimation, which spreads each data point with a smooth kernel function and then sums the contributions.

In MATLAB, densities are often viewed in connection with 'pdf' normalized histograms, or using external functions that perform kernel density estimation.

Approximating Density with Normalized Histograms

If you normalize a histogram with Normalization set to 'pdf', the bar heights approximate an underlying density. As the number of data points increases and the bins become smaller, the histogram with 'pdf' normalization tends to resemble the true density more closely.

For example:

matlab
data = randn(5000,1);
histogram(data,'Normalization','pdf');

To emphasize the density view, you can hide the edges and use a filled style:

matlab
h = histogram(data,'Normalization','pdf');
h.EdgeColor = 'none';

If you want an even smoother appearance, you can combine a 'pdf' normalized histogram with a continuous curve, usually from a density estimate or a known theoretical distribution.

Kernel Density Estimation with `ksdensity`

MATLAB provides the function ksdensity for kernel density estimation. This function takes your data and returns points along the x-axis together with estimated density values. You can then plot these as a smooth curve.

A basic usage pattern is:

matlab
data = randn(1000,1);
[x,f] = ksdensity(data);
plot(x,f,'LineWidth',2);

Here, x is a vector of evaluation points sorted in increasing order. f is the estimated density at these points. The curve integrates to 1, which means it represents a probability density.

To compare the density estimate with a histogram, you can overlay them:

matlab
histogram(data,'Normalization','pdf'); 
hold on;
[x,f] = ksdensity(data);
plot(x,f,'r','LineWidth',2);
hold off;

This combination often gives a nice visual summary, where the histogram shows the raw bin structure while the curve highlights the overall trend.

The function ksdensity chooses sensible defaults for the number of evaluation points and the kernel bandwidth, but you can adjust them using name-value pairs. Basic use for beginners usually works well with defaults.

Choosing Between Histograms and Density Plots

Both histograms and density plots describe how data are distributed, but they emphasize different aspects.

Histograms directly reflect counts in bins. They show discrete groupings that depend on bin edges. Histograms are intuitive and closely tied to how many observations fall in each interval. They are very common in exploratory analysis and are easy to interpret for beginners.

Density plots with a function like ksdensity emphasize a smooth view of the distribution. They can reveal features like multiple modes more clearly, especially when there is a lot of data. However, they depend on choices like bandwidth, and they can hide some discrete aspects of the data.

In practice, many users begin with a histogram, then add a smooth density curve if more detail is needed. For example, you might start with:

matlab
histogram(data);

and then, if you suspect a particular shape, refine it to:

matlab
histogram(data,'Normalization','pdf');
hold on;
[x,f] = ksdensity(data);
plot(x,f,'r','LineWidth',2);
hold off;

Visual Interpretation of Distribution Shape

When you look at a histogram or density plot, you are mainly interested in the shape of the distribution.

If the bars cluster around the center and fall off symmetrically, the distribution resembles a normal distribution. If there is a long tail on one side, the distribution is skewed. Heavy tails show more extreme values than a normal distribution would suggest, while several peaks indicate that there might be subgroups or mixtures of populations in the data.

You can change bin settings or bandwidths to check whether apparent features are real or just artifacts of the plotting choices. A pattern that persists as you adjust these settings is likely to be meaningful.

Density plots are especially helpful when distributions overlap. If you overlay two smooth curves, their differences in location and spread become easier to see than if you only look at overlapping bars.

Working with Logarithmic Scales for Histograms

Sometimes the data span several orders of magnitude. In such cases, it can be helpful to display the histogram on a logarithmic scale, usually along the horizontal axis.

You can use axis scaling commands on a histogram just as you would on other plots. For example:

matlab
data = lognrnd(0,1,1000,1);   % lognormal data
histogram(data);
set(gca,'XScale','log');

This keeps the histogram bars but changes the scale for reading the values. Be aware that the interpretation of bin widths changes, since equal spacing on a log axis corresponds to multiplicative changes in the original scale.

Using log scales can reveal patterns that are invisible on a linear scale, such as multiplicative relationships or power law behaviors.

Important points to remember:

  1. Use histogram for basic histograms, and control binning with NumBins, BinWidth, or BinEdges.
  2. Change Normalization to 'probability' or 'pdf' when comparing samples or approximating densities.
  3. Use consistent bin edges and transparency when overlaying multiple histograms.
  4. Use ksdensity to create smooth density curves and overlay them on 'pdf' normalized histograms.
  5. Adjust axes and visual properties through the histogram object to clarify the shape of distributions.

Views: 3

Comments

Please login to add a comment.

Don't have an account? Register now!