Kahibaro
Discord Login Register

16.3 Basic Correlation and Regression

Understanding Relationships Between Variables

In data analysis you often want to know whether two quantities move together and how to describe that relationship with a simple mathematical model. Correlation and regression are basic tools for this purpose in MATLAB.

This chapter focuses on how to compute and interpret simple correlation and basic linear regression in MATLAB, and how to work with the most important functions and outputs. It does not cover more advanced statistical modeling or time series, which appear elsewhere in the course.

Correlation in MATLAB

Correlation measures the strength and direction of a linear relationship between two variables. In MATLAB you typically use corrcoef to compute the Pearson correlation coefficient.

Suppose you have two numeric vectors x and y of the same length. You can compute the correlation matrix as:

x = [1 2 3 4 5];
y = [2.1 4.0 6.1 7.8 10.2];
R = corrcoef(x, y)

The result R is a 2-by-2 matrix. The diagonal elements are 1, and the off diagonal elements are the correlation between x and y. You usually care about one of these off diagonal values:

r = R(1,2);

This r is the sample Pearson correlation coefficient, which ranges between -1 and 1. Positive values indicate that as x increases, y tends to increase. Negative values indicate that as x increases, y tends to decrease. Values near 0 indicate little or no linear relationship.

You can also compute correlations for multiple variables at once. If you have a matrix X where each column is a different variable observed for the same cases, corrcoef(X) returns a correlation matrix whose entry in row i, column j is the correlation between column i and column j of X.

If your data is in a table, you can convert selected variables to a matrix using {:,:} or table2array, then pass that matrix to corrcoef.

Visualizing Correlation with Scatter Plots

It is helpful to look at a scatter plot before trusting a correlation value. You can create a basic scatter plot with:

scatter(x, y);
xlabel('x');
ylabel('y');
title('Scatter of y vs x');

A roughly straight cloud of points that slopes upward corresponds to a positive correlation. A downward slope corresponds to a negative correlation. If the pattern is curved or very irregular, the correlation coefficient alone may be misleading, even if its magnitude is large.

Scatter plots are also useful for spotting outliers, which can strongly affect the correlation value.

Simple Linear Regression with polyfit

Basic linear regression fits a straight line to describe how a response variable y changes as a predictor x changes. A simple way to do this in MATLAB is with the function polyfit.

For a straight line, you use a polynomial of degree 1. The model is

$$
y \approx p_1 x + p_2,
$$

where p1 is the slope and p2 is the intercept. In MATLAB:

p = polyfit(x, y, 1);
slope     = p(1);
intercept = p(2);

You can then compute fitted values of y for given x using polyval:

yfit = polyval(p, x);

If x is a vector of original data points, yfit contains the predicted values at those points. If you create a finer grid for plotting, for example with linspace, you can draw a smooth regression line.

Plotting a Regression Line

To see the fitted line on top of your data, you can combine scatter, polyfit, and polyval. Here is a typical pattern:

x = [1 2 3 4 5];
y = [2.1 4.0 6.1 7.8 10.2];
p = polyfit(x, y, 1);       % fit line
xline = linspace(min(x), max(x), 100);
yline = polyval(p, xline);  % predicted values
scatter(x, y, 'filled');
hold on;
plot(xline, yline, 'r-', 'LineWidth', 2);
hold off;
xlabel('x');
ylabel('y');
title('Linear Regression Fit');

The hold on and hold off commands let you plot the points and the line in the same axes. The red line shows the linear trend estimated by regression.

Regression with fitlm for Extra Output

If you have the Statistics and Machine Learning Toolbox, the function fitlm provides a richer interface for linear regression. It uses a model that can be written as

$$
y = \beta_0 + \beta_1 x + \varepsilon,
$$

where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\varepsilon$ is random error.

A basic usage for vectors x and y is:

mdl = fitlm(x, y);

The result mdl is a linear model object that contains many properties. Some of the most useful for beginners are:

mdl.Coefficients: a table with the estimated intercept and slope.
mdl.Rsquared.Ordinary: the coefficient of determination, a measure between 0 and 1 of how much of the variation in y is explained by the linear model.
mdl.Fitted: the fitted values of y for each observation.
mdl.Residuals: the differences between observed and fitted values.

You can display a summary in the Command Window simply by typing:

mdl

If your data is in a table, you can call fitlm with table variable names. For example, if T is a table with variables Height and Weight, you can write:

mdl = fitlm(T, 'Weight ~ Height');

The formula 'Weight ~ Height' tells MATLAB to model Weight as a linear function of Height.

Interpreting Basic Regression Results

For a simple regression with one predictor, the key numbers are:

The slope, which tells you the estimated change in y for a one unit increase in x.
The intercept, which is the estimated value of y when x is zero.
The coefficient of determination $R^2$, which summarizes how well the line fits the data.

If you used fitlm, you can access these as:

coefTable = mdl.Coefficients;
intercept = coefTable.Estimate(1);
slope     = coefTable.Estimate(2);
R2        = mdl.Rsquared.Ordinary;

A value of $R^2$ close to 1 indicates that the line accounts for most of the variability in y. A small $R^2$ means the linear model does not explain much of the variation.

You can add the fitted line from a fitlm model to a scatter plot using the plot method of the model object:

plot(mdl);

This command creates a figure that includes the data and the regression line, along with some diagnostic information. If you want more control, you can extract the fitted values manually and use plot as shown before.

Correlation and Regression Together

Although correlation and regression are related, they are not the same thing. Correlation summarizes how strongly two variables vary together in a linear way. Regression provides an explicit equation that describes how one variable changes on average when the other changes.

In practice with MATLAB, you might often compute both. A typical workflow is:

Use scatter to visualize the relationship.
Use corrcoef to compute the correlation coefficient.
Use polyfit or fitlm to obtain a regression line and fitted values.
Inspect residuals or $R^2$ to judge the quality of the fit.

Remember that both correlation and simple linear regression, as described here, focus on linear relationships and may not capture curved or more complex patterns.

Important points to remember:
Use corrcoef for basic Pearson correlation between numeric variables.
Always inspect scatter plots to see if a linear summary is reasonable.
Use polyfit and polyval for quick straight line fits and plotting.
Use fitlm for richer regression output such as coefficients and $R^2`.
Correlation and regression describe linear relationships and can be misleading when the pattern is strongly non linear or affected by outliers.

Views: 37

Comments

Please login to add a comment.

Don't have an account? Register now!