16.2 Sorting, Filtering, and Grouping Data

Table of Contents

Introduction

Sorting, filtering, and grouping are three fundamental operations when you start exploring and shaping data in MATLAB. They help you reorder observations, keep only the records you care about, and summarize information across categories. This chapter focuses on how to perform these tasks on arrays and tables using core MATLAB functions, and how they fit into a basic data analysis workflow.

Sorting Data

Sorting means arranging data in a defined order. In MATLAB, sorting is commonly done with the sort function for numeric and string arrays, and sortrows for matrices and tables where you care about keeping rows together.

Sorting Numeric Vectors and Matrices

For a simple numeric vector, sort returns the values in ascending order:

x = [4 1 7 3];
y = sort(x);      % y = [1 3 4 7]

If you also want to know where the sorted elements came from in the original vector, request a second output:

[y, idx] = sort(x);
% y   = [1 3 4 7]
% idx = [2 4 1 3]

The index vector idx can be used to rearrange one variable according to the sorting order of another, which is especially useful when you have related vectors:

age   = [30 22 27 25];
score = [80 95 70 88];
[ageSorted, idx] = sort(age);
scoreSorted      = score(idx);

By default, sort uses ascending order. For descending order, specify the direction:

yDesc = sort(x, "descend");

For matrices, sort works column by column. This means each column is sorted independently while rows are not treated as records:

A = [4 1; 7 3];
B = sort(A);    % sorts each column separately

If you need to keep each row as an observation, use sortrows instead.

Sorting Rows of Matrices

When rows represent observations and columns represent variables, sortrows orders the rows according to one or more columns:

A = [30 80; 22 95; 27 70; 25 88];  % [age score]
A1 = sortrows(A, 1);                % sort by first column (age)

You can sort by multiple columns by providing a vector of column indices. MATLAB will sort by the first column in the list, then break ties using the next column, and so on:

B = [25 80; 25 90; 22 95; 25 85];
B1 = sortrows(B, [1 2]);    % sort by age, then score

To specify different directions for each column, use a numeric vector with signs or a cell array of "ascend" and "descend" with tables, which will be covered shortly.

Sorting Tables by Variables

For table data, sortrows works with variable names, which is often clearer than using numeric indices. Assume a table T with variables Age, Score, and Group:

Tsorted = sortrows(T, "Age");

To sort by more than one variable, pass a string array or cell array of names:

Tsorted = sortrows(T, ["Group" "Score"]);

You can also set sort directions for each variable:

Tsorted = sortrows(T, "Score", "descend");
Tsorted2 = sortrows(T, ...
    ["Group" "Score"], ...
    ["ascend" "descend"]);

Sorting tables by variables is a key step when you later want to group or summarize data by those variables.

Filtering Data with Logical Conditions

Filtering means keeping only the elements or rows that meet some condition and discarding the rest. In MATLAB this is usually done with logical indexing. You create a logical array that is true where the condition holds, then use it to select the desired values.

Filtering Numeric Vectors

Consider a numeric vector:

x = [4 1 7 3 9];

To keep only values greater than 4, create a logical condition:

ix = x > 4;      % ix = [0 0 1 0 1] (logical)
xFiltered = x(ix);   % [7 9]

You can combine multiple conditions using logical operators like & (and) or | (or). For example, values between 3 and 8 inclusive:

ix = (x >= 3) & (x <= 8);
xBetween = x(ix);

Filtering with logical indices does not change the original vector, it returns a new one. If you want to update the original variable, you must assign the result back to it.

Filtering Rows of Matrices

When rows correspond to observations, you usually build a logical index from one column and apply it to all rows. Suppose the first column is age and the second is score:

A = [30 80; 22 95; 27 70; 25 88];  % [age score]
ix = A(:,1) >= 25;      % ages 25 or older
Aolder = A(ix, :);      % keep matching rows

The colon : selects all columns, while the logical index ix selects only the rows you want.

Filtering Table Rows

With tables, filtering is more readable because you can refer to variables by name. Assume a table T with variables Age, Score, and Group:

ix = T.Age >= 25;
T1 = T(ix, :);

To filter based on a categorical or string variable, you can compare to a value or use ismember. For example, keep only rows where Group is "A":

ix = (T.Group == "A");
TA = T(ix, :);

or if you want several groups:

ix = ismember(T.Group, ["A" "C"]);
TAC = T(ix, :);

You can combine multiple conditions with logical operators. For example, age at least 25 and score above 80:

ix = (T.Age >= 25) & (T.Score > 80);
Tfiltered = T(ix, :);

Filtering is one of the central tools for basic data cleaning and preparing subsets for further analysis.

Removing or Replacing Values

Filtering can also be used to remove or replace specific values such as missing data or obvious outliers.

To remove all negative values from a vector:

x = [4 -1 7 -3 9];
x(x < 0) = [];   % deletes elements where condition is true

To replace suspiciously large values with NaN instead:

x = [4 1 700 3 900];
x(x > 100) = NaN;

For tables, you usually apply this idea columnwise:

ix = T.Score < 0;
T.Score(ix) = NaN;

Removing or replacing values in this way is often an early step before computing summary statistics or fitting models.

Grouping Data Conceptually

Grouping means splitting your data into subsets that share a common property, and often then applying some operation within each subset. Grouping itself does not have to change your data type, it is more about how you organize calculations.

In a simple case you can perform grouping manually by selecting subsets with logical indexing. Suppose you have a vector of scores and a grouping variable of the same length:

scores = [80 95 70 88 60 77];
group  = ["A" "A" "B" "B" "A" "B"]';

You can compute the mean score separately for groups "A" and "B":

meanA = mean(scores(group == "A"));
meanB = mean(scores(group == "B"));

This is already a combination of filtering and grouping. You use a logical condition based on the group variable to filter rows, then apply a function like mean to the filtered values.

For more complex cases, MATLAB provides functions that group and aggregate at the same time, such as groupsummary and varfun on tables. These functions go beyond this chapter, but the key idea is the same: define one or more grouping variables, then summarize another variable within each group.

Sorting Before Grouping

Sorting is often useful before grouping because it arranges rows with the same group together, which can make grouped operations easier to understand or implement manually.

If you sort a table by a Group variable:

Tsorted = sortrows(T, "Group");

then all rows from group "A" appear together, followed by all rows from group "B", and so on. This can help you visually inspect group structure or write simple loops that walk through each block of rows for a group. Although you can group without sorting, the combination is common in workflows where you want both structure and human readability.

Filtering Within Groups

Sometimes you need to apply different filtering rules inside different groups. You can combine grouping conditions with other criteria in the logical expression.

For instance, keep only rows where Group is "A" and Score is at least 80:

ix = (T.Group == "A") & (T.Score >= 80);
T_A_high = T(ix, :);

If you need to filter each group separately, but with the same rule, you can apply one condition based on the group and another based on the variable of interest. Conceptually this is again just building a logical index that combines two parts.

Simple Grouped Summaries with Manual Logic

Even without dedicated grouping functions, you can build basic grouped summaries by combining indexing and functions like mean, sum, or median.

Suppose T has variables Group (categorical) and Score:

groups = categories(T.Group);
for k = 1:numel(groups)
    g = groups{k};
    ix = (T.Group == g);
    avgScore = mean(T.Score(ix));
    fprintf("Group %s: mean score = %.2f\n", g, avgScore);
end

Here you first list the distinct groups, then for each group create a logical filter that selects only the rows for that group. Inside the loop you compute a summary measure for that subset. This pattern reflects the basic idea of grouping: split the data by some variable, then apply the same operation to each part.

Combining Sorting, Filtering, and Grouping in a Workflow

In practice, you rarely use sorting, filtering, or grouping in isolation. A simple workflow might look like this:

Filter out invalid or unwanted rows, for example missing scores or ages outside a realistic range.
Sort the cleaned data by a key variable, for example by Group then by Score.
Group and summarize, by selecting subsets by group with logical conditions and computing summary statistics within each subset.

For example:

% 1) Filter: keep only ages between 18 and 65
ix = (T.Age >= 18) & (T.Age <= 65);
Tclean = T(ix, :);
% 2) Sort by Group then Score (descending)
Tsort = sortrows(Tclean, ["Group" "Score"], ["ascend" "descend"]);
% 3) Grouped mean score, done manually
groups = categories(Tsort.Group);
for k = 1:numel(groups)
    g = groups{k};
    ixg = (Tsort.Group == g);
    avgScore = mean(Tsort.Score(ixg));
    fprintf("Group %s: mean score = %.2f\n", g, avgScore);
end

This example uses simple tools that you have already seen in this chapter. Dedicated grouping and aggregation functions will let you do this more compactly, but understanding the underlying logic now will make those functions easier to learn later.

Important points to remember:
Use sort for vectors and columns, and sortrows when rows represent observations that must stay together.
Filtering is done with logical indexing: build a logical condition, then use it to select rows or elements.
Combine conditions with & and | to filter on multiple criteria.
Group conceptually by using a grouping variable as part of your logical condition, then apply summary functions within each group.
Sorting by group variables before grouping can make group structure easier to see and to process.

Comments

Please login to add a comment.

Don't have an account? Register now!