Kahibaro
Discord Login Register

16.4 Simple Data Cleaning Techniques

Why Data Cleaning Matters

Before any meaningful analysis, your data must be in a usable state. Simple data cleaning aims to fix common problems such as missing values, impossible values, duplicated rows, and inconsistent formats. In MATLAB you rarely change the original file directly. Instead you typically import the data into a table or arrays, clean these in memory, then save a cleaned version.

This chapter focuses on straightforward, practical techniques you can apply to small and medium sized datasets using core MATLAB functions.

Inspecting Data for Problems

The first step of cleaning is to look at your data. Once you have your data in a table T, basic inspection usually includes:

Use head(T) and tail(T) to inspect the first and last few rows. This helps you spot obvious issues such as headers being read as data, extra rows, or strange values at the end.

Use summary(T) to get a quick overview of each variable. For numeric variables, summary reports minimum, maximum, and counts of missing values. For categorical and string variables, it shows categories and counts.

Use size(T) to check how many rows and columns you have, and T.Properties.VariableNames to inspect column names.

A small example:

T = readtable("data.csv");
head(T)
summary(T)
size(T)
T.Properties.VariableNames

These inspection steps guide what cleaning steps you apply next.

Handling Missing Values

Missing data appears in different ways depending on type and how the file was imported. Common representations are NaN for numeric data, "" or <missing> for strings and categorical arrays, and empty cells in tables.

For simple cleaning, you typically either remove rows with missing values or fill them with replacement values.

To find missing entries in a numeric vector x you can use:

idxMissing = isnan(x);

To find missing values in a table you can use ismissing. It returns a logical array with true where values are missing:

M = ismissing(T);

You can then remove any row that has at least one missing value:

T(any(ismissing(T), 2), :) = [];

Sometimes it is better to fill missing values rather than drop rows. For numeric columns, use fillmissing. For example, to replace missing entries in variable Temperature with the mean of the nonmissing values:

T.Temperature = fillmissing(T.Temperature, "movmean", 5);

or with a constant:

T.Temperature = fillmissing(T.Temperature, "constant", 0);

For timelike sequences, fillmissing can interpolate or use neighboring values. The key idea is that you can choose a strategy that fits your analysis, such as constant replacement, interpolation, or local averaging, without rewriting low level loops.

Detecting and Removing Duplicates

Duplicate rows can bias your results. MATLAB provides simple tools to detect and remove them.

To detect duplicate rows in a table:

idxDup = duplicated(T);

idxDup is true for rows that have already appeared earlier. To keep only the first occurrence of each unique row, remove duplicates:

T(idxDup, :) = [];

If you only care about duplicates based on specific columns, use those columns instead:

[~, idxUnique] = unique(T(:, ["ID", "Date"]), "rows");
T = T(idxUnique, :);

Here unique keeps a single row for each combination of ID and Date, and you select those rows from the original table.

Fixing Inconsistent Text and Categories

Text data, such as city names or categories, can be inconsistent. Simple problems include differences in case, extra spaces, or slightly different spellings that refer to the same concept.

Assume a table variable T.City holds city names as strings. To trim unwanted leading and trailing spaces use:

T.City = strtrim(T.City);

To make everything lower case or upper case:

T.City = lower(T.City);
% or
T.City = upper(T.City);

If some values should be corrected or normalized, for example "NY", "New York", and "N.Y." that should all be "New York", you can use string replacement:

T.City = replace(T.City, "NY", "New York");
T.City = replace(T.City, "N.Y.", "New York");

For string arrays or character vectors, regexprep can handle more complicated patterns, but for simple cleaning, straightforward replace is often enough.

When working with categorical variables, you may want to merge several categories into one. Suppose T.Outcome is categorical and contains "ok", "OK", "Ok", and "failure". First standardize the text, then convert to categorical:

s = string(T.Outcome);
s = upper(strtrim(s));
T.Outcome = categorical(s);

Now all "ok" variants become one category "OK".

Simple Outlier Detection and Handling

Outliers are values that look unusually large or small compared to the rest of the data. Simple methods to detect them use summary statistics such as the mean and standard deviation or quantiles.

For a numeric vector x, you can compute the mean and standard deviation:

m = mean(x, "omitnan");
s = std(x, "omitnan");

A basic rule is to mark values more than 3 standard deviations away from the mean:

idxOutlier = abs(x - m) > 3*s;

You can inspect these outliers, remove them, or replace them:

xClean = x;
xClean(idxOutlier) = NaN;

After that, you can either drop rows with missing values or fill them.

An alternative uses quantiles. Compute the 1st and 99th percentiles:

q = quantile(x, [0.01 0.99]);
idxOutlier = x < q(1) | x > q(2);

Again you can decide how to handle those values. These rules are simple and not suitable for every situation, but they are easy to apply and often helpful for an initial pass.

Correcting Obvious Invalid Values

Sometimes you know that a variable must be within a realistic range. For example, age should not be negative, and percentages should be between 0 and 100.

Assume a numeric column T.Age. You can mark impossible values as missing:

idxInvalid = T.Age < 0 | T.Age > 120;
T.Age(idxInvalid) = NaN;

Similarly for a percentage variable T.Rate:

idxInvalid = T.Rate < 0 | T.Rate > 100;
T.Rate(idxInvalid) = NaN;

You can combine this with fillmissing or row removal. The key is to use logical expressions to select values that violate your assumptions and then either set them to missing or correct them if you know the proper value.

Cleaning and Standardizing Date and Time Data

Date and time information often comes as strings in different formats. For simple cleaning, your tasks are to convert them to MATLAB datetime, fix obvious invalid dates, and ensure they all use the same format and time zone.

Suppose T.Date is currently a string column with dates like "2025-01-15" or "01/15/2025". Convert them to datetime by telling MATLAB the input format:

T.Date = datetime(T.Date, "InputFormat", "uuuu-MM-dd");

If the data uses a different format:

T.Date = datetime(T.Date, "InputFormat", "MM/dd/uuuu");

If some entries have invalid values, such as 2025-02-30, they become NaT which is the missing value for datetimes. You can find and handle them with ismissing:

idxMissingDates = ismissing(T.Date);

Once the dates are converted, you can standardize the display format if you like:

T.Date.Format = "dd-MMM-uuuu";

If time zones matter, you can set or convert them, but for simple cleaning it is often enough to make sure all dates are valid and in a consistent representation.

Putting the Steps Together

In practice you will combine these simple techniques to clean a dataset. For example, a basic cleaning function for a small table might:

Read the raw file into a table. Trim and standardize text fields. Convert date strings to datetime. Mark invalid numeric ranges as missing. Fill or drop missing values. Remove duplicate rows.

A short illustration:

T = readtable("raw_data.csv");
% Standardize city names
T.City = strtrim(lower(string(T.City)));
% Convert date
T.Date = datetime(T.Date, "InputFormat", "MM/dd/uuuu");
% Fix invalid ages
idxInvalidAge = T.Age < 0 | T.Age > 120;
T.Age(idxInvalidAge) = NaN;
% Remove rows with any missing values
T(any(ismissing(T), 2), :) = [];
% Remove duplicate rows
T(duplicated(T), :) = [];
writetable(T, "clean_data.csv");

This is not a universal recipe, but it shows how the simple tools described in this chapter can be combined into a straightforward cleaning workflow.

Key points to remember:
Use head, tail, and summary to inspect data before cleaning.
Find missing values with ismissing or isnan, then either remove or fill them.
Remove duplicates with duplicated or keep unique rows with unique.
Standardize text using strtrim, lower, upper, and replace.
Detect obvious invalid numeric values using logical conditions and set them to NaN.
Convert date strings to datetime with the correct InputFormat and handle NaT.
Always work on a copy or save a cleaned version instead of overwriting raw data.

Views: 31

Comments

Please login to add a comment.

Don't have an account? Register now!