This script explores different outlier (or novelty) detection methods on the unlabeled Air quality data https://archive.ics.uci.edu/dataset/360/air+quality.
Identifying outliers in a single variable
Multivariate outlier detection methods
Robust multivariate covariance method
What are outliers? What are novelties?
Outliers (or anomaly) are observations that deviate significantly from a dataset's normal behavior. Outliers can mean unwanted data or events of interest (or novelty).
Outlier detection vs novelty detection:
- Outlier detection looks for "abnormal" outliers within the known data -> known data already contains outliers
- Novelty detection aims to identify new data points that deviate from the known pattern -> known data is assumed to be outlier-free
Outlier types: point outliers, segment outliers, time series outliers.
Applications:
- Outlier detection: fraud detection, system monitoring, and quality control
- Novelty detection: discovering emerging trends, new customer behaviors, or identifying potential new market
How to find outliers?
- Visual inspection and statistical methods: scatter plot, histogram, boxplot etc.
- Multivariate distance-based methods:
$t^2$ statistic distance method, multivariate covariance method - Density based methods: DBSCAN
- Model-based methods: Local Outlier Factor, Isolation Forest, Robust Random Cut Fores, One-class SVM.
- Deep learning methods: Auto Encoder
Download the data if needed.
downloadData = exist(fullfile("Data","AirQualityUCI.xlsx"), "file") ~= 2;
if downloadData
url = "https://archive.ics.uci.edu/static/public/360/air+quality.zip";
websave(fullfile("Data","AirQuality.zip"), url)
unzip(fullfile("Data","AirQuality.zip"), "Data")
end
Read in as timetable and preprocess before outlier detection.
dataTT = readtimetable("AirQualityUCI.xlsx", "VariableNamingRule","preserve");
Datetime manipulation.
dataTT.Properties.RowTimes = dataTT.Properties.RowTimes+dataTT.Time;
dataTT.Time =[];
Standardize the missing because the original data uses -200 to indicate missing entries.
dataTT= standardizeMissing(dataTT,-200);
90% of the data in the NMHC(GT) variable is missing, so we exclude this column first.
dataTT.("NMHC(GT)")=[];
Remove the rest missing entries.
dataTT= rmmissing(dataTT);
Stacked plot the data
figure
stackedplot(dataTT)
Extract and normalize the features data in matrix form for the following machine learning algorithms.
data = dataTT.Variables;
dataNorm = normalize(data);
Take O3 as an example. We first exam if the data normally distributed before choosing the outlier detection methods.
ozone = dataTT.("PT08.S5(O3)");
figure
tiledlayout("flow")
nexttile
histogram(ozone, "NumBins",50)
title("Ozone Histogram( mean: "+ round(mean(ozone), 2)+", std:"+round(std(ozone), 2)+")")
nexttile
boxchart(ozone, "Orientation","horizontal", "JitterOutliers","on")
yticklabels("Ozone")
Try out a few appropriate outlier detection methods for this variable e.g., quantile, moving median, isolation forest etc.
% Quantile method
quantileOutlierIdx = isoutlier(ozone,"quartiles");
% Moving median method
moveMedianOutlierIdx = isoutlier(ozone,"movmedian",9);
% Iforest method
[mdl, iforestOutlierIdx] = iforest(ozone, "ContaminationFraction",0.03);
figure
% Plot the original data.
plot(dataTT.Properties.RowTimes, ozone, "Color",[0.5, 0.5, 0.5])
hold on
% Quantile outliers
scatter(dataTT.Properties.RowTimes(quantileOutlierIdx), ozone(quantileOutlierIdx), "MarkerEdgeColor","b", "Marker","x")
% Moving median outliers
scatter(dataTT.Properties.RowTimes(moveMedianOutlierIdx), ozone(moveMedianOutlierIdx), "MarkerEdgeColor","r")
% Iforest outliers
scatter(dataTT.Properties.RowTimes(iforestOutlierIdx), ozone(iforestOutlierIdx), "MarkerEdgeColor","k")
hold off
xlim tight
legend(["Input data","Quantile("+nnz(quantileOutlierIdx)+")", ...
"Moving median("+nnz(moveMedianOutlierIdx)+")", "Isolation Forest ("+nnz(iforestOutlierIdx)+")"], "Location","best")
For univarite outlier detection, we can use the Clean Outlier Data Live Task.
[rho, pValues] = corr(dataNorm,"Type","Pearson");
featureNames = string(dataTT.Properties.VariableNames);
figure
heatmap(featureNames, featureNames, rho, "Colormap", cool())
title("Data Correlation")
[C, S, e, t2, explained] = pca(dataNorm);
figure
tiledlayout(1,2)
nexttile
pareto(explained)
xlabel("Principal Component")
nexttile
plot(e, "o-", "LineWidth", 2)
xlabel("Principal Component")
ylabel("Eigenvalue")
To find the number of components to use, you can use
- the elbow method
- white noise method
- broken stick method
- cumulative variance method (i.e., to capture certain amount of variance in the data, how many components do we need?)
varThreshold = 85;
numComponents = find(cumsum(explained)>=varThreshold,1)
numComponents = 3
Visualize the scores using the first three principal components. Show the positions of the original vectors in the new coordinate system.
figure
biplot( C(:, 1:3), "Scores", S(:, 1:3), "VarLabels",featureNames)
title( "Scores and Variables")
To understand which of the original features are important to the reduced PCA components, we can use a post-hoc rotation provided by [rotatefactors
](matlab: doc rotatefactors).
A = rotatefactors(C);
A = round(abs(A));
[~, impIdx] = max(A);
importantFeatures = featureNames(impIdx(1:numComponents))';
disp(importantFeatures)
"C6H6(GT)"
"T"
"RH"
- C6H6(GT): true hourly averaged Benzene concentration in microg/m^3 (reference analyzer). Benzene (C6H6) is considered a precursor to ozone formation, meaning it can chemically react with other pollutants in the atmosphere, particularly in the presence of sunlight, to produce ozone as a secondary pollutant; especially when combined with nitrogen oxides (NOx). Ground-level ozone is a harmful air pollutant that's closely linked to air quality. It's a major component of smog, and it can damage human health and the environment.
- T: temperature (
$^{\circ } C$ ) - RH: relative humidity (%)
Hotelling’s
Set the outlier fraction between 0.01 and 0.05.
contaminationFraction = 0.03;
t2Threshold = quantile(t2, 1-contaminationFraction);
t2OutlierIdx = t2 > t2Threshold;
figure
histogram(t2, "Normalization","cdf", "FaceAlpha",0.1, "EdgeAlpha",0.4)
hold on
yline(1-contaminationFraction, "Label","Cumulative probability="+(1-contaminationFraction), ...
"Color","r","LineWidth",2,"LabelHorizontalAlignment","left","LabelColor","m", "LabelVerticalAlignment","bottom")
xline(t2Threshold, "Color","r", "Label","Threshold value="+round(t2Threshold,2),"LabelColor","m","LineWidth",2, "LabelVerticalAlignment","bottom")
title("t2 Emperical CDF")
Confirm it is a reasonable outlier fraction by visualizing the sorted
figure
plot((1:length(t2))/length(t2), sort(t2, "descend"))
xlabel("Percentage of data")
ylabel("t^2-statistic")
xline(contaminationFraction, "Label", "Outlier fraction = "+contaminationFraction)
title("t^2-statistic Outlier fraction")
Visualize in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S,t2OutlierIdx);
Tabulate the outliers, and display the results.
outliersTT = dataTT;
outliersTT.T2Outlier = t2OutlierIdx;
This
tSNE (t-distributed Stochastic Neighbor Embedding) constructs a set of embedded points in a low-dimensional space (Y, Student t distribution) whose relative similarities mimic those of the original high-dimensional points (X).
Assumption: distant (or nearby) points in high-dimensional space correspond to distant (or nearby) embedded low-dimensional points. (Generally, it is impossible to match distances exactly between high-dimensional and low-dimensional spaces.)
How: minimize the Kullback-Leibler divergence between a Gaussian distribution in the high-dimensional space (X) and a t distribution in the low-dimensional space (Y).
Steps:
- Remove rows with NaN. Standardize data, use a given number of principal components (recommended).
- Calculate the pairwise distances (Euclidean metric by default) between the high-dimensional points
$d(x_i ,x_j )$ . - Estimate a Gaussian distribution for each
$x_i$ and the conditional probability of$P_{j|i}$ (ie. prob. of$x_i$ would pick$x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at$x_i$ ).$P_{j|i} =\frac{exp(-d(x_i ,x_j )^2 )/(2\sigma_i^2 )}{\sum_{k\not= i} exp(-d(x_i ,x_k )^2 )/(2\sigma_i^2 )}$ The joint probability$P_{ij} =\frac{P_{j|i} +P_{j|i} }{2N}$ .$P_i$ represents the conditional probability distribution over all other data points given data point$x_i$ . The Shannon entropy of$P_i$ ,$H(P_i )=-\sum_j P_{j|i} \log_2 (P_{j|i} )$ . - Initialize a set of random Gaussian-distributed points Y.
tsne
then calculates the similarities between each pair of points in Y. The probability model$q_{ij}$ of the distribution of the distances between points$y_i$ and$y_j$ is$q_{ij} =\frac{(1+|y_i -y_j {|}^2 )^{-1} }{\sum_k \sum_{l\not= k} (1+|y_k -y_l {|}^2 )^{-1} }$ . The Kullback-Leibler divergence between the joint distribution P and Q is$KL(P|Q)=\sum_j \sum_{i\not= j} p_{ij} log\frac{p_{ij} }{q_i j}$
rng default
[tsneEmbeddings, loss] = tsne(data,"NumDimensions",3, "NumPCAComponents",numComponents,"Standardize",true, "Distance","mahalanobis");
Mahalanobis distance measures the distance between a point x and the center of a multivariate distribution
Given a probability distribution
figure
ax = axes;
scatter3(tsneEmbeddings(:,1), tsneEmbeddings(:, 2), tsneEmbeddings(:,3), ...
"filled", "MarkerFaceColor",[ 0.0620,0.50,0.5010], "MarkerFaceAlpha",0.8)
hold on
m = mean(tsneEmbeddings);
M = cov(tsneEmbeddings);
sigmaNum = 5;% Number of standard deviation from the center.
f = @(x,y,z)([x,y,z]-m)*inv(M)*([x;y;z]-m')-sigmaNum; %#ok<MINV>
f1 = @(x,y,z) arrayfun(f,x,y,z);
h = fimplicit3(f1,axis(gca));
h.EdgeColor = "none";
h.FaceColor = [1.0000, 0.8190, 0.6190];
h.FaceAlpha = 0.5;
hold off
xlabel("tsneEmbeddingsX")
ylabel("tsneEmbeddingsY")
zlabel("tsneEmbeddingsZ")
title("Air Quality Data in 3D tSNE Representation")
Euclidean Distance Vs Mahalanobis Distance :
- Mahalanobis distance can be thought of as a generalized form of Euclidean distance i.e., when the covariance matrix is an identity matrix.
- Euclidean distance treats all dimensions equally; Mahalanobis distance scales distances based on the variability and correlations in the data.
- Euclidean distance is suitable for simple, uncorrelated data, while Mahalanobis distance is preferred for multivariate data where correlations between variables are significant.
Use the robustcov
function to compute robust Mahalanobis distances and robust estimates for the mean and covariance of the data, which are less sensitive to outliers than the estimates from the cov
and mean
functions. By default, the function assumes that the data set follows a multivariate normal distribution. If this assumption doesn't hold, this method might not work very well.
robustcov
minimizes the covariance determinant over (1-contaminationFraction)*100% of the observations.
[sigma,mu,s_robustcov,robustcovOutlierIdx] = robustcov(S(:,1:numComponents), ...
"OutlierFraction",contaminationFraction);
Visualize in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S,robustcovOutlierIdx);
Tabulate the outliers, and display the results.
outliersTT.robustcovOutlier = robustcovOutlierIdx;
[DBSCAN](matlab: doc dbscan) (Density-based spatial clustering of applications with noise) is a density-based clustering algorithm. DBSCAN is frequently used to analyze and segment point clouds, effectively identifying clusters of points within a 3D space based on their density, making it particularly useful for tasks like object detection and surface segmentation in applications like autonomous driving and 3D reconstruction where point cloud data is prevalent. Doc example about point clouds https://uk.mathworks.com/help/vision/point-cloud-processing.html
The algorithm identifies three kinds of points: core points, border points, and noise points (i.e., outliers).
- Core points in a cluster are points that have at least a minimum number of neighbors (
minpts
) in their epsilon neighborhood (epsilon
). Each cluster must contain at least one core point. - Border points in a cluster are points that have fewer than the required minimum number of neighbors for a core point (
minpts
) in their epsilon neighborhood (epsilon
). - Noise points are outliers that do not belong to any cluster.
Empirically minpts can start with number of input features + 1.
minpts = numComponents+1;
One strategy for estimating a value for epsilon
is to generate a k-distance graph for the input data X
. For each point in X
, find the distance to the kth nearest point, and plot sorted points against this distance.
[kD, kDIdx] = pdist2(S(:,1:numComponents),S(:,1:numComponents),"euc","Smallest",minpts);
Visualize the distance in descending order. The graph contains a knee. The distance that corresponds to the knee is generally a good choice for epsilon
, because it is the region where points start tailing off into outlier (noise) territory.
sortedKD = sort(kD(end,:));
figure
plot(sortedKD);
title('k-distance graph')
xlabel("Points sorted with "+ minpts + " nearest distances")
ylabel(minpts + " nearest distances")
grid on
hold on
Set the knee value as the epsilon
to this range.
epsilon = 0.5;
yline(epsilon, "Color","r","Label","epsilon = "+ epsilon,"LabelHorizontalAlignment","left", "LabelColor","m")
hold off
Perform dbscan using the chosen parameters epsilon
and minpts
.
grpIdx = dbscan(S(:,1:numComponents),epsilon,minpts);
The dbscan algorithm returns the cluster indices for all the observations. An index equal to –1
indicates an outlier (or noise point).
DBSCANOutlierIdx = grpIdx == -1;
Visualize in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S, DBSCANOutlierIdx)
Tabulate the results and visualize in a stacked plot.
outliersTT.DBSCANOutlier = DBSCANOutlierIdx;
- DBSCAN is fast across three or more dimensions and intuitive. However, selecting the optimal parameters can be tricky, and the model needs to be re-calibrated every time new data needs analysis.
The local outlier factor ([LOF](matlab: doc lof)) algorithm detects outliers based on the relative density of an observation with respect to the surrounding neighborhood. Outliers are points that have a substantially lower density than their neighbors.
The locality is given by k nearest neighbors, whose distance is used to estimate the density. In the following figure, point A has a much lower density than its neighbors. By default, k is set as
Unsupervised workflow: We don't know which observations are anomaly, but we can give an empirical guess of how much percentage (or fraction) of data is anomaly.
For normal observations, the local outlier factor scores are less than or close to 1, indicating that the local reachability density of an observation is higher than or similar to its neighbors. A local outlier factor score is greater than 1 can indicate an anomaly.
[lofMdl,LOFOutlierIdx,lof_scores] = lof(S(:,1:numComponents), ...
"ContaminationFraction",contaminationFraction);
Visualize in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S,LOFOutlierIdx)
For all model-based the outlier detection methods, if a new contamination fraction is set, we don't need to retrain the outlier models, only use the already computed scores to find the outliers under the new contamination fraction.
newContaminationFraction = 0.05;
newScoreThreshold = quantile(lof_scores,1-newContaminationFraction);
newLOFOutlierIdx = lof_scores > newScoreThreshold;
visualizeOutliersinPCA(S,newLOFOutlierIdx)
Tabulate the outliers, and display the results.
outliersTT.LOFOutlier = LOFOutlierIdx;
- LOF can work with either all continuous or all categorical variables.
- LOF can be computationally expensive to run on higher dimension. So scalability is a big issue.
- However, compared to DBSCAN, LOF doesn't require neighborhood parameter setting even though it is also a density-based outlier detection method.
Semi-Supervised workflow for novelty detection
We train an outlier model on all normal observations and use this model to check unseen data whether it is anomaly or not (isanomaly
). This workflow also suits to other model-based outlier detection methods including Local Outlier Factor, Isolation Forest, Robust Random Cut Fores, One-class SVM.
lofNormalMdl = lof(NormalData)% The ContaminationFraction name-value pair by default is set as 0.
NewDataOutlierTF = isanomaly(lofNormalMdl, NewData)
Isolation forest ([iforest
](matlab: doc iforest))
builds an ensemble of "Isolation Trees" for the dataset to detects outliers. Each isolation tree is trained for a subset of training observations, sampled without replacements. The isolation trees then form a tree forest, which is averaged out the results of each tree.
Anomalies are few and different; therefore, an anomaly lands in a separate leaf node closer to the root node and has a shorter path length (the distance from the root node to the leaf node) than normal points.
[iforestMdl, iforestOutlierIdx, iforestScores] = iforest(S(:,1:numComponents), ...
"ContaminationFraction", contaminationFraction);
Visualize the data in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S, iforestOutlierIdx)
Visualize the scores in histogram.
scoreHistogram(iforestScores, iforestMdl.ScoreThreshold, contaminationFraction)
Tabulate the outliers, and display the results.
outliersTT.IforestOutlier = iforestOutlierIdx;
The robust random cut forest algorithm classifies a point as a normal point or an anomaly based on the change in model complexity introduced by the point. Similar to the Isolation Forest algorithm, the robust random cut forest algorithm builds an ensemble of trees. The two algorithms differ in how they choose a split variable in the trees and how they define anomaly scores.
rng("default") % For reproducibility
[rforestMdl,rforestOutlierIdx,rforestScores] = rrcforest(S(:,1:numComponents), ...
"ContaminationFraction",contaminationFraction,"StandardizeData",true);
Visualize the data in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S, rforestOutlierIdx)
Tabulate the outliers, and display the results.
outliersTT.rforestOutlier = rforestOutlierIdx;
Visualize the scores in histogram.
scoreHistogram(rforestScores, rforestMdl.ScoreThreshold, contaminationFraction)
[OCSVM](matlab: doc ocsvm) learns a hyperplane with maximum margin within the a higher-dimensional feature space to linearly separate data points from origin.
The objective function of one-class SVM is
In one-class SVM, kernel functions are used to measure the similarity between pairs of data in the input space. Common kernel functions include Gaussian, polynomial and sigmoid methods. These kernels map the original input space into a higher-dimension space, where data points become linearly separable.
Perform One-class SVM outlier detection.
[OCSVMMdl,OCSVMOutlierIdx,OCSVMScores] = ocsvm(S(:,1:numComponents), ...
"ContaminationFraction",contaminationFraction, "StandardizeData",true);
The score from OCSVM is unbounded (from -inf to inf). A negative score value with large magnitude indicates a normal observation, and a large positive value indicates an outlier.
Visualize in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S,OCSVMOutlierIdx)
Visualize the scores in histogram.
scoreHistogram(OCSVMScores, OCSVMMdl.ScoreThreshold, contaminationFraction)
OCSVM can be a bit slow due to the nonlinear kernels. Might not separate the outliers correctly around the margin.
Tabulate the outliers, and display the results.
outliersTT.OCSVMOutlier = OCSVMOutlierIdx;
Autoencoders is a type of neural network which attempt to reconstruct their input at their output layer. Outliers are those data points which have large reconstruction errors.
Autoencoder network consists of an encoder and a decoder. The encoder maps the input to an efficient representation (i.e. a feature space with less dimension but more informative than input features). The decoder maps this representation back to the original input.
The encoder maps input vector x to another vector z, where
The decoder maps the encoded representation
First, we need to transpose the normalized features matrix to features-by-samples for autoencoder.
feaTrans = dataNorm';
Then use the [trainAutoencoder](matlab: doc trainAutoencoder) to set the hidden layer neuron size and max epochs used to train the autoencoder.
doTrain = false;
if doTrain
autoenc = trainAutoencoder(feaTrans, "hiddenSize",20,"L2WeightRegularization",0.0002,"SparsityProportion",0.15, ...
"MaxEpochs", 1000,"ShowProgressWindow", false);
save(fullfile("Data","autoenc.mat"), "autoenc")
else
load(fullfile("Data","autoenc.mat"))
end
trainAutoencoder
only provides one hidden layer. You can train stacked autoencoders or define more complicated deep learning network for the autoencoder, for example, Time Series Anomaly Detection Using Deep Learning. MATLAB also provides built-in function [deepSignalAnomalyDetector
](matlab: doc deepSignalAnomalyDetector) in Signal Processing Toolbox to use either LSTM Autoencoder or convolutional encoder to detect outliers in signals.
View the autoencoder graphic diagram.
figure
view(autoenc)
Obtain the reconstructed features from the autoencoder model using the predict
function.
feaReconstructed = predict(autoenc,feaTrans);
Compute the overall root mean square error and each sample's square root error.
mseError = mse(feaTrans,feaReconstructed, "all")
mseError = 0.0048
E1 = sqrt(sum((feaTrans - feaReconstructed).^2));
Find the outliers if their reconstruction errors are at the top (contaminationFraction)% percentile.
ETranspose = E1';
figure
bar(ETranspose)
title("Reconstruction Error")
Set the reconstruction threshold for outliers.
threshold = quantile(ETranspose,1-contaminationFraction)
threshold = 0.4473
AutoEncoderOutlierIdx = ETranspose >= threshold;
Visualize in the first three principal components and highlight the outliers.
visualizeOutliersinPCA(S, AutoEncoderOutlierIdx)
Visualize the scores in histogram.
scoreHistogram(ETranspose, threshold, contaminationFraction)
Tabulate the outliers, and display the results.
outliersTT.AutoEncoderOutlier = AutoEncoderOutlierIdx;
Can use majority vote or other methods like weighted sum can be applied. For example, if you have a small amount of labeled outliers, you can use true positive ratio to weight each method's importance in the final outlier decision.
outliersTT.FinalOutlier = mean(outliersTT{:,["T2Outlier","AutoEncoderOutlier","DBSCANOutlier","IforestOutlier","LOFOutlier","OCSVMOutlier","rforestOutlier","robustcovOutlier"]},2) >=0.5;
visualizeOutliersinPCA(S, outliersTT.FinalOutlier )
Visualize the final outliers in the important (or interested) features identified by PCA.
figure
tiledlayout("flow")
interestedFeatures = [importantFeatures;"PT08.S5(O3)"];
for k = 1: length(interestedFeatures)
nexttile
plot(outliersTT.Properties.RowTimes, outliersTT.(interestedFeatures(k)), "Color",[0.5, 0.5, 0.5])
hold on
scatter(outliersTT.Properties.RowTimes(outliersTT.FinalOutlier),outliersTT{outliersTT.FinalOutlier,interestedFeatures(k)}, "Marker","x")
hold off
legend(["Data", "Outliers"], "Location","best")
title(interestedFeatures(k))
end
Helper Functions
function visualizeOutliersinPCA(S,outlierIdx)
figure
scatter3(S(:, 1), S(:, 2), S(:, 3), ".")
hold on
scatter3(S(outlierIdx, 1), S(outlierIdx, 2), S(outlierIdx, 3), "ro")
hold off
grid on
xlabel("PC1")
ylabel("PC2")
zlabel("PC3")
legend(["Observations", "Outliers"], "Location","best")
title("Outliers in Principal Component Space")
subtitle("Outliers No.: "+ sum(outlierIdx))
end
function scoreHistogram(scores, scoreThreshold, contaminationFraction)
figure
histogram(scores)
xline(scoreThreshold,"r-",...
join(["Threshold Score =", scoreThreshold; "Outlier fraction =", contaminationFraction]), "LabelColor","m", "LineWidth",2)
title("Anomaly Scores histogram")
end