In this script we discuss the theory and motivation for principal component analysis (PCA), and walk through the steps involved when applying PCA to a data set.
Interpreting eigenvalues and eigenvectors
Diagonalization of real symmetric matrices
Application to covariance/correlation matrices
Applying PCA: a geometric interpretation
Translate the data so that the centroid lies at the origin.
Rescale the data so that each dimension has standard deviation one.
Compute the principal components.
Visualize the second principal component.
Overlay the ellipse with axes defined by the principal components.
Visualize the transformed data in the new coordinate system.
Interpreting the coefficients, scores, and eigenvalues
Deciding how many components to keep
Selecting representative features
PCA for image feature extraction
Applications of PCA include:
- Dimensionality reduction
- Understanding how features are related
- Identifying groups of similar features (clustering)
- Finding outliers
PCA can be used whenever we have:
- a collection of continuous numeric variables, or
- a correlation or covariance matrix corresponding to a set of variables.
PCA is a matrix transformation designed to decorrelate a set of variables, represented in a matrix
-
$X$ - a numeric matrix of size$n\times p$ , where$n$ is the number of observations (rows) and$p$ is the number of variables (columns) -
$V$ - an orthogonal matrix (i.e.,$VV^T =V^T V=I_{p\times p}$ ) of size$p\times p$ -
$S$ - a numeric matrix of size$n\times p$ , obtained by$S=XV$
PCA is based on linear correlation, so cannot be relied upon to detect nonlinear relationships in data sets.
In this script we work with a processed dataset from the National Health and Nutrition Examination Survey (NHANES).
The dataset comprises a set of continuous and categorical variables recording basic body measurements for a collection of individuals.
Each row of the table holds the data for one individual, and each column of the table is a feature or variable.
load( "HealthData.mat" )
disp( head( H ) )
Age Pulse SystolicBloodPressure DiastolicBloodPressure Height Weight BMI LegLength ArmLength ArmCircumference Waist TricepsSkinfold SubscapularSkinfold MeanArterialPressure SelfReportedHeight SelfReportedWeight BodySurfaceArea
___ _____ _____________________ ______________________ ______ ______ _____ _________ _________ ________________ _____ _______________ ___________________ ____________________ __________________ __________________ _______________
41475 752 66 128 64 154.7 138.9 58.04 34.2 37.6 45.2 156.3 NaN NaN 85.333 157.48 139.71 2.2615
41477 860 78 144 60 167.1 83.9 30.05 32.4 38.2 34.1 109.5 15.6 25.8 88 172.72 86.182 1.9303
41479 630 62 112 70 154.4 65.7 27.56 33.3 34.1 33.2 95.4 13.3 25.9 84 170.18 64.41 1.6429
41481 254 52 112 62 182.7 77.9 23.34 43.6 43 31 79.5 8.4 9.2 78.667 185.42 79.379 1.9954
41482 779 76 116 78 173.8 101.6 33.64 43.5 40 32.8 117 18 20.1 90.667 172.72 103.42 2.1544
41483 804 60 110 62 173.8 133.1 44.06 36.5 38.9 40.5 145.6 17.6 NaN 78 180.34 131.09 2.4165
41485 361 68 108 44 157.9 64.8 25.99 34.3 32.6 30.7 89.7 21.1 30.5 65.333 160.02 62.596 1.66
41486 735 62 126 64 166.2 86.2 31.21 35 36 35.3 97 24.6 24.8 84.667 162.56 84.822 1.945
Suppose we have a linear regression model:
load( "HealthData.mat" )
modelVars = H(:, ["Height", "TricepsSkinfold", "Weight"]);
model = fitlm( modelVars, "linear" );
disp( model )
Linear regression model:
Weight ~ 1 + Height + TricepsSkinfold
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ _______ ______
(Intercept) -157.92 3.1314 -50.431 0
Height 1.2616 0.017782 70.95 0
TricepsSkinfold 1.324 0.021209 62.424 0
Number of observations: 5690, Error degrees of freedom: 5687
Root Mean Squared Error: 12.9
R-squared: 0.546, Adjusted R-Squared: 0.545
F-statistic vs. constant model: 3.41e+03, p-value = 0
Evaluate the model over a grid of
[mn, mx] = bounds( H.Height );
fineHeight = linspace( mn, mx );
[mn, mx] = bounds( H.TricepsSkinfold );
fineTriceps = linspace( mn, mx );
[HeightGrid, TricepsGrid] = meshgrid( fineHeight, fineTriceps );
WeightGrid = predict( model, [HeightGrid(:), TricepsGrid(:)] );
WeightGrid = reshape( WeightGrid, size( HeightGrid ) );
Visualize the data and fitted model.
figure
scatter3( H.Height, H.TricepsSkinfold, H.Weight, [], H.Weight, "." )
hold on
surf( HeightGrid, TricepsGrid, WeightGrid, "FaceAlpha", 0.5 )
shading interp
xlabel( "Height (cm)" )
ylabel( "Triceps Skinfold (cm)" )
zlabel( "Weight (kg)" )
title( "Weight vs. Height and Triceps Skinfold" )
colorbar
How do we interpret this model? Since the model is determined completely by its coefficients, it is natural to use the values of
-
$\beta_1$ represents the change in weight (kg) per unit change in height (cm), -
$\beta_2$ represents the change in weight (kg) per unit change in triceps skinfold (cm).
If the height changes significantly, we would expect the triceps skinfold to change significantly because they are significantly linearly correlated.
figure
ax = axes;
scatter( H.Height, H.TricepsSkinfold, "." )
bestFitLine = lsline();
bestFitLine.LineWidth = 2;
bestFitLine.Color = ax.ColorOrder(2, :);
xlabel( "Height (cm)" )
ylabel( "Triceps Skinfold (cm)" )
title( "Triceps Skinfold vs. Height" )
grid on
[rho, corrPValue] = corr( H.Height, H.TricepsSkinfold, "Rows", "pairwise" );
legend( "Observed data", "Correlation = " + rho )
PCA is based upon some core statistical concepts, primarily the covariance or correlation matrix corresponding to a collection of features. Let's first define the mean and variance.
For a random variable
If
For a random variable
This is also equivalent to
The units of variance are equal to the squared units of measurement of
If
For two random variables
Note that:
- Covariance can be interpreted as a measure of the dependence between
$X$ and$Y$ - If
$X$ and$Y$ are independent, then$\textrm{cov}(X,Y)=0$ , so computing$\textrm{cov}(X,Y)$ is only meaningful when$X$ and$Y$ are dependent - The converse is false unless
$X,Y$ are also jointly normally distributed - The units of measurement of covariance are given by the product of the units of
$X$ and$Y$
If
If
If a stick of length 1 is broken randomly into two pieces of length
It's difficult to compare covariance between different pairs of variables. This is partly due to the arbitrary scale that the variables may be measured on, and partly due to different units of measurement. For example, how could we compare -100
If
For two random variables
If
Note that if
$\mathbb{E}({\mathbf{x}}i {\mathbf{x}}j )=\frac{1}{n}\sum{k=1}^n x{k,i} x_{k,j} =\frac{1}{n}(X^T X)_{i,j}$ .
In this case,
Extract the continuous variables.
requiredVars = 1:13;
bodyVars = H{:, requiredVars};
bodyVarNames = H(:, requiredVars).Properties.VariableNames;
The variables are recorded using different units of measurement with large differences in scale, so it's good practice to standardize them before applying any machine learning technique.
figure
boxchart( bodyVars )
xticklabels(bodyVarNames)
title("Continuous Body Variables")
grid on
The
bodyVarsZScore = normalize( bodyVars, "zscore" );
figure
boxchart( bodyVarsZScore )
xticklabels( bodyVarNames )
title( "Continuous Body Variables" )
grid on
Visualize the correlation matrix.
rho = corr( bodyVarsZScore, "Rows", "pairwise" );
figure
heatmap( bodyVarNames, bodyVarNames, rho, "Colormap", cool() )
title( "Body Measurements: Correlation" )
If
If $A=\left\lbrack \begin{array}{cc} 2 & 0\newline 0 & 3 \end{array}\right\rbrack$ , then the eigenvalues of
If
If the eigenvectors are linearly independent, then
If
Any correlation matrix is symmetric, so the spectral theorem applies and we can diagonalize the correlation matrix via an orthogonal matrix. Correlation matrices are also positive definite, i.e.,
We saw above that for our standardized data matrix
The columns of the matrix
In this section, we'll work through an example of applying PCA in two dimensions. This should help to provide a geometric interpretation of the algorithm.
Create a scatter plot of height and weight.
figure
scatter( H.Height, H.Weight, "." )
xlabel( "Height (cm)" )
ylabel( "Weight (kg)" )
title( "Weight vs. height" )
grid on
Plot the centroid (mean
c = mean( H{:, ["Height", "Weight"]}, "omitmissing" );
hold on
plot( c(1), c(2), ".", "MarkerSize", 20 )
legend( "Data", "Centroid: (" + num2str( c(1), "%.2f" ) + "cm, " + num2str( c(2), "%.2f" ) + "kg)", "Location", "northwest" )
axis equal
We subtract the mean value of each variable. This creates new data centered at the origin.
centeredHeight = H.Height - c(1);
centeredWeight = H.Weight - c(2);
Recreate the scatter graph.
figure
ax = axes();
scatter( centeredHeight, centeredWeight, "." )
xlabel( "Mean-centered height (cm)" )
ylabel( "Mean-centered weight (kg)" )
title( "Weight vs. height" )
grid on
Plot the new centroid.
c = mean( [centeredHeight, centeredWeight], "omitmissing" );
hold on
plot( c(1), c(2), ".", "MarkerSize", 20 )
legend( "Data", "Centroid: (0, 0)", "Location", "northwest" )
axis equal
ax.XAxisLocation = "origin";
ax.YAxisLocation = "origin";
We divide the centered data by the standard deviation of each variable. This creates new data with standard deviation (and variance) one in each dimension.
zscoredHeight = centeredHeight / std( H.Height, "omitmissing" );
zscoredWeight = centeredWeight / std( H.Weight, "omitmissing" );
Recreate the scatter graph.
figure
ax = axes();
scatter( zscoredHeight, zscoredWeight, "." )
xlabel( "z-scored height" )
ylabel( "z-scored weight" )
title( "Z-scored weight vs. z-scored height" )
grid on
hold on
plot( c(1), c(2), ".", "MarkerSize", 20 )
legend( "Data", "Centroid: (0, 0)", "Location", "northwest" )
axis equal
ax.XAxisLocation = "origin";
ax.YAxisLocation = "origin";
Note that rows containing missing values are removed by default.
[V, S] = pca( [zscoredHeight, zscoredWeight] );
Extract column vectors corresponding to the two principal component coefficients.
pc1 = V(:, 1);
pc2 = V(:, 2);
Visualize the first principal component. First, recreate the scatter graph and centroid.
figure
ax = axes();
scatter( zscoredHeight, zscoredWeight, "." )
xlabel( "z-scored height" )
ylabel( "z-scored weight" )
title( "Z-scored weight vs. z-scored height" )
grid on
hold on
plot( c(1), c(2), ".", "MarkerSize", 20 )
Next, format the equation of the first principal component in terms of the variables.
if pc1(2) >= 0
s = " + " + num2str( pc1(2) );
else
s = " - " + num2str( -pc1(2) );
end % if
comp1Eqn = num2str( pc1(1) ) + " * Height" + s + " * Weight";
Add a quiver arrow to represent the principal component vector.
scale = 3;
quiver( c(1), c(2), pc1(1), pc1(2), scale, ...
"LineWidth", 1.5, "Color", "r", "MaxHeadSize", 1 )
axis equal
legend( "Data", "Centroid: (0, 0)", ...
"Principal component 1: " + newline() + comp1Eqn, ...
"Location", "southoutside" )
ax.XAxisLocation = "origin";
ax.YAxisLocation = "origin";
Format the equation of the second principal component.
if pc2(2) >= 0
s = " + " + num2str( pc2(2) );
else
s = " - " + num2str( -pc2(2) );
end % if
comp2Eqn = num2str( pc2(1) ) + " * Height " + s + " * Weight";
quiver( c(1), c(2), pc2(1), pc2(2), scale, ...
"LineWidth", 1.5, "Color", "r", "MaxHeadSize", 1, ...
"DisplayName", "Principal component 2: " + newline() + comp2Eqn )
In this example, the coefficient matrix is $V=\frac{1}{\sqrt{2}}\left\lbrack \begin{array}{cc} 1 & -1\newline 1 & 1 \end{array}\right\rbrack =\left\lbrack \begin{array}{cc} \cos (\pi /4) & -\sin (\pi /4)\newline \cos (\pi /4) & \sin (\pi /4) \end{array}\right\rbrack$ .
The matrix
The equation of an ellipse centered at the origin is
t = linspace( 0, 2 * pi, 500 ).';
x = scale * cos(t);
y = scale * sin(t);
e = [x, y] * V.';
xellipse = e(:, 1);
yellipse = e(:, 2);
plot( xellipse, yellipse, ":", "LineWidth", 2, "DisplayName", "Ellipse with principal component axes" )
Plot the transformed data points (the scores) and the centroid.
figure
ax = axes();
scatter( S(:, 1), S(:, 2), ".", "DisplayName", "Transformed height/weight data" )
hold on
plot( c(1), c(2), ".", "MarkerSize", 20, "DisplayName", "Centroid: (0,0)" )
Annotate the chart and draw the axes to pass through the origin.
ax.XAxisLocation = "origin";
ax.YAxisLocation = "origin";
xlabel( "Component 1" )
ylabel( "Component 2" )
title( "Data points in the transformed coordinates" )
grid on
Overlay the component vectors.
quiver( c(1), c(2), 1, 0, scale, ...
"LineWidth", 2, "Color", "r", "MaxHeadSize", 1, ...
"DisplayName", "Principal component 1" )
quiver( c(1), c(2), 0, 1, scale, ...
"LineWidth", 2, "Color", "r", "MaxHeadSize", 1, ...
"DisplayName", "Principal component 2" )
axis equal
Visualize the ellipse with axes defined by the principal component vectors.
plot( x, y, ":", "LineWidth", 2, "DisplayName", "Ellipse with principal component axes" )
legend( "Location", "southoutside" )
Now that we've applied PCA in two dimensions and interpreted the results geometrically, let's scale up to the entire data set.
bodyVarsClean = rmmissing( bodyVarsZScore );
[V, S, Lambda] = pca( bodyVarsClean );
Verify that the transformed variables are uncorrelated.
srho = corr( S );
figure
heatmap( srho, "Colormap", gray( 2 ), "ColorLimits", [0, 1] )
title( "Scores: Correlation" )
The transformation matrix
In other words, ${\textrm{PC}}k =\sum{i=1}^p V_{i,k} {\mathbf{x}}_i$ for each
p = width( bodyVars );
figure
imagesc( 1:p, 1:p, V, [-1, 1] )
colormap cool
xticks( 1:p )
xlabel( "Principal component" )
yticks( 1:p )
yticklabels( bodyVarNames )
title( "Coefficients" )
Visualize variables which have a high association with each principal component.
highIdx = V > 0.30;
hold on
spy( highIdx, "g.", 12 )
We see that:
- Component 1 represents weight-related variables
- Component 2 represents height-related variables
- Component 3 represents age and blood pressure-related variables
- Component 4 is closely related to pulse
- Subsequent components don't contribute much more information
The scores are the transformed variables,
Visualize the scores using the first three principal components. Show the positions of the original vectors in the new coordinate system.
figure
biplot( V(:, 1:3), "Scores", S(:, 1:3), "VarLabels", bodyVarNames )
title( "Scores and Variables" )
The eigenvalues represent the relative proportions of the total variance explained by each principal component.
figure
plot( Lambda, "o-", "LineWidth", 2 )
xticks( 1:p )
xlabel( "Principal component" )
ylabel( "Eigenvalue" )
title( "Eigenvalues from PCA" )
grid on
Show the same information as a cumulative percentage of the total variance.
figure
pareto( Lambda )
xlabel( "Principal component" )
ylabel( "Eigenvalue" )
title( "Pareto Plot for the Health Data PCA" )
grid on
In this method, we look for an elbow (a point at which the gradient changes rapidly) in the eigenvalue plot. For this data set, we could choose to keep 4 components.
With this approach, we take only those principal components which contain more information than what we'd obtain from applying the same algorithm to a data set comprised of white noise (i.e., data sampled from the standard normal distribution
figure
plot( Lambda, "o-", "LineWidth", 2 )
xticks( 1:p )
xlabel( "Principal component" )
ylabel( "Eigenvalue" )
title( "Eigenvalues from PCA" )
grid on
yline( 1, "LineWidth", 2, "Label", "White noise threshold" )
If a stick of length 1 is broken randomly into
For example, the expected length of the longest segment is
We can choose to keep those principal components
reciprocals = 1 ./ (1:p);
expectedLengths = sum( reciprocals ) - cumsum( [0, reciprocals(1:end-1)] );
hold on
plot( expectedLengths, "o-", "LineWidth", 2 )
legend( "Eigenvalues", "White noise threshold", "Broken stick method" )
Following the broken stick method, we would keep two components.
Let's look at the positions of the variables within the new coordinate system.
figure
biplot( V(:, 1:3), "VarLabels", bodyVarNames )
title( "Grouping Variables" )
Observations:
- There are three main groups, or clusters, of variables, corresponding to the principal components we identified above
- Orthogonal vectors are uncorrelated
- Parallel vectors are correlated
- Long vectors are well-represented in the given coordinates
- Short vectors are not well-represented in the given coordinates (e.g., pulse)
If our objective is to select features for a model, we may want to work with observed data rather than principal components. This could be for ease of interpretation - it's easier to understand a variable such as height or weight, but principal components cannot be observed directly - they are linear combinations of features and so exist only in an abstract sense. We might also want to select a representative feature from each of the groups of features that we've previously identified.
We can perform another orthogonal transformation to maximize the correlation between the principal component axes and the original variables.
A = rotatefactors( V );
figure
heatmap( 1:p, bodyVarNames, abs(A), "Colormap", cool() )
title( "After Factor Rotation" )
The disadvantage of this approach is that by selecting original variables instead of principal components, we reintroduce mutual correlation. Unless the selected variables are already uncorrelated, they will not explain as much variance as selecting the same number of principal components.
Since
Select an image file and convert it to a grayscale floating-point matrix.
imageFile = "peppers.png";
rgbImage = imread( imageFile );
grayImage = im2gray( rgbImage );
I = double( grayImage );
Apply PCA. Do not center the data - this makes the reconstruction simpler. We don't standardize the image data because we expect all elements belong to a fixed grayscale range of integer values.
[IV, IS, ILambda] = pca( I, "Centered", false );
numComponents = 50;
Approximate the grayscale image.
approxI = IS(:, 1:numComponents) * IV(:, 1:numComponents).';
Show the results.
figure
imshowpair( grayImage, approxI, "montage" )
title( "Number of components: " + numComponents )
Each row of the scores matrix corresponds to an observation in the new coordinate system. We can use the distance to the origin in the new coordinates, weighted by the variance of each principal component, to measure how unusual each data point is. These values are referred to as Hotelling's pca
function.
Alternatively, we can compute them directly from the scores and eigenvalues as follows.
t2 = sum( (S ./ Lambda.').^2, 2 );
The vector
We'll learn more about identifying outliers using the