Back

Principle Component Analysis using singular value decomposition

Principle Component Analysis

What is PCA and why should I use it?

Pincipal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. The principal components of a collection of points in a real coordinate space are a sequence of p unit vectors, where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first i-1 vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated.

So what?

Right? Just build a correlation matrix and drop those with too small values. Well.. yes and no. Building the correlation matrix is just one part of your analysis, but you will also need to look through each feature individually and make an educated guess which one to drop and which to keep. Well, not with PCA. It'll do the ordering for you and even tells you how much a dataset is going into the trend of a priciple component of your dataset.

Theorie

To analyse the priciple components of a given set of data X¯ the initial data first needs to be transformed. So we calculate the mean value of each row of records x¯=1nx(i) then multiply the resulting 1×n vector with a vector of ones with the size n X~=[1]x¯ Subtract the result from the input dataset B=X¯X~ and finally normalize the data B=1nB This gives us the input matrix for the singular value decomposition, short SVD which can (in most languages) simply be imported via a library. The SVD gives us B=UΣVT which we can now use to perform our priciple component analysis.

The priciple components are T=UΣ and V contains the so called "loading" of each record's load towards the corresponding component in the dataset.

So let's say V(0,4)=0.5. This stands for the first dataset having a load of 0.5 for the 5th principle component.

Therefor according to the load of components we can decide to keep, or discard them.

Example

All of the things above seem simple enough to give them a try, right?

For simplification we imagine a gaussian normal distributed set of data. It may look like this:

image
image

To make it a bit harder we perform regular transformations to the dataset, rotate and stretch it, into

image
image

Where left is the original data and on the right side the transformed dataset.

Then after performing above described PCA we achive the result

image
image

For an implementation of the example in C++ see github.com/philsupertramp/game-math