Principle Component Analysis using singular value decomposition
Principle Component Analysis
What is PCA and why should I use it?
Pincipal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. The principal components of a collection of points in a real coordinate space are a sequence of
unit vectors, where the -th vector is the direction of a line that best fits the data while being orthogonal to the first -1 vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated.
So what?
Right? Just build a correlation matrix and drop those with too small values. Well.. yes and no. Building the correlation matrix is just one part of your analysis, but you will also need to look through each feature individually and make an educated guess which one to drop and which to keep. Well, not with PCA. It'll do the ordering for you and even tells you how much a dataset is going into the trend of a priciple component of your dataset.
Theorie
To analyse the priciple components of a given set of data
The priciple components are
So let's say
Therefor according to the load of components we can decide to keep, or discard them.
Example
All of the things above seem simple enough to give them a try, right?
For simplification we imagine a gaussian normal distributed set of data. It may look like this:
To make it a bit harder we perform regular transformations to the dataset, rotate and stretch it, into
Where left is the original data and on the right side the transformed dataset.
Then after performing above described PCA we achive the result
For an implementation of the example in C++ see github.com/philsupertramp/game-math