Principal Component Analysis on Big data

Problem Statement

Principal component analysis (PCA) is a fairly common statistical technique used to reduce the dimensions of data containing a large set of interrelated variables. Problems start to arise when the number of observations and variables are very large. At this point classical PCA techniques require sizeable memory and immense computational time. While working with a biological dataset we were hindered by a similar problem.

We were attempting to do PCA on a dataset which had over 22000 features and 22000 observations. The limitation for us here was clearly the size of the data. We were unable to do PCA on such a large dataset using a machine with 12GB RAM, in under 4 hours. At this point we started exploring alternate methods. This blog describes these alternate methods of processing PCA on large datasets and compares the results with the classical PCA techniques we have.

A brief about PCA

Before jumping into the comparison let’s briefly revisit the mathematics behind PCA.

PCA aims to transform the dataset by finding its Eigen vector and Eigen values. There are essentially two ways to find Eigen vector:

  1. singular value decomposition (SVD) on mean-centered data
  2. Eigen value decomposition of centered co-variance matrix of the data

PCA relies on variance of the data: if data has low variance (or high redundancy) then this data can easily be reduced to less dimensions. We find out a co-variance matrix which represents the co-variance between all features. With the help of this co-variance matrix we obtain Eigen values (numbers on diagonal of co-variance matrix) and Eigen vector.

Procedure

For the purpose of this blog we would be using R, and different packages available in R, to carry out principal component analysis. The most commonly used PCA function in R is prcomp present in the Stats package. We would be using this package as the gold standard for comparison of all other packages and their results. There are various packages which provide principal component analysis functionality in different languages. Here we will only look at three packages in R:

  1. Stats package (prcomp)
  2. bigpca package (big.PCA)
  3. flashpca R package (flashpca)

Furthermore, for the purpose of comparison and analysis we have used a 1000 cross 100 matrix.

Principal component analysis using prcomp

Prcomp is the most commonly used PCA library in R. It relies on doing SVD on the mean centered data instead of Eigen decomposition on the co-variance matrix of the data. SVD breaks down the original dataset into matrices having Eigen vectors and singular values (square root of eigen values). Let’s say we have a dataset X with dimension N x p then we say that SVD of X is

X = UDVT

where,

U is an N x k matrix, whose column represents the Eigen vector of X*XT 

D is k x k matrix which represents singular values which is square root of Eigen values

V is k x p matrix whose column represents the Eigen vector of XT*X

 

Thus, prcomp does not require co-variance matrix to be computed explicitly, we can directly pass the dataset to find Eigen vector and Eigen value. Prcomp also returns some other information related to PCA such as standard deviation of principal components, proportion of variances explained by all the principal components.

Principal component analysis using bigpca

Bigpca is a recent addition to the list of R packages which provide PCA. Its differentiator lies in its algorithm which the developers of the package claim can do quick computation while still keeping the memory consumption lower, comparatively.

Bigpca provides principal component analysis on big matrix object (which is used to handle large matrices efficiently). With the help of bigpca we can perform principal component analysis on very large matrices using any of the previously mentioned methods: singular value decomposition, or Eigen decomposition on co-variance matrix. This can be done by setting SVD argument in bigpca as TRUE or FALSE. Another important argument of bigpca is thin. This parameter is used to reduce the dimension of data. As we know most of the variance is explained by the top principal components, we can calculate only top principal component by reducing the dimension of dataset.

Some of the ways of reducing dataset dimension are uniform sub setting, random sub setting, correlated sub setting and PCA sub setting. Here, uniform sub setting truncates dimensions(rows/column) of dataset uniformly. For example, if we want to keep 500 rows, it uniformly selects first 500 rows from original dataset. Random sub setting means it randomly selects dimensions from dataset. Correlated sub setting calculates correlation for given data matrix and returns rows which are having high absolute sum of correlation values. PCA sub setting performs PCA on a small set of columns and all rows, and selects rows which are most correlated to the first ‘n’ principle components. Here ‘n’ is chosen by the function quick.elbow() which returns those rows whose variance is greater than the threshold. The number of variables selected corresponding to each component is weighted according to how much of the variance is explained by each component. Flowchart of big.PCA is shown as below:

 

Principal component analysis using flashpca

flashpca R package was an outcome of a 2014 paper titled “Fast Principal Component Analysis of Large-Scale Genome-Wide Data”. flashpca is very useful for our specific purpose, that is to compute PCA on large datasets.

flashpca relies on performing principal component analysis on smaller subset of data that represents most of the variance of data. flashpca uses randomised PCA approach to find out this small matrix. Randomised PCA approach is based on computing a small matrix which captures top Eigen vector and Eigen values of original dataset.

To understand this better, let’s assume we have a dataset

Xn*p

Where,

n is the number of rows

p are the number of columns
Let’s say we want “d” top principal component at the output, then randomised PCA computes a small random orthogonal matrix

R(d+e)*p

where,

d is the no. of principal component we want

e is additional dimension required to increase accuracy

p is the number of columns in the original matrix Xn*p

After constructing random matrix, we multiply it with the complete data. Multiplying a random vector by a square matrix projects that vector onto the eigenvectors of that matrix. flashpca then computes QR decomposition on that matrix which gives us an orthogonal matrix

Qn*(d+e)

where,

n is the number of rows in the original matrix X

d is the no. of principal component we want

e is additional dimension required to increase accuracy

Finally, the transpose of Q is multiplied with X so as to obtain a small matrix of dimension (d+e) x p, i.e.,

S(d+e)*p  = QTn*(d+e) x Xn*p

It is on this matrix, S, that we perform PCA to obtain the principal component values.

Time Comparison of different ways of PCA

We performed principal component analysis on 500 x 500 data matrix and compared the time taken by different PCA methods (using thin value = 0.5 in bigpca). Time comparison of different ways of PCA can be found in the following table. Looking at this table, we can say that flashpca takes the least time to perform principal component analysis.

Memory Comparison of different ways of PCA

We performed principal component analysis on 1000 x 1000 data matrix and compared the RAM consumption of all the three methods. From the table below we can say that flashpca requires the least memory consumption to perform principal component analysis.

Result Comparisons of different ways of PCA

We compared the result obtained from the different methods of PCA (1000 features) by computing correlation value for the top 3 principal components. Correlation values of the principal component for different PCA method comparison are shown below

From the above table, we can see that there is a high correlation between the results obtained from prcomp and flashpca, whereas bigpca shows less correlation with other methods. In this case we assume that the result of Prcomp is the most reliable.

Final verdict

Clearly, flashpca is the better way to perform principal component analysis as it takes significantly lesser time and memory while still reaching PCs which are very similar to the well validated Prcomp.

 

Written by Raghav Sehgal (Data Scientist) and Tushar Chabbra (Intern) at Elucidata.

 

Vipul Jain Written by:

One Comment

  1. November 24, 2018
    Reply

    I’m so glad I found my solution online.

Leave a Reply

Your email address will not be published. Required fields are marked *