PCA first or normalization first?

Machine Learning 2017. 1. 17. 20:53

http://stackoverflow.com/questions/10119913/pca-first-or-normalization-first

When doing regression or classification, what is the correct (or better) way to preprocess the data?

Normalize the data -> PCA -> training
PCA -> normalize PCA output -> training
Normalize the data -> PCA -> normalize PCA output -> training

Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.

------------

You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;

If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:

>> wts=pca(X)
wts =
    0.6659    0.7461
   -0.7461    0.6659

If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:

>> Y = X;
>> Y(:,1) = 100 * Y(:,1);

However, we now find that the principal components are aligned with the coordinate axes:

>> wts=pca(Y)
wts =
    1.0000    0.0056
   -0.0056    1.0000

To resolve this, there are two options. First, I could rescale the data:

>> Ynorm = bsxfun(@rdivide,Y,std(Y))

(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).

We now get sensible results from PCA:

>> wts = pca(Ynorm)
wts =
   -0.7125   -0.7016
    0.7016   -0.7125

They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.

The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr')
wts =
    0.7071    0.7071
   -0.7071    0.7071

In fact this is completely equivalent to standardizing the date by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).

저작자표시 비영리 동일조건

'Machine Learning' 카테고리의 다른 글

handling an unbalanced training set or imbalaced dataset in classification (0)	2017.02.03
Parameters Tuning - SVM : Tuning an SVM Classifier parameters - matlab help sites (0)	2017.01.26
Top Machine learning Books (0)	2017.01.08
강화학습 공부할때 처음 넘어야 하는 산이 "마코프"인데요. 저도 이에 대해서 여러가지 자료를 봤는데 아래 영상이 가장 쉽고 명확하게 설명해 주는 거 같습니다. 큰수의 법칙과 베르누이 과정에 이어서 마코프 과정 (0)	2017.01.08
After K-fold cross-validation, Which model to select as a representative? (0)	2017.01.04