http://stackoverflow.com/questions/26225344/why-feature-scaling

 

 

 

Why feature scaling?

I found that scaling in SVM (Support Vector Machine) problems really improve its performance... I have read this explanation:

"The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges."

Unfortunately this didn't help me ... Can somebody provide me a better explanation? Thank you in advance!

share|improve this question
    
Are you talking about log-normalizing data? – Leo Oct 6 '14 at 21:54
1  
Maybe you should ask this question at stats.stackexchange.com - this forum is for programming questions, your questions sounds like a theoretical one – Leo Oct 6 '14 at 22:01

3 Answers 3

The true reason behind scaling features in SVM is the fact, that this classifier is not affine trasnformation invariant. In other words, if you multiply one feature by a 1000 than a solution given by SVM will be completely different. It has nearly nothing to do with the underlying optimization techniques (although they are affected by these scales problems, they should still converge to global optimum).

Consider an example: you have man and woman, encoded by their sex and height (two features). Let as assume very simple case with such data:

0-man, 1-woman

1 150

1 160

1 170

0 180

0 190

0 200

And let us do something silly. Train it to predict the sex of the person, so we are trying to learn f(x,y)=x (ignoring second parameter).

It is easy to see, that for such data largest margin classifier will "cut" the plane horizontally somewhere around height "175", so once we get new sample "0 178" (a woman of 178cm height) we get the classification that she is a man.

However, if we scale down everything to [0,1] we get sth like

0 0.0

0 0.2

0 0.4

1 0.6

1 0.8

1 1.0

and now largest margin classifier "cuts" the plane nearly vertically (as expected) and so given new sample "0 178" which is also scaled to around "0 0.56" we get that it is a woman (correct!)

So in general - scaling ensures that just because some features are big it won't lead to using them as a main predictor.

share|improve this answer

Feature scaling is a general trick applied to optimization problems (not just SVM). The underline algorithm to solve the optimization problem of SVM is gradient descend. Andrew Ng has a great explanation in his coursera videos here.

I will illustrate the core ideas here (I borrow Andrew's slides). Suppose you have only two parameters and one of the parameters can take a relatively large range of values. Then the contour of the cost function can look like very tall and skinny ovals (see blue ovals below). Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution.
enter image description here

Instead if your scaled your feature, the contour of the cost function might look like circles; then the gradient can take a much more straight path and achieve the optimal point much faster. enter image description here

share|improve this answer
    
Thank you so much greeness. Your answer is really clear but your answer explain why scaling improves computation speed time, not accuracy as I asked, in my humble opinion. Thank you! – Kevin Oct 7 '14 at 9:42
    
@Venik I think the reason for above is in his answer. I am not exactloy sure though: <<Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution.>> – Parag S. Chandakkar Oct 8 '14 at 5:11
    
This answer is not correct, SVM is not solved with SGD in most implementations, and the reason for feature scaling is completely different. – lejlot Nov 14 '14 at 0:55
1  
I don't agree. To avoid the big values' dominating effect is probably the primary advantage. However, the author of libsvm also pointed out that feature scaling has the advantage of preventing numeric problems. see Section 2.2 csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf – greeness Nov 14 '14 at 8:31
    
I also don't know why you think gradient descent is not used to solve SVM in most implementations. In libsvm 's different versions, I see coordinate gradient descent and also sub-gradient descent's implementations. – greeness Nov 14 '14 at 8:36

Just personal thoughts from another perspective.
1. why feature scaling influence?
There's a word in applying machine learning algorithm, 'garbage in, garbage out'. The more real reflection of your features, the more accuracy your algorithm will get. That applies too for how machine learning algorithms treat relationship between features. Different from human's brain, when machine learning algorithms do the classify for example, all the features are expressed and calculated by the same coordinate system, which in some sense, establish a priori assumption between the features(not really reflection of data itself). And also the nature of most algorithms is to find the most appropriate weight percentage between the features to fittest the data. So when these algorithms' input is unscaled features, large scale data has more influence on the weight. Actually it's not the reflection of data iteself.
2. why usually feature scaling improve the accuracy?
The common practice in unsupervised machine learning algorithms about the hyper-parameters(or hyper-hyper parameters) selection(for example, hierachical Dirichlet process, hLDA) is that you should not add any personal subjective assumption about data. The best way is just to assume that they have the equality probability to appear. I think it applies here too. The feature scaling just try to make the assumption that all the features has the equality opportunity to influence the weight, which more really reflects the information/knowledge you know about the data. Commonly also result in better accuracy.

BTW, about the affine transformation invariant and converge faster, there's are interest link here on stats.stackexchange.com.

share|improve this answer
Posted by uniqueone
,