Why feature scaling?
I found that scaling in SVM (Support Vector Machine) problems really improve its performance... I have read this explanation:
Unfortunately this didn't help me ... Can somebody provide me a better explanation? Thank you in advance! | |||||||||
|
The true reason behind scaling features in SVM is the fact, that this classifier is not affine trasnformation invariant. In other words, if you multiply one feature by a 1000 than a solution given by SVM will be completely different. It has nearly nothing to do with the underlying optimization techniques (although they are affected by these scales problems, they should still converge to global optimum). Consider an example: you have man and woman, encoded by their sex and height (two features). Let as assume very simple case with such data: 0-man, 1-woman 1 150 1 160 1 170 0 180 0 190 0 200 And let us do something silly. Train it to predict the sex of the person, so we are trying to learn f(x,y)=x (ignoring second parameter). It is easy to see, that for such data largest margin classifier will "cut" the plane horizontally somewhere around height "175", so once we get new sample "0 178" (a woman of 178cm height) we get the classification that she is a man. However, if we scale down everything to [0,1] we get sth like 0 0.0 0 0.2 0 0.4 1 0.6 1 0.8 1 1.0 and now largest margin classifier "cuts" the plane nearly vertically (as expected) and so given new sample "0 178" which is also scaled to around "0 0.56" we get that it is a woman (correct!) So in general - scaling ensures that just because some features are big it won't lead to using them as a main predictor. | ||||
|
Feature scaling is a general trick applied to optimization problems (not just SVM). The underline algorithm to solve the optimization problem of SVM is gradient descend. Andrew Ng has a great explanation in his coursera videos here. I will illustrate the core ideas here (I borrow Andrew's slides). Suppose you have only two parameters and one of the parameters can take a relatively large range of values. Then the contour of the cost function can look like very tall and skinny ovals (see blue ovals below). Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution. Instead if your scaled your feature, the contour of the cost function might look like circles; then the gradient can take a much more straight path and achieve the optimal point much faster. | |||||||||||||||||||||
|
Just personal thoughts from another perspective. BTW, about the affine transformation invariant and converge faster, there's are interest link here on stats.stackexchange.com. | |||
|
'Machine Learning' 카테고리의 다른 글
After K-fold cross-validation, Which model to select as a representative? (0) | 2017.01.04 |
---|---|
Cross Validation Parameter Selection (0) | 2017.01.04 |
When should I apply feature scaling for my data (0) | 2016.12.31 |
About Feature Scaling and Normalization (0) | 2016.12.31 |
Analyze and model data using statistics and machine learning (0) | 2016.12.29 |