http://stats.stackexchange.com/questions/121886/when-should-i-apply-feature-scaling-for-my-data
When should I apply feature scaling for my data
I started a discussion with a collague of mine and we started to wonder, when should one apply feature normalization / scaling to the data? Lets say that we have a set of features with some of the features having a very broad range of values and some features having not so broad range of values. If I'd be doing principal component analysis I would need to normalize the data, this is clear, but lets say we are trying to classify the data by using plain and simple k-nearest neighbor / linear regression method. Under what conditions should or shouldn't I normalize the data and why? A short and simple example highlighting the point added to the answer would be perfect. | ||||
|
You should normalize when the scale of a feature is irrelevant or misleading, and not normalize when the scale is meaningful. K-means considers Euclidean distance to be meaningful. If a feature has a big scale compared to another, but the first feature truly represents greater diversity, then clustering in that dimension should be penalized. In regression, as long as you have a bias it does not matter if you normalize or not since you are discovering an affine map, and the composition of a scaling transformation and an affine map is still affine. When there are learning rates involved, e.g. when you're doing gradient descent, the input scale effectively scales the gradients, which might require some kind of second order method to stabilize per-parameter learning rates. It's probably easier to normalize the inputs if it doesn't matter otherwise. | |||||
|
In my view the question about scaling/not scaling the features in machine learning is a statement about the measurement units of your features. And it is related to the prior knowledge you have about the problem. Some of the algorithms, like Linear Discriminant Analysis and Naive Bayes do feature scaling by design and you would have no effect in performing one manually. Others, like knn can be gravely affected by it. So with knn type of classifier you have to measure the distances between pairs of samples. The distances will of course be influenced by the measurement units one uses. Imagine you are classifying population into males and females and you have a bunch of measurements including height. Now your classification result will be influenced by the measurements the height was reported in. If the height is measured in nanometers then it's likely that any k nearest neighbors will merely have similar measures of height. You have to scale. However as a contrast example imagine classifying something that has equal units of measurement recorded with noise. Like a photograph or microarray or some spectrum. in this case you already know a-priori that your features have equal units. If you were to scale them all you would amplify the effect of features that are constant across all samples, but were measured with noise. (Like a background of the photo). This again will have an influence on knn and might drastically reduce performance if your data had more noisy constant values compared to the ones that vary. Now any similarity between k nearest neighbors will get influenced by noise. So this is like with everything else in machine learning - use prior knowledge whenever possible and in the case of black-box features do both and cross-validate. | |||||||||||||||||
|
There are several methods of normalization. In regards to regression, if you plan on normalizing the feature by a single factor then there is no need. The reason being that single factor normalization like dividing or multiplying by a constant already gets adjusted in the weights(i.e lets say the weight of a feature is 3, but if we normalize all the values of the feature by dividing by 2, then the new weight will be 6, so overall the effect is same). In contrast if you are planning to mean normalize, then there is a different story. Mean normalization is good when there is a huge variance in the feature values ( 1 70 300 4 ). Also if a single feature can have both a positive and negative effect, then it is good to mean normalize. This is because when you mean normalize a given set of positive values then the values below mean become negative while those above mean become positive. In regards to k-nearest neighbours, normalization should be performed all the times. This is because in KNN, the distance between points causes the clustering to happen. So if you are applying KNN on a problem with 2 features with the first feature ranging from 1-10 and the other ranging from 1-1000, then all the clusters will be generated based on the second feature as the difference between 1 to 10 is small as compared to 1-1000 and hence can all be clustered ito a single group | |||||||||
|
Here's another chemometric application example where feature scaling would be disastrous: There are lots of classification (qualitative analysis) tasks of the form "test whether some analyte (= substance of interest) content is below (or above) a given threshold (e.g. legal limit)". In this case, the sensors to produce the input data for the classifier would be chosen to have , preferrably with being a steep and even linear function.
In this situation, feature scaling would essentially erase all relevant information from the raw data. In general, some questions that help to decide whether scaling is a good idea:
| |||||
|
I contend that it's better to actually scale your data before splitting it into a train and test set for two reasons: 1 - There is no danger of leakage as there is no fitting being done when scaling features. 2 - This actually mimics what you might do when receiving a new set of data. You might have the X for a new set of data but not the y. There is no reason not to include the X from the new data with the old data when preprocessing. It some situations it might be the only logical thing to do. | ||||
|
'Machine Learning' 카테고리의 다른 글
Cross Validation Parameter Selection (0) | 2017.01.04 |
---|---|
Why feature scaling? (0) | 2016.12.31 |
About Feature Scaling and Normalization (0) | 2016.12.31 |
Analyze and model data using statistics and machine learning (0) | 2016.12.29 |
Decision theory - reject option (0) | 2016.12.28 |