http://stats.stackexchange.com/questions/121886/when-should-i-apply-feature-scaling-for-my-data

 

 

When should I apply feature scaling for my data

I started a discussion with a collague of mine and we started to wonder, when should one apply feature normalization / scaling to the data? Lets say that we have a set of features with some of the features having a very broad range of values and some features having not so broad range of values.

If I'd be doing principal component analysis I would need to normalize the data, this is clear, but lets say we are trying to classify the data by using plain and simple k-nearest neighbor / linear regression method.

Under what conditions should or shouldn't I normalize the data and why? A short and simple example highlighting the point added to the answer would be perfect.

share|improve this question

5 Answers 5

You should normalize when the scale of a feature is irrelevant or misleading, and not normalize when the scale is meaningful.

K-means considers Euclidean distance to be meaningful. If a feature has a big scale compared to another, but the first feature truly represents greater diversity, then clustering in that dimension should be penalized.

In regression, as long as you have a bias it does not matter if you normalize or not since you are discovering an affine map, and the composition of a scaling transformation and an affine map is still affine.

When there are learning rates involved, e.g. when you're doing gradient descent, the input scale effectively scales the gradients, which might require some kind of second order method to stabilize per-parameter learning rates. It's probably easier to normalize the inputs if it doesn't matter otherwise.

share|improve this answer
    
+1 Thank you for your help! =) – jjepsuomi Oct 29 '14 at 10:56

In my view the question about scaling/not scaling the features in machine learning is a statement about the measurement units of your features. And it is related to the prior knowledge you have about the problem.

Some of the algorithms, like Linear Discriminant Analysis and Naive Bayes do feature scaling by design and you would have no effect in performing one manually. Others, like knn can be gravely affected by it.

So with knn type of classifier you have to measure the distances between pairs of samples. The distances will of course be influenced by the measurement units one uses. Imagine you are classifying population into males and females and you have a bunch of measurements including height. Now your classification result will be influenced by the measurements the height was reported in. If the height is measured in nanometers then it's likely that any k nearest neighbors will merely have similar measures of height. You have to scale.

However as a contrast example imagine classifying something that has equal units of measurement recorded with noise. Like a photograph or microarray or some spectrum. in this case you already know a-priori that your features have equal units. If you were to scale them all you would amplify the effect of features that are constant across all samples, but were measured with noise. (Like a background of the photo). This again will have an influence on knn and might drastically reduce performance if your data had more noisy constant values compared to the ones that vary. Now any similarity between k nearest neighbors will get influenced by noise.

So this is like with everything else in machine learning - use prior knowledge whenever possible and in the case of black-box features do both and cross-validate.

share|improve this answer
2  
Good examples... – Neil G Oct 29 '14 at 10:23
    
+1 Thank you for your help! =) – jjepsuomi Oct 29 '14 at 10:59
1  
Just a quick follow-up, why would kNN be affected by feature scaling? The Mahalanobis distance should already account for that as far as I understand. – biostats101 Feb 11 '15 at 3:41
    
@SebastianRaschka When kNN was mentioned for some reason I only had Euclidean distance in mind. This should explain the confusion. kNN of course can be used with other distance metrics and thank you for noticing this. – Karolis Koncevičius Feb 11 '15 at 21:51

There are several methods of normalization.

In regards to regression, if you plan on normalizing the feature by a single factor then there is no need. The reason being that single factor normalization like dividing or multiplying by a constant already gets adjusted in the weights(i.e lets say the weight of a feature is 3, but if we normalize all the values of the feature by dividing by 2, then the new weight will be 6, so overall the effect is same). In contrast if you are planning to mean normalize, then there is a different story. Mean normalization is good when there is a huge variance in the feature values ( 1 70 300 4 ). Also if a single feature can have both a positive and negative effect, then it is good to mean normalize. This is because when you mean normalize a given set of positive values then the values below mean become negative while those above mean become positive.

In regards to k-nearest neighbours, normalization should be performed all the times. This is because in KNN, the distance between points causes the clustering to happen. So if you are applying KNN on a problem with 2 features with the first feature ranging from 1-10 and the other ranging from 1-1000, then all the clusters will be generated based on the second feature as the difference between 1 to 10 is small as compared to 1-1000 and hence can all be clustered ito a single group

share|improve this answer
    
"…if a single feature can have both a positive and negative effect, then it is good to mean normalize. This is because when you mean normalize a given set of positive values then the values below mean become negative while those above mean become positive." — won't the existence of a bias term allow any feature to have a positive or negative effect despite a positive range of values? – Neil G Oct 29 '14 at 9:53
    
+1 Thank you for your help! =) – jjepsuomi Oct 29 '14 at 10:55

Here's another chemometric application example where feature scaling would be disastrous:

There are lots of classification (qualitative analysis) tasks of the form "test whether some analyte (= substance of interest) content is below (or above) a given threshold (e.g. legal limit)". In this case, the sensors to produce the input data for the classifier would be chosen to have

signal=f(analyte concentration)
, preferrably with f being a steep and even linear function.

 

In this situation, feature scaling would essentially erase all relevant information from the raw data.


In general, some questions that help to decide whether scaling is a good idea:

  • What does normalization do to your data wrt. solving the task at hand? Should that become easier or do you risk to delete important information?
  • Does your algorithm/classifier react sensitively to the (numeric) scale of the data? (convergence)
  • Is the algorithm/classifier heavily influenced by different scales of different features?
  • If so, do your features share the same (or comparable) scales or even physical units?
  • Does your classifier/algorithm/actual implementation perform its own normalisation?
share|improve this answer
    
Thank you for your help! Appreciate it! =) – jjepsuomi Oct 16 '15 at 10:32

I contend that it's better to actually scale your data before splitting it into a train and test set for two reasons:

1 - There is no danger of leakage as there is no fitting being done when scaling features.

2 - This actually mimics what you might do when receiving a new set of data. You might have the X for a new set of data but not the y. There is no reason not to include the X from the new data with the old data when preprocessing. It some situations it might be the only logical thing to do.

share|improve this answer
Posted by uniqueone
,