For many machine learning algorithms, normalizing the data of analysis is a must. A supervised example would be neural networks. It is known that normalizing the input data to the networks improve the results. If you don’t believe me it’s ok (no offense taken), but you may prefer to believe Yann Le Cunn (Director of AI Research in Facebook and founding father of convolutional networks) by checking section 4.3 of this paper. You can catch up the idea with the first sentence of the section: Convergence [of backprop] is usually faster if the average of each input variable over the training set is close to zero. Among other things, one reason is that when the neural network tries to correct the error performed in a prediction, it updates the weights in the network by an amount proportional to the input vector, which is bad if input is large.
Another example, in this case of an unsupervised algorithm, is K-means. This algorithms tries to group data in clusters so that the data in each cluster shares some common characteristics. This algorithm performs two steps:
In this second step, the distances between each point and the centers are calculated usually as a Minkowski distance (commonly the famous Euclidean distance). Each feature weights the same in the calculation, so features measured in high ranges will influence more than those measured in low ranges e.g. the same feature would influence more in the calculation if measured in millimeters than in kilometers (because the numbers would be bigger). So the scale of the features must be in a comparable range.
Now you now that normalization is important, let us see what options we have to normalize our data.
Each feature is normalized within the limits of the feature.
This is a common technique used to scale data into a range. But the problem when normalizing each feature within its empiricallimits (so that, the maximum and the minimum found in this column) is that noise may be amplified.
One example: imagine we have internet data from a particular house and we want to make a model to predict something (maybe the price to charge). One of our hypothetical features could be the bandwidth of the fiber optic connection. Suppose the house purchased a 30Mbit internet connection, so the bit rate is approximatelly the same every time we measure it (lucky guy).
It looks like a pretty stable connection right?. As the bandwidth is measured in a scale far from 1, let us scale it between 0 and 1 using our feature scaling method (sklearn.preprocessing.MinMaxScaler).
After the scaling, our data is distorted. What it was an almost flat signal, now looks like a connection with a lot of variation. This fact tells us that feature scaling is not adequate to nearly constant signals.
Next try. Ok, scaling in a range didn’t work for a noisy flat signal, but what about standardizing the signal?. Each feature woulde be normalize by:
This could work on the previous case, but don’t open the bottle yet. Mean and standard deviation are very sensitive to outliers (small demonstration). This means that outliers may attenuate the non-outlier part of the data.
Now imagine we have data about how often the word “hangover” is posted on Facebook (for real). The frequency is like a sine wave, with lows during the weekdays and highs on weekends. It also has big outliers after “Halloween” and similar dates. We have idealized this siutation with the next dataset (3 parties in 50 days. Not bad).
Despite having outliers, we would like to be able to distinguish clearly that there exist a measurable difference between weekdays and weekends. Now we want to predict something (that’s our business) and we would like to preserve the fact that during the weekends the values are higer, so we think of standardizing the data (sklearn.preprocessing.StandardScaler). We check the basic parameters of standardization.
What happened? First, we were not able to scale the data between 0 and 1. Second, we now have negative numbers, which is not a dead end, but complicates the analysis. And third, now we are not able to clearly distinguish the differences between weekdays and weekend (all close to 0), because outliers have interfered with the data.
From a very promising data, now we have an almost irrelevant one. One solution to his situation could be to preprocess the data and eliminate the outliers (things change with outliers).
The next idea that comes to mind is to scale the data by dividing it by its maximum value. Let see how it behaves with our datasets (sklearn.preprocessing.MaxAbsScaler).
Good!, our data is in range 0,1… But, wait! what happended with the differences between weekdays and weekends?, they are all close to zero!. As in the case of standardization, outliers flatten the differences among the data when scaling over the maximum.
The next tool in the box of the data scientist is to normalize samples individually to unit norm (check this if you don’t remember what a norm is).
This data rings a bell in your head right? let’s normalize it (here by hand, but also available as sklearn.preprocessing.Normalizer).
At this point of the post, you know the story, but this case is worse than the previous ones. In this case we don’t even get the highest outlier as 1, it is scaled to 0.74, which flattens the rest of the data even more.
The last option we are going to evaluate is Robust scaler. This method removes the median and scales the data according to the Interquartile Range (IQR). It is supposed to be robust to outliers.
You may not see it in the plot (but you can see it in the output), but this scaler introduced negative numbers and did not limit the data to the range [0, 1]. (ok, I quit).
There are others methods to normalize your data (based on PCA, taking into accounts possible physical bounds, etc), but now you know how to evaluate wheter your algorithm is going to influence negatively in your data.
There is no ideal method to normalize or scale all the dataset. So it is the job of the data scientist to know how the data is distributed, know the existence of outliers, check ranges, know the physical limits (if any) and so on. With this knowledge, one can select the best technique to normalize the feature, probably using a different method for each feature.
If you know nothing about your data, I would recommend you to first check the existence of outliers (remove them if neccesary) and then scale over the maximum of each feature (while crossing your fingers).