https://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set

 

 

In classification, how do you handle an unbalanced training set?

In some cases, you have a lot more examples of one class than of the others. How do you tackle this problem? What are some gotchas to be aware of in this situation?
30 Answers
Sergey Feldman
Sergey Feldman, machine learning PhD & consultant @ www.data-cowboys.com

the ones listed above/below are great! here are a few more:

1) let's say you have L more times of the abundant class than rare class. for stochastic gradient descent, take L separate steps each time you encounter training data from the rare class.

2) divide the more abundant class into L distinct clusters. then train L predictors, where each predictor is trained on only one of the distinct clusters, but on all of the data from the rare class. to be clear, the data from the rare class is used in the training of all L predictors. finally, use model averaging for the L learned predictors as your final predictor.

3) this is similar to Kripa's number (2), but a little different.
let N be number of samples in the rare class. cluster the abundant
class into N clusters (agglomerative clustering may be best here), and use the resulting cluster mediods as the training data for the abundant class. to be clear, you throw out the original training data from the abundant class, and use the mediods instead. voila, now your classes are balanced!

4) whatever method you use will help in some ways, but hurt in others. to mitigate that, you could train a separate model using all of the methods listed on this page, and then perform model averaging over all of them!

5) A recent ICML paper (similar to Kripa's (1)) shows that adding data that are "corrupt[ed] training examples with noise from known distributions" can actually improve performance. The paper isn't totally relevant to the problem of unbalanced classes because they add the data implicitly with math (that is, the dataset size remains unchanged). But I think the table of corrupting distributions in the paper is useful if you want to implement your own surrogate data code.

More details than you need: imho, the most interesting of the corrupting distributions is the blankout distribution, where you just zero out a random subset of features. Why is it interesting? Because you are helping your classifier be sturdier/hardier by giving it variations of your data that have essentially missing features. So it has to learn to classify correctly even in adverse conditions.

A related idea is dropout in neural networks, where random hidden units are removed (zeroed-out) during training. This forces the NN to, again, be hardier than it would be otherwise. See here for a recent treatment: http://www.stanford.edu/~sidaw/c...

Here’s a nice package that does a lot of this stuff and is compatible with the scikit-learn API: scikit-learn-contrib/imbalanced-learn

Roar Nybø
Roar Nybø, Using ML for diagnostics in offshore drilling
The data set example has a strong class imbalance, which can mislead some classification algorithms. In particular, some will always output '0' since that is correct in 99.97% of cases. The easiest remedy is to train on just 300 examples from each class.
However, you also have to consider the following:
  1. Are the almost 1 million examples labeled '0' similar enough to be captured by one class, or is it simpler from an ML point of view, to work with many sub-classes?
  2. Are the 300 examples labeled '1' representative of the class, or are there parts of the parameter space that is simply not covered, because we have too few examples?
  3. What is the misclassification rate? If in class '0', only one in every ten thousandth example got misclassified, those misclassifications would still make up half of the examples in class '1'.

One situation where this occurs is intrusion detection in computer networks. Almost all network traffic is legitimate, with viruses and hacking attempts making up only a tiny amount, so this will yield a data set with a strong class imbalance. We also lack a representative collection of intrusion attempts, because new methods are developed all the time. The intrusion class is an open class.
When classes not seen in the training set may appear in later use, the problem is referred to as open set classification [1].

Methods to deal with issue 2 includes anomaly detection and one-class learning. Here the goal is to draw a line around a well-documented class and classify cases as simply belonging or not belonging to that class. In the
example above you'd flag as intrusion anything not similar to normal traffic. In [2] issue 1 and 2 is attacked simultaneously, by using an ensemble of one-class classifiers. (Actually the same strategy as Charles H Martin is proposing here.)

Issue 3 is a hard one, but ensembles of classifiers can be of some use in this regard as well. If none of the one-class classifiers fire on an example, this can be taken as an indication that the example is hard to classify. This is an improvement over binary classifiers, which by virtue of the decision boundary will always try to classify every example.

[1] GORI, M. & SCARSELLI, F. (1998) Are multilayer perceptrons adequate for pattern recognition and verification? Ieee Transactions on Pattern Analysis and
Machine Intelligence, 20, 1121-1132
[2] GIACINTO, G., PERDISCI, R., DEL RIO, M. & ROLI, F. (2008) Intrusion detection in computer networks by a modular ensemble of one-class classifiers.
Information Fusion, 9, 69-82.
Abhishek Ghose
Abhishek Ghose, Data Scientist @[24]7
This is a very practical problem and here are some ways to get around this:
  1. Random undersampling of the majority class
  2. Random oversampling of the minority class
  3. Random undersampling leads to potential loss of information - since a lot of data instances are just 'thrown away'. You can perform informed undersampling by finding out the distribution of data first and selectively pick points to be thrown away
  4. You can oversample with synthetically generated data points that are not too different from the minority class data points you actually have - SMOTE  is a popular technique.
  5. Use a cost-sensitive classifier. For example, in certain kinds of decision trees (ex C5.0: An Informal Tutorial) you can explicitly mention that misclassifying a data instance from the minority class as the majority class is much more expensive than the other kind of misclassification. libsvm, the popular SVM package, allows this using the "wx" flags.

The above list is not exhaustive. This paper provides a good survey: http://www.ele.uri.edu/faculty/h...

Make sure that you are using a scoring mechanism that deals with imbalance. For ex, if your data 97% -ve and 3% +ve, using accuracy as a performance metric I can easily score 97% by classifying all points as -ve. So, accuracy is not a good metric here; something like F1-score is more suitable.
Ian Vala
Ian Vala, Senior Data Scientist working in Silicon Valley. Harvard grad.

Haha you know whats funny? You get 90% accuracy for your model and you are like “awesome!” until you find out, well %90 of the data was all on one class lol

This is actually a very good interview question and what you are referring to is called “imbalanced data”.It is a very common problem when you get a real dataset. For example you get cancer patient data. They tell you go predict if the person has cancer or not. Great! Making the world a better place, and making 6 figure salary! You are excited and get the dataset, it’s 98% no cancer and 2% cancer.

Crap…

Lo and behold, thankfully there are some solutions for this:

  1. Resample differently. Over-sample from your minority class and under-sample from your majority class, so you get a more balanced dataset.
  2. Try different metrics other than correct vs wrong prediction. Try Confusion Matrix or ROC curve. Accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of the values.
  3. Use Penalized Models. Like penalized-SVM and penalized-LDA. They put additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model towards paying attention to minority class.
  4. Try Anomaly Detection techniques and models often used there. Although that would probably be necessary if your data was even more Imbalanced.

Still not satisfied? Here’s the book on this: Imbalanced Learning: Foundations, Algorithms, and Applications: Haibo He, Yunqian Ma: 9781118074626: Amazon.com: Books

Kripa Chettiar
Kripa Chettiar, NLP and ML practitioner; Personalization @ Amazon - Music Recommendations
You could do either of the following:

  1. Generate synthetic data using SMOTE or SMOTEBoost.[http://wiki.pentaho.com/display/... ]. This should give you good results in more cases.
  2. Decompose your larger class into smaller number of other classes. This is tricky and totally dependent on the kind of data you have. For something like 30 instances of A vs 4000 instances of B, you would decompose the 4000 instances of B into 1000 instances of B1 and 3000 instances of unknown. Effectively you are reducing the difference in number of instances between A and B1.
  3. A medly of 1 and 2 could also work.
  4. In the worst case use a One Class Classifier. What you are doing here is that you are considering the smaller class an outlier and confidently learn the decision boundary for ONLY the larger class. Weka provides a one-class classifier library. [http://webcache.googleuserconten...
5. Get more data!
Dayvid Victor
Dayvid Victor, Machine Learning Researcher | PhD candidate in Computer Science | Data Scientist

There are a few notes and suggestions:

  1. Regular classification rate (classification accuracy) isn't a good metric, because if you correctly classify only the instances of the majority class
    (class with many samples), this metric still gives you a high rate.
    The Area Under the ROC Curve (AUC) is a good metric for evaluation of classifiers in such datasets.
  2. You can increase the number of minority class samples by:
    1. Resampling: bootstrapping samples of the minority class.
    2. Oversampling: generate new samples of the minority class, for this, I'd recommend to use SMOTE, SPIDER or any variant. You can also use Prototype Generation (PG) in order to generate new samples of the minority class - there are specific PG techniques for imbalanced datasets such as ASGP and EASGP.
  3. You can reduce the number of majority class samples by:
    1. Random Undersampling.
    2. Prototype Selection (PS) to reduce imbalance level, such as One-Sided Selection (OSS). Or, you can use Tomek Links, Edited-Nearest Neighbors (ENN) and other but only remove the majority class outliers.
  4. In your K-Fold validation, try to use the same proportion between the classes. If the number of instances of the minority class is too low, you can reduce the number of K, until there are enough.
  5. Use Multiple Classifier Systems (MCS):
    1. Using Ensemble Learning has been proposed as an interesting solution to learn from imbalanced data.

Also, be careful with the techniques/algorithms you will use. In prototype generation, for example, there are techniques that have a high performance on regular datasets, but if you use them with unbalanced datasets, they will misclassify most (or all) instances of the minority class.

In cases like that, search for a similar algorithm, adapted to handle unbalanced datasets, or adapt it yourself. If you get good results, you can even publish it (I've done that).

Yevgeny Popkov
Yevgeny Popkov, Data Scientist @ CQuotient
These two alternative approaches are typically used:
1) Quick&dirty approach: balance the data set by either under-sampling the majority class(es) or over-sampling the minority class(es).
2) A preferred approach: use cost-sensitive learning by applying class-specific weights in your loss function (smaller weights for the majority class cases and larger weights for the minority class cases). For starters, the weights can be set to be inversely proportional to the fraction of cases of the corresponding class.  In case of binary classification the weight of one of the classes can be tuned using resampling to squeeze even more predictive power out of the model.
Qingbin Li
Qingbin Li, Data Scientist @ ServiceNow | Machine Learning Enthusiast

When dealing with imbalanced classification problem, I would consider the following aspects :

  1. Whether more data could be collected or not. Sometime the dataset is imbalanced because we don’t collect enough data.
  2. Utilize techniques to balance the data. We can consider downsampling the majority class or upsampling the minority class or generating synthetic data, etc. The main goal is to convert the imbalanced classification problem into a balanced classification problem so the regular classification algorithms can be used.
  3. Choose the algorithm that is insensitive to imbalance data, like cost-sensitive learning. The basic idea is to adjust the cost of various classes.
  4. Use the correct performance metric, like AUC, f-score, Confusion Matrix, etc. It’s quite important to choose the right metric to evaluate your model performance.
Charles H Martin
Charles H Martin, Calculation Consulting; we predict things
I woud just try the simplest ting possible... train a bunch of binary SVMs with equal balance of positive and random negative samples and majority vote as the label.

I have this problem all the time in industry...sometimes something just bone head simple works quite well

---

Additionally...ask yourself "why would a simple SVM fail (using class weights)?" After all, the SVM only learns from data on the margin, so, in theory, the mis-balanced classes should not matter.  There are two reasons:

1.  You are using mean squared error to evaluate the accuracy during cross validation.  Another metric may suit your problem better

2.  The data is non-separable without adding some slack, andyou can not adjust more than a single  regularization parameter (and a few kernel parameters), and therefore you can can not control the slack at the margin...where the data is non-seperable.

 By selecting N random sets rather than class weights, the hope is you can adjust the slack variables on each set.  Vapnik has developed a 'similar' approach, which he calls "Master Learning" (although here the Master knowledge is information add by adjusting the] slack variables 

of course, if you find the regularization and kernel parameters are the same for every random sample, you may not have gained anything
Muktabh Mayank
Muktabh Mayank, Data Scientist,CoFounder @ ParallelDots, BITSian for life, love new technology trends
Here is a recent paper which addresses the same problem. It uses a method similar to SVM.
I am working on an implementation, will open source it soon. Page on hal.inria.fr

There is a nice method to cope with unbalanced data set with a theoretical justification.
The method is based of the boosting algorithm Robert E. Schapire presented at "The strength of weak learnability" (Machine Learning, 5(2):197–227, 1990. http://rob.schapire.net/papers/s... ).

In this paper Schapire presented a boosting algorithm based upon combining triplets of 3 weak learners recursively. By the way, this was the first boosting algorithm.

We can use the first step of the algorithm (even without the recursion) to cope with the lack of balance.

The algorithm trains the first learner, L1, on the original data set.
The second learner, L2, is train on a set on which L1 has 50% chance to be correct (by sampling from the original distribution).
The third learner, L3, is trained on the cases on which L1 and L2 disagree.
As output, return the majority of the classifiers.
See the paper to see why it improves the classification.

Now for the application of the method of an imbalanced set.
Assume the concept is binary and the majority of the samples are classified as true.

Let L1 return always true.
L2, is being trained where L1 has 50% to be right. Since L1 is just true, L2 is being training on a balanced data set.
L3 is being trained when L1 and L2 disagree, hence, when L2 predicts false.
The ensemble predicts by majority, hence predicts false only when both L2 and L3 predict false.

I used this method in practice many times and it is very useful.

Kaushik Kasi
Kaushik Kasi, (Data Science && Bitcoin) Enthusiast
Here are some options
  • Replicate data points for the lagging class and add those to the training data (might increase bias)
  • Artificially generate more data points by manually tweaking features. This is sometimes done with with systems involving images, where an existing image can be distorted to create a new image. You run the risk of training the model based on (artificial) samples that aren't representative of the test samples that the model would encounter in the real world.
  • Treat the multi-class classifier as a binary classifier for the larger sample. Define the larger class as positive and the smaller class as a negative. This would train your model distinctly on what the larger class looks like, and theoretically classifier the smaller class as "negatives". Your performance will depend on how many features you have for your samples and how tolerant your model can be to overfitting.
Aayush
Aayush, Data Science Intern

I’d recommend three ways to solve the problem, each has (basically) been derived from Chapter 16: Remedies for Severe Class Imbalance of Applied Predictive Modeling by Max Kuhn and Kjell Johnson.

  1. Random Forests w/ SMOTE Boosting: Use a hybrid SMOTE that undersamples the majority class and generates synthetic samples for the minority class by adjustable percentages. Select these percentages depending on the distribution of your response variable in the training set. Feed this data to your RF model. Always cross-validate/perform gird-search to find the best parameter settings for your RFs.
  2. XGBoost w/ hyper-parameter optimisation: Again, cross-validate or perform gird-search to find the best parameter settings for the model. I found this post extremely useful in my case. Additionally, xgboost offers parameters to balance positive and negative weights using scale_pos_weight. See the parameter documentation for a complete list.
  3. Support Vector Machines w/ Cost Sensitive Training:
    - SVMs allow for some degree of mis-classification of data and then repeatedly optimizes a loss function to find the required “best fit” model.
    - It controls the complexity using a cost function that increases the penalty if samples are on the incorrect side of the current class boundary.
    - For class imbalances, unequal costs for each class can adjust the parameters to increase or decrease the sensitivity of the model to particular classes.

I have used methods one and two and have been able to obtain an AUC of over 0.8 on the test data-set, and the data-set I was working on had very serious class imbalance with the minority class making up as little as 0.26% of the data-set.

Some important advice that I received and I’d like to give to anyone dealing with severely skewed classes is that:

  • Understand your data. Know what your requirements are and what type of classification matters more to you and which classification are you willing to trade-off.
  • Do not use % accuracy as a metric, use AUC, Sensitivity-Specificity and Cohen’s Kappa scores instead. The fundamental problem with using accuracy as a metric is that your model can simply classify everything into the majority class and give you a very high accuracy which is definitely one of the biggest “gotchas”, as mentioned in the question.
  • Run a lot of tests on multiple models. Intuition can take you a long way in data-science - if your gut tells you that an ensemble of classifiers will give you the best results, go ahead and try it.

On a closing note, I’d like to say here that XGBoost when tuned correctly has rarely ever disappointed anyone, but that shouldn’t stop you from experimenting!

Shehroz Khan
Shehroz Khan, ML Researcher, Postdoc @U of Toronto

To handle skewed data, you can employ different strategies, such as

  • Over sampling the minority class or under sampling the majority class
  • Cost sensitive classification
  • One-class Classification

You can read my detailed answer to a similar question here - Shehroz Khan's answer to I have an imbalanced dataset with two classes. Would it be considered OK if I oversample the minority class and also change the costs of misclassification on the training set to create the model?

C.V. Krishnakumar
C.V. Krishnakumar, studied Computer Science at Stanford University (2010)
Here is my opinion on this. Please take it with a grain of salt, since the answer might differ with different applications.
  • Gotcha #1: Do not use accuracy alone as a metric. That way, we would get 99% accuracy with everything classified as the majority class, which would not mean anything. Precision & Recall, with respect to each class might be a better one.
  • If you are more  interested in the classification of the  minority class, you can use a Cost sensitive classifier (http://weka.wikispaces.com/CostS...) through which you can state the cost of misclassification of the different classes. Eg. Misclassifying the minority might be considered to be costlier.
  • You might want to boost the number of minority class training examples by artificially creating new samples from the existing samples.
  • Simplest of all, you could also just resample the set, to have a proportional number of samples in both the classes, if that is an option.
Feng Qi (奇峰)
Feng Qi (奇峰), Software Engineer, Quora
1. assign higher weights to training data of rare classes
2. up-sampling rare class
3. sometimes highly imbalanced data can still give good training results. I think it makes sense for some metrics, for example, auc.

I found this blogpost to be helpful.

In short - use F1 scoring, and pick an algorithm that can handle unbalanced classes by weighting samples differently by class.

I did not find sub sampling very helpful because my rare class was only 3% of observations and that led to too little data for my algorithm to work with.

Yiyao Liu
Yiyao Liu, improving every day

In this article: Practical Guide to deal with Imbalanced Classification Problems in R, it mentioned four methods with some cases, which are clear and helpful.

  1. Undersampling
  2. Oversampling
  3. Synthetic Data Generation
  4. Cost Sensitive Learning
Adding to what Krishnakumar said and specifically the third point of generating artificial or synthetic samples of the minority class to maintain a balance, check out this paper on SMOTE(Synthetic Minority Oversampling TEchnique) - http://arxiv.org/pdf/1106.1813.pdf. It seems to be using a nearest neighbor approach to generate synthetic samples of the minority class. In all probability, there will be a lot of noise in the artificial samples generated.

I did have quite a bit of success in using this approach to classify credit card transactions as fraudulent.
If you are looking to use a Support Vector Machine for classification, you may consider using the Twin SVM which is suited to unbalanced datasets.
@Twin Support Vector Machines for Pattern Classification
The objective is two find two separating hyperplanes, each of which is closer to points of one class and far from points of the other. Predictions on test points are computed based on the distance from the two hyperplanes.
Todd Jamison
Todd Jamison, Imagery Scientist, develop machine learning algorithms for earth observation

In our software, we use an accuracy metric that balances the Producer’s and User’s accuracy and applies a larger penalty for higher levels of error. It can accommodate any number of classes, not just binary problems. Our accuracy function equation is:

where:

N = the number of classes,

Mn = the number of data points in each class,

Rn = the number of data points assigned to the class by the solution,

pni = the classification results for the ith training sample in class n, and

rni = the classification results for the ith training sample assigned to class n.

There are two items within the primary summation. The first item is related to the Producer’s Accuracy and the second is related to the User’s Accuracy. In this formula, we calculate the Producer’s and User’s error rates for each class instead of the accuracy rates by inverting the results values (i.e., (1-pni) and (1-rni) respectively). The error rates in the range 0 to 1 are root-mean-squared, in order to emphasize larger error rates. The error metric is then converted back to an accuracy metric by subtracting the result from 1. If you are using it as a cost function, you don’t do the final subtraction.

This metric should result in solutions that have a more balanced level of accuracy across classes and between Producer and User perspectives.

Nigel Legg
Nigel Legg, Discovering meaning in your data.
This is actually something I'm thinking about right now.  I have a data file of 49,000 records that need to be pushed through the classifier.  We have two classifications, Yes and No. From experience with a smaller data file, we will get >99% No, and <1% Yes, but the Yes classification is the important one. So I have manually classified 200 records, for the training set (gave 2xYes, 198xNo), and this is currently queued to be classified.
I'm assuming that this will give some odd results - there aren't enough examples of the 'Yes' classification.  I think a better training set would be a more balanced one, in spite of the inbalance in the test set.  Thus a better approach would be to hunt out the 'Yes' records to ensure that there are >50 classified in the training set.  The danger with this, however, is that you could end up classifying all the 'Yes' records by hand, leaving nothing for the classifier to do. ;-)
Alex Gilgur
Alex Gilgur, Data Scientist; IT Analyst; occasional lecturer at UCBerkeley's MIDS program.

I think you are referring to an imbalanced, rather than skewed, dataset. Regardless, it is a big problem in classification, because as you pointed out, data are rarely balanced. One way to solve it is, when training your classifier, to sample separately from the positive and the negative sets and combine them 9:1 when done: it's all about conditional probabilities.

Prem R. Adhikari
Prem R. Adhikari, Machine learning student and learner
Here you will find 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset - Machine Learning Mastery
I will like to add that ideas suggested in the link and by Abhishek Ghose, it is not the choice of classification algorithm but the classification metric that is important for imbanced datasets. AUC, F-measure, and logloss are often used in imbalanced datasets.
Ziyu Yao
Ziyu Yao, studied at Beijing University of Posts and Telecommunications
When I worked on social spam detection, I met the same problem. I think I can give some ideas:
1. For training step, you can perform under-sampling for unbalanced training dataset;
2. For testing step, some measurements can be  fair enough to evaluate your model. In my experience, AUC curve, Lift Chart and F-measure can be very helpful. You can try them.
Hope my advices can help you.
Have a look at down/up sampling methods and hybrid one like SMOTE

You can also play with the threshold (default=0.5) in order to reach a specific precision or specificity requirement, but it's more artificial.

NB: if your class is not well balanced don't choose the accuracy as your evaluation metric (ex: Y={0:90%, 1:10%} then for a the silly Y_hat=0 you get an accuracy of 90%) one could prefer the AUC (Area Under the Curve)
I would recommend using an error metric that is not sensitive to the imbalance between the classes. My personal favorite is AUC.

See What is the best out-of-sample accuracy test for a logistic regression and why? for more on this.
Quora User
Quora User, studied at Stanford University
I deal almost exclusively with this problem in the workplace.  Usually undersampling the 0 class (which I will assume, WLOG, is the larger one) is the right way to do this.
We encounter this scenario often in online advertising (specifically display advertising) where conversion rate is really low. I have experimented several methods but the method that gave me the best result was using binary classifier (logistic or SVM) with appropriate weight on negative and positive instances.  There are several tricks to compute these weights and they depend on profit and cost associated with positive and negative example respectively.

A good read on this topic is: Learning from Imbalanced Classes

Posted by uniqueone
,