# Training a decision tree against unbalanced data (2022)

78

$\begingroup$

This is an interesting and very frequent problem in classification - not just in decision trees but in virtually all classification algorithms.

As you found empirically, a training set consisting of different numbers of representatives from either class may result in a classifier that is biased towards the majority class. When applied to a test set that is similarly imbalanced, this classifier yields an optimistic accuracy estimate. In an extreme case, the classifier might assign every single test case to the majority class, thereby achieving an accuracy equal to the proportion of test cases belonging to the majority class. This is a well-known phenomenon in binary classification (and it extends naturally to multi-class settings).

This is an important issue, because an imbalanced dataset may lead to inflated performance estimates. This in turn may lead to false conclusions about the significance with which the algorithm has performed better than chance.

The machine-learning literature on this topic has essentially developed three solution strategies.

1. You can restore balance on the training set by undersampling the large class or by oversampling the small class, to prevent bias from arising in the first place.

2. Alternatively, you can modify the costs of misclassification, as noted in a previous response, again to prevent bias.

3. An additional safeguard is to replace the accuracy by the so-called balanced accuracy. It is defined as the arithmetic mean of the class-specific accuracies, $\phi := \frac{1}{2}\left(\pi^+ + \pi^-\right),$ where $\pi^+$ and $\pi^-$ represent the accuracy obtained on positive and negative examples, respectively. If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to chance (see sketch below).

I would recommend to consider at least two of the above approaches in conjunction. For example, you could oversample your minority class to prevent your classifier from acquiring a bias in favour the majority class. Following this, when evaluating the performance of your classifier, you could replace the accuracy by the balanced accuracy. The two approaches are complementary. When applied together, they should help you both prevent your original problem and avoid false conclusions following from it.

I would be happy to post some additional references to the literature if you would like to follow up on this.

answered May 8, 2012 at 20:11

Kay BrodersenKay Brodersen

(Video) Decision Tree for Imbalanced Datasets - Standoff Balancing

$\endgroup$

7

10

$\begingroup$

The following four ideas may help you tackle this problem.

1. Select an appropriate performance measure and then fine tune the hyperparameters of your model --e.g. regularization-- to attain satisfactory results on the Cross-Validation dataset and once satisfied, test your model on the testing dataset. For these purposes, set apart 15% of your data to be used for cross validation and 15% to be used for final testing. An established measure in Machine Learning, advocated by Andrews Ng is the F1 statistics defined as $2 * Precision * \frac{Recall}{Precision + Recall}$. Try to maximize this figure on the Cross-Validation dataset and make sure that the performance is stable on the testing dataset as well.

2. Use the 'prior' parameter in the Decision Trees to inform the algorithm of the prior frequency of the classes in the dataset, i.e. if there are 1,000 positives in a 1,000,0000 dataset set prior = c(0.001, 0.999) (in R).

3. Use the 'weights' argument in the classification function you use to penalize severely the algorithm for misclassifications of the rare positive cases

4. Use the 'cost' argument in some classification algorithms -- e.g. rpart in R-- to define relative costs for misclassifications of true positives and true negatives. You naturally should set a high cost for the misclassification of the rare class.

I am not in favor of oversampling, since it introduces dependent observations in the dataset and this violates assumptions of independence made both in Statistics and Machine Learning.

edited Apr 7, 2017 at 8:05

Ferdi

(Video) Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)

answered Apr 7, 2017 at 6:32

rf7rf7

$\endgroup$

3

$\begingroup$

Adding to @Kay 's answer 1st solution strategy : Synthetic Minority Oversampling (SMOTE) usually does better than under or over sampling from my experience as I think it kind of creates a compromise between both. It creates synthetic samples of the minority class using the data points plotted on the multivariate predictor space and it more or less takes midpoints between adjacent points on that space to create new synthetic points and hence balances both class sizes. (not sure of the midpoints, details of the algorithm here

answered Nov 9, 2017 at 15:10

Bharat Ram AmmuBharat Ram Ammu

$\endgroup$

(Video) Machine Learning Classification How to Deal with Imbalanced Data ❌ Practical ML Project with Python

2

$\begingroup$

I gave an answer in recent topic:

What we do is pick a sample with different proportions. In aforementioned example, that would be 1000 cases of "YES" and, for instance, 9000 of "NO" cases. This approach gives more stable models. However, it have to be tested on a real sample (that with 1,000,000 rows).

Not only gives that more stable approach, but models are generally better, as far as measures as lift are concerned.

You can search it as "oversampling in statistics", the first result is pretty good: http://www.statssa.gov.za/isi2009/ScientificProgramme/IPMS/1621.pdf

edited Apr 13, 2017 at 12:44

CommunityBot

1

answered May 8, 2012 at 18:51

grotosgrotos

(Video) Live Discussion On Handling Imbalanced Dataset- Machine Learning

$\endgroup$

$\begingroup$

My follow up with the the 3 approaches @Kay mentioned above is that to deal with unbalanced data, no matter you use undersampling/oversampling or weighted cost function, it is shifting your fit in the original feature space v.s. original data. So "undersampling/oversampling" and "weighted cost" are essentially the same in term of result.

(I do not know how to pin @Kay) I think what @Kay mean by "balanced accuracy" is only trying to evaluate a model from measurement, it has nothing to do with the model itself. However, in order to count 𝜋+ and 𝜋− , you will have to decide a threshold value of the classification. I HOPE THERE IS MORE DETAIL PROVIDED ON HOW TO GET THE CONFUSION MATRIX {40, 8, 5,2 }.

In real life, most of cases I met are unbalanced data, so I choose the cutoff by myself instead of using the default 0.5 in balanced data. I find it's more realistic to use F1 score mentioned in the other author to determine the threshold and use as evaluating model.

edited Apr 23, 2020 at 22:37

answered Apr 23, 2020 at 22:18

StellaStella

(Video) Tutorial 85 - Working with imbalanced data during machine learning training

$\endgroup$

## FAQs

### Can decision tree handle imbalanced data? ›

The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets. The split points of the tree are chosen to best separate examples into two groups with minimum mixing.

### How do you train an imbalanced dataset? ›

Approach to deal with the imbalanced dataset problem
1. Choose Proper Evaluation Metric. The accuracy of a classifier is the total number of correct predictions by the classifier divided by the total number of predictions. ...
2. Resampling (Oversampling and Undersampling) ...
3. SMOTE. ...
4. BalancedBaggingClassifier. ...
5. Threshold moving.
Jun 21, 2021

### What are the 3 ways to handle an imbalanced dataset? ›

The following are a series of steps and decisions you can carry out in order to overcome the issues with an imbalanced dataset.
1. Can you collect more data. ...
2. Change Performance metric. ...
3. Try Different Algorithms. ...
4. Resample the Dataset. ...
5. Generate Synthetic samples. ...
6. Conclusion.
Jun 21, 2020

### Which algorithm is best for unbalanced data? ›

A widely adopted and perhaps the most straightforward method for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

### Is decision tree sensitive to outliers? ›

Decision trees are also not sensitive to outliers since the partitioning happens based on the proportion of samples within the split ranges and not on absolute values.

### What is the class imbalance problem? ›

What is the Class Imbalance Problem? It is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative).

### How do you deal with imbalanced classification without re balancing the data? ›

How To Deal With Imbalanced Classification, Without Re-balancing the Data
1. import numpy as np. import pandas as pd. ...
2. Xtrain, Xtest, ytrain, ytest = model_selection.train_test_split( ...
3. hardpredtst=gbc.predict(Xtest) ...
4. predtst=gbc.predict_proba(Xtest)[:,1] ...
5. hardpredtst_tuned_thresh = np.where(predtst >= 0.00035, 1, 0)

### How do you deal with unbalanced data in logistic regression? ›

In logistic regression, another technique comes handy to work with imbalance distribution. This is to use class-weights in accordance with the class distribution. Class-weights is the extent to which the algorithm is punished for any wrong prediction of that class.

### How do you balance your data? ›

7 Techniques to Handle Imbalanced Data
1. Use the right evaluation metrics. ...
2. Resample the training set. ...
3. Use K-fold Cross-Validation in the right way. ...
4. Ensemble different resampled datasets. ...
5. Resample with different ratios. ...
6. Cluster the abundant class. ...
7. Design your own models.
Jun 1, 2017

### What are possible steps that can be taken to overcome class imbalance? ›

Overcoming Class Imbalance using SMOTE Techniques
• Random Under-Sampling.
• Random Over-Sampling. ...
• Random under-sampling with imblearn.
• Random over-sampling with imblearn.
• Under-sampling: Tomek links.
• Synthetic Minority Oversampling Technique (SMOTE)
• NearMiss.
• Change the performance metric.
Jul 23, 2020

### How do you handle missing or corrupted data in a dataset? ›

how do you handle missing or corrupted data in a dataset?
1. Method 1 is deleting rows or columns. We usually use this method when it comes to empty cells. ...
2. Method 2 is replacing the missing data with aggregated values. ...
3. Method 3 is creating an unknown category. ...
4. Method 4 is predicting missing values.

### How do you handle imbalanced dataset in text classification? ›

The simplest way to fix imbalanced dataset is simply balancing them by oversampling instances of the minority class or undersampling instances of the majority class. Using advanced techniques like SMOTE(Synthetic Minority Over-sampling Technique) will help you create new synthetic instances from minority class.

### How do you classify unbalanced data? ›

Another way to describe the imbalance of classes in a dataset is to summarize the class distribution as percentages of the training dataset. For example, an imbalanced multiclass classification problem may have 80 percent examples in the first class, 18 percent in the second class, and 2 percent in a third class.

### Is gradient boosting good for Imbalanced data? ›

It is fine to use a gradient boosting machine algorithm when dealing with an imbalanced dataset. When dealing with a strongly imbalanced dataset it much more relevant to question the suitability of the metric used.

### How do you balance an imbalanced image dataset? ›

One of the basic approaches to deal with the imbalanced datasets is to do data augmentation and re-sampling. There are two types of re-sampling such as under-sampling when we removing the data from the majority class and over-sampling when we adding repetitive data to the minority class.

### How do decision trees deal with outliers? ›

Decision Trees are not sensitive to noisy data or outliers since, extreme values or outliers, never cause much reduction in Residual Sum of Squares(RSS), because they are never involved in the split.

### How does a decision tree handle missing attribute values? ›

There are several methods used by various decision trees. Simply ignoring the missing values (like ID3 and other old algorithms does) or treating the missing values as another category (in case of a nominal feature) are not real handling missing values.

### What strategies can help reduce overfitting in decision trees? ›

Pruning refers to a technique to remove the parts of the decision tree to prevent growing to its full depth. By tuning the hyperparameters of the decision tree model one can prune the trees and prevent them from overfitting. There are two types of pruning Pre-pruning and Post-pruning.

### What is an unbalanced data set? ›

In simple terms, an unbalanced dataset is one in which the target variable has more observations in one specific class than the others. For example, let's suppose that we have a dataset used to detect a fraudulent transaction.

### Why is imbalanced data a problem? ›

It is a problem typically because data is hard or expensive to collect and we often collect and work with a lot less data than we might prefer. As such, this can dramatically impact our ability to gain a large enough or representative sample of examples from the minority class.

### What is the impact of using class imbalanced training data samples? ›

Data imbalance occurs when sample size from a class is very small or large then another class. Performance of predicted models is greatly affected when dataset is highly imbalanced and sample size increases. Overall, Imbalanced training data have a major negative impact on performance.

### How can you reduce false negatives in classification? ›

To minimize the number of False Negatives (FN) or False Positives (FP) we can also retrain a model on the same data with slightly different output values more specific to its previous results. This method involves taking a model and training it on a dataset until it optimally reaches a global minimum.

### Why accuracy is not good for imbalanced dataset? ›

Even when model fails to predict any Crashes its accuracy is still 90%. As data contain 90% Landed Safely. So, accuracy does not holds good for imbalanced data. In business scenarios, most data won't be balanced and so accuracy becomes poor measure of evaluation for our classification model.

### What's the difference between imbalanced and unbalanced? ›

In common usage, imbalance is the noun meaning the state of being not balanced, while unbalance is the verb meaning to cause the loss of balance. In the context stated, the noun form should be used.

### Is logistic regression good for unbalanced data? ›

Logistic regression does not support imbalanced classification directly. Instead, the training algorithm used to fit the logistic regression model must be modified to take the skewed distribution into account.

### Does data need to be balanced for logistic regression? ›

Logistic regression requires dependent variable which is in binary form i.e., 0 and 1. A balanced sample means if you have thirty 0, you also need thirty 1. But, there is no such condition in logistic regression.

### How can I improve my F1 score with skewed classes? ›

How to improve F1 score for classification
1. StandardScaler()
2. GridSearchCV for Hyperparameter Tuning.
3. Recursive Feature Elimination(for feature selection)
4. SMOTE(the dataset is imbalanced so I used SMOTE to create new examples from existing examples)
Jul 1, 2020

## Videos

1. Handling Class Imbalance Problem in R: Improving Predictive Model Performance
(Dr. Bharatendra Rai)
2. SMOTE (Synthetic Minority Oversampling Technique) for Handling Imbalanced Datasets
(Bhavesh Bhatt)
3. Tutorial 45-Handling imbalanced Dataset using python- Part 1
(Krish Naik)
4. This is why you should care about unbalanced data .. as a data scientist
(ritvikmath)
5. Wayfair Data Science Explains It All: Handling Imbalanced Data
(Wayfair Data Science)
6. Handling Imbalanced Data in machine learning classification (Python) - 1
(Lianne and Justin)

## Latest Posts

Article information

Last Updated: 07/27/2022

Views: 6545

Rating: 4.4 / 5 (65 voted)

Author information