Skip to content

Handling Imbalanced Data for Classification

Handling Imbalanced Data for Classification

Last Updated : 02 Jan, 2024

A key component of machine learning classification tasks is handling unbalanced data, which is characterized by a skewed class distribution with a considerable overrepresentation of one class over the others. The difficulty posed by this imbalance is that models may exhibit inferior performance due to bias towards the majority class. When faced with uneven settings, the model's bias is to value accuracy over accurately recognizing occurrences of minority classes.

This problem can be solved by applying specialized strategies like resampling (oversampling minority class, undersampling majority class), utilizing various assessment measures (F1-score, precision, recall), and putting advanced algorithms to work with unbalanced datasets into practice.

What is Imbalanced Data and How to handle it?

Imbalanced data pertains to datasets where the distribution of observations in the target class is uneven. In other words, one class label has a significantly higher number of observations, while the other has a notably lower count.

When one class greatly outnumbers the others in a classification, there is imbalanced data. Machine learning models may become biased in their predictions as a result, favoring the majority class. Techniques like oversampling the minority class or undersampling the majority class are used in resampling to remedy this.

Furthermore, it is possible to evaluate model performance more precisely by substituting other assessment measures, such as precision, recall, or F1-score, for accuracy. To further improve the handling of imbalanced datasets for more reliable and equitable predictions, specialized techniques such as ensemble approaches and the incorporation of synthetic data generation can be used.

Problem with Handling Imbalanced Data for Classification

Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
Minority class observations look like noise to the model and are ignored by the model.
Imbalanced dataset gives misleading accuracy score.

Ways to handle Imbalanced Data for Classification

Addressing imbalanced data in classification is crucial for fair model performance. Techniques include resampling (oversampling or undersampling), synthetic data generation, specialized algorithms, and alternative evaluation metrics. Implementing these strategies ensures more accurate and unbiased predictions across all classes.

1. Different Evaluation Metric

Classifier accuracy is calculated by dividing the total correct predictions by the overall predictions, suitable for balanced classes but less effective for imbalanced datasets. Precision gauges the accuracy of a classifier in predicting a specific class, while recall assesses its ability to correctly identify a class. In imbalanced datasets, the F1 score emerges as a preferred metric, striking a balance between precision and recall, providing a more comprehensive evaluation of a classifier's performance. It can be expressed as the mean of recall and accuracy.

F1 = 2 * \frac{precision\; *\; recall}{precision\; +\; recall}

Precision and F1 score both decrease when the classifier incorrectly predict the minority class, increasing the number of false positives. Recall and F1 score also drop if the classifier have trouble accurately identifying the minority class, leading to more false negatives. In particular, the F1 score only becomes better when the amount and accuracy of predictions get better.

F1 score is essentially a comprehensive statistic that takes into account the trade-off between precision and recall, which is critical for assessing classifier performance in datasets that are imbalanced.

2. Resampling (Undersampling and Oversampling)

This method involves adjusting the balance between minority and majority classes through upsampling or downsampling. In the case of an imbalanced dataset, oversampling the minority class with replacement, termed oversampling, is employed. Conversely, undersampling entails randomly removing rows from the majority class to align with the minority class.

This sampling approach yields a balanced dataset, ensuring comparable representation for both majority and minority classes. Achieving a similar number of records for both classes in the dataset signifies that the classifier will assign equal importance to each class during training.

Python3

import numpy as np from sklearn.datasets import make_classification from imblearn.over_sampling import RandomOverSampler from imblearn.under_sampling import RandomUnderSampler from collections import Counter  # Create an imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],                            n_informative=3, n_redundant=1, flip_y=0,                            n_features=20, n_clusters_per_class=1,                            n_samples=1000, random_state=42)  print("Original class distribution:", Counter(y))  # Oversampling using RandomOverSampler oversample = RandomOverSampler(sampling_strategy='minority') X_over, y_over = oversample.fit_resample(X, y) print("Oversampled class distribution:", Counter(y_over))   # Undersampling using RandomUnderSampler undersample = RandomUnderSampler(sampling_strategy='majority') X_under, y_under = undersample.fit_resample(X, y) print("Undersampled class distribution:", Counter(y_under))

Output:

Original class distribution: Counter({1: 900, 0: 100})
Oversampled class distribution: Counter({1: 900, 0: 900})
Undersampled class distribution: Counter({0: 100, 1: 100})

3. BalancedBaggingClassifier

When dealing with imbalanced datasets, traditional classifiers tend to favor the majority class, neglecting the minority class due to its lower representation. The BalancedBaggingClassifier, an extension of sklearn classifiers, addresses this imbalance by incorporating additional balancing during training. It introduces parameters like "sampling_strategy," determining the type of resampling (e.g., 'majority' for resampling only the majority class, 'all' for resampling all classes), and "replacement," dictating whether the sampling should occur with or without replacement. This classifier ensures a more equitable treatment of classes, particularly beneficial when handling imbalanced datasets.

Importing Libraries

Python3

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from imblearn.ensemble import BalancedBaggingClassifier from sklearn.metrics import accuracy_score, classification_report

This code demonstrates the usage of a BalancedBaggingClassifier from the imbalanced-learn library to handle imbalanced datasets. It creates an imbalanced dataset, splits it, and trains a Random Forest classifier with balanced bagging, assessing the model's performance through accuracy and a classification report.

Creating imbalanced dataset and splitting

Python3

# Create an imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],                            n_informative=3, n_redundant=1, flip_y=0,                            n_features=20, n_clusters_per_class=1,                            n_samples=1000, random_state=42)  # Split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=0.2, random_state=42)

This code creates two-class, imbalanced datasets, divides them into training and testing sets, and uses a predetermined random state to guarantee reproducibility. With 20 features in the final dataset, the minority class has a weight of 0.1, indicating a notable class imbalance.

Creating a random forest classifier

Python3

# Create a Random Forest Classifier (you can use any classifier) base_classifier = RandomForestClassifier(random_state=42)

By initializing a Random Forest classifier with a given random state, this method creates a base classifier that may be used in subsequent analyses. Reproducibility in model training is guaranteed by the random state.

Creating a balanced bagging classifier

Python3

# Create a BalancedBaggingClassifier balanced_bagging_classifier = BalancedBaggingClassifier(base_classifier,                                                         sampling_strategy='auto',  # You can adjust this parameter                                                         replacement=False,  # Whether to sample with or without replacement                                                         random_state=42)

This code creates a BalancedBaggingClassifier by starting with a RandomForestClassifier that was previously defined. A random state is established for reproducibility, and options like "sampling_strategy" and "replacement" are supplied to address class imbalance.

Fitting the model and making predictions

Python3

# Fit the model balanced_bagging_classifier.fit(X_train, y_train)  # Make predictions y_pred = balanced_bagging_classifier.predict(X_test)

This code use the training data (X_train, y_train) to train the BalancedBaggingClassifier. Then, using the test data (X_test), they predict the labels, saving the results in the variable y_pred.

Evaluation

Python3

# Evaluate the performance print("Accuracy:", accuracy_score(y_test, y_pred)) print("Classification Report:\n", classification_report(y_test, y_pred))

Output:

Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        13
           1       1.00      1.00      1.00       187
    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200

This code compute and output the balanced bagging classifier's accuracy on the test set. Furthermore, a comprehensive classification report with information on each class's F1-score, recall, and precision is printed.

4. SMOTE

The Synthetic Minority Oversampling Technique (SMOTE) addresses imbalanced datasets by synthetically generating new instances for the minority class. Unlike simply duplicating records, SMOTE enhances diversity by creating artificial instances. In simpler terms, SMOTE examines instances in the minority class, selects a random nearest neighbor using k-nearest neighbors, and generates a synthetic instance randomly within the feature space.

Python3

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from imblearn.over_sampling import SMOTE from collections import Counter  # Create an imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],                            n_informative=3, n_redundant=1, flip_y=0,                            n_features=20, n_clusters_per_class=1,                            n_samples=1000, random_state=42)  # Split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=0.2, random_state=42)  # Display class distribution before SMOTE print("Class distribution before SMOTE:", Counter(y_train))  # Apply SMOTE to oversample the minority class smote = SMOTE(sampling_strategy='auto', random_state=42) X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)  # Display class distribution after SMOTE print("Class distribution after SMOTE:", Counter(y_train_resampled))

Output:

Class distribution before SMOTE: Counter({1: 713, 0: 87})
Class distribution after SMOTE: Counter({1: 713, 0: 713})

This code demonstrates how to rectify class imbalance in a dataset using SMOTE. Initially, an unbalanced dataset is produced, with 10% of the data belonging to a minority class. It shows the class distribution before to SMOTE after dividing the data into training and testing sets. After that, the minority class is oversampled using the SMOTE approach to produce synthetic instances. It displays a more equal representation of both classes in the resampled training data by printing the class distribution after SMOTE.

5. Threshold Moving

In classifiers, predictions are often expressed as probabilities of class membership. The conventional threshold for assigning predictions to classes is typically set at 0.5. However, in the context of imbalanced class problems, this default threshold may not yield optimal results. To enhance classifier performance, it is essential to adjust the threshold to a value that efficiently discriminates between the two classes.

Techniques such as ROC Curves and Precision-Recall Curves are employed to identify the optimal threshold. Additionally, grid search methods or exploration within a specified range of values can be utilized to pinpoint the most suitable threshold for the classifier.

Python3

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score  # Create an imbalanced dataset X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],                            n_informative=3, n_redundant=1, flip_y=0,                            n_features=20, n_clusters_per_class=1,                            n_samples=1000, random_state=42)  # Split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=0.2, random_state=42)  # Train a classification model (Random Forest as an example) model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train)  # Predict probabilities on the test set y_proba = model.predict_proba(X_test)[:, 1]  # Set a threshold (initially 0.5) threshold = 0.5  # Adjust threshold based on your criteria (e.g., maximizing F1-score) while threshold >= 0:     y_pred = (y_proba >= threshold).astype(int)     f1 = f1_score(y_test, y_pred)      print(f"Threshold: {threshold:.2f}, F1 Score: {f1:.4f}")      # Move the threshold (you can customize the step size)     threshold -= 0.02

Output:

Threshold: 0.50, F1 Score: 1.0000
Threshold: 0.48, F1 Score: 1.0000
Threshold: 0.46, F1 Score: 1.0000
Threshold: 0.44, F1 Score: 1.0000
Threshold: 0.42, F1 Score: 1.0000
Threshold: 0.40, F1 Score: 1.0000
Threshold: 0.38, F1 Score: 1.0000
Threshold: 0.36, F1 Score: 1.0000
Threshold: 0.34, F1 Score: 1.0000
Threshold: 0.32, F1 Score: 1.0000
Threshold: 0.30, F1 Score: 1.0000
Threshold: 0.28, F1 Score: 0.9973
Threshold: 0.26, F1 Score: 0.9973
Threshold: 0.24, F1 Score: 0.9973
Threshold: 0.22, F1 Score: 0.9947
Threshold: 0.20, F1 Score: 0.9947
Threshold: 0.18, F1 Score: 0.9947
Threshold: 0.16, F1 Score: 0.9920
Threshold: 0.14, F1 Score: 0.9920
Threshold: 0.12, F1 Score: 0.9894
Threshold: 0.10, F1 Score: 0.9842
Threshold: 0.08, F1 Score: 0.9740
Threshold: 0.06, F1 Score: 0.9664
Threshold: 0.04, F1 Score: 0.9664
Threshold: 0.02, F1 Score: 0.9664

6. Using Tree Based Models

The hierarchical structure of tree-based models—such as Decision Trees, Random Forests, and Gradient Boosted Trees—allows them to handle imbalanced datasets better than non-tree-based models.

Decision Trees: Decision trees create a structure resembling a tree by dividing the feature space into regions according to feature values. By changing the decision boundaries to incorporate minority class patterns, decision trees can react to data that is unbalanced. They might experience overfitting, though.
Random Forests: Random Forests are made up of many Decision Trees that have been trained using arbitrary subsets of features and data. Random Forests improve generalization by reducing overfitting and strengthening robustness against imbalanced datasets by mixing numerous trees.
Gradient Boosted Trees: Boosted Gradient Trees grow in a sequential fashion, with each new growth repairing the mistakes of the older one. Gradient Boosted Trees perform well in imbalanced circumstances because of their ability to concentrate on misclassified occurrences through sequential learning. Although they often work effectively, they could be noise-sensitive.

7. Using Anomaly Detection Algorithms

Anomaly or Outlier Detection algorithms are 'one class classification algorithms' that helps in identifying outliers ( rare data points) in the dataset.
In an Imbalanced dataset, assume 'Majority class records as Normal data' and 'Minority Class records as Outlier data'.
These algorithms are trained on Normal data.
A trained model can predict if the new record is Normal or Outlier.

Handling Imbalanced Data for Classification

R

raman_k

Improve

Article Tags :

Practice Tags :

Machine Learning

Similar Reads

Classification and Tabulation of Data

Classification and Tabulation of Data are fundamental processes in the field of statistics, especially in the context of economics. They transform raw data into a structured form, enabling better analysis, interpretation, and presentation of economic data. Proper classification ensures that data is

Dataset for Classification

Classification is a type of supervised learning where the objective is to predict the categorical labels of new instances based on past observations. The goal is to learn a model from the training data that can predict the class label for unseen data accurately. Classification problems are common in

Classification on Imbalanced data using Tensorflow

In the modern days of machine learning, imbalanced datasets are like a curse that degrades the overall model performance in classification tasks. In this article, we will implement a Deep learning model using TensorFlow for classification on a highly imbalanced dataset. Classification on Imbalanced

Classification of Data Mining Systems

Data Mining is considered as an interdisciplinary field. It includes a set of various disciplines such as statistics, database systems, machine learning, visualization and information sciences.Classification of the data mining system helps users to understand the system and match their requirements

SMOTE for Imbalanced Classification with Python

Imbalanced datasets impact the performance of the machine learning models and the Synthetic Minority Over-sampling Technique (SMOTE) addresses the class imbalance problem by generating synthetic samples for the minority class. The article aims to explore the SMOTE, its working procedure, and various