Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
SMOTE for Imbalanced Classification with Python
Next article icon

Handling Imbalanced Data for Classification

Last Updated : 02 Jan, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

A key component of machine learning classification tasks is handling unbalanced data, which is characterized by a skewed class distribution with a considerable overrepresentation of one class over the others. The difficulty posed by this imbalance is that models may exhibit inferior performance due to bias towards the majority class. When faced with uneven settings, the model’s bias is to value accuracy over accurately recognizing occurrences of minority classes.

This problem can be solved by applying specialized strategies like resampling (oversampling minority class, undersampling majority class), utilizing various assessment measures (F1-score, precision, recall), and putting advanced algorithms to work with unbalanced datasets into practice.

What is Imbalanced Data and How to handle it?

Imbalanced data pertains to datasets where the distribution of observations in the target class is uneven. In other words, one class label has a significantly higher number of observations, while the other has a notably lower count.

When one class greatly outnumbers the others in a classification, there is imbalanced data. Machine learning models may become biased in their predictions as a result, favoring the majority class. Techniques like oversampling the minority class or undersampling the majority class are used in resampling to remedy this.

Furthermore, it is possible to evaluate model performance more precisely by substituting other assessment measures, such as precision, recall, or F1-score, for accuracy. To further improve the handling of imbalanced datasets for more reliable and equitable predictions, specialized techniques such as ensemble approaches and the incorporation of synthetic data generation can be used.

Problem with Handling Imbalanced Data for Classification

  • Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
  • Minority class observations look like noise to the model and are ignored by the model.
  • Imbalanced dataset gives misleading accuracy score.

Ways to handle Imbalanced Data for Classification

Addressing imbalanced data in classification is crucial for fair model performance. Techniques include resampling (oversampling or undersampling), synthetic data generation, specialized algorithms, and alternative evaluation metrics. Implementing these strategies ensures more accurate and unbiased predictions across all classes.

1. Different Evaluation Metric

Classifier accuracy is calculated by dividing the total correct predictions by the overall predictions, suitable for balanced classes but less effective for imbalanced datasets. Precision gauges the accuracy of a classifier in predicting a specific class, while recall assesses its ability to correctly identify a class. In imbalanced datasets, the F1 score emerges as a preferred metric, striking a balance between precision and recall, providing a more comprehensive evaluation of a classifier’s performance. It can be expressed as the mean of recall and accuracy.

[Tex]F1 = 2 * \frac{precision\; *\; recall}{precision\; +\; recall} [/Tex]

Precision and F1 score both decrease when the classifier incorrectly predict the minority class, increasing the number of false positives. Recall and F1 score also drop if the classifier have trouble accurately identifying the minority class, leading to more false negatives. In particular, the F1 score only becomes better when the amount and accuracy of predictions get better.

F1 score is essentially a comprehensive statistic that takes into account the trade-off between precision and recall, which is critical for assessing classifier performance in datasets that are imbalanced.

2. Resampling (Undersampling and Oversampling)

This method involves adjusting the balance between minority and majority classes through upsampling or downsampling. In the case of an imbalanced dataset, oversampling the minority class with replacement, termed oversampling, is employed. Conversely, undersampling entails randomly removing rows from the majority class to align with the minority class.

This sampling approach yields a balanced dataset, ensuring comparable representation for both majority and minority classes. Achieving a similar number of records for both classes in the dataset signifies that the classifier will assign equal importance to each class during training.

Python3

import numpy as np
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
 
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
print("Original class distribution:", Counter(y))
 
# Oversampling using RandomOverSampler
oversample = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample.fit_resample(X, y)
print("Oversampled class distribution:", Counter(y_over))
 
 
# Undersampling using RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersample.fit_resample(X, y)
print("Undersampled class distribution:", Counter(y_under))
                      
                       

Output:

Original class distribution: Counter({1: 900, 0: 100})
Oversampled class distribution: Counter({1: 900, 0: 900})
Undersampled class distribution: Counter({0: 100, 1: 100})

3. BalancedBaggingClassifier

When dealing with imbalanced datasets, traditional classifiers tend to favor the majority class, neglecting the minority class due to its lower representation. The BalancedBaggingClassifier, an extension of sklearn classifiers, addresses this imbalance by incorporating additional balancing during training. It introduces parameters like “sampling_strategy,” determining the type of resampling (e.g., ‘majority’ for resampling only the majority class, ‘all’ for resampling all classes), and “replacement,” dictating whether the sampling should occur with or without replacement. This classifier ensures a more equitable treatment of classes, particularly beneficial when handling imbalanced datasets.

Importing Libraries

Python3

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.metrics import accuracy_score, classification_report
                      
                       

This code demonstrates the usage of a BalancedBaggingClassifier from the imbalanced-learn library to handle imbalanced datasets. It creates an imbalanced dataset, splits it, and trains a Random Forest classifier with balanced bagging, assessing the model’s performance through accuracy and a classification report.

Creating imbalanced dataset and splitting

Python3

# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
                      
                       

This code creates two-class, imbalanced datasets, divides them into training and testing sets, and uses a predetermined random state to guarantee reproducibility. With 20 features in the final dataset, the minority class has a weight of 0.1, indicating a notable class imbalance.

Creating a random forest classifier

Python3

# Create a Random Forest Classifier (you can use any classifier)
base_classifier = RandomForestClassifier(random_state=42)
                      
                       

By initializing a Random Forest classifier with a given random state, this method creates a base classifier that may be used in subsequent analyses. Reproducibility in model training is guaranteed by the random state.

Creating a balanced bagging classifier

Python3

# Create a BalancedBaggingClassifier
balanced_bagging_classifier = BalancedBaggingClassifier(base_classifier,
                                                        sampling_strategy='auto',  # You can adjust this parameter
                                                        replacement=False,  # Whether to sample with or without replacement
                                                        random_state=42)
                      
                       

This code creates a BalancedBaggingClassifier by starting with a RandomForestClassifier that was previously defined. A random state is established for reproducibility, and options like “sampling_strategy” and “replacement” are supplied to address class imbalance.

Fitting the model and making predictions

Python3

# Fit the model
balanced_bagging_classifier.fit(X_train, y_train)
 
# Make predictions
y_pred = balanced_bagging_classifier.predict(X_test)
                      
                       

This code use the training data (X_train, y_train) to train the BalancedBaggingClassifier. Then, using the test data (X_test), they predict the labels, saving the results in the variable y_pred.

Evaluation

Python3

# Evaluate the performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
                      
                       

Output:

Accuracy: 1.0
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 187
accuracy 1.00 200
macro avg 1.00 1.00 1.00 200
weighted avg 1.00 1.00 1.00 200

This code compute and output the balanced bagging classifier’s accuracy on the test set. Furthermore, a comprehensive classification report with information on each class’s F1-score, recall, and precision is printed.

4. SMOTE

The Synthetic Minority Oversampling Technique (SMOTE) addresses imbalanced datasets by synthetically generating new instances for the minority class. Unlike simply duplicating records, SMOTE enhances diversity by creating artificial instances. In simpler terms, SMOTE examines instances in the minority class, selects a random nearest neighbor using k-nearest neighbors, and generates a synthetic instance randomly within the feature space.

Python3

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter
 
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
 
# Display class distribution before SMOTE
print("Class distribution before SMOTE:", Counter(y_train))
 
# Apply SMOTE to oversample the minority class
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
 
# Display class distribution after SMOTE
print("Class distribution after SMOTE:", Counter(y_train_resampled))
                      
                       

Output:

Class distribution before SMOTE: Counter({1: 713, 0: 87})
Class distribution after SMOTE: Counter({1: 713, 0: 713})

This code demonstrates how to rectify class imbalance in a dataset using SMOTE. Initially, an unbalanced dataset is produced, with 10% of the data belonging to a minority class. It shows the class distribution before to SMOTE after dividing the data into training and testing sets. After that, the minority class is oversampled using the SMOTE approach to produce synthetic instances. It displays a more equal representation of both classes in the resampled training data by printing the class distribution after SMOTE.

5. Threshold Moving

In classifiers, predictions are often expressed as probabilities of class membership. The conventional threshold for assigning predictions to classes is typically set at 0.5. However, in the context of imbalanced class problems, this default threshold may not yield optimal results. To enhance classifier performance, it is essential to adjust the threshold to a value that efficiently discriminates between the two classes.

Techniques such as ROC Curves and Precision-Recall Curves are employed to identify the optimal threshold. Additionally, grid search methods or exploration within a specified range of values can be utilized to pinpoint the most suitable threshold for the classifier.

Python3

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
 
# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
 
# Train a classification model (Random Forest as an example)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
 
# Predict probabilities on the test set
y_proba = model.predict_proba(X_test)[:, 1]
 
# Set a threshold (initially 0.5)
threshold = 0.5
 
# Adjust threshold based on your criteria (e.g., maximizing F1-score)
while threshold >= 0:
    y_pred = (y_proba >= threshold).astype(int)
    f1 = f1_score(y_test, y_pred)
 
    print(f"Threshold: {threshold:.2f}, F1 Score: {f1:.4f}")
 
    # Move the threshold (you can customize the step size)
    threshold -= 0.02
                      
                       

Output:

Threshold: 0.50, F1 Score: 1.0000
Threshold: 0.48, F1 Score: 1.0000
Threshold: 0.46, F1 Score: 1.0000
Threshold: 0.44, F1 Score: 1.0000
Threshold: 0.42, F1 Score: 1.0000
Threshold: 0.40, F1 Score: 1.0000
Threshold: 0.38, F1 Score: 1.0000
Threshold: 0.36, F1 Score: 1.0000
Threshold: 0.34, F1 Score: 1.0000
Threshold: 0.32, F1 Score: 1.0000
Threshold: 0.30, F1 Score: 1.0000
Threshold: 0.28, F1 Score: 0.9973
Threshold: 0.26, F1 Score: 0.9973
Threshold: 0.24, F1 Score: 0.9973
Threshold: 0.22, F1 Score: 0.9947
Threshold: 0.20, F1 Score: 0.9947
Threshold: 0.18, F1 Score: 0.9947
Threshold: 0.16, F1 Score: 0.9920
Threshold: 0.14, F1 Score: 0.9920
Threshold: 0.12, F1 Score: 0.9894
Threshold: 0.10, F1 Score: 0.9842
Threshold: 0.08, F1 Score: 0.9740
Threshold: 0.06, F1 Score: 0.9664
Threshold: 0.04, F1 Score: 0.9664
Threshold: 0.02, F1 Score: 0.9664

6. Using Tree Based Models

The hierarchical structure of tree-based models—such as Decision Trees, Random Forests, and Gradient Boosted Trees—allows them to handle imbalanced datasets better than non-tree-based models.

  • Decision Trees: Decision trees create a structure resembling a tree by dividing the feature space into regions according to feature values. By changing the decision boundaries to incorporate minority class patterns, decision trees can react to data that is unbalanced. They might experience overfitting, though.
  • Random Forests: Random Forests are made up of many Decision Trees that have been trained using arbitrary subsets of features and data. Random Forests improve generalization by reducing overfitting and strengthening robustness against imbalanced datasets by mixing numerous trees.
  • Gradient Boosted Trees: Boosted Gradient Trees grow in a sequential fashion, with each new growth repairing the mistakes of the older one. Gradient Boosted Trees perform well in imbalanced circumstances because of their ability to concentrate on misclassified occurrences through sequential learning. Although they often work effectively, they could be noise-sensitive.

7. Using Anomaly Detection Algorithms

  • Anomaly or Outlier Detection algorithms are ‘one class classification algorithms’ that helps in identifying outliers ( rare data points) in the dataset.
  • In an Imbalanced dataset, assume  ‘Majority class records as Normal data’ and ‘Minority Class records as Outlier data’.
  • These algorithms are trained on Normal data.
  • A trained model can predict if the new record is Normal or Outlier.


Next Article
SMOTE for Imbalanced Classification with Python

R

raman_k
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

  • Classification and Tabulation of Data
    Classification and Tabulation of Data are fundamental processes in the field of statistics, especially in the context of economics. They transform raw data into a structured form, enabling better analysis, interpretation, and presentation of economic data. Proper classification ensures that data is
    12 min read
  • Dataset for Classification
    Classification is a type of supervised learning where the objective is to predict the categorical labels of new instances based on past observations. The goal is to learn a model from the training data that can predict the class label for unseen data accurately. Classification problems are common in
    5 min read
  • Classification on Imbalanced data using Tensorflow
    In the modern days of machine learning, imbalanced datasets are like a curse that degrades the overall model performance in classification tasks. In this article, we will implement a Deep learning model using TensorFlow for classification on a highly imbalanced dataset. Classification on Imbalanced
    7 min read
  • Classification of Data Mining Systems
    Data Mining is considered as an interdisciplinary field. It includes a set of various disciplines such as statistics, database systems, machine learning, visualization and information sciences.Classification of the data mining system helps users to understand the system and match their requirements
    1 min read
  • SMOTE for Imbalanced Classification with Python
    Imbalanced datasets impact the performance of the machine learning models and the Synthetic Minority Over-sampling Technique (SMOTE) addresses the class imbalance problem by generating synthetic samples for the minority class. The article aims to explore the SMOTE, its working procedure, and various
    14 min read
  • Handling Class Imbalance in PyTorch
    Class imbalance is a common challenge in machine learning, where certain classes are underrepresented compared to others. This can lead to biased models that perform poorly on minority classes. In this article, we will explore various techniques to handle class imbalance in PyTorch, ensuring your mo
    9 min read
  • Classification in R Programming
    R is a very dynamic and versatile programming language for data science. This article deals with classification in R. Generally classifiers in R are used to predict specific category related information like reviews or ratings such as good, best or worst.Various Classifiers are:   Decision TreesNaiv
    4 min read
  • Bagging and Random Forest for Imbalanced Classification
    Ensemble learning techniques like bagging and random forests have gained prominence for their effectiveness in handling imbalanced classification problems. In this article, we will delve into these techniques and explore their applications in mitigating the impact of class imbalance. Classification
    8 min read
  • Evaluation Metrics For Classification Model in Python
    Classification is a supervised machine-learning technique that predicts the class label based on the input data. There are different classification algorithms to build a classification model, such as Stochastic Gradient Classifier, Support Vector Machine Classifier, Random Forest Classifier, etc. To
    7 min read
  • How to Design Database for Machine Learning Applications
    Machine learning (ML) has emerged as a transformative technology, enabling computers to learn from data and make predictions or decisions without being explicitly programmed. Behind every successful machine learning application lies a robust database architecture designed to store, manage, and analy
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences