Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
What is exactly sklearn.pipeline.Pipeline?
Next article icon

Target encoding using nested CV in sklearn pipeline

Last Updated : 27 May, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In machine learning, feature engineering plays a pivotal role in enhancing model performance. One such technique is target encoding, which is particularly useful for categorical variables. However, improper implementation can lead to data leakage and overfitting. This article delves into the intricacies of target encoding using nested cross-validation (CV) within an Sklearn pipeline, ensuring a robust and unbiased model evaluation.

Table of Content

  • Understanding Target Encoding
  • The Challenge of Data Leakage : Nested Cross-Validation (CV)
  • Utilizing Target Encoding Using Nested CV in Scikit-Learn Pipeline
  • Practical Considerations and Best Practices

Understanding Target Encoding

Target encoding, also known as mean encoding, involves replacing categorical values with the mean of the target variable for each category. This technique can be particularly powerful for high-cardinality categorical features, where one-hot encoding might lead to a sparse matrix and overfitting. While powerful, this technique can lead to overfitting if not applied correctly, especially when the same data is used to calculate the means and train the model.

Benefits of Target Encoding

  1. Dimensionality Reduction: Unlike one-hot encoding, target encoding reduces the number of features, leading to a more compact representation.
  2. Handling High Cardinality: It is effective for categorical variables with many unique values.
  3. Potential Performance Boost: By capturing the relationship between categorical features and the target variable, it can improve model performance.

The Challenge of Data Leakage : Nested Cross-Validation (CV)

One of the primary concerns with target encoding is data leakage. If the encoding is done on the entire dataset before splitting into training and testing sets, information from the test set can leak into the training process, leading to overly optimistic performance estimates. To prevent overfitting and data leakage when using target encoding within cross-validation, it's crucial to fit the encoder on the training folds and transform both the training and validation folds in each cross-validation step. This approach ensures that the model is not exposed to any information from the validation set during training, which is essential for maintaining the integrity of the cross-validation process.

  • The necessity to fit the encoder on the training folds and not on the validation fold in each cross-validation step is to prevent overfitting and data leakage.
  • If the encoder is fit on the entire dataset, including the validation set, it can lead to the model being biased towards the validation set, resulting in overfitting.

Nested cross-validation is a robust technique to mitigate data leakage and ensure unbiased model evaluation. It involves two layers of cross-validation:

  1. Outer CV: Used for model evaluation.
  2. Inner CV: Used for hyperparameter tuning and feature engineering, including target encoding.

Benefits of Nested CV

  • Prevents Data Leakage: By separating the data used for encoding and model training.
  • Reliable Performance Estimates: Provides a more accurate measure of model performance on unseen data.

Utilizing Target Encoding Using Nested CV in Scikit-Learn Pipeline

Implementing target encoding in a pipeline while leveraging nested CV requires careful design to avoid data leakage. Scikit-Learn’s Pipeline and FeatureUnion can be used in conjunction with custom transformers to ensure proper target encoding with following steps:

  • Create a Custom Transformer for Target Encoding: This transformer should handle the fitting and transformation of target encoding.
  • Integrate the Transformer in a Pipeline: Include the custom transformer in a Scikit-Learn pipeline.
  • Apply Nested Cross-Validation: Use nested CV to evaluate the model within the pipeline.

Let's walk through a step-by-step implementation of target encoding using nested cross-validation within an Sklearn pipeline.

Step 1: Import Necessary Libraries and Create a Sample Dataset

Python
import numpy as np import pandas as pd from sklearn.model_selection import KFold, cross_val_score, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from category_encoders import TargetEncoder  # Sample dataset data = {     'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C', 'B', 'A'],     'feature': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],     'target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1] } df = pd.DataFrame(data) X = df[['category', 'feature']] y = df['target'] 

Step 2: Define the Pipeline

We will create a pipeline that includes target encoding and a classifier.An Sklearn pipeline is defined, which includes:

  • TargetEncoder for target encoding the category feature.
  • StandardScaler for scaling the numerical feature.
  • RandomForestClassifier as the classifier.
Python
pipeline = Pipeline([     ('target_encoder', TargetEncoder(cols=['category'])),     ('scaler', StandardScaler()),     ('classifier', RandomForestClassifier()) ]) 

Step 3: Nested Cross-Validation

We will use nested cross-validation to evaluate the model. The outer loop will handle the model evaluation, while the inner loop will handle hyperparameter tuning and target encoding. The outer and inner cross-validation strategies are defined using KFold. A parameter grid is defined for hyperparameter tuning of the RandomForestClassifier.

Python
# Define the outer and inner cross-validation strategies outer_cv = KFold(n_splits=5, shuffle=True, random_state=42) inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)  # Define the parameter grid for hyperparameter tuning param_grid = {     'classifier__n_estimators': [50, 100],     'classifier__max_depth': [None, 10, 20] }  # Perform nested cross-validation grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=inner_cv, scoring='accuracy') nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy') print(f'Nested CV Accuracy: {np.mean(nested_scores):.4f} ± {np.std(nested_scores):.4f}') 

Output:

Nested CV Accuracy: 0.1000 ± 0.2000

A nested cross-validation accuracy of 0.1000 ± 0.2000 indicates that the model's performance is not reliable.

  • The mean accuracy of 0.1000 suggests that, on average, the model is correctly predicting the target class for only 10% of the samples.
  • However, the large standard deviation of 0.2000 indicates high variability in model performance across different folds or iterations of cross-validation.

Practical Considerations and Best Practices

Implementing target encoding within nested cross-validation demands careful attention to various considerations and adherence to best practices. Common pitfalls and offer guidance on best practices for maximizing the effectiveness of this technique:

  • Choosing Appropriate Encoding Techniques: Different categorical variables may require different encoding techniques. For ordinal variables, methods like ordinal encoding might be suitable, while for nominal variables, techniques like target encoding or one-hot encoding could be considered. Understanding the nature of the categorical variables in your dataset is crucial for selecting the most appropriate encoding method.
  • Handling Missing Values During Encoding: Missing values within categorical variables pose a challenge during encoding. It's essential to decide how to handle these missing values before applying target encoding. Options include treating missing values as a separate category, imputing them with the mode or median, or using advanced imputation techniques. The chosen approach should align with the specific characteristics of the dataset and the objectives of the analysis.
  • Dealing with Rare or Unseen Categories: In real-world datasets, categorical variables may contain rare or unseen categories that were not present in the training data. Target encoding such categories based solely on the training set may lead to biased or unreliable results. To address this issue, consider techniques such as frequency thresholding or combining rare categories into a single group. Additionally, incorporating domain knowledge or external data sources can aid in properly handling rare categories during encoding.
  • Preventing Overfitting and Data Leakage: Overfitting and data leakage are significant concerns when using target encoding within nested cross-validation. To mitigate these risks, ensure that the encoding is performed solely on the training folds during cross-validation. This prevents information from the validation set from influencing the encoding process, leading to more reliable model evaluation. By adhering to this practice, the model can generalize better to unseen data and provide more accurate performance estimates.

Conclusion

Target encoding is a powerful technique for handling categorical variables, especially with high cardinality. Implementing it correctly in a Scikit-Learn pipeline using nested cross-validation can prevent data leakage and overfitting, ensuring robust model performance. By integrating these practices, data scientists can build more reliable and accurate predictive models.


Next Article
What is exactly sklearn.pipeline.Pipeline?

P

personal4tyd
Improve
Article Tags :
  • Machine Learning
  • Blogathon
  • AI-ML-DS
  • Data Science Blogathon 2024
  • Sklearn
Practice Tags :
  • Machine Learning

Similar Reads

  • Make_pipeline() function in Sklearn
    In this article let's learn how to use the make_pipeline method of SKlearn using Python. The make_pipeline() method is used to Create a Pipeline using the provided estimators. This is a shortcut for the Pipeline constructor identifying the estimators is neither required nor allowed. Instead, their n
    3 min read
  • How to Perform Ordinal Encoding Using Sklearn
    Have you ever played a game where you rank things, like your favorite pizza toppings or the scariest monsters? In the field of computers, it is essentially what ordinal encoding accomplishes! It converts ordered data, such as "small," "medium", and "large," into numerical values that a computer can
    6 min read
  • What is exactly sklearn.pipeline.Pipeline?
    The process of transforming raw data into a model-ready format often involves a series of steps, including data preprocessing, feature selection, and model training. Managing these steps efficiently and ensuring reproducibility can be challenging. This is where sklearn.pipeline.Pipeline from the sci
    5 min read
  • Fitting Different Inputs into an Sklearn Pipeline
    The Scikit-learn A tool called a pipeline class links together many processes, including feature engineering, model training, and data preprocessing, to simplify and optimize the machine learning workflow. The sequential application of each pipeline step guarantees consistent data transformation thr
    10 min read
  • SHAP with a Linear SVC model from Sklearn Using Pipeline
    SHAP (SHapley Additive exPlanations) is a powerful tool for interpreting machine learning models by assigning feature importance based on Shapley values. In this article, we will explore how to integrate SHAP with a linear SVC model from Scikit-learn using a Pipeline. We'll provide an overview of SH
    5 min read
  • Using Ansible in Jenkins Pipelines
    In the fast-paced software development scene, DevOps practices have become significant for associations expecting to convey excellent software quickly and dependably. DevOps underscores collaborations, automation, and continuous integration and delivery (CI/CD) to smooth out the software development
    6 min read
  • Prediction using ColumnTransformer, OneHotEncoder and Pipeline
    In this tutorial, we'll predict insurance premium costs for each customer having various features, using ColumnTransformer, OneHotEncoder and Pipeline. We'll import the necessary data manipulating libraries: Code: import pandas as pd import numpy as np from sklearn.compose import ColumnTransformer f
    6 min read
  • How to import datasets using sklearn in PyBrain
    In this article, we will discuss how to import datasets using sklearn in PyBrain Dataset: A Dataset is defined as the set of data that is can be used to test, validate, and train on networks. On comparing it with arrays, a dataset is considered more flexible and easy to use. A dataset resembles a 2-
    2 min read
  • Person and Face Detection using Intel OpenVINO toolkit
    The OpenVINO Toolkit by Intel is a robust platform aimed at assisting developers in speeding up the implementation of deep learning models for computer vision activities. It enhances models for Intel hardware, such as CPUs, GPUs, VPUs, and FPGAs, enabling effective inference on edge devices. The too
    8 min read
  • What is Machine Learning Pipeline?
    In artificial intelligence, developing a successful machine learning model involves more than selecting the best algorithm; it requires effective data management, training, and deployment in an organized manner. A machine learning pipeline becomes crucial in this situation. A machine learning pipeli
    7 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences