Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Why One-Hot Encoding Improves Machine Learning Performance?
Next article icon

One Hot Encoding in Machine Learning

Last Updated : 07 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

One Hot Encoding is a method for converting categorical variables into a binary format. It creates new columns for each category where 1 means the category is present and 0 means it is not. The primary purpose of One Hot Encoding is to ensure that categorical data can be effectively used in machine learning models.

Importance of One Hot Encoding

We use one hot Encoding because:

  1. Eliminating Ordinality: Many categorical variables have no inherent order (e.g., “Male” and “Female”). If we were to assign numerical values (e.g., Male = 0, Female = 1) the model might mistakenly interpret this as a ranking and lead to biased predictions. One Hot Encoding eliminates this risk by treating each category independently.
  2. Improving Model Performance: By providing a more detailed representation of categorical variables. One Hot Encoding can help to improve the performance of machine learning models. It allows models to capture complex relationships within the data that might be missed if categorical variables were treated as single entities.
  3. Compatibility with Algorithms: Many machine learning algorithms particularly based on linear regression and gradient descent which require numerical input. It ensures that categorical variables are converted into a suitable format.

How One-Hot Encoding Works: An Example

To grasp the concept better let’s explore a simple example. Imagine we have a dataset with fruits their categorical values and corresponding prices. Using one-hot encoding we can transform these categorical values into numerical form. For example:

  • Wherever the fruit is “Apple,” the Apple column will have a value of 1 while the other fruit columns (like Mango or Orange) will contain 0.
  • This pattern ensures that each categorical value gets its own column represented with binary values (1 or 0) making it usable for machine learning models.
FruitCategorical value of fruitPrice
apple15
mango210
apple115
orange320

The output after applying one-hot encoding on the data is given as follows,

Fruit_appleFruit_mangoFruit_orangeprice
1005
01010
10015
00120

Implementing One-Hot Encoding Using Python

To implement one-hot encoding in Python we can use either the Pandas library or the Scikit-learn library both of which provide efficient and convenient methods for this task.

1. Using Pandas

Pandas offers the get_dummies function which is a simple and effective way to perform one-hot encoding. This method converts categorical variables into multiple binary columns.

  • For example the Gender column with values 'M' and 'F' becomes two binary columns: Gender_F and Gender_M.
  • drop_first=True in pandas drops one redundant column e.g., keeps only Gender_F to avoid multicollinearity.
Python
import pandas as pd from sklearn.preprocessing import OneHotEncoder  data = {     'Employee id': [10, 20, 15, 25, 30],     'Gender': ['M', 'F', 'F', 'M', 'F'],     'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'] }  df = pd.DataFrame(data) print(f"Original Employee Data:\n{df}\n") # Use pd.get_dummies() to one-hot encode the categorical columns df_pandas_encoded = pd.get_dummies(df, columns=['Gender', 'Remarks'], drop_first=True) print(f"One-Hot Encoded Data using Pandas:\n{df_pandas_encoded}\n")  encoder = OneHotEncoder(sparse_output=False)  one_hot_encoded = encoder.fit_transform(df[categorical_columns])  one_hot_df = pd.DataFrame(one_hot_encoded,                            columns=encoder.get_feature_names_out(categorical_columns))  df_sklearn_encoded = pd.concat([df.drop(categorical_columns, axis=1), one_hot_df], axis=1)  print(f"One-Hot Encoded Data using Scikit-Learn:\n{df_sklearn_encoded}\n") 

Output:

Original Employee Data:    Employee id Gender Remarks 0           10      M    Good 1           20      F    Nice 2           15      F    Good 3           25      M   Great 4           30      F    Nice  One-Hot Encoded Data using Pandas:    Employee id  Gender_M  Remarks_Great  Remarks_Nice 0           10      True          False         False 1           20     False          False          True 2           15     False          False         False 3           25      True           True         False 4           30     False          False          True

We can observe that we have 3 Remarks and 2 Gender columns in the data. However you can just use n-1 columns to define parameters if it has n unique labels. For example if we only keep the Gender_Female column and drop the Gender_Male column then also we can convey the entire information as when the label is 1 it means female and when the label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.

2. One Hot Encoding using Scikit Learn Library

Scikit-learn(sklearn) is a popular machine-learning library in Python that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for encoding categorical and numerical variables into binary vectors. Using df.select_dtypes(include=['object']) in Scikit Learn Library:

  • This selects only the columns with categorical data (data type object).
  • In this case, ['Gender', 'Remarks'] are identified as categorical columns.
Python
import pandas as pd from sklearn.preprocessing import OneHotEncoder  data = {'Employee id': [10, 20, 15, 25, 30],         'Gender': ['M', 'F', 'F', 'M', 'F'],         'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],         } df = pd.DataFrame(data) print(f"Employee data : \n{df}")  categorical_columns = df.select_dtypes(include=['object']).columns.tolist() encoder = OneHotEncoder(sparse_output=False)  one_hot_encoded = encoder.fit_transform(df[categorical_columns])  one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))  df_encoded = pd.concat([df, one_hot_df], axis=1)  df_encoded = df_encoded.drop(categorical_columns, axis=1) print(f"Encoded Employee data : \n{df_encoded}") 

Output:

Employee data :     Employee id Gender Remarks 0           10      M    Good 1           20      F    Nice 2           15      F    Good 3           25      M   Great 4           30      F    Nice Encoded Employee data :     Employee id  Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice 0           10       0.0       1.0           1.0            0.0           0.0 1           20       1.0       0.0           0.0            0.0           1.0 2           15       1.0       0.0           1.0            0.0           0.0 3           25       0.0       1.0           0.0            1.0           0.0 4           30       1.0       0.0           0.0            0.0           1.0

Both Pandas and Scikit-Learn offer robust solutions for one-hot encoding.

  • Use Pandas get_dummies() when you need quick and simple encoding.
  • Use Scikit-Learn OneHotEncoder when working within a machine learning pipeline or when you need finer control over encoding behavior.

Advantages and Disadvantages of One Hot Encoding

Advantages of Using One Hot Encoding

  1. It allows the use of categorical variables in models that require numerical input.
  2. It can improve model performance by providing more information to the model about the categorical variable.
  3. It can help to avoid the problem of ordinality which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).

Disadvantages of Using One Hot Encoding

  1. It can lead to increased dimensionality as a separate column is created for each category in the variable. This can make the model more complex and slow to train.
  2. It can lead to sparse data as most observations will have a value of 0 in most of the one-hot encoded columns.
  3. It can lead to overfitting especially if there are many categories in the variable and the sample size is relatively small.

Best Practices for One Hot Encoding

To make the most of One Hot Encoding and we must consider the following best practices:

  1. Limit the Number of Categories: If you have high cardinality categorical variables consider limiting the number of categories through grouping or feature engineering.
  2. Use Feature Selection: Implement feature selection techniques to identify and retain only the most relevant features after One Hot Encoding. This can help reduce dimensionality and improve model performance.
  3. Monitor Model Performance: Regularly evaluate your model’s performance after applying One Hot Encoding. If you notice signs of overfitting or other issues consider alternative encoding methods.
  4. Understand Your Data: Before applying One Hot Encoding take the time to understand the nature of your categorical variables. Determine whether they have a natural order and whether One Hot Encoding is appropriate.

Alternatives to One Hot Encoding

While One Hot Encoding is a popular choice for handling categorical data there are several alternatives that may be more suitable depending on the context:

  1. Label Encoding: In cases where categorical variables have a natural order (e.g., “Low,” “Medium,” “High”) label encoding can be a better option. This method assigns a unique integer to each category without introducing the same risks of hierarchy misinterpretation as with nominal data.
  2. Binary Encoding: This technique combines the benefits of One Hot Encoding and label encoding. It converts categories into binary numbers and then creates binary columns. This method can reduce dimensionality while preserving information.
  3. Target Encoding: In target encoding, we replace each category with the mean of the target variable for that category. This method can be particularly useful for categorical variables with a high number of unique values but it also carries a risk of leakage if not handled properly.


Next Article
Why One-Hot Encoding Improves Machine Learning Performance?

L

Lekhana_Ganji
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

  • Mean Encoding - Machine Learning
    During Feature Engineering the task of converting categorical features into numerical is called Encoding. There are various ways to handle categorical features like OneHotEncoding and LabelEncoding, FrequencyEncoding or replacing by categorical features by their count. In similar way we can uses Mea
    2 min read
  • Feature Encoding Techniques - Machine Learning
    As we all know that better encoding leads to a better model and most algorithms cannot handle the categorical variables unless they are converted into a numerical value. Categorical features are generally divided into 3 types:  A. Binary: Either/or Examples:  Yes, NoTrue, False B. Ordinal: Specific
    5 min read
  • Why One-Hot Encoding Improves Machine Learning Performance?
    One-hot encoding is a crucial step in data preparation for machine learning algorithms. It involves converting categorical data into a numerical format that can be effectively processed by these algorithms. This technique has been widely adopted due to its ability to significantly improve the perfor
    8 min read
  • Autoencoders in Machine Learning
    An autoencoder is a type of artificial neural network that learns to represent data in a compressed form and then reconstructs it as closely as possible to the original input. Autoencoders consists of two components: Encoder: This compresses the input into a compact representation and capture the mo
    9 min read
  • Denoising AutoEncoders In Machine Learning
    Autoencoders are types of neural network architecture used for unsupervised learning. The architecture consists of an encoder and a decoder. The encoder encodes the input data into a lower dimensional space while the decoder decodes the encoded data back to the original input. The network is trained
    10 min read
  • One-Hot Encoding in NLP
    Natural Language Processing (NLP) is a quickly expanding discipline that works with computer-human language exchanges. One of the most basic jobs in NLP is to represent text data numerically so that machine learning algorithms can comprehend it. One common method for accomplishing this is one-hot en
    9 min read
  • Introduction to Data in Machine Learning
    Data refers to the set of observations or measurements to train a machine learning models. The performance of such models is heavily influenced by both the quality and quantity of data available for training and testing. Machine learning algorithms cannot be trained without data. Cutting-edge develo
    4 min read
  • What is No-Code Machine Learning?
    As we know Machine learning is a field in which the data are provided according to the use case of the feature engineering then model selection, model training, and model deployment are done with programming languages like Python and R. For developing the model the person or developer must have the
    10 min read
  • Introduction to Machine Learning in R
    The word Machine Learning was first coined by Arthur Samuel in 1959. The definition of machine learning can be defined as that machine learning gives computers the ability to learn without being explicitly programmed. Also in 1997, Tom Mitchell defined machine learning that “A computer program is sa
    8 min read
  • What are Embedding in Machine Learning?
    In recent years, embeddings have emerged as a core idea in machine learning, revolutionizing the way we represent and understand data. In this article, we delve into the world of embeddings, exploring their importance, applications, and the underlying techniques used to generate them. Table of Conte
    15+ min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences