Encoding Categorical Data in R
Last Updated : 26 Apr, 2025
Encoding is the process of converting categorical data into numerical values. Categorical data is a type of data which can be classified into categories or groups (such as colors or job titles). Since categorical variables cannot be directly used in statistical analysis or machine learning models, encoding is necessary to represent them in a format that models can process.
Different Techniques to Encode Categorical Data
The Categorical data can be encoded in R using a variety of techniques. We'll go over three of the most popular approaches: label encoding, frequency encoding, and one-hot encoding.
1. One-Hot Encoding
One-hot encoding is a technique used to convert categorical data into a binary matrix. Each unique category value in a variable is assigned its own column in the matrix. For each row, if a category value is present, the corresponding column is marked with a 1, while all other columns for that row are set to 0. This technique ensures that categorical values are represented numerically, allowing them to be used in machine learning models.
In this example, we create a sample dataset and convert the gender column , which is a categorical column, to a numerical format using one hot encoding.
R gender <- c("male", "female", "male", "male", "female") age <- c(23, 34, 52, 21, 19) income <- c(50000, 70000, 80000, 45000, 55000) df <- data.frame(gender, age, income) encoded_gender <- model.matrix(~gender-1, data=df)
Output:
One-hot Encoding
2. Label Encoding
The Label encoding method is for encoding categorical variables that assigns the number value to each distinct value. For the instance, the numerical values 1, 2, and 3 might be assigned to a categorical variable with the three unique values of "red," "green," and "blue," respectively. The factor() function in R can be used to turn a category variable into a factor, that can subsequently be turned into integers using the as.integer() function.
In this example, the data frame contains a column color which is a categorical column. We can label encrypt the color column using the factor() function and change its type to integer using as.integer() function.
R color <- c("red", "green", "blue", "blue", "red") df <- data.frame(color) df$color <- as.integer(factor(df$color))
Output:
Label Encoding
3. Frequency Encoding
The Frequency Each distinct value is assigned the frequency with which it occurs in the data when encoding categorical variables. The numerical values for each of these values may be 3, 4, and 2, respectively, if a categorical variable has three distinct values (red, green, and blue), and each of those values appears three, four, or two times.
In this example, we will frequency encode the color column of the data frame.
R color <- c("red", "green", "blue", "blue", "red") df <- data.frame(color) freq_count <- table(df$color) df$color <- match(df$color, names(freq_count)
Output:
Frequency EncodingChoosing an Encoding Method
The choice of encoding method depends on the type of analysis or model being used and the characteristics of the data. For categorical variables with a small number of unique values, Label Encoding and Frequency Encoding are commonly used. On the other hand, One-Hot Encoding is typically preferred for categorical variables with many unique values.
It's important to note that Label Encoding and Frequency Encoding can introduce unintended order or hierarchy into the data, which may affect the validity of analysis or machine learning models. In such cases, One-Hot Encoding may be a more suitable choice.
Difference Between all the Methods
Encoding Method | Description | When to Use |
---|
One-Hot Encoding | Converts each category into a binary vector, where one element is set to 1, and all others are 0. | When there is no inherent order between categories.
Ideal for categorical variables with a large number of unique values. |
Frequency Encoding | Assigns a numerical value to each category based on its frequency in the dataset.
| When there are many categories and you want to retain information about the frequency of categories.
Useful for large datasets but may imply an unintended hierarchy. |
Label Encoding | Assigns a unique numerical value to each category based on its order in the dataset.
| When there is an ordinal relationship between categories (e.g., low, medium, high).
Suitable for variables with a limited number of categories.
Not recommended for nominal data as it may create artificial ranking |
In this article, we discussed three encoding methods One-Hot Encoding, Frequency Encoding, and Label Encoding and when to use each based on the nature of the categorical data and the analysis or model requirements.
Similar Reads
Handling Categorical Data in Python Categorical data refers to features that contain a fixed set of possible values or categories that data points can belong to. Handling categorical data correctly is important because improper handling can lead to inaccurate analysis and poor model performance. In this article, we will see how to han
5 min read
How to Plot Categorical Data in R? In this article, we will be looking at different plots for the categorical data in the R programming language. Categorical Data is a variable that can take on one of a limited, and usually fixed, a number of possible values, assigning each individual or other unit of observation to a particular grou
3 min read
Categorical Data Categorical data classifies information into distinct groups or categories, lacking a specific numerical value. It refers to a form of information that can be stored and identified based on their names or labels. Categorical Data is a type of qualitative data that is easily measured numerically.In t
14 min read
Categorical Data Descriptive Statistics in R Categorical data, representing non-measurable attributes, requires specialized analysis. This article explores descriptive statistics and visualization techniques in R Programming Language for categorical data, focusing on frequencies, proportions, bar charts, pie charts, frequency tables, and conti
12 min read
Passing categorical data to Sklearn Decision Tree Theoretically, decision trees are capable of handling numerical as well as categorical data, but, while implementing, we need to prepare the data for classification. There are two methods to handle the categorical data before training: one-hot encoding and label encoding. In this article, we underst
5 min read