Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Demand forecasting in retail using catboost
Next article icon

Demand forecasting in retail using catboost

Last Updated : 17 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In the fast-paced world of retail, accurate demand forecasting is crucial for optimizing inventory management, minimizing costs, and ensuring customer satisfaction. Traditional forecasting methods often fall short in capturing the complexity and dynamic nature of retail demand. This is where advanced machine learning techniques like CatBoost come into play. CatBoost, a gradient boosting algorithm developed by Yandex, is particularly well-suited for demand forecasting in retail due to its ability to handle categorical data and mitigate overfitting. In this article, we will discuss the same.

What is CatBoost?

CatBoost stands for "Categorical Boosting," and it is designed to efficiently handle categorical features without extensive preprocessing. Unlike other gradient boosting algorithms, CatBoost incorporates an innovative approach to dealing with categorical variables directly, which makes it highly effective for retail datasets that often include categorical features such as product categories, store locations, and promotional strategies.

Benefits of CatBoost for Retail Demand Forecasting

  • Handling Categorical Data: Retail datasets are rich in categorical data. CatBoost's native support for categorical features eliminates the need for one-hot encoding, reducing dimensionality and computational overhead.
  • Reduced Overfitting: CatBoost employs ordered boosting, a method that reduces overfitting by using random permutations of the dataset to build each tree. This is particularly beneficial in retail, where the risk of overfitting is high due to the volatility of consumer behavior.
  • Speed and Efficiency: CatBoost is optimized for both CPU and GPU, allowing it to train models faster than many other gradient boosting algorithms. This efficiency is critical in retail, where timely forecasts can significantly impact decision-making.
  • Robustness: CatBoost's ability to handle missing values and noisy data makes it robust in real-world retail environments where data quality can be inconsistent.

Code Implementation of Demand forecasting in retail using catboost

We will now see the step by step implementation of demand forecasting in retail using catboost.

Step 1: Create a Synthetic Dataset

We will create a synthetic dataset which we will be using for our analysis:

Python
import pandas as pd import numpy as np  # Create a synthetic dataset np.random.seed(42)  date_range = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D') store_ids = np.arange(1, 6)  # 5 stores product_ids = np.arange(1, 21)  # 20 products  data = [] for date in date_range:     for store_id in store_ids:         for product_id in product_ids:             sales = np.random.poisson(lam=20)             promo = np.random.choice([0, 1])             holiday = 1 if date in pd.to_datetime(['2021-01-01', '2021-12-25']) else 0             data.append([date, store_id, product_id, sales, promo, holiday])  df = pd.DataFrame(data, columns=['date', 'store_id', 'product_id', 'sales', 'promo', 'holiday']) df.to_csv('synthetic_retail_sales_data.csv', index=False) df.info() 

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 36500 non-null datetime64[ns]
1 store_id 36500 non-null int64
2 product_id 36500 non-null int64
3 sales 36500 non-null int64
4 promo 36500 non-null int64
5 holiday 36500 non-null int64

Step 2: Load and Preprocess the Data

  • The code demonstrates the creation of additional time-based features and the use of the CatBoostRegressor for a regression task. Initially, the code extracts the day of the week, month, and year from a date column in a DataFrame, adding these as new features to the dataset.
  • These features can help the model capture temporal patterns in the data. The dataset is then split into training and testing sets using train_test_split from scikit-learn.
  • The CatBoostRegressor model, which is well-suited for handling categorical data and often provides superior performance without extensive hyperparameter tuning, is trained on the training set. Performance of the model is evaluated on the test set using metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE), providing a comprehensive understanding of the model's predictive accuracy.
Python
from catboost import CatBoostRegressor, Pool from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, mean_squared_error   # Create additional time-based features df['day_of_week'] = df['date'].dt.dayofweek df['month'] = df['date'].dt.month df['year'] = df['date'].dt.year df.info() 

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 36500 non-null datetime64[ns]
1 store_id 36500 non-null int64
2 product_id 36500 non-null int64
3 sales 36500 non-null int64
4 promo 36500 non-null int64
5 holiday 36500 non-null int64
6 day_of_week 36500 non-null int32
7 month 36500 non-null int32
8 year 36500 non-null int32
dtypes: datetime64[ns](1), int32(3), int64(5)
memory usage: 2.1 MB

Step 3: Feature Engineering

The code demonstrates the creation of lag features in a DataFrame, which can be particularly useful for time series analysis and forecasting. Specifically, it creates two new features: sales_last_week and sales_last_month. These features capture the sales data from the same store and product combination one week and one month prior, respectively. This is achieved using the groupby method to group the data by store_id and product_id, ensuring that the lag features are calculated within each group. The shift method is then used to shift the sales data by 7 days for the sales_last_week feature and by 30 days for the sales_last_month feature. These lag features help the model learn from past sales patterns and improve its predictive performance.

Python
# Create lag features df['sales_last_week'] = df.groupby(['store_id', 'product_id'])['sales'].shift(7) df['sales_last_month'] = df.groupby(['store_id', 'product_id'])['sales'].shift(30) # Drop rows with NaN values df.dropna(inplace=True) 

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 33500 entries, 3000 to 36499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 33500 non-null datetime64[ns]
1 store_id 33500 non-null int64
2 product_id 33500 non-null int64
3 sales 33500 non-null int64
4 promo 33500 non-null int64
5 holiday 33500 non-null int64
6 day_of_week 33500 non-null int32
7 month 33500 non-null int32
8 year 33500 non-null int32
9 sales_last_week 33500 non-null float64
10 sales_last_month 33500 non-null float64
dtypes: datetime64[ns](1), float64(2), int32(3), int64(5)
memory usage: 2.7 MB

Step 4: Define Features and Target Variable

The code snippet defines the features and target variable for a machine learning model. The features list includes a set of predictors that the model will use to learn and make predictions.

Python
# Define features and target variable features = ['store_id', 'product_id', 'promo', 'holiday', 'day_of_week', 'month', 'year', 'sales_last_week', 'sales_last_month'] target = 'sales' 

Step 5: Split the Dataset

The code snippet demonstrates how to split the data into training and testing sets, which is a crucial step in building a machine learning model. 

Python
# Split the data into training and testing sets X = df[features] y = df[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

Step 6: Handle Categorical Data

The code converts specified features to categorical data types, which can be particularly beneficial when working with certain machine learning models like CatBoost, which can natively handle categorical features without needing to one-hot encode them.

Python
# Convert categorical features to categorical data type categorical_features = ['store_id', 'product_id', 'promo', 'holiday', 'day_of_week', 'month', 'year'] for feature in categorical_features:     X_train[feature] = X_train[feature].astype('category')     X_test[feature] = X_test[feature].astype('category') 

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 36500 non-null datetime64[ns]
1 store_id 36500 non-null int64
2 product_id 36500 non-null int64
3 sales 36500 non-null int64
4 promo 36500 non-null int64
5 holiday 36500 non-null int64
6 day_of_week 36500 non-null int32
7 month 36500 non-null int32
8 year 36500 non-null int32
9 sales_last_week 35800 non-null float64
10 sales_last_month 33500 non-null float64
dtypes: datetime64[ns](1), float64(2), int32(3), int64(5)
memory usage: 2.6 MB

Step 7: Initialize and Train CatBoost Model

Initialize and train the CatBoost model using the training data.

Python
# Create Pool objects for training and validation train_pool = Pool(X_train, y_train, cat_features=categorical_features) test_pool = Pool(X_test, y_test, cat_features=categorical_features)  # Initialize CatBoostRegressor model = CatBoostRegressor(     iterations=1000,     learning_rate=0.1,     depth=6,     loss_function='RMSE',     eval_metric='MAE',     verbose=100 )  # Train the model model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=50) 


Output:

0:    learn: 3.5340190    test: 3.5190577    best: 3.5190577 (0)    total: 113ms    remaining: 1m 52s
Stopped by overfitting detector (50 iterations wait)

bestTest = 3.519057696
bestIteration = 0

Shrink model to first 1 iterations.
<catboost.core.CatBoostRegressor at 0x799573d23fa0>

Step 9: Make Predictions and Evaluate the Model

This code demonstrates how to make predictions using a trained model, evaluate the model's performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), and print the results.

Python
# Make predictions y_pred = model.predict(X_test)  # Evaluate the model mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred))  print(f'Mean Absolute Error (MAE): {mae}') print(f'Root Mean Squared Error (RMSE): {rmse}') 

Output:

Mean Absolute Error (MAE): 3.5706688593058047
Root Mean Squared Error (RMSE): 4.4738763403131685

The Mean Absolute Error (MAE) of 3.5706688593058047 indicates that, on average, the model's predictions differ from the actual values by approximately 3.57 units. This metric gives a straightforward interpretation of prediction accuracy, treating all errors equally regardless of their magnitude. On the other hand, the Root Mean Squared Error (RMSE) of 4.4738763403131685 provides a measure of the standard deviation of the prediction errors, with larger errors having a more significant impact due to the squaring of differences before averaging


Conclusion

Accurate demand forecasting is crucial in retail for optimizing inventory, reducing costs, and ensuring customer satisfaction. Traditional methods often fall short in capturing the complexities of retail demand. CatBoost, a gradient boosting algorithm developed by Yandex, effectively handles categorical data and mitigates overfitting, making it ideal for retail demand forecasting.

The implementation of CatBoost demonstrates its ability to leverage temporal and lag features, enhancing predictive accuracy. The model's native handling of categorical features reduces dimensionality and computational overhead. With a Mean Absolute Error (MAE) of 3.5706688593058047 and a Root Mean Squared Error (RMSE) of 4.4738763403131685, CatBoost provides reliable performance metrics.




Next Article
Demand forecasting in retail using catboost
https://media.geeksforgeeks.org/auth/avatar.png
Anonymous
Improve
Article Tags :
  • Machine Learning
  • Blogathon
  • AI-ML-DS
  • CatBoost
  • AI-ML-DS With Python
  • Data Science Blogathon 2024
Practice Tags :
  • Machine Learning

Similar Reads

    Fuel Efficiency Forecasting with CatBoost
    The automobile sector is continuously looking for new and creative ways to cut fuel use in its pursuit of economy, and sustainability. Comprehending car fuel usage has become more crucial due to the increase in gas costs and the increased emphasis on environmental sustainability. A technique for thi
    7 min read
    Regression using CatBoost
    In this article, we will learn about one of the state-of-the-art machine learning models: Catboost here cat stands for categorical which implies that this algorithm is highly efficient when your data contains many categorical columns. Table of ContentWhat is CatBoost?How Catboost Works?Implementatio
    13 min read
    Inventory Demand Forecasting using Machine Learning - Python
    Vendors selling everyday items need to keep their stock updated so that customers don’t leave empty-handed. Maintaining the right stock levels helps avoid shortages that disappoint customers and prevents overstocking which can increase costs. In this article we’ll learn how to use Machine Learning (
    6 min read
    Multiregression using CatBoost
    Multiregression, also known as multiple regression, is a statistical method used to predict a target variable based on two or more predictor variables. This technique is widely used in various fields such as finance, economics, marketing, and machine learning. CatBoost, a powerful gradient boosting
    5 min read
    Handling Missing Values with CatBoost
    Data is the cornerstone of any analytical or machine-learning endeavor. However, real-world datasets are not perfect yet and they often contain missing values which can lead to error in the training phase of any algorithm. Handling missing values is crucial because they can lead to biased or inaccur
    8 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences