Demand forecasting in retail using catboost

Last Updated : 17 Jun, 2024

In the fast-paced world of retail, accurate demand forecasting is crucial for optimizing inventory management, minimizing costs, and ensuring customer satisfaction. Traditional forecasting methods often fall short in capturing the complexity and dynamic nature of retail demand. This is where advanced machine learning techniques like CatBoost come into play. CatBoost, a gradient boosting algorithm developed by Yandex, is particularly well-suited for demand forecasting in retail due to its ability to handle categorical data and mitigate overfitting. In this article, we will discuss the same.

What is CatBoost?

CatBoost stands for "Categorical Boosting," and it is designed to efficiently handle categorical features without extensive preprocessing. Unlike other gradient boosting algorithms, CatBoost incorporates an innovative approach to dealing with categorical variables directly, which makes it highly effective for retail datasets that often include categorical features such as product categories, store locations, and promotional strategies.

Benefits of CatBoost for Retail Demand Forecasting

Handling Categorical Data: Retail datasets are rich in categorical data. CatBoost's native support for categorical features eliminates the need for one-hot encoding, reducing dimensionality and computational overhead.
Reduced Overfitting: CatBoost employs ordered boosting, a method that reduces overfitting by using random permutations of the dataset to build each tree. This is particularly beneficial in retail, where the risk of overfitting is high due to the volatility of consumer behavior.
Speed and Efficiency: CatBoost is optimized for both CPU and GPU, allowing it to train models faster than many other gradient boosting algorithms. This efficiency is critical in retail, where timely forecasts can significantly impact decision-making.
Robustness: CatBoost's ability to handle missing values and noisy data makes it robust in real-world retail environments where data quality can be inconsistent.

Code Implementation of Demand forecasting in retail using catboost

We will now see the step by step implementation of demand forecasting in retail using catboost.

Step 1: Create a Synthetic Dataset

We will create a synthetic dataset which we will be using for our analysis:

Python

import pandas as pd import numpy as np  # Create a synthetic dataset np.random.seed(42)  date_range = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D') store_ids = np.arange(1, 6)  # 5 stores product_ids = np.arange(1, 21)  # 20 products  data = [] for date in date_range:     for store_id in store_ids:         for product_id in product_ids:             sales = np.random.poisson(lam=20)             promo = np.random.choice([0, 1])             holiday = 1 if date in pd.to_datetime(['2021-01-01', '2021-12-25']) else 0             data.append([date, store_id, product_id, sales, promo, holiday])  df = pd.DataFrame(data, columns=['date', 'store_id', 'product_id', 'sales', 'promo', 'holiday']) df.to_csv('synthetic_retail_sales_data.csv', index=False) df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        36500 non-null  datetime64[ns]
 1   store_id    36500 non-null  int64         
 2   product_id  36500 non-null  int64         
 3   sales       36500 non-null  int64         
 4   promo       36500 non-null  int64         
 5   holiday     36500 non-null  int64

Step 2: Load and Preprocess the Data

The code demonstrates the creation of additional time-based features and the use of the CatBoostRegressor for a regression task. Initially, the code extracts the day of the week, month, and year from a date column in a DataFrame, adding these as new features to the dataset.
These features can help the model capture temporal patterns in the data. The dataset is then split into training and testing sets using train_test_split from scikit-learn.
The CatBoostRegressor model, which is well-suited for handling categorical data and often provides superior performance without extensive hyperparameter tuning, is trained on the training set. Performance of the model is evaluated on the test set using metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE), providing a comprehensive understanding of the model's predictive accuracy.

Python

from catboost import CatBoostRegressor, Pool from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, mean_squared_error   # Create additional time-based features df['day_of_week'] = df['date'].dt.dayofweek df['month'] = df['date'].dt.month df['year'] = df['date'].dt.year df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         36500 non-null  datetime64[ns]
 1   store_id     36500 non-null  int64         
 2   product_id   36500 non-null  int64         
 3   sales        36500 non-null  int64         
 4   promo        36500 non-null  int64         
 5   holiday      36500 non-null  int64         
 6   day_of_week  36500 non-null  int32         
 7   month        36500 non-null  int32         
 8   year         36500 non-null  int32         
dtypes: datetime64[ns](1), int32(3), int64(5)
memory usage: 2.1 MB

Step 3: Feature Engineering

The code demonstrates the creation of lag features in a DataFrame, which can be particularly useful for time series analysis and forecasting. Specifically, it creates two new features: sales_last_week and sales_last_month. These features capture the sales data from the same store and product combination one week and one month prior, respectively. This is achieved using the groupby method to group the data by store_id and product_id, ensuring that the lag features are calculated within each group. The shift method is then used to shift the sales data by 7 days for the sales_last_week feature and by 30 days for the sales_last_month feature. These lag features help the model learn from past sales patterns and improve its predictive performance.

Python

# Create lag features df['sales_last_week'] = df.groupby(['store_id', 'product_id'])['sales'].shift(7) df['sales_last_month'] = df.groupby(['store_id', 'product_id'])['sales'].shift(30) # Drop rows with NaN values df.dropna(inplace=True)

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 33500 entries, 3000 to 36499
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              33500 non-null  datetime64[ns]
 1   store_id          33500 non-null  int64         
 2   product_id        33500 non-null  int64         
 3   sales             33500 non-null  int64         
 4   promo             33500 non-null  int64         
 5   holiday           33500 non-null  int64         
 6   day_of_week       33500 non-null  int32         
 7   month             33500 non-null  int32         
 8   year              33500 non-null  int32         
 9   sales_last_week   33500 non-null  float64       
 10  sales_last_month  33500 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int32(3), int64(5)
memory usage: 2.7 MB

Step 4: Define Features and Target Variable

The code snippet defines the features and target variable for a machine learning model. The features list includes a set of predictors that the model will use to learn and make predictions.

Python

# Define features and target variable features = ['store_id', 'product_id', 'promo', 'holiday', 'day_of_week', 'month', 'year', 'sales_last_week', 'sales_last_month'] target = 'sales'

Step 5: Split the Dataset

The code snippet demonstrates how to split the data into training and testing sets, which is a crucial step in building a machine learning model.

Python

# Split the data into training and testing sets X = df[features] y = df[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Handle Categorical Data

The code converts specified features to categorical data types, which can be particularly beneficial when working with certain machine learning models like CatBoost, which can natively handle categorical features without needing to one-hot encode them.

Python

# Convert categorical features to categorical data type categorical_features = ['store_id', 'product_id', 'promo', 'holiday', 'day_of_week', 'month', 'year'] for feature in categorical_features:     X_train[feature] = X_train[feature].astype('category')     X_test[feature] = X_test[feature].astype('category')

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              36500 non-null  datetime64[ns]
 1   store_id          36500 non-null  int64         
 2   product_id        36500 non-null  int64         
 3   sales             36500 non-null  int64         
 4   promo             36500 non-null  int64         
 5   holiday           36500 non-null  int64         
 6   day_of_week       36500 non-null  int32         
 7   month             36500 non-null  int32         
 8   year              36500 non-null  int32         
 9   sales_last_week   35800 non-null  float64       
 10  sales_last_month  33500 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int32(3), int64(5)
memory usage: 2.6 MB

Step 7: Initialize and Train CatBoost Model

Initialize and train the CatBoost model using the training data.

Python

# Create Pool objects for training and validation train_pool = Pool(X_train, y_train, cat_features=categorical_features) test_pool = Pool(X_test, y_test, cat_features=categorical_features)  # Initialize CatBoostRegressor model = CatBoostRegressor(     iterations=1000,     learning_rate=0.1,     depth=6,     loss_function='RMSE',     eval_metric='MAE',     verbose=100 )  # Train the model model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=50)

Output:

0:    learn: 3.5340190    test: 3.5190577    best: 3.5190577 (0)    total: 113ms    remaining: 1m 52s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 3.519057696
bestIteration = 0

Shrink model to first 1 iterations.
<catboost.core.CatBoostRegressor at 0x799573d23fa0>

Step 9: Make Predictions and Evaluate the Model

This code demonstrates how to make predictions using a trained model, evaluate the model's performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), and print the results.

Python

# Make predictions y_pred = model.predict(X_test)  # Evaluate the model mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred))  print(f'Mean Absolute Error (MAE): {mae}') print(f'Root Mean Squared Error (RMSE): {rmse}')

Output:

Mean Absolute Error (MAE): 3.5706688593058047
Root Mean Squared Error (RMSE): 4.4738763403131685

The Mean Absolute Error (MAE) of 3.5706688593058047 indicates that, on average, the model's predictions differ from the actual values by approximately 3.57 units. This metric gives a straightforward interpretation of prediction accuracy, treating all errors equally regardless of their magnitude. On the other hand, the Root Mean Squared Error (RMSE) of 4.4738763403131685 provides a measure of the standard deviation of the prediction errors, with larger errors having a more significant impact due to the squaring of differences before averaging

Conclusion

Accurate demand forecasting is crucial in retail for optimizing inventory, reducing costs, and ensuring customer satisfaction. Traditional methods often fall short in capturing the complexities of retail demand. CatBoost, a gradient boosting algorithm developed by Yandex, effectively handles categorical data and mitigates overfitting, making it ideal for retail demand forecasting.

The implementation of CatBoost demonstrates its ability to leverage temporal and lag features, enhancing predictive accuracy. The model's native handling of categorical features reduces dimensionality and computational overhead. With a Mean Absolute Error (MAE) of 3.5706688593058047 and a Root Mean Squared Error (RMSE) of 4.4738763403131685, CatBoost provides reliable performance metrics.

Demand forecasting in retail using catboost

Anonymous

Improve

Article Tags :

Practice Tags :

Machine Learning

Demand forecasting in retail using catboost

What is CatBoost?

Benefits of CatBoost for Retail Demand Forecasting

Code Implementation of Demand forecasting in retail using catboost

Step 1: Create a Synthetic Dataset

Step 2: Load and Preprocess the Data

Step 3: Feature Engineering

Step 4: Define Features and Target Variable

Step 5: Split the Dataset

Step 6: Handle Categorical Data

Step 7: Initialize and Train CatBoost Model

Step 9: Make Predictions and Evaluate the Model

Conclusion

Similar Reads