Demand forecasting in retail using catboost
Last Updated : 17 Jun, 2024
In the fast-paced world of retail, accurate demand forecasting is crucial for optimizing inventory management, minimizing costs, and ensuring customer satisfaction. Traditional forecasting methods often fall short in capturing the complexity and dynamic nature of retail demand. This is where advanced machine learning techniques like CatBoost come into play. CatBoost, a gradient boosting algorithm developed by Yandex, is particularly well-suited for demand forecasting in retail due to its ability to handle categorical data and mitigate overfitting. In this article, we will discuss the same.
What is CatBoost?
CatBoost stands for "Categorical Boosting," and it is designed to efficiently handle categorical features without extensive preprocessing. Unlike other gradient boosting algorithms, CatBoost incorporates an innovative approach to dealing with categorical variables directly, which makes it highly effective for retail datasets that often include categorical features such as product categories, store locations, and promotional strategies.
Benefits of CatBoost for Retail Demand Forecasting
- Handling Categorical Data: Retail datasets are rich in categorical data. CatBoost's native support for categorical features eliminates the need for one-hot encoding, reducing dimensionality and computational overhead.
- Reduced Overfitting: CatBoost employs ordered boosting, a method that reduces overfitting by using random permutations of the dataset to build each tree. This is particularly beneficial in retail, where the risk of overfitting is high due to the volatility of consumer behavior.
- Speed and Efficiency: CatBoost is optimized for both CPU and GPU, allowing it to train models faster than many other gradient boosting algorithms. This efficiency is critical in retail, where timely forecasts can significantly impact decision-making.
- Robustness: CatBoost's ability to handle missing values and noisy data makes it robust in real-world retail environments where data quality can be inconsistent.
Code Implementation of Demand forecasting in retail using catboost
We will now see the step by step implementation of demand forecasting in retail using catboost.
Step 1: Create a Synthetic Dataset
We will create a synthetic dataset which we will be using for our analysis:
Python import pandas as pd import numpy as np # Create a synthetic dataset np.random.seed(42) date_range = pd.date_range(start='2021-01-01', end='2021-12-31', freq='D') store_ids = np.arange(1, 6) # 5 stores product_ids = np.arange(1, 21) # 20 products data = [] for date in date_range: for store_id in store_ids: for product_id in product_ids: sales = np.random.poisson(lam=20) promo = np.random.choice([0, 1]) holiday = 1 if date in pd.to_datetime(['2021-01-01', '2021-12-25']) else 0 data.append([date, store_id, product_id, sales, promo, holiday]) df = pd.DataFrame(data, columns=['date', 'store_id', 'product_id', 'sales', 'promo', 'holiday']) df.to_csv('synthetic_retail_sales_data.csv', index=False) df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 36500 non-null datetime64[ns]
1 store_id 36500 non-null int64
2 product_id 36500 non-null int64
3 sales 36500 non-null int64
4 promo 36500 non-null int64
5 holiday 36500 non-null int64
Step 2: Load and Preprocess the Data
- The code demonstrates the creation of additional time-based features and the use of the CatBoostRegressor for a regression task. Initially, the code extracts the day of the week, month, and year from a date column in a DataFrame, adding these as new features to the dataset.
- These features can help the model capture temporal patterns in the data. The dataset is then split into training and testing sets using train_test_split from scikit-learn.
- The CatBoostRegressor model, which is well-suited for handling categorical data and often provides superior performance without extensive hyperparameter tuning, is trained on the training set. Performance of the model is evaluated on the test set using metrics such as Mean Absolute Error (MAE) and Mean Squared Error (MSE), providing a comprehensive understanding of the model's predictive accuracy.
Python from catboost import CatBoostRegressor, Pool from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, mean_squared_error # Create additional time-based features df['day_of_week'] = df['date'].dt.dayofweek df['month'] = df['date'].dt.month df['year'] = df['date'].dt.year df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 36500 non-null datetime64[ns]
1 store_id 36500 non-null int64
2 product_id 36500 non-null int64
3 sales 36500 non-null int64
4 promo 36500 non-null int64
5 holiday 36500 non-null int64
6 day_of_week 36500 non-null int32
7 month 36500 non-null int32
8 year 36500 non-null int32
dtypes: datetime64[ns](1), int32(3), int64(5)
memory usage: 2.1 MB
Step 3: Feature Engineering
The code demonstrates the creation of lag features in a DataFrame, which can be particularly useful for time series analysis and forecasting. Specifically, it creates two new features: sales_last_week and sales_last_month. These features capture the sales data from the same store and product combination one week and one month prior, respectively. This is achieved using the groupby method to group the data by store_id and product_id, ensuring that the lag features are calculated within each group. The shift method is then used to shift the sales data by 7 days for the sales_last_week feature and by 30 days for the sales_last_month feature. These lag features help the model learn from past sales patterns and improve its predictive performance.
Python # Create lag features df['sales_last_week'] = df.groupby(['store_id', 'product_id'])['sales'].shift(7) df['sales_last_month'] = df.groupby(['store_id', 'product_id'])['sales'].shift(30) # Drop rows with NaN values df.dropna(inplace=True)
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 33500 entries, 3000 to 36499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 33500 non-null datetime64[ns]
1 store_id 33500 non-null int64
2 product_id 33500 non-null int64
3 sales 33500 non-null int64
4 promo 33500 non-null int64
5 holiday 33500 non-null int64
6 day_of_week 33500 non-null int32
7 month 33500 non-null int32
8 year 33500 non-null int32
9 sales_last_week 33500 non-null float64
10 sales_last_month 33500 non-null float64
dtypes: datetime64[ns](1), float64(2), int32(3), int64(5)
memory usage: 2.7 MB
Step 4: Define Features and Target Variable
The code snippet defines the features and target variable for a machine learning model. The features list includes a set of predictors that the model will use to learn and make predictions.
Python # Define features and target variable features = ['store_id', 'product_id', 'promo', 'holiday', 'day_of_week', 'month', 'year', 'sales_last_week', 'sales_last_month'] target = 'sales'
Step 5: Split the Dataset
The code snippet demonstrates how to split the data into training and testing sets, which is a crucial step in building a machine learning model.
Python # Split the data into training and testing sets X = df[features] y = df[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 6: Handle Categorical Data
The code converts specified features to categorical data types, which can be particularly beneficial when working with certain machine learning models like CatBoost, which can natively handle categorical features without needing to one-hot encode them.
Python # Convert categorical features to categorical data type categorical_features = ['store_id', 'product_id', 'promo', 'holiday', 'day_of_week', 'month', 'year'] for feature in categorical_features: X_train[feature] = X_train[feature].astype('category') X_test[feature] = X_test[feature].astype('category')
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36500 entries, 0 to 36499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 36500 non-null datetime64[ns]
1 store_id 36500 non-null int64
2 product_id 36500 non-null int64
3 sales 36500 non-null int64
4 promo 36500 non-null int64
5 holiday 36500 non-null int64
6 day_of_week 36500 non-null int32
7 month 36500 non-null int32
8 year 36500 non-null int32
9 sales_last_week 35800 non-null float64
10 sales_last_month 33500 non-null float64
dtypes: datetime64[ns](1), float64(2), int32(3), int64(5)
memory usage: 2.6 MB
Step 7: Initialize and Train CatBoost Model
Initialize and train the CatBoost model using the training data.
Python # Create Pool objects for training and validation train_pool = Pool(X_train, y_train, cat_features=categorical_features) test_pool = Pool(X_test, y_test, cat_features=categorical_features) # Initialize CatBoostRegressor model = CatBoostRegressor( iterations=1000, learning_rate=0.1, depth=6, loss_function='RMSE', eval_metric='MAE', verbose=100 ) # Train the model model.fit(train_pool, eval_set=test_pool, early_stopping_rounds=50)
Output:
0: learn: 3.5340190 test: 3.5190577 best: 3.5190577 (0) total: 113ms remaining: 1m 52s
Stopped by overfitting detector (50 iterations wait)
bestTest = 3.519057696
bestIteration = 0
Shrink model to first 1 iterations.
<catboost.core.CatBoostRegressor at 0x799573d23fa0>
Step 9: Make Predictions and Evaluate the Model
This code demonstrates how to make predictions using a trained model, evaluate the model's performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), and print the results.
Python # Make predictions y_pred = model.predict(X_test) # Evaluate the model mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) print(f'Mean Absolute Error (MAE): {mae}') print(f'Root Mean Squared Error (RMSE): {rmse}')
Output:
Mean Absolute Error (MAE): 3.5706688593058047
Root Mean Squared Error (RMSE): 4.4738763403131685
The Mean Absolute Error (MAE) of 3.5706688593058047 indicates that, on average, the model's predictions differ from the actual values by approximately 3.57 units. This metric gives a straightforward interpretation of prediction accuracy, treating all errors equally regardless of their magnitude. On the other hand, the Root Mean Squared Error (RMSE) of 4.4738763403131685 provides a measure of the standard deviation of the prediction errors, with larger errors having a more significant impact due to the squaring of differences before averaging
Conclusion
Accurate demand forecasting is crucial in retail for optimizing inventory, reducing costs, and ensuring customer satisfaction. Traditional methods often fall short in capturing the complexities of retail demand. CatBoost, a gradient boosting algorithm developed by Yandex, effectively handles categorical data and mitigates overfitting, making it ideal for retail demand forecasting.
The implementation of CatBoost demonstrates its ability to leverage temporal and lag features, enhancing predictive accuracy. The model's native handling of categorical features reduces dimensionality and computational overhead. With a Mean Absolute Error (MAE) of 3.5706688593058047 and a Root Mean Squared Error (RMSE) of 4.4738763403131685, CatBoost provides reliable performance metrics.