An ensemble of trees is an efficient technique that can be used to combine multiple weak learners into a strong learner. The main idea of the ensemble for trees is that we take aggregate of the results from multiple trees which may not have been able to perform well. This aggregate mitigates the weaknesses of individual trees and therefore improves the performance. Scikit Learn provides multiple algorithms to determine the importance of each feature and thereby apply feature transformation.
Scikit Learn is a popular Machine Learning library that can be used for implementing feature engineering, and various machine learning algorithms. It also provides a comprehensive set of tools for working with ensembles of trees. In this article, we will discuss two main ensemble techniques based on trees.
What is an Ensemble of Trees?
Let's say you are training a model on a huge dataset, you cleaned the data, used all the appropriate techniques of feature engineering, and selected the best features. Now, you train the model by hyper tuning and have set the most optimal set of parameters. However, you still see that your model doesn't meet your requirements for accuracy. At this point, even if everything is in place you don't get the most accurate results. You can say that this model is a weak learner.
Since you stubbornly want to have a model with high accuracy you need to do something that will turn this weak learner into the strong learner. This is where the ensemble of trees comes into the scene. The word ensemble here actually means "a group of things taken as a whole". In the ensemble trees, we have multiple decision trees, each receiving the same input and producing an output. Finally, the aggregate of the output of all the trees is taken to make our model a strong learner.
In general, there are two types of ensemble techniques, i.e bagging and boosting.
- Bagging - It works by providing subsets of training data to different models and then taking majority vote(in case of classification) or average(in case of regression).
- Boosting- It works by connecting the weak learners sequentially such that output of one model is sent as input to next model for improvement.
In both the cases, we get a model that has improved high accuracy.
How ensemble models work?
Both ensemble models work by combing the predictions of multiple trees, however there are slight differences. In case of Gradient Boosting Classifier each tree is built iteratively or sequentially based on the aggregate performance of the trees build so far. Whereas, in Random Forest Classifier, the tree is built independently and then the result of all trees is aggregated to produce an output.
We know that in each case a decision tree is built. A decision tree and multiple nodes and leaf nodes. When you input any data x and it goes through the tree then it ends up in one of the leaf node. If you have trained the model with N decision trees, each decision tree will have multiple leaf nodes as show below:

As we know, from each of these trees only one leaf node will get activated. If we use one hot encoder, then for each tree we will get a vector.In this vector, the value will be 1 for the leaf node which is activated and 0 for the rest of the leaf nodes. Hence, for N number of trees in a trained model you will have N one hot encoded vectors. When we concatenate all these one hot encode vector what we get it is a transformed feature vector. Thus, this transformation actually makes the dense vector into sparse vector.
Random Forest Classifier
The working of random forest classifier is quite simple. In a normal decision tree algorithm, you have one single decision tree through which you pass the input and get the output as result. When working with random forest classifier, you have multiple such decision trees. Each decision tree gets different subset of data points and different subset of features randomly. Now, when this input data is passed to different decisions tree, we get an output from each tree.
The majority vote is taken if you are dealing with classification problem. For regression problems, you take average of the outputs received. Since each tree has seen almost different data points and features it would be more accurate when aggregated and will not overfit. Here, the model truly understands the underlying pattern.
Let's use Random Forest Classifier for feature transformation.
Importing Libraries and Splitting Data
Firstly, we create a random dataset using the make_classification method and then split it into train and test data.
Python from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Create a synthetic dataset for illustration X, y = make_classification(n_samples=1000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Implementing Random Forest Classifier
Python3 from sklearn.ensemble import RandomForestClassifier # Train a Random Forest classifier rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) rf_classifier.fit(X_train, y_train)
When we create a random forest classifier model we will pass n_estimators=100, which means the number of trees in forest is set to 100. random_state will control the randomness of bootstrapping of samples used while building trees and sampling of features.
Feature Importance
Python3 # Get feature importances feature_importances = rf_classifier.feature_importances_ # Use feature importances for feature selection or transformation selected_features_X_train = X_train[:, feature_importances > 0.02] selected_features_X_test = X_test[:, feature_importances > 0.02] print(feature_importances)
Output:
[0.01695078 0.10181962 0.02108998 0.01368187 0.01554572 0.36313417
0.02333904 0.01519547 0.01549653 0.0158737 0.01915803 0.02569037
0.01922067 0.01651946 0.07557534 0.01734487 0.01852994 0.01427792
0.17725206 0.01430447]
We can use the feature_importances_ property of RandomForestClassifier to know which features are important. In our case, we select only the features which have feature_importances_ value greater than 0.02. This actually trims down our columns from 20 to 7, i.e. only 7 significant features are selected to classify.
When you print the feature_importances you will get a list that contains the importance of each feature in classification. The importance of a feature is calculated based on how much that feature contributes to reducing impurity in decision tree.
Logistic Regression
You can train the Logistic Regression model directly with the splitted dataset and also with the dataset on which we performed feature transformation just now.
Python3 from sklearn.linear_model import LogisticRegression model1 = LogisticRegression() model1.fit(X_train, y_train) model2 = LogisticRegression() model2.fit(selected_features_X_train, y_train)
This code trains two models using the Logistic Regression module of scikit-learn. While the second model (model2) is trained on a subset of features (selected_features_X_train and y_train), the first model (model1) is trained on the entire feature set (X_train and y_train). The utilization of specific features in the second model indicates a desire to enhance the model's performance by concentrating on particular predictors and maybe going through a feature selection or dimensionality reduction procedure.
Evaluation
Python3 score1 = model1.score(X_test, y_test) score2 = model2.score(selected_features_X_test, y_test) print("Mean accuracy without ensemble of tree: ", score1) print("Mean accuracy with ensemble of tree: ", score2)
Output:
Mean accuracy without ensemble of tree: 0.855
Mean accuracy with ensemble of tree: 0.875
When you compare the mean accuracy score you can see that it has been significantly improved by 0.02 after performing feature transformation.
Since this was a custom dataset created by us you may not notice a huge change or difference in accuracy. However, depending on the use cases and real datasets you can expect Random Forest to be a helpful technique.
Gradient Boosting Classifier
In gradient boosting classifier, each model tries to reduce the errors of the preceding model. All such models are arranged in sequentially so that we get a final output having high accuracy. It is important to note that rather than fitting the model again, it tries to fit a new model to the residual errors made by the previous one. Thus, it boosts the performance overall.
To get started you need to import the following form sklearn.
Importing Libraries
Python3 from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression
Generating Data and splitting data
Python3 # Create a synthetic dataset for illustration X, y = make_classification(n_samples=1000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We will be generating synthetic dataset using make_classification() and then split it into training and testing dataset.
Training a Gradient Tree classifier
Python3 # Train a Gradient Boosted Trees classifier gb_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42) gb_classifier.fit(X_train, y_train)
When you create a GradientBoostingClassifier model you need to specify n_estimators which means the number of boosting stages to perform. By default, n_estimators is set to 100.
Feature importance
Python3 # Get feature importances feature_importances = gb_classifier.feature_importances_ # Use feature importances for feature selection or transformation selected_features_X_train = X_train[:, feature_importances > 0.02] selected_features_X_test = X_test[:, feature_importances > 0.02] print(feature_importances)
Output:
[0.00528748 0.01243391 0.01087386 0.00099458 0.00248339 0.66505179
0.01199457 0.01062141 0.00485092 0.00563588 0.00230154 0.01865276
0.01184804 0.00581178 0.17462893 0.00929922 0.00624585 0.00471869
0.03393672 0.00232869]
You can know the importance of each feature by calling the feature_importances_ property. It gives you a list containing importance value for each feature. Then we select only those features which have feature_importances_ greater than 0.02
If you print feature_importances, you will get an array of values where each value corresponds to the importance of a specific feature. Higher value means more importance.
Training the model
Python3 #Logistic Regression without Gradient Boosting Classifier model1 = LogisticRegression() model1.fit(X_train, y_train) #Logistic Regression with Gradient Boosting Classifier model2 = LogisticRegression() model2.fit(selected_features_X_train, y_train)
Now, we train two models. One with GradientBoostingClassifier and another without it.
Evaluation
Python3 score1 = model1.score(X_test, y_test) score2 = model2.score(selected_features_X_test, y_test) print("Mean accuracy without ensemble of tree: ", score1) print("Mean accuracy with ensemble of tree: ", score2)
Output:
Mean accuracy without ensemble of tree: 0.855
Mean accuracy with ensemble of tree: 0.865
When we compare the two models based on their mean accuracy we see that the accuracy improves when we apply feature transformation by using GradientBoostingClassifier.
ROC Curve
Since, we have learned to use both the techniques of ensemble, we will compare them using ROC curve
Python3 from sklearn.metrics import roc_curve, roc_auc_score import matplotlib.pyplot as plt # Predict probabilities for positive class y_probs_rf = rf_classifier.predict_proba(X_test)[:, 1] y_probs_gb = gb_classifier.predict_proba(X_test)[:, 1] # Calculate ROC curve fpr_rf, tpr_rf, _ = roc_curve(y_test, y_probs_rf) fpr_gb, tpr_gb, _ = roc_curve(y_test, y_probs_gb) # Calculate AUC (Area Under the Curve) score auc_rf = roc_auc_score(y_test, y_probs_rf) auc_gb = roc_auc_score(y_test, y_probs_gb) # Plot ROC curve plt.figure(figsize=(4, 4)) plt.plot(fpr_rf, tpr_rf, color='blue', label=f'Random Forest (AUC = {auc_rf:.2f})') plt.plot(fpr_gb, tpr_gb, color='orange', label=f'Gradient Boosting (AUC = {auc_gb:.2f})') plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # Diagonal line for reference plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve for Random Forest and Gradient Boosting') plt.legend() plt.show()
Output:
ROC CurveFor two classifiers, Random Forest (rf_classifier) and Gradient Boosting (gb_classifier), this code snippet assesses the Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) scores. By utilizing both classifiers, it initially forecasts the test set's positive class probabilities. The ROC curves of each classifier are then used to compute the False Positive Rate (fpr) and True Positive Rate (tpr). The classifiers' discrimination performance is measured by computing the AUC scores. Lastly, a diagonal reference line and the ROC curves for each classifier are plotted by the code, along with the corresponding AUC scores.
Conclusion
To sum up, using ensembles of trees and feature transformations in Scikit-Learn provides a strong way to improve predictive modeling. The combined strength of several decision trees is tapped into by methods like Random Forests and Gradient Boosting, which make it possible to handle complex relationships in the data effectively. The automatic selection of relevant predictors is aided by the natural identification of feature importance through ensemble methods. This increases the accuracy of the model and offers insightful information about the importance of various features. Through the integration of feature transformations and tree-based ensembles, Scikit-Learn provides a stable framework to handle complex patterns in a variety of datasets, thereby advancing machine learning solutions that are more comprehensible and efficient.
Similar Reads
Feature Selection in Python with Scikit-Learn
Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-L
4 min read
SVM with Univariate Feature Selection in Scikit Learn
Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amou
10 min read
Recursive Feature Elimination with Cross-Validation in Scikit Learn
In this article, we will earn how to implement recursive feature elimination with cross-validation using scikit learn package in Python. What is Recursive Feature Elimination (RFE)? Recursive Feature Elimination (RFE) is a feature selection algorithm that is used to select a subset of the most relev
5 min read
SVM Feature Selection in R with Example
In machine learning, SVM is often praised for its robustness and accuracy, particularly in binary classification problems. However, like any model, its performance can be heavily dependent on the input features. Effective feature selection not only simplifies the model by reducing the number of vari
4 min read
Feature Agglomeration vs Univariate Selection in Scikit Learn
Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, inc
4 min read
Feature selection using SelectFromModel and LassoCV in Scikit Learn
Feature selection is a critical step in machine learning and data analysis, aimed at identifying and retaining the most relevant variables in a dataset. It not only enhances model performance but also reduces overfitting and improves interpretability. In this guide, we delve into the world of featur
7 min read
Classification of text documents using sparse features in Python Scikit Learn
Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a liv
5 min read
Logistic Regression and the Feature Scaling Ensemble
Logistic Regression is a widely used classification algorithm in machine learning. However, to enhance its performance further specially when dealing with features of different scales, employing feature scaling ensemble techniques becomes imperative. In this guide, we will dive depth into logistic r
9 min read
Transform Text Features to Numerical Features with CatBoost
Handling text and category data is essential to machine learning to create correct prediction models. Yandex's gradient boosting library, CatBoost, performs very well. It provides sophisticated methods to convert text characteristics into numerical ones and supports categorical features natively, bo
4 min read
MLOps Feature Store: A Comprehensive Guide with Examples
In the realm of Machine Learning Operations (MLOps), the concept of a feature store has emerged as a crucial component for effective model development and deployment. A feature store acts as a centralized repository for features, enabling data scientists and engineers to streamline their workflows a
4 min read