Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Feature Selection in Python with Scikit-Learn
Next article icon

Feature Transformations with Ensembles of Trees in Scikit Learn

Last Updated : 17 Dec, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

An ensemble of trees is an efficient technique that can be used to combine multiple weak learners into a strong learner. The main idea of the ensemble for trees is that we take aggregate of the results from multiple trees which may not have been able to perform well. This aggregate mitigates the weaknesses of individual trees and therefore improves the performance. Scikit Learn provides multiple algorithms to determine the importance of each feature and thereby apply feature transformation.

Scikit Learn is a popular Machine Learning library that can be used for implementing feature engineering, and various machine learning algorithms. It also provides a comprehensive set of tools for working with ensembles of trees. In this article, we will discuss two main ensemble techniques based on trees.

What is an Ensemble of Trees?

Let's say you are training a model on a huge dataset, you cleaned the data, used all the appropriate techniques of feature engineering, and selected the best features. Now, you train the model by hyper tuning and have set the most optimal set of parameters. However, you still see that your model doesn't meet your requirements for accuracy. At this point, even if everything is in place you don't get the most accurate results. You can say that this model is a weak learner.

Since you stubbornly want to have a model with high accuracy you need to do something that will turn this weak learner into the strong learner. This is where the ensemble of trees comes into the scene. The word ensemble here actually means "a group of things taken as a whole". In the ensemble trees, we have multiple decision trees, each receiving the same input and producing an output. Finally, the aggregate of the output of all the trees is taken to make our model a strong learner.

In general, there are two types of ensemble techniques, i.e bagging and boosting.

  • Bagging - It works by providing subsets of training data to different models and then taking majority vote(in case of classification) or average(in case of regression).
  • Boosting- It works by connecting the weak learners sequentially such that output of one model is sent as input to next model for improvement.

In both the cases, we get a model that has improved high accuracy.

How ensemble models work?

Both ensemble models work by combing the predictions of multiple trees, however there are slight differences. In case of Gradient Boosting Classifier each tree is built iteratively or sequentially based on the aggregate performance of the trees build so far. Whereas, in Random Forest Classifier, the tree is built independently and then the result of all trees is aggregated to produce an output.

We know that in each case a decision tree is built. A decision tree and multiple nodes and leaf nodes. When you input any data x and it goes through the tree then it ends up in one of the leaf node. If you have trained the model with N decision trees, each decision tree will have multiple leaf nodes as show below:

ensemble-tree-with-leaf-nodes

As we know, from each of these trees only one leaf node will get activated. If we use one hot encoder, then for each tree we will get a vector.In this vector, the value will be 1 for the leaf node which is activated and 0 for the rest of the leaf nodes. Hence, for N number of trees in a trained model you will have N one hot encoded vectors. When we concatenate all these one hot encode vector what we get it is a transformed feature vector. Thus, this transformation actually makes the dense vector into sparse vector.

Random Forest Classifier

The working of random forest classifier is quite simple. In a normal decision tree algorithm, you have one single decision tree through which you pass the input and get the output as result. When working with random forest classifier, you have multiple such decision trees. Each decision tree gets different subset of data points and different subset of features randomly. Now, when this input data is passed to different decisions tree, we get an output from each tree.

The majority vote is taken if you are dealing with classification problem. For regression problems, you take average of the outputs received. Since each tree has seen almost different data points and features it would be more accurate when aggregated and will not overfit. Here, the model truly understands the underlying pattern.

Let's use Random Forest Classifier for feature transformation.

Importing Libraries and Splitting Data

Firstly, we create a random dataset using the make_classification method and then split it into train and test data.

Python
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split  # Create a synthetic dataset for illustration X, y = make_classification(n_samples=1000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

Implementing Random Forest Classifier

Python3
from sklearn.ensemble import RandomForestClassifier  # Train a Random Forest classifier rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) rf_classifier.fit(X_train, y_train) 

When we create a random forest classifier model we will pass n_estimators=100, which means the number of trees in forest is set to 100. random_state will control the randomness of bootstrapping of samples used while building trees and sampling of features.

Feature Importance

Python3
# Get feature importances feature_importances = rf_classifier.feature_importances_  # Use feature importances for feature selection or transformation selected_features_X_train = X_train[:, feature_importances > 0.02] selected_features_X_test = X_test[:, feature_importances > 0.02]  print(feature_importances) 

Output:

[0.01695078 0.10181962 0.02108998 0.01368187 0.01554572 0.36313417
0.02333904 0.01519547 0.01549653 0.0158737 0.01915803 0.02569037
0.01922067 0.01651946 0.07557534 0.01734487 0.01852994 0.01427792
0.17725206 0.01430447]

We can use the feature_importances_ property of RandomForestClassifier to know which features are important. In our case, we select only the features which have feature_importances_ value greater than 0.02. This actually trims down our columns from 20 to 7, i.e. only 7 significant features are selected to classify.

When you print the feature_importances you will get a list that contains the importance of each feature in classification. The importance of a feature is calculated based on how much that feature contributes to reducing impurity in decision tree.

Logistic Regression

You can train the Logistic Regression model directly with the splitted dataset and also with the dataset on which we performed feature transformation just now.

Python3
from sklearn.linear_model import LogisticRegression  model1 = LogisticRegression() model1.fit(X_train, y_train)  model2 = LogisticRegression() model2.fit(selected_features_X_train, y_train) 

This code trains two models using the Logistic Regression module of scikit-learn. While the second model (model2) is trained on a subset of features (selected_features_X_train and y_train), the first model (model1) is trained on the entire feature set (X_train and y_train). The utilization of specific features in the second model indicates a desire to enhance the model's performance by concentrating on particular predictors and maybe going through a feature selection or dimensionality reduction procedure.

Evaluation

Python3
score1 = model1.score(X_test, y_test) score2 = model2.score(selected_features_X_test, y_test)  print("Mean accuracy without ensemble of tree: ", score1) print("Mean accuracy with ensemble of tree: ", score2) 

Output:

Mean accuracy without ensemble of tree:  0.855
Mean accuracy with ensemble of tree: 0.875

When you compare the mean accuracy score you can see that it has been significantly improved by 0.02 after performing feature transformation.

Since this was a custom dataset created by us you may not notice a huge change or difference in accuracy. However, depending on the use cases and real datasets you can expect Random Forest to be a helpful technique.

Gradient Boosting Classifier

In gradient boosting classifier, each model tries to reduce the errors of the preceding model. All such models are arranged in sequentially so that we get a final output having high accuracy. It is important to note that rather than fitting the model again, it tries to fit a new model to the residual errors made by the previous one. Thus, it boosts the performance overall.

To get started you need to import the following form sklearn.

Importing Libraries

Python3
from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression 

Generating Data and splitting data

Python3
# Create a synthetic dataset for illustration X, y = make_classification(n_samples=1000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

We will be generating synthetic dataset using make_classification() and then split it into training and testing dataset.

Training a Gradient Tree classifier

Python3
# Train a Gradient Boosted Trees classifier gb_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42) gb_classifier.fit(X_train, y_train) 

When you create a GradientBoostingClassifier model you need to specify n_estimators which means the number of boosting stages to perform. By default, n_estimators is set to 100.

Feature importance

Python3
# Get feature importances feature_importances = gb_classifier.feature_importances_  # Use feature importances for feature selection or transformation selected_features_X_train = X_train[:, feature_importances > 0.02] selected_features_X_test = X_test[:, feature_importances > 0.02]  print(feature_importances) 

Output:

[0.00528748 0.01243391 0.01087386 0.00099458 0.00248339 0.66505179
0.01199457 0.01062141 0.00485092 0.00563588 0.00230154 0.01865276
0.01184804 0.00581178 0.17462893 0.00929922 0.00624585 0.00471869
0.03393672 0.00232869]

You can know the importance of each feature by calling the feature_importances_ property. It gives you a list containing importance value for each feature. Then we select only those features which have feature_importances_ greater than 0.02

If you print feature_importances, you will get an array of values where each value corresponds to the importance of a specific feature. Higher value means more importance.

Training the model

Python3
#Logistic Regression without Gradient Boosting Classifier model1 = LogisticRegression() model1.fit(X_train, y_train)  #Logistic Regression with Gradient Boosting Classifier model2 = LogisticRegression() model2.fit(selected_features_X_train, y_train) 

Now, we train two models. One with GradientBoostingClassifier and another without it.

Evaluation

Python3
score1 = model1.score(X_test, y_test) score2 = model2.score(selected_features_X_test, y_test)  print("Mean accuracy without ensemble of tree: ", score1) print("Mean accuracy with ensemble of tree: ", score2) 

Output:

Mean accuracy without ensemble of tree:  0.855
Mean accuracy with ensemble of tree: 0.865

When we compare the two models based on their mean accuracy we see that the accuracy improves when we apply feature transformation by using GradientBoostingClassifier.

ROC Curve

Since, we have learned to use both the techniques of ensemble, we will compare them using ROC curve

Python3
from sklearn.metrics import roc_curve, roc_auc_score import matplotlib.pyplot as plt  # Predict probabilities for positive class y_probs_rf = rf_classifier.predict_proba(X_test)[:, 1] y_probs_gb = gb_classifier.predict_proba(X_test)[:, 1]  # Calculate ROC curve fpr_rf, tpr_rf, _ = roc_curve(y_test, y_probs_rf) fpr_gb, tpr_gb, _ = roc_curve(y_test, y_probs_gb)  # Calculate AUC (Area Under the Curve) score auc_rf = roc_auc_score(y_test, y_probs_rf) auc_gb = roc_auc_score(y_test, y_probs_gb)  # Plot ROC curve plt.figure(figsize=(4, 4)) plt.plot(fpr_rf, tpr_rf, color='blue', label=f'Random Forest (AUC = {auc_rf:.2f})') plt.plot(fpr_gb, tpr_gb, color='orange', label=f'Gradient Boosting (AUC = {auc_gb:.2f})') plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Diagonal line for reference plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve for Random Forest and Gradient Boosting') plt.legend() plt.show() 

Output:

ROC-curve
ROC Curve

For two classifiers, Random Forest (rf_classifier) and Gradient Boosting (gb_classifier), this code snippet assesses the Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) scores. By utilizing both classifiers, it initially forecasts the test set's positive class probabilities. The ROC curves of each classifier are then used to compute the False Positive Rate (fpr) and True Positive Rate (tpr). The classifiers' discrimination performance is measured by computing the AUC scores. Lastly, a diagonal reference line and the ROC curves for each classifier are plotted by the code, along with the corresponding AUC scores.

Conclusion

To sum up, using ensembles of trees and feature transformations in Scikit-Learn provides a strong way to improve predictive modeling. The combined strength of several decision trees is tapped into by methods like Random Forests and Gradient Boosting, which make it possible to handle complex relationships in the data effectively. The automatic selection of relevant predictors is aided by the natural identification of feature importance through ensemble methods. This increases the accuracy of the model and offers insightful information about the importance of various features. Through the integration of feature transformations and tree-based ensembles, Scikit-Learn provides a stable framework to handle complex patterns in a variety of datasets, thereby advancing machine learning solutions that are more comprehensible and efficient.


Next Article
Feature Selection in Python with Scikit-Learn

D

devyanic11
Improve
Article Tags :
  • Machine Learning
  • Geeks Premier League
  • AI-ML-DS
  • Python scikit-module
  • Geeks Premier League 2023
Practice Tags :
  • Machine Learning

Similar Reads

  • Feature Selection in Python with Scikit-Learn
    Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-L
    4 min read
  • SVM with Univariate Feature Selection in Scikit Learn
    Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amou
    10 min read
  • Recursive Feature Elimination with Cross-Validation in Scikit Learn
    In this article, we will earn how to implement recursive feature elimination with cross-validation using scikit learn package in Python. What is Recursive Feature Elimination (RFE)? Recursive Feature Elimination (RFE) is a feature selection algorithm that is used to select a subset of the most relev
    5 min read
  • SVM Feature Selection in R with Example
    In machine learning, SVM is often praised for its robustness and accuracy, particularly in binary classification problems. However, like any model, its performance can be heavily dependent on the input features. Effective feature selection not only simplifies the model by reducing the number of vari
    4 min read
  • Feature Agglomeration vs Univariate Selection in Scikit Learn
    Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, inc
    4 min read
  • Feature selection using SelectFromModel and LassoCV in Scikit Learn
    Feature selection is a critical step in machine learning and data analysis, aimed at identifying and retaining the most relevant variables in a dataset. It not only enhances model performance but also reduces overfitting and improves interpretability. In this guide, we delve into the world of featur
    7 min read
  • Classification of text documents using sparse features in Python Scikit Learn
    Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a liv
    5 min read
  • Logistic Regression and the Feature Scaling Ensemble
    Logistic Regression is a widely used classification algorithm in machine learning. However, to enhance its performance further specially when dealing with features of different scales, employing feature scaling ensemble techniques becomes imperative. In this guide, we will dive depth into logistic r
    9 min read
  • Transform Text Features to Numerical Features with CatBoost
    Handling text and category data is essential to machine learning to create correct prediction models. Yandex's gradient boosting library, CatBoost, performs very well. It provides sophisticated methods to convert text characteristics into numerical ones and supports categorical features natively, bo
    4 min read
  • MLOps Feature Store: A Comprehensive Guide with Examples
    In the realm of Machine Learning Operations (MLOps), the concept of a feature store has emerged as a crucial component for effective model development and deployment. A feature store acts as a centralized repository for features, enabling data scientists and engineers to streamline their workflows a
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences