Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Python | Decision tree implementation
Next article icon

Python | Decision tree implementation

Last Updated : 14 May, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Decision Tree is one of the most powerful and popular algorithms. Python Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables. In this article, We are going to implement a Decision tree in Python algorithm on the Balance Scale Weight & Distance Database presented on the UCI.

Decision Tree

A Decision tree is a tree-like structure that represents a set of decisions and their possible consequences. Each node in the tree represents a decision, and each branch represents an outcome of that decision. The leaves of the tree represent the final decisions or predictions.

Decision trees are created by recursively partitioning the data into smaller and smaller subsets. At each partition, the data is split based on a specific feature, and the split is made in a way that maximizes the information gain.

decisionTree2
Decision Tree

In the above figure, decision tree is a flowchart-like tree structure that is used to make decisions. It consists of Root Node(WINDY), Internal nodes(OUTLOOK, TEMPERATURE), which represent tests on attributes, and leaf nodes, which represent the final decisions. The branches of the tree represent the possible outcomes of the tests.

Key Components of Decision Trees in Python

  1. Root Node: The decision tree's starting node, which stands for the complete dataset.
  2. Branch Nodes: Internal nodes that represent decision points, where the data is split based on a specific attribute.
  3. Leaf Nodes: Final categorization or prediction-representing terminal nodes.
  4. Decision Rules: Rules that govern the splitting of data at each branch node.
  5. Attribute Selection: The process of choosing the most informative attribute for each split.
  6. Splitting Criteria: Metrics like information gain, entropy, or the Gini Index are used to calculate the optimal split.

Assumptions we make while using Decision tree

  • At the beginning, we consider the whole training set as the root.
  • Attributes are assumed to be categorical for information gain and for gini index, attributes are assumed to be continuous.
  • On the basis of attribute values records are distributed recursively.
  • We use statistical methods for ordering attributes as root or internal node.

Pseudocode of Decision tree

  1. Find the best attribute and place it on the root node of the tree.
  2. Now, split the training set of the dataset into subsets. While making the subset make sure that each subset of training dataset should have the same value for an attribute.
  3. Find leaf nodes in all branches by repeating 1 and 2 on each subset.

Key concept in Decision Tree

Gini index and information gain both of these methods are used to select from the n attributes of the dataset which attribute would be placed at the root node or the internal node.

Gini index

\text { Gini Index }=1-\sum_{j}{ }_{\mathrm{j}}^{2}

  • Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified.
  • It means an attribute with lower gini index should be preferred.
  • Sklearn supports “gini” criteria for Gini Index and by default, it takes “gini” value.

Entropy

If a random variable x can take N different value, the i'value x_{i} with probability p_{ii} we can associate the following entropy with x :

H(x)= -\sum_{i=1}^{N}p(x_{i})log_{2}p(x_{i})

  • Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy the more the information content.

Information Gain

Definition: Suppose S is a set of instances, A is an attribute, S_{v} is the subset of s with A = v and Values(A) is the set of all possible of A, then

  • The entropy typically changes when we use a node in a Python decision tree to partition the training instances into smaller subsets. Information gain is a measure of this change in entropy.
  • Sklearn supports “entropy” criteria for Information Gain and if we want to use Information Gain method in sklearn then we have to mention it explicitly.

Python Decision Tree Implementation

Dataset Description:

Title                               : Balance Scale Weight & Distance  Database Number of Instances  : 625 (49 balanced, 288 left, 288 right) Number of Attributes  : 4 (numeric) + class name = 5 Attribute Information: 1. Class Name (Target variable): 3        L [balance scale tip to the left]        B [balance scale be balanced]        R [balance scale tip to the right] 2. Left-Weight: 5 (1, 2, 3, 4, 5) 3. Left-Distance: 5 (1, 2, 3, 4, 5) 4. Right-Weight: 5 (1, 2, 3, 4, 5) 5. Right-Distance: 5 (1, 2, 3, 4, 5) Missing Attribute Values: None Class Distribution:       1. 46.08 percent are L       2. 07.84 percent are B       3. 46.08 percent are R 

You can find more details of the dataset.

Prerequsite

  1. sklearn :
    • In python, sklearn is a machine learning package which include a lot of ML algorithms.
    • Here, we are using some of its modules like train_test_split, DecisionTreeClassifier and accuracy_score.
  2. NumPy :
    • It is a numeric python module which provides fast maths functions for calculations.
    • It is used to read data in numpy arrays and for manipulation purpose.
  3. Pandas :
    • Used to read and write different files.
    • Data manipulation can be done easily with dataframes.

Installation of the packages

In Python, sklearn is the package which contains all the required packages to implement Machine learning algorithm. You can install the sklearn package by following the commands given below.

using pip

pip install -U scikit-learn 

Before using the above command make sure you have scipy and numpy packages installed. If you don't have pip. You can install it using

python get-pip.py 

using conda

conda install scikit-learn 

While implementing the decision tree in Python we will go through the following two phases:

  1. Building Phase
    • Preprocess the dataset.
    • Split the dataset from train and test using Python sklearn package.
    • Train the classifier.
  2. Operational Phase
    • Make predictions.
    • Calculate the accuracy.

Data Import

  • To import and manipulate the data we are using the pandas package provided in python.
  • Here, we are using a URL which is directly fetching the dataset from the UCI site no need to download the dataset. When you try to run this code on your system make sure the system should have an active Internet connection.
  • As the dataset is separated by "," so we have to pass the sep parameter's value as ",".
  • Another thing is notice is that the dataset doesn't contain the header so we will pass the Header parameter's value as none. If we will not pass the header parameter then it will consider the first line of the dataset as the header.

Data Slicing

  • Before training the model we have to split the dataset into the training and testing dataset.
  • To split the dataset for training and testing we are using the sklearn module train_test_split
  • First of all we have to separate the target variable from the attributes in the dataset.
X = balance_data.values[:, 1:5] Y = balance_data.values[:,0] 
  • Above are the lines from the code which separate the dataset. The variable X contains the attributes while the variable Y contains the target variable of the dataset.
  • Next step is to split the dataset for training and testing purpose.
X_train, X_test, y_train, y_test = train_test_split(            X, Y, test_size = 0.3, random_state = 100) 
  • Above line split the dataset for training and testing. As we are splitting the dataset in a ratio of 70:30 between training and testing so we are pass test_size parameter's value as 0.3.
  • random_state variable is a pseudo-random number generator state used for random sampling.

Building a Decision Tree in Python

Below is the code for the sklearn decision tree in Python.

Import Library

Importing the necessary libraries required for the implementation of decision tree in Python.

Python
# Importing the required packages import numpy as np import pandas as pd from sklearn.metrics import confusion_matrix, accuracy_score, classification_report from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt 

Data Import and Exploration

Python
# Function to import the dataset def importdata():     balance_data = pd.read_csv(         'https://archive.ics.uci.edu/ml/machine-learning-' +         'databases/balance-scale/balance-scale.data',         sep=',', header=None)      # Displaying dataset information     print("Dataset Length: ", len(balance_data))     print("Dataset Shape: ", balance_data.shape)     print("Dataset: ", balance_data.head())          return balance_data 

Data Splitting

  1. splitdataset(balance_data): This function defines the splitdataset() function, which is responsible for splitting the dataset into training and testing sets. It separates the target variable (class labels) from the features and splits the data using the train_test_split() function from scikit-learn. It sets the test size to 30% and uses a random state of 100 for reproducibility.
Python
# Function to split the dataset into features and target variables def splitdataset(balance_data):      # Separating the target variable     X = balance_data.values[:, 1:5]     Y = balance_data.values[:, 0]      # Splitting the dataset into train and test     X_train, X_test, y_train, y_test = train_test_split(         X, Y, test_size=0.3, random_state=100)      return X, Y, X_train, X_test, y_train, y_test 

Training with Gini Index:

  1. train_using_gini(X_train, X_test, y_train): This function defines the train_using_gini() function, which is responsible for training a decision tree classifier using the Gini index as the splitting criterion. It creates a classifier object with the specified parameters (criterion, random state, max depth, min samples leaf) and trains it on the training data.
Python
def train_using_gini(X_train, X_test, y_train):      # Creating the classifier object     clf_gini = DecisionTreeClassifier(criterion="gini",                                       random_state=100, max_depth=3, min_samples_leaf=5)      # Performing training     clf_gini.fit(X_train, y_train)     return clf_gini 

Training with Entropy:

  1. train_using_entropy(X_train, X_test, y_train): This function defines the train_using_entropy() function, which is responsible for training a decision tree classifier using entropy as the splitting criterion. It creates a classifier object with the specified parameters (criterion, random state, max depth, min samples leaf) and trains it on the training data.
Python
def train_using_entropy(X_train, X_test, y_train):      # Decision tree with entropy     clf_entropy = DecisionTreeClassifier(         criterion="entropy", random_state=100,         max_depth=3, min_samples_leaf=5)      # Performing training     clf_entropy.fit(X_train, y_train)     return clf_entropy 

Prediction and Evaluation:

  1. prediction(X_test, clf_object): This function defines the prediction() function, which is responsible for making predictions on the test data using the trained classifier object. It passes the test data to the classifier's predict() method and prints the predicted class labels.
  2. cal_accuracy(y_test, y_pred): This function defines the cal_accuracy() function, which is responsible for calculating the accuracy of the predictions. It calculates and prints the confusion matrix, accuracy score, and classification report, providing detailed performance evaluation.
Python
# Function to make predictions def prediction(X_test, clf_object):     y_pred = clf_object.predict(X_test)     print("Predicted values:")     print(y_pred)     return y_pred  # Placeholder function for cal_accuracy def cal_accuracy(y_test, y_pred):     print("Confusion Matrix: ",           confusion_matrix(y_test, y_pred))     print("Accuracy : ",           accuracy_score(y_test, y_pred)*100)     print("Report : ",           classification_report(y_test, y_pred)) 

Plots the Decision Tree

By using  plot_tree function from the sklearn.tree submodule to plot the decision tree. The function takes the following arguments:

  • clf_object: The trained decision tree model object.
  • filled=True: This argument fills the nodes of the tree with different colors based on the predicted class majority.
  • feature_names: This argument provides the names of the features used in the decision tree.
  • class_names: This argument provides the names of the different classes.
  • rounded=True: This argument rounds the corners of the nodes for a more aesthetically pleasing appearance.
Python
 from sklearn import tree # Function to plot the decision tree def plot_decision_tree(clf_object, feature_names, class_names):     plt.figure(figsize=(15, 10))     plot_tree(clf_object, filled=True, feature_names=feature_names, class_names=class_names, rounded=True)     plt.show() 

This defines two decision tree classifiers, training and visualization of decision trees based on different splitting criteria, one using the Gini index and the other using entropy,

Python
if __name__ == "__main__":     data = importdata()     X, Y, X_train, X_test, y_train, y_test = splitdataset(data)      clf_gini = train_using_gini(X_train, X_test, y_train)     clf_entropy = train_using_entropy(X_train, X_test, y_train)      # Visualizing the Decision Trees     plot_decision_tree(clf_gini, ['X1', 'X2', 'X3', 'X4'], ['L', 'B', 'R'])     plot_decision_tree(clf_entropy, ['X1', 'X2', 'X3', 'X4'], ['L', 'B', 'R']) 

Output:

DATA INFO Dataset Length:  625 Dataset Shape:  (625, 5) Dataset:     0  1  2  3  4 0  B  1  1  1  1 1  R  1  1  1  2 2  R  1  1  1  3 3  R  1  1  1  4 4  R  1  1  1  5 

Using Gini Index

Screenshot-2023-12-05-152502

Using Entropy

Screenshot-2023-12-07-112105-(1)

It performs the operational phase of the decision tree model, which involves:

  • Imports and splits data for training and testing.
  • Uses Gini and entropy criteria to train two decision trees.
  • Generates class labels for test data using each model.
  • Calculates and compares accuracy of both models.

Evaluates the performance of the trained decision trees on the unseen test data and provides insights into their effectiveness for the specific classification task and evaluates their performance on a dataset using the confusion matrix, accuracy score, and classification report.

Results using Gini Index

Python
# Operational Phase print("Results Using Gini Index:") y_pred_gini = prediction(X_test, clf_gini) cal_accuracy(y_test, y_pred_gini) 

Output:

Results Using Gini Index: Predicted values: ['R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'L'  'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L'  'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R'  'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R'  'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'  'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R'  'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L'  'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'  'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'  'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R'  'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R'] Confusion Matrix:  [[ 0  6  7]  [ 0 67 18]  [ 0 19 71]] Accuracy :  73.40425531914893 Report :                precision    recall  f1-score   support            B       0.00      0.00      0.00        13            L       0.73      0.79      0.76        85            R       0.74      0.79      0.76        90     accuracy                           0.73       188    macro avg       0.49      0.53      0.51       188 weighted avg       0.68      0.73      0.71       188 

Results using Entropy

Python
print("Results Using Entropy:") y_pred_entropy = prediction(X_test, clf_entropy) cal_accuracy(y_test, y_pred_entropy) 

Output:

Results Using Entropy: Predicted values: ['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'  'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'  'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'  'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'  'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'  'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'  'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'  'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'  'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'  'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'  'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R'] Confusion Matrix:  [[ 0  6  7]  [ 0 63 22]  [ 0 20 70]] Accuracy :  70.74468085106383 Report :                precision    recall  f1-score   support             B       0.00      0.00      0.00        13            L       0.71      0.74      0.72        85            R       0.71      0.78      0.74        90      accuracy                           0.71       188    macro avg       0.47      0.51      0.49       188 weighted avg       0.66      0.71      0.68       188 

Applications of Decision Trees

Python Decision trees are versatile tools with a wide range of applications in machine learning:

  1. Classification: Making predictions about categorical results, like if an email is spam or not.
  2. Regression: The estimation of continuous values; for example, feature-based home price prediction.
  3. Feature Selection: Feature selection lowers dimensionality and boosts model performance by determining which features are most pertinent to a certain job.

Conclusion

Python decision trees provide a strong and comprehensible method for handling machine learning tasks. They are an invaluable tool for a variety of applications because of their ease of use, efficiency, and capacity to handle both numerical and categorical data. Decision trees are a useful tool for making precise forecasts and insightful analysis when used carefully.


Next Article
Python | Decision tree implementation

A

Abhishek Sharma 44
Improve
Article Tags :
  • Technical Scripter
  • Machine Learning
  • AI-ML-DS
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

    Limitations of Decision Tree
    A decision tree splits data into branches based on certain rules. While decision trees are intuitive and easy to interpret, they have notable limitations. These challenges, such as overfitting, high variance, bias, greedy algorithms, and difficulty in capturing linear relationships, can affect their
    3 min read
    Decision Tree in Machine Learning
    A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has a hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. It It works like a flowchart help to make decisions step by step where: Internal nodes re
    9 min read
    Pruning decision trees
    Decision tree pruning is a critical technique in machine learning used to optimize decision tree models by reducing overfitting and improving generalization to new data. In this guide, we'll explore the importance of decision tree pruning, its types, implementation, and its significance in machine l
    6 min read
    Overfitting in Decision Tree Models
    In machine learning, decision trees are a popular tool for making predictions. However, a common problem encountered when using these models is overfitting. Here, we explore overfitting in decision trees and ways to handle this challenge. Why Does Overfitting Occur in Decision Trees?Overfitting in d
    7 min read
    Implementing CART (Classification And Regression Tree) in Python
    Classification and Regression Trees (CART) are a type of decision tree algorithm used in machine learning and statistics for predictive modeling. CART is versatile, used for both classification (predicting categorical outcomes) and regression (predicting continuous outcomes) tasks. Here we check the
    6 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences