Iterative Dichotomiser 3 (ID3) Algorithm From Scratch
Last Updated : 02 Jan, 2024
In the realm of machine learning and data mining, decision trees stand as versatile tools for classification and prediction tasks. The ID3 (Iterative Dichotomiser 3) algorithm serves as one of the foundational pillars upon which decision tree learning is built. Developed by Ross Quinlan in the 1980s, ID3 remains a fundamental algorithm, forming the basis for subsequent tree-based methods like C4.5 and CART (Classification and Regression Trees).
Introduction to Decision Trees
Machine learning models called decision trees divide the input data recursively according to features to arrive at a decision. Every internal node symbolizes a feature, and every branch denotes a potential result of that feature. It is simple to interpret and visualize thanks to the tree structure. Every leaf node makes a judgment call or forecast. To optimize information acquisition or limit impurity, the best feature is chosen at each stage of creation. Decision trees are adaptable and can be used for both regression and classification applications. Although they can overfit, this is frequently avoided by employing strategies like pruning.
Decision Trees
Before delving into the intricacies of the ID3 algorithm, let's grasp the essence of decision trees. Picture a tree-like structure where each internal node represents a test on an attribute, each branch signifies an outcome of that test, and each leaf node denotes a class label or a decision. Decision trees mimic human decision-making processes by recursively splitting data based on different attributes to create a flowchart-like structure for classification or regression.
ID3 Algorithm
A well-known decision tree approach for machine learning is the Iterative Dichotomiser 3 (ID3) algorithm. By choosing the best characteristic at each node to partition the data depending on information gain, it recursively constructs a tree. The goal is to make the final subsets as homogeneous as possible. By choosing features that offer the greatest reduction in entropy or uncertainty, ID3 iteratively grows the tree. The procedure keeps going until a halting requirement is satisfied, like a minimum subset size or a maximum tree depth. Although ID3 is a fundamental method, other iterations such as C4.5 and CART have addresse
How ID3 Works
The ID3 algorithm is specifically designed for building decision trees from a given dataset. Its primary objective is to construct a tree that best explains the relationship between attributes in the data and their corresponding class labels.
1. Selecting the Best Attribute
- ID3 employs the concept of entropy and information gain to determine the attribute that best separates the data. Entropy measures the impurity or randomness in the dataset.
- The algorithm calculates the entropy of each attribute and selects the one that results in the most significant information gain when used for splitting the data.
2. Creating Tree Nodes
- The chosen attribute is used to split the dataset into subsets based on its distinct values.
- For each subset, ID3 recurses to find the next best attribute to further partition the data, forming branches and new nodes accordingly.
3. Stopping Criteria
- The recursion continues until one of the stopping criteria is met, such as when all instances in a branch belong to the same class or when all attributes have been used for splitting.
4. Handling Missing Values
- ID3 can handle missing attribute values by employing various strategies like attribute mean/mode substitution or using majority class values.
5. Tree Pruning
- Pruning is a technique to prevent overfitting. While not directly included in ID3, post-processing techniques or variations like C4.5 incorporate pruning to improve the tree's generalization.
Mathematical Concepts of ID3 Algorithm
Now let's examine the formulas linked to the main theoretical ideas in the ID3 algorithm:
1. Entropy
A measure of disorder or uncertainty in a set of data is called entropy. Entropy is a tool used in ID3 to measure a dataset's disorder or impurity. By dividing the data into as homogenous subsets as feasible, the objective is to minimize entropy.
For a set S with classes {c1, c2, ..., cn}, the entropy is calculated as:
H(S) = \Sigma^n _{i=1} p_i log_2(p_i)
Where, pi is the proportion of instances of class ci in the set.
2. Information Gain
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3 splits the data at each stage, choosing the property that maximizes Information Gain. It is computed using the distinction between entropy prior to and following the split.
Information Gain measures the effectiveness of an attribute A in reducing uncertainty in set S.
IG(A,S) = H(S) - \Sigma_{v \epsilon values(A)} \frac{|S_v|}{|S|} \cdot H(S_v)
Where, |Sv | is the size of the subset of S for which attribute A has value v.
3. Gain Ratio
Gain Ratio is an improvement on Information Gain that considers the inherent worth of characteristics that have a wide range of possible values. It deals with the bias of Information Gain in favor of characteristics with more pronounced values.
GR(A,S) = \frac{IG(A,S)}{\Sigma_{v\epsilon values(A)}\frac{|S_v|}{S} \cdot log_2(\frac{|S_v|}{|S|})}
Iterative Dichotomiser 3 (ID3) Implementation using Python
Let's create a simplified version of the ID3 algorithm from scratch using Python.
Importing Libraries
Importing the necessary libraries:
Python3 from collections import Counter import numpy as np
- collections for the Counter class to count occurrences.
- numpy as np for numerical operations and array handling.
Defining Node Class
Python3 class Node: def __init__(self, feature=None, value=None, results=None, true_branch=None, false_branch=None): self.feature = feature # Feature to split on self.value = value # Value of the feature to split on self.results = results # Stores class labels if node is a leaf node self.true_branch = true_branch # Branch for values that are True for the feature self.false_branch = false_branch # Branch for values that are False for the feature
The provided Python code defines a class called Node for constructing nodes in a decision tree. Each node encapsulates information crucial for decision-making within the tree. The feature attribute signifies the feature used for splitting, while value stores the specific value of that feature for the split. In the case of a leaf node, results holds class labels. The node also has branches, with true_branch representing the path for values evaluating to True for the feature, and false_branch for values evaluating to False. This class forms a fundamental building block for creating decision trees, enabling the representation of decision points and outcomes in a hierarchical structure.
Entropy Calculation Function
Python3 def entropy(data): counts = np.bincount(data) probabilities = counts / len(data) entropy = -np.sum([p * np.log2(p) for p in probabilities if p > 0]) return entropy
The entropy function calculates the entropy of a given dataset using the formula for information entropy. It first computes the counts of occurrences for each unique element in the dataset using np.bincount. Then, it calculates the probabilities of each element and uses these probabilities to compute the entropy using the standard formula - \Sigma_ip_i \cdot log_2(p_i) . The function ensures that the logarithm is not taken for zero probabilities, avoiding mathematical errors. The result is the entropy value for the input dataset, reflecting its degree of disorder or uncertainty.
Splitting Data Function
Python3 def split_data(X, y, feature, value): true_indices = np.where(X[:, feature] <= value)[0] false_indices = np.where(X[:, feature] > value)[0] true_X, true_y = X[true_indices], y[true_indices] false_X, false_y = X[false_indices], y[false_indices] return true_X, true_y, false_X, false_y
The split_data function divides a dataset into two subsets based on a specified feature and threshold value. It uses NumPy to identify indices where the feature values satisfy the condition (<= value for the true branch and > value for the false branch). Then, it extracts the corresponding subsets for features (true_X and false_X) and labels (true_y and false_y). The function returns these subsets, enabling the partitioning of data for further use in constructing a decision tree.
Building the Tree Function
Python3 def build_tree(X, y): if len(set(y)) == 1: return Node(results=y[0]) best_gain = 0 best_criteria = None best_sets = None n_features = X.shape[1] current_entropy = entropy(y) for feature in range(n_features): feature_values = set(X[:, feature]) for value in feature_values: true_X, true_y, false_X, false_y = split_data(X, y, feature, value) true_entropy = entropy(true_y) false_entropy = entropy(false_y) p = len(true_y) / len(y) gain = current_entropy - p * true_entropy - (1 - p) * false_entropy if gain > best_gain: best_gain = gain best_criteria = (feature, value) best_sets = (true_X, true_y, false_X, false_y) if best_gain > 0: true_branch = build_tree(best_sets[0], best_sets[1]) false_branch = build_tree(best_sets[2], best_sets[3]) return Node(feature=best_criteria[0], value=best_criteria[1], true_branch=true_branch, false_branch=false_branch) return Node(results=y[0])
The build_tree function recursively constructs a decision tree using the ID3 algorithm. It first checks if the labels in the current subset are homogenous; if so, it creates a leaf node with the corresponding class label. Otherwise, it iterates through all features and values, calculating information gain for each split and identifying the one with the highest gain. The function then recursively calls itself to build the true and false branches using the best split criteria. The resulting decision tree is constructed and returned. The process continues until further splits do not yield positive information gain, resulting in the creation of leaf nodes.
Prediction Function
Python3 def predict(tree, sample): if tree.results is not None: return tree.results else: branch = tree.false_branch if sample[tree.feature] <= tree.value: branch = tree.true_branch return predict(branch, sample)
The predict function uses a trained decision tree to predict the class label for a given sample. It recursively navigates the tree by checking if the current node is a leaf node (indicated by non-None results). If it is a leaf, it returns the class labels. Otherwise, it determines the next branch to traverse based on the feature value of the sample compared to the node's splitting criteria. The function then calls itself with the appropriate branch until a leaf node is reached, providing the final predicted class labels for the input sample.
Dataset and Tree Building
Python3 X = np.array([[1, 1], [1, 0], [0, 1], [0, 0]]) y = np.array([1, 1, 0, 0]) # Building the tree decision_tree = build_tree(X, y)
The code creates a dataset X with binary features and their corresponding labels y. Then, it constructs a decision tree using the build_tree function, which recursively builds the tree using the ID3 algorithm based on the provided dataset. The resulting decision_tree is the root node of the constructed decision tree.
Prediction
Python3 sample = np.array([1, 0]) prediction = predict(decision_tree, sample) print(f"Prediction for sample {sample}: {prediction}")
Output:
Prediction for sample [1 0]: 1
- Predicts the class label for the sample using the built decision tree and prints the prediction.
- If we want to predict the class label for the sample [1, 0], the algorithm will traverse the decision tree starting from the root node. As Feature 0 is 1 (greater than 0.5), it will follow the False branch, and thus the prediction will be 1 (Class 1).
Advantages and Limitations of ID3
Advantages
- Interpretability: Decision trees generated by ID3 are easily interpretable, making them suitable for explaining decisions to non-technical stakeholders.
- Handles Categorical Data: ID3 can effectively handle categorical attributes without requiring explicit data preprocessing steps.
- Computationally Inexpensive: The algorithm is relatively straightforward and computationally less expensive compared to some complex models.
Limitations
- Overfitting: ID3 tends to create complex trees that may overfit the training data, impacting generalization to unseen instances.
- Sensitive to Noise: Noise or outliers in the data can lead to the creation of non-optimal or incorrect splits.
- Binary Trees Only: ID3 constructs binary trees, limiting its ability to represent more complex relationships present in the data directly.
Conclusion
The ID3 algorithm laid the groundwork for decision tree learning, providing a robust framework for understanding attribute selection and recursive partitioning. Despite its limitations, ID3's simplicity and interpretability have paved the way for more sophisticated algorithms that address its drawbacks while retaining its essence.
As machine learning continues to evolve, the ID3 algorithm remains a crucial piece in the mosaic of tree-based methods, serving as a stepping stone for developing more advanced and accurate models in the quest for efficient data analysis and pattern recognition.
Similar Reads
Sklearn | Iterative Dichotomiser 3 (ID3) Algorithms
The ID3 algorithm is a popular decision tree algorithm used in machine learning. It aims to build a decision tree by iteratively selecting the best attribute to split the data based on information gain. Each node represents a test on an attribute, and each branch represents a possible outcome of the
11 min read
Iterative algorithm for a forward data-flow problem
Overview :The purpose of this article is to tell you about an iterative algorithm for forward data-flow problem. Before starting, you should know some terminology related to data flow analysis. Terminologies for Iterative algorithm :Here, we will discuss terminologies for iterative algorithm as foll
3 min read
kNN: k-Nearest Neighbour Algorithm in R From Scratch
In this article, we are going to discuss what is KNN algorithm, how it is coded in R Programming Language, its application, advantages and disadvantages of the KNN algorithm. kNN algorithm in RKNN can be defined as a K-nearest neighbor algorithm. It is a supervised learning algorithm that can be use
15+ min read
Implementing the AdaBoost Algorithm From Scratch
AdaBoost means Adaptive Boosting and it is a is a powerful ensemble learning technique that combines multiple weak classifiers to create a strong classifier. It works by sequentially adding classifiers to correct the errors made by previous models giving more weight to the misclassified data points.
3 min read
Iterative algorithm for a backward data flow problem
Introduction :The reason for this article is to inform you approximately an iterative set of rules for backward statistics float problems. Before beginning, you must recognize a few terminologies associated with statistics float analysis. Data flow analysis :It is a technique for collecting informat
6 min read
How to Choose Right Machine Learning Algorithm?
Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. ML is one of the most exciting technologies that one would have ever come across. A machine-learning algorithm is a program with a particular manner of altering its own parameters
4 min read
Spectral Co-Clustering Algorithm in Scikit Learn
Spectral co-clustering is a type of clustering algorithm that is used to find clusters in both rows and columns of a data matrix simultaneously. This is different from traditional clustering algorithms, which only cluster the rows or columns of a data matrix. Spectral co-clustering is a powerful too
4 min read
Bidirectional Associative Memory (BAM) Implementation from Scratch
Prerequisite: ANN | Bidirectional Associative Memory (BAM) Learning AlgorithmTo implement BAM model, here are some essential consideration and approach- Consider the value of M, as BAM will be constructed with M pairs of patterns. Here the value of M is 4.Set A: Input PatternsSet B: Corresponding Ta
4 min read
Disjoint Set Union (Randomized Algorithm)
A Disjoint set union is an algorithm that is used to manage a collection of disjoint sets. A disjoint set is a set in which the elements are not in any other set. Also, known as union-find or merge-find. The disjoint set union algorithm allows you to perform the following operations efficiently: Fin
15+ min read
AO* algorithm in Artificial intelligence (AI)
The AO* algorithm is an advanced search algorithm utilized in artificial intelligence, particularly in problem-solving and decision-making contexts. It is an extension of the A* algorithm, designed to handle more complex problems that require handling multiple paths and making decisions at each node
15+ min read