Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Upper Confidence Bound Algorithm in Reinforcement Learning
Next article icon

Epsilon-Greedy Algorithm in Reinforcement Learning

Last Updated : 10 Jan, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

In Reinforcement Learning, the agent or decision-maker learns what to do—how to map situations to actions—so as to maximize a numerical reward signal. The agent is not explicitly told which actions to take, but instead must discover which action yields the most reward through trial and error.

Multi-Armed Bandit Problem

The multi-armed bandit problem is used in reinforcement learning to formalize the notion of decision-making under uncertainty. In a multi-armed bandit problem, an agent(learner) chooses between k different actions and receives a reward based on the chosen action.
The multi-armed bandits are also used to describe fundamental concepts in reinforcement learning, such as rewards, timesteps, and values.
For selecting an action by an agent, we assume that each action has a separate distribution of rewards and there is at least one action that generates maximum numerical reward. Thus, the probability distribution of the rewards corresponding to each action is different and is unknown to the agent(decision-maker). Hence, the goal of the agent is to identify which action to choose to get the maximum reward after a given set of trials.

Action-Value and Action-Value Estimate
For an agent to decide which action yields the maximum reward, we must define the value of taking each action. We use the concept of probability to define these values using the action-value function.
The value of selecting an action is defined as the expected reward received when taking that action from a set of all possible actions. Since the value of selecting an action is not known to the agent, so we use the ‘sample-average’ method to estimate the value of taking an action.

Exploration vs Exploitation

Exploration allows an agent to improve its current knowledge about each action, hopefully leading to long-term benefit. Improving the accuracy of the estimated action-values, enables an agent to make more informed decisions in the future.

Exploitation on the other hand, chooses the greedy action to get the most reward by exploiting the agent’s current action-value estimates. But by being greedy with respect to action-value estimates, may not actually get the most reward and lead to sub-optimal behaviour.
When an agent explores, it gets more accurate estimates of action-values. And when it exploits, it might get more reward. It cannot, however, choose to do both simultaneously, which is also called the exploration-exploitation dilemma.

Epsilon-Greedy Action Selection
Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly.
The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring.

Code: Python code for Epsilon-Greedy




# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
  
# Define Action class
class Actions:
  def __init__(self, m):
    self.m = m
    self.mean = 0
    self.N = 0
  
  # Choose a random action
  def choose(self): 
    return np.random.randn() + self.m
  
  # Update the action-value estimate
  def update(self, x):
    self.N += 1
    self.mean = (1 - 1.0 / self.N)*self.mean + 1.0 / self.N * x
  
  
def run_experiment(m1, m2, m3, eps, N):
      
  actions = [Actions(m1), Actions(m2), Actions(m3)]
  
  data = np.empty(N)
    
  for i in range(N):
    # epsilon greedy
    p = np.random.random()
    if p < eps:
      j = np.random.choice(3)
    else:
      j = np.argmax([a.mean for a in actions])
    x = actions[j].choose()
    actions[j].update(x)
  
    # for the plot
    data[i] = x
  cumulative_average = np.cumsum(data) / (np.arange(N) + 1)
  
  # plot moving average ctr
  plt.plot(cumulative_average)
  plt.plot(np.ones(N)*m1)
  plt.plot(np.ones(N)*m2)
  plt.plot(np.ones(N)*m3)
  plt.xscale('log')
  plt.show()
  
  for a in actions:
    print(a.mean)
  
  return cumulative_average
  
  
if __name__ == '__main__':
      
  c_1 = run_experiment(1.0, 2.0, 3.0, 0.1, 100000)
  c_05 = run_experiment(1.0, 2.0, 3.0, 0.05, 100000)
  c_01 = run_experiment(1.0, 2.0, 3.0, 0.01, 100000)
 
 

Output:


Code: Python code for getting the log output plot




# log scale plot
plt.plot(c_1, label ='eps = 0.1')
plt.plot(c_05, label ='eps = 0.05')
plt.plot(c_01, label ='eps = 0.01')
plt.legend()
plt.xscale('log')
plt.show()
 
 

Output:

Code: Python code for getting the linear output plot




# linear plot
plt.figure(figsize = (12, 8))
plt.plot(c_1, label ='eps = 0.1')
plt.plot(c_05, label ='eps = 0.05')
plt.plot(c_01, label ='eps = 0.01')
plt.legend()
plt.show()
 
 

Output:



Next Article
Upper Confidence Bound Algorithm in Reinforcement Learning
author
samishawl
Improve
Article Tags :
  • Machine Learning
  • Artificial Intelligence
  • python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Dyna Algorithm in Reinforcement Learning
    The Dyna algorithm introduces a hybrid approach that leverages both real-world and simulated experiences, enhancing the agent's learning efficiency. This article delves into the key concepts, architecture, and benefits of the Dyna algorithm, along with its applications. Table of Content Understandin
    5 min read
  • Upper Confidence Bound Algorithm in Reinforcement Learning
    In Reinforcement learning, the agent or decision-maker generates its training data by interacting with the world. The agent must learn the consequences of its actions through trial and error, rather than being explicitly told the correct action. Multi-Armed Bandit Problem In Reinforcement Learning,
    6 min read
  • Actor-Critic Algorithm in Reinforcement Learning
    Actor-Critic Algorithm is a type of reinforcement learning algorithm that combines aspects of both policy-based methods (Actor) and value-based methods (Critic). This hybrid approach is designed to address the limitations of each method when used individually. In the actor-critic framework, an agent
    7 min read
  • Hierarchical Reinforcement Learning (HRL) in AI
    In the rapidly evolving field of Artificial Intelligence (AI), Reinforcement Learning (RL) has emerged as a powerful tool for solving complex decision-making problems. Traditional RL algorithms have shown remarkable success in various domains, from gaming to robotics. However, as tasks become more i
    7 min read
  • Genetic Algorithm for Reinforcement Learning : Python implementation
    In reinforcement learning, the challenge is to find the best policy or the best set of parameters for a given environment. Genetic Algorithm (GA) is an optimization algorithm inspired by the process of natural evolution. It is used to find approximate solutions to complex problems by evolving a popu
    8 min read
  • Function Approximation in Reinforcement Learning
    Function approximation is a critical concept in reinforcement learning (RL), enabling algorithms to generalize from limited experience to a broader set of states and actions. This capability is essential when dealing with complex environments where the state and action spaces are vast or continuous.
    5 min read
  • Multi-Agent Reinforcement Learning in AI
    Reinforcement learning (RL) can solve complex problems through trial and error, learning from the environment to make optimal decisions. While single-agent reinforcement learning has made remarkable strides, many real-world problems involve multiple agents interacting within the same environment. Th
    7 min read
  • Deep Q-Learning in Reinforcement Learning
    Deep Q-Learning integrates deep neural networks into the decision-making process. This combination allows agents to handle high-dimensional state spaces, making it possible to solve complex tasks such as playing video games or controlling robots. Before diving into Deep Q-Learning, it’s important to
    4 min read
  • Q-Learning in Reinforcement Learning
    Q-learning is a model-free reinforcement learning algorithm used to train agents (computer programs) to make optimal decisions by interacting with an environment. It helps the agent explore different actions and learn which ones lead to better outcomes. The agent uses trial and error to determine wh
    9 min read
  • Model-Based Reinforcement Learning (MBRL) in AI
    Model-based reinforcement learning is a subclass of reinforcement learning where the agent constructs an internal model of the environment's dynamics and uses it to simulate future states, predict rewards, and optimize actions efficiently. Key Components of MBRLModel of the Environment: This is typi
    7 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences