Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Dyna Algorithm in Reinforcement Learning
Next article icon

Actor-Critic Algorithm in Reinforcement Learning

Last Updated : 26 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Actor-Critic Algorithm is a type of reinforcement learning algorithm that combines aspects of both policy-based methods (Actor) and value-based methods (Critic). This hybrid approach is designed to address the limitations of each method when used individually.

In the actor-critic framework, an agent (the "actor") learns a policy to make decisions, and a value function (the "Critic") evaluates the actions taken by the Actor.

Simultaneously, the critic evaluates these actions by estimating their value or quality. This dual role allows the method to strike a balance between exploration and exploitation, leveraging the strengths of both policy and value functions.

Roles of Actor and Critic

  • Actor: The actor makes decisions by selecting actions based on the current policy. Its responsibility lies in exploring the action space to maximize expected cumulative rewards. By continuously refining the policy, the actor adapts to the dynamic nature of the environment.
  • Critic: The critic evaluates the actions taken by the actor. It estimates the value or quality of these actions by providing feedback on their performance. The critic's role is pivotal in guiding the actor towards actions that lead to higher expected returns, contributing to the overall improvement of the learning process.

Key Terms in Actor Critic Algorithm

There are two key terms:

  • Policy (Actor) :
    • The policy, denoted as \pi(a|s) , represents the probability of taking action a in state s.
    • The actor seeks to maximize the expected return by optimizing this policy.
    • The policy is modeled by the actor network, and its parameters are denoted by \theta
  • Value Function (Critic) :
    • The value function, denoted as V(s) , estimates the expected cumulative reward starting from state s.
    • The value function is modeled by the critic network, and its parameters are denoted by w.

How Actor-Critic algorithm works?

Actor Critic Algorithm Objective Function

  • The objective function for the Actor-Critic algorithm is a combination of the policy gradient (for the actor) and the value function (for the critic).
  • The overall objective function is typically expressed as the sum of two components:

Policy Gradient (Actor)

\nabla_\theta J(\theta)\approx \frac{1}{N} \sum_{i=0}^{N} \nabla_\theta \log\pi_\theta (a_i|s_i)\cdot A(s_i,a_i)

Here,

  • J(θ) represents the expected return under the policy parameterized by θ
  • π_\theta (a∣s) is the policy function
  • N is the number of sampled experiences.
  • A(s,a) is the advantage function representing the advantage of taking action a in state s.
  • i represents the index of the sample

Value Function Update (Critic)

\nabla_w J(w) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_w (V_{w}(s_i)- Q_{w}(s_i , a_i))^2

Here,

  • \nabla_w J(w) is the gradient of the loss function with respect to the critic's parameters w.
  • N is number of samples
  • V_w(s_i) is the critic's estimate of value of state s with parameter w
  • Q_w (s_i , a_i) is the critic's estimate of the action-value of taking action a
  • i represents the index of the sample

Update Rules

The update rules for the actor and critic involve adjusting their respective parameters using gradient ascent (for the actor) and gradient descent (for the critic).

Actor Update

\theta_{t+1}= \theta_t + \alpha \nabla_\theta J(\theta_t)

Here,

  • \alpha: learning rate for the actor
  • t is the time step within an episode

Critic Update

w_{t} = w_t -\beta \nabla_w J(w_t)

Here,

  • w represents the parameters of the critic network
  • \beta is the learning rate for the critic
Actor-Critic-Method

Advantage Function

The advantage function, A(s,a), measures the advantage of taking action a in state s​ over the expected value of the state under the current policy.

A(s,a)=Q(s,a)−V(s)

The advantage function, then, provides a measure of how much better or worse an action is compared to the average action.

These mathematical expressions highlight the essential computations involved in the Actor-Critic method. The actor is updated based on the policy gradient, encouraging actions with higher advantages, while the critic is updated to minimize the difference between the estimated value and the action-value.

Training Agent: Actor-Critic Algorithm

Let's understand how the Actor-Critic algorithm works in practice. Below is an implementation of a simple Actor-Critic algorithm using TensorFlow and OpenAI Gym to train an agent in the CartPole environment.

Step 1: Import Libraries

Python
import numpy as np import tensorflow as tf import gym 

Step 2: Creating CartPole Environment

Create the CartPole environment using the gym.make() function from the Gym library because it provides a standardized and convenient way to interact with various reinforcement learning tasks.

Python
# Create the CartPole Environment env = gym.make('CartPole-v1') 

Step 3: Defining Actor and Critic Networks

  • Actor and the Critic are implemented as neural networks using TensorFlow's Keras API.
  • Actor network maps the state to a probability distribution over actions.
  • Critic network estimates the state's value.
Python
# Define the actor and critic networks actor = tf.keras.Sequential([     tf.keras.layers.Dense(32, activation='relu'),     tf.keras.layers.Dense(env.action_space.n, activation='softmax') ])  critic = tf.keras.Sequential([     tf.keras.layers.Dense(32, activation='relu'),     tf.keras.layers.Dense(1) ]) 

Step 4: Defining Optimizers and Loss Functions

We use Adam optimizer for both networks.

Python
# Define optimizer and loss functions actor_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) critic_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) 

Step 5: Training Loop

The training loop runs for 1000 episodes, with the agent interacting with the environment, calculating advantages, and updating both the actor and critic.

Python
# Main training loop num_episodes = 1000 gamma = 0.99  for episode in range(num_episodes):     state = env.reset()     episode_reward = 0      with tf.GradientTape(persistent=True) as tape:         for t in range(1, 10000):  # Limit the number of time steps             # Choose an action using the actor             action_probs = actor(np.array([state]))             action = np.random.choice(env.action_space.n, p=action_probs.numpy()[0])              # Take the chosen action and observe the next state and reward             next_state, reward, done, _ = env.step(action)              # Compute the advantage             state_value = critic(np.array([state]))[0, 0]             next_state_value = critic(np.array([next_state]))[0, 0]             advantage = reward + gamma * next_state_value - state_value              # Compute actor and critic losses             actor_loss = -tf.math.log(action_probs[0, action]) * advantage             critic_loss = tf.square(advantage)              episode_reward += reward              # Update actor and critic             actor_gradients = tape.gradient(actor_loss, actor.trainable_variables)             critic_gradients = tape.gradient(critic_loss, critic.trainable_variables)             actor_optimizer.apply_gradients(zip(actor_gradients, actor.trainable_variables))             critic_optimizer.apply_gradients(zip(critic_gradients, critic.trainable_variables))              if done:                 break      if episode % 10 == 0:         print(f'Episode {episode}, Reward: {episode_reward}')  env.close() 

Output:

Capture

Advantages of Actor Critic Algorithm

The Actor-Critic method offer several advantages:

  • Improved Sample Efficiency: The hybrid nature of Actor-Critic algorithms often leads to improved sample efficiency, requiring fewer interactions with the environment to achieve optimal performance.
  • Faster Convergence: The method's ability to update both the policy and value function concurrently contributes to faster convergence during training, enabling quicker adaptation to the learning task.
  • Versatility Across Action Spaces: Actor-Critic architectures can seamlessly handle both discrete and continuous action spaces, offering flexibility in addressing a wide range of RL problems.
  • Off-Policy Learning (in some variants): Learns from past experiences, even when not directly following the current policy.

Variants of Actor-Critic Algorithms

Several variants of the Actor-Critic algorithm have been developed to address specific challenges or improve performance in certain types of environments:

  • Advantage Actor-Critic (A2C): A2C modifies the critic’s value function to estimate the advantage function, which measures how much better or worse an action is compared to the average action. The advantage function is defined as:

A(s_t, a_t) = Q(s_t, a_t) - V(s_t)

A2C helps reduce the variance of the policy gradient, leading to better learning performance.

  • Asynchronous Advantage Actor-Critic (A3C): A3C is an extension of A2C that uses multiple agents (threads) running in parallel to update the policy asynchronously. This allows for more stable and faster learning by reducing correlations between updates.

Next Article
Dyna Algorithm in Reinforcement Learning

Q

qwerty_gfg
Improve
Article Tags :
  • Machine Learning
  • Geeks Premier League
  • AI-ML-DS
  • Geeks Premier League 2023
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning

Similar Reads

  • Dyna Algorithm in Reinforcement Learning
    The Dyna algorithm introduces a hybrid approach that leverages both real-world and simulated experiences, enhancing the agent's learning efficiency. This article delves into the key concepts, architecture, and benefits of the Dyna algorithm, along with its applications. Table of Content Understandin
    5 min read
  • Epsilon-Greedy Algorithm in Reinforcement Learning
    In Reinforcement Learning, the agent or decision-maker learns what to do—how to map situations to actions—so as to maximize a numerical reward signal. The agent is not explicitly told which actions to take, but instead must discover which action yields the most reward through trial and error. Multi-
    4 min read
  • Upper Confidence Bound Algorithm in Reinforcement Learning
    In Reinforcement learning, the agent or decision-maker generates its training data by interacting with the world. The agent must learn the consequences of its actions through trial and error, rather than being explicitly told the correct action. Multi-Armed Bandit Problem In Reinforcement Learning,
    6 min read
  • Function Approximation in Reinforcement Learning
    Function approximation is a critical concept in reinforcement learning (RL), enabling algorithms to generalize from limited experience to a broader set of states and actions. This capability is essential when dealing with complex environments where the state and action spaces are vast or continuous.
    5 min read
  • Hierarchical Reinforcement Learning (HRL) in AI
    In the rapidly evolving field of Artificial Intelligence (AI), Reinforcement Learning (RL) has emerged as a powerful tool for solving complex decision-making problems. Traditional RL algorithms have shown remarkable success in various domains, from gaming to robotics. However, as tasks become more i
    7 min read
  • Multi-Agent Reinforcement Learning in AI
    Reinforcement learning (RL) can solve complex problems through trial and error, learning from the environment to make optimal decisions. While single-agent reinforcement learning has made remarkable strides, many real-world problems involve multiple agents interacting within the same environment. Th
    7 min read
  • Genetic Algorithm for Reinforcement Learning : Python implementation
    In reinforcement learning, the challenge is to find the best policy or the best set of parameters for a given environment. Genetic Algorithm (GA) is an optimization algorithm inspired by the process of natural evolution. It is used to find approximate solutions to complex problems by evolving a popu
    8 min read
  • Q-Learning in Reinforcement Learning
    Q-learning is a model-free reinforcement learning algorithm used to train agents (computer programs) to make optimal decisions by interacting with an environment. It helps the agent explore different actions and learn which ones lead to better outcomes. The agent uses trial and error to determine wh
    9 min read
  • Reinforcement learning from Human Feedback
    Reinforcement Learning from Human Feedback (RLHF) is a method in machine learning where human input is utilized to enhance the training of an artificial intelligence (AI) agent. Let's step into the fascinating world of artificial intelligence, where Reinforcement Learning from Human Feedback (RLHF)
    8 min read
  • First-Order algorithms in machine learning
    First-order algorithms are a cornerstone of optimization in machine learning, particularly for training models and minimizing loss functions. These algorithms are essential for adjusting model parameters to improve performance and accuracy. This article delves into the technical aspects of first-ord
    7 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences