Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Curiosity-Driven Exploration in Reinforcement Learning
Next article icon

Curiosity-Driven Exploration in Reinforcement Learning

Last Updated : 25 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Curiosity-driven exploration is approach in reinforcement learning (RL) that addresses the challenge of sparse or delayed rewards by introducing internal, self-generated incentives for agents to explore and learn.

Why Curiosity-Driven Exploration?

The Sparse Reward Problem: In many RL environments, agents receive external (extrinsic) rewards only after completing significant milestones. For example, in a game like Mario, the agent might only get a reward after finishing a level, while most actions yield no feedback at all. This makes learning extremely slow and inefficient, as the agent might take millions of random actions before stumbling upon a rewarding sequence.

The Need for Internal Motivation: To overcome this, researchers introduced the concept of intrinsic motivation rewards generated internally by the agent for behaviors such as exploring new states or reducing uncertainty. This mimics human curiosity, where we are driven to explore and learn even without immediate external rewards.

Types of Rewards

  • Extrinsic Reward: Comes from the environment (e.g., points for finishing a level).
  • Intrinsic (Curiosity) Reward: Generated by the agent, typically for visiting novel or unpredictable states.

Curiosity Reward Calculation: The Core Idea

The most common method is prediction-based curiosity:

  • The agent builds a model (often a neural network) to predict the next state given the current state and action.
  • After taking an action, the agent compares its predicted next state to the actual next state.
  • The difference (prediction error) becomes the curiosity reward: the larger the error, the more novel or surprising the state, and the higher the reward.

Key Architecture: Intrinsic Curiosity Module (ICM)

The ICM is a popular module for curiosity-driven RL, typically consisting of three main components:

1.Encoder

  • Purpose: Converts high-dimensional observations (e.g., images) into lower-dimensional feature vectors, denoted as \phi(s_t)  for state s_t  and \phi(s_{t+1}) or next state s_{t+1} .
  • Mathematical Representation:

\phi(s_t) = \mathrm{Encoder}(s_t)

\phi(s_{t+1}) = \mathrm{Encoder}(s_{t+1})

The encoder is typically a neural network (like a CNN for image input)

2. Inverse Dynamics Model

  • Purpose: Predicts the action \hat{a}_t  taken by the agent, given the encoded representations of the current and next states. This encourages the encoder to focus on aspects of the environment controlled by the agent, filtering out irrelevant or uncontrollable features (e.g., background noise).
  • Mathematical Representation:

\hat{a}_t = g\left( \phi(s_t), \phi(s_{t+1}); \theta_I \right)

where g: Inverse model neural network with parameters \theta_I

Loss Function:

  • For discrete actions (e.g., Atari), use cross-entropy loss: \mathcal{L}_{\text{inv}} = -\log P(a_t \mid \phi(s_t), \phi(s_{t+1}))
  • For continuous actions, use mean squared error (MSE): \mathcal{L}_{\text{inv}} = \| a_t - \hat{a}_t \|^2

Role: Optimizing this loss ensures the encoder learns features that encode only agent-relevant (controllable) factors.

3. Forward Dynamics Model

  • Purpose: Predicts the encoded feature vector of the next state \hat{\phi}(s_{t+1}) , given the encoded current state  and the action .
  • Mathematical Representation: \hat{\phi}(s_{t+1}) = f\left( \phi(s_t), a_t; \theta_F \right) where f is the forward model (a neural network with parameters \theta_F ).
  • Loss Function: The forward model is trained to minimize the prediction error in the feature space

Mean squared error in feature space: \mathcal{L}_{\text{fwd}} = \frac{1}{2} \left\| \hat{\phi}(s_{t+1}) - \phi(s_{t+1}) \right\|^2

Intrinsic Reward (Curiosity Signal): The agent receives an intrinsic reward proportional to this prediction error:

r_t^{\text{int}} = \eta \cdot \frac{1}{2} \left\| \hat{\phi}(s_{t+1}) - \phi(s_{t+1}) \right\|^2

\eta:Scaling factor for the curiosity reward.

4. Combined Optimization

  • Total Loss: The ICM is trained by combining the inverse and forward losses:
  • \mathcal{L}_{\text{ICM}} = (1 - \lambda)\mathcal{L}_{\text{inv}} + \lambda \mathcal{L}_{\text{fwd}}
  • \lambda : Hyperparameter balancing the two losses (e.g., \lambda =0.1 in the original paper).

Policy Training: The agent’s policy is trained using both extrinsic (environment) and intrinsic (curiosity) rewards:

r_t = r_t^{\text{ext}} + \beta r_t^{\text{int}}

where r_t is total reward received by the agent at time step t, β controls the influence of curiosity.

Training Flow:

  • The encoder and inverse dynamics model are trained together, ensuring the encoder learns meaningful representations.
  • The forward model’s prediction error provides the curiosity reward, which is combined with any extrinsic rewards to train the RL agent.

Addressing Challenges

  • Noisy TV Problem: Agents might be attracted to unpredictable but irrelevant phenomena (like random noise on a TV screen), since these maximize prediction error. The ICM’s encoder and inverse dynamics model help mitigate this by focusing on agent-controllable aspects of the environment.
  • Trivial Randomness: Background elements or unrelated environment features can cause high prediction error. The encoder, trained via inverse dynamics, filters out such distractions.

Practical Example: Curiosity in Action

Game Example (Mario):

  • In a sparse-reward version of Mario, the agent rarely receives external rewards.
  • With curiosity-driven exploration, the agent is rewarded for exploring new areas or experiencing surprising outcomes, even if it hasn't reached the end of the level.
  • Over time, the agent learns to traverse more of the environment, discovers new strategies, and eventually finds the path to the goal much more efficiently than an agent relying only on random exploration.

Empirical Results: Studies show that curiosity-driven agents explore significantly more of the environment and learn faster than those using random or naive exploration strategies.


Next Article
Curiosity-Driven Exploration in Reinforcement Learning

S

shambhava9ex
Improve
Article Tags :
  • Deep Learning
  • AI-ML-DS With Python
  • Deep Learning

Similar Reads

    Actor-Critic Algorithm in Reinforcement Learning
    Actor-Critic Algorithm is a type of reinforcement learning algorithm that combines two parts i.e the Actor which selects actions and the Critic which evaluates them. This helps the agent learn more effectively by balancing decision-making and feedback. In the actor-critic method the actor learns how
    7 min read
    Q-Learning in Reinforcement Learning
    Q-Learning is a popular model-free reinforcement learning algorithm that helps an agent learn how to make the best decisions by interacting with its environment. Instead of needing a model of the environment the agent learns purely from experience by trying different actions and seeing their results
    7 min read
    Deep Q-Learning in Reinforcement Learning
    Deep Q-Learning is a method that uses deep learning to help machines make decisions in complicated situations. It’s especially useful in environments where the number of possible situations called states is very large like in video games or robotics.Before understanding Deep Q-Learning it’s importan
    4 min read
    Dynamic Programming in Reinforcement Learning
    Dynamic Programming (DP) is a technique used to solve problems by breaking them down into smaller subproblems, solving each one and combining their results. In Reinforcement Learning (RL) it helps an agent to learn so that it acts in best way in a environment to earn the most reward over time. In Re
    9 min read
    A Beginner's Guide to Deep Reinforcement Learning
    Deep Reinforcement Learning (DRL) is the crucial fusion of two powerful artificial intelligence fields: deep neural networks and reinforcement learning. By combining the benefits of data-driven neural networks and intelligent decision-making, it has sparked an evolutionary change that crosses tradit
    10 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences