Curiosity-Driven Exploration in Reinforcement Learning
Last Updated : 25 Jun, 2025
Curiosity-driven exploration is approach in reinforcement learning (RL) that addresses the challenge of sparse or delayed rewards by introducing internal, self-generated incentives for agents to explore and learn.
Why Curiosity-Driven Exploration?
The Sparse Reward Problem: In many RL environments, agents receive external (extrinsic) rewards only after completing significant milestones. For example, in a game like Mario, the agent might only get a reward after finishing a level, while most actions yield no feedback at all. This makes learning extremely slow and inefficient, as the agent might take millions of random actions before stumbling upon a rewarding sequence.
The Need for Internal Motivation: To overcome this, researchers introduced the concept of intrinsic motivation rewards generated internally by the agent for behaviors such as exploring new states or reducing uncertainty. This mimics human curiosity, where we are driven to explore and learn even without immediate external rewards.
Types of Rewards
- Extrinsic Reward: Comes from the environment (e.g., points for finishing a level).
- Intrinsic (Curiosity) Reward: Generated by the agent, typically for visiting novel or unpredictable states.
Curiosity Reward Calculation: The Core Idea
The most common method is prediction-based curiosity:
- The agent builds a model (often a neural network) to predict the next state given the current state and action.
- After taking an action, the agent compares its predicted next state to the actual next state.
- The difference (prediction error) becomes the curiosity reward: the larger the error, the more novel or surprising the state, and the higher the reward.
Key Architecture: Intrinsic Curiosity Module (ICM)
The ICM is a popular module for curiosity-driven RL, typically consisting of three main components:
1.Encoder
- Purpose: Converts high-dimensional observations (e.g., images) into lower-dimensional feature vectors, denoted as \phi(s_t) for state s_t and \phi(s_{t+1}) or next state s_{t+1} .
- Mathematical Representation:
\phi(s_t) = \mathrm{Encoder}(s_t)
\phi(s_{t+1}) = \mathrm{Encoder}(s_{t+1})
The encoder is typically a neural network (like a CNN for image input)
2. Inverse Dynamics Model
- Purpose: Predicts the action \hat{a}_t taken by the agent, given the encoded representations of the current and next states. This encourages the encoder to focus on aspects of the environment controlled by the agent, filtering out irrelevant or uncontrollable features (e.g., background noise).
- Mathematical Representation:
\hat{a}_t = g\left( \phi(s_t), \phi(s_{t+1}); \theta_I \right)
where g: Inverse model neural network with parameters \theta_I
Loss Function:
- For discrete actions (e.g., Atari), use cross-entropy loss: \mathcal{L}_{\text{inv}} = -\log P(a_t \mid \phi(s_t), \phi(s_{t+1}))
- For continuous actions, use mean squared error (MSE): \mathcal{L}_{\text{inv}} = \| a_t - \hat{a}_t \|^2
Role: Optimizing this loss ensures the encoder learns features that encode only agent-relevant (controllable) factors.
3. Forward Dynamics Model
- Purpose: Predicts the encoded feature vector of the next state \hat{\phi}(s_{t+1}) , given the encoded current state and the action .
- Mathematical Representation: \hat{\phi}(s_{t+1}) = f\left( \phi(s_t), a_t; \theta_F \right) where f is the forward model (a neural network with parameters \theta_F ).
- Loss Function: The forward model is trained to minimize the prediction error in the feature space
Mean squared error in feature space: \mathcal{L}_{\text{fwd}} = \frac{1}{2} \left\| \hat{\phi}(s_{t+1}) - \phi(s_{t+1}) \right\|^2
Intrinsic Reward (Curiosity Signal): The agent receives an intrinsic reward proportional to this prediction error:
r_t^{\text{int}} = \eta \cdot \frac{1}{2} \left\| \hat{\phi}(s_{t+1}) - \phi(s_{t+1}) \right\|^2
\eta:Scaling factor for the curiosity reward.
4. Combined Optimization
- Total Loss: The ICM is trained by combining the inverse and forward losses:
- \mathcal{L}_{\text{ICM}} = (1 - \lambda)\mathcal{L}_{\text{inv}} + \lambda \mathcal{L}_{\text{fwd}}
- \lambda : Hyperparameter balancing the two losses (e.g., \lambda =0.1 in the original paper).
Policy Training: The agent’s policy is trained using both extrinsic (environment) and intrinsic (curiosity) rewards:
r_t = r_t^{\text{ext}} + \beta r_t^{\text{int}}
where r_t is total reward received by the agent at time step t, β controls the influence of curiosity.
Training Flow:
- The encoder and inverse dynamics model are trained together, ensuring the encoder learns meaningful representations.
- The forward model’s prediction error provides the curiosity reward, which is combined with any extrinsic rewards to train the RL agent.
Addressing Challenges
- Noisy TV Problem: Agents might be attracted to unpredictable but irrelevant phenomena (like random noise on a TV screen), since these maximize prediction error. The ICM’s encoder and inverse dynamics model help mitigate this by focusing on agent-controllable aspects of the environment.
- Trivial Randomness: Background elements or unrelated environment features can cause high prediction error. The encoder, trained via inverse dynamics, filters out such distractions.
Practical Example: Curiosity in Action
Game Example (Mario):
- In a sparse-reward version of Mario, the agent rarely receives external rewards.
- With curiosity-driven exploration, the agent is rewarded for exploring new areas or experiencing surprising outcomes, even if it hasn't reached the end of the level.
- Over time, the agent learns to traverse more of the environment, discovers new strategies, and eventually finds the path to the goal much more efficiently than an agent relying only on random exploration.
Empirical Results: Studies show that curiosity-driven agents explore significantly more of the environment and learn faster than those using random or naive exploration strategies.