Genetic Algorithm for Reinforcement Learning : Python implementation

Last Updated : 08 Apr, 2025

In reinforcement learning, the challenge is to find the best policy or the best set of parameters for a given environment. Genetic Algorithm (GA) is an optimization algorithm inspired by the process of natural evolution. It is used to find approximate solutions to complex problems by evolving a population of candidate solutions over generations.

The integration of Genetic Algorithms with Reinforcement Learning helps us to optimize the policy of RL model.

Use of Genetic Algorithm for RL?

Exploration of Non-differentiable Spaces: If the reward function is not differentiable, traditional RL methods may struggle. GAs can explore the solution space by evolving individuals without relying on gradients.
Global Optimization: GAs are good for finding a global optimum in large or complex search spaces whereas gradient-based methods can get stuck in local optima.
Avoiding Local Minima: GAs maintain population diversity and reduce the risk of convergence to local minima which is a common issue with gradient descent methods.

Example of Genetic Algorithm for Policy Optimization

Let’s imagine that we are applying a GA to evolve a policy for a simple RL task like balancing a pole on a cart.

Initialization: Start with a population of random neural networks representing different policies. Each individual in the population could be a neural network with random weights.
Evaluation: Run the environment with each neural network and calculate the cumulative reward (fitness) for each agent. For example the fitness could be how long the agent can balance the pole.
Selection: Select the top-performing policies based on fitness. Best-performing neural networks are more likely to “reproduce.”
Crossover: Create new neural networks (offspring) by combining parts of the weights from the top-performing networks.
Mutation: Introduce small random changes to the offspring networks weights to add diversity.
Repeat: The process is repeated for several generations. With each generation the population evolves to have better-performing policies for balancing the pole.

Python Implementation of Genetic Algorithm for Reinforcement Learning

To implement it we need to follow below steps:

1. Downloading necessary libraries

We will implement a Genetic Algorithm to optimize the policy of an RL agent using the OpenAI Gym framework for creating environments.

pip install gym

2. Importing Libraries and Creating the Environment

We will import libraries like numpy, random, matplotlib and the environment is set up. The environment used is CartPole-v1 which is a classic control problem where the agent has to balance a pole on a cart.

Python

import gym import numpy as np import random import matplotlib.pyplot as plt env = gym.make('CartPole-v1')

3. Population Initialization

This block defines the population initialization function. A population of agents is generated randomly. Each agent’s policy is represented by a set of weights that determines how the agent will act based on the current state.

np.random.randn(): Generates random weights for the neural network representing each individual in the population.
input_dim: corresponds to the number of state variables
output_dim: corresponds to the number of possible actions (2 for Cart Pole: left or right).
* 0.5: This scales the random weights to a range that is more likely to produce diverse results initially.

Python

def initialize_population(pop_size, input_dim, output_dim):     population = []     for _ in range(pop_size):         individual = np.random.randn(input_dim, output_dim) * 0.5          population.append(individual)     return population

4. Fitness Evaluation Function

This function evaluates how well an individual policy performs in the environment. The agent’s performance is measured by the total reward it accumulates in the environment. The function terminates either when the agent has completed the task or after a set number of steps (max_steps).

np.dot(state, individual): Computes the action to take based on the current state and the individual’s policy weights. This represents a linear function approximator for the agent’s decision-making.
env.step(action): Applies the selected action to the environment and returns the next state, reward and a done flag indicating if the task is completed (balancing the pole or reaching a time limit).
np.argmax: Selects the action with the highest value from the computed values.

Python

def fitness_function(individual, env, max_steps=100):     state = env.reset()     done = False     total_reward = 0     steps = 0     while not done and steps < max_steps:          action = np.argmax(np.dot(state, individual))           state, reward, done, _ = env.step(action)         total_reward += reward         steps += 1         print(f"Step: {steps}, Action: {action}, Reward: {reward}, Total Reward: {total_reward}, Done: {done}")     return total_reward

5. Tournament Selection

This function selects individuals from the population using tournament selection. A random subset of the population is chosen and the best individual from the subset is selected to move on to the next generation.

random.sample: Selects a random set of individuals from the population.
np.argmax(tournament_fitness): Chooses the fittest individual from the tournament.

Python

def tournament_selection(population, fitness_scores, tournament_size=3):     selected = []     for _ in range(len(population)):         tournament = random.sample(range(len(population)), tournament_size)         tournament_fitness = [fitness_scores[i] for i in tournament]         winner = tournament[np.argmax(tournament_fitness)]         selected.append(population[winner])     return selected

6. Crossover

The crossover function combines two parent solutions to create two offspring by exchanging parts of their “genetic material” (policy weights). This introduces diversity in the population and is a important part of the evolutionary process.

random.randint(1, len(parent1) - 1): Randomly selects a crossover point within the parent policies.
np.concatenate: Joins the parts of the two parents to create two offspring.

Python

def crossover(parent1, parent2):     crossover_point = random.randint(1, len(parent1) - 1)     offspring1 = np.concatenate((parent1[:crossover_point], parent2[crossover_point:]), axis=0)     offspring2 = np.concatenate((parent2[:crossover_point], parent1[crossover_point:]), axis=0)     return offspring1, offspring2

7. Mutation

Mutation introduces random changes to an individual’s policy to maintain genetic diversity and help the algorithm explore different parts of the solution space. The mutation rate determines the probability of a gene (policy weight) being altered.

random.random() < mutation_rate: Determines if a mutation should occur at a particular position.
np.random.uniform(-0.5, 0.5): Introduces a random change to the policy weights within a range and adding diversity to the population.

Python

def mutate(individual, mutation_rate=0.05):      for i in range(len(individual)):         if random.random() < mutation_rate:             individual[i] += np.random.uniform(-0.5, 0.5)       return individual

8. Genetic Algorithm Loop

It initializes the population, evaluates the fitness of each individual, performs selection, crossover and mutation operations to create the next generation and repeats for a specified number of generations.

initialize_population: Creates the initial population of agents (policies).
The loop iterates through generations, evaluating each individual, selecting the fittest, performing crossover and mutation and updating the population.

Python

def genetic_algorithm(env, pop_size=50, generations=10, mutation_rate=0.01, max_steps_per_generation=50):     input_dim = env.observation_space.shape[0]     output_dim = env.action_space.n     population = initialize_population(pop_size, input_dim, output_dim)      for gen in range(generations):         print(f"Generation {gen} start")         fitness_scores = []         for individual in population:             total_reward = fitness_function(individual, env, max_steps=max_steps_per_generation)             fitness_scores.append(total_reward)          print(f"Generation {gen}, Best Fitness: {max(fitness_scores)}")          selected_population = tournament_selection(population, fitness_scores)          next_generation = []         for i in range(0, len(selected_population), 2):             parent1, parent2 = selected_population[i], selected_population[i + 1]             offspring1, offspring2 = crossover(parent1, parent2)             next_generation.append(mutate(offspring1, mutation_rate))             next_generation.append(mutate(offspring2, mutation_rate))          population = next_generation          if gen >= generations:             print("Reached max generations!")             break      return population

9. Running the Genetic Algorithm

Here we will run our model.

Python

final_population = genetic_algorithm(env, pop_size=50, generations=10, mutation_rate=0.01, max_steps_per_generation=50)

Output:

Model Working

Each line represents one timestep during the agent’s operation. Here’s a breakdown of the key parts:

Step: The current step within the episode.
Action: The action taken by the agent at that step either 0 or 1 representing different directions or movements.
Reward: The reward received for taking that action. In this case the agent gets 1.0 reward for each action likely because it’s continuing to balance the pole.
Total Reward: The cumulative reward for the agent at that point in the episode.
Done: A flag indicating whether the episode has finished. It’s set to False here meaning the agent is still in the process of balancing the pole.

The Best Fitness at Generation 9 is 50.0 indicating that the best individual in this generation has managed to maintain the pole for 50 steps. This output suggests that the agent is progressing and over multiple generations the population should continue to evolve to improve performance.

10. Visualization

This function visualizes the agent’s performance in the environment using the best policy found by the GA. The agent interacts with the environment by selecting actions based on the learned policy and the environment is rendered for visualization.

np.argmax(np.dot(state, policy)): The agent selects an action based on the current state and the learned policy.

Python

def evaluate_best_policy(policy, env, max_steps=500):     state = env.reset()     done = False     total_reward = 0     steps = 0     while not done and steps < max_steps:         action = np.argmax(np.dot(state, policy))          state, reward, done, _ = env.step(action)          total_reward += reward         steps += 1     print(f"Total Reward: {total_reward}")   best_policy = final_population[0]   evaluate_best_policy(best_policy, env)

Output:

Total Reward: 97.0

The output means the agent using the best-found policy and successfully balanced the pole for 97 steps accumulating 97.0 reward points. It shows that the genetic algorithm has made progress in evolving a policy that works fairly well in the CartPole task.

By combining the exploration capabilities of GAs with the decision-making framework of RL we can enhance the ability of agents to adapt and optimize in challenging tasks leading to more robust and diverse solutions.

Actor-Critic Algorithm in Reinforcement Learning

ngrover241

Improve

Article Tags :

Machine Learning

Practice Tags :

Machine Learning

Genetic Algorithm for Reinforcement Learning : Python implementation

Use of Genetic Algorithm for RL?

Example of Genetic Algorithm for Policy Optimization

Python Implementation of Genetic Algorithm for Reinforcement Learning

1. Downloading necessary libraries

2. Importing Libraries and Creating the Environment

3. Population Initialization

4. Fitness Evaluation Function

5. Tournament Selection

6. Crossover

7. Mutation

8. Genetic Algorithm Loop

9. Running the Genetic Algorithm

10. Visualization

Similar Reads