{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "xRQE_P5OYSTr" }, "source": [ " *by Gerard Caravaca Ibáñez*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "2d-NLtNzYSqb" }, "source": [ "## **TD3 implementation**\n", "\n", "This notebook is an implementation of the TD3 algorithm for reinforcement learning, proposed in [1]. \n", "\n", "**TD3 (Twin Delayed Deep Deterministic Policy Gradient)** is a state-of-the-art reinforcement learning algorithm that is designed to learn continuous control policies in environments with high-dimensional state and action spaces. TD3 is an extension of the original DDPG algorithm, which was limited by its susceptibility to overestimation of the Q-function and sensitivity to hyperparameters.\n", "\n", "TD3 improves upon DDPG by introducing several key modifications, including the use of twin Q-networks to reduce overestimation bias, delayed policy updates to improve stability, and target policy smoothing to reduce variance.\n", "\n", "\n", "*[1] Fujimoto, S., Hoof, H., & Meger, D. (2018, July). Addressing function approximation error in actor-critic methods. In International conference on machine learning (pp. 1587-1596). PMLR.*" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "tE636EOu985l" }, "source": [ "# **Imports**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "70sYGfWnZIeY", "outputId": "0241a45a-c55d-4a68-a06a-10a3e90e1f9d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Collecting box2d-py\n", " Downloading box2d-py-2.3.8.tar.gz (374 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m374.5/374.5 kB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "Building wheels for collected packages: box2d-py\n", " \u001b[1;31merror\u001b[0m: \u001b[1msubprocess-exited-with-error\u001b[0m\n", " \n", " \u001b[31m×\u001b[0m \u001b[32mpython setup.py bdist_wheel\u001b[0m did not run successfully.\n", " \u001b[31m│\u001b[0m exit code: \u001b[1;36m1\u001b[0m\n", " \u001b[31m╰─>\u001b[0m See above for output.\n", " \n", " \u001b[1;35mnote\u001b[0m: This error originates from a subprocess, and is likely not a problem with pip.\n", " Building wheel for box2d-py (setup.py) ... \u001b[?25lerror\n", "\u001b[31m ERROR: Failed building wheel for box2d-py\u001b[0m\u001b[31m\n", "\u001b[0m\u001b[?25h Running setup.py clean for box2d-py\n", "Failed to build box2d-py\n", "Installing collected packages: box2d-py\n", " \u001b[1;31merror\u001b[0m: \u001b[1msubprocess-exited-with-error\u001b[0m\n", " \n", " \u001b[31m×\u001b[0m \u001b[32mRunning setup.py install for box2d-py\u001b[0m did not run successfully.\n", " \u001b[31m│\u001b[0m exit code: \u001b[1;36m1\u001b[0m\n", " \u001b[31m╰─>\u001b[0m See above for output.\n", " \n", " \u001b[1;35mnote\u001b[0m: This error originates from a subprocess, and is likely not a problem with pip.\n", " Running setup.py install for box2d-py ... \u001b[?25l\u001b[?25herror\n", "\u001b[1;31merror\u001b[0m: \u001b[1mlegacy-install-failure\u001b[0m\n", "\n", "\u001b[31m×\u001b[0m Encountered error while trying to install package.\n", "\u001b[31m╰─>\u001b[0m box2d-py\n", "\n", "\u001b[1;35mnote\u001b[0m: This is an issue with the package mentioned above, not pip.\n", "\u001b[1;36mhint\u001b[0m: See above for output from the failure.\n", "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: gym[box2d] in /usr/local/lib/python3.9/dist-packages (0.25.2)\n", "Requirement already satisfied: gym-notices>=0.0.4 in /usr/local/lib/python3.9/dist-packages (from gym[box2d]) (0.0.8)\n", "Requirement already satisfied: numpy>=1.18.0 in /usr/local/lib/python3.9/dist-packages (from gym[box2d]) (1.22.4)\n", "Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.9/dist-packages (from gym[box2d]) (2.2.1)\n", "Requirement already satisfied: importlib-metadata>=4.8.0 in /usr/local/lib/python3.9/dist-packages (from gym[box2d]) (6.4.1)\n", "Collecting swig==4.*\n", " Downloading swig-4.1.1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m27.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hCollecting pygame==2.1.0\n", " Downloading pygame-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.3/18.3 MB\u001b[0m \u001b[31m56.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hCollecting box2d-py==2.3.5\n", " Downloading box2d-py-2.3.5.tar.gz (374 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m374.4/374.4 kB\u001b[0m \u001b[31m33.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/dist-packages (from importlib-metadata>=4.8.0->gym[box2d]) (3.15.0)\n", "Building wheels for collected packages: box2d-py\n", " \u001b[1;31merror\u001b[0m: \u001b[1msubprocess-exited-with-error\u001b[0m\n", " \n", " \u001b[31m×\u001b[0m \u001b[32mpython setup.py bdist_wheel\u001b[0m did not run successfully.\n", " \u001b[31m│\u001b[0m exit code: \u001b[1;36m1\u001b[0m\n", " \u001b[31m╰─>\u001b[0m See above for output.\n", " \n", " \u001b[1;35mnote\u001b[0m: This error originates from a subprocess, and is likely not a problem with pip.\n", " Building wheel for box2d-py (setup.py) ... \u001b[?25lerror\n", "\u001b[31m ERROR: Failed building wheel for box2d-py\u001b[0m\u001b[31m\n", "\u001b[0m\u001b[?25h Running setup.py clean for box2d-py\n", "Failed to build box2d-py\n", "Installing collected packages: swig, box2d-py, pygame\n", " Running setup.py install for box2d-py ... \u001b[?25l\u001b[?25hdone\n", "\u001b[33m DEPRECATION: box2d-py was installed using the legacy 'setup.py install' method, because a wheel could not be built for it. pip 23.1 will enforce this behaviour change. A possible replacement is to fix the wheel build issue reported above. Discussion can be found at https://github.com/pypa/pip/issues/8368\u001b[0m\u001b[33m\n", "\u001b[0m Attempting uninstall: pygame\n", " Found existing installation: pygame 2.3.0\n", " Uninstalling pygame-2.3.0:\n", " Successfully uninstalled pygame-2.3.0\n", "Successfully installed box2d-py-2.3.5 pygame-2.1.0 swig-4.1.1\n" ] } ], "source": [ "!pip install box2d-py\n", "!pip install gym[box2d]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OOgTdhE796G8" }, "outputs": [], "source": [ "import numpy as np\n", "import tensorflow as tf\n", "import tensorflow.keras as keras\n", "from tensorflow.keras.layers import Dense\n", "from tensorflow.keras.optimizers.legacy import Adam\n", "import gym\n", "import os\n", "import matplotlib.pyplot as plt\n", "from gym.wrappers import RecordVideo\n", "import glob\n", "import io\n", "import base64\n", "from IPython.display import HTML\n", "from IPython import display" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CVWtIfGZQE2p", "outputId": "87687587-23c6-4d9f-e265-26b644b2f3bc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.12.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n", " and should_run_async(code)\n" ] } ], "source": [ "print(tf.__version__)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4jclbREX5Ps3", "outputId": "5495a9af-5e84-4f1e-fd50-0d093fe12fc0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mounted at /content/drive\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "QR-_j-ZQ-yDh" }, "source": [ "# **Replay Buffer**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "LJUAwgV0YcZx" }, "source": [ "The following class implements a **replay buffer** based on numpy memory arrays. This buffer is used to store experiences during the traininig phase. It allows the agent to learn from past experiences by randomly sampling a batch of transitions from the buffer, instead of relying solely on the most recent experience. In this case I have decided to set a memory limit to avoid overloading the device's RAM.\n", "\n", "In particular this class implements two functions:\n", "\n", "* Store: is used to store an experience in the replay buffer. It takes as input a state, an action, a reward, a new state, and a boolean flag indicating whether the episode is done (done). It updates the buffers by storing the provided values at the current storage index (id) and increments the storage index.\n", "* Sample: is used to randomly sample a batch of experiences from the replay buffer. It takes as input the desired batch size. It first determines the maximum index to sample from, which is the minimum of the current storage index (self.curr_storage) and the maximum storage size (self.max_storage). It then randomly selects batch_size indices from the range of valid indices and uses them to retrieve corresponding samples from the state, action, reward, new state, and terminal buffers. Finally, it returns these samples as separate arrays.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4QFzMS7r-2YZ" }, "outputs": [], "source": [ "class ReplayBuffer:\n", " def __init__(self, max_size, input_shape, n_actions):\n", " self.max_storage = max_size\n", " self.curr_storage = 0\n", " self.curr_buffer = np.zeros((self.max_storage, *input_shape))\n", " self.new_buffer = np.zeros((self.max_storage, *input_shape))\n", " self.action_buffer = np.zeros((self.max_storage, n_actions))\n", " self.reward_buffer = np.zeros(self.max_storage)\n", " self.term_buffer = np.zeros(self.max_storage, dtype=bool)\n", " \n", " def store(self, state, action, reward, new_state, done):\n", " id = self.curr_storage % self.max_storage\n", "\n", " self.curr_buffer[id] = state\n", " self.new_buffer[id] = new_state\n", " self.action_buffer[id] = action\n", " self.reward_buffer[id] = reward\n", " self.term_buffer[id] = done\n", "\n", " self.curr_storage += 1\n", " \n", " def sample(self, batch_size):\n", " max = min(self.curr_storage, self.max_storage)\n", " batch = np.random.choice(max, batch_size)\n", "\n", " return self.curr_buffer[batch], self.action_buffer[batch], self.reward_buffer[batch], self.new_buffer[batch], self.term_buffer[batch]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "nKuGmLOnCIrP" }, "source": [ "# **Critic**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "WqN2Sl5WYf9G" }, "source": [ "This class implements the critic neural network which approximates the Q-function taking into account the current state and action. This class overwrite the keras.Model class, taking this into account the init function initialize the model architecture and the call function do the forward pass throug the network. I decided to use a sequential network with 2 dense layers applying RELU function as activation.\n", "\n", "**Important:** if you want to run this notebook, modify the *chkpt_dir* parameter. This refers to the address of the directory that will be used to store the models during training. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "L_H9pPa7Bgeq" }, "outputs": [], "source": [ "class Critic(keras.Model):\n", " def __init__(self, n1, n2, name='critic', chkpt_dir='/content/drive/MyDrive/MAI/ATCI/implementation/models'):\n", " super(Critic, self).__init__()\n", " self.l1_size = n1\n", " self.l2_size = n2\n", " self.model_name = name\n", " self.checkpoint_dir = chkpt_dir\n", " self.checkpoint_file = os.path.join(self.checkpoint_dir, name)\n", "\n", " # Net architecture\n", " self.l1 = Dense(self.l1_size, activation='relu')\n", " self.l2 = Dense(self.l2_size, activation='relu')\n", " self.q = Dense(1, activation=None)\n", "\n", " def call(self, state, action):\n", " q = self.l1(tf.concat([state, action], axis=1))\n", " q = self.l2(q)\n", "\n", " q = self.q(q)\n", "\n", " return q" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "DMq0jeTCF8H5" }, "source": [ "# **Actor**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "99dkreCzYkJs" }, "source": [ "This class implements the actor neural networks which takes in the current state as input and outputs an action based on the current policy. All the concepts explained in the previous case also apply to this class.\n", "\n", "**Important:** if you want to run this notebook, modify the *chkpt* parameter. This refers to the address of the directory that will be used to store the models during training. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E0EsO-UOGDB3" }, "outputs": [], "source": [ "class Actor(keras.Model):\n", " def __init__(self, n1, n2, n_actions, name='actor',\n", " chkpt='/content/drive/MyDrive/MAI/ATCI/implementation/models'):\n", " super(Actor, self).__init__()\n", " self.l1_size = n1\n", " self.l2_size = n2\n", " self.n_actions = n_actions\n", " self.model_name = name\n", " self.checkpoint_dir = chkpt\n", " self.checkpoint_file = os.path.join(self.checkpoint_dir, name)\n", "\n", " # Net architecture\n", " self.l1 = Dense(self.l1_size, activation='relu')\n", " self.l2 = Dense(self.l2_size, activation='relu')\n", " self.mu = Dense(self.n_actions, activation='tanh')\n", "\n", " def call(self, state):\n", " prob = self.l1(state)\n", " prob = self.l2(prob)\n", "\n", " mu = self.mu(prob)\n", "\n", " return mu" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "uAurJoSTHNaF" }, "source": [ "# **Agent**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "VFYFEs0pYmfG" }, "source": [ "This class implements the complete agent,which learns to interact with an environment in order to achieve a specific goal. \n", "\n", "All the logic of the algorithm is implemented in this class. In this way, the three differential concepts of TD3 can be detected:\n", "\n", "* Twin Q-networks: It can be seen from the initialisation of the class that 2 Critic models are used. Moreover, in the training function the minimum Q of the two models is chosen (pessimistic q-learning).\n", "* Clipped action exploration: in the next action function it can be seen that some noise is introduced to the policy and that clipped exploration is implemented.\n", "* Delayed policy update: in the training function you can see that the actor is only updated when update_actor_interval indicates.\n", "\n", "Following, each of the functions implemented in this class are explained:\n", "\n", "* The __init__ method is the constructor of the Agent class. It initializes the agent with various parameters, such as learning rates (lr_a and lr_c), input shape (input_shape), the environment (env), update interval for the actor network (update_actor_interval), number of actions (n_actions), maximum size of the replay buffer (max_size), layer sizes for the actor and critic models (l1_size and l2_size), a parameter for soft target network updates (tau), and a name for the agent (name).\n", "\n", "* The __next_action__ method is used to determine the next action to take based on the current observation. It takes as input an observation and optional parameters for noise (noise) and a discount factor (gamma). It first converts the observation to a tensor and passes it through the actor network (self.actor) to obtain the mean action (mu). Some noise is added to the mean action, and the resulting action is clipped to the valid action range. The method also increments the time step count.\n", "\n", "* The __save_mem method__ is used to store an experience in the agent's replay buffer (self.replay_buffer). It takes as input a state, action, reward, new state, and done flag, and calls the store method of the replay buffer to store the experience.\n", "\n", "* The __training__ method is used to train the agent's critic and actor networks. It takes an optional batch size (batch_size), soft target network update parameter (tau), and discount factor (gamma). If the replay buffer does not have enough experiences to form a batch, the method returns early. Otherwise, it samples a batch of experiences from the replay buffer. The states, actions, rewards, and new states are converted to tensors. Within a tf.GradientTape context, the target actions are computed by passing the new states through the target actor network (self.target_actor). Some noise is added to the target actions, which are then clipped to the valid action range. The target Q-values are computed by passing the new states and target actions through the target critic networks (self.target_critic1 and self.target_critic2). The current Q-values are obtained by passing the states and actions through the critic networks (self.critic1 and self.critic2). The minimum of the target Q-values is taken as the Q-value estimate. The critic loss is computed by comparing the target Q-values with the current Q-values. The loss is then used to compute the gradients for the critic networks (self.critic1 and self.critic2), and the optimizer is applied to update the critic networks' weights. The learn_step_count is incremented, and if it is not a multiple of update_actor_interval, the method returns early. Otherwise, within another tf.GradientTape context, the actor loss is computed by passing the states through the actor network (self.actor). The gradients for the actor network are computed using the actor loss, and the optimizer is applied to update the actor network's weights.\n", "\n", "* Finally, the __update_network__ method is called to update the weights of the target networks (self.target_actor, self.target_critic1, and self.target_critic2) based on the current network weights.\n", "\n", "\n", "One thing to note is that, although it can be clearly seen that for each model, the model itself uses an additional model called target, in practice only the main model is trained and the target model is a separate copy of the original network that is periodically updated to match the weights. This is done to improve the stability of the models in the training phase. \n", "\n", "Another tricky point is the use of the tf.squeeze function. This is used to size tensors from (batch_size,1) shape to (batch_size) shape.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qxXxMe6iHHay" }, "outputs": [], "source": [ "class Agent:\n", " def __init__(self, lr_a, lr_c, input_shape, env,\n", " update_actor_interval = 2,\n", " n_actions = 2, max_size = 1000000,\n", " l1_size=400, l2_size=300, tau=0.005, name=''):\n", "\n", " self.n_actions=n_actions\n", " \n", " self.replay_buffer = ReplayBuffer(max_size, input_shape, n_actions)\n", " \n", " self.update_actor_interval = update_actor_interval\n", " self.time_step = 0\n", " self.learn_step_count = 0\n", " self.max_action = env.action_space.high[0]\n", " self.min_action = env.action_space.low[0]\n", "\n", " # Needed networks\n", " self.actor = Actor(l1_size, l2_size, n_actions=n_actions, name='actor_'+name)\n", " self.critic1 = Critic(l1_size, l2_size, name='critic1_'+name)\n", " self.critic2 = Critic(l1_size, l2_size, name='critic2_'+name)\n", " self.target_actor = Actor(l1_size, l2_size, n_actions=n_actions, name='target_actor_'+name)\n", " self.target_critic1 = Critic(l1_size, l2_size, name='target_critic1_'+name)\n", " self.target_critic2 = Critic(l1_size, l2_size, name='target_critic2_'+name)\n", "\n", " # Optimizers\n", " opt_actor = Adam(learning_rate=lr_a)\n", " opt_critics = Adam(learning_rate=lr_c)\n", "\n", " self.actor.compile(optimizer=opt_actor, loss='mean')\n", " self.target_actor.compile(optimizer=opt_actor, loss='mean')\n", " self.critic1.compile(optimizer=opt_critics, loss='mean_squared_error')\n", " self.target_critic1.compile(optimizer=opt_critics, loss='mean_squared_error')\n", " self.critic2.compile(optimizer=opt_critics, loss='mean_squared_error')\n", " self.target_critic2.compile(optimizer=opt_critics, loss='mean_squared_error')\n", "\n", " self.update_network(tau=tau)\n", " \n", " def next_action(self, observation, noise=0.1, gamma=0.99):\n", " # Get policy with some noise\n", " state = tf.convert_to_tensor([observation], dtype=tf.float32)\n", " mu = self.actor(state)[0]\n", " mu_ = mu + np.random.normal(scale=noise)\n", "\n", " # Clipped action exploration\n", " mu_ = tf.clip_by_value(mu_, self.min_action, self.max_action)\n", " self.time_step += 1\n", "\n", " return mu_\n", "\n", " def save_mem(self, state, action, reward, new_state, done):\n", " self.replay_buffer.store(state, action, reward, new_state, done)\n", " \n", " def training(self, batch_size=100, tau=0.005, gamma=0.99):\n", " if self.replay_buffer.curr_storage < batch_size:\n", " return\n", "\n", " states, actions, rewards, new_states, dones = self.replay_buffer.sample(batch_size)\n", "\n", " # convert to tensor for training\n", " states = tf.convert_to_tensor(states, dtype=tf.float32)\n", " actions = tf.convert_to_tensor(actions, dtype=tf.float32)\n", " rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)\n", " new_states = tf.convert_to_tensor(new_states, dtype=tf.float32)\n", " \n", " with tf.GradientTape(persistent=True) as tape:\n", " # Select action according to policy and add clipped noise \n", " target_actions = self.target_actor(new_states)\n", " target_actions = target_actions + tf.clip_by_value(np.random.normal(scale=0.2), -0.5, 0.5)\n", " target_actions = tf.clip_by_value(target_actions, self.min_action,\n", " self.max_action)\n", " # Compute the target Q value\n", " q1_ = self.target_critic1(new_states, target_actions)\n", " q2_ = self.target_critic2(new_states, target_actions)\n", " q1 = tf.squeeze(self.critic1(states, actions), 1)\n", " q2 = tf.squeeze(self.critic2(states, actions), 1)\n", " # shape is [batch_size, 1], want to collapse to [batch_size],\n", " # squeeze removes dimensions of size 1 from the shape of a tensor.\n", " q1_ = tf.squeeze(q1_, 1)\n", " q2_ = tf.squeeze(q2_, 1)\n", " # pessimistic double-Q learning\n", " q = tf.math.minimum(q1_, q2_) \n", "\n", " # Compute critic loss\n", " target = rewards + gamma*q*(1-dones)\n", " critic1_loss = keras.losses.MSE(target, q1)\n", " critic2_loss = keras.losses.MSE(target, q2)\n", "\n", " # Optimize the critic\n", " critic1_gradient = tape.gradient(critic1_loss,\n", " self.critic1.trainable_variables)\n", " critic2_gradient = tape.gradient(critic2_loss,\n", " self.critic2.trainable_variables)\n", " self.critic1.optimizer.apply_gradients(\n", " zip(critic1_gradient, self.critic1.trainable_variables))\n", " self.critic2.optimizer.apply_gradients(\n", " zip(critic2_gradient, self.critic2.trainable_variables))\n", "\n", " self.learn_step_count += 1\n", "\n", " # delayed policy update\n", " if self.learn_step_count % self.update_actor_interval != 0:\n", " return\n", "\n", " with tf.GradientTape() as tape:\n", " # Compute actor loss\n", " new_actions = self.actor(states)\n", " critic1_value = self.critic1(states, new_actions)\n", " actor_loss = -tf.math.reduce_mean(critic1_value)\n", "\n", " # Optimize the actor\n", " actor_gradient = tape.gradient(actor_loss, self.actor.trainable_variables)\n", " self.actor.optimizer.apply_gradients(zip(actor_gradient, self.actor.trainable_variables))\n", "\n", " self.update_network(tau=tau)\n", "\n", " def update_network(self, tau):\n", " # Update weights of a network\n", " weights = []\n", " targets = self.target_actor.weights\n", " for i, weight in enumerate(self.actor.weights):\n", " weights.append(weight * tau + targets[i]*(1-tau))\n", "\n", " self.target_actor.set_weights(weights)\n", "\n", " weights = []\n", " targets = self.target_critic1.weights\n", " for i, weight in enumerate(self.critic1.weights):\n", " weights.append(weight * tau + targets[i]*(1-tau))\n", "\n", " self.target_critic1.set_weights(weights)\n", "\n", " weights = []\n", " targets = self.target_critic2.weights\n", " for i, weight in enumerate(self.critic2.weights):\n", " weights.append(weight * tau + targets[i]*(1-tau))\n", "\n", " self.target_critic2.set_weights(weights)\n", "\n", " def save_models(self):\n", " print('Saving models ...')\n", " self.actor.save_weights(self.actor.checkpoint_file)\n", " self.critic1.save_weights(self.critic1.checkpoint_file)\n", " self.critic2.save_weights(self.critic2.checkpoint_file)\n", " self.target_actor.save_weights(self.target_actor.checkpoint_file)\n", " self.target_critic1.save_weights(self.target_critic1.checkpoint_file)\n", " self.target_critic2.save_weights(self.target_critic2.checkpoint_file)\n", "\n", " def load_models(self):\n", " print('Loading models ...')\n", " self.actor.load_weights(self.actor.checkpoint_file)\n", " self.critic1.load_weights(self.critic1.checkpoint_file)\n", " self.critic2.load_weights(self.critic2.checkpoint_file)\n", " self.target_actor.load_weights(self.target_actor.checkpoint_file)\n", " self.target_critic1.load_weights(self.target_critic1.checkpoint_file)\n", " self.target_critic2.load_weights(self.target_critic2.checkpoint_file)\n", " \n", " \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ie-4HhccFElW" }, "source": [ "# **Extra functions**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "3TY10d6zYrVX" }, "source": [ "In this section you can see the functions that have been used to run the experiments." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ekOJrYz9FOr3" }, "source": [ "Parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4-fXBFsrFOXi" }, "outputs": [], "source": [ "ENV='Pendulum-v1'\n", "LR_ACTOR=0.001\n", "LR_CRITIC=0.002\n", "GAMMA=0.99\n", "TAU=0.005\n", "NOISE=0.1\n", "BATCH_SIZE=128\n", "L1_SIZE=512\n", "L2_SIZE=512\n", "UPDATE_ACTOR_INTERVAL=2\n", "N_ITERATIONS=300\n", "MAX_SIZE=100000\n", "EXP_NAME='pendulum'\n", "SAVE_POINTS=[100, 200, 250, 300, 350, 400, 450]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "nt66Fcv9F1S3" }, "source": [ "Training function\n", "\n", "The train function trains an agent in a gym environment using the Agent class. It performs a specified number of iterations. In each iteration, the agent interacts with the environment, collects experiences, and updates its critic and actor networks. The function keeps track of the scores achieved in each episode and the average scores over time. At the end of training, the trained agent, episode scores, and average scores are returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WXE6M3g6FDAx" }, "outputs": [], "source": [ "def train():\n", " env = gym.make(ENV)\n", "\n", " agent = Agent(lr_a=LR_ACTOR, lr_c=LR_CRITIC, input_shape=env.observation_space.shape, env=env, update_actor_interval=UPDATE_ACTOR_INTERVAL,\n", " n_actions=env.action_space.shape[0], l1_size=L1_SIZE, l2_size=L2_SIZE, max_size=MAX_SIZE, name=EXP_NAME)\n", "\n", " best_score = env.reward_range[0]\n", " scores=[]\n", " avg_history=[]\n", "\n", " with tf.device('GPU:0'):\n", " tf.random.set_seed(123)\n", " for i in range(N_ITERATIONS):\n", " obs = env.reset()\n", " done = False\n", " score = 0\n", " while not done:\n", " action = agent.next_action(observation=obs, noise=NOISE, gamma=GAMMA)\n", " new_obs, reward, done, info = env.step(action)\n", " agent.save_mem(state=obs, action=action, reward=reward, new_state=new_obs, done=done)\n", " agent.training(batch_size=BATCH_SIZE, tau=TAU)\n", " score += reward\n", " obs = new_obs\n", "\n", " scores.append(score)\n", " # Mean of last 50 scores\n", " mean_score = np.mean(scores[-100:])\n", "\n", " if mean_score > best_score:\n", " best_score = mean_score\n", "\n", " if i in SAVE_POINTS:\n", " agent.save_models()\n", "\n", " avg_history.append(mean_score)\n", " print(f\"# Episode: {i}, Reward: {score}, Mean reward: {mean_score}.\")\n", " \n", " agent.save_models()\n", " env.close()\n", "\n", " return agent, scores, avg_history\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "dg305ALJ8A60" }, "source": [ "Plot Training Curve" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The show_curve function is used to visualize the training progress of an agent by plotting the average test reward over episodes. It takes two arguments: avg, which is a list representing the average test rewards at each episode, and score, which is a list representing the test scores achieved at each episode. The function creates a line plot with the x-axis representing the episode numbers and the y-axis representing the average test reward. The average test rewards are plotted in red, while the test scores are plotted in blue with reduced opacity. The function then displays the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HyAaDH5Ig_c3" }, "outputs": [], "source": [ "def show_curve(avg, score):\n", " ep = [i for i in range(len(avg))]\n", " plt.plot( range(len(avg)),avg,'r')\n", " plt.plot(score, color = 'b', alpha=0.5)\n", " plt.xlabel(\"Episode\")\n", " plt.ylabel(\"Average Test Reward\")\n", " plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "n0QgMVqw8Lni" }, "source": [ "Show Experiment Video" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The show_video function is used to display a video file in the Jupyter Notebook environment. It looks for video files with the .mp4 extension in the \"video\" directory. If a video file is found, it reads the file, encodes it using base64, and displays it as an HTML5 video element. The video is set to autoplay, loop, and has controls for playback. The function uses the display and HTML functions from IPython to show the video. If no video file is found, it prints a message indicating that the video could not be found." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pLd_ZAaXxeLb" }, "outputs": [], "source": [ "def show_video():\n", " mp4list = glob.glob('video/*.mp4')\n", " if len(mp4list) > 0:\n", " mp4 = mp4list[0]\n", " video = io.open(mp4, 'r+b').read()\n", " encoded = base64.b64encode(video)\n", " display.display(HTML(data='''<video alt=\"test\" autoplay \n", " loop controls style=\"height: 400px;\">\n", " <source src=\"data:video/mp4;base64,{0}\" type=\"video/mp4\" />\n", " </video>'''.format(encoded.decode('ascii'))))\n", " else: \n", " print(\"Could not find video\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "KD4h8Qqy8yvA" }, "source": [ "Test Learned Behavior" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The test_behavior function is used to test the behavior of an agent in a gym environment and record a video of the agent's interactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dH8KkUmA8vwM" }, "outputs": [], "source": [ "def test_behavior(agent):\n", " env = RecordVideo(gym.make(ENV,render_mode='rgb_array',new_step_api=True),'video',new_step_api=True)\n", " \n", " obs = env.reset()\n", " agent.load_models()\n", " while True:\n", " env.render()\n", " action = agent.next_action(observation=obs, noise=NOISE, gamma=GAMMA)\n", " new_obs, reward, done, truncated, info = env.step(action)\n", " if done or truncated: break\n", " obs = new_obs\n", " env.close()\n", " show_video()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JZEk1Yg6-eOV" }, "source": [ "Test Random Behavior" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The random_behavior function is used to observe the behavior of an agent that takes random actions in a gym environment and record a video of its interactions. It does not rely on any pre-trained agent or specific algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AdCgcsyw-gLj" }, "outputs": [], "source": [ "def random_behavior():\n", " env = RecordVideo(gym.make(ENV,render_mode='rgb_array',new_step_api=True),'video',new_step_api=True)\n", "\n", " observation = env.reset()\n", "\n", " for _ in range(300):\n", " env.render()\n", " action = env.action_space.sample() # this takes random actions\n", " observation, reward, terminated , truncated, info = env.step(action)\n", " if terminated or truncated:\n", " break\n", " env.close()\n", " show_video()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ESj1kJ4t9XRH" }, "source": [ "# **Experiments**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "dvCTvuOlYucK" }, "source": [ "In order to test the algorithm I used two Gym environments: \n", "\n", "* Pendulum\n", "* Lunar Lander in continious version" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "A0ELfvjv9cZj" }, "source": [ "# **1-Pendulum**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s_TDcqqy9ZbX" }, "outputs": [], "source": [ "ENV='Pendulum-v1'\n", "LR_ACTOR=0.001\n", "LR_CRITIC=0.002\n", "GAMMA=0.99\n", "TAU=0.005\n", "NOISE=0.1\n", "BATCH_SIZE=128\n", "L1_SIZE=512\n", "L2_SIZE=512\n", "UPDATE_ACTOR_INTERVAL=2\n", "N_ITERATIONS=300\n", "MAX_SIZE=100000\n", "EXP_NAME='pendulum'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "AUifM-hU-Boq" }, "source": [ "Behaviour before learning:" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "b7AnDMzBZAvG" }, "source": [ "As expected, the initial behaviour of the untrained agent is totally random." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nsFl8nv2-BAV", "outputId": "e4035bfc-9970-4f0e-f41f-bfbd0a236c7c" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/gym/wrappers/record_video.py:78: UserWarning: \u001b[33mWARN: Overwriting existing videos at /content/video folder (try specifying a different `video_folder` for the `RecordVideo` wrapper if this is not desired)\u001b[0m\n", " logger.warn(\n" ] }, { "data": { "text/html": [ "<video alt=\"test\" autoplay \n", " loop controls style=\"height: 400px;\">\n", " <source src=\"data:video/mp4;base64,\" type=\"video/mp4\" />\n", " </video>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "random_behavior()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "220rrsOfA6BI" }, "source": [ "Training:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Q9pDgPJwA4fI", "outputId": "a7463993-ef2c-4054-a861-8dd9902a3bc2" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/gym/core.py:317: DeprecationWarning: \u001b[33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", " deprecation(\n", "/usr/local/lib/python3.9/dist-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: \u001b[33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", " deprecation(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "# Episode: 0, Reward: -1562.7877723822164, Mean reward: -1562.7877723822164.\n", "# Episode: 1, Reward: -1361.4354621836085, Mean reward: -1462.1116172829124.\n", "# Episode: 2, Reward: -1343.92031217996, Mean reward: -1422.7145155819283.\n", "# Episode: 3, Reward: -1356.8107492610375, Mean reward: -1406.2385740017057.\n", "# Episode: 4, Reward: -1402.3624143374261, Mean reward: -1405.46334206885.\n", "# Episode: 5, Reward: -1610.3995975035982, Mean reward: -1439.619384641308.\n", "# Episode: 6, Reward: -1642.1109866652127, Mean reward: -1468.5467563590087.\n", "# Episode: 7, Reward: -1426.276993509213, Mean reward: -1463.2630360027842.\n", "# Episode: 8, Reward: -1526.5718965928436, Mean reward: -1470.297353846124.\n", "# Episode: 9, Reward: -1601.0449361066994, Mean reward: -1483.3721120721816.\n", "# Episode: 10, Reward: -1632.4195736371332, Mean reward: -1496.9218813053592.\n", "# Episode: 11, Reward: -1618.3070587284117, Mean reward: -1507.0373127572802.\n", "# Episode: 12, Reward: -1500.0601440474577, Mean reward: -1506.5006074719092.\n", "# Episode: 13, Reward: -1443.7748427851704, Mean reward: -1502.0201957085706.\n", "# Episode: 14, Reward: -1605.856108746786, Mean reward: -1508.9425899111184.\n", "# Episode: 15, Reward: -1407.7304553550618, Mean reward: -1502.6168315013647.\n", "# Episode: 16, Reward: -1431.5150723891027, Mean reward: -1498.4343750829962.\n", "# Episode: 17, Reward: -1542.1406742952418, Mean reward: -1500.86250281701.\n", "# Episode: 18, Reward: -1333.9310054769703, Mean reward: -1492.0766345359552.\n", "# Episode: 19, Reward: -1596.194111067101, Mean reward: -1497.2825083625125.\n", "# Episode: 20, Reward: -1544.1416115454483, Mean reward: -1499.5138942283666.\n", "# Episode: 21, Reward: -1334.8386853561851, Mean reward: -1492.028657461449.\n", "# Episode: 22, Reward: -1222.440017689133, Mean reward: -1480.3074122539572.\n", "# Episode: 23, Reward: -1244.7076291702547, Mean reward: -1470.4907546254697.\n", "# Episode: 24, Reward: -1849.8587307451246, Mean reward: -1485.6654736702558.\n", "# Episode: 25, Reward: -1139.5673601294225, Mean reward: -1472.3540077648393.\n", "# Episode: 26, Reward: -1209.5439464295478, Mean reward: -1462.6203017894582.\n", "# Episode: 27, Reward: -1283.6819527180066, Mean reward: -1456.2296464654776.\n", "# Episode: 28, Reward: -1311.7590436580167, Mean reward: -1451.2479015410822.\n", "# Episode: 29, Reward: -1727.5391400533667, Mean reward: -1460.457609491492.\n", "# Episode: 30, Reward: -1344.8025328583071, Mean reward: -1456.7268005678409.\n", "# Episode: 31, Reward: -1263.0352807211386, Mean reward: -1450.6739405726314.\n", "# Episode: 32, Reward: -1090.0322702825226, Mean reward: -1439.7454051092948.\n", "# Episode: 33, Reward: -964.6306197058785, Mean reward: -1425.7714408327238.\n", "# Episode: 34, Reward: -999.1826075827698, Mean reward: -1413.5831884541537.\n", "# Episode: 35, Reward: -820.7368043974894, Mean reward: -1397.1152333414686.\n", "# Episode: 36, Reward: -1001.6962232742645, Mean reward: -1386.428233069382.\n", "# Episode: 37, Reward: -616.4755407853949, Mean reward: -1366.1663201145402.\n", "# Episode: 38, Reward: -878.4856897301255, Mean reward: -1353.661688566222.\n", "# Episode: 39, Reward: -873.1413496213816, Mean reward: -1341.6486800926007.\n", "# Episode: 40, Reward: -997.7200058398942, Mean reward: -1333.2601758425346.\n", "# Episode: 41, Reward: -591.3227006294347, Mean reward: -1315.5949978612703.\n", "# Episode: 42, Reward: -1726.895061254992, Mean reward: -1325.1601156146126.\n", "# Episode: 43, Reward: -758.8838761360727, Mean reward: -1312.2902010810094.\n", "# Episode: 44, Reward: -987.714786203871, Mean reward: -1305.0774140837398.\n", "# Episode: 45, Reward: -1767.1234013579933, Mean reward: -1315.1218920679626.\n", "# Episode: 46, Reward: -761.3189814508729, Mean reward: -1303.3388514165351.\n", "# Episode: 47, Reward: -714.8536243176349, Mean reward: -1291.0787425186415.\n", "# Episode: 48, Reward: -1820.2651821640177, Mean reward: -1301.8784657767105.\n", "# Episode: 49, Reward: -1081.2301419013795, Mean reward: -1297.4654992992039.\n", "# Episode: 50, Reward: -1152.3863088325393, Mean reward: -1289.2574700282103.\n", "# Episode: 51, Reward: -1013.215303901061, Mean reward: -1282.2930668625593.\n", "# Episode: 52, Reward: -1278.043395431276, Mean reward: -1280.9755285275858.\n", "# Episode: 53, Reward: -1181.6151216893502, Mean reward: -1277.471615976152.\n", "# Episode: 54, Reward: -899.5065282463488, Mean reward: -1267.4144982543303.\n", "# Episode: 55, Reward: -890.3884405564133, Mean reward: -1253.0142751153867.\n", "# Episode: 56, Reward: -1012.4450936221324, Mean reward: -1240.420957254525.\n", "# Episode: 57, Reward: -1015.956456487028, Mean reward: -1232.2145465140813.\n", "# Episode: 58, Reward: -961.4900692646497, Mean reward: -1220.9129099675174.\n", "# Episode: 59, Reward: -503.2769183732298, Mean reward: -1198.9575496128482.\n", "# Episode: 60, Reward: -647.2510231762222, Mean reward: -1179.25417860363.\n", "# Episode: 61, Reward: -569.9195764795688, Mean reward: -1158.2864289586528.\n", "# Episode: 62, Reward: -1768.2528971160455, Mean reward: -1163.6502840200246.\n", "# Episode: 63, Reward: -1849.4999410253265, Mean reward: -1171.764785984828.\n", "# Episode: 64, Reward: -247.69127711997874, Mean reward: -1144.6014893522918.\n", "# Episode: 65, Reward: -124.520686421582, Mean reward: -1118.9372939736222.\n", "# Episode: 66, Reward: -250.96258953738746, Mean reward: -1095.326244316588.\n", "# Episode: 67, Reward: -123.12736136432878, Mean reward: -1066.9459780579696.\n", "# Episode: 68, Reward: -247.47727449729308, Mean reward: -1045.2169034383762.\n", "# Episode: 69, Reward: -486.35152685808197, Mean reward: -1023.0200517541957.\n", "# Episode: 70, Reward: -121.06801176253148, Mean reward: -994.5585797585372.\n", "# Episode: 71, Reward: -126.12178950966162, Mean reward: -970.3842418416068.\n", "# Episode: 72, Reward: -124.68299345969903, Mean reward: -948.4291013570182.\n", "# Episode: 73, Reward: -1768.81771946083, Mean reward: -958.9113031628295.\n", "# Episode: 74, Reward: -1317.8592103785847, Mean reward: -948.2713127554989.\n", "# Episode: 75, Reward: -125.96508157099453, Mean reward: -927.9992671843304.\n", "# Episode: 76, Reward: -124.86154925421897, Mean reward: -906.3056192408238.\n", "# Episode: 77, Reward: -125.10810520554872, Mean reward: -883.1341422905746.\n", "# Episode: 78, Reward: -1817.0037797866976, Mean reward: -893.239037013148.\n", "# Episode: 79, Reward: -485.14270681520463, Mean reward: -868.391108348385.\n", "# Episode: 80, Reward: -242.25248770571577, Mean reward: -846.3401074453332.\n", "# Episode: 81, Reward: -0.16088524511623073, Mean reward: -821.0826195358127.\n", "# Episode: 82, Reward: -0.37655954144025167, Mean reward: -799.289505320991.\n", "# Episode: 83, Reward: -246.84852421148202, Mean reward: -784.9338634111032.\n", "# Episode: 84, Reward: -238.85350418128604, Mean reward: -769.7272813430735.\n", "# Episode: 85, Reward: -126.51073678903167, Mean reward: -755.8427599909043.\n", "# Episode: 86, Reward: -0.07694916273697687, Mean reward: -735.8103745086737.\n", "# Episode: 87, Reward: -0.3540789145446045, Mean reward: -723.4879452712568.\n", "# Episode: 88, Reward: -1684.4994792339974, Mean reward: -739.6082210613343.\n", "# Episode: 89, Reward: -369.9571952701445, Mean reward: -729.5445379743095.\n", "# Episode: 90, Reward: -1714.0712311386797, Mean reward: -743.8715624802852.\n", "# Episode: 91, Reward: -125.9597728941272, Mean reward: -734.5643039255791.\n", "# Episode: 92, Reward: -1767.3403761160225, Mean reward: -735.3732102227997.\n", "# Episode: 93, Reward: -244.61978599329467, Mean reward: -725.0879284199441.\n", "# Episode: 94, Reward: -241.08306082269254, Mean reward: -710.1552939123205.\n", "# Episode: 95, Reward: -505.494147641591, Mean reward: -684.9227088379924.\n", "# Episode: 96, Reward: -123.51443717383053, Mean reward: -672.1666179524516.\n", "# Episode: 97, Reward: -490.983110234996, Mean reward: -667.6892076707987.\n", "# Episode: 98, Reward: -125.18571577681207, Mean reward: -633.7876183430548.\n", "# Episode: 99, Reward: -1.1043235683937573, Mean reward: -612.1851019763951.\n", "# Episode: 100, Reward: -127.29385311973925, Mean reward: -591.683252862139.\n", "# Episode: 101, Reward: -502.8341819494419, Mean reward: -581.4756304231066.\n", "# Episode: 102, Reward: -239.67339602983338, Mean reward: -560.7082304350777.\n", "# Episode: 103, Reward: -496.1652082366493, Mean reward: -546.9992321660237.\n", "# Episode: 104, Reward: -482.17880518355645, Mean reward: -538.6526777047679.\n", "# Episode: 105, Reward: -126.33762090037119, Mean reward: -523.3716613116471.\n", "# Episode: 106, Reward: -241.69381511593505, Mean reward: -507.95663574152314.\n", "# Episode: 107, Reward: -246.71170044522913, Mean reward: -492.57174062068714.\n", "# Episode: 108, Reward: -120.10206967396866, Mean reward: -475.74398062887354.\n", "# Episode: 109, Reward: -242.8586391075948, Mean reward: -470.5356150435608.\n", "# Episode: 110, Reward: -125.58743670460788, Mean reward: -460.1023433141286.\n", "# Episode: 111, Reward: -124.27458571661924, Mean reward: -451.18944349886954.\n", "# Episode: 112, Reward: -241.83636866399033, Mean reward: -420.6611129298284.\n", "# Episode: 113, Reward: -1.3274321462674412, Mean reward: -383.6976627522472.\n", "# Episode: 114, Reward: -123.94303897522518, Mean reward: -381.22269798935224.\n", "# Episode: 115, Reward: -373.94044732125946, Mean reward: -386.21109320734575.\n", "# Episode: 116, Reward: -118.73771596185165, Mean reward: -383.566595735835.\n", "# Episode: 117, Reward: -124.26929108545362, Mean reward: -383.5894343302576.\n", "# Episode: 118, Reward: -633.7380780478151, Mean reward: -391.314650401268.\n", "# Episode: 119, Reward: -1.541880794669653, Mean reward: -381.61845747999973.\n", "# Episode: 120, Reward: -121.50155744527255, Mean reward: -381.62712839365463.\n", "# Episode: 121, Reward: -124.52600256123664, Mean reward: -381.5952126546861.\n", "# Episode: 122, Reward: -124.85262059229919, Mean reward: -381.5986051973381.\n", "# Episode: 123, Reward: -376.208367292793, Mean reward: -353.7464181539774.\n", "# Episode: 124, Reward: -122.00326222363539, Mean reward: -329.8292991908783.\n", "# Episode: 125, Reward: -367.1106707356174, Mean reward: -334.6522109741708.\n", "# Episode: 126, Reward: -356.23341307469735, Mean reward: -339.2796482505803.\n", "# Episode: 127, Reward: -120.0949415648, Mean reward: -339.17938497776527.\n", "# Episode: 128, Reward: -826.2249272372575, Mean reward: -319.36380792677653.\n", "# Episode: 129, Reward: -500.2865687294518, Mean reward: -319.6666851650615.\n", "# Episode: 130, Reward: -378.71330013889093, Mean reward: -322.39590141372497.\n", "# Episode: 131, Reward: -126.12191029085925, Mean reward: -324.9151219146398.\n", "# Episode: 132, Reward: -237.24766276264194, Mean reward: -329.65254397906386.\n", "# Episode: 133, Reward: -126.32992069071126, Mean reward: -327.24217190864846.\n", "# Episode: 134, Reward: -373.22689524867855, Mean reward: -329.92963972999627.\n", "# Episode: 135, Reward: -122.79656442383008, Mean reward: -329.8553562826923.\n", "# Episode: 136, Reward: -477.8493830780354, Mean reward: -339.41080496099823.\n", "# Episode: 137, Reward: -237.9090219379713, Mean reward: -344.16190382146686.\n", "# Episode: 138, Reward: -245.17722494788754, Mean reward: -315.37545873574453.\n", "# Episode: 139, Reward: -121.40257875138226, Mean reward: -310.40436640536933.\n", "# Episode: 140, Reward: -494.4256033648532, Mean reward: -286.0114538498928.\n", "# Episode: 141, Reward: -125.55905050942413, Mean reward: -286.00343940219875.\n", "# Episode: 142, Reward: -1698.0615380999548, Mean reward: -284.6178626418774.\n", "# Episode: 143, Reward: -629.3311811993128, Mean reward: -292.3120905459978.\n", "# Episode: 144, Reward: -240.82374586366487, Mean reward: -292.3069042468172.\n", "# Episode: 145, Reward: -122.47784356970755, Mean reward: -284.64657816537954.\n", "# Episode: 146, Reward: -507.80681776733803, Mean reward: -292.3324257772497.\n", "# Episode: 147, Reward: -126.58288442134442, Mean reward: -285.04442126097666.\n", "# Episode: 148, Reward: -1760.6736737476926, Mean reward: -317.7541804203943.\n", "# Episode: 149, Reward: -364.72354768719316, Mean reward: -325.0265649027703.\n", "# Episode: 150, Reward: -123.27602769175391, Mean reward: -324.94620839421054.\n", "# Episode: 151, Reward: -1771.3319720752559, Mean reward: -350.31616419672685.\n", "# Episode: 152, Reward: -124.23363037757943, Mean reward: -348.00736888368175.\n", "# Episode: 153, Reward: -1730.8088551189248, Mean reward: -372.70024182132727.\n", "# Episode: 154, Reward: -123.48948701029381, Mean reward: -365.52645545786197.\n", "# Episode: 155, Reward: -0.6877930997149249, Mean reward: -363.0134589018489.\n", "# Episode: 156, Reward: -479.24491774042053, Mean reward: -367.7644809543385.\n", "# Episode: 157, Reward: -1749.4411637718683, Mean reward: -397.81907022087137.\n", "# Episode: 158, Reward: -234.7894012979782, Mean reward: -400.11281685335155.\n", "# Episode: 159, Reward: -124.0752807777356, Mean reward: -397.7371496867543.\n", "# Episode: 160, Reward: -1853.5012822111185, Mean reward: -432.2954265968846.\n", "# Episode: 161, Reward: -249.27536163254666, Mean reward: -434.7954421152031.\n", "# Episode: 162, Reward: -0.8212808710090668, Mean reward: -429.97514035934347.\n", "# Episode: 163, Reward: -125.80852591884518, Mean reward: -432.46476223479505.\n", "# Episode: 164, Reward: -475.3331263756432, Mean reward: -439.49256398280346.\n", "# Episode: 165, Reward: -121.21107206631153, Mean reward: -434.4379764777044.\n", "# Episode: 166, Reward: -124.66968982389449, Mean reward: -434.55661595494536.\n", "# Episode: 167, Reward: -240.72541497635777, Mean reward: -436.8857384327634.\n", "# Episode: 168, Reward: -124.8731233752167, Mean reward: -426.70843933931144.\n", "# Episode: 169, Reward: -485.66480025168914, Mean reward: -436.39089772845176.\n", "# Episode: 170, Reward: -126.66536474871323, Mean reward: -436.4941738745207.\n", "# Episode: 171, Reward: -118.65891999581214, Mean reward: -436.3768322232122.\n", "# Episode: 172, Reward: -246.17153340974937, Mean reward: -438.8032104795612.\n", "# Episode: 173, Reward: -363.2510662905924, Mean reward: -438.5440644595172.\n", "# Episode: 174, Reward: -484.09859319637314, Mean reward: -445.7859710789719.\n", "# Episode: 175, Reward: -236.38326880285177, Mean reward: -443.1714230403166.\n", "# Episode: 176, Reward: -126.36848232827298, Mean reward: -438.5741244253882.\n", "# Episode: 177, Reward: -238.04197712698976, Mean reward: -440.9330651366319.\n", "# Episode: 178, Reward: -238.67079798957732, Mean reward: -429.1819825516783.\n", "# Episode: 179, Reward: -242.8256705719036, Mean reward: -424.0327645885273.\n", "# Episode: 180, Reward: -485.0264553384061, Mean reward: -426.1590276925177.\n", "# Episode: 181, Reward: -1602.4835093430172, Mean reward: -455.68625967356076.\n", "# Episode: 182, Reward: -353.76008847672705, Mean reward: -458.0165081878425.\n", "# Episode: 183, Reward: -235.0935304679208, Mean reward: -460.19178038338674.\n", "# Episode: 184, Reward: -678.1178857306672, Mean reward: -466.2896001930265.\n", "# Episode: 185, Reward: -236.76264038541217, Mean reward: -468.5689217122581.\n", "# Episode: 186, Reward: -487.23733918412796, Mean reward: -468.75668083437995.\n", "# Episode: 187, Reward: -122.65071484673048, Mean reward: -466.45151469255524.\n", "# Episode: 188, Reward: -124.02019596121741, Mean reward: -464.0283741128218.\n", "# Episode: 189, Reward: -122.53127719499612, Mean reward: -464.05094808169406.\n", "# Episode: 190, Reward: -125.25444918012978, Mean reward: -456.66752499799964.\n", "# Episode: 191, Reward: -118.88667126332855, Mean reward: -456.5340774130777.\n", "# Episode: 192, Reward: -124.67504150867458, Mean reward: -425.066347481252.\n", "# Episode: 193, Reward: -501.1280225383371, Mean reward: -422.5022843080326.\n", "# Episode: 194, Reward: -370.4501802593436, Mean reward: -425.09481299594614.\n", "# Episode: 195, Reward: -241.75560982760368, Mean reward: -427.48036832110404.\n", "# Episode: 196, Reward: -363.8158122931417, Mean reward: -424.60054821162004.\n", "# Episode: 197, Reward: -121.0296843038079, Mean reward: -424.48948420926934.\n", "# Episode: 198, Reward: -0.5112335062875162, Mean reward: -389.28623540444124.\n", "# Episode: 199, Reward: -126.31769086994498, Mean reward: -384.5181182680963.\n", "# Episode: 200, Reward: -485.20503530877636, Mean reward: -391.75669842043675.\n", "# Episode: 201, Reward: -371.9951140300931, Mean reward: -363.76996125953355.\n", "# Episode: 202, Reward: -241.9497502315216, Mean reward: -366.1242836566123.\n", "# Episode: 203, Reward: -120.98942221760763, Mean reward: -333.92789499858605.\n", "# Episode: 204, Reward: -494.26052090390465, Mean reward: -341.34331567645813.\n", "# Episode: 205, Reward: -616.4256339666218, Mean reward: -353.65807249379634.\n", "# Episode: 206, Reward: -367.5930059462569, Mean reward: -351.425034257913.\n", "# Episode: 207, Reward: -237.27439865577907, Mean reward: -321.1816989555913.\n", "# Episode: 208, Reward: -123.68938582351767, Mean reward: -318.9596986461021.\n", "# Episode: 209, Reward: -125.2938839856628, Mean reward: -318.98407071026065.\n", "# Episode: 210, Reward: -728.4823474171959, Mean reward: -296.4836920143822.\n", "# Episode: 211, Reward: -124.97789897631857, Mean reward: -293.9977427612576.\n", "# Episode: 212, Reward: -119.85795048787519, Mean reward: -296.37847615359493.\n", "# Episode: 213, Reward: -0.971242988957709, Mean reward: -293.88173049499716.\n", "# Episode: 214, Reward: -376.00070795778646, Mean reward: -291.89508212664003.\n", "# Episode: 215, Reward: -124.73794321574529, Mean reward: -291.96561954962874.\n", "# Episode: 216, Reward: -122.47885047906468, Mean reward: -291.9218027627321.\n", "# Episode: 217, Reward: -587.688620484883, Mean reward: -298.86106687290265.\n", "# Episode: 218, Reward: -124.47928281090125, Mean reward: -298.8531900616163.\n", "# Episode: 219, Reward: -121.64504522908108, Mean reward: -291.57279496116416.\n", "# Episode: 220, Reward: -123.41753496219765, Mean reward: -291.5078383654338.\n", "# Episode: 221, Reward: -478.49870728914794, Mean reward: -298.7046341113006.\n", "# Episode: 222, Reward: -120.91837788140563, Mean reward: -296.1995710007336.\n", "# Episode: 223, Reward: -127.2760616707372, Mean reward: -291.4800709083366.\n", "# Episode: 224, Reward: -366.2734563044792, Mean reward: -289.12356817049874.\n", "# Episode: 225, Reward: -503.16430298528144, Mean reward: -294.4591888541473.\n", "# Episode: 226, Reward: -241.9717213622038, Mean reward: -296.77125363482594.\n", "# Episode: 227, Reward: -123.05558841254964, Mean reward: -294.4715258605371.\n", "# Episode: 228, Reward: -121.80718490347297, Mean reward: -292.13425359881506.\n", "# Episode: 229, Reward: -123.79532577609437, Mean reward: -289.75364670289883.\n", "# Episode: 230, Reward: -0.6850578776810112, Mean reward: -280.06681875368434.\n", "# Episode: 231, Reward: -471.10478355079016, Mean reward: -257.43924423783983.\n", "# Episode: 232, Reward: -124.29877757456948, Mean reward: -252.8500180197967.\n", "# Episode: 233, Reward: -367.56708852951965, Mean reward: -255.49948918102865.\n", "# Episode: 234, Reward: -357.09788583844744, Mean reward: -249.07908918318427.\n", "# Episode: 235, Reward: -473.53633139052744, Mean reward: -253.81456300328654.\n", "# Episode: 236, Reward: -474.2203954168461, Mean reward: -253.5542241279409.\n", "# Episode: 237, Reward: -124.15082681618107, Mean reward: -253.5842263673299.\n", "# Episode: 238, Reward: -235.75908427855242, Mean reward: -255.8190041336766.\n", "# Episode: 239, Reward: -125.9571484448382, Mean reward: -255.88752155867343.\n", "# Episode: 240, Reward: -368.78659108437006, Mean reward: -260.7581643967583.\n", "# Episode: 241, Reward: -0.8092003950439985, Mean reward: -258.39661497939255.\n", "# Episode: 242, Reward: -541.8134661029964, Mean reward: -266.73938347127904.\n", "# Episode: 243, Reward: -122.09277078577028, Mean reward: -259.15867843622766.\n", "# Episode: 244, Reward: -123.80453062400085, Mean reward: -254.2257654435208.\n", "# Episode: 245, Reward: -244.80746129818309, Mean reward: -254.2868024729324.\n", "# Episode: 246, Reward: -362.04810334426253, Mean reward: -254.2514482939548.\n", "# Episode: 247, Reward: -122.05596876655771, Mean reward: -254.2719739832098.\n", "# Episode: 248, Reward: -125.37210840676055, Mean reward: -256.76919148121925.\n", "# Episode: 249, Reward: -474.1867420017607, Mean reward: -263.72657250385555.\n", "# Episode: 250, Reward: -239.80757243639619, Mean reward: -258.81862324640804.\n", "# Episode: 251, Reward: -242.61737417057233, Mean reward: -256.23106844921756.\n", "# Episode: 252, Reward: -485.84023114353573, Mean reward: -261.1088780674579.\n", "# Episode: 253, Reward: -121.19956134226047, Mean reward: -261.1130808499509.\n", "# Episode: 254, Reward: -122.1725311107124, Mean reward: -253.6713210540871.\n", "# Episode: 255, Reward: -122.01004446138667, Mean reward: -243.7830092639824.\n", "# Episode: 256, Reward: -122.23518076405887, Mean reward: -238.87585276033843.\n", "# Episode: 257, Reward: -363.13613684553275, Mean reward: -241.39308752413348.\n", "# Episode: 258, Reward: -123.96380651125054, Mean reward: -241.39857593788813.\n", "# Episode: 259, Reward: -590.1960660907439, Mean reward: -250.69661957998977.\n", "# Episode: 260, Reward: -361.6273079238908, Mean reward: -243.35951879012364.\n", "# Episode: 261, Reward: -124.78511563576797, Mean reward: -243.35566312331267.\n", "# Episode: 262, Reward: -120.851830872494, Mean reward: -243.37554073100506.\n", "# Episode: 263, Reward: -365.70910960846214, Mean reward: -250.67029806339514.\n", "# Episode: 264, Reward: -598.043352046622, Mean reward: -255.1111509451718.\n", "# Episode: 265, Reward: -240.62694313127963, Mean reward: -257.4289309434825.\n", "# Episode: 266, Reward: -124.14126245732913, Mean reward: -257.4621791830478.\n", "# Episode: 267, Reward: -366.56377000308623, Mean reward: -253.03968217341185.\n", "# Episode: 268, Reward: -125.4914809119177, Mean reward: -253.0599261354322.\n", "# Episode: 269, Reward: -479.2894882132265, Mean reward: -260.2128149951151.\n", "# Episode: 270, Reward: -120.6822045282293, Mean reward: -260.1581083864357.\n", "# Episode: 271, Reward: -122.24722647550198, Mean reward: -253.0330787701628.\n", "# Episode: 272, Reward: -123.5513554825655, Mean reward: -253.08573832218596.\n", "# Episode: 273, Reward: -359.8891182636622, Mean reward: -257.7379994540445.\n", "# Episode: 274, Reward: -246.86296773367806, Mean reward: -255.34978968262848.\n", "# Episode: 275, Reward: -124.48266052282159, Mean reward: -247.77615683337928.\n", "# Episode: 276, Reward: -244.32212199841155, Mean reward: -247.82316484610342.\n", "# Episode: 277, Reward: -245.43571584904677, Mean reward: -250.27076739483343.\n", "# Episode: 278, Reward: -361.4745719035413, Mean reward: -255.0641151348348.\n", "# Episode: 279, Reward: -354.90591119778327, Mean reward: -259.6863268432686.\n", "# Episode: 280, Reward: -117.78860747055151, Mean reward: -262.02839783512593.\n", "# Episode: 281, Reward: -126.38756989652042, Mean reward: -255.13405356204055.\n", "# Episode: 282, Reward: -125.1538419620272, Mean reward: -255.15115484978966.\n", "# Episode: 283, Reward: -479.7216582908638, Mean reward: -257.3942462450166.\n", "# Episode: 284, Reward: -240.10265221730293, Mean reward: -255.0543415725937.\n", "# Episode: 285, Reward: -1.6677042294417748, Mean reward: -245.61696902937197.\n", "# Episode: 286, Reward: -123.1305317936327, Mean reward: -238.59517175690772.\n", "# Episode: 287, Reward: -123.73483675826444, Mean reward: -238.58685195574938.\n", "# Episode: 288, Reward: -120.7551181431495, Mean reward: -236.28677263304132.\n", "# Episode: 289, Reward: -238.8284109740542, Mean reward: -238.54419788362569.\n", "# Episode: 290, Reward: -246.42512770716246, Mean reward: -236.0969686160815.\n", "# Episode: 291, Reward: -364.9531862373078, Mean reward: -243.37984833292674.\n", "# Episode: 292, Reward: -125.50550185474584, Mean reward: -235.05368904796177.\n", "# Episode: 293, Reward: -238.08844118645425, Mean reward: -237.37360245597546.\n", "# Episode: 294, Reward: -369.49941348024845, Mean reward: -242.2875001131004.\n", "# Episode: 295, Reward: -506.84745463890573, Mean reward: -247.5282999799149.\n", "# Episode: 296, Reward: -502.7416480484216, Mean reward: -250.34217087399801.\n", "# Episode: 297, Reward: -518.3245393486237, Mean reward: -258.2675422856393.\n", "# Episode: 298, Reward: -126.61400713309797, Mean reward: -258.2923802601661.\n", "# Episode: 299, Reward: -238.47390221545294, Mean reward: -253.57812346443993.\n", "Saving models ...\n" ] } ], "source": [ "agent, scores, avg_history = train()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 449 }, "id": "qzM6DahtEVvT", "outputId": "f083766c-a95c-48bb-9ad8-6e8ba2d52f4c" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_curve(avg_history, scores)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "UOTTN7wKZDiL" }, "source": [ "This graph shows that the problem could be solved using 300 episodes. However, from episode 210 onwards, the results are very similar and the training stabilises. Although the agent has managed to learn to solve the problem in a reduced number of episodes, the learning line that the reward shows is quite unstable and has peaks. This may be due to the fact that the parameters used are not the most optimal." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 493 }, "id": "Gi54iOb_Dkp3", "outputId": "3dfb0725-6e77-4271-dd19-72e6a2efedb2" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/gym/wrappers/record_video.py:78: UserWarning: \u001b[33mWARN: Overwriting existing videos at /content/video folder (try specifying a different `video_folder` for the `RecordVideo` wrapper if this is not desired)\u001b[0m\n", " logger.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Loading models ...\n" ] }, { "data": { "text/html": [ "<video alt=\"test\" autoplay \n", " loop controls style=\"height: 400px;\">\n", " <source src=\"data:video/mp4;base64,\" type=\"video/mp4\" />\n", " </video>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "test_behavior(agent)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "rsxvJ5U3ZGZA" }, "source": [ "This video shows how the agent has learned to use inertia effectively to achieve the fastest possible vertical positioning. Moreover, he manages to stay in that position, which is the objective of the problem." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "p_03BMxoTWgl" }, "source": [ "# **2-LunarLander**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JgbOh9h5TVkI" }, "outputs": [], "source": [ "ENV='LunarLanderContinuous-v2'\n", "LR_ACTOR=0.001\n", "LR_CRITIC=0.002\n", "GAMMA=0.99\n", "TAU=0.005\n", "NOISE=0.1\n", "BATCH_SIZE=128\n", "L1_SIZE=512\n", "L2_SIZE=512\n", "UPDATE_ACTOR_INTERVAL=2\n", "N_ITERATIONS=400\n", "MAX_SIZE=100000\n", "EXP_NAME='lunar'\n", "SAVE_POINTS=[100, 200, 250, 300, 350, 400]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "BzmyEcLqZKu5" }, "source": [ "This video shows how the agent has learned to use inertia effectively to achieve the fastest possible vertical positioning. Moreover, he manages to stay in that position, which is the objective of the problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 476 }, "id": "UQ4BuyZZ5ZOY", "outputId": "92b617a7-3c67-41f1-b863-e047b03bb4d1" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/gym/wrappers/record_video.py:78: UserWarning: \u001b[33mWARN: Overwriting existing videos at /content/video folder (try specifying a different `video_folder` for the `RecordVideo` wrapper if this is not desired)\u001b[0m\n", " logger.warn(\n" ] }, { "data": { "text/html": [ "<video alt=\"test\" autoplay \n", " loop controls style=\"height: 400px;\">\n", " <source src=\"data:video/mp4;base64,\" type=\"video/mp4\" />\n", " </video>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "random_behavior()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UzKT0Rut5c7L", "outputId": "8828e0f7-e15f-4232-99d5-6745a8e0ca63" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/gym/core.py:317: DeprecationWarning: \u001b[33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", " deprecation(\n", "/usr/local/lib/python3.9/dist-packages/gym/wrappers/step_api_compatibility.py:39: DeprecationWarning: \u001b[33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.\u001b[0m\n", " deprecation(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "# Episode: 0, Reward: -317.00588299169215, Mean reward: -317.00588299169215.\n", "# Episode: 1, Reward: -421.27300686331415, Mean reward: -369.1394449275032.\n", "# Episode: 2, Reward: -604.968276132241, Mean reward: -447.74905532908247.\n", "# Episode: 3, Reward: -1057.7104739971176, Mean reward: -600.2394099960912.\n", "# Episode: 4, Reward: -486.55860873379964, Mean reward: -577.503249743633.\n", "# Episode: 5, Reward: -449.35420010037814, Mean reward: -556.1450748030904.\n", "# Episode: 6, Reward: -713.9958220688386, Mean reward: -578.6951815553401.\n", "# Episode: 7, Reward: -710.742066617938, Mean reward: -595.2010421881649.\n", "# Episode: 8, Reward: -623.9746857697377, Mean reward: -598.3981136972285.\n", "# Episode: 9, Reward: -671.0811449407813, Mean reward: -605.6664168215838.\n", "# Episode: 10, Reward: -210.4783305659402, Mean reward: -569.7402271619799.\n", "# Episode: 11, Reward: -115.0221911868141, Mean reward: -531.8470574973827.\n", "# Episode: 12, Reward: -133.72554171374398, Mean reward: -501.22232551402584.\n", "# Episode: 13, Reward: -81.19858847003853, Mean reward: -471.2206300108839.\n", "# Episode: 14, Reward: -89.88899170601356, Mean reward: -445.7985207905592.\n", "# Episode: 15, Reward: -75.629684862501, Mean reward: -422.6629685450556.\n", "# Episode: 16, Reward: -47.59158938874976, Mean reward: -400.5999462417435.\n", "# Episode: 17, Reward: -362.57820036492086, Mean reward: -398.48762702636446.\n", "# Episode: 18, Reward: -232.28965787373116, Mean reward: -389.7403654920153.\n", "# Episode: 19, Reward: -301.12135991218855, Mean reward: -385.30941521302395.\n", "# Episode: 20, Reward: -139.61145658613535, Mean reward: -373.60951242126737.\n", "# Episode: 21, Reward: -244.97229041626318, Mean reward: -367.7623659664945.\n", "# Episode: 22, Reward: -166.72829269120786, Mean reward: -359.0217540849603.\n", "# Episode: 23, Reward: -224.17194085115608, Mean reward: -353.40301186688515.\n", "# Episode: 24, Reward: -289.86182566063394, Mean reward: -350.86136441863505.\n", "# Episode: 25, Reward: -235.20180660104913, Mean reward: -346.41291988718945.\n", "# Episode: 26, Reward: -367.93251797241226, Mean reward: -347.20994203849403.\n", "# Episode: 27, Reward: -379.4019851711283, Mean reward: -348.3596578646596.\n", "# Episode: 28, Reward: -169.25890259442176, Mean reward: -342.1837697518927.\n", "# Episode: 29, Reward: -286.11224387221887, Mean reward: -340.314718889237.\n", "# Episode: 30, Reward: -276.0166249463987, Mean reward: -338.24058682656477.\n", "# Episode: 31, Reward: -127.26372768306308, Mean reward: -331.64755997833026.\n", "# Episode: 32, Reward: -218.82943600506536, Mean reward: -328.2288289488374.\n", "# Episode: 33, Reward: -157.1201143117069, Mean reward: -323.19621969480414.\n", "# Episode: 34, Reward: -165.93905716458718, Mean reward: -318.7031579082265.\n", "# Episode: 35, Reward: -115.16716258118797, Mean reward: -313.04938026025326.\n", "# Episode: 36, Reward: -314.57341126593565, Mean reward: -313.0905702874339.\n", "# Episode: 37, Reward: -175.87203125733163, Mean reward: -309.47955610243116.\n", "# Episode: 38, Reward: -32.73904339271533, Mean reward: -302.38364552013076.\n", "# Episode: 39, Reward: -314.37523625977497, Mean reward: -302.6834352886218.\n", "# Episode: 40, Reward: -238.72495068979748, Mean reward: -301.1234722496261.\n", "# Episode: 41, Reward: -246.71338178788068, Mean reward: -299.8279939052988.\n", "# Episode: 42, Reward: -142.71648974386844, Mean reward: -296.1742379945679.\n", "# Episode: 43, Reward: -228.96177657828173, Mean reward: -294.64668205328866.\n", "# Episode: 44, Reward: -275.84604190453894, Mean reward: -294.22889004998314.\n", "# Episode: 45, Reward: -105.5486470413312, Mean reward: -290.12714563675155.\n", "# Episode: 46, Reward: -155.06039973229124, Mean reward: -287.2533850855928.\n", "# Episode: 47, Reward: -448.40794088626, Mean reward: -290.6107716647734.\n", "# Episode: 48, Reward: -201.38060398542066, Mean reward: -288.78974783458256.\n", "# Episode: 49, Reward: -183.42643902222156, Mean reward: -286.6824816583353.\n", "# Episode: 50, Reward: -546.7020498051294, Mean reward: -291.78090456317443.\n", "# Episode: 51, Reward: -236.66532645517492, Mean reward: -290.72098959955906.\n", "# Episode: 52, Reward: -246.8673739733887, Mean reward: -289.8935628896313.\n", "# Episode: 53, Reward: -98.65727527518754, Mean reward: -286.3521501560305.\n", "# Episode: 54, Reward: -56.777400273285444, Mean reward: -282.17806379452605.\n", "# Episode: 55, Reward: 152.86192746380277, Mean reward: -274.40949252205587.\n", "# Episode: 56, Reward: 170.5804837854585, Mean reward: -266.60265083245037.\n", "# Episode: 57, Reward: -127.70245529352314, Mean reward: -264.20781987488266.\n", "# Episode: 58, Reward: -184.81219086149733, Mean reward: -262.86213124753715.\n", "# Episode: 59, Reward: -123.86765957380311, Mean reward: -260.5455567196416.\n", "# Episode: 60, Reward: -336.4711329338211, Mean reward: -261.7902382969232.\n", "# Episode: 61, Reward: -355.6750866516587, Mean reward: -263.30451004458024.\n", "# Episode: 62, Reward: -54.75907280799288, Mean reward: -259.99426500907884.\n", "# Episode: 63, Reward: -353.783701506248, Mean reward: -261.4597249543471.\n", "# Episode: 64, Reward: -228.76561799117354, Mean reward: -260.9567386933752.\n", "# Episode: 65, Reward: -167.87566756776886, Mean reward: -259.5464194338963.\n", "# Episode: 66, Reward: -419.38995359459733, Mean reward: -261.9321438243545.\n", "# Episode: 67, Reward: -325.0313670566028, Mean reward: -262.8600735777699.\n", "# Episode: 68, Reward: -486.50601610806166, Mean reward: -266.1013191216872.\n", "# Episode: 69, Reward: -291.2265270949895, Mean reward: -266.46025066416297.\n", "# Episode: 70, Reward: -289.680549562538, Mean reward: -266.78729712752033.\n", "# Episode: 71, Reward: 175.71411730779232, Mean reward: -260.6414441492521.\n", "# Episode: 72, Reward: 11.931092203256064, Mean reward: -256.90757378825884.\n", "# Episode: 73, Reward: 242.90891015494174, Mean reward: -250.15329697821554.\n", "# Episode: 74, Reward: 232.98387219362706, Mean reward: -243.71146805592429.\n", "# Episode: 75, Reward: 211.7046910453925, Mean reward: -237.71915017301222.\n", "# Episode: 76, Reward: 36.5852169085756, Mean reward: -234.15675579532927.\n", "# Episode: 77, Reward: -333.35058732152856, Mean reward: -235.42847158412673.\n", "# Episode: 78, Reward: 217.5386096076324, Mean reward: -229.69471106271203.\n", "# Episode: 79, Reward: 240.7859618303707, Mean reward: -223.8137026515486.\n", "# Episode: 80, Reward: 202.99102732384532, Mean reward: -218.54450845432152.\n", "# Episode: 81, Reward: 248.19823012440432, Mean reward: -212.85252383750776.\n", "# Episode: 82, Reward: -33.01594734646966, Mean reward: -210.68581809665187.\n", "# Episode: 83, Reward: 244.24550967468204, Mean reward: -205.26996895651695.\n", "# Episode: 84, Reward: 255.76195238327594, Mean reward: -199.8460639995782.\n", "# Episode: 85, Reward: 237.10123724620445, Mean reward: -194.76528142695287.\n", "# Episode: 86, Reward: 242.92959767116082, Mean reward: -189.73430580513545.\n", "# Episode: 87, Reward: -37.28279937294455, Mean reward: -188.00190232295145.\n", "# Episode: 88, Reward: 220.2876610100787, Mean reward: -183.41437913943426.\n", "# Episode: 89, Reward: 232.89982340643212, Mean reward: -178.78866577781352.\n", "# Episode: 90, Reward: 230.41586728563445, Mean reward: -174.29191266722617.\n", "# Episode: 91, Reward: 271.70257713762635, Mean reward: -169.44414647369516.\n", "# Episode: 92, Reward: 159.09021989072363, Mean reward: -165.91151887837884.\n", "# Episode: 93, Reward: -587.2285801806544, Mean reward: -170.39361527521154.\n", "# Episode: 94, Reward: -434.8888466022839, Mean reward: -173.1777756049702.\n", "# Episode: 95, Reward: -208.30058648677038, Mean reward: -173.5436382183223.\n", "# Episode: 96, Reward: -81.27143005613809, Mean reward: -172.59237834036165.\n", "# Episode: 97, Reward: 208.83737692981504, Mean reward: -168.70023798046188.\n", "# Episode: 98, Reward: 207.90768267362694, Mean reward: -164.8961175698145.\n", "# Episode: 99, Reward: 206.94447625368548, Mean reward: -161.17771163157948.\n", "Saving models ...\n", "# Episode: 100, Reward: -294.8118310070248, Mean reward: -160.95577111173282.\n", "# Episode: 101, Reward: 206.1072681600317, Mean reward: -154.68196836149937.\n", "# Episode: 102, Reward: 281.76666741276813, Mean reward: -145.81461892604926.\n", "# Episode: 103, Reward: 171.83950022456332, Mean reward: -133.5191191838325.\n", "# Episode: 104, Reward: -447.82758234583775, Mean reward: -133.13180891995285.\n", "# Episode: 105, Reward: 173.61773928922923, Mean reward: -126.90208952605677.\n", "# Episode: 106, Reward: 199.91420035415672, Mean reward: -117.76298930182682.\n", "# Episode: 107, Reward: -455.9848803947525, Mean reward: -115.21541743959497.\n", "# Episode: 108, Reward: -159.26472612356457, Mean reward: -110.56831784313323.\n", "# Episode: 109, Reward: -187.845261921528, Mean reward: -105.73595901294073.\n", "# Episode: 110, Reward: -242.79442341921637, Mean reward: -106.0591199414735.\n", "# Episode: 111, Reward: -314.4964205387612, Mean reward: -108.05386223499296.\n", "# Episode: 112, Reward: -60.8710872578847, Mean reward: -107.32531769043435.\n", "# Episode: 113, Reward: -7.259810048931579, Mean reward: -106.5859299062233.\n", "# Episode: 114, Reward: 184.95886691311728, Mean reward: -103.83745132003197.\n", "# Episode: 115, Reward: -90.83522507636904, Mean reward: -103.98950672217066.\n", "# Episode: 116, Reward: 286.72671356718456, Mean reward: -100.64632369261132.\n", "# Episode: 117, Reward: 219.7476168848529, Mean reward: -94.82306552011357.\n", "# Episode: 118, Reward: -84.70205682639795, Mean reward: -93.34718950964023.\n", "# Episode: 119, Reward: -91.77373584181716, Mean reward: -91.25371326893654.\n", "# Episode: 120, Reward: 13.176220112053258, Mean reward: -89.72583650195466.\n", "# Episode: 121, Reward: -66.04697513686762, Mean reward: -87.9365833491607.\n", "# Episode: 122, Reward: 3.813005172847312, Mean reward: -86.23117037052015.\n", "# Episode: 123, Reward: 30.814469329432995, Mean reward: -83.68130626871424.\n", "# Episode: 124, Reward: 286.0864878579562, Mean reward: -77.92182313352833.\n", "# Episode: 125, Reward: 224.3291649836283, Mean reward: -73.32651341768157.\n", "# Episode: 126, Reward: 188.05448458554355, Mean reward: -67.76664339210201.\n", "# Episode: 127, Reward: 159.90984700957924, Mean reward: -62.373525070294924.\n", "# Episode: 128, Reward: -431.3276905836456, Mean reward: -64.99421295018716.\n", "# Episode: 129, Reward: -99.7353391426771, Mean reward: -63.130443902891756.\n", "# Episode: 130, Reward: 284.7895911781392, Mean reward: -57.52238174164638.\n", "# Episode: 131, Reward: 287.22475144301734, Mean reward: -53.377496950385584.\n", "# Episode: 132, Reward: 38.62912747847153, Mean reward: -50.80291131555022.\n", "# Episode: 133, Reward: 207.1597110592462, Mean reward: -47.16011306184068.\n", "# Episode: 134, Reward: -131.39780175913504, Mean reward: -46.814700507786156.\n", "# Episode: 135, Reward: -18.536628226210482, Mean reward: -45.84839516423638.\n", "# Episode: 136, Reward: -153.70561202913686, Mean reward: -44.2397171718684.\n", "# Episode: 137, Reward: -42.18554434002289, Mean reward: -42.9028523026953.\n", "# Episode: 138, Reward: 230.62850385102794, Mean reward: -40.26917683025788.\n", "# Episode: 139, Reward: 106.98186390246234, Mean reward: -36.0556058286355.\n", "# Episode: 140, Reward: -490.31592767018685, Mean reward: -38.57151559843939.\n", "# Episode: 141, Reward: 174.3208201394026, Mean reward: -34.361173579166554.\n", "# Episode: 142, Reward: -992.5234913051908, Mean reward: -42.859243594779784.\n", "# Episode: 143, Reward: 137.29481921707668, Mean reward: -39.196677636826195.\n", "# Episode: 144, Reward: 256.8196031259344, Mean reward: -33.87002118652146.\n", "# Episode: 145, Reward: 253.78617669968978, Mean reward: -30.27667294911126.\n", "# Episode: 146, Reward: 258.96686613824136, Mean reward: -26.136400290405927.\n", "# Episode: 147, Reward: 245.45434909002148, Mean reward: -19.197777390643115.\n", "# Episode: 148, Reward: 185.10192373551217, Mean reward: -15.332952113433787.\n", "# Episode: 149, Reward: 216.08815764701916, Mean reward: -11.337806146741382.\n", "# Episode: 150, Reward: 262.7082925212612, Mean reward: -3.2437027234774747.\n", "# Episode: 151, Reward: -17.071248757718493, Mean reward: -1.047761946502916.\n", "# Episode: 152, Reward: 269.66544668569554, Mean reward: 4.11756626008793.\n", "# Episode: 153, Reward: 255.0585054663092, Mean reward: 7.654724067502901.\n", "# Episode: 154, Reward: -56.99143148946241, Mean reward: 7.652583755341132.\n", "# Episode: 155, Reward: 254.8883091065191, Mean reward: 8.67284757176829.\n", "# Episode: 156, Reward: 8.277436558748377, Mean reward: 7.049817099501188.\n", "# Episode: 157, Reward: 212.31946879089338, Mean reward: 10.450036340345354.\n", "# Episode: 158, Reward: 309.9273263881561, Mean reward: 15.39743151284189.\n", "# Episode: 159, Reward: 266.65468869406413, Mean reward: 19.302654995520562.\n", "# Episode: 160, Reward: 212.68603602941812, Mean reward: 24.794226685152953.\n", "# Episode: 161, Reward: -48.5525542343689, Mean reward: 27.86545200932585.\n", "# Episode: 162, Reward: 290.05430598167027, Mean reward: 31.313585797222476.\n", "# Episode: 163, Reward: 272.7546915875227, Mean reward: 37.57896972816019.\n", "# Episode: 164, Reward: 265.50561682465616, Mean reward: 42.52168207631849.\n", "# Episode: 165, Reward: -243.9581031707284, Mean reward: 41.7608577202889.\n", "# Episode: 166, Reward: 252.3366751188105, Mean reward: 48.47812400742298.\n", "# Episode: 167, Reward: -32.773974207651406, Mean reward: 51.4006979359125.\n", "# Episode: 168, Reward: 278.82211115923815, Mean reward: 59.053979208585496.\n", "# Episode: 169, Reward: 269.0617072553255, Mean reward: 64.65686155208864.\n", "# Episode: 170, Reward: -9.475556258604783, Mean reward: 67.45891148512797.\n", "# Episode: 171, Reward: 245.5797865843922, Mean reward: 68.15756817789398.\n", "# Episode: 172, Reward: 236.8983192798549, Mean reward: 70.40724044865996.\n", "# Episode: 173, Reward: -41.606044159923385, Mean reward: 67.5620909055113.\n", "# Episode: 174, Reward: 37.06755447036201, Mean reward: 65.60292772827866.\n", "# Episode: 175, Reward: -419.7883785847724, Mean reward: 59.28799703197699.\n", "# Episode: 176, Reward: 16.84711656387806, Mean reward: 59.09061602853002.\n", "# Episode: 177, Reward: 17.30878560858966, Mean reward: 62.59720975783121.\n", "# Episode: 178, Reward: -4.2117337575149065, Mean reward: 60.37970632417974.\n", "# Episode: 179, Reward: -47.694908703117946, Mean reward: 57.49489761884485.\n", "# Episode: 180, Reward: 261.5974883265053, Mean reward: 58.08096222887145.\n", "# Episode: 181, Reward: 284.753105782645, Mean reward: 58.44651098545386.\n", "# Episode: 182, Reward: 225.53281924544427, Mean reward: 61.03199865137301.\n", "# Episode: 183, Reward: 10.852422710704431, Mean reward: 58.698067781733215.\n", "# Episode: 184, Reward: -289.74253797593053, Mean reward: 53.24302287814116.\n", "# Episode: 185, Reward: 251.4986638303434, Mean reward: 53.38699714398254.\n", "# Episode: 186, Reward: 46.73856589171402, Mean reward: 51.42508682618807.\n", "# Episode: 187, Reward: 239.93110469505845, Mean reward: 54.1972258668681.\n", "# Episode: 188, Reward: 276.7142481677652, Mean reward: 54.761491738444974.\n", "# Episode: 189, Reward: -127.60809553775071, Mean reward: 51.15641254900314.\n", "# Episode: 190, Reward: 19.424365252973985, Mean reward: 49.04649752867654.\n", "# Episode: 191, Reward: 298.13554679850034, Mean reward: 49.31082722528527.\n", "# Episode: 192, Reward: -22.244239143173985, Mean reward: 47.497482634946294.\n", "# Episode: 193, Reward: 220.25239778932735, Mean reward: 55.57229241464611.\n", "# Episode: 194, Reward: 272.49360803494585, Mean reward: 62.646116961018414.\n", "# Episode: 195, Reward: 242.75664417451173, Mean reward: 67.15668926763124.\n", "# Episode: 196, Reward: 191.20155568710453, Mean reward: 69.88141912506367.\n", "# Episode: 197, Reward: -19.99854410992033, Mean reward: 67.59305991466631.\n", "# Episode: 198, Reward: 228.62552011930006, Mean reward: 67.80023828912304.\n", "# Episode: 199, Reward: 15.154113754629293, Mean reward: 65.88233466413247.\n", "Saving models ...\n", "# Episode: 200, Reward: 248.0983853946604, Mean reward: 71.31143682814931.\n", "# Episode: 201, Reward: 283.8547146422926, Mean reward: 72.08891129297193.\n", "# Episode: 202, Reward: -40.497258867174565, Mean reward: 68.86627203017251.\n", "# Episode: 203, Reward: 301.298402149618, Mean reward: 70.16086104942306.\n", "# Episode: 204, Reward: 288.767886772147, Mean reward: 77.5268157406029.\n", "# Episode: 205, Reward: -374.5865011634521, Mean reward: 72.04477333607609.\n", "# Episode: 206, Reward: -210.56895167313263, Mean reward: 67.93994181580318.\n", "# Episode: 207, Reward: 60.86013871968626, Mean reward: 73.10839200694758.\n", "# Episode: 208, Reward: 16.86631954274536, Mean reward: 74.86970246361068.\n", "# Episode: 209, Reward: 217.14643621053972, Mean reward: 78.91961944493136.\n", "# Episode: 210, Reward: 262.94303224839894, Mean reward: 83.97699400160751.\n", "# Episode: 211, Reward: 159.34787131348594, Mean reward: 88.71543692012997.\n", "# Episode: 212, Reward: 252.1848460837185, Mean reward: 91.84599625354602.\n", "# Episode: 213, Reward: 266.0463694989302, Mean reward: 94.57905804902464.\n", "# Episode: 214, Reward: -337.72232503844054, Mean reward: 89.35224612950906.\n", "# Episode: 215, Reward: 208.49865120268183, Mean reward: 92.34558489229957.\n", "# Episode: 216, Reward: 255.33977501556953, Mean reward: 92.0317155067834.\n", "# Episode: 217, Reward: 267.98292905910074, Mean reward: 92.51406862852589.\n", "# Episode: 218, Reward: 282.2261691521652, Mean reward: 96.18335088831152.\n", "# Episode: 219, Reward: 298.67825971239563, Mean reward: 100.08787084385366.\n", "# Episode: 220, Reward: 120.99800535242758, Mean reward: 101.16608869625739.\n", "# Episode: 221, Reward: 186.24945073476576, Mean reward: 103.6890529549737.\n", "# Episode: 222, Reward: 257.84792002154, Mean reward: 106.22940210346067.\n", "# Episode: 223, Reward: -25.412064554474185, Mean reward: 105.66713676462159.\n", "# Episode: 224, Reward: 25.541615280610557, Mean reward: 103.06168803884813.\n", "# Episode: 225, Reward: 232.26201525206372, Mean reward: 103.14101654153248.\n", "# Episode: 226, Reward: 288.1843845779491, Mean reward: 104.14231554145654.\n", "# Episode: 227, Reward: -235.92407743401327, Mean reward: 100.18397629702059.\n", "# Episode: 228, Reward: 312.6065648005639, Mean reward: 107.6233188508627.\n", "# Episode: 229, Reward: 276.3778768539682, Mean reward: 111.38445101082917.\n", "# Episode: 230, Reward: 273.39941391689206, Mean reward: 111.2705492382167.\n", "# Episode: 231, Reward: -77.33745780618037, Mean reward: 107.62492714572473.\n", "# Episode: 232, Reward: 253.73959726413597, Mean reward: 109.77603184358135.\n", "# Episode: 233, Reward: 224.52035925298782, Mean reward: 109.94963832551875.\n", "# Episode: 234, Reward: 252.08740288348682, Mean reward: 113.78449037194497.\n", "# Episode: 235, Reward: 258.583356394405, Mean reward: 116.55569021815114.\n", "# Episode: 236, Reward: 301.9362927658928, Mean reward: 121.11210926610143.\n", "# Episode: 237, Reward: 251.055000602341, Mean reward: 124.04451471552507.\n", "# Episode: 238, Reward: 266.14158781936874, Mean reward: 124.3996455552085.\n", "# Episode: 239, Reward: 258.0417206347706, Mean reward: 125.91024412253157.\n", "# Episode: 240, Reward: -86.04525107795116, Mean reward: 129.95295088845393.\n", "# Episode: 241, Reward: 278.0675990611987, Mean reward: 130.9904186776719.\n", "# Episode: 242, Reward: 267.6654467282247, Mean reward: 143.59230805800604.\n", "# Episode: 243, Reward: 280.1131603273913, Mean reward: 145.02049146910917.\n", "# Episode: 244, Reward: 279.5532810959419, Mean reward: 145.24782824880924.\n", "# Episode: 245, Reward: 232.8112122586534, Mean reward: 145.0380786043989.\n", "# Episode: 246, Reward: -107.03250209245756, Mean reward: 141.37808492209192.\n", "# Episode: 247, Reward: 307.1097266546795, Mean reward: 141.9946386977385.\n", "# Episode: 248, Reward: -57.45671937144893, Mean reward: 139.5690522666689.\n", "# Episode: 249, Reward: 246.837409666567, Mean reward: 139.87654478686437.\n", "Saving models ...\n", "# Episode: 250, Reward: 264.166044489662, Mean reward: 139.8911223065484.\n", "# Episode: 251, Reward: -0.6443060496861364, Mean reward: 140.0553917336287.\n", "# Episode: 252, Reward: 255.35792949291599, Mean reward: 139.9123165617009.\n", "# Episode: 253, Reward: -161.20973215462345, Mean reward: 135.7496341854916.\n", "# Episode: 254, Reward: 257.2849813152287, Mean reward: 138.8923983135385.\n", "# Episode: 255, Reward: 260.2114400702295, Mean reward: 138.9456296231756.\n", "# Episode: 256, Reward: 261.83742292217414, Mean reward: 141.48122948680984.\n", "# Episode: 257, Reward: 278.8055244498257, Mean reward: 142.14609004339917.\n", "# Episode: 258, Reward: 266.3079031852393, Mean reward: 141.70989581137002.\n", "# Episode: 259, Reward: -153.0939509424878, Mean reward: 137.51240941500447.\n", "# Episode: 260, Reward: 267.3918608759608, Mean reward: 138.05946766346992.\n", "# Episode: 261, Reward: 254.22780272933716, Mean reward: 141.08727123310697.\n", "# Episode: 262, Reward: 66.97525580473618, Mean reward: 138.85648073133763.\n", "# Episode: 263, Reward: 264.3547862058284, Mean reward: 138.7724816775207.\n", "# Episode: 264, Reward: 256.2582857958642, Mean reward: 138.68000836723277.\n", "# Episode: 265, Reward: 243.55432036860188, Mean reward: 143.55513260262606.\n", "# Episode: 266, Reward: 39.43074682215479, Mean reward: 141.42607331965948.\n", "# Episode: 267, Reward: 252.22550849597903, Mean reward: 144.27606814669582.\n", "# Episode: 268, Reward: 254.62102299159682, Mean reward: 144.03405726501938.\n", "# Episode: 269, Reward: 283.9802595691933, Mean reward: 144.18324278815805.\n", "# Episode: 270, Reward: 6.862836523049737, Mean reward: 144.34662671597462.\n", "# Episode: 271, Reward: 310.962211117787, Mean reward: 145.00045096130853.\n", "# Episode: 272, Reward: 26.51955856372369, Mean reward: 142.89666335414725.\n", "# Episode: 273, Reward: -188.64352163898673, Mean reward: 141.4262885793566.\n", "# Episode: 274, Reward: 250.32666217009793, Mean reward: 143.558879656354.\n", "# Episode: 275, Reward: 299.17476643045717, Mean reward: 150.74851110650627.\n", "# Episode: 276, Reward: 304.0101477501148, Mean reward: 153.62014141836866.\n", "# Episode: 277, Reward: 273.03873064955985, Mean reward: 156.17744086877835.\n", "# Episode: 278, Reward: 242.0018106094555, Mean reward: 158.63957631244807.\n", "# Episode: 279, Reward: 237.14594014716715, Mean reward: 161.48798480095093.\n", "# Episode: 280, Reward: 267.11942848961667, Mean reward: 161.54320420258205.\n", "# Episode: 281, Reward: 287.30552488703006, Mean reward: 161.5687283936259.\n", "# Episode: 282, Reward: 258.55129224939134, Mean reward: 161.89891312366535.\n", "# Episode: 283, Reward: 271.2417703962688, Mean reward: 164.50280660052096.\n", "# Episode: 284, Reward: 257.10919514126607, Mean reward: 169.97132393169298.\n", "# Episode: 285, Reward: 235.35181197598575, Mean reward: 169.8098554131494.\n", "# Episode: 286, Reward: 275.54947313889477, Mean reward: 172.0979644856212.\n", "# Episode: 287, Reward: 289.67835443726176, Mean reward: 172.59543698304321.\n", "# Episode: 288, Reward: 252.58048386578076, Mean reward: 172.3540993400234.\n", "# Episode: 289, Reward: 293.62001312639916, Mean reward: 176.56638042666486.\n", "# Episode: 290, Reward: 282.4733968431755, Mean reward: 179.1968707425669.\n", "# Episode: 291, Reward: 39.65437066713278, Mean reward: 176.6120589812532.\n", "# Episode: 292, Reward: 297.9017198494957, Mean reward: 179.81351857117988.\n", "# Episode: 293, Reward: 317.1942923108396, Mean reward: 180.78293751639504.\n", "# Episode: 294, Reward: 262.25144496655764, Mean reward: 180.68051588571115.\n", "# Episode: 295, Reward: 313.9951187216442, Mean reward: 181.39290063118244.\n", "# Episode: 296, Reward: 289.5770310181008, Mean reward: 182.3766553844924.\n", "# Episode: 297, Reward: 266.59159913449275, Mean reward: 185.2425568169365.\n", "# Episode: 298, Reward: 249.34753318142026, Mean reward: 185.4497769475577.\n", "# Episode: 299, Reward: 281.70983093351856, Mean reward: 188.1153341193466.\n", "Saving models ...\n", "# Episode: 300, Reward: 278.37913480775126, Mean reward: 188.41814161347753.\n", "# Episode: 301, Reward: 228.53108308625622, Mean reward: 187.86490529791718.\n", "# Episode: 302, Reward: 298.84596041649615, Mean reward: 191.2583374907539.\n", "# Episode: 303, Reward: 251.32050704134167, Mean reward: 190.75855853967113.\n", "# Episode: 304, Reward: 253.66518384378728, Mean reward: 190.4075315103875.\n", "# Episode: 305, Reward: 292.528186470421, Mean reward: 197.07867838672627.\n", "# Episode: 306, Reward: 275.27996361232294, Mean reward: 201.9371675395808.\n", "# Episode: 307, Reward: 280.11565129622136, Mean reward: 204.12972266534615.\n", "# Episode: 308, Reward: 276.5459283463827, Mean reward: 206.72651875338255.\n", "# Episode: 309, Reward: 265.5606700785799, Mean reward: 207.21066109206294.\n", "# Episode: 310, Reward: 241.86098241099444, Mean reward: 206.99984059368893.\n", "# Episode: 311, Reward: 255.74316251665155, Mean reward: 207.96379350572056.\n", "# Episode: 312, Reward: 255.72596207624977, Mean reward: 207.99920466564583.\n", "# Episode: 313, Reward: 294.0712555591008, Mean reward: 208.2794535262476.\n", "# Episode: 314, Reward: 255.18829641607826, Mean reward: 214.20855974079277.\n", "# Episode: 315, Reward: 276.66613493357295, Mean reward: 214.89023457810174.\n", "# Episode: 316, Reward: 233.37932181610861, Mean reward: 214.6706300461071.\n", "# Episode: 317, Reward: 258.42570946142314, Mean reward: 214.57505785013032.\n", "# Episode: 318, Reward: 283.72476064904333, Mean reward: 214.59004376509907.\n", "# Episode: 319, Reward: 266.0558443614178, Mean reward: 214.26381961158927.\n", "# Episode: 320, Reward: 276.77626654587186, Mean reward: 215.82160222352374.\n", "# Episode: 321, Reward: 259.42279421048823, Mean reward: 216.55333565828096.\n", "# Episode: 322, Reward: 246.65470802937998, Mean reward: 216.44140353835934.\n", "# Episode: 323, Reward: 301.28708939370347, Mean reward: 219.70839507784115.\n", "# Episode: 324, Reward: 278.02743845337636, Mean reward: 222.2332533095688.\n", "# Episode: 325, Reward: 275.2978176863519, Mean reward: 222.6636113339117.\n", "# Episode: 326, Reward: 270.1470580248672, Mean reward: 222.48323806838084.\n", "# Episode: 327, Reward: 267.30000791003897, Mean reward: 227.51547892182137.\n", "# Episode: 328, Reward: 281.44320028016176, Mean reward: 227.20384527661736.\n", "# Episode: 329, Reward: 254.71384454464967, Mean reward: 226.98720495352418.\n", "# Episode: 330, Reward: 256.1801172474028, Mean reward: 226.8150119868293.\n", "# Episode: 331, Reward: 262.6212301064545, Mean reward: 230.21459886595568.\n", "# Episode: 332, Reward: 26.928456854999908, Mean reward: 227.9464874618643.\n", "# Episode: 333, Reward: 264.9311579177372, Mean reward: 228.35059544851177.\n", "# Episode: 334, Reward: 284.73130333657605, Mean reward: 228.67703445304267.\n", "# Episode: 335, Reward: 265.3386042547016, Mean reward: 228.74458693164564.\n", "# Episode: 336, Reward: 267.3305268398667, Mean reward: 228.3985292723854.\n", "# Episode: 337, Reward: 269.76141246795095, Mean reward: 228.58559339104144.\n", "# Episode: 338, Reward: 288.7157590775905, Mean reward: 228.8113351036237.\n", "# Episode: 339, Reward: 289.974962018923, Mean reward: 229.13066751746516.\n", "# Episode: 340, Reward: 259.3462867389672, Mean reward: 232.58458289563438.\n", "# Episode: 341, Reward: 290.4434281349759, Mean reward: 232.70834118637217.\n", "# Episode: 342, Reward: 280.6280500970514, Mean reward: 232.83796722006042.\n", "# Episode: 343, Reward: 260.40024594842816, Mean reward: 232.64083807627082.\n", "# Episode: 344, Reward: 283.24811970059375, Mean reward: 232.67778646231736.\n", "# Episode: 345, Reward: 263.425411957228, Mean reward: 232.9839284593031.\n", "# Episode: 346, Reward: 277.6667700869442, Mean reward: 236.8309211810971.\n", "# Episode: 347, Reward: 278.71563837940516, Mean reward: 236.54698029834438.\n", "# Episode: 348, Reward: 262.92477815604184, Mean reward: 239.75079527361925.\n", "# Episode: 349, Reward: 271.3407469456791, Mean reward: 239.99582864641036.\n", "Saving models ...\n", "# Episode: 350, Reward: 234.18460735551633, Mean reward: 239.69601427506893.\n", "# Episode: 351, Reward: 273.9843669166363, Mean reward: 242.44230100473217.\n", "# Episode: 352, Reward: 295.0045967294483, Mean reward: 242.83876767709754.\n", "# Episode: 353, Reward: 244.75708773720777, Mean reward: 246.89843587601584.\n", "# Episode: 354, Reward: 299.36145185384447, Mean reward: 247.31920058140201.\n", "# Episode: 355, Reward: 268.51025063464124, Mean reward: 247.4021886870461.\n", "# Episode: 356, Reward: 266.8224971927194, Mean reward: 247.45203942975152.\n", "# Episode: 357, Reward: 288.37485932424397, Mean reward: 247.5477327784957.\n", "# Episode: 358, Reward: 251.3047946163591, Mean reward: 247.39770169280692.\n", "# Episode: 359, Reward: 265.8107165308721, Mean reward: 251.58674836754054.\n", "# Episode: 360, Reward: 279.96262912705254, Mean reward: 251.71245605005143.\n", "# Episode: 361, Reward: 34.095106380591034, Mean reward: 249.51112908656395.\n", "# Episode: 362, Reward: 289.86137391224145, Mean reward: 251.739990267639.\n", "# Episode: 363, Reward: 293.12904044643756, Mean reward: 252.02773281004508.\n", "# Episode: 364, Reward: -151.20554576496804, Mean reward: 247.95309449443678.\n", "# Episode: 365, Reward: 280.57044543578957, Mean reward: 248.32325574510867.\n", "# Episode: 366, Reward: 255.00162700972342, Mean reward: 250.47896454698432.\n", "# Episode: 367, Reward: 71.99783707672148, Mean reward: 248.67668783279177.\n", "# Episode: 368, Reward: 242.19704963020192, Mean reward: 248.55244809917778.\n", "# Episode: 369, Reward: 260.48968225461806, Mean reward: 248.31754232603205.\n", "# Episode: 370, Reward: 57.700204281136024, Mean reward: 248.82591600361292.\n", "# Episode: 371, Reward: 243.28416587826325, Mean reward: 248.14913555121765.\n", "# Episode: 372, Reward: 273.8513460236304, Mean reward: 250.62245342581676.\n", "# Episode: 373, Reward: 245.01993316344834, Mean reward: 254.9590879738411.\n", "# Episode: 374, Reward: 242.48147193248005, Mean reward: 254.8806360714649.\n", "# Episode: 375, Reward: 230.1740336033593, Mean reward: 254.19062874319388.\n", "# Episode: 376, Reward: 281.0425667424446, Mean reward: 253.9609529331172.\n", "# Episode: 377, Reward: 255.68567042834934, Mean reward: 253.7874223309051.\n", "# Episode: 378, Reward: 265.592021893534, Mean reward: 254.0233244437459.\n", "# Episode: 379, Reward: 249.38718460169164, Mean reward: 254.14573688829117.\n", "# Episode: 380, Reward: 12.241604364698247, Mean reward: 251.596958647042.\n", "# Episode: 381, Reward: 42.487006085593436, Mean reward: 249.1487734590276.\n", "# Episode: 382, Reward: 71.19467294947037, Mean reward: 247.27520726602842.\n", "# Episode: 383, Reward: 290.5295624400628, Mean reward: 247.46808518646634.\n", "# Episode: 384, Reward: 280.0255799252849, Mean reward: 247.69724903430654.\n", "# Episode: 385, Reward: 51.34068722345009, Mean reward: 245.8571377867812.\n", "# Episode: 386, Reward: 290.8368303258207, Mean reward: 246.01001135865047.\n", "# Episode: 387, Reward: 266.9087209831739, Mean reward: 245.7823150241096.\n", "# Episode: 388, Reward: 251.89296708267764, Mean reward: 245.77543985627852.\n", "# Episode: 389, Reward: 230.40109774644964, Mean reward: 245.143250702479.\n", "# Episode: 390, Reward: 215.99643993117303, Mean reward: 244.47848113335897.\n", "# Episode: 391, Reward: 245.57356310470828, Mean reward: 246.53767305773476.\n", "# Episode: 392, Reward: 288.9303478000724, Mean reward: 246.44795933724058.\n", "# Episode: 393, Reward: 267.51360012886425, Mean reward: 245.95115241542078.\n", "# Episode: 394, Reward: 290.0632633816501, Mean reward: 246.2292705995717.\n", "# Episode: 395, Reward: 282.9890007996357, Mean reward: 245.91920942035162.\n", "# Episode: 396, Reward: 294.59040216848604, Mean reward: 245.96934313185545.\n", "# Episode: 397, Reward: 51.548604392016614, Mean reward: 243.8189131844307.\n", "# Episode: 398, Reward: 222.57826869936832, Mean reward: 243.55122053961023.\n", "# Episode: 399, Reward: 38.1930135055608, Mean reward: 241.1160523653306.\n", "Saving models ...\n" ] } ], "source": [ "agent, scores, avg_history = train()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 449 }, "id": "QtqayElp5fe9", "outputId": "c79d40f7-25e8-4d02-9cc9-a2beabe4e9b6" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_curve(avg_history, scores)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "pwkOuaN5ZOLk" }, "source": [ "In this case the training curve shows a more stable training which seems to indicate that the chosen parameters are more suitable for problems of this complexity. As can be seen, a stable reward of 200 has been reached with about 400 episodes. This seems to me to be a very positive result. \n", "\n", "The main problem with this environment is that by the time the agent learns to stay in the air, the episodes can be quite long-lasting, which means that training can be quite time-consuming. With this in mind I decided not to run the experiment again with more episodes even though the reward was still rising. Even so, this experiment allows us to see that the algorithm works as it should and that it is most likely that with more episodes the agent will be able to solve the environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 494 }, "id": "-FCs5Lin5i5U", "outputId": "0764c196-ebaa-4c86-af16-9a7979887d9d" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.9/dist-packages/gym/wrappers/record_video.py:78: UserWarning: \u001b[33mWARN: Overwriting existing videos at /content/video folder (try specifying a different `video_folder` for the `RecordVideo` wrapper if this is not desired)\u001b[0m\n", " logger.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Loading models ...\n" ] }, { "data": { "text/html": [ "<video alt=\"test\" autoplay \n", " loop controls style=\"height: 400px;\">\n", " <source src=\"data:video/mp4;base64,\" type=\"video/mp4\" />\n", " </video>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "test_behavior(agent)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "MO1fMyuRaq49" }, "source": [ "The video shows the behaviour learned by the agent after the training. As can be seen, the agent has learnt to descend with speed until a few moments before reaching the ground where it uses his propulsion to slow down his descent and to be able to land perfectly. At this point in the training, the agent needs to centre the landing a little more to the left. This is the point that I think it could learn with more episodes, as I have already explained." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "jv18bAnob1gu" }, "source": [ "# **Conclusions**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JxcQBASxb8kG" }, "source": [ "The TD3 (Twin Delayed DDPG) algorithm for reinforcement learning was implemented and evaluated in environments such as Pendulum-v1 and LunarLanderContinuous-v2. The experiments conducted demonstrated that TD3 can achieve high performance and generate robust results across a variety of tasks. The results indicate that TD3 is effective in learning policies that can successfully solve the given tasks. The agent trained using TD3 consistently achieved high rewards and demonstrated the ability to control the pendulum or land the lunar lander smoothly.\n", "\n", "However, although the results show that the algorithm can learn useful policies with the set parameters, it is still possible for the algorithm to learn useful policies with the set parameters, it is important to note that the performance of TD3 can be sensitive to the choice of hyperparameters. This implies that fine-tuning of hyperparameters may be necessary to achieve optimal performance on specific tasks or environments.\n", "\n", "Overall, the implementation of TD3 showcased its capability to learn effective policies in reinforcement learning tasks. The robustness of TD3 across different environments highlights its potential as a reliable algorithm for various real-world applications. Further research and experimentation can be conducted to explore additional environments and optimize the hyperparameters for improved performance.\n", "\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [ "A0ELfvjv9cZj", "p_03BMxoTWgl" ], "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }