Double DQN implementaion on Super Mario Bros.
DDQN implementaion using Mario (gym) API
Double DQN: Super Mario Bros.
This project is built based on the paper regarding Double DQN[1] and official PyTorch website [2].
The main motivation and purpose of building this project was to enhance the better understanding of how reinforcement learning works on practice through code. Here, the code applies DDQN which is quite similar to DQN [3]. For further information regarding the difference between the two, please see the following: Background or [1].
Showcase
To Get Started
git clone git@github.com:3seoksw/DDQN-mario.git
In order to properly run the code, first clone the repository.
And create new virtual environment using the requirements.txt. Here, I’m using conda.
conda create --name <your-env> --file requirements.txt
conda activate <your-env>
After setting up the virtual environment and the code, run main.py.
Note: nes-py 8.2.1 (from gym-super-mario-bros 7.4.0) and gym 0.26.0 are not compatible. There is a difference in reset() function’s signature between two APIs.
There are few workarounds but I erased every truncated keyword in time_limit.py from imported package directory.
Problem Justification
Because the project is using several packages, explicit (manual) definition or assignment of \(\text{Environment, State, Action, Reward}\) is not necessary.
But for the sake of understanding of RL, you might want to see [3] which provides detailed explanations on solving Atari games using RL.
The following explains how problem can be justified in a simple manner:
\[\begin{align*} \text{Environment}: &\text{the world agent interacts with} \\ & \textit{i.e.} \text{) stage, blocks, mushrooms, } \textit{etc.} \\ \text{State}: &\text{current image frame} (\text{channel} \times \text{height} \times \text{width}) \\ \text{Action}: &\text{set of actions agent Mario can take} \\ &\textit{e.g.}) \text{move forward, jump, } \textit{etc.} \\ \text{Reward}: &\text{distance agent moved} \\ &\text{coins agent acquired} \\ &\text{enemies agent killed} \\ &\text{time consumed} \\ &\text{reached to the final flag (terminal state)} \\ &\textit{etc.} \\ \end{align*}\]Main goal is to agent maximizing its rewards.
Since states are in the forms of image frames, CNN will be used which consists of three convolutional layers paired with ReLU function, flattening layer, densifying the tensor processed, and two fully-connected neural networks.
Agent will judge the current situation using the state information
Actions can vary (right-only, simple, or complex) according to nes-py 8.2.1. Agent will choose an action based on the algorithm namely DDQN in order to maximize the reward.
Rewards are the key to solve RL problems. Agent will take actions based on the rewards. Here, rewards can be whether the agent reached to final state (flag), the distance agent moved, etc.
Background
There are some similarities between DQN and Double DQN (DDQN) in terms of both taking advantage of using \(Q\) values. But there is a major difference for the method to update \(Q\) value. For further reference, please see [1].
DQN
In \(Q\)-learning, objective is to find optimal \(Q\) value which is parameterized by \(\theta\). The \(Q\)-learning update requires some action \(A_t\), state \(S_t\), and reward \(R_{t+1}\), then we can get:
\[\begin{aligned} \theta_{t+1} &= \theta_t + \alpha(Y^{Q}_t - Q(S_t, A_t; \theta_t)) \nabla_{\theta_t}Q(S_t, A_t; \theta_t) \end{aligned}\]where \(\alpha\) is a step size. And the target \(Y_t^Q\) is defined as:
\[\begin{aligned} Y_t^Q \equiv R_{t+1} + \gamma \text{max}_aQ(S_{t+1}, a; \theta_t) \\ Y^Q_t \approx Q(S_t, A_t; \theta_t) \end{aligned}\]However, the target value can cause overestimation. Therefore, DDQN is proposed.
Double DQN
While applying the same fundamental foundations from DQN, experience replay and target network, DDQN uses two separate \(Q\)-networks; online network which is for selecting the best action and target network which is for evaluating the action. You can simply think DDQN is separating DQN’s target network into two.
The target is as follows:
\[\begin{aligned} Y_t^{\text{DQN}} \equiv R_{t+1} + \gamma \text{max}_a Q(S_{t+1}, a; \theta_t^{-}) \end{aligned}\]where \(\theta^{-}\) is a vector parameters of target network. And the Double \(Q\)-learning error can be written as follows:
\[\begin{aligned} Y_t^{\text{DoubleQ}} \equiv R_{t+1} + \gamma Q_{\text{eval}}^{\text{target}}(S_{t+1}, \text{argmax}_a Q_{\text{select}}^{\text{online}}(S_{t+1}, a; \theta_t); \theta_t^{-} ) \end{aligned}\]where \(\theta\) is parameterizing online network.
References
[1] H. v. Hasselt, A. Guez, and D. Silver. “Deep Reinforcement Learning with Double Q-learning,” Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2016.
[2] Y. Feng, S. Subramanian, H. Wang, and S. Guo. “TRAIN A MARIO-PLAYING RL AGENT,” PyTorch, accessed January 27, 2024, https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html.
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. “Playing atari with deep reinforcement learning.” arXiv preprint arXiv:1312.5602, (2013).