RL Series #2: Learning with Deep Q Network (DQN)

Exploring the theory and code behind DQN

Saasha Nair
12 min readMar 23, 2020
Photo by SpaceX on Unsplash

Updated 21 September 2020: Hi dear reader, I am happy to announce that I have shifted to my own platform, where I continue to write on topics relating to AI in general and RL in particular. If you would like to read some more of my content, please do check out https:/www.saashanair.com/blog/. Looking forward to meeting you there. 🙃

Note1: PyTorch implementation for this blog post can be found on Github.

Note2: This post is part of a series, the first post can be found here. 😄

Self-isolation at home starting to get to you? How about going to the moon instead? #socialdistancing_lunar_edition😋

Yes, I see the confusion in your eyes. No, social-isolation hasn’t caused me to ‘lose it’. Not yet anyway. 😅 I was talking about this:

Screenshot of OpenAI Gym’s LunarLander-v0. The purple object is the spacecraft that the RL agent will learn to control

Eh? What is LunarLander-v2?

Recall, in the previous post we learnt that RL tasks require an agent that interacts within an environment, to learn an optimal sequence of steps to achieve its goal. Well here, OpenAI Gym’s LunarLander-v2 acts as the environment in which to learn. The purple object in the image is the spacecraft that we would learn to operate via RL, and the goal is to learn to land the spacecraft on the landing pad, i.e. between the two yellow poles, on the surface of the moon.

To convert this to a form similar to the block diagram (Fig. 1) from the previous post, the LunarLander-v2 environment provides the following components as can be examined in the code here: -

  1. State is represented as a vector with 8 elements as [horizontal position, vertical position, horizontal velocity, vertical velocity, angle, angular velocity, left leg touchdown, right leg touchdown]
  2. At each time step, the lander has a choice of executing one of 4 discrete actions, namely no-op (do nothing), fire the left engine, fire the main engine, fire the right engine
  3. The reward is calculated by the environment based on the amount of fuel used, location of landing and contact of legs to the surface.

Ooooh, take me to the moon! 🚀

Your wish is my command dear reader, so let’s prepare for takeoff! But before that, we need to figure out how to train our spacecraft to land safely. Though almost obsolete now and replaced by more sophisticated algorithms, DQN forms a good starting point into the exploration of Deep RL. It was popularised back in 2015 (yes, it does feel like a lifetime ago, things move really fast in Deep Learning 😅) by DeepMind as being “the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning”. Still can’t recall it? Maybe this video will help.

Video of DeepMind’s DQN agent learning to play Breakout (source)

What is with the name of this algorithm?

Let’s unpack the term Deep Q Network. Q or Q(s,a) is called the action-value function, which indicates the “goodness” or quality of executing a certain action in a given state. More specifically, the action-value function quantifies the expected return to be gained if an agent following a certain policy 𝜋, takes action a in state s and then continues to act according to the policy. That was quite a mouthful, so let’s dig a little deeper to see what policy and expected return mean before getting back to the definition of Q.

A policy can be defined as the behaviour exhibited by an agent. It is the rule that the agent follows to decide which action to perform in the given state s, and is expressed as:

Eq. 1: Policy

Expected Return is defined as the sum of all future rewards. Do you see a problem with this formulation? If you thought, but we humans don’t assign uniform weightage to near- and far- term rewards, then you are absolutely right. Like even right now, you know following along this post will be useful in the long run, but your mind is trying to get you to switch to Netflix for that short term reward, and so you need to rationalise what rewards to focus on. A similar setup can be achieved in this computational formulation of the problem by introducing discounting factor such that 0 ≤ 𝛾 ≤ 1. The resulting expected return can then be expressed as:

Eq. 2: Expected Return

The discounting factor 𝛾 helps the agent determine what to focus on. Fig. 1 below gives a more intuitive sense of the effect of the discounting factor. A 𝛾 value closer to 0 results in a myopic agent that focuses only on immediate gains and hence in the LunarLander environment (indicated by the orange line) leads to low scores. While a 𝛾 value closer to 1 (set to 0.99 in the plot as per the DQN paper) results in far-sightedness, leading to the agent being able to accumulate higher scores.

Fig 1: Plots indicating the effect of discounting factor on episodic reward. The plot on the left tracks how the average reward over the past 100 episodes varies over time during training, while the plot on the right tracks how reward per episode varies over time in test mode

Phew, that was a long diversion! Getting back to the point of action-value function, i.e. the Q of DQN. To reiterate, the action-value function for an agent following a policy 𝜋 gives a measure of the expected return of executing action a in state s and then continuing along following the same policy, that is

Eq. 3: Action-value function or Q-value of following policy 𝜋

This can further be extended to define an optimal action-value function Q* , which is the Q-value of starting in state s, taking action a and then forever following an optimal/best policy.

Eq. 4: Optimal action-value function

As one might notice, in line 3 of Eq. 4, the part within the parenthesis looks similar to line 2, with the difference being that the sequence of r’s starts from r at time step t+1. Thus the portion within the parenthesis can itself be depicted as the Q* of the state-action pair at time step t+1. Both, the action-value function Q, and its optimal counterpart Q* allow for this recursive formulation. Thus, the action-value function is a sum of the immediate reward and the discounted optimal action-value in the successor state. This formulation (Eq. 5) of the Q-value is referred to as the Bellman Equation.

Eq. 5: Bellman optimality equation for the action-value function

Okay, so now we know what the Q in DQN denotes, so what about the DN or ‘deep network. Well, simply put, it just indicates that this technique is the deep learning counterpart of the traditional Q-learning algorithm. This is useful, because with large state and action spaces, calculating and storing a Q-value for each possible state-action pair can quickly become impractical. DQN, thus, uses Neural Networks instead to approximate the optimal action-value function, i.e Q(s,a;𝜃)Q*(s,a). Therefore, given a state-action pair, the Neural Network trained for the task can give you an estimate of the resultant optimal Q-value.

Oh Neural Networks, I know what those are. But I am still confused, how do the networks learn?

That is actually the fun bit. You remember how in using Neural Networks for a supervised learning task, you have your data to train on and then you have a set of labels instructing the network to learn the correct answer. Well, training of the networks here takes an almost similar approach.

Fig 2: Block diagram depicting the architecture for the Neural Network that acts as the Q-function approximator in DQN

As shown in Fig. 2, the current state s is passed in as the input to the network, while the output layer has one node per possible action that the agent can execute in the given environment. The architecture of the network (the gray coloured box in Fig. 2) can be as simple or as complex as warranted by the task you are trying to solve. You can use a whole range of network architectures varying from simple FeedForward Neural Networks (FFNN), to Convolutional Neural Networks (CNN) or even Recurrent Neural Networks (RNN). That said, for this series we stick to a simple FFNN that takes as input the 8-component vector describing the state of the spacecraft in the environment and the output contains 4 nodes, one for each of the possible actions (discussed earlier).

Well, now that we know what the network should look like, the next question to be answered is what loss function is used for optimising the parameters of the network. Looking back at the supervised learning setting, the mean squared error (MSE) for a single sample is expressed as L = (target_y -predicted_y)². The loss function in the RL setting takes a similar form. The predicted_y, in this setup, is the Q value computed by the network for the given state s and action a. “But what about the target”, you wonder, “we don’t have labels in RL”. Yes true, and that is why we use the concept of ‘Bootstrapping’, which, as Sutton and Barto describe, involves “updating estimates of the values of states based on estimates of the values of successor states”. In other words, target_y is the Bellman formulation of the action-value function from above. Thus, essentially, the loss for each sample can be represented as L = (targetQ-predictedQ)².

Woah, any more theory and my head will explode! Can we get to the code already?

Well, yes we are almost there. We now have all the components we need to write the code, we know what the network should look like, we have seen how the network would update its weights and we know how to interact with the environment. “So where is the hold up then?”, you ask. The problem is that the naive formulation we discussed above leads to unstable training. To combat this, the paper by DeepMind suggests two useful tricks: -

1. Experience Replay Buffer

When learning directly from trajectories, subsequent input samples are temporally correlated. Moreover, the training distribution switches wildly as the current parameters of the network determine the next sample that the weights would be updated on. However most of the popularly used machine learning theory is built on the assumption that the data being used for training is i.i.d (independent and identically distributed). Thus, the DQN algorithm uses Experience Replay to satisfy this assumption.

Fig 3: Depiction of the working of the Experience Replay Buffer (source: Tutorial on Deep RL by David Silver)

You can think of the replay buffer as a big dataset (depicted in Fig 3), that stores a subset of the previous experiences that the agent has encountered. At each time step, a 4-tuple containing (state, action, reward, next_state) representing the experience of the agent is added to the replay buffer. For the weight updates then, a mini-batch of experience transitions are sampled uniformly at random from the replay buffer, thereby breaking the temporal correlations from earlier.

Additionally, since the replay buffer allows samples to be reused, each experience tuple is potentially used by the network for weight updates multiple times. This helps improve sample efficiency, that means, a lesser number of samples would be needed by the agent to learn the task at hand.

2. Fixed Q-targets

Going back to the discussion on calculation of target_y for the loss function, we said that to update the estimates of the Q value we use the estimates of the Q values of successor states. This leads to a circular feedback loop because the weight update caused by the error loss will in turn result in the estimated Q-value for target_y updating as well, which results in the network chasing some non-stationary target.

Thus, to solve this problem, a second Q-network called the ‘target network’ is introduced. The target network is identical in architecture to the already existing policy network. The difference, though, is that the target network is kept frozen, with the weights of the policy network being periodically copied over to the target network. Thus during learning we bootstrap towards the frozen targets that are estimated using this new target network.

Effectively, these two modifications to the network can be seen as bringing the RL problem closer to a supervised learning formulation that we are familiar with. The experience replay can thus be seen as the training dataset, while the target_y’s from the target network are akin to the labels in the supervised setting. Thus, the loss function used for training, that incorporates the two modifications, can be expressed as:

Eq. 5: Loss function used for weight updates via gradient method

Hmm, I think I got it. Could you maybe summarise how all these different parts fit in together?

With all the puzzle pieces uncovered, it is finally time to put it all together.

Fig 4: Flowchart depicting the DQN algorithm

The main thing to notice in Fig. 4 is the calculation of targetQ. For each data point in the mini-batch sampled from the Experience Replay Buffer, check if it is a terminal state or not. A terminal state is one where the current episode sets ‘done’ to True. If a data point represents a terminal state, then it has no future states to look towards, and hence no expected rewards from future actions. Thus, targetQ here would be equal to the immediate reward earned by the agent taking action a in current state s. On the other hand, if the data point is not a terminal state, then, by the Bellman equation (Eq. 5), we consider not only the immediate reward but also a discounted expected future return from actions that the agent might take in states that it might encounter in the future.

Do I just replicate this flowchart in code?

Yes, pretty much. Fig. 4 shows the exact flow of control for implementing the DQN algorithm. The hyperparameters that you would need to consider tuning for this algorithm, along with the values that I used in my implementation for the LunarLander environment are listed below: —

  1. Capacity of replay buffer = 3000 (usually kept at 10⁶)
  2. Mini-batch of transitions sampled from the replay buffer = 64 (usually kept at either 32 or 64)
  3. Replay start size, N = 1000 (used to populate the replay memory using a random policy before learning starts)
  4. Max episodes = 3000
  5. Epsilon = 0.05 (kept at a fixed value for the plots in this post, but we will look at this parameter in detail in the next post 😄)
  6. Discount Factor, 𝛾 = 0.99
  7. Network architecture = FeedForward Neural Network with one input layer containing 8 nodes, 2 hidden layers with 256 nodes each and an output layer with 4 nodes; Adam, with default values, is used as the optimiser.

The result of training with these hyperparameters can be observed in the following plots (Fig. 5):

Fig 5: Plots indicating performance of a DQN agent trained in the LunarLander-v0 environment with the hyperparameters listed above. Left: Shows rewards accumulated by the agent per episode during training, along with a moving average of the scores of the last 100 episodes. Right: Graphs the rewards per episode accumulated by the agent in test mode. Note: Here a reward of above 200 is considered as acceptably solving the environment, and this has been indicated via the red dotted line.

So what next?

I hope this post helped you better understand the working of DQN. Now, dear reader, it’s your turn. Try to implement the algorithm by following the flowchart, to cement the concepts that you learnt here.

You can find the PyTorch implementation (documented and in-sync with the nomenclature utilised here) of the code for the DQN Agent used to generate the plots in this post on Github. Also, in case this is your first time working with Gym, you might find their documentation helpful. 😊

Also, if you have the time, and want to get a deeper understanding of the algorithm, I would recommend checking out the following: —

  1. Human-level control — the original paper that describes the two modifications introduced by DeepMind
  2. DeepRL Bootcamp 2017: Deep Q-Networks — an hour long video lecture by one of the contributors of the DQN paper

I hope that your first step into the world of Deep RL proved fruitful and that you found this post enjoyable. I would love to hear from you about your experience with the implementation of DQN. Looking forward to hearing from you in the comments section, or if you prefer, via e-mail at saasha.allthingsai@gmail.com! 😁

See you in the next post, and stay safe! 🦠

--

--

Saasha Nair

Predoc Research Fellow at Scuola Sant’Anna, Italy | Interested in topics related to safety, fairness and transparency in AI/AGI | Tw: @nair_saasha