Reinforcement learning

It’s been a while we haven’t covered any machine learning algorithm. Last time we discussed the Markov Decision Process (or MDP).

Today we’re going to build our knowledge on top of the MDP and see how we can generalise our MDP to solve more complex problems.

Reinforcement learning really hit the news back in 2013 when a computer learned how to play a bunch of old Atari games (like Breakout) just by observing the pixels on the screen. Let’s find out how this is possible!

Image from

In Breakout the goal is to destroy all the bricks at the top of the screen by bouncing a ball over a paddle that you can move horizontally at the bottom of the screen. Every time the ball hits a brick your score increases.

The idea

If we want to teach a computer how to play that game, one solution might be to use a classifier such as a neural network where the input would be the screen images and the output the action to perform. It makes sense as for each screen we want to know if the paddle should move left or right (or stay still). The problem is that we need lots of training examples. But this is not how we learn! We don’t want someone to tell us a million time what we should do!

Instead we learn by observing the reward we get as a consequence of our actions. This is the essence of reinforcement learning. An agent (the player) performs some actions which change the state of the environment (the pixels on the screen) and gets a reward for his actions (increase of score).

Reinforcement learning lies somewhere between supervised and unsupervised learning. In supervised learning we have a label for every training example and in unsupervised learning there is no label at all. In reinforcement learning we have rewards which are sparse and time-delayed labels. Using only on those rewards the agent has to learn to behave in the environment.

It may seem simple but there are some challenges along the way. When the ball hits a brick it has nothing to do with the action you just take. In fact it was the action performed when the ball bounced on the paddle that sent the ball against the brick. The problem of identifying which action is at the origin of the reward is known as the credit assignment problem.

Markov Decision Process

Doesn’t it sound familiar? Indeed, it looks quite similar to our tiny robot example we use to illustrate the MDP.
The robot (the agent) moves (take actions) and gets a reward as it moves closer to its destination.

In fact our problem can be modelled as an MDP as well because one game can be modelled as a finite sequences of states, actions and rewards:


Moreover the next state \(s_i\) depends only on the current state \(s_i\) and the current action \(a_i\) and not on any previous states.

We are still interesting in maximising the long term reward

\(R = r_1+r_2+r_3+…+r_{n-1}+r_n\)

or considering the long term reward from time t:

\(R_t = r_t+r_{t+1}+…+r_{n-1}+r_n\)

Discounted rewards

As our process is stochastic there is no guarantee that the same action will give the same reward (e.g. when we start the game the ball starts in a random direction). It means that the most distant rewards are also the most uncertain.

For this reason we want to “discard” rewards far away in the future using a discount factor \(\gamma\):

R_t & = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + … + \gamma^{n-t} r_n \\
R_t & = r_t + \gamma (r_{t+1} + \gamma r_{t+2} + … + \gamma^{n-t-1} r_n) \\
R_t & = r_t + \gamma R_{t+1}

We used the same technique when we covered the MDP with our tiny robot example.

Q learning

The strategy we used to choose an action was the one that maximises the discounted reward. It can be expressed by defining a function \(Q\) that represents the discounted long-term reward when we performed an action \(a\) in state \(s\) and perform optimally afterwards.

Q(s_t, a_t) = \max_\pi R_{t+1}

In this case \(Q\) represents the best possible score at the end of the game if we perform action \(a\) in state \(s\).

\(\pi\) is the policy to follow to choose an action in a given state. If we assume that we know \(Q\) then choosing the action becomes simple: Just pick the action that gives the maximum score at the end of the game

\(\pi(s) = \operatorname{argmax}_a Q(s,a)\)

But there is one problem! How can we know the function \(Q\)? How can we know the score at the end of the game when we only know the current state and action. Well, we just can’t but we can assume that such a function exists from a theoretical point of view.

Bellman equation

If we break it down to a single step in the game or a single transition from state \(s\) to state \(s’\) we get

\(Q(s,a) = r + \gamma \max_{a’}Q(s’, a’)\)

It means that by taking action \(a\) in state \(s\) we get the corresponding reward \(r\) plus the maximum discounted reward we can get from here by choosing the best action in the next state.

The recursive nature of the Bellman equation helps to solve the credit assignment problem. It should also sound familiar as it’s the same Bellman equation that we’ve seen when solving MDPs. Except last time we used another value function \(U\).

In fact \(U\) can be expressed in terms of the \(Q\) function:

\(U(s) = \max_a Q(s,a)\)

With \(Q\) we specify the action to perform, we somehow force \(Q\) to learn \(a\) whereas \(U\) takes only the current state \(s\) and figures out what the best action is for that state.

Solving the Bellman equation is the key to solve the MDP and there are several techniques to do it. Last time we’ve seen how we can use linear algebra or the value iteration algorithm.

The idea was to start with a random utility function and refine it over time until we got something that stabilises:

  1. Start with an arbitrary utility function
  2. Update the utility for a state based on all the states it can reach
  3. Repeat 2. until convergence

which can be summarised by the following equation:

\hat{U}_{\pi^*}(s) = r_s + \gamma \max_a \sum_{s’} T(s,a,s’) \hat{U}_{\pi^*}(s’)

which can be written in terms of the \(Q\) function

\hat{Q}_{\pi^*}(s, a) & = r_s + \gamma \sum_{s’} T(s,a,s’) \max_{a’} \hat{Q}_{\pi^*}(s’, a’) \\

Similarly we start with a random \(Q\) function and we refine it as we visit more state over time.

Jolly good, but how does that fit with learning how to play Breakout ? Well, there is one more problem here. Remember that \(T(s,a,s’)\) is the transition model, i.e. the probabilities for reaching state \(s’\) given that we are in state \(s\) and perform action \(a\). And in the Breakout game we don’t have access to the transition model.

Are we doomed? No, we can actually learn the \(Q\) function from the data, i.e. by observing what happens when we perform an action \(a\) in state \(s\). The idea is that if we visit state \(s\) and perform action \(a\) many times we will finally learn the real value \(Q(s,a)\).

To make things simple imagine that we that we store the \(Q\) values in a table where the rows are the possible state and the columns the possible actions.

This gives us the following pseudo code:

initialise Q[num_states, num_action] randomly
start from initial state s
while not terminated do
   select and perform action a
   observe reward r and next state s'
   update Q[s, a] = (1 - α)Q[s, a] + α (r + γ max_a' Q[s', a'])
   s = s'

The key point is to understand how we update the estimated \(\hat{Q}(s,a)\) every time we take action \(a\) in state \(s\)

\hat{Q}(s, a) = (1 – \alpha) \hat{Q}(s, a) + \alpha (r + \gamma \max_{a’} \hat{Q}(s’, a’)

Here the update is simple because we always know \(\hat{Q}\) because we initialised all the values randomly. It may not be correct but there is a value we can use for our computation.

There is also a new parameter \(\alpha\). \(\alpha\) is the learning rate and has a value between 0 and 1.

  • If \(\alpha = 0\) then we are not learning anything because we’re not update \(\hat{Q}\).
  • If \(\alpha = 1\) then we forget everything we’ve learned so far and only consider the new value.

The idea is that \(\alpha\) is decreasing over time because we want to learn a lot at the beginning and then take smaller steps as we accumulate knowledge and converge to the true value.

Exploration / Exploitation dilemma

There is some point we need to clarify:

  • How do we choose the action to perform?
  • How do we initialise \(\hat{Q}\)?

If we always choose the same action \(a\) we won’t learn anything. On the other hand if we always choose \(a\) randomly we will explore all the possibilities and converge to “real” \(Q\) but we’re not going to use that knowledge.

We need somehow to use the knowledge accumulated in \(\hat{Q}\). We can simply use \(\hat{\pi}(s) = \operatorname{argmax}_a \hat{Q}(s,a)\). Sounds good but \(\hat{Q} \neq Q\) especially at the beginning where \(\hat{Q}\) is initialised randomly.

It means that at the beginning we pick the actions randomly but as soon as we find something that “seems” to work we’re stick with that. E.g. we can find out that always moving the paddle to the left is a valuable strategy however it surely isn’t the optimal strategy. And that’s the problem with this algorithm, as soon as it fines something that looks better than the alternatives it always picks this action and stop trying anything else. It’s like being stuck in a local minima. For this reason we say that the algorithm is greedy.

To solve this problem we can slightly modify our policy \(\hat{\pi}\) to choose the action given by \(\operatorname{argmax}_a \hat{Q}(s,a)\) with probability \((1-\epsilon)\) and choose a random action with property \(\epsilon\).

It means that we do what seems to be right most of the time (by following \(\hat{Q}\)) but once in a while we pick a random action to get a chance to explore the whole space of possibilities.

We can have \(\epsilon\) decreasing over time so that we take more random actions at the beginning to favour the exploration of the space and we as gain more knowledge we use it more and more often to pick the right action.

This is the exploration / exploitation dilemma.

We can also initialise \(\hat{Q}\) with high values (bigger than any values the \(Q\) function can get) so that any unexplored value looks better than anything which have been tried. This is a known as optimism in face of uncertainty.

Deep Q Network

The state of the environment is represented by the pixels on the screen. However for Breakout we need several consecutive images to capture the direction and speed of the ball.

This is what the DeepMind team behind the paper “Playing Atari with Deep Reinforcement Learning” did. They use 4 consecutive screenshots – reduced to 86×86 images with 256 grey levels – to describe the state of the system.

If we were to use our Q table to store all the possible state we’d be left with \(256^{86*86*4}\) game states. This number is so big that there is no way to build such table. It’s true that the images have some structures (bricks at the top, paddle at the bottom, …) so in practice there is much less possible state.

Still using a sparse table to store only the visited states would take too long to compute the \(Q\) function as many states will be visited very rarely.

This is where deep learning comes in! The idea is to use a neural network to compute the \(Q\) function. Neural network are very efficient to find good features for highly-structured data.

This is exactly what the DeepMind team did. They used a convolutional layer with 3 layers followed by 2 fully connected layers.

Layer Input Filter stride activation output
conv1 84x84x4 8×8 4 ReLU 20x20x32
conv2 20x20x32 4×4 2 ReLU 9x9x64
conv3 9x9x64 3×3 1 ReLU 7x7x64
fc4 7x7x64 ReLU 512
fc5 512 Linear 18

The output is the Q-values for the 18 possible actions on the Atari.

As Q-values are scalar values it’s possible to use a simple squared loss error function to do the back-propagation:

L=\frac{1}{2}[\underbrace{r + \gamma max_{a’}Q(s’,a’)}_{\text{target}} – \underbrace{Q(s,a)}_{\text{prediction}}]^2

Using this neural network our update rule for the Q-function becomes:

  1. Perform a feed-forward pass for the current state \(s\)
  2. Perform a feed-forward pass for the next state \(s’\) and calculate \(\max+{a’}Q(s’,a’)\)
  3. Set the Q-value target for action \(a\) to \(r+\gamma\max_{a’}Q(s’,a’)\) (i.e. the max calculated in step 2). For all other actions, set the Q-value target to the same value as returned from step 1 (it makes the error 0 for the other actions)
  4. Perform a back propagation pass to update the weights of the Q function

We now have a good idea how to use a neural network to approximate the \(Q\) function. However it turns out that approximation of Q-values using non-linear function is not very stable. You actually need lots of tricks to make it converge.

The most important trick is experience replay. During the game all the transitions \(\) are stored in memory. When training the network, random samples from the replay memory are used instead of the most recent transition. This avoids being stuck in local minima.

Taking all this into consideration we can update our Q-learning algorithm as follow

initialise weights for function Q randomly
observe initial state s
while not terminated do
  select an action a:
    with probability ε choose a random action
    otherwise chose action a = argmax_a' Q(s,a')
  perform action a
  observe reward r and next state s'
  store transition <s,a,r,s'> into replay memory

  sample random transitions <ss, aa, rr, ss'> from replay memory
  calculate target for each mini batch transition:
    if ss' is terminal state then target = rr
    otherwise target = rr + γ max_a’ Q(ss’, aa)
  train the neural network using (target - Q(ss,aa))^2 as loss

  s = s'

Of course there are more tricks involved like error clipping, reward clipping, … but it’s quite awesome that MDP can be solved just by observing data as we interact with the system.