Now that we know about Markov chain, let’s focus on a slightly different process: the Markov Decision Process.

This process is quite similar to a Markov chain but adds more concept into it: Actions and Rewards. Having a reward means that it’s possible to learn which action yield the best rewards. This type of learning is also known as reinforcement learning.

In this post we’re going to see what exactly is a Markov decision process and how to solve it in an optimal way.

#### The defintion

A Markov decision process is similar to a Markov chain but adds actions and rewards to it. That means it is defined by the following properties:

- A set of states \(S = s_0, s_1, s_2, …, s_m\)
- An initial state \(s_0\)
- A set of actions \(A = a_0, a_1, a_2, …, a_n\)
- A transition model \(T(s, a, s’)\)
- A reward function \(R(s)\)

At every step an action can be performed. Like a Markov chain the Markov decision process is also a stochastic process (i.e. the transitions are based on probabilities). Given a state and an action there is a probability distribution that indicates what the next step is. These probabilities are given by the transition model \(T(s, a, s’) \sim Pr(s’|s,a)\). The transition model is stationery (the probabilities don’t change over time).

For each state that we’re in we are also given a reward (positive or negative – depending how “good” or “useful” the state is).

Note that the reward function \(R(s)\) depends only on the state \(s\). This is a simplification that we assume in this post but it doesn’t have to be this way. (e.g. the reward may depend on both the state and the action \(R(s, a)\) or even on the state, the action and the new state \(R(s, a, s’)\)).

Like the Markov chain the Markov process depends only on the current state (and on the chosen action). It doesn’t depends on any of the previous states.

#### The robot example

That’s all it is – but let’s take an example to see what all of this theory actually means. A fairly common example is a robot that can move into different positions. Each position is defined by its coordinates.

The set of states are the different positions. The set of actions are the moves the robot can make: Up, Down, Left, Right. If the robot hits a wall it stays at the same position (its state doesn’t change). That allows us to have a static set of actions \(A\) and not a set of action that depends on the current state \(A(s)\).

For every state the robot gets a reward of \(-0.04\) because it consumes some battery power. When it reaches its dock station at position \((4,3)\) it gets a \(+1\) reward because it can charge its batteries and if it falls down into the trap at position \((4,2)\) it gets a \(-1\) reward.

Remember that our process is stochastic. If the robot choose to move up it doesn’t mean it actually does. It certainly does most of the time but in some cases it may end up in a different position for whatever reason (e.g. a cat was on the robot’s way or someone accidentally stepped on it, …). To put some figure in place let’s assume that if the robot chose to move in one direction it does 80% of the time. (and it ends up on the left-side of that direction 10 % of the time and similarly 10% of the time for the right-side). However in any case the robot knows its current position (i.e. its current state).

Now using the reward function we can choose the next action in order to maximise the reward. E.g if the robot is at position \((3,3)\) the “best” action it can do is move right which hopefully will lead to position \((4,3)\) where there is a \(+1\) reward. Alternatively if we are in the starting state \(s_0 = (1,1)\) all the actions give the same reward of \(-0.04\), so which action should the robot choose?

This is where we introduce the notion of policy. A policy \(\pi\) is just a function that given a state returns an action. This is a solution to a Markov decision process.

And of course we are interested in the optimal policy \(\pi^*\) that given a state returns the action that eventually gives the most reward. The optimal policy takes into consideration the reward for the current state but also all the rewards the robot can get over its lifetime.

#### The reward matters

With a \(-0.4\) reward the optimal policy guides the robot to the dock station where it gets a \(+1\) reward.

Notice that the optimal plan starting from position \((1,1)\) is up, up, right, right, right and not right, right, up, up, right – which seems equivalent. Why ? Because the process is stochastic and going right first the robot passes close to the trap and according to the stochastic nature of the process it has some chance to fall into the trap even tough it wants to go up. Similarly it’s the same thing that happens for position \((3,1)\) where the policy advise to take the longest path (going back to the start) in order to avoid passing close to the trap (even though this path is shorter – but also more risky).

If we change the reward the optimal policy also changes. If we don’t want the robot to end the game you can give him a reward of \(+2\). In this case it will try to avoid both the trap and the dock station.

In the opposite if we want the robot to reach one of the terminal state (dock-station or trap) as quickly as possible we can give a reward of \(-2\). In this case it becomes more “painful” to continue the game than falling into the trap.

All of this to say that the reward function must be chosen carefully and reflect some domain knowledge so that the “robot” can learn the behaviour that we want.

Also notice that the policy is stationery – it doesn’t change over time. And that because we assume that the process can run forever. If the robot had only a finite number of steps it can make (e.g. 3 or 4 steps) then the policy would be different (e.g if the robot is in state \((3,1)\) it might become worth to take the shortest path to get to the dock-station event though there is a risk to fall down into the trap because otherwise there is no way to got to the dock-station going the long way around). For the rest of this post we assume an infinite time steps.

#### Sequence of rewards

Now let’s define a utility function that computes the reward that the robot gets for a sequence of states that it traverses:

\(\begin{align}

U(s_0, s_1, s_2, …) = \sum_{t=0}^{\infty} R(s_t)

\end{align}

\)

The utility function is simply the sum of all the rewards the robot gets in the different states it goes through.

However assuming that robot lives forever there is a problem with this expression. Taking an infinity number of steps and accumulating an infinite number of rewards gives infinity (if the rewards are positive).

From the above example it is clear that the first sequence is better than the second. However they both have a utility that is infinite. In order to deal with this we can slightly change the utility definition:

\(\begin{align}

U(s_0, s_1, s_2, …) = \sum_{t=0}^{\infty} \gamma^tR(s_t)

\text{ where } 0 \le \gamma \lt 1

\end{align}

\)

The intuition behind it is that \(\gamma^t\) “fades” away the rewards that are far away in the future. We have a “finite horizon” of the rewards we can get.

Now U is no longer infinite but bounded by the following geometric serie:

\(\begin{align}

U(s_0, s_1, s_2, …) & = \sum_{t=0}^{\infty} \gamma^tR(s_t) \\

& \le \sum_{t=0}^{\infty} \gamma^tR_{\max} = \frac{R_{\max}}{1 – \gamma}

\end{align}

\)

This is known as the discounted rewards.

#### Finding the optimal policy

Now we are all setup to try to find out the optimal policy \(\pi^*\). \(\pi^*\) is the policy that gives the most rewards over the robot’s lifetime. This is also the one the maximises the expectation of our discounted serie:

\(\begin{align}

\pi^* = \operatorname{argmax}_{\pi} \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^tR(s_t) \middle| \pi\right]

\end{align}

\)

We can also redefine \(U^{\pi}(s)\) to be the utility (the reward accumulation) starting from state \(s\) and following the policy \(\pi\).

\(\begin{align}

U^{\pi}(s) = \mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^tR(s_t) \middle| \pi, s_0=s\right]

\end{align}

\)

This answers the question: How useful is it to be in a given state? Well it is as useful to be in that state as what we expect to see from that point on given that we follow the policy \(\pi\).

Note that it’s not the same as the reward we get for entering state \(s\). The reward is the immediate feedback whereas the utility is the long term feedback (as it includes the current reward and all the other rewards we expect to get if we follow the policy \(\pi\)).

We can now rewrite \(\pi^*\) in terms of the utility function

\(\begin{align}

\pi^*(s) = \operatorname{argmax}_a \sum_{s’} T(s,a,s’)U^{\pi^*}(s’)

\end{align}

\)

What does this equation means? Simply that the optimum policy for state \(s\) is the action that’s going to give the maximum utility function and as we’re dealing with a stochastic process we have to consider each step \(s’\) that we may end up in given that we performed action \(a\) in state \(s\).

Note that this definition is recursive because \(\pi^*\) is defined in terms of \(\pi^*\).

Let’s take an example with our robot example. Let’s say the robot is in state \((1, 1)\). There are 4 possible actions: up, down, left and right. If the robot goes up it may end up in state \((1,2)\) with a 0.8 probability, it can stay in same state with a probability of 0.1 and move to the right with a probability of 0.1.

\(\begin{align}

\sum_{s’} T(s_{(1,1)},a_{up},s’)U^{\pi^*}(s’) = 0.8 U^{\pi^*}(s_{(1,2)}) + 0.1 U^{\pi^*}(s_{(1,1)}) + 0.1 U^{\pi^*}(s_{(2,1)})

\end{align}

\)

Using the transition probability ensures that something with a low probability to happen (e.g. 0.1) counts less than something with a high-probability (e.g. 0.8).

Now that it’s clear we can also rewrite \(U^{\pi^*}\) as

\(\begin{align}

U^{\pi^*}(s) = R(s) + \gamma \max_a \sum_{s’} T(s,a,s’) U^{\pi^*}(s’)

\end{align}

\)

This recursive equation is known as the Bellman equation and is a fundamental equation in reinforcement learning. If we can solve the Bellman equation we can figure out the optimum policy \(\pi^*\).

#### The value iteration algorithm

From the definition of the MDP (Markov Decision Process) we said that we have a set of \(m\) states. That means that we actually have a set of \(m\) equations. What unknowns do we have? Well, \(R\) is known, \(T\) is known and even \(\gamma\) is known. The only unknown is \(U\). So we have \(m\) equations and \(m\) unknowns. Seems like something we can solve – except that this system in non-linear (because of the \(\max\) function).

What can we do, then? We can start with an arbitrary utility value and then refine the values until we got something that stabilises (that no longer changes).

- Start with an arbitrary utility function
- Update the utility for a state based on all the states it can reach
- Repeat 2. until convergence

Probably it’ll become clearer if we rewrite Bellman’s equation in terms of updates:

\(\begin{align}

\hat{U}_{t+1}^{\pi^*}(s) = R(s) + \gamma \max_a \sum_{s’} T(s,a,s’) \hat{U}_t^{\pi^*}(s’)

\end{align}

\)

Why does it work? The intuition behind it is that we start with incorrect values of \(\hat{U}\) but after every iteration we had more truth into \(\hat{U}\) because \(R(s)\) is the “true” reward we get at state \(s\). The more we iterate, the more truth we add into \(\hat{U}\) and after a while the original arbitrary values are just negligible.

#### The policy iteration algorithm

The utility function gives us the reward we can expect by taking a given action but what we are trying to solve is to find out \(\pi^*\) which is a mapping from states to actions. In fact it doesn’t really matter how much reward we can get as long as the policy gives the correct action.

This time we’re going to focus on the policies rather than on the utilities. The policy iteration algorithm is

- Start with an arbitrary policy function \(\pi_0\)
- Given \(\pi_t\), evaluate \(U_t = U^{\pi_t}\)
- Improve \(\pi_{t+1} = \operatorname{argmax}_a \sum_{s’} T(s,a,s’) U_t(s’)\)

We can also solve the policy iteration using linear algebra. We can re-write the Bellman’s equation in terms of policy

\(\begin{align}

U_t(s) = R(s) + \gamma \sum_{s’} T(s,\pi_t(s),s’) U_t(s’)

\end{align}

\)

In this equation the \(\pi_t(s)\) gives us the “best” action to perform in state \(s\) so the only unknown is \(U\). This is a system of \(m\) equations with \(m\) unknowns. This time the equations are linear (the \(\max\) function is no longer needed).

We can solve this system with regular linear algebra

\(\begin{align}

U & = R + \gamma T U \\

U – \gamma T U & = R \\

(I – \gamma T) U & = R \\

U & = (I – \gamma T)^{-1} R

\end{align}

\)

Note that we need to inverse the transition matrix and if the number of states is large this might become an issue.

So sometimes it’s better to use the value iteration algorithm (or even start with the value iteration algorithm before using the policy iteration algorithm).