Understanding RL: The Bellman Equations

Step-by-step derivation, explanation, and demystification of the most important equations in reinforcement learning

In the previous post we learnt about MDPs and some of the principal components of the Reinforcement Learning framework. In this post, we will build upon that theory and learn about value functions and the Bellman equations.

Reward and Return

As discussed previously, RL agents learn to maximize cumulative future reward. The word used to describe cumulative future reward is return and is often denoted with R. We also use a subscript t to give the return from a certain time step. In mathematical notation, it looks like this:

\[R_t = r_{t+1} + r_{t+2} + r_{t+3} + r_{t+4} + ... = \sum_{k = 0}^{\infty} r_{t+k+1}\]

If we let this series go on to infinity, then we might end up with infinite return, which really doesn’t make a lot of sense for our definition of the problem. Therefore, this equation only makes sense if we expect the series of rewards to end. Tasks that always terminate are called episodic. Card games are good examples of episodic problems. The episode starts by dealing cards to everyone, and inevitably comes to an end depending on the rules of the particular game. Then, another episode is started with the next round by dealing the cards again.

More common than using future cumulative reward as return is using future cumulative discounted reward:

\[R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4} + ... = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\]

where . The two benefits of defining return this way is that the return is well defined for infinite series, and that it gives a greater weight to sooner rewards, meaning that we care more about imminent rewards and less about rewards we will receive further in the future. The smaller the value we select for \gamma the more true this is. This can be seen in the special cases where we let \gamma equal 0 or 1. If \gamma is 1, we arrive back at our first equation where we care about all rewards equally, not matter how far into the future they are. On the other hand, when \gamma is 0 we care only about the immediate reward, and do not care about any reward after that. This would lead our algorithm to be extremely short-sighted. It would learn to take the action that is best for that moment, but won’t take into account the effects that action will have on its future.


A policy, written \pi (s, a), describes a way of acting. It is a function that takes in a state and an action and returns the probability of taking that action in that state. Therefore, for a given state, it must be true that \sum_a \pi (s, a) = 1. In the example below, when we are Hungry we can choose between two actions, Eat or Don’t Eat.

Our policy should describe how to act in each state, so an equiprobable random policy would look something like \pi (hungry, E) = 0.5, \pi(hungry, \overline{E}) = 0.5, \pi(full, \overline{E}) = 1.0 where E is the action Eat, and \overline{E} is the action Don’t Eat. This means that if you are in the state Hungry, you will choose the action Eat and Don’t Eat with equal probability.

Our goal in reinforcement learning is to learn an optimal policy, \pi^*. An optimal policy is a policy which tells us how to act to maximize return in every state. Since this is such a simple example, it is easy to see that the optimal policy in this case is to always eat when hungry, \pi^*(hungry, E) = 1.0. In this instance, as is the case for many MDPs, the optimal policy is deterministic. There is one optimal action to take in each state. Sometimes this is written as \pi^*(s) = a, which is a mapping from states to optimal actions in those states.

Value Functions

To learn the optimal policy, we make use of value functions. There are two types of value functions that are used in reinforcement learning: the state value function, denoted V(s), and the action value function, denoted Q(s, a).

The state value function describes the value of a state when following a policy. It is the expected return when starting from state s acting according to our policy \pi:

(1)   \[V^{\pi}(s) = \mathbb{E}_{\pi} \big[R_t | s_t = s \big] \]

It is important to note that even for the same environment the value function changes depending on the policy. This is because the value of the state changes depending on how you act, since the way that you act in that particular state affects how much reward you expect to see. Also note the importance of the expectation. (As a refresher, an expectation is much like a mean; it is literally what return you expect to see.) The reason we use an expectation is that there is some randomness in what happens after you arrive at a state. You may have a stochastic policy, which means we need to combine the results of all the different actions that we take. Also, the transition function can be stochastic, meaning that we may not end up in any state with 100% probability. Remember in the example above: when you select an action, the environment returns the next state. There may be multiple states it could return, even given one action. We will see more of this as we look at the Bellman equations. The expectation takes all of this randomness into account.

The other value function we will use is the action value function. The action value function tells us the value of taking an action in some state when following a certain policy. It is the expected return given the state and action under \pi:

(2)   \[Q^{\pi}(s, a) = \mathbb{E}_{\pi} \big[ R_t | s_t = s, a_t = a \big] \]

The same notes for the state value function apply to the action value function. The expectation takes into account the randomness in future actions according to the policy, as well as the randomness of the returned state from the environment.

The Bellman Equations

Richard Bellman was an American applied mathematician who derived the following equations which allow us to start solving these MDPs. The Bellman equations are ubiquitous in RL and are necessary to understand how RL algorithms work. But before we get into the Bellman equations, we need a little more useful notation. We will define \mathcal{P} and \mathcal{R} as follows:

\[\mathcal{P}_{s s'}^{a} = Pr(s_{t+1} = s' | s_t = s, a_t = a)\]

\mathcal{P} is the transition probability. If we start at state s and take action a we end up in state s' with probability \mathcal{P}_{s s'}^{a}.

\[\mathcal{R}_{s s'}^{a} = \mathbb{E}[ r_{t+1} | s_t = s, s_{t+1} = s', a_t = a ]\]

\mathcal{R}_{s s'}^{a} is another way of writing the expected (or mean) reward that we receive when starting in state s, taking action a, and moving into state s'.

Finally, with these in hand, we are ready to derive the Bellman equations. We will consider the Bellman equation for the state value function. Using the definition for return, we could rewrite equation (1) as follows:

\[V^{\pi}(s) =\mathbb{E}_{\pi} \Big[r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t = s \Big] = \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s \Big]\]

If we pull out the first reward from the sum, we can rewrite it like so:

\[V^{\pi}(s) = \mathbb{E}_{\pi} \Big[r_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_t = s \Big]\]

The expectation here describes what we expect the return to be if we continue from state s following policy \pi. The expectation can be written explicitly by summing over all possible actions and all possible returned states. The next two equations can help us make the next step.

\[\mathbb{E}_{\pi} [r_{t+1} | s_t = s] = \sum_{a} \pi(s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \mathcal{R}_{s s'}^{a}\]
\[\mathbb{E}_{\pi} \Big[ \gamma \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_t = s \Big] = \sum_{a} \pi(s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \gamma \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_{t+1} = s' \Big]\]

By distributing the expectation between these two parts, we can then manipulate our equation into the form:

\[V^{\pi}(s) = \sum_{a} \pi (s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \Bigg[ \mathcal{R}_{s s'}^{a} +\gamma \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2}  | s_{t+1} = s' \Big] \Bigg]\]

Now, note that equation (1) is in the same form as the end of this equation. We can therefore substitute it in, giving us

(3)   \[V^{\pi}(s) = \sum_{a} \pi (s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma V^{\pi}(s') \Big] \]

The Bellman equation for the action value function can be derived in a similar way. The specific steps are included at the end of this post for those interested. The end result is as follows:

(4)   \[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} \pi (s', a') Q^{\pi}(s', a') \Big]\]

The importance of the Bellman equations is that they let us express values of states as values of other states. This means that if we know the value of s_{t+1}, we can very easily calculate the value of s_t. This opens a lot of doors for iterative approaches for calculating the value for each state, since if we know the value of the next state, we can know the value of the current state. The most important things to remember here are the numbered equations. Finally, with the Bellman equations in hand, we can start looking at how to calculate optimal policies and code our first reinforcement learning agent.

Next Steps: Dynamic Programming

In the next post we will look at calculating optimal policies using dynamic programming, which will once again lay the foundation for more advanced algorithms. However, this will be the first opportunity to actually code a reinforcement learning algorithm. We will be looking at policy iteration and value iteration and their benefits and weaknesses. Until then, thank you for reading!

As promised: deriving the Bellman equation for the Action Value Function

Following much the same process as for when we derived the Bellman equation for the state value function, we get this series of equations, starting with equation (2):

\[Q^{\pi}(s, a) = \mathbb{E}_{\pi} \Big[ r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t = s, a_t = a \Big] = \mathbb{E}_{\pi} \Big[ \sum_{k = 0}^{\infty} \gamma^k r_{t + k + 1} | s_t = s, a_t = a \Big]\]
\[Q^{\pi}(s,a) = \mathbb{E}_{\pi} \Big[ r_{t+1} + \gamma \sum_{k=0}^{\infty}\gamma^k r_{t+k+2} | s_t = s, a_t = a \Big]\]
\[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Bigg[ \mathcal{R}_{s s'}^{a} + \gamma \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_{t+1} = s' \Big] \Bigg]\]
\[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Bigg[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_{t+1} = s', a_{t+1} = a' \Big] \Bigg]\]
\[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} \pi (s', a') Q^{\pi}(s', a') \Big]\]