Understanding RL: The Bellman Equations

Step-by-step derivation, explanation, and demystification of the most important equations in reinforcement learning

In the previous post we learnt about MDPs and some of the principal components of the Reinforcement Learning framework. In this post, we will build upon that theory and learn about value functions and the Bellman equations.

Reward and Return

As discussed previously, RL agents learn to maximize cumulative future reward. The word used to describe cumulative future reward is return and is often denoted with R. We also use a subscript t to give the return from a certain time step. In mathematical notation, it looks like this:

    \[R_t = r_{t+1} + r_{t+2} + r_{t+3} + r_{t+4} + ... = \sum_{k = 0}^{\infty} r_{t+k+1}\]

If we let this series go on to infinity, then we might end up with infinite return, which really doesn’t make a lot of sense for our definition of the problem. Therefore, this equation only makes sense if we expect the series of rewards to end. Tasks that always terminate are called episodic. Card games are good examples of episodic problems. The episode starts by dealing cards to everyone, and inevitably comes to an end depending on the rules of the particular game. Then, another episode is started with the next round by dealing the cards again.

More common than using future cumulative reward as return is using future cumulative discounted reward:

    \[R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \gamma^3 r_{t+4} + ... = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\]

where 0 < \gamma < 1. The two benefits of defining return this way is that the return is well defined for infinite series, and that it gives a greater weight to sooner rewards, meaning that we care more about imminent rewards and less about rewards we will receive further in the future. The smaller the value we select for \gamma the more true this is. This can be seen in the special cases where we let \gamma equal 0 or 1. If \gamma is 1, we arrive back at our first equation where we care about all rewards equally, not matter how far into the future they are. On the other hand, when \gamma is 0 we care only about the immediate reward, and do not care about any reward after that. This would lead our algorithm to be extremely short-sighted. It would learn to take the action that is best for that moment, but won’t take into account the effects that action will have on its future.


A policy, written \pi (s, a), describes a way of acting. It is a function that takes in a state and an action and returns the probability of taking that action in that state. Therefore, for a given state, it must be true that \sum_a \pi (s, a) = 1. In the example below, when we are Hungry we can choose between two actions, Eat or Don’t Eat.

Our policy should describe how to act in each state, so an equiprobable random policy would look something like \pi (hungry, E) = 0.5, \pi(hungry, \overline{E}) = 0.5, \pi(full, \overline{E}) = 1.0 where E is the action Eat, and \overline{E} is the action Don’t Eat. This means that if you are in the state Hungry, you will choose the action Eat and Don’t Eat with equal probability.

Our goal in reinforcement learning is to learn an optimal policy, \pi^*. An optimal policy is a policy which tells us how to act to maximize return in every state. Since this is such a simple example, it is easy to see that the optimal policy in this case is to always eat when hungry, \pi^*(hungry, E) = 1.0. In this instance, as is the case for many MDPs, the optimal policy is deterministic. There is one optimal action to take in each state. Sometimes this is written as \pi^*(s) = a, which is a mapping from states to optimal actions in those states.

Value Functions

To learn the optimal policy, we make use of value functions. There are two types of value functions that are used in reinforcement learning: the state value function, denoted V(s), and the action value function, denoted Q(s, a).

The state value function describes the value of a state when following a policy. It is the expected return when starting from state s acting according to our policy \pi:

(1)   \[V^{\pi}(s) = \mathbb{E}_{\pi} \big[R_t | s_t = s \big] \]

It is important to note that even for the same environment the value function changes depending on the policy. This is because the value of the state changes depending on how you act, since the way that you act in that particular state affects how much reward you expect to see. Also note the importance of the expectation. (As a refresher, an expectation is much like a mean; it is literally what return you expect to see.) The reason we use an expectation is that there is some randomness in what happens after you arrive at a state. You may have a stochastic policy, which means we need to combine the results of all the different actions that we take. Also, the transition function can be stochastic, meaning that we may not end up in any state with 100% probability. Remember in the example above: when you select an action, the environment returns the next state. There may be multiple states it could return, even given one action. We will see more of this as we look at the Bellman equations. The expectation takes all of this randomness into account.

The other value function we will use is the action value function. The action value function tells us the value of taking an action in some state when following a certain policy. It is the expected return given the state and action under \pi:

(2)   \[Q^{\pi}(s, a) = \mathbb{E}_{\pi} \big[ R_t | s_t = s, a_t = a \big] \]

The same notes for the state value function apply to the action value function. The expectation takes into account the randomness in future actions according to the policy, as well as the randomness of the returned state from the environment.

The Bellman Equations

Richard Bellman was an American applied mathematician who derived the following equations which allow us to start solving these MDPs. The Bellman equations are ubiquitous in RL and are necessary to understand how RL algorithms work. But before we get into the Bellman equations, we need a little more useful notation. We will define \mathcal{P} and \mathcal{R} as follows:

    \[\mathcal{P}_{s s'}^{a} = Pr(s_{t+1} = s' | s_t = s, a_t = a)\]

\mathcal{P} is the transition probability. If we start at state s and take action a we end up in state s' with probability \mathcal{P}_{s s'}^{a}.

    \[\mathcal{R}_{s s'}^{a} = \mathbb{E}[ r_{t+1} | s_t = s, s_{t+1} = s', a_t = a ]\]

\mathcal{R}_{s s'}^{a} is another way of writing the expected (or mean) reward that we receive when starting in state s, taking action a, and moving into state s'.

Finally, with these in hand, we are ready to derive the Bellman equations. We will consider the Bellman equation for the state value function. Using the definition for return, we could rewrite equation (1) as follows:

    \[V^{\pi}(s) =\mathbb{E}_{\pi} \Big[r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t = s \Big] = \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s \Big]\]

If we pull out the first reward from the sum, we can rewrite it like so:

    \[V^{\pi}(s) = \mathbb{E}_{\pi} \Big[r_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_t = s \Big]\]

The expectation here describes what we expect the return to be if we continue from state s following policy \pi. The expectation can be written explicitly by summing over all possible actions and all possible returned states. The next two equations can help us make the next step.

    \[\mathbb{E}_{\pi} [r_{t+1} | s_t = s] = \sum_{a} \pi(s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \mathcal{R}_{s s'}^{a}\]

    \[\mathbb{E}_{\pi} \Big[ \gamma \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_t = s \Big] = \sum_{a} \pi(s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \gamma \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_{t+1} = s' \Big]\]

By distributing the expectation between these two parts, we can then manipulate our equation into the form:

    \[V^{\pi}(s) = \sum_{a} \pi (s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \Bigg[ \mathcal{R}_{s s'}^{a} +\gamma \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2}  | s_{t+1} = s' \Big] \Bigg]\]

Now, note that equation (1) is in the same form as the end of this equation. We can therefore substitute it in, giving us

(3)   \[V^{\pi}(s) = \sum_{a} \pi (s, a) \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma V^{\pi}(s') \Big] \]

The Bellman equation for the action value function can be derived in a similar way. The specific steps are included at the end of this post for those interested. The end result is as follows:

(4)   \[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} Q^{\pi}(s', a') \Big]\]

The importance of the Bellman equations is that they let us express values of states as values of other states. This means that if we know the value of s_{t+1}, we can very easily calculate the value of s_t. This opens a lot of doors for iterative approaches for calculating the value for each state, since if we know the value of the next state, we can know the value of the current state. The most important things to remember here are the numbered equations. Finally, with the Bellman equations in hand, we can start looking at how to calculate optimal policies and code our first reinforcement learning agent.

Next Steps: Dynamic Programming

In the next post we will look at calculating optimal policies using dynamic programming, which will once again lay the foundation for more advanced algorithms. However, this will be the first opportunity to actually code a reinforcement learning algorithm. We will be looking at policy iteration and value iteration and their benefits and weaknesses. Until then, thank you for reading!


As promised: deriving the Bellman equation for the Action Value Function

Following much the same process as for when we derived the Bellman equation for the state value function, we get this series of equations, starting with equation (2):

    \[Q^{\pi}(s, a) = \mathbb{E}_{\pi} \Big[ r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t = s, a_t = a \Big] = \mathbb{E}_{\pi} \Big[ \sum_{k = 0}^{\infty} \gamma^k r_{t + k + 1} | s_t = s, a_t = a \Big]\]

    \[Q^{\pi}(s,a) = \mathbb{E}_{\pi} \Big[ r_{t+1} + \gamma \sum_{k=0}^{\infty}\gamma^k r_{t+k+2} | s_t = s, a_t = a \Big]\]

    \[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Bigg[ \mathcal{R}_{s s'}^{a} + \gamma \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_{t+1} = s' \Big] \Bigg]\]

    \[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Bigg[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} \mathbb{E}_{\pi} \Big[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+2} | s_{t+1} = s', a_{t+1} = a' \Big] \Bigg]\]

    \[Q^{\pi}(s,a) = \sum_{s'} \mathcal{P}_{s s'}^{a} \Big[ \mathcal{R}_{s s'}^{a} + \gamma \sum_{a'} Q^{\pi}(s', a') \Big]\]

Posted by Joshgreaves in Reinforcement Learning, 6 comments

Everything You Need to Know to Get Started in Reinforcement Learning

By the end of this two-part series you will know all the basic theory required to understand how reinforcement learning algorithms work.

In two posts I will share the most important elements of about 85 pages of reinforcement learning textbook. RL is an extremely useful tool to have in any machine learning practitioners toolkit, and these posts are designed as a primer to the foundations of reinforcement learning so that you can get into implementing the latest models as quickly as possible. Of course, for a more thorough treatment of the subject, I recommend picking up the textbook “Reinforcement Learning: An Introduction” by Sutton and Barto, but this post will attempt to give a quick, intuitive grounding into the theory behind reinforcement learning.

Supervised vs Evaluative Learning

For many problems of interest, the paradigm of supervised learning doesn’t give us the flexibility we need. The main difference between supervised learning and reinforcement learning is whether the feedback received is evaluative or instructive. Instructive feedback tells you how to achieve your goal, while evaluative feedback tells you how well you achieved your goal. Supervised learning solves problems based on instructive feedback, and reinforcement learning solves them based on evaluative feedback. Image classification is an example of a supervised problem with instructive feedback; when the algorithm attempts to classify a certain piece of data it is told what the true class is. Evaluative feedback, on the other hand, merely tells you how well you did at achieving your goal. If you were training a classifier using evaluative feedback, your classifier might say “I think this is a hamster”, and in return it would receive 50 points. Without any more context, we really don’t know what 50 points means. We would need to make other classifications and explore  to find out whether our 50 points means that we were accurate or not. Maybe 10,000 is a more respectable score, but we just don’t know until we attempt to classify some other data points.

Two gold stars and a smiley face for guessing hamster. If you guessed gerbil you can have a silver star and half a thumbs up.

In many problems of interest, the idea of evaluative feedback is much more intuitive and accessible. For example, imagine a system that controls the temperature in a data center. Instructive feedback doesn’t seem to make much sense here, how do you tell your algorithm what the correct setting of each component is at any given time step? Evaluative feedback makes much more sense. You could easily feed back data such as how much electricity was used for a certain time period, or what was the average temperature, or even how many machines overheated. This is actually how Google tackles the problem, with reinforcement learning. So let’s jump straight into it.

Markov Decision Processes

A state is said to be Markov if the future from that state is conditionally independent of the past given that we know s. This means that describes all the past states up until that current state. If that doesn’t make much sense, it is much easier to see it by example. Consider a ball flying through the air. If its state is its position and velocity, that is sufficient to describe where it has been and where it will go (given a physics model and that there are no outside influences). Therefore, the state has the Markov property. However, if we only knew the position of the ball but not its velocity, its state is no longer Markov. The current state doesn’t summarize all past states, we need information from the previous time step to start building a proper model of the ball.

Reinforcement Learning is most often modeled as a Markov Decision Process, or MDP. An MDP is a directed graph which has states for its nodes and edges which describe transitions between Markov states. Here is a simple example:


A simple MDP for learning about MDPs.

This MDP shows the process of learning about MDPs. At first you are in the state Don’t understand. From there, you have two possible actions, Study or Don’t Study. If you choose to not study, there is a 100% chance that you will end up back in the Don’t understand state. However, if you study, there is a 20% chance you’ll end up back where you started, but an 80% chance of ending up in the Understand state.

Really, I’m sure there’s a much higher than 80% chance of transitioning to the understand state, at the core of it, MDPs are really very simple. From a state there are a set of actions you can take. After you take an action there is some distribution over what states you can transition into. As in the case of the Don’t Study action, the transition could very well be deterministic too.

The goal in reinforcement learning is to learn how to act to spend more time in more valuable states. To have a valuable state we need more information in our MDP.

You don’t need an MDP to teach you that not eating could make you starve. A reinforcement learning agent might, though.

This MDP has the addition of rewards. Each time you make a transition into a state, you receive a reward. In this example, you get negative rewards for ending up being hungry, and a large negative reward for starving. If you become full, however, you receive a positive reward. Now that our MDP is fully formed, we’re able to start thinking about how to make actions to receive the most reward we could!

Since this MDP is very simple, it is easy to see that the way to stay in areas of higher reward is to eat whenever we’re hungry. We don’t have much choice of how to act when we’re full in this model, but we will inevitably end up hungry again and could immediately choose to eat. Problems that are of interest to solve with reinforcement learning have much larger, much more complex MDPs, and often we don’t know them before hand but need to learn them from exploration.

Formalizing the Reinforcement Learning Problem

Now that we have many of the building blocks we need we should look at the terminology used in RL. The most important components are the agent and the environment. An agent exists in some environment which it has indirect control over. By looking back at our MDPs, the agent can choose which action to take in a given state, which has a significant effect on the states it sees. However, the agent does not control the dynamics of the environment completely. The environment, upon receiving these actions, returns the new state and the reward.

Image from Sutton and Barto – Reinforcement Learning: an Introduction

This image, taken from Sutton and Barto’s book “Reinforcement Learning: an Introduction” (which is highly recommended) explains the agent and environment interactions very well. At some time step t, the agent is in state st and takes an action at. The environment then responds with a new state st+1 and a reward rt+1. The reason that the reward is at t+1 is because it is returned with the environment with the state at t+1, so it makes sense to keep them together (as in the image).


We now have a framework for the reinforcement learning problem, and are ready to start looking at how to go about maximizing our reward. In the next post we will learn about state value functions and action value functions, as well as the Bellman equations which make the foundation for the algorithms for solving reinforcement learning problems. We will also explore some simple and effective dynamic programming solutions. If you would like to hear things explained in a different way or want to delve a little deeper into the subject, I recommend David Silver’s Youtube lecture series on reinforcement learning, as well Sutton and Barto’s book “Reinforcement Learning: an Introduction”. Thanks for reading! View part two here.

Posted by Joshgreaves in Reinforcement Learning, 2 comments

Image to Image Translation–U-Nets and cGANs

Machine learning is everywhere in translation tasks. After all, the Chinese word ‘mao’ and the English word ‘cat’ carry the same meaning, though in different forms, and machine learning is great at learning underlying patterns and representations. This isn’t much different from taking a picture of a cat, and then sketching the photo. They are two different representations of the same idea. “Image-To-Image Translation with Conditional Adversarial Networks”1 by Berkeley’s team Isola, Zhu, Zhou, and Efros tackle this idea and produce fantastic results. In this post I seek to explain their two major contributions, their conditional generative adversarial network (cGAN), and their patch discriminator, as well as their use of U-Nets2. To stop this post from getting too long, I assume some prior knowledge in deep learning. If you’d like me to expand on any information, feel free to contact me and request me to write a blog post on it.


A GAN is made up of two parts, a generator network and a discriminator network. The generator G takes a noise vector z as input and tries to create an image. The discriminator D then takes an image as its input and outputs the probability of it being a real image. At the beginning of this process, D is not very good at discriminating between real and fake images, and G is not very good at generating realistic images. As D sees more real images and generated images, it becomes better at discriminating what is real and what is not, and as D improves, G needs to produce better images to trick D. Therefore, over time, G learns to create images similar to those that D has seen during training. This gives us G : z → y, where y is the reconstructed image.

cGANs are a little different from this. They start with some input x (e.g. a black and white image), which they map to a latent space z, before reconstructing an image again, y (which in this case could be a colored image). Thus we have G : {x,z} → y. This is why it is called a conditional GAN–It doesn’t just create realistic images from noise, it creates some image given some input image. Three examples they use in the paper are decolored images to colored images, sketches to photos, and aerial photos to maps.

When training a cGAN, D is shown a pair of images, either x and the real y, or x and the generated G(x). It is then tasked with learning which two images are the correct pairs. This could allow us to create powerful loss functions, which test whether images are visually similar, and whether one is likely to generate the other. This is hard to achieve with L1 or L2 loss alone, as they assume every pixel is independent, and just seek to minimize a pixel-by-pixel loss. The discriminator loss looks more at what textures are likely to be in the image and at what position. In the paper, Isola et al. actually use a mix of discriminator loss and L1 loss. In my experiments, adding a weighted L1 term to the loss made it significantly easier for the generator to train. I hope to upload my tensorflow code to my github (github.com/joshgreaves) within the next week.


For how simple U-Nets are, they are remarkably powerful in image translation. U-Nets are very similar to autoencoders; they start with an input x, reduce the dimensionality over several fully connected or convolution layers to create a code z, and then decode back to y. However, U-Nets differ from autoencoders in the use of “skip connections”. These skip connections connect the corresponding layers in the encoder and decoder. If we label the layers in the encoder as e1e2, …, en, and the layers in the decoder as d1d2, …, dn, then there are connections between ei and d(n-i).

The skip connections in a U-Net allow the network to learn what it wants from each layer in the encoder network. If it is important for the network to know of specific pixel values and simple lines and textures, it can draw from the information gathered in the earlier layers of the encoder. If it is useful to have a higher level of abstraction, such as what it is the image contains, it can draw from the later layers. This is very useful in image translation, as the different translations frequently share the outlines and important details between the inputs and the outputs.

The skip connections in the paper appear to implemented by concatenating the corresponding layers together. This can be achieved easily in Tensorflow with the following:

tf.concat(3, [cid(n-i)])

Here we use 3 as the first parameter as we concatenate the layers in the channels (or filter number) dimension, since we have tensors of shape [batch, height, width, channels].

Patch Discriminator

Generally, people have used discriminators to look at an entire image and decide whether it is a real or a fake image. Here, the authors of the paper break the image into patches, and then ask the discriminator whether the patch is real or fake. Afterwards, they average the results of discriminating each patch to get the overall probability of the image being real. This is incredibly useful, as it allows the discriminator to work much easier on images of varying sizes. The discriminator doesn’t need to be designed to fit just one specified dimension for images.

It appears tf.extract_image_patches doesn’t have a gradient defined for the operation yet, which means that back-prop won’t work if it is in your computation graph. However, there is this simple work around using Tensorflow operations which do have gradients defined:

def get_patches(input, patch_dim):in_filters = input.get_shape().as_list()[3]

out_filters = patch_dim**2 * in_filters

filter = tf.constant(np.eye(out_filters).reshape(patch_dim, patch_dim, in_filters, out_filters), tf.float32)

return tf.nn.conv2d(input, filter, [1,patch_dim,patch_dim,1], “VALID”)

The result of this function is a tensor of shape [batch, patch_height, patch_width, channels * patch_height * patch_width]. It can then be reshaped:

tf.reshape(patches, [-1, num_patches, patch_height, patch_width, channels])

After the reshape you can run tf.conv3d over the patches, which allows you to convolve over each patch separately but with the same filters. After several conv3d operations, you can reduce the dimensions to [batch, num_patches, 1, 1, 1], and then take the mean of the probabilities from each filter, and the mean across the batches for total loss.


While the three ideas presented in this post are incremental improvements on already existing ideas, they are a great step towards designing better, and more descriptive loss functions. If you want to implement these ideas yourself, please check out the paper for yourselves. As I mentioned earlier, I hope to post my code on my github soon. On my github I also have available an image sketcher which will run a sketch filter over a large amount of images at once, so you could create your own sketch-photo pair dataset with your own photos, or take advantage of many great datasets out there.

Also, this is my first proper blog post on machine learning, and I would love to hear your feedback if you read it! I’m still building the website, and getting used to presenting machine learning ideas. If you have any suggestions, comments, or requests please let me know. You can contact me at my email address on the “contact” page.

Thank you for reading, and I hope this helped!

Posted by Joshgreaves, 0 comments