Everything You Need to Know to Get Started in Reinforcement Learning

By the end of this two-part series you will know all the basic theory required to understand how reinforcement learning algorithms work.

In two posts I will share the most important elements of about 85 pages of reinforcement learning textbook. RL is an extremely useful tool to have in any machine learning practitioners toolkit, and these posts are designed as a primer to the foundations of reinforcement learning so that you can get into implementing the latest models as quickly as possible. Of course, for a more thorough treatment of the subject, I recommend picking up the textbook “Reinforcement Learning: An Introduction” by Sutton and Barto, but this post will attempt to give a quick, intuitive grounding into the theory behind reinforcement learning.

Supervised vs Evaluative Learning

For many problems of interest, the paradigm of supervised learning doesn’t give us the flexibility we need. The main difference between supervised learning and reinforcement learning is whether the feedback received is evaluative or instructive. Instructive feedback tells you how to achieve your goal, while evaluative feedback tells you how well you achieved your goal. Supervised learning solves problems based on instructive feedback, and reinforcement learning solves them based on evaluative feedback. Image classification is an example of a supervised problem with instructive feedback; when the algorithm attempts to classify a certain piece of data it is told what the true class is. Evaluative feedback, on the other hand, merely tells you how well you did at achieving your goal. If you were training a classifier using evaluative feedback, your classifier might say “I think this is a hamster”, and in return it would receive 50 points. Without any more context, we really don’t know what 50 points means. We would need to make other classifications and explore  to find out whether our 50 points means that we were accurate or not. Maybe 10,000 is a more respectable score, but we just don’t know until we attempt to classify some other data points.

Two gold stars and a smiley face for guessing hamster. If you guessed gerbil you can have a silver star and half a thumbs up.

In many problems of interest, the idea of evaluative feedback is much more intuitive and accessible. For example, imagine a system that controls the temperature in a data center. Instructive feedback doesn’t seem to make much sense here, how do you tell your algorithm what the correct setting of each component is at any given time step? Evaluative feedback makes much more sense. You could easily feed back data such as how much electricity was used for a certain time period, or what was the average temperature, or even how many machines overheated. This is actually how Google tackles the problem, with reinforcement learning. So let’s jump straight into it.

Markov Decision Processes

A state is said to be Markov if the future from that state is conditionally independent of the past given that we know s. This means that describes all the past states up until that current state. If that doesn’t make much sense, it is much easier to see it by example. Consider a ball flying through the air. If its state is its position and velocity, that is sufficient to describe where it has been and where it will go (given a physics model and that there are no outside influences). Therefore, the state has the Markov property. However, if we only knew the position of the ball but not its velocity, its state is no longer Markov. The current state doesn’t summarize all past states, we need information from the previous time step to start building a proper model of the ball.

Reinforcement Learning is most often modeled as a Markov Decision Process, or MDP. An MDP is a directed graph which has states for its nodes and edges which describe transitions between Markov states. Here is a simple example:


A simple MDP for learning about MDPs.

This MDP shows the process of learning about MDPs. At first you are in the state Don’t understand. From there, you have two possible actions, Study or Don’t Study. If you choose to not study, there is a 100% chance that you will end up back in the Don’t understand state. However, if you study, there is a 20% chance you’ll end up back where you started, but an 80% chance of ending up in the Understand state.

Really, I’m sure there’s a much higher than 80% chance of transitioning to the understand state, at the core of it, MDPs are really very simple. From a state there are a set of actions you can take. After you take an action there is some distribution over what states you can transition into. As in the case of the Don’t Study action, the transition could very well be deterministic too.

The goal in reinforcement learning is to learn how to act to spend more time in more valuable states. To have a valuable state we need more information in our MDP.

You don’t need an MDP to teach you that not eating could make you starve. A reinforcement learning agent might, though.

This MDP has the addition of rewards. Each time you make a transition into a state, you receive a reward. In this example, you get negative rewards for ending up being hungry, and a large negative reward for starving. If you become full, however, you receive a positive reward. Now that our MDP is fully formed, we’re able to start thinking about how to make actions to receive the most reward we could!

Since this MDP is very simple, it is easy to see that the way to stay in areas of higher reward is to eat whenever we’re hungry. We don’t have much choice of how to act when we’re full in this model, but we will inevitably end up hungry again and could immediately choose to eat. Problems that are of interest to solve with reinforcement learning have much larger, much more complex MDPs, and often we don’t know them before hand but need to learn them from exploration.

Formalizing the Reinforcement Learning Problem

Now that we have many of the building blocks we need we should look at the terminology used in RL. The most important components are the agent and the environment. An agent exists in some environment which it has indirect control over. By looking back at our MDPs, the agent can choose which action to take in a given state, which has a significant effect on the states it sees. However, the agent does not control the dynamics of the environment completely. The environment, upon receiving these actions, returns the new state and the reward.

Image from Sutton and Barto – Reinforcement Learning: an Introduction

This image, taken from Sutton and Barto’s book “Reinforcement Learning: an Introduction” (which is highly recommended) explains the agent and environment interactions very well. At some time step t, the agent is in state st and takes an action at. The environment then responds with a new state st+1 and a reward rt+1. The reason that the reward is at t+1 is because it is returned with the environment with the state at t+1, so it makes sense to keep them together (as in the image).


We now have a framework for the reinforcement learning problem, and are ready to start looking at how to go about maximizing our reward. In the next post we will learn about state value functions and action value functions, as well as the Bellman equations which make the foundation for the algorithms for solving reinforcement learning problems. We will also explore some simple and effective dynamic programming solutions. If you would like to hear things explained in a different way or want to delve a little deeper into the subject, I recommend David Silver’s Youtube lecture series on reinforcement learning, as well Sutton and Barto’s book “Reinforcement Learning: an Introduction”. Thanks for reading! View part two here.


Hi, I enjoyed reading your article. I have some requests regarding your article. Do you mind sending me an email? There is no “contact” page or your email address left on this blog.


I think the misunderstanding of the building learning and educational design projects remains effective and fix dynamic programming solutions when we are supervised by a compliance with the community where we live in, and it depends on the ratio of our enforced progress to discover the better paths.

Leave a Reply