## Developing Reinforcement Learning from the Bellman Equation

Note: This tutorial uses a functional notation, e.g., equations using $V_x$ indicate the relationship holds true for the function $V(x)$ at all $x$. If you don't see equations on this page (or you seem them in their raw Latex ugliness), enable Javascript. If that doesn't work, get a better web browser. This tutorial is intended for a reader who already knows a few things about optimal control and reinforcement learning, not for a reader who knows nothing about either. Your mileage may vary.

***

The main purpose of this document is to combat the inexplicable mysticism that often surrounds reinforcement learning methods among a significant number of people who should be able to easily draw the connection to their optimal control background. The principles between the two disciplines are very much the same but, exceptions aside, surprisingly few techniques seem to cross between artificial intelligence and control communities. Sutton's book is a good introduction to reinforcement learning, but was written during the early years of reinforcement learning and does not give the reader an up-to-date description of techniques in the field. If one understands optimal control well, one can get a tangible sense of the connection between these two fields while reading this book. The intent of this tutorial is to give a rudimentary approximation of this understanding by demonstrating how the $Q$-learning algorithm can be developed directly from the Bellman equation.

Consider the Bellman equation for a discrete-time deterministic system with process model $f$. Letting $V$ denote the value function over the state $x$, the Bellman equation is $V_x = \min_u \left[ r_{xu} + \gamma V_f \right]$ where the subscript $f$ on $V_f$ indicates the value at the state that will be reached under process model $f$. The value function encodes the utility at being at state $x$ for every state, i.e., the sum of rewards the system will receive starting from $x$ and using the best possible choice of control $u$ at every future time step.

An analogous idea is a function that encodes the utility of being at $x$ and using the specific control $u$. This is the $Q$ function. It can be defined recursively similarly to the value function. \begin{align} Q_{xu} &= r_{xu} + \gamma V_f \\ &= r_{xu} + \gamma \min_{u'} Q_{fu'} \end{align}

You are already familiar with one iterative method to compute the value function (or a function converging to it) based on bootstrapping; value iteration. Bootstrapping is a somewhat foreign term in control theory, but it simply refers to choosing our estimate of $V$ at $x$ based on our (bootstrapped) estimate of $V$ at $f(x,u)$. The algorithm is $\hat{V}_x \leftarrow \min_u \left[ r_{xu} + \gamma \hat{V}_f \right]$ with the hat notation implying a computation of an approximation of $V$. The arrow implies specification of an algorithm, where equality is due to explicitly choosing the quantity on the left to be identical to the quantity on the right.

While the functions $V$ and $Q$ have nearly identical recursive definitions, value iteration and $Q$-learning are significantly different algorithms. The reason is simply due to these two algorithms being solutions to two different problems with differing assumptions. For value iteration (solving the optimal control problem), we assume full knowledge of the process model and reward function. For $Q$-learning (solving the reinforcement learning problem), we assume no knowledge of process model or reward function.

What does this practically mean? Note that in the value iteration algorithm we set the left-hand side to be equal to a quantity that depends on a sum of an a priori known reward and the value at an a priori known destination state. (In the stochastic case, we assume full knowledge of a probabilistic model and can compute expectations of these quantities based on this.) Without a priori knowledge of these two functions, this algorithm is impossible.

The reinforcement learning problem statement requires only that

• there exists a memoryless function $f$,
• there exists a memoryless function $r_{xu}$, and
• at every stage we provide a control $u$ at $x$ and receive a report that we have transitioned to $x'$ and received reward $r$.
Thus, we must construct an algorithm that works without additional a priori information.

Begin by re-arranging the $Q$-function. $r_{xu} = Q_{xu} - \gamma \min_{u'} Q_{fu'}$ This can be interpreted as stating that the reward being received the current step is equal to the sum total of rewards to be received from this time step forward minus the sum total of rewards to be received one time step in the future going forward. This is obviously true, and is the exact same optimality principle driving the Bellman equation. If we do the same thing with an estimate of $Q$, $\hat{Q}$, we'll get an estimate of the reward we should receive at the next stage. $\hat{r}_{xu} = \hat{Q}_{xu} - \gamma \min_{u'} \hat{Q}_{fu'}$ If our estimate of $Q$ is bad, it is likely that so too will be our estimate of the next stage reward.

To improve on $\hat{Q}$, we can employ gradient descent using the reward error signal. \begin{align} \hat{Q}_{xu} &\leftarrow \hat{Q}_{xu} + \alpha \left[r_{xu} - \hat{r}_{xu}\right] \\ &= \hat{Q}_{xu} + \alpha \left[r_{xu} - \left(\hat{Q}_{xu} - \gamma \min_{u'} \hat{Q}_{fu'} \right)\right] \end{align} This is the famous algorithm that you likely recognize as $Q$-learning. Note that this algorithm has no dependence on process model or reward function. The next iterate of $\hat{Q}$ requires only the immediate reward and some quantities based on our estimate of $\hat{Q}$. It is worth noting that this error signal methodology is the exact same approach we use to construct state observers.

The parameter $\alpha$ is the learning rate, which is also known as rate of descent in optimization theory. The usual discussions on rate of convergence and overshoot apply here. The most elegant way to understand the learning rate can be given in one line. Rewrite the right-hand side of the $Q$-learning algorithm by grouping the $\hat{Q}_{xu}$ terms. $\hat{Q} \leftarrow (1-\alpha) \hat{Q}_{xu} - \alpha \left(r_{xu} + \gamma \min_{u'} \hat{Q}_{fu'} \right)$ What this says is that we are making our next guess at $Q$ from a linear combination of our current guess and a guess based on the new reward information we've just collected (bootstrapped from our current guess, of course). How much we weight what we just saw is our learning rate. A high learning rate will ensure the algorithm assimilates new information quickly, but may take a long time to converge to a reasonable understanding of the overall situation.

Of course, the functional $Q$-learning algorithm I've given is somewhat of a lie because we only have an input (control) and output (state, reward) at one point in the environment. Thus, the canonical point-wise form. $\hat{Q}(x,u) \leftarrow \hat{Q}(x,u) + \alpha \left[ r(x,u) + \gamma \min_{u'} \hat{Q}(f(x,u), u') - \hat{Q}(x,u) \right]$ Because of this information capacity limit, it is impossible to update the $Q$ function faster than at one point per time step. (Or is it...?)

The algorithm is guaranteed to converge, i.e., $\hat{Q} \rightarrow Q$ point-wise, given the modification that we choose $\alpha$ as a function that converges to zero. Convergence at the right rate is important, so $\alpha(i) = \frac{1}{1+i}$ where $i$ is the number of time steps is good and $\alpha(i) = 0$ is bad. Of course, there are a few technical conditions that I'll omit in this discussion.

The final step of this tutorial fulfills my promise to develop $Q$-learning for stochastic systems. In order to accomplish this, we need to do... nothing. Remember we assumed no knowledge of $f$ and $r$? We also assumed no knowledge of the form of $f$ and $r$ other than being memoryless. This translates to $f$ and $r$ being Markov in the stochastic case. Under that assumption, we're good to go with the current algorithm. Of course, we need to take additional steps to prove/ensure convergence including additional details on choice of $\alpha$ and a lot of "infinitely often"/persistence of excitation conditions (depending on your choice of vernacular). I am certain these details are in the paper you were reading before you came here, so I'll send you back there for the intricate details.

Good luck and any time you build something that might become Skynet, please add a kill switch.