# Underactuated Robotics

Algorithms for Walking, Running, Swimming, Flying, and Manipulation

Russ Tedrake

How to cite these notes, use annotations, and give feedback.

Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.

# Dynamic Programming

In chapter 2, we spent some time thinking about the phase portrait of the simple pendulum, and concluded with a challenge: can we design a nonlinear controller to reshape the phase portrait, with a very modest amount of actuation, so that the upright fixed point becomes globally stable? With enough actuators and unbounded torque, feedback-cancellation solutions (e.g., invert gravity) can work well, but can also require an unnecessarily large amount of control effort. The energy-based swing-up control solutions presented for the Acrobot and cart-pole systems are considerably more appealing, but required some cleverness and might not scale to more complicated systems. Here we investigate another approach to the problem, using computational optimal control to synthesize a feedback controller directly.

# Formulating control design as an optimization

In this chapter, we will introduce optimal control. Reinforcement learning is the machine learning name for optimal control; we'll discuss that machine learning perspective later in the notes. Optimal control is powerful for a number of reasons. First and foremost, it is very general - allowing us to specify the goal of control equally well for fully-actuated or underactuated, linear or nonlinear, deterministic or stochastic, and continuous or discrete systems. Second, it permits concise descriptions of potentially very complex desired behaviors, specifying the goal of control as a scalar objective (plus a list of constraints). Finally, and most importantly, optimal control is very amenable to numerical solutions. Bertsekas00a is a fantastic reference on this material for those who want a somewhat rigorous treatment; Sutton18 is an excellent (free) reference for those who want something more approachable.

The fundamental idea in optimal control is to formulate the goal of control as the long-term optimization of a scalar cost function. Let's introduce the basic concepts by considering a system that is even simpler than the simple pendulum.

# Optimal Control Formulations for the Double Integrator

Consider the "double integrator" system $$\ddot{q} = u, \quad |u| \le 1.$$ If you would like a mechanical analog of the system (I always do), then you can think about this as a unit mass brick moving along the $x$-axis on a frictionless surface, with a control input which provides a horizontal force, $u$. The task is to design a control system, $u = \pi(\bx,t)$, $\bx=[q,\dot{q}]^T$ to regulate this brick to $\bx = [0,0]^T$.

In order to formulate this control design problem using optimal control, we must define a scalar objective which scores the long-term performance of running each candidate control policy, $\pi(\bx,t)$, from each initial condition, $(\bx_0,t_0)$, and a list of constraints that must be satisfied. For the task of driving the double integrator to the origin, one could imagine a number of optimal control formulations which would accomplish the task, e.g.:

• Minimum time: $\min_\pi t_f,$ subject to $\bx(t_0) = \bx_0,$ $\bx(t_f) = {\bf 0}.$
• Quadratic cost: $\min_\pi \int_{t_0}^{\infty} \left[ \bx^T(t) {\bf Q} \bx(t) \right] dt,$ ${\bf Q}\succ0$.
The first is a constrained optimization formulation which optimizes time, and the second accrues a penalty at every instance according to the distance that the state is away from the origin (with weights defined by the matrix ${\bf Q}$), and therefore encourages trajectories that go more directly towards the goal, possibly at the expense of requiring a longer time to reach the goal. In fact the quadratic cost will result in an exponential approach to the goal, where as the minimum-time formulation will arrive at the goal in finite time. Note that both optimization problems only have well-defined solutions if it is possible for the system to actually reach the origin, otherwise the minimum-time problem cannot satisfy the terminal constraint, and the integral in the quadratic cost would not converge to a finite value as time approaches infinity (fortunately the double-integrator system is controllable, and therefore can be driven to the goal in finite time).

Note that the input limits, $|u|\le1$ are also required to make this problem well-posed; otherwise both optimizations would result in the optimal policy using infinite control input to approach the goal infinitely fast. Besides input limits, another common approach to limiting the control effort is to add an additional quadratic cost on the input (or "effort"), e.g. $\int \left[ \bu^T(t) {\bf R} \bu(t) \right] dt,$ ${\bf R}\succ0$. This could be added to either formulation above. We will examine many of these formulations in some detail in the examples worked out at the end of this chapter.

Optimal control has a long history in robotics. It started out motivated by achieving peak performance in niche applications like programming minimum-time paths for a pick-and-place robot in a factory. But it is the generality of the approach (one can program almost any task using similar tools) that has made optimization-based and now machine-learning-based control dominate the way that state-of-the-art robot control systems are developed.

# The minimum time problem for the double integrator

For more intuition, let's do an informal derivation of the solution to the minimum time problem for the double integrator with input constraints: \begin{align*} \minimize_{\pi} \quad & t_f\\ \subjto \quad & \bx(t_0) = \bx_0, \\ & \bx(t_f) = {\bf 0}, \\ & \ddot{q}(t) = u(t), \\ & |u(t)| \le 1. \end{align*} What behavior would you expect an optimal controller to exhibit?

Your intuition might tell you that the best thing that the brick can do, to reach the goal in minimum time with limited control input, is to accelerate maximally towards the goal until reaching a critical point, then hitting the brakes in order to come to a stop exactly at the goal. This is called a bang-bang control policy; these are often optimal for systems with bounded input, and it is in fact optimal for the double integrator although we will not prove it until we have developed more tools.

Let's work out the details of this bang-bang policy. First, we can figure out the states from which, when the brakes are fully applied, the system comes to rest precisely at the origin. Let's start with the case where $q(0) < 0$, and $\dot{q}(0)>0$, and "hitting the brakes" implies that $u=-1.$ Integrating the equations, we have \begin{gather*} \ddot{q}(t) = u = -1, \\\dot{q}(t) = \dot{q}(0) - t, \\ q(t) = q(0) + \dot{q}(0) t - \frac{1}{2} t^2. \end{gather*} Substituting $t = \dot{q}(0) - \dot{q}$ into the solution reveals that the system trajectories are parabolic arcs: $q = -\frac{1}{2} \dot{q}^2 + c_{-},$ with $c_{-} = q(0) + \frac{1}{2}\dot{q}^2(0)$.

Similarly, the solutions for $u=1$ are $q = \frac{1}{2} \dot{q}^2 + c_{+}$, with $c_{+}=q(0)-\frac{1}{2}\dot{q}^2(0)$.

Perhaps the most important of these trajectories are the ones that pass directly through the origin (e.g., $c_{-}=0$). Following our initial logic, if the system is going slower than this $\dot{q}$ for any $q$, then the optimal thing to do is to slam on the accelerator ($u=-\text{sgn}(q)$). If it's going faster than the $\dot{q}$ that we've solved for, then still the best thing to do is to brake; but inevitably the system will overshoot the origin and have to come back. We can summarize this policy with: $u = \begin{cases} +1, & \text{if } (\dot{q} < 0 \text{ and } q \le \frac{1}{2} \dot{q}^2) \text{ or } (\dot{q}\ge 0 \text{ and } q < -\frac{1}{2} \dot{q}^2), \\ 0, & \text{if } q=0 \text{ and } \dot{q}=0, \text{ or} \\ -1, & \text{otherwise.} \end{cases}$

We can illustrate some of the optimal solution trajectories:

And for completeness, we can compute the optimal time to the goal by solving for the amount of time required to reach the switching surface plus the amount of time spent moving along the switching surface to the goal. With a little algebra, you will find that the time to the goal, $J(\bx)$, is given by $J(\bx) = \begin{cases} 2\sqrt{\frac{1}{2}\dot{q}^2-q} - \dot{q}, & \text{for } u=+1 \text{ regime}, \\ 0, & \text{for } u=0, \\ \dot{q} + 2\sqrt{\frac{1}{2}\dot{q}^2+q}, & \text{for } u=-1, \end{cases}$ which is plotted here:

Notice that the "time-to-the-origin function" is continuous over state (though not smooth), even though the policy is discontinuous.

As we begin to develop theoretical and algorithmic tools for optimal control, we will see that some formulations are much easier to deal with than others. One important example is the dramatic simplification that can come from formulating objective functions using additive cost, because they often yield recursive solutions. In the additive cost formulation, the long-term objective for a trajectory can be written as $$\int_0^T \ell(x(t),u(t)) dt,$$ where $\ell()$ is the instantaneous cost (also referred to as the "running cost"), and $T$ can be either a finite real number or $\infty$. We will call a problem specification with a finite $T$ a "finite-horizon" problem, and $T=\infty$ an "infinite-horizon" problem. Problems and solutions for infinite-horizon problems tend to be more elegant, but care is required to make sure that the integral converges for the optimal controller (typically by having an achievable goal that allows the robot to accrue zero cost, or by having costs that discount/decay to zero as time increases).

The "quadratic cost" formulation we considered in the double integrator example above is clearly written as an additive cost. At first glance, our "minimum-time" formulation doesn't appear to be of this form, but we actually can write it as an additive cost problem using an infinite horizon and the instantaneous cost $$\ell(x,u) = \begin{cases} 0 & \text{if } x=0, \\ 1 & \text{otherwise.} \end{cases}$$

We will examine a number of approaches to solving optimal control problems throughout the next few chapters. For the remainder of this chapter, we will focus on additive-cost problems and their solution via dynamic programming.

# Continuous dynamic programming

I find the graph search algorithm extremely satisfying as a first step, but also become quickly frustrated by the limitations of the discretization required to use it. In many cases, we can do better; coming up with algorithms which work more natively on continuous dynamical systems. We'll explore those extensions in this section.

# The Hamilton-Jacobi-Bellman Equation

It's important to understand that the value iteration equations, equations (\ref{eq:value_update}) and (\ref{eq:policy_update}), are more than just an algorithm. They are also sufficient conditions for optimality: if we can produce a $J^*$ and $\pi^*$ which satisfy these equations, then $\pi^*$ must be an optimal controller. There are an analogous set of conditions for the continuous systems. For a system $\dot{\bx} = f(\bx,\bu)$ and an infinite-horizon additive cost $\int_0^\infty \ell(\bx,\bu)dt,$ we have: \begin{gather} 0 = \min_\bu \left[ \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) \right], \label{eq:HJB} \\ \pi^*(\bx) = \argmin_\bu \left[ \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) \right]. \end{gather} Equation \ref{eq:HJB} is known as the Hamilton-Jacobi-Bellman (HJB) equation. Technically, a Hamilton-Jacobi equation is a PDE whose time derivative depends on the first-order partial derivatives over state, which we get in the finite-time deriviation; Eq \ref{eq:HJB} is the steady-state solution of the Hamilton-Jacobi equation. Bertsekas05 gives an informal derivation of these equations as the limit of a discrete-time approximation of the dynamics, and also gives the famous "sufficiency theorem". The treatment in Bertsekas05 is for the finite-horizon case; I have modified it to one particular form of an infinite-horizon case here.

# HJB Sufficiency Theorem (stated informally)

Consider a system $\dot{\bx}=f(\bx,\bu)$ and an infinite-horizon additive cost $\int_0^\infty \ell(\bx,\bu)dt$, with $f$ and $\ell$ continuous functions, and $\ell$ a strictly-positive-definite function that obtains zero only at a unique state $\bx^*$. Suppose $J(\bx)$ is a solution to the HJB equation: $J$ is continuously differentiable in $\bx$ and satisfies $0 = \min_{\bu \in U} \left[\ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) \right],\quad \text{for all } \bx.$ Further assume that $J(\bx)$ is positive definite and that $\pi(\bx)$ is the minimizer for all $\bx$. Then, under some technical conditions on the existence and boundedness of solutions, we have that $J(\bx) - J(\bx^*)$ is the optimal cost-to-go and $\pi$ is an optimal policy.

Here is a more formal version from a personal communication with Sasha Megretski.

Given an open subset $\Omega\subset\mathbb R^n$, with a selected element $x^*$, a subset $U\subset\mathbb R^m$, continuous functions $f:~\Omega\times U\to\mathbb R^n$, $g:~\Omega\times U\to[0,\infty)$, continuously differentiable function $V:~\Omega\to[0,\infty)$, and a function $\mu:~\Omega\to U$, such that

1. $g(x,u)$ is strictly positive definite, in the sense that $\lim_{k\to\infty}x_k=x^*$ for every sequence of vectors $x_k\in\Omega$, $u_k\in U$ ($k=1,2,\dots$) such that $\lim_{k\to\infty}g(x_k,u_k)=0$;
2. the function $g_V:~\Omega\times U\to\mathbb R$, defined by $$\label{e0} g_V(x,u)=g(x,u)+\frac{dV(x)}{dx}f(x,u),$$ is non-negative, and achieves zero value whenever $u=\mu(x)$;
3. for every $x_0\in\Omega$, the system of equations $$\label{e1} \dot x(t)=f(x(t),u(t)),~x(0)=x_0~\left(\text{i.e.,}~ x(t)=x_0+\int_0^tf(x(\tau),u(\tau)d\tau\right)$$ $$\label{e2} u(t)=\mu(x(t)),$$ has at least one piecewise continuous solution $x:~[0,\infty)\to\Omega$, $u:~[0,\infty)\to U$.
Then, for every $x_0\in\Omega$, the functional $J(x,u)=\int_0^\infty g(x(t),u(t))dt,$ defined on the set of all piecewise continuous pairs $(x,u)$ satisfying (\ref{e1}), achieves its minimal value of $V(x_0)-V(x^*)$ at every piecewise continuous solution of (\ref{e1}), (\ref{e2}).

Proof. First, observe that, for all piecewise continuous $(x,u)$ satisfying (\ref{e1}),

1. due to (\ref{e0}), $$\label{e3} \int_0^Tg(x(t),u(t))dt=V(x_0)-V(x(T))+\int_0^Tg_V(x(t),u(t))dt;$$
2. whenever $J(x,u)<\infty$, there exists a sequece of positive numbers $T_k$, $k=1,2,\dots$, $T_k\to+\infty$, such that $g(x(T_k),u(T_k))\to0$, which, according to assumption (a), implies $x(T_k)\to x^*$ as $k\to\infty$.
In particular, for a piecewise continuous solution $(x,u)$ of (\ref{e1}), satisfying, in addition, equation (\ref{e2}), condition (I) means $\int_0^Tg(x(t),u(t))dt=V(x_0)-V(x(T))\le V(x_0).$ Since the upper bound $V(x_0)$ does not depend on $T$, and $g(x(t),u(t))\ge0$ is non-negative, we conclude that $\int_0^\infty g(x(t),u(t))dt<\infty.$ Therefore observation (**) applies, and, for the corresponding $T_k$, \begin{align*} J(x,u)&=\int_0^\infty g(x(t),u(t))dt=\lim_{k\to\infty}\int_0^{T_k}g(x(t),u(t))dt \\&= \lim_{k\to\infty}V(x_0)-V(x(T_k))=V(x_0)-V(x^*).\end{align*} Moreover, since $g$ is non-negative, the inequality $V(x_0)\ge V(x^*)$ must hold for every $x_0\in\Omega$.

To finish the proof, for an arbitrary piecewise continuous solution of (\ref{e1}), equation (\ref{e3}), together with non-negativity of $g_V$, implies $J(x,u)=\int_0^Tg(x(t),u(t))dt\ge V(x_0)-V(x(T)).$ When $J(x,u)<\infty$, applying this to $T=T_k$, where $T_k$ are described in observation (II), and taking the limit as $k\to\infty$ (and $x(T_k)\to x^*$) yields $J(x,u)\ge V(x_0)-V(x^*)$. $\Box$

It is possible to prove sufficiency under different assumptions, too. The particular assumptions here were chosen to ensure that $J(\bx(0)) < \infty$ implies that $\bx(t) \rightarrow \bx^*$. As Sasha says, "without something like this, all sorts of counter-examples emerge."

As a tool for verifying optimality, the HJB equations are actually surprisingly easy to work with: we can verify optimality for an infinite-horizon objective without doing any integration; we simply have to check a derivative condition on the optimal cost-to-go function $J^*$. Let's see this play out on the double integrator example.

# HJB for the Double Integrator

Consider the problem of regulating the double integrator (this time without input limits) to the origin using a quadratic cost: $$\ell(\bx,\bu) = q^2 + \dot{q}^2 + u^2.$$ I claim (without derivation) that the optimal controller for this objective is $$\pi(\bx) = -q - \sqrt{3}\dot{q}.$$ To convince you that this is indeed optimal, I have produced the following cost-to-go function: $$J(\bx) = \sqrt{3} q^2 + 2 q \dot{q} + \sqrt{3} \dot{q}^2.$$

Taking \begin{gather*} \pd{J}{q} = 2\sqrt{3} q + 2\dot{q}, \qquad \pd{J}{\dot{q}} = 2q + 2\sqrt{3}\dot{q}, \end{gather*} we can write \begin{align*} \ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) &= q^2 + \dot{q}^2 + u^2 + (2\sqrt{3} q + 2\dot{q}) \dot{q} + (2q + 2\sqrt{3}\dot{q}) u \end{align*} This is a convex quadratic function in $u$, so we can find the minimum with respect to $u$ by finding where the gradient with respect to $u$ evaluates to zero. $\pd{}{u} \left[ \ell(\bx,\bu) + \pd{J}{\bx} f(\bx,\bu) \right] = 2u + 2q + 2\sqrt{3}\dot{q}.$ Setting this equal to $0$ and solving for $u$ yields: $$u^* = -q - \sqrt{3} \dot{q},$$ thereby confirming that our policy $\pi$ is in fact the minimizer. Substituting $u^*$ back into the HJB reveals that the right side does in fact simplify to zero. I hope you are convinced!

Note that evaluating the HJB for the time-to-go of the minimum-time problem for the double integrator will also reveal that the HJB is satisfied wherever that gradient is well-defined. This is certainly mounting evidence in support of our bang-bang policy being optimal, but since $\pd{J}{\bx}$ is not defined everywhere, it does not actually satisfy the requirements of the sufficiency theorem as stated above. Since cusps are commonplace in optimal value functions, one could say that the assumption in the sufficiency theorem that $\pd{J}{\bx}$ is defined everywhere makes our current statement very weak.

# Solving for the minimizing control

We still face a few barriers to actually using the HJB in an algorithm. The first barrier is the minimization over $\bu$. When the action set was discrete, as in the graph search version, we could evaluate the one-step cost plus cost-to-go for every possible action, and then simply take the best. For continuous action spaces, in general we cannot rely on the strategy of evaluating a finite number of possible $\bu$'s to find the minimizer.

All is not lost. In the quadratic cost double integrator example above, we were able to solve explicitly for the minimizing $\bu$ in terms of the cost-to-go. It turns out that this strategy will actually work for a number of the problems we're interested in, even when the system (which we are given) or cost function (which we are free to pick, but which should be expressive) gets more complicated.

Recall that I've already tried to convince you that a majority of the systems of interest are control affine, e.g. I can write $f(\bx,\bu) = f_1(\bx) + f_2(\bx)\bu.$ We can make another simplification by restricting ourselves to running cost functions of the form $\ell(\bx,\bu) = \ell_1(\bx) + \bu^T {\bf R} \bu, \qquad {\bf R}={\bf R}^T \succ 0.$ In my view, this is not very restrictive - many of the cost functions that I find myself choosing to write down can be expressed in this form. Given these assumptions, we can write the HJB as $0 = \min_{\bu} \left[ \ell_1(\bx) + \bu^T {\bf R} \bu + \pd{J}{\bx} \left[ f_1(\bx) + f_2(\bx)\bu \right]\right].$ Since this is a positive quadratic function in $\bu$, if the system does not have any constraints on $\bu$, then we can solve in closed-form for the minimizing $\bu$ by taking the gradient of the right-hand side: $\pd{}{\bu} = 2\bu^T {\bf R} + \pd{J}{\bx} f_2(\bx) = 0,$ and setting it equal to zero to obtain $\bu^* = -\frac{1}{2}{\bf R}^{-1}f_2^T(\bx) \pd{J}{\bx}^T.$ If there are linear constraints on the input, such as torque limits, then more generally this could be solved (at any particular $\bx$) as a quadratic program.

Exploiting the ability to solve for the optimal action in closed form is also nice for generating benchmark problems for numerical optimal control, as it can be used for "converse optimal control" Doyle96.

# Converse optimal control for a cubic polynomial

Consider a one-dimensional system of the form $\dot{x} = f_1(x) + xu,$ and the running cost function $\ell(x,u) = x^2 + u^2.$ The optimal policy is $u^* = -\frac{1}{2}x\pd{J^*}{x},$ leading to the HJB $$0 = x^2 - \frac{1}{4}x^2\left[\pd{J^*}{x}\right]^2 + \pd{J^*}{x} f_1(x).$$ Choosing $J^*(x) = x^2$, we find that this is the cost-to-go when $f_1(x) = -\frac{1}{2}x + \frac{1}{2}x^3$ and $u^* = -x^2.$

What happens in the case where our system is not control affine or if we really do need to specify an instantaneous cost function on $\bu$ that is not simply quadratic? If the goal is to produce an iterative algorithm, like value iteration, then one common approach is to make a (positive-definite) quadratic approximation in $\bu$ of the HJB, and updating that approximation on every iteration of the algorithm. This broad approach is often referred to as differential dynamic programming (c.f. Jacobson70).

# Numerical solutions for $J^*$

The other major barrier to using the HJB in a value iteration algorithm is that the estimated optimal cost-to-go function, $\hat{J}^*$, must somehow be represented with a finite set of numbers, but we don't yet know anything about the potential form it must take. In fact, knowing the time-to-goal solution for minimum-time problem with the double integrator, we see that this function might need to be non-smooth for even very simple dynamics and objectives.

One natural way to parameterize $\hat{J}^*$ -- a scalar valued-function defined over the state space -- is to define the values on a mesh. This approach then admits algorithms with close ties to the relatively very advanced numerical methods used to solve other partial differential equations (PDEs), such as the ones that appear in finite element modeling or fluid dynamics. One important difference, however, is that our PDE lives in the dimension of the state space, while many of the mesh representations from the other disciplines are optimized for two or three dimensional space. Also, our PDE may have discontinuities (or at least discontinuous gradients) at locations in the state space which are not known apriori.

These days, the other natural choice is to represent $J^*$ using a neural network, or some similarly richly parameterized function approximator. This is common practice in reinforcement learning, which we will discuss more later in the book.

# Value iteration with function approximation

If we approximate $J^*$ with a finitely-parameterized function $\hat{J}_\balpha^*$, with parameter vector $\balpha$, then this immediately raises many important questions. If the true cost-to-go function does not live in the prescribed function class -- e.g., there does not exist an $\balpha$ which satisfies the sufficiency conditions for all $\bx$ -- then we might instead write a cost function to match the optimality conditions as closely as possible. But our choice of cost function is important. Are errors in all states equally important? If we are balancing an Acrobot, presumably having an accurate cost-to-go in the vicinity of the upright condition is more important that some rarely visited state? Even if we can define a cost function that we like, then can we say anything about the convergence of our algorithm in this more general case?

Let us start by considering a least-squares approximation of the value iteration update.

Using the least squares solution in a value iteration update is sometimes referred to as fitted value iteration, where $\bx_k$ are some number of samples taken from the continuous space and for discrete-time systems the iterative approximate solution to \begin{gather*} J^*(\bx_0) = \min_{u[\cdot]} \sum_{n=0}^\infty \ell(\bx[n],\bu[n]), \\ \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*} becomes \begin{gather} J^d_k = \min_\bu \left[ \ell(\bx_k,\bu) + \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right], \\ \balpha \Leftarrow \argmin_\balpha \sum_k \left(\hat{J}^*_\balpha(\bx_k) - J^d_k \right)^2. \label{eq:fitted_value_iteration} \end{gather} Since the desired values $J^d_k$ are depend on the current guess of the cost-to-go, we will apply this value iteration algorithm iteratively until (hopefully) we achieve some numerical convergence.

Note that the update in \eqref{eq:fitted_value_iteration} is not quite the same as doing least-squares optimization of $$\sum_k \left(\hat{J}^*_\balpha(\bx_k) - \min_\bu \left[ \ell(\bx_k,\bu) + \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right] \right)^2,$$ because in this equation $\alpha$ has an effect on both occurrences of $\hat{J}^*$. In \eqref{eq:fitted_value_iteration}, we cut that dependence by taking $J_k^d$ as fixed desired values; this version performs better in practice. In the deep reinforcement learning literature, we cut the dependence in a similar way by using a "target network" with the weights, $\beta,$ that are updated more slowly than $\alpha:$ $$\sum_k \left(\hat{J}^*_\balpha(\bx_k) - \min_\bu \left[ \ell(\bx_k,\bu) + \hat{J}^*_\beta\left({f(\bx_k,\bu)}\right) \right] \right)^2.$$ For nonlinear function approximators, the update to $\alpha$ in \eqref{eq:fitted_value_iteration} is typically approximated by taking multiple steps of (stochastic) gradient descent.

# Linear function approximators

In general, the convergence and accuracy guarantees for value iteration with generic function approximators are quite weak. But we do have some results for the special case of linear function approximators. A linear function approximator takes the form: $\hat{J}^*_\balpha(\bx) = \sum_i \alpha_i \psi_i(\bx) = \bpsi^T(\bx) \balpha,$ where $\bpsi(\bx)$ is a vector of potentially nonlinear features. Common examples of features include polynomials, radial basis functions, or most interpolation schemes used on a mesh. The distinguishing feature of a linear function approximator is the ability to exactly solve for $\balpha$ in order to represent a desired function optimally, in a least-squares sense. For linear function approximators, this is simply: \begin{gather*} \balpha \Leftarrow \begin{bmatrix} \bpsi^T(\bx_1) \\ \vdots \\ \bpsi^T(\bx_K)\end{bmatrix}^+ \begin{bmatrix} J^d_1 \\ \vdots \\ J^d_K \end{bmatrix}, \end{gather*} where the $^+$ notation refers to a Moore-Penrose pseudoinverse. One "theory of deep learning" argues that we can think of deep networks as linear function approximators (with randomly-initialized weights causing the last hidden layer to be a rich random basis vector, which is combined linearly into the outputs), which might justify how they can obtain zero training error quickly and then slowly adapt the features based on implicit regularization.

In finite state and action settings, fitted value iteration with linear function approximators is known to converge to the globally optimal $\balpha^*$Tsitsiklis97. Munos08 extends these results and treats the continuous state-action case. Konda03 gives the important generalization to actor-critic algorithms, which we will discuss soon. There can be some subtleties required for achieving convergence in practice. In the next chapter, we'll see why exponentially-discounted cost functions are often used help the value iterations converge even for infinite-horizon formulations.

# Value iteration on a mesh

Imagine that we use a mesh to approximate the cost-to-go function over that state space with $K$ mesh points $\bx_k$. We would like to perform the value iteration update: $$\forall k, \hat{J}^*(\bx_k) \Leftarrow \min_\bu \left[ \ell(\bx_k,\bu) + \hat{J}^*\left({f(\bx_k,\bu)}\right) \right], \label{eq:mesh_value_iteration}$$ but must deal with the fact that $f(\bx_k,\bu)$ might not result in a next state that is directly at a mesh point. Most interpolation schemes for a mesh can be written as some weighted combination of the values at nearby mesh points, e.g. $\hat{J}^*(\bx) = \sum_i \beta_i(\bx) \hat{J}^*(\bx_i), \quad \sum_i \beta_i = 1$ with $\beta_i$ the relative weight of the $i$th mesh point. In we have implemented barycentric interpolationMunos98. Taking $\alpha_i = \hat{J}^*(\bx_i)$, the cost-to-go estimate at mesh point $i$, we can see that this is precisely an important special case of fitted value iteration with linear function approximation. Furthermore, assuming $\beta_i(\bx_i) = 1,$ (e.g., the only point contributing to the interpolation at a mesh point is the value at the mesh point itself), the update in Eq. (\ref{eq:mesh_value_iteration}) is precisely the least-squares update (and it achieves zero error). This is the representation used in the value iteration examples that you've already experimented with above.

# Neural fitted value iteration

Of course the most popular function approximators today are based on neural networks. In these much more general approximators, we cannot solve the least-squares update \eqref{eq:fitted_value_iteration} exactly, but instead attempt to minimize the loss function: with a few steps of (stochastic) gradient descent on each iteration. In practice, we can use any of the standard optimizers (the examples below uses AdamKingma14 as the default).

What sort of neural architectures work best for value iteration in robotics? AlphaGo famously used convolutional networksSilver16; here they could represent the state of the board as an image and the some notion of translational equivariance makes sense. But in "deep" reinforcement learning for robotics, when the input to the network is a state vector, then the standard choice is to use a simple multi-layered perceptron like the one I've used in the examples below. For rotational coordinates, instead of passing in $\theta$ directly, we typically pass in both $\sin\theta$ and $\cos\theta.$

When it comes to neural network software, I'll mostly use PyTorch throughout these notes. But for small problems (with small neural nets and modest batch sizes during training) like the ones we'll consider here, the cost of going back and forth to PyTorch and especially the GPU outweighs most of the benefits. I've also implemented a simple CPU-based MultilayerPerceptron in , which can be used directly in the Systems framework, and will use it when it makes sense. My goal is to provide examples of both approaches so that you can pick and choose.

# Neural Fitted Value Iteration

Let us try reproducing our double-integrator value iteration examples using neural networks:

# Continuous Neural Fitted Value Iteration

This is nice work on continuous neural fitted value iteration in Lutter21. Bansal21 has some of the most complete and convincing HJB solutions with neural networks that I have seen recently, and recommends a number of architectural and optimization training tweaks which could make a significant difference.

# Continuous-time systems

For solutions to systems with continuous-time dynamics, I have to uncover one of the details that I've so far been hiding to keep the notation simpler. Let us consider a problem with a finite-horizon: \begin{gather*} \min_{\bu[\cdot]} \sum_{n=0}^N \ell(\bx[n],\bu[n]), \\ \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*} When the horizon is finite, the optimal cost-to-go and the optimal policy, in general, can depend on time (if the post office closes at 5pm, then your actions at 4:55pm might be very different than your actions at 4pm).

The way that we compute this is by solving the time-varying cost-to-go function backwards in time: \begin{gather*}J^*(\bx,N) = \min_\bu \ell(\bx, \bu) \\ J^*(\bx,n-1) = \min_\bu \left[ \ell(\bx, \bu) + J^*(f(\bx,\bu), n) \right]. \end{gather*} Even in the infinite-horizon setting, one can actually understand the convergence of the value iteration update as solving this time-varying cost-to-go backwards in time until it reaches a steady-state solution (the infinite-horizon solution). This is one way to see why value iteration only converges if the optimal cost-to-go is bounded.

Now let's consider the continuous-time version. Again, we have a time-varying cost-to-go, $J^*(\bx,t).$ Now $$\frac{dJ^*}{dt} = \pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t},$$ and our sufficiency condition is $$0 = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t} \right].$$ But since $\pd{J^*}{t}$ doesn't depend on $\bu$, we can pull it out of the $\min$ and write the (true) HJB: $$-\pd{J^*}{t} = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu) \right].$$ The standard numerical recipe Osher03 for solving this is to approximate $\hat{J}^*(\bx,t)$ on a mesh and then integrate the equations backwards in time (until convergence, if the goal is to find the infinite-horizon solution). If, for mesh point $\bx_i$ we have $\alpha_i(t) = \hat{J}^*(\bx_i, t)$, then: $$-\dot\alpha_i(t) = \min_\bu \left[\ell(\bx_i, \bu) + \pd{J^*(\bx_i, t)}{\bx}f(\bx_i,\bu) \right],$$ where the partial derivative is estimated with a suitable finite-difference approximation on the mesh and often some "viscosity" terms are added to the right-hand side to provide additional numerical robustness; see the Lax-Friedrichs scheme Osher03 (section 5.3.1) for an example.

Probably most visible and consistent campaign using numerical HJB solutions in applied control (at least in robotics) has come from Claire Tomlin's group at Berkeley. Their work leverages Ian Mitchell's Level Set Toolbox, which solves the Hamilton-Jacobi PDEs on a Cartesian mesh using the technique cartooned above, and even includes the minimum-time problem for the double integrator as a tutorial exampleMitchell05.

try just running snopt on quartic objective

# Extensions

finite-horizon / time-varying dynamics or cost.

There are many many nice extensions to the basic formulations that we've presented so far. I'll try to list a few of the most important ones here. I've also had a number of students in this course explore very interesting extensions; for example Yang20 looked at imposing a low-rank structure on the (discrete-state) value function using ideas from matrix completion to obtain good estimates despite updating only a small fraction of the states.

# Discounted and average cost formulations

Throughout this chapter, we have focused primarily on minimizing an infinite-horizon sum of running costs. Whenever you write an infinite sum (or integral), one should immediately question whether that sum converges to a finite value. For many control problems, it does -- for instance, if we set the cost to be zero at the goal and the controller can drive the system to the goal in finite time, or exponentially at a rate that is faster than cost is accrued. However, once we enter the realm of approximate optimal control, we can run into problems. For instance, if we form a discrete state and action approximation of the dynamics, it may not be possible for the discretized system to arrive exactly at the zero-cost goal.

A common way to protect oneself from this is to add an exponential discount factor to the integral cost. This is a natural choice since it preserves the recursive structure of the Bellman equation. In discrete time, it takes the form $$\min_{\bu[\cdot]} \sum_{n=0}^\infty \gamma^n \ell(\bx[n], \bu[n]),$$ which leads to the Bellman equation $$J^*(\bx) = \min_\bu \left[ \ell(\bx, \bx) + \gamma J^*(f(\bx,\bu) \right].$$ In continuous time we haveDoya00 $$\min_\bu(\cdot) \int_0^\infty e^{-\frac{t}{\tau}} \ell(\bx(t), \bu(t)) dt,$$ which leads to the corresponding HJB: $$\frac{1}{\tau} J^*(\bx) = \min_\bu \left[ \ell(\bx, \bu) + \pd{J^*}{\bx} f(\bx,\bu)\right].$$

A word of warning: this discount factor does change the optimal policy; and it is not without its pitfalls. We'll work out a specific case of the discounted cost for the LQR problem, and see that the discounted controller is exactly a controller which assumes that the plant is more stable than it actually is. It is possible to find an optimal controller for the discounted cost that results in an unstable controller.

A more robust alternative is to consider an average-cost formulations... (details coming soon!)

# Stochastic control for finite MDPs

One of the most amazing features of the dynamic programming, additive cost approach to optimal control is the relative ease with which it extends to optimizing stochastic systems.

For discrete systems, we can generalize our dynamics on a graph by adding action-dependent transition probabilities to the edges. This new dynamical system is known as a Markov Decision Process (MDP), and we write the dynamics completely in terms of the transition probabilities $\Pr(s[n+1] = s' | s[n] = s, a[n] = a).$ For discrete systems, this is simply a big lookup table. The cost that we incur for any execution of system is now a random variable, and so we formulate the goal of control as optimizing the expected cost, e.g. $$J^*(s[0]) = \min_{a[\cdot]} E \left[ \sum_{n=0}^\infty \ell(s[n],a[n]) \right]. \label{eq:stochastic_dp_optimality_cond}$$ Note that there are many other potential objectives, such as minimizing the worst-case error, but the expected cost is special because it preserves the dynamic programming recursion: $J^*(s) = \min_a E \left[\ell(s,a) + J^*(s')\right] = \min_a \left[ \ell(s,a) + \sum_{s'} \Pr(s'|s,a) J^*(s') \right].$ Remarkably, if we use these optimality conditions to construct our value iteration algorithm $\hat{J}(s) \Leftarrow \min_a \left[ \ell(s,a) + \sum_{s'} \Pr(s'|s,a) \hat{J}(s') \right],$ then this algorithm has the same strong convergence guarantees of its counterpart for deterministic systems. And it is essentially no more expensive to compute!

We will continue to develop a much more general toolkit for for stochastic and robust control later in the notes.

# Stochastic interpretation of deterministic, continuous-state value iteration

There is a particularly nice observation to be made here. Let's assume that we have discrete control inputs and discrete-time dynamics, but a continuous state space. Recall the fitted value iteration on a mesh algorithm described above. In fact, the resulting update is exactly the same as if we treated the system as a discrete state MDP with $\beta_i$ representing the probability of transitioning to state $\bx_i$! This sheds some light on the impact of discretization on the solutions -- discretization error here will cause a sort of diffusion corresponding to the probability of spreading across neighboring nodes.

# Linear Programming Dynamic Programming

For discrete MDPs, value iteration is a magical algorithm because it is simple, but known to converge to the global optimal, $J^*$. However, other important algorithms are known; one of the most important is a solution approach using linear programming. This formulation provides an alternative view, but may also be more generalizable and even more efficient for some instances of the problem.

Recall the optimality conditions from Eq. \eqref{eq:value_update}. If we describe the cost-to-go function as a vector, $J_i = J(s_i)$, then these optimality conditions can be rewritten in vector form as $$\bJ = \min_a \left[ {\bf \ell}(a) + \bT(a) \bJ \right], \label{eq:vector_stochastic_dp}$$ where $\ell_i(a) = \ell(s_i,a)$ is the cost vector, and $T_{i,j}(a) = \Pr(s_j|s_i,a)$ is the transition probability matrix. Let us take $\bJ$ as the vector of decision variables, and replace Eq. (\ref{eq:vector_stochastic_dp}) with the constraints: $$\forall a, \bJ \le {\bf \ell}(a) + \bT(a) \bJ.$$ Clearly, for finite $a$, this is finite list of linear constraints, and for any vector $\bJ$ satisfying these constraints we have $\bJ \le \bJ^*$ (elementwise). Now write the linear program: \begin{gather*} \maximize_\bJ \quad \bc^T \bJ, \\ \subjto \quad \forall a, \bJ \le {\bf \ell}(a) + \bT(a) \bJ, \end{gather*} where $c$ is any positive vector. The solution to this problem is $\bJ = \bJ^*$.

Perhaps even more interesting is that this approach can be generalized to linear function approximators. Taking a vector form of my notation above, now we have a matrix of features with $\bPsi_{i,j} = \psi^T_j(\bx_i)$ and we can write the LP \begin{gather} \maximize_\balpha \quad \bc^T \bPsi \balpha, \\ \subjto \quad \forall a, \bPsi \balpha \le {\bf \ell}(a) + \bT(a) \bPsi \balpha. \end{gather} This time the solution is not necessarily optimal, because $\bPsi \balpha^*$ only approximates $\bJ^*$, and the relative values of the elements of $\bc$ (called the "state-relevance weights") can determine the relative tightness of the approximation for different features Farias02.

Add an example here; and perhaps reference the problem set question.

# Sums-of-Squares Dynamic Programming

There are some cases where the linear programming approach to dynamic programming can be extended to continuous states and actions, with strong guarantees. This approach follows naturally from our study of Lyapunov functions in an upcoming chapter, and builds on the computational tools we develop there. Feel free to skip ahead.

# Discounting and the Convergence of Value Iteration

Let's consider the discrete time, state, and action version of value iteration: $$\hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ \ell(s_i,a) + \hat{J}^*\left({f(s_i,a)}\right) \right],$$ which finds the optimal policy for the infinite horizon cost function $\sum_{n=0}^\infty \ell(s[n],a[n]).$

1. If $J^*$ is the true optimal cost-to-go, show that any solution $\hat{J}^* = J^* + c$, where $c$ is any constant scalar, is a fixed point of this value iteration update.
2. Is the controller still optimal, even if the estimated cost-to-go function is off by a constant factor?

Surprisingly, adding a discount factor can help with this. Consider the infinite-horizon discounted cost: $\sum_{n=0}^\infty \gamma^n \ell(s[n],a[n])$, where $0 < \gamma \le 1$ is the discount factor. The corresponding value iteration update is $$\hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ \ell(s_i,a) + \gamma\hat{J}^*\left({f(s_i,a)}\right) \right].$$ For simplicity, assume that there exists a state $s_i$ that is a zero-cost fixed point under the optimal policy; e.g. $\ell(s_i, \pi^*(s_i)) = 0$ and $f(s_i, \pi^*(s_i)) = s_i$.

1. Is $\hat{J}^* = J^* + c$ still a fixed point of the value iteration update when $\gamma < 1$? Show your work.

# Choosing a Cost Function

The figure above shows an autonomous car moving at constant velocity $v>0$ on a straight road. Let $x$ be the (longitudinal) position of the car along the road, $y$ its (transversal) distance from the centerline, and $\theta$ the angle between the centerline and the direction of motion. The only control action is the steering velocity $u$, which is constrained in the interval $[u_{\text{min}}, u_{\text{max}}]$ (where $u_{\text{min}}<0$ and $u_{\text{max}}>0$). We describe the car dynamics with the simple kinematic model \begin{align*}\dot x &= v \cos \theta, \\ \dot y &= v \sin \theta, \\ \dot \theta &= u.\end{align*} Let $\bx = [x, y, \theta]^T$ be the state vector. To optimize the car trajectory we consider a quadratic objective function $$J = \int_{0}^{\infty} [\bx^T(t) \bQ \bx(t) + R u^2(t)] dt,$$ where $\bQ$ is a constant positive-semidefinite (hence symmetric) matrix and $R$ is a constant nonnegative scalar (note that $R=0$ is allowed here).

1. Suppose our only goal is to keep the distance $y$ between the car and the centerline as small as possible, as fast as possible, without worrying about anything else. What would be your choice for $\bQ$ and $R$?
2. How would the behavior of the car change if you were to multiply the weights $\bQ$ and $R$ from the previous point by an arbitrary positive coefficient $\alpha$?
3. The cost function from point (a) might easily lead to excessively sharp turns. Which entry of $\bQ$ or $R$ would you increase to mitigate this issue?
4. Country roads are more slippery on the sides than in the center. Is this class of objective functions (quadratic objectives) rich enough to include a penalty on sharp turns that increases with the distance of the car from the centerline?
5. With this dynamics model and this objective function, would you ever choose a weight matrix $\bQ$ which is strictly positive definite (independent of the task you want the car to accomplish)? Why?

# Ill-Posed Optimal Control Problem

In this exercise we will see how seemingly simple cost functions can give surprising results. Consider the single-integrator system $\dot x = u$ with initial state $x(0)=0$. We would like to find the control signal $u(t)$ that minimizes the seemingly innocuous cost function $$J = \int_0^T x^2(t) + (u^2(t) - 1)^2 dt,$$ with $T$ finite. To this end, we consider a square-wave control parameterized by $\tau>0$: $$u_\tau(t) = \begin{cases} 1 &\text{if} & t \in [0, \tau) \cup [3 \tau, 5 \tau) \cup [7 \tau, 9 \tau) \cup \cdots \\ -1 &\text{if} & t \in [\tau, 3 \tau) \cup [5 \tau, 7 \tau) \cup [9 \tau, 11 \tau) \cup \cdots \end{cases}.$$

1. What are the states $x$ and the inputs $u$ for which the running cost $$\ell(x, u) = x^2 + (u^2 - 1)^2$$ is equal to zero?
2. Consider now two control signals $u_{\tau_1}(t)$ and $u_{\tau_2}(t)$, with $\tau_1 = \tau_2 / 2$. Which one of the two incurs the lower cost $J$? (Hint: start by sketching how the system state $x(t)$ evolves under these two control signals.)
3. What happens to the cost $J$ when $\tau$ gets smaller and smaller? What does the optimal control signal look like? Could you implement it with a finite-bandwidth controller?

# A Linear System with Quadratic Cost

Consider the scalar control differential equation $$\dot{x} = x + u,$$ and the infinite horizon cost function $$J = \int_0^{\infty} [3x^2(t) + u^2(t)] dt.$$ As we will see in the chapter on linear-quadratic regulation, the optimal cost-to-go for a problem of this kind assumes the form $J^* = S x^2$. It is not hard to see that this, in turn, implies that the optimal controller has the form $u^* = - K x$.

1. Imagine that you plugged the expression $J^* = S x^2$ in the HJB equations for this problem, you solved them for $S$, and you got $S \leq 0$. Would this result ring any alarm bells? Explain your answer.
2. Use the HJB sufficiency theorem to derive the optimal values of $S$ and $K$.
3. Working with digital controllers, we typically have to sample the dynamics of our systems, and approximate them with discrete-time models. Let us introduce a time step $h > 0$ and discretize the dynamics as $$\frac{x[n+1] - x[n]}{h} = x[n] + u[n],$$ and the objective as $$h \sum_{n=0}^{\infty} (3 x^2[n] + u^2[n]).$$ One of the following expressions is the correct cost-to-go $J_h^*(x)$ for this discretized problem. Can you identify it without solving the discrete-time HJB equation? Explain your answer.
1. $J_h^* (x)= S_h x^4$ with $S_h = 3 + h + \sqrt{6}h^2$.
2. $J_h^* (x)= S_h x^2$ with $S_h = 1 + 2h + 2 \sqrt{h^2 + h +1}$.
3. $J_h^* (x)= S_h x^2$ with $S_h = 3 + 2h^2 + \sqrt{h^2 + h + 2}$.

# Value Iteration for Minimum-Time Problems

In this exercise we analyze the performance of the value-iteration algorithm, considering its application to the minimum time problem for the double integrator. , you will find everything you need for this analysis. Take the time to go through the notebook and understand the code in it, then answer the following questions.

1. At the end of the notebook section "Performance of the Value-Iteration Policy", we plot the state trajectory of the double integrator in closed loop with the value-iteration policy, as well as the resulting control signal.
1. Does the state-space trajectory follow the theoretically-optimal quadratic arcs we have seen in the example?
2. Is the control policy we get from value iteration a bang-bang policy? In other words, does the control signal take values in the set $\{-1, 0, 1\}$ exclusively?
3. Explain in a couple of sentences, what is the reason for this behavior.
2. In the "Value Iteration Algorithm" section of the notebook, increase the number of knot points on the $q$ and the $\dot q$ axes to refine the state-space mesh used in the value iteration. Does this fix the issues you have seen in point (a)?
3. In the final section of the notebook, implement the theoretically-optimal control policy from the example, and use the plots to verify that the closed-loop system behaves as expected.

# Linear Program for Dynamic Programming

In this exercise we solve the dynamic programming using a linear program, considering its application to the grid world. , you will write a linear program for solving the optimal cost-to-go in the grid world.

# Continuous Fitted Value Iteration for Pendulum Swingup

In this excercise we will implement value iteration for a continuous time system with continuous actions and states. We will achieve by training a neural network to approximate our value function. You will work exclusively in to implement the algorithms described in the Continuous Dynamic Programming section.

# Ordering of the Value Iteration Updates

The batched Bellman update used in batched value iteration is as follows: $$\forall i \in |S| \quad \hat{J}^{\ast}(s_i) \leftarrow \min_{a \in A} [l (s_i, a) + \hat{J}^{\ast} (f(s_i, a))]$$ The batched Bellman update consists of $|S|$ individual Bellman updates on each iteration of the algorithm. Throughout this problem, we use the following goal reaching cost: \begin{align*} l(s_i, a) = \left\{ \begin{array}{lr} 0 & \text{if } s_i = s_5\\ 1 & \text{else} \end{array} \right\} \end{align*}
1. Suppose we represent our dynamics with a graph $G = (V, E)$ where the vertices represent the states and edges represent the available actions/state transitions. Calculate the batched value iteration tables until convergence for the following directed acyclic graph (DAG), and calculate the total number of individual Bellman updates. Note: You do not need to include the updates required to check convergence.
2. Suppose we are performing asynchronous value iteration with a hand-specified update order. Produce the update order for the above graph that minimizes the number of individual Bellman updates till convergence, and calculate the total number of individual updates. Note: You do not need to include the updates required to check convergence.
3. Generalize the update order in part (b) to arbitrary DAGs with a single start node and single goal node by defining the set of states to be updated at iteration $n+1$ in terms of the set of states that were updated at iteration $n$. Note: Describe the update sequence iteratively in terms of sets of states instead of individual states (as defined by asynchronous VI) because this simplifies the description of the update sequence.
4. Suppose we extend the class of dynamics graphs in part c) to allow cycles. Take an instance of this class of graphs and apply value iteration to get an optimal value function. Can an optimal policy associated with this value function contain a cycle? Explain your answer.
5. (Optional, ungraded) Suppose we are uniformly sampling states under asynchronous value iteration (updating each $i$ one at a time in a random order). What is the probability that asynchronous value iteration converges with less total Bellman updates than batched value iteration?

# References

1. , "Dynamic Programming and Optimal Control", Athena Scientific , 2000.

2. , "Reinforcement Learning: An Introduction", MIT Press , 2018.

3. , "Neuro-Dynamic Programming", Athena Scientific , October, 1996.

4. , "Reinforcement Learning: An Introduction", MIT Press , 1998.

5. , "Dynamic Programming \& Optimal Control", Athena Scientific , vol. I and II, May 1, 2005.

6. , "Nonlinear games: examples and counterexamples", Proceedings of 35th IEEE Conference on Decision and Control , vol. 4, pp. 3915--3920, 1996.

7. , "Differential Dynamic Programming", American Elsevier Publishing Company, Inc. , 1970.

8. , "An Analysis of Temporal-Difference Learning with Function Approximation", IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674-690, May, 1997.

9. , "Finite-Time Bounds for Fitted Value Iteration", J. Mach. Learn. Res., vol. 9, pp. 815--857, June, 2008.

10. , "Actor-Critic Algorithms", SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143-1166, 2003.

11. , "Barycentric Interpolators for Continuous Space and Time Reinforcement Learning", Advances in Neural Information Processing Systems , vol. 11, pp. 1024--1030, 1998.

12. , "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980, 2014.

13. , "Mastering the game of Go with deep neural networks and tree search", nature, vol. 529, no. 7587, pp. 484--489, 2016.

14. , "Value Iteration in Continuous Actions, States and Time", International Conference on Machine Learning , pp. 7224--7234, 2021.

15. , "Deepreach: A deep learning approach to high-dimensional reachability", 2021 IEEE International Conference on Robotics and Automation (ICRA) , pp. 1817--1824, 2021.

16. , "Level Set Methods and Dynamic Implicit Surfaces", Springer , 2003.

17. , "A Toolbox of {H}amilton-{J}acobi Solvers for Analysis of Nondeterministic Continuous and Hybrid Systems", Proceedings of Hybrid Systems Computation and Control , March, 2005.

18. , "Harnessing Structures for Value-Based Planning and Reinforcement Learning", International Conference on Learning Representations , 2020.

19. , "Reinforcement Learning in Continuous Time and Space", Neural Computation, vol. 12, no. 1, pp. 219-245, January, 2000.

20. , "The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application", PhD thesis, Stanford University, June, 2002.