Algorithms for Walking, Running, Swimming, Flying, and Manipulation
© Russ Tedrake, 2022
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2022 semester. Lecture videos are available on YouTube.
Previous Chapter | Table of contents | Next Chapter |
In chapter 2, we spent some time thinking about the phase portrait of the simple pendulum, and concluded with a challenge: can we design a nonlinear controller to reshape the phase portrait, with a very modest amount of actuation, so that the upright fixed point becomes globally stable? With unbounded torque, feedback-cancellation solutions (e.g., invert gravity) can work well, but can also require an unnecessarily large amount of control effort. The energy-based swing-up control solutions presented for the acrobot and cart-pole systems are considerably more appealing, but required some cleverness and might not scale to more complicated systems. Here we investigate another approach to the problem, using computational optimal control to synthesize a feedback controller directly.
In this chapter, we will introduce optimal control - a control design
process using optimization. This approach is powerful for a number of
reasons. First and foremost, it is very general - allowing us to specify
the goal of control equally well for fully- or under-actuated, linear or
nonlinear, deterministic or stochastic, and continuous or discrete
systems. Second, it permits concise descriptions of potentially very
complex desired behaviours, specifying the goal of control as a scalar
objective (plus a list of constraints). Finally, and most importantly,
optimal control is very amenable to numerical solutions.
The fundamental idea in optimal control is to formulate the goal of control as the long-term optimization of a scalar cost function. Let's introduce the basic concepts by considering a system that is even simpler than the simple pendulum.
Consider the double integrator system $$\ddot{q} = u, \quad |u| \le 1.$$ If you would like a mechanical analog of the system (I always do), then you can think about this as a unit mass brick moving along the x-axis on a frictionless surface, with a control input which provides a horizontal force, $u$. The task is to design a control system, $u = \pi(\bx,t)$, $\bx=[q,\dot{q}]^T$ to regulate this brick to $\bx = [0,0]^T$.
In order to formulate this control design problem using optimal control, we must define a scalar objective which scores the long-term performance of running each candidate control policy, $\pi(\bx,t)$, from each initial condition, $(\bx_0,t_0)$, and a list of constraints that must be satisfied. For the task of driving the double integrator to the origin, one could imagine a number of optimal control formulations which would accomplish the task, e.g.:
Note that the input limits, $|u|\le1$ are also required to make this problem well-posed; otherwise both optimizations would result in the optimal policy using infinite control input to approach the goal infinitely fast. Besides input limits, another common approach to limiting the control effort is to add an additional quadratic cost on the input (or "effort"), e.g. $\int \left[ \bu^T(t) {\bf R} \bu(t) \right] dt,$ ${\bf R}\succ0$. This could be added to either formulation above. We will examine many of these formulations in some details in the examples worked out at the end of this chapter.
Optimal control has a long history in robotics. For instance, there has been a great deal of work on the minimum-time problem for pick-and-place robotic manipulators, and the linear quadratic regulator (LQR) and linear quadratic regulator with Gaussian noise (LQG) have become essential tools for any practicing controls engineer. With increasingly powerful computers and algorithms, the popularity of numerical optimal control has grown at an incredible pace over the last few years.
For more intuition, let's do an informal derivation of the solution to the minimum time problem for the double integrator with input constraints: \begin{align*} \minimize_{\pi} \quad & t_f\\ \subjto \quad & \bx(t_0) = \bx_0, \\ & \bx(t_f) = {\bf 0}, \\ & \ddot{q}(t) = u(t), \\ & |u| \le 1. \end{align*} What behavior would you expect an optimal controller exhibit?
Your intuition might tell you that the best thing that the brick can do, to reach the goal in minimum time with limited control input, is to accelerate maximally towards the goal until reaching a critical point, then hitting the brakes in order to come to a stop exactly at the goal. This would be called a bang-bang control policy; these are often optimal for systems with bounded input, and it is in fact optimal for the double integrator, although we will not prove it until we have developed more tools.
Let's work out the details of this bang-bang policy. First, we can figure out the states from which, when the brakes are fully applied, the system comes to rest precisely at the origin. Let's start with the case where $q(0) < 0$, and $\dot{q}(0)>0$, and "hitting the brakes" implies that $u=-1$ . Integrating the equations, we have \begin{gather*} \ddot{q}(t) = u = -1 \\\dot{q}(t) = \dot{q}(0) - t \\ q(t) = q(0) + \dot{q}(0) t - \frac{1}{2} t^2. \end{gather*} Substituting $t = \dot{q}(0) - \dot{q}$ into the solution reveals that the system orbits are parabolic arcs: \[ q = -\frac{1}{2} \dot{q}^2 + c_{-}, \] with $c_{-} = q(0) + \frac{1}{2}\dot{q}^2(0)$.
Similarly, the solutions for $u=1$ are $q = \frac{1}{2} \dot{q}^2 + c_{+}$, with $c_{+}=q(0)-\frac{1}{2}\dot{q}^2(0)$.
Perhaps the most important of these orbits are the ones that pass directly through the origin (e.g., $c_{-}=0$). Following our initial logic, if the system is going slower than this $\dot{q}$ for any $q$, then the optimal thing to do is to slam on the accelerator ($u=-\text{sgn}(q)$). If it's going faster than the $\dot{q}$ that we've solved for, then still the best thing to do is to brake; but inevitably the system will overshoot the origin and have to come back. We can summarize this policy with: \[ u = \begin{cases} +1 & \text{if } (\dot{q} < 0 \text{ and } q \le \frac{1}{2} \dot{q}^2) \text{ or } (\dot{q}\ge 0 \text{ and } q < -\frac{1}{2} \dot{q}^2) \\ 0 & \text{if } q=0 \text{ and } \dot{q}=0 \\ -1 & \text{otherwise} \end{cases} \]
and illustrate some of the optimal solution trajectories:
And for completeness, we can compute the optimal time to the goal by solving for the amount of time required to reach the switching surface plus the amount of time spent moving along the switching surface to the goal. With a little algebra, you will find that the time to goal, $T(\bx)$, is given by \[ T(\bx) = \begin{cases} 2\sqrt{\frac{1}{2}\dot{q}^2-q} - \dot{q} & \text{for } u=+1 \text{ regime}, \\ 0 & \text{for } u=0, \\ \dot{q} + 2\sqrt{\frac{1}{2}\dot{q}^2+q} & \text{for } u=-1, \end{cases} \] plotted here:
Notice that the function is continuous (though not smooth), even though the policy is discontinuous.
As we begin to develop theoretical and algorithmic tools for optimal control, we will see that some formulations are much easier to deal with than others. One important example is the dramatic simplification that can come from formulating objective functions using additive cost, because they often yield recursive solutions. In the additive cost formulation, the long-term "score" for a trajectory can be written as $$\int_0^T \ell(x(t),u(t)) dt,$$ where $\ell()$ is the instantaneous cost (also referred to as the "running cost"), and $T$ can be either a finite real number or $\infty$. We will call a problem specification with a finite $T$ a "finite-horizon" problem, and $T=\infty$ an "infinite-horizon" problem. Problems and solutions for infinite-horizon problems tend to be more elegant, but care is required to make sure that the integral converges for the optimal controller (typically by having an achievable goal that allows the robot to accrue zero-cost).
The quadratic cost function suggested in the double integrator example above is clearly written as an additive cost. At first glance, our minimum-time problem formulation doesn't appear to be of this form, but we actually can write it as an additive cost problem using an infinite horizon and the instantaneous cost $$\ell(x,u) = \begin{cases} 0 & \text{if } x=0, \\ 1 & \text{otherwise.} \end{cases}$$
We will examine a number of approaches to solving optimal control problems throughout the next few chapters. For the remainder of this chapter, we will focus on additive-cost problems and their solution via dynamic programming.
For systems with continuous states and continuous actions, dynamic programming is a set of theoretical ideas surrounding additive cost optimal control problems. For systems with a finite, discrete set of states and a finite, discrete set of actions, dynamic programming also represents a set of very efficient numerical algorithms which can compute optimal feedback controllers. Many of you will have learned it before as a tool for graph search.
Imagine you have a directed graph with states (or nodes) $\{s_1,s_2,...\} \in S$ and "actions" associated with edges labeled as $\{a_1,a_2,...\} \in A$, as in the following trivial example:
Let us also assume that each edge has an associate weight or cost, using $\ell(s,a)$ to denote the cost of being in state $s$ and taking action $a$. Furthermore we will denote the transition "dynamics" using \[ s[n+1] = f(s[n],a[n]). \] For instance, in the graph above, $f(s_1,a_1) = s_2$.
There are many algorithms for finding (or approximating) the optimal path
from a start to a goal on directed graphs. In dynamic programming, the key
insight is that we can find the shortest path from every node by solving
recursively for the optimal cost-to-go (the cost that will be
accumulated when running the optimal controller) from every node to the
goal. One such algorithm starts by initializing an estimate $\hat{J}^*=0$
for all $s_i$, then proceeds with an iterative algorithm which sets
\begin{equation} \hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ \ell(s_i,a)
+ \hat{J}^*\left({f(s_i,a)}\right) \right]. \label{eq:value_update}
\end{equation} In software, $\hat{J}^*$ can be represented as a vector with
dimension equal to the number of discrete states. This algorithm,
appropriately known as value iteration, is guaranteed to converge
to the optimal cost-to-go up to a constant factor, $\hat{J}^* \rightarrow
J^* + c$
Value iteration is an amazingly simple algorithm, but it accomplishes something quite amazing: it efficiently computes the long-term cost of an optimal policy from every state by iteratively evaluating the one-step cost. If we know the optimal cost-to-go, then it's easy to extract the optimal policy, $a = \pi^*(s)$: \begin{equation} \pi^*(s_i) = \argmin_a \left[ \ell(s_i,a) + J^*\left( f(s_i,a) \right) \right]. \label{eq:policy_update} \end{equation} It's a simple algorithm, but playing with an example can help our intuition.
Imagine a robot living in a grid (finite state) world. Wants to get
to the goal location. Possibly has to negotiate cells with obstacles.
Actions are to move up, down, left, right, or do nothing.
You can run value iteration for the double integrator (using
barycentric interpolation to interpolate between nodes) in
Please do take some time to try different cost functions by editing the code yourself.
Let's take a minute to appreciate how amazing this is. Our solution to finding the optimal controller for the double integrator wasn't all that hard, but it required some mechanical intuition and solutions to differential equations. The resulting policy was non-trivial -- bang-bang control with a parabolic switching surface. The value iteration algorithm doesn't use any of this directly -- it's a simple algorithm for graph search. But remarkably, it can generate effectively the same policy with just a few moments of computation.
It's important to note that there are some differences between the computed policy and the optimal policy that we derived, due to discretization errors. We will ask you to explore these in the problems.
The real value of this numerical solution, however, is unlike our analytical solution for the double integrator, we can apply this same algorithm to any number of dynamical systems virtually without modification. Let's apply it now to the simple pendulum, which was intractable analytically.
You can run value iteration for the simple pendulum (using barycentric
interpolation to interpolate between nodes) in
Again, you can easily try different cost functions by editing the code yourself.
I find the graph search algorithm extremely satisfying as a first step, but also become quickly frustrated by the limitations of the discretization required to use it. In many cases, we can do better; coming up with algorithms which work more natively on continuous dynamical systems. We'll explore those extensions in this section.
It's important to understand that the value iteration equations,
equations (\ref{eq:value_update}) and (\ref{eq:policy_update}), are more
than just an algorithm. They are also sufficient conditions for
optimality: if we can produce a $J^*$ and $\pi^*$ which satisfy these
equations, then $\pi^*$ must be an optimal controller. There are an
analogous set of conditions for the continuous systems. For a system
$\dot{\bx} = f(\bx,\bu)$ and an infinite-horizon additive cost
$\int_0^\infty \ell(\bx,\bu)dt$, we have: \begin{gather} 0 = \min_\bu \left[
\ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) \right], \label{eq:HJB} \\
\pi^*(\bx) = \argmin_\bu \left[ \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu)
\right]. \end{gather} Equation \ref{eq:HJB} is known as the
Hamilton-Jacobi-Bellman (HJB) equation.
Consider a system $\dot{\bx}=f(\bx,\bu)$ and an infinite-horizon additive cost $\int_0^\infty \ell(\bx,\bu)dt$, with $f$ and $\ell$ continuous functions, and $\ell$ a strictly-positive-definite function that obtains zero only when $\bx=\bx^*$. Suppose $J(\bx)$ is a positive-definite solution to the HJB equation; that is $J$ is continuously differentiable in $\bx$ and is such that \[ 0 = \min_{\bu \in U} \left[\ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) \right],\quad \text{for all } \bx, \] and that $\pi(\bx)$ is the minimizer for all $\bx$. Then, under some technical conditions on the existence and boundedness of solutions, we have that $J(\bx) - J(\bx^*)$ is the optimal cost-to-go and $\pi$ is an optimal policy.
Given an open subset $\Omega\subset\mathbb R^n$, with a selected element $x^*$, a subset $U\subset\mathbb R^m$, continuous functions $f:~\Omega\times U\to\mathbb R^n$, $g:~\Omega\times U\to[0,\infty)$, continuously differentiable function $V:~\Omega\to[0,\infty)$, and a function $\mu:~\Omega\to U$, such that
Proof. First, observe that, for all piecewise continuous $(x,u)$ satisfying (\ref{e1}),
To finish the proof, for an arbitrary piecewise continuous solution of (\ref{e1}), equation (\ref{e3}), together with non-negativity of $g_V$, implies \[ J(x,u)=\int_0^Tg(x(t),u(t))dt\ge V(x_0)-V(x(T)).\] When $J(x,u)<\infty$, applying this to $T=T_k$, where $T_k$ are described in observation (II), and taking the limit as $k\to\infty$ (and $x(T_k)\to x^*$) yields $J(x,u)\ge V(x_0)-V(x^*)$. $\Box$
It is possible to prove sufficiency under different assumptions, too. The particular assumptions here were chosen to ensure that $J(\bx(0)) < \infty$ implies that $\bx(t) \rightarrow \bx^*$. As Sasha says, "without something like this, all sorts of counter-examples emerge."
As a tool for verifying optimality, the HJB equations are actually surprisingly easy to work with: we can verify optimality for an infinite-horizon objective without doing any integration; we simply have to check a derivative condition on the optimal cost-to-go function $J^*$. Let's see this play out on the double integrator example.
Consider the problem of regulating the double integrator (this time without input limits) to the origin using a quadratic cost: $$ \ell(\bx,\bu) = q^2 + \dot{q}^2 + u^2. $$ I claim (without derivation) that the optimal controller for this objective is $$\pi(\bx) = -q - \sqrt{3}\dot{q}.$$ To convince you that this is indeed optimal, I have produced the following cost-to-go function: $$J(\bx) = \sqrt{3} q^2 + 2 q \dot{q} + \sqrt{3} \dot{q}^2.$$
Taking \begin{gather*} \pd{J}{q} = 2\sqrt{3} q + 2\dot{q}, \qquad \pd{J}{\dot{q}} = 2q + 2\sqrt{3}\dot{q}, \end{gather*} we can write \begin{align*} \ell(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) &= q^2 + \dot{q}^2 + u^2 + (2\sqrt{3} q + 2\dot{q}) \dot{q} + (2q + 2\sqrt{3}\dot{q}) u \end{align*} This is a convex quadratic function in $u$, so we can find the minimum with respect to $u$ by finding where the gradient with respect to $u$ evaluates to zero. \[ \pd{}{u} \left[ \ell(\bx,\bu) + \pd{J}{\bx} f(\bx,\bu) \right] = 2u + 2q + 2\sqrt{3}\dot{q}. \] Setting this equal to $0$ and solving for $u$ yields: $$u^* = -q - \sqrt{3} \dot{q},$$ thereby confirming that our policy $\pi$ is in fact the minimizer. Substituting $u^*$ back into the HJB reveals that the right side does in fact simplify to zero. I hope you are convinced!
Note that evaluating the HJB for the time-to-go of the minimum-time problem for the double integrator will also reveal that the HJB is satisfied wherever that gradient is well-defined. This is certainly mounting evidence in support of our bang-bang policy being optimal, but since $\pd{J}{\bx}$ is not defined everywhere, it does not actually satisfy the requirements of the sufficiency theorem as stated above. In some sense, the assumption in the sufficiency theorem that $\pd{J}{\bx}$ is defined everywhere makes it very weak.
We still face a few barriers to actually using the HJB in an algorithm. The first barrier is the minimization over $u$. When the action set was discrete, as in the graph search version, we could evaluate the one-step cost plus cost-to-go for every possible action, and then simply take the best. For continuous action spaces, in general we cannot rely on the strategy of evaluating a finite number of possible $\bu$'s to find the minimizer.
All is not lost. In the quadratic cost double integrator example above, we were able to solve explicitly for the minimizing $\bu$ in terms of the cost-to-go. It turns out that this strategy will actually work for a number of the problems we're interested in, even when the system (which we are given) or cost function (which we are free to pick, but which should be expressive) gets more complicated.
Recall that I've already tried to convince you that a majority of the systems of interest are control affine, e.g. I can write \[ f(\bx,\bu) = f_1(\bx) + f_2(\bx)\bu. \] We can make another dramatic simplification by restricting ourselves to instantaneous cost functions of the form \[ \ell(\bx,\bu) = \ell_1(\bx) + \bu^T {\bf R} \bu, \qquad {\bf R}={\bf R}^T \succ 0. \] In my view, this is not very restrictive - many of the cost functions that I find myself choosing to write down can be expressed in this form. Given these assumptions, we can write the HJB as \[ 0 = \min_{\bu} \left[ \ell_1(\bx) + \bu^T {\bf R} \bu + \pd{J}{\bx} \left[ f_1(\bx) + f_2(\bx)\bu \right]\right]. \] Since this is a positive quadratic function in $\bu$, if the system does not have any constraints on $\bu$, then we can solve in closed-form for the minimizing $\bu$ by taking the gradient of the right-hand side: \[ \pd{}{\bu} = 2\bu^T {\bf R} + \pd{J}{\bx} f_2(\bx) = 0, \] and setting it equal to zero to obtain \[ \bu^* = -\frac{1}{2}{\bf R}^{-1}f_2^T(\bx) \pd{J}{\bx}^T.\] If there are linear constraints on the input, such as torque limits, then more generally this could be solved (at any particular $\bx$) as a quadratic program.
What happens in the case where our system is not control affine or if
we really do need to specify an instantaneous cost function on $\bu$
that is not simply quadratic? If the goal is to produce an iterative
algorithm, like value iteration, then one common approach is to make a
(positive-definite) quadratic approximation in $\bu$ of the HJB, and
updating that approximation on every iteration of the algorithm. This
broad approach is often referred to as differential dynamic
programming (c.f.
The other major barrier to using the HJB in a value iteration algorithm is that the estimated optimal cost-to-go function, $\hat{J}^*$, must somehow be represented with a finite set of numbers, but we don't yet know anything about the potential form it must take. In fact, knowing the time-to-goal solution for minimum-time problem with the double integrator, we see that this function might need to be non-smooth for even very simple dynamics and objectives.
One natural way to parameterize $\hat{J}^*$ -- a scalar valued-function defined over the state space -- is to define the values on a mesh. This approach then admits algorithms with close ties to the relatively very advanced numerical methods used to solve other partial differential equations (PDEs), such as the ones that appear in finite element modeling or fluid dynamics. One important difference, however, is that our PDE lives in the dimension of the state space, while many of the mesh representations from the other disciplines are optimized for two or three dimensional space. Also, our PDE may have discontinuities (or at least discontinuous gradients) at locations in the state space which are not known apriori.
A slightly more general view of the problem would describe the mesh (and the associated interpolation functions) as just one form of representations for function approximation. Using a neural network to represent the cost-to-go also falls under the domain of function approximation, perhaps representing the other extreme in terms of complexity; using neural networks in approximate dynamic programming is common in reinforcement learning, which we will discuss more later in the book.
If we approximate $J^*$ with a finitely-parameterized function $\hat{J}_\balpha^*$, with parameter vector $\balpha$, then this immediately raises many important questions:
Using the least squares solution in a value iteration update is sometimes referred to as fitted value iteration, where $\bx_k$ are some number of samples taken from the continuous space and for discrete-time systems the iterative approximate solution to \begin{gather*} J^*(\bx_0) = \min_{u[\cdot]} \sum_{n=0}^\infty \ell(\bx[n],\bu[n]), \\ \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*} becomes \begin{gather} J^d_k = \min_\bu \left[ \ell(\bx_k,\bu) + \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right], \\ \balpha \Leftarrow \argmin_\balpha \sum_k \left(\hat{J}^*_\balpha(\bx_k) - J^d_k \right)^2. \label{eq:fitted_value_iteration} \end{gather} Since the desired values $J^d_k$ are only an initial guess of the cost-to-go, we will apply this algorithm iteratively until (hopefully) we achieve some numerical convergence.
Note that the update in \eqref{eq:fitted_value_iteration} is not quite the same as doing least-squares optimization of $$\sum_k \left(\hat{J}^*_\balpha(\bx_k) - \min_\bu \left[ \ell(\bx_k,\bu) + \hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right] \right)^2,$$ because in this equation $\alpha$ has an effect on both occurences of $\hat{J}^*$. In \eqref{eq:fitted_value_iteration}, we cut that dependence by taking $J_k^d$ as fixed desired values; this version performs better in practice. Like many things, this is an old idea that has been given a new name in the deep reinforcement learning literature -- people think of the $\hat{J}^*_\alpha$ on the right hand side as being the output from a fixed "target network". For nonlinear function approximators, the update to $\alpha$ in \eqref{eq:fitted_value_iteration} is often replaced with just one step of gradient descent.
Let us try reproducing our double-integrator value iteration examples using neural networks in PyTorch:
In general, the convergence and accuracy guarantees for value iteration with generic function approximators are quite weak. But we do have some results for the special case of linear function approximators. A linear function approximator takes the form: \[ \hat{J}^*_\balpha(\bx) = \sum_i \alpha_i \psi_i(\bx) = \bpsi^T(\bx) \balpha, \] where $\bpsi(\bx)$ is a vector of potentially nonlinear features. Common examples of features include polynomials, radial basis functions, or most interpolation schemes used on a mesh. The distinguishing feature of a linear function approximator is the ability to exactly solve for $\balpha$ in order to represent a desired function optimally, in a least-squares sense. For linear function approximators, this is simply: \begin{gather*} \balpha \Leftarrow \begin{bmatrix} \bpsi^T(\bx_1) \\ \vdots \\ \bpsi^T(\bx_K)\end{bmatrix}^+ \begin{bmatrix} J^d_1 \\ \vdots \\ J^d_K \end{bmatrix}, \end{gather*} where the $^+$ notation refers to a Moore-Penrose pseudoinverse. Remarkably, for linear function approximators, this update is still known to converge to the globally optimal $\balpha^*$.
Imagine that we use a mesh to approximate the cost-to-go function
over that state space with $K$ mesh points $\bx_k$. We would like to
perform the value iteration update: \begin{equation} \forall k,
\hat{J}^*(\bx_k) \Leftarrow \min_\bu \left[ \ell(\bx_k,\bu) +
\hat{J}^*\left({f(\bx_k,\bu)}\right) \right],
\label{eq:mesh_value_iteration} \end{equation} but must deal with the
fact that $f(\bx_k,\bu)$ might not result in a next state that is
directly at a mesh point. Most interpolation schemes for a mesh can be
written as some weighted combination of the values at nearby mesh
points, e.g. \[ \hat{J}^*(\bx) = \sum_i \beta_i(\bx) \hat{J}^*(\bx_i),
\quad \sum_i \beta_i = 1 \] with $\beta_i$ the relative weight of the
$i$th mesh point. In
For solutions to systems with continuous-time dynamics, I have to uncover one of the details that I've so far been hiding to keep the notation simpler. Let us consider a problem with a finite-horizon: \begin{gather*} \min_{\bu[\cdot]} \sum_{n=0}^N \ell(\bx[n],\bu[n]), \\ \text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*} In fact, the way that we compute this is by solving the time-varying cost-to-go function backwards in time: \begin{gather*}J^*(\bx,N) = \min_\bu \ell(\bx, \bu) \\ J^*(\bx,n-1) = \min_\bu \left[ \ell(\bx, \bu) + J^*(f(\bx,\bu), n) \right]. \end{gather*} The convergence of the value iteration update is equivalent to solving this time-varying cost-to-go backwards in time until it reaches a steady-state solution (the infinite-horizon solution). Which explains why value iteration only converges if the optimal cost-to-go is bounded.
Now let's consider the continuous-time version. Again, we have a
time-varying cost-to-go, $J^*(\bx,t)$. Now $$\frac{dJ^*}{dt} =
\pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t},$$ and our sufficiency condition
is $$0 = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu) +
\pd{J^*}{t} \right].$$ But since $\pd{J^*}{t}$ doesn't depend on $\bu$,
we can pull it out of the $\min$ and write the (true) HJB:
$$-\pd{J^*}{t} = \min_\bu \left[\ell(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu)
\right].$$ The standard numerical recipe
Probably most visible and consistent campaign using numerical HJB
solutions in applied control (at least in robotics) has come from Claire Tomlin's group
at Berkeley. Their work leverages Ian Mitchell's Level
Set Toolbox, which solves the Hamilton-Jacobi PDEs on a Cartesian
mesh using the technique cartooned above, and even includes the
minimum-time problem for the double integrator as a tutorial
example
There are many many nice extensions to the basic formulations that we've
presented so far. I'll try to list a few of the most important ones here.
I've also had a number of students in this course explore very interesting
extensions; for example
One of the most amazing features of the dynamic programming, additive cost approach to optimal control is the relative ease with which it extends to optimizing stochastic systems.
For discrete systems, we can generalize our dynamics on a graph by adding action-dependent transition probabilities to the edges. This new dynamical system is known as a Markov Decision Process (MDP), and we write the dynamics completely in terms of the transition probabilities \[\Pr(s[n+1] = s' | s[n] = s, a[n] = a). \] For discrete systems, this is simply a big lookup table. The cost that we incur for any execution of system is now a random variable, and so we formulate the goal of control as optimizing the expected cost, e.g. \begin{equation} J^*(s[0]) = \min_{a[\cdot]} E \left[ \sum_{n=0}^\infty \ell(s[n],a[n]) \right]. \label{eq:stochastic_dp_optimality_cond} \end{equation} Note that there are many other potential objectives, such as minimizing the worst-case error, but the expected cost is special because it preserves the dynamic programming recursion: \[ J^*(s) = \min_a E \left[\ell(s,a) + J^*(s')\right] = \min_a \left[ \ell(s,a) + \sum_{s'} \Pr(s'|s,a) J^*(s') \right].\] Remarkably, if we use these optimality conditions to construct our value iteration algorithm \[ \hat{J}(s) \Leftarrow \min_a \left[ \ell(s,a) + \sum_{s'} \Pr(s'|s,a) \hat{J}(s') \right],\] then this algorithm has the same strong convergence guarantees of its counterpart for deterministic systems. And it is essentially no more expensive to compute!
There is a particularly nice observation to be made here. Let's assume that we have discrete control inputs and discrete-time dynamics, but a continuous state space. Recall the fitted value iteration on a mesh algorithm described above. In fact, the resulting update is exactly the same as if we treated the system as a discrete state MDP with $\beta_i$ representing the probability of transitioning to state $x_i$! This sheds some light on the impact of discretization on the solutions -- discretization error here will cause a sort of diffusion corresponding to the probability of spreading across neighboring nodes.
This is a preview of a much more general toolkit that we develop later for stochastic and robust control.
For discrete MDPs, value iteration is a magical algorithm because it is simple, but known to converge to the global optimal, $J^*$. However, other important algorithms are known; one of the most important is a solution approach using linear programming. This formulation provides an alternative view, but may also be more generalizable and even more efficient for some instances of the problem.
Recall the optimality conditions from Eq. \eqref{eq:value_update}. If we describe the cost-to-go function as a vector, $J_i = J(s_i)$, then these optimality conditions can be rewritten in vector form as \begin{equation} \bJ = \min_a \left[ {\bf \ell}(a) + \bT(a) \bJ \right], \label{eq:vector_stochastic_dp} \end{equation} where $\ell_i(a) = \ell(s_i,a)$ is the cost vector, and $T_{i,j}(a) = \Pr(s_j|s_i,a)$ is the transition probability matrix. Let us take $\bJ$ as the vector of decision variables, and replace Eq. (\ref{eq:vector_stochastic_dp}) with the constraints: \begin{equation} \forall a, \bJ \le {\bf \ell}(a) + \bT(a) \bJ.\end{equation} Clearly, for finite $a$, this is finite list of linear constraints, and for any vector $\bJ$ satisfying these constraints we have $\bJ \le \bJ^*$ (elementwise). Now write the linear program: \begin{gather*} \maximize_\bJ \quad \bc^T \bJ, \\ \subjto \quad \forall a, \bJ \le {\bf \ell}(a) + \bT(a) \bJ, \end{gather*} where $c$ is any positive vector. The solution to this problem is $\bJ = \bJ^*$.
Perhaps even more interesting is that this approach can be generalized
to linear function approximators. Taking a vector form of my notation
above, now we have a matrix of features with $\bPsi_{i,j} =
\psi^T_j(\bx_i)$ and we can write the LP \begin{gather} \maximize_\balpha
\quad \bc^T \bPsi \balpha, \\ \subjto \quad \forall a, \bPsi \balpha \le
{\bf \ell}(a) + \bT(a) \bPsi \balpha. \end{gather} This time the solution is not
necessarily optimal, because $\bPsi \balpha^*$ only approximates $\bJ^*$,
and the relative values of the elements of $\bc$ (called the
"state-relevance weights") can determine the relative tightness of the
approximation for different features
Let's consider the discrete time, state, and action version of value iteration: \begin{equation} \hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ \ell(s_i,a) + \hat{J}^*\left({f(s_i,a)}\right) \right],\end{equation} which finds the optimal policy for the infinite horizon cost function $\sum_{n=0}^\infty \ell(s[n],a[n]).$ If $J^*$ is the true optimal cost-to-go, show that any solution $\hat{J}^* = J^* + c$, where $c$ is any constant scalar, is a fixed point of this value iteration update. Is the controller still optimal, even if the estimated cost-to-go function is off by a constant factor?
Surprisingly, adding a discount factor can help with this. Consider the infinite-horizon discounted cost: $\sum_{n=0}^\infty \gamma^n \ell(s[n],a[n])$, where $0 < \gamma \le 1$ is the discount factor. The corresponding value iteration update is \begin{equation} \hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ \ell(s_i,a) + \gamma\hat{J}^*\left({f(s_i,a)}\right) \right].\end{equation} For simplicity, assume that there exists a state $s_i$ that is a zero-cost fixed point under the optimal policy; e.g. $\ell(s_i, \pi^*(s_i)) = 0$ and $f(s_i, \pi^*(s_i)) = s_i$. Is $\hat{J}^* = J^* + c$ still a fixed point of the value iteration update when $\gamma < 1$? Show your work.
The figure above shows an autonomous car moving at constant velocity $v>0$ on a straight road. Let $x$ be the (longitudinal) position of the car along the road, $y$ its (transversal) distance from the centerline, and $\theta$ the angle between the centerline and the direction of motion. The only control action is the steering velocity $u$, which is constrained in the interval $[u_{\text{min}}, u_{\text{max}}]$ (where $u_{\text{min}}<0$ and $u_{\text{max}}>0$). We describe the car dynamics with the simple kinematic model \begin{align*}\dot x &= v \cos \theta, \\ \dot y &= v \sin \theta, \\ \dot \theta &= u.\end{align*} Let $\bx = [x, y, \theta]^T$ be the state vector. To optimize the car trajectory we consider a quadratic objective function $$J = \int_{0}^{\infty} [\bx^T(t) \bQ \bx(t) + R u^2(t)] dt,$$ where $\bQ$ is a constant positive-semidefinite (hence symmetric) matrix and $R$ is a constant nonnegative scalar (note that $R=0$ is allowed here).
In this exercise we will see how seemingly simple cost functions can give surprising results. Consider the single-integrator system $\dot x = u$ with initial state $x(0)=0$. We would like to find the control signal $u(t)$ that minimizes the seemingly innocuous cost function $$J = \int_0^T x^2(t) + (u^2(t) - 1)^2 dt,$$ with $T$ finite. To this end, we consider a square-wave control parameterized by $\tau>0$: $$u_\tau(t) = \begin{cases} 1 &\text{if} & t \in [0, \tau) \cup [3 \tau, 5 \tau) \cup [7 \tau, 9 \tau) \cup \cdots \\ -1 &\text{if} & t \in [\tau, 3 \tau) \cup [5 \tau, 7 \tau) \cup [9 \tau, 11 \tau) \cup \cdots \end{cases}.$$
Consider the scalar control differential equation $$\dot{x} = x + u,$$ and the infinite horizon cost function $$J = \int_0^{\infty} [3x^2(t) + u^2(t)] dt.$$ As we will see in the chapter on linear-quadratic regulation, the optimal cost-to-go for a problem of this kind assumes the form $J^* = S x^2$. It is not hard to see that this, in turn, implies that the optimal controller has the form $u^* = - K x$.
In this exercise we analyze the performance of the value-iteration algorithm, considering its application to the minimum time problem for the double integrator. In this python notebook, you will find everything you need for this analysis. Take the time to go through the notebook and understand the code in it, then answer the following questions.
In this exercise we will implement fitted value iteration with a neural network, and use it to generate a controler for the double integrator. In this python notebook, you will need to implement a few things. Take the time to go through the notebook and understand the code in it, and then answer the following questions. The written questions will also be listed in the notebook for your convenience.
X
before performing the fit. Note that this does not include Xnext
.G
before performing the fit. Assume we are using the quadratic cost.Jd, ind = torch.min(G + discount*Jnext, dim=1)
.
Previous Chapter | Table of contents | Next Chapter |