Algorithms for Walking, Running, Swimming, Flying, and Manipulation
© Russ Tedrake, 2024
Last modified 2024-11-18.
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.
Previous Chapter | Table of contents | Next Chapter |
In chapter 2, we spent some time thinking about the phase portrait of the simple pendulum, and concluded with a challenge: can we design a nonlinear controller to reshape the phase portrait, with a very modest amount of actuation, so that the upright fixed point becomes globally stable? With enough actuators and unbounded torque, feedback-cancellation solutions (e.g., invert gravity) can work well, but can also require an unnecessarily large amount of control effort. The energy-based swing-up control solutions presented for the Acrobot and cart-pole systems are considerably more appealing, but required some cleverness and might not scale to more complicated systems. Here we investigate another approach to the problem, using computational optimal control to synthesize a feedback controller directly.
In this chapter, we will introduce optimal control. Reinforcement
learning is the machine learning name for optimal control; we'll
discuss that machine learning perspective later in the notes. Optimal
control is powerful for a number of reasons. First and foremost, it is very
general - allowing us to specify the goal of control equally well for
fully-actuated or underactuated, linear or nonlinear, deterministic or
stochastic, and continuous or discrete systems. Second, it permits concise
descriptions of potentially very complex desired behaviors, specifying the
goal of control as a scalar objective (plus a list of constraints).
Finally, and most importantly, optimal control is very amenable to
numerical solutions.
The fundamental idea in optimal control is to formulate the goal of control as the long-term optimization of a scalar cost function. Let's introduce the basic concepts by considering a system that is even simpler than the simple pendulum.
Consider the "double integrator" system
In order to formulate this control design problem using optimal control, we
must define a scalar objective which scores the long-term performance of running
each candidate control policy,
Note that the input limits,
Optimal control has a long history in robotics. It started out motivated by achieving peak performance in niche applications like programming minimum-time paths for a pick-and-place robot in a factory. But it is the generality of the approach (one can program almost any task using similar tools) that has made optimization-based and now machine-learning-based control dominate the way that state-of-the-art robot control systems are developed.
For more intuition, let's do an informal derivation of the solution to
the minimum time problem for the double integrator with input constraints:
Your intuition might tell you that the best thing that the brick can do, to reach the goal in minimum time with limited control input, is to accelerate maximally towards the goal until reaching a critical point, then hitting the brakes in order to come to a stop exactly at the goal. This is called a bang-bang control policy; these are often optimal for systems with bounded input, and it is in fact optimal for the double integrator although we will not prove it until we have developed more tools.
Let's work out the details of this bang-bang policy. First, we can
figure out the states from which, when the brakes are fully applied, the
system comes to rest precisely at the origin. Let's start with the case
where
Similarly, the solutions for
Perhaps the most important of these trajectories are the ones that pass
directly through the origin (e.g.,
We can illustrate some of the optimal solution trajectories:
And for completeness, we can compute the optimal time to the goal by
solving for the amount of time required to reach the switching surface
plus the amount of time spent moving along the switching surface to the
goal. With a little algebra, you will find that the time to the goal,
Notice that the "time-to-the-origin function" is continuous over state (though not smooth), even though the policy is discontinuous.
As we begin to develop theoretical and algorithmic tools for optimal
control, we will see that some formulations are much easier to deal with
than others. One important example is the dramatic simplification that
can come from formulating objective functions using additive
cost, because they often yield recursive solutions. In the additive
cost formulation, the long-term objective for a trajectory can be written
as
The "quadratic cost" formulation we considered in the double
integrator example above is clearly written as an additive cost. At
first glance, our "minimum-time" formulation doesn't appear to be of this
form, but we actually can write it as an additive cost problem using an
infinite horizon and the instantaneous cost
We will examine a number of approaches to solving optimal control problems throughout the next few chapters. For the remainder of this chapter, we will focus on additive-cost problems and their solution via dynamic programming.
For systems with continuous states and continuous actions, dynamic programming is a set of theoretical ideas surrounding additive-cost optimal control problems. For systems with a finite, discrete set of states and a finite, discrete set of actions, dynamic programming also represents a set of very efficient numerical algorithms which can compute optimal feedback controllers. Many of you will have seen it introduced before as a tool for graph search.
Imagine you have a directed graph with states (or nodes)
Let us also assume that each edge has an associate weight or cost, using
There are many algorithms for finding (or approximating) the optimal
path from a start to a goal on directed graphs. In dynamic programming,
the key insight is that we can find the shortest path from every node by
solving recursively for the optimal cost-to-go (the cost that will
be accumulated when running the optimal controller), which we'll denote
Value iteration is an amazingly simple algorithm, but it accomplishes
something quite amazing: it efficiently computes the long-term cost of an
optimal policy from every state by iteratively evaluating the
one-step cost. If we know the optimal cost-to-go, then it's easy to extract
the optimal policy,
Imagine a robot living in a grid (finite state) world
As always, I encourage you to run the notebook and play with the code yourself:
You can run value iteration for the double integrator (using
barycentric interpolation to discretize the continuous state space, as
we'll describe below) in
Please do take some time to try different cost functions by editing the code yourself.
Let's take a minute to appreciate how amazing this is. Our solution to finding the optimal controller for the double integrator wasn't all that hard, but it required some mechanical intuition and solutions to differential equations. The resulting policy was non-trivial -- bang-bang control with a parabolic switching surface. The value iteration algorithm doesn't use any of this directly -- it's a simple algorithm for graph search. But remarkably, it can generate effectively the same policy with just a few moments of computation.
It's important to note that there are some differences between the computed policy and the optimal policy that we derived, due to discretization errors. We will ask you to explore these in the exercises.
The real value of this numerical solution, however, is unlike our analytical solution for the double integrator, we can apply this same algorithm to any number of dynamical systems virtually without modification. Let's apply it now to the simple pendulum, which was intractable analytically.
You can run value iteration for the simple pendulum (again using
barycentric interpolation) in
Again, you can easily try different cost functions by editing the code yourself.
I find the graph search algorithm extremely satisfying as a first step, but also become quickly frustrated by the limitations of the discretization required to use it. In many cases, we can do better; coming up with algorithms which work more natively on continuous dynamical systems. We'll explore those extensions in this section.
It's important to understand that the value iteration equations,
equations (
Consider a system
Given an open subset
Proof.
First, observe that, for all piecewise continuous
To finish the proof, for an arbitrary piecewise continuous
solution of (
It is possible to prove sufficiency under different assumptions, too.
The particular assumptions here were chosen to ensure that
As a tool for verifying optimality, the HJB equations are actually
surprisingly easy to work with: we can verify optimality for an
infinite-horizon objective without doing any integration; we simply have
to check a derivative condition on the optimal cost-to-go function
Consider the problem of regulating the double integrator (this time
without input limits) to the origin using a quadratic cost:
Taking
Note that evaluating the HJB for the time-to-go of the minimum-time
problem for the double integrator will also reveal that the HJB is
satisfied wherever that gradient is well-defined. This is certainly
mounting evidence in support of our bang-bang policy being optimal, but
since
We still face a few barriers to actually using the HJB in an algorithm.
The first barrier is the minimization over
All is not lost. In the quadratic cost double integrator example
above, we were able to solve explicitly for the minimizing
Recall that I've already tried to convince you that a majority of the
systems of interest are control affine, e.g. I can write
Exploiting the ability to solve for the optimal action in closed form
is also nice for generating benchmark problems for numerical optimal
control, as it can be used for "converse optimal control"
Consider a one-dimensional system of the form
What happens in the case where our system is not control affine or if
we really do need to specify an instantaneous cost function on
The other major barrier to using the HJB in a value iteration
algorithm is that the estimated optimal cost-to-go function,
One natural way to parameterize
These days, the other natural choice is to represent
If we approximate
Let us start by considering a least-squares approximation of the value iteration update.
Using the least squares solution in a value iteration update is
sometimes referred to as fitted value iteration, where
Note that the update in
In general, the convergence and accuracy guarantees for value
iteration with generic function approximators are quite weak. But we
do have some results for the special case of linear function
approximators. A linear function approximator takes the form:
In finite state and action settings, fitted value iteration with
linear function approximators is known to converge to the globally
optimal
Imagine that we use a mesh to approximate the cost-to-go function
over that state space with
Of course the most popular function approximators today are based on
neural networks. In these much more general approximators, we cannot
solve the least-squares update
What sort of neural architectures work best for value iteration in
robotics? AlphaGo famously used convolutional
networks
When it comes to neural network software, I'll mostly use PyTorch throughout these notes. But for
small problems (with small neural nets and modest batch sizes during
training) like the ones we'll consider here, the cost of going back and
forth to PyTorch and especially the GPU outweighs most of the benefits.
I've also implemented a simple CPU-based
MultilayerPerceptron
in Systems
framework, and will use it when it
makes sense. My goal is to provide examples of both approaches so that
you can pick and choose.
Let us try reproducing our double-integrator value iteration examples using neural networks:
There is nice work on continuous neural fitted value iteration in
For solutions to systems with continuous-time dynamics, I have to
uncover one of the details that I've so far been hiding to keep the
notation simpler. Let us consider a problem with a finite-horizon:
The way that we compute this is by solving the
time-varying cost-to-go function backwards in time:
Now let's consider the continuous-time version. Again, we have a
time-varying cost-to-go,
Probably most visible and consistent campaign using numerical HJB
solutions in applied control (at least in robotics) has come from Claire Tomlin's group
at Berkeley. Their work leverages Ian Mitchell's Level
Set Toolbox, which solves the Hamilton-Jacobi PDEs on a Cartesian
mesh using the technique cartooned above, and even includes the
minimum-time problem for the double integrator as a tutorial
example
There are many many nice extensions to the basic formulations that we've
presented so far. I'll try to list a few of the most important ones here.
I've also had a number of students in this course explore very interesting
extensions; for example
Throughout this chapter, we have focused primarily on minimizing an infinite-horizon sum of running costs. Whenever you write an infinite sum (or integral), one should immediately question whether that sum converges to a finite value. For many control problems, it does -- for instance, if we set the cost to be zero at the goal and the controller can drive the system to the goal in finite time, or exponentially at a rate that is faster than cost is accrued. However, once we enter the realm of approximate optimal control, we can run into problems. For instance, if we form a discrete state and action approximation of the dynamics, it may not be possible for the discretized system to arrive exactly at the zero-cost goal.
A common way to protect oneself from this is to add an exponential
discount factor to the integral cost. This is a natural choice since it
preserves the recursive structure of the Bellman equation. In discrete
time, it takes the form
A word of warning: this discount factor does change the optimal policy; and it is not without its pitfalls. We'll work out a specific case of the discounted cost for the LQR problem, and see that the discounted controller is exactly a controller which assumes that the plant is more stable than it actually is. It is possible to find an optimal controller for the discounted cost that results in an unstable controller.
A more robust alternative is to consider an average-cost formulations... (details coming soon!)
One of the most amazing features of the dynamic programming, additive cost approach to optimal control is the relative ease with which it extends to optimizing stochastic systems.
For discrete systems, we can generalize our dynamics on a graph by
adding action-dependent transition probabilities to the edges. This new
dynamical system is known as a Markov Decision Process (MDP), and we
write the dynamics completely in terms of the transition probabilities
We will continue to develop a much more general toolkit for for stochastic and robust control later in the notes.
There is a particularly nice observation to be made here. Let's
assume that we have discrete control inputs and discrete-time dynamics,
but a continuous state space. Recall the fitted value iteration on a
mesh algorithm described above. In fact, the resulting update is
exactly the same as if we treated the system as a discrete state MDP
with
For discrete MDPs, value iteration is a magical algorithm because it
is simple, but known to converge to the global optimal,
Recall the optimality conditions from Eq.
Perhaps even more interesting is that this approach can be
generalized to linear function approximators. Taking a vector form of
my notation above, now we have a matrix of features with
There are some cases where the linear programming approach to dynamic programming can be extended to continuous states and actions, with strong guarantees. This approach follows naturally from our study of Lyapunov functions in an upcoming chapter, and builds on the computational tools we develop there. Feel free to skip ahead.
Let's consider the discrete time, state, and action version of value
iteration:
Surprisingly, adding a discount factor can help with this. Consider
the infinite-horizon discounted cost:
The figure above shows an autonomous car moving at constant velocity
In this exercise we will see how seemingly simple cost functions can
give surprising results. Consider the single-integrator system
Consider the scalar control differential equation
In this exercise we analyze the performance of the value-iteration algorithm, considering its application to the minimum time problem for the double integrator. In this python notebook, you will find everything you need for this analysis. Take the time to go through the notebook and understand the code in it, then answer the following questions.
In this exercise we solve the dynamic programming using a linear program, considering its application to the grid world. In this python notebook, you will write a linear program for solving the optimal cost-to-go in the grid world.
In this excercise we will implement value iteration for a continuous time system with continuous actions and states. We will achieve by training a neural network to approximate our value function. You will work exclusively in this python notebook to implement the algorithms described in the Continuous Dynamic Programming section.
Previous Chapter | Table of contents | Next Chapter |