Algorithms for Walking, Running, Swimming, Flying, and Manipulation

© Russ Tedrake, 2023

Last modified .

How to cite these notes, use annotations, and give feedback.

**Note:** These are working notes used for a course being taught
at MIT. They will be updated throughout the Spring 2023 semester. Lecture videos are available on YouTube.

Previous Chapter | Table of contents | Next Chapter |

The view of dynamics and controls taken in these notes builds heavily on
tools from optimization -- and our success in practice depends heavily on the
effective application of numerical optimization. There are many excellent
books on optimization, for example

The intention of this chapter is, therefore, mainly to provide a launchpad and to address some topics that are particularly relevant in robotics but might be more hidden in the existing general treatments. You can get far very quickly as a consumer of these tools with just a high-level understanding. But, as with many things, sometimes the details matter and it becomes important to understand what is going on under the hood. We can often formulate an optimization problem in multiple ways that might be mathematically equivalent, but perform very differently in practice.

Some of the algorithms from optimization are quite simple to implement yourself; stochastic gradient descent is perhaps the classic example. Even more of them are conceptually fairly simple, but for some of the algorithms, the implementation details matter a great deal -- the difference between an expert implementation and a novice implementation of the numerical recipe, in terms of speed and/or robustness, can be dramatic. These packages often use a wealth of techniques for numerically conditioning the problems, for discarding trivially valid constraints, and for warm-starting optimization between solves. Not all solvers support all types of objectives and constraints, and even if we have two commercial solvers that both claim to solve, e.g. quadratic programs, then they might perform differently on the particular structure/conditioning of your problem. In some cases, we also have nicely designed open-source alternatives to these commercial solvers (often written by academics who are experts in optimization) -- sometimes they compete well with the commercial solvers or even outperform them in particular problem types.

As a result, there are also a handful of software packages that attempt
to provide a layer of abstraction between the formulation of a mathematical
program and the instantiation of that problem in each of the particular
solvers. Famous examples of this include CVX and YALMIP for MATLAB, and JuMP for Julia.

We have a number of tutorials on mathematical programming in

The general formulation is ...

It's important to realize that even though this formulation is incredibly general, it does have it's limits. As just one example, when write optimizations to plan trajectories of a robot, in this formulation we typically have to choose a-priori as particular number of decision variables that will encode the solution. Although we can of course write algorithms that change the number of variables and call the optimizer again, somehow I feel that this "general" formulation fails to capture, for instance, the type of mathematical programming that occurs in sample-based optimization planning -- where the resulting solutions can be described by any finite number of parameters, and the computation flexibly transitions amongst them.

Local optima. Convex functions, convex constraints.

If an optimization problem is nonconvex, it does not necessarily mean that the optimization is hard. There are many cases in deep learning where we can reliably solve seemingly very high-dimensional and nonconvex optimization problems. Our understanding of these phenomenon is evolving rapidly, and I suspect anything I write here will shortly become outdated. But the current thinking in supervised learning mostly revolves around the idea that over-parameterization is key -- that many of these success stories are happening in a regime where we actually have more decision variables than we have data, and that the search space is actually dense with solutions that can fit the data perfectly (the so-called "interpolating solutions"), that not all global minima are equally robust, and that the optimization algorithms are performing some form of implicit regularization in order to pick a "good" interpolating solution.

Given the equality-constrained optimization problem $$\minimize_\bz \ell(\bz) \quad \subjto \quad \bphi(\bz) = 0,$$ where $\bphi$ is a vector. Define a vector $\lambda$ of Lagrange multipliers, the same size as $\phi$, and the scalar Lagrangian function, $$L(\bz,{\bf \lambda}) = \ell(\bz) + \lambda^T\phi(\bz).$$ A necessary condition for $\bz^*$ to be an optimal value of the constrained optimization is that the gradients of $L$ vanish with respect to both $\bz$ and $\lambda$: $$\pd{L}{\bz} = 0, \quad \pd{L}{\lambda} = 0.$$ Note that $\pd{L}{\lambda} = \phi(\bz)$, so requiring this to be zero is equivalent to requiring the constraints to be satisfied.

Consider the following optimization: $$\min_{x,y} x+y, \quad \subjto \quad x^2 + y^2 = 1.$$ The level sets of $x+y$ are straight lines with slope $-1$, and the constraint requires that the solution lives on the unit circle.

Simply by inspection, we can determine that the optimal solution should be $x=y=-\frac{\sqrt{2}}{2}.$ Let's make sure we can obtain the same result using Lagrange multipliers.

Formulating $$L = x+y+\lambda(x^2+y^2-1),$$ we can take the derivatives and solve \begin{align*} \pd{L}{x} = 1 + 2\lambda x = 0 \quad & \Rightarrow & \lambda = -\frac{1}{2x}, \\ \pd{L}{y} = 1 + 2\lambda y = 1 - \frac{y}{x} = 0 \quad & \Rightarrow & y = x,\\ \pd{L}{\lambda} = x^2+y^2-1 = 2x^2 -1 = 0 \quad & \Rightarrow & x = \pm \frac{1}{\sqrt{2}}. \end{align*} Given the two solutions which satisfy the necessary conditions, the negative solution is clearly the minimizer of the objective.

One of the most powerful use cases for semidefinite programming is
in writing convex relaxations of more general non-convex optimization
problems. One particularly useful example of this is the general
quadratic optimization problem, written as \begin{align*} \min_\by
\quad &\bx^T\bQ\bx \\ \subjto \quad & \bx^T \bA_i \bx \ge 0, \\ &
\bB\bx \ge {\bf 0}, \\& \bx = \begin{bmatrix} 1 \\ \by \end{bmatrix}.
\end{align*} In the following, we won't make any assumptions about
$\bQ$ or $\bA_i$ being PSD. For instance, quadratic equality
constraints of the form $\bx^T\bA\bx=0$ (which are generically
nonconvex) can be encoded with two constraints: $ \bx^T\bA\bx \ge 0$
and $-\bx^T\bA\bx \ge 0.$ The SDP relaxation

Consider the problem: \begin{align*} \min_x \quad & \| x - a \|^2
\\ \text{subject to} \quad & \| x - b \|^2 \ge 1 \end{align*} This
problem is non-convex, however we can write a convex relaxation of
the problem using the semidefinite-programming
relaxation:\begin{align*} \min_{x,y} \quad & y - 2ax + a^2 \\
\text{subject to} \quad & y - 2bx + b^2 \ge 1 \\ & y \ge x^2
\end{align*} where we write $y \ge x^2$ as the semidefinite
constraint $\begin{bmatrix} y & x \\ x & 1 \end{bmatrix} \succeq 0.$
^{†}^{†} Note that this particular SDP constraint is
so simple that it could have alternatively been written as a
second-order cone constraint.

I've plotted the feasible region and the objective with an arrow. As you can see, the feasible region is the epigraph of $y=x^2$ intersected with the linear constraint. Now here is the key point: for linear objectives, the optimal solution will (almost) never be active on the linear constraint unless it's at a vertex of the feasible set (this is related to the classical picture from linear programming). In this example specifically, the constraint $y=x^2$ will be inactive only if $a = b;$ for every other objective, this relaxation will give $y=x^2$, and will give the optimal solution to the original problem. Note that we could have equally well written a quadratic equality constraint.

Another useful example is \begin{align*} \min_\bx \quad & \bx^T{\bf A}\bx + {\bf b}^T \bx, \\ \text{subject to} \quad & \bx^T\bx = 1. \end{align*} Again this optimization is non-convex, but we can write the semidefinite-programming relaxation as \begin{align*} \min_{{\bf Y},\bx} \quad & \trace({\bf AY}) + {\bf b}^T \bx, \\ \text{subject to} \quad & \trace({\bf Y}) = 1, \\ & {\bf Y} \succeq \bx\bx^T, \end{align*} where the PSD constraint is again written using the Schur complement. For the case where $\bx\in\Re^2,$ the PSD constraint gives us that $Y_{11} \ge x_1^2$ and $Y_{22} \ge x_2^2$; combined with the trace constraint this gives us $$Y_{22} = 1-Y_{11} \ge x_2^2 \Rightarrow 1 - x_2^2 \ge Y_{11} \ge x_1^2 \Rightarrow 1 \ge x_1^2 + x_2^2.$$ Once again the picture looks like the affine section of a cone (now in higher dimensions). Given the linear objective, we expect the optimal solution to be unbounded, or active on one ore more of the constraints. Almost always the constraint $${\bf Y} \succeq \bx \bx^T$$ will be active, and when ${\bf Y} = \bx\bx^T$ we have $\bx^T\bx = 1.$ When the convex relaxation gives the optimal solution to the original problem, we say that the convex relaxation is "tight".

Let's compare this to the simpler relaxation, \begin{align*} \min_\bx \quad & \bx^T{\bf A}\bx + {\bf b}^T \bx, \\ \text{subject to} \quad & \bx^T\bx \le 1, \end{align*} which can be written as a second-order cone program (SOCP). When the objective is linear $({\bf A}=0, {\bf b} \neq 0)$, even this simpler relaxation will be tight. But when the objective is quadratic, if the minimum of the quadratic falls inside the unit disk, then the relaxation will not be tight (it will be "loose") -- the optimal solution will satisfy the constraints $\bx^T\bx \le 1$, but will not have $\bx^T \bx = 1.$ In the PSD relaxation, the convex relaxation will be tight, because the objective is now linear (in the lifted coordinates), and will push towards the boundary so long as ${\bf b} \neq 0$.

I've given a numerical example you can play with in the notebook: What's amazing is that this SDP relaxation even gives the optimal solution when the objective is non-convex. I've given an example in the notebook with a concave objective (e.g. ${\bf A} \preceq 0$).

It turns out that in the same way that we can use SDP to search over
the positive quadratic equations, we can generalize this to search over
the positive polynomial equations. To be clear, for quadratic equations
we have \[ {\bf P}\succeq 0 \quad \Rightarrow \quad \bx^T {\bf P} \bx \ge
0, \quad \forall \bx. \] It turns out that we can generalize this to \[
{\bf P}\succeq 0 \quad \Rightarrow \quad {\bf m}^T(\bx) {\bf P} {\bf
m}(\bx) \ge 0, \quad \forall \bx, \] where ${\bf m}(\bx)$ is a vector of
polynomial equations, typically chosen as a vector of *monomials*
(polynomials with only one term). The set of positive polynomials
parameterized in this way is exactly the set of polynomials that can be
written as a *sum of squares*

Even better, there is quite a bit that is known about how to choose
the terms in ${\bf m}(\bx)$. For example, if you give me the polynomial
\[ p(x) = 2 - 4x + 5x^2 \] and ask if it is positive for all real $x$, I
can convince you that it is by producing the sums-of-squares
factorization \[ p(x) = 1 + (1-2x)^2 + x^2, \] or the factorization \[
p(x) = \begin{bmatrix} 1 \\ x \end{bmatrix}^T \begin{bmatrix} 2 & -2 \\
-2 & 5 \end{bmatrix} \begin{bmatrix} 1 \\ x \end{bmatrix}, \] since the
matrix in this quadratic form is PSD. In general, I can produce a
monomial vector containing all monomials with degree up to the the
half of the degree of $p$ (we can actually do much better, but I
will defer those details). In practice, what this means for us is that
people have authored optimization front-ends which take a high-level
specification with constraints on the positivity of polynomials and they
automatically generate the SDP problem for you, without having to think
about what terms should appear in ${\bf m}(\bx)$ (c.f.

As it happens, the particular choice of ${\bf m}(\bx)$ can have a huge
impact on the numerics of the resulting semidefinite program (and on your
ability to solve it with commercial solvers).

We will write optimizations using sums-of-squares constraints as \[ p(\bx) \sos \] as shorthand for the constraint that $p(\bx) \ge 0$ for all $\bx$, as demonstrated by finding a sums-of-squares decomposition.

This is surprisingly powerful. It allows us to use convex optimization to solve what appear to be very non-convex optimization problems.

Consider the following famous non-linear function of two variables
(called the "six-hump-camel": $$p(x) = 4x^2 + xy - 4y^2 - 2.1x^4 + 4y^4
+ \frac{1}{3}x^6.$$ This function has six local minima, two of them
being global minima

By formulating a simple sums-of-squares optimization, we can actually find the minimum value of this function (technically, it is only a lower bound, but in this case and many cases, it is surprisingly tight) by writing: \begin{align*} \max_\lambda \ \ & \lambda \\ \text{s.t. } & p(x) - \lambda \text{ is sos.} \end{align*} Go ahead and play with the code (most of the lines are only for plotting; the actual optimization problem is nice and simple to formulate).

Note that this finds the minimum value, but does not actually
produce the $\bx$ value which mimimizes it. This is possible

The S-procedure.

The S-procedure

Using the quotient ring

Quotient rings via sampling

The generic formulation of a nonlinear optimization problem is \[
\min_z c(z) \quad \subjto \quad \bphi(z) \le 0, \] where $z$ is a
vector of *decision variables*, $c$ is a scalar *objective
function* and $\phi$ is a vector of *constraints*. Note
that, although we write $\phi \le 0$, this formulation captures
positivity constraints on the decision variables (simply multiply the
constraint by $-1$) and equality constraints (simply list both $\phi\le0$
and $-\phi\le0$) as well.

The picture that you should have in your head is a nonlinear, potentially non-convex objective function defined over (multi-dimensional) $z$, with a subset of possible $z$ values satisfying the constraints.

Note that minima can be the result of the objective function having zero
derivative *or* due to a sloped objective up against a
constraint.

Numerical methods for solving these optimization problems require an
initial guess, $\hat{z}$, and proceed by trying to move down the objective
function to a minima. Common approaches include *gradient descent*,
in which the gradient of the objective function is computed or estimated, or
second-order methods such as *sequential quadratic programming (SQP)*
which attempts to make a local quadratic approximation of the objective
function and local linear approximations of the constraints and solves a
quadratic program on each iteration to jump directly to the minimum of the
local approximation.

While not strictly required, these algorithms can often benefit a great
deal from having the gradients of the objective and constraints computed
explicitly; the alternative is to obtain them from numerical
differentiation. Beyond pure speed considerations, I strongly prefer to
compute the gradients explicitly because it can help avoid numerical
accuracy issues that can creep in with finite difference methods. The
desire to calculate these gradients will be a major theme in the discussion
below, and we have gone to great lengths to provide explicit gradients of
our provided functions and automatic differentiation of user-provided
functions in

When I started out, I was of the opinion that there is nothing difficult
about implementing gradient descent or even a second-order method, and I
wrote all of the solvers myself. I now realize that I was wrong. The
commercial solvers available for nonlinear programming are substantially
higher performance than anything I wrote myself, with a number of tricks,
subtleties, and parameter choices that can make a huge difference in
practice. Some of these solvers can exploit sparsity in the problem (e.g.,
if the constraints operate in a sparse way on the decision variables).
Nowadays, we make heaviest use of SNOPT

Augmented Lagrangian

An advanced, but very readable book on MIP

- "Numerical optimization", Springer , 2006. ,
- "Convex Optimization", Cambridge University Press , 2004. ,
- "Lecture notes from MIT 6.7230 - Algebraic techniques and semidefinite optimization", , 2023. ,
- "General heuristics for nonconvex quadratically constrained quadratic programming", arXiv preprint arXiv:1703.07870, 2017. ,
- "Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization", PhD thesis, California Institute of Technology, May 18, 2000. ,
- "SOSTOOLS: Sum of Squares Optimization Toolbox for MATLAB User's guide", , June 1, 2004. ,
- "Reduction methods in semidefinite and conic optimization", PhD thesis, Massachusetts Institute of Technology, 2017. [ link ] ,
- "GloptiPoly 3: moments, optimization and semidefinite programming", Optimization Methods \& Software, vol. 24, no. 4-5, pp. 761--779, 2009. ,
- "User's Guide for SNOPT Version 7: Software for Large -Scale Nonlinear Programming", , February 12, 2006. ,
- "Integer programming", Springer , vol. 271, 2014. ,
- "Mixed integer linear programming formulation techniques", Siam Review, vol. 57, no. 1, pp. 3--57, 2015. ,

Previous Chapter | Table of contents | Next Chapter |