Underactuated Robotics

Algorithms for Walking, Running, Swimming, Flying, and Manipulation

Russ Tedrake

© Russ Tedrake, 2024
Last modified .
How to cite these notes, use annotations, and give feedback.

Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.

Previous Chapter Table of contents Next Chapter

Imitation Learning

Imitation learning, also known as "learning from demonstrations" (LfD), is the problem of learning a policy from a collection of demonstrations. For state-based feedback, these demonstrations take the form of a set of state-action sequences, $\left[ \bx[\cdot], \bu[\cdot]\right]$. For the richer class of output feedback, this takes the form of observation-action sequences, $\left[ \by[\cdot], \bu[\cdot]\right]$. Note that we do not require any explicit definition of a cost or reward function; in most cases we assume that the demonstrations are obtained from an optimal or near-optimal policy.

Broadly speaking, most approaches to imitation learning can be categorized as either behavior cloning (BC) or inverse reinforcement learning (IRL). Behavior cloning attempts to learn a policy directly from the data using supervised learning. Inverse RL (aka inverse optimal control) attempts to learn a cost function from the data, and then uses potentially more traditional optimal control approaches to synthesize a policy for this cost function, in the hopes of generalizing significantly beyond the demonstration data. See Argall09+Bagnell15+Osa18 for some fairly recent surveys, though admittedly they appeared a bit before the latest boom.

I think it's fair to say that today, in 2024, behavior cloning is once again taking the robotics world by storm (especially in manipulation research), and it now seems like the shortest path to building "robot foundation models". We'll devote most of this chapter to it.

Behavior cloning

Behavior cloning Pomerleau88+Bain95 attempts to learn a policy directly from the policy's input-output data using supervised learning. In the case of full-state feedback, where we know that the optimal policy can be represented as a simple function of the state, this is typically cast as a simple regression problem, e.g. $\min_\theta \sum_{i} |\bu_i - \pi_\theta(\bx_i)|^2.$ In the richer case of output feedback, where we have learned to think about a policy as a dynamical system, then this becomes a sequence learning problem (given a sequence of observations, predict a sequence of actions).

Famously, Large Language Models are trained with behavior cloning (and then fine-tuned to make them more aligned with human preferences Ziegler19). OpenAI's GPT models are autoregressive models that predict the next token given a "context" of recent tokens Radford18+Radford19+Brown20+Bubeck23. More recently, we've seen the extension to multi-modal models, such as visually-conditioned language models (VLMs) like GPT-4V and LLaVa Liu23 with an increasing wealth of open-source reproductions Karamcheti24+Laurençon24.

Is predicting actions fundamentally different from next-token prediction in language? There are a few reasons why it might be. Actions are continuous and high-dimensional, whereas language tokens are discrete. Our control systems get put into the feedback loop with physics, and have to deal with stochasticity from the environment that LLMs don't experience. Google DeepMind released the RT series of "vision-language-action" (VLA) models (RT-1, RT-2, and RT-2-X) which started to show that predicting actions may not be so fundamentally different than language, at least for simple pick-and-place tasks. Then in 2024, there were two main lines of work which convincingly demonstrated this capability could even extend to surprisngly dexterous manipulation. Those were the Diffusion Policy Chi24 and the Action Chunking Transformer (ACT) from the ALOHA project Zhao23. Of course, these were built on a long line of progress on "visuomotor policies" which we'll discuss below.

Diffusion Policy started as an intern project(!) by Cheng Chi at TRI, and blossomed into a great collaboration with Shuran Song.

Diffusion Policy and ALOHA seem to have been the watershed results for dexterous manipulation in robotics. Since that time, the internet is now rich with videos of highly dexterous manipulation from all sorts of robots up to and including humanoid robots with dexterous hands.

In one sense, it might seem a little disappointing that, after we've spent so much time in these notes exploring the rich mathematical foundations of dynamics and control, that behavior cloning from human teleop demonstrations can outperform some of our best methods, at least for some class of problems (which are arguably more about understanding the world than about dynamics and control). But think of it this way: using supervised learning is an awfully clever way to explore the space of policy parameterizations, and has accelerated us into bigger questions about using cameras in the feedback loop, learning multitask/foundation models, and leveraging structure (such as 3D geometry / objectness) in our representations or not. I must say that the success of LLMs (and now multimodal models) is undeniable, and it would be a mistake to ignore the great new possibilities that have opened up. I'm very confident that our knowledge of dynamics and control will help us penetrate the new vistas enabled by these high-capacity models.

Let me make one important point. Sometimes people say that BC is fundamentally limited because it can never outperform the human demonstrator. But one of the early results from behavior cloning was that this limit that is not strictly true. cite the classic result; i forget which classic paper it came from. Chowdery23 figure 6 is relevant, but they seem to hedge a bit because perhaps the metric isn't great. Certainly Chat-GPT can produce text output that by some metrics surpasses the capabilities of any one human. Here is a recent case study of this phenomenon for chess.

There is also a precident for combining BC with other methods to improve beyond the original demonstrations. For DeepMind's AlphaGoSilver16, behavior cloning was only the first step -- it was enough to get us to a strong Go player, but not enough to be championship level. Adding "self-play" and improvement via Monte-Carlo tree search was fundamental to "mastering the game of Go".

Visuomotor policies (aka control from pixels)

In 2016, a few years after the start of the deep learning revolution, Levine16 introduced the concept of a "visuomotor policy", and produced fairly stunning videos of robots performing complicated tasks. The name "visuomotor" is chosen to emphasize that the policy is trained to predict actions directly from RGB camera images, using architectures similar to the one pictured below; in a sense this was a return to the early ideas of Pomerleau88, but now powered with deep learning architectures and pretraining.

The (now) classical visuomotor policy architecture, adapted from Levine16+Florence20. The emphasis here is that $z$ is a learned state representation which encodes the state of the environment.

My own journey with imitation learning started with a project led by Pete Florence and Lucas Manuelli on using a particular form of self-supervised learning for 3D geometries to train the policies Florence20. For me, this was the first I got to really experience directly how the model was trained, and I got to experience the incredibly impressive robustness that was possible in the roll-outs, even with a relatively modest amount of robot training data.

Writing feedback controllers which operate directly on the RGB camera images was something entirely new for me. We had been using RGB-D cameras to do some amount of visual state estimation / pose estimation before that, but were leaning pretty heavily on the depth channel and very explicit 3D reasoning/matching. But I remember around the time of our first imitation learning project, I asked the students "if you could only choose one, RGB or depth, which would you choose". They chose RGB. There are many situations where a task is unclear or even ambiguous when looking only at a depth image. The ambiguities can often be resolved with the addition of RGB, and there are many depth cues in RGB that allow us and our visuomotor policies to be successfull for 3D tasks even without an explicit depth sensor.

In my experience, control theorists have very satisfying answers to almost any dynamics and control problem. But (with a few rare exceptions) they didn't do computer vision. This was something entirely new. The sensor model -- the mapping from, e.g. the state and parameters of a MultibodyPlant to the output image $y$ -- is potentially a full game-engine-quality renderer. Even though there are lots of projects now on making differentiable renderers, these can only do so much because the pixelation process is inherently very local/non-smooth. Going from the image back into a manageable intermediate (latent) state representation, $z$, started becoming viable with the rise of deep networks for perception. Data/learning does feel fundamental here -- mapping from RGB into a meaningful representation for control is more about the statistics of natural scenes than about the model-based physics of propagating light.

Behavior cloning as sequence modeling

Supervised learning in a feedback loop: dealing with distribution shift

One of the famous challenges in imitation learning is the problem of distribution shift. Imagine that you are training a policy for driving a car...

DAGGER Ross11

Teacher/Student.

Dealing with suboptimal and multimodal demonstrations

Discrete maze example w/ symmetries. Push-T example.

Architectures for visuomotor policies

Desiderata

scalable, reliable training, ..., can cope with multimodality, ...

Output/action decoders)

The visuomotor policies that we'll study here should output low-level robot actions -- $\bu$ in the parlance of these notes. These need not be torque commands directly... in fact it's more typical for them to output a slightly higher-level command like joint velocity or end-effector velocity, which gets passed to a low-level controller. Note that there is also a now large body of literature where people use LLMs or VLMs to determine a sequence of high level actions, but assume that someone has authored or otherwise obtained a set of "skill libraries" that map the discrete high-level actions to control; while interesting, I would not call those approached visuomotor policies and will not discuss those here.

In all cases, the input encoders (discussed next) map the recent history of observations into some latent representation, which then eventually gets mapped back into actions via the action decoder. It is quite useful to categorize the different visuomotor policy architectures based on the different choices that they make about the action decoder.

There is a series of work, now commonly referred to as VLA (vision-language-action) architectures, which leverage the successful transformer architectures from language and vision by discretizing and tokenize the robot action space. Early examples include Decision Transformer Chen21, GatoReed22, and more recently these powered the Robot Transformer (RT) line of models from Google DeepMind: RT-1 Brohan22, RT-2 Brohan23, RT-X Padalkar23. For example, in RT, the action is represented using a uniform discretization with 256 bins over each coordinate (using the 1st and 99th quantile of the actions in the training data as the minumum and maximum values, respectively); in those models the action space was taken to be the end-effector pose and gripper extension. Then they literally output a string of integers corresponding to those coordinates as desired text output for the VLM; if the integers not already represented as tokens in the VLM, then they simply overwrite the 256 least frequently used tokens Brohan23. You can find an strong open-source reproduction of RT-2X models in the OpenVLA project Kim24.

These tokenized-action architectures naturally learn a probability over next tokens, which deals very directly with the potential multimodality in the training data. But this comes at the cost of having discretized the action space. I don't worry much about the resolution of this discretization being limiting, but I'm a bit worried that the discretization destroys the natural inductive bias of the continuous space. (For instance, the end-effector position 0.1 is closer to 0.2 than to 0.5, but this information is completely discarded in the discretization.) Other people don't seem as concerned, and perhaps when we eventually have enough data it will all be in the noise.

Behavior Transformers (BeT) Shafiullah22 and VQ-BeT Lee24 put more work into discretizing the action space, by preprocessing the data with k-means (in BeT) or Residual Vector Quantization (in VQ-BeT), and then doing some continuous up-sampling.

There were a number of attempts to handle multimodality in a more natively continous setting. Implicit BC Florence22 called this out very explicitly, and also explored the use of Mixture Density Networks Bishop94 for behavior cloning. But things really started working well when some of the new tools from generative AI started getting applied to robotics; most notably in the Action-chunking transformers (ACT) Zhao23 which used Conditional Variational Autoencoders (CVAE) and in Diffusion Policy Chi24 which used Diffusion models. We'll discuss these in more detail below.

Both the ACT paper and the Diffusion Policy paper strongly emphasized another detail about the output encoding: rather than predicting a single (current) action to take, these models predicted an entire sequence of future actions, and then operate in a fashion similar to model-predictive control.

(Multi-modal) input encoders

Although researchers are rapidly adopting additional input modalities, by far the most common input modalities are the robot proprioception (e.g. joint sensors), which can be passed into the model directly, and image observations which need to be encoded from raw RGB into some intermediate representation. Although there is a torrent of literature on this, there are a few choices that have clearly emerged as the standards: ResNet and ViT. For instance, the original Diffusion Policy paper used a ResNet-18 (without pretraining) with small modifications, e.g. to maintain spatial information Chi24, but as the scale of the experiments has increased we have seen more success with CLIP-pretrained ViT Chi24a.

Language-conditioned multitask policies... Padalkar23+Kim24.

Diffusion Policy

One particularly successful form of behavior cloning for visuomotor policies with continuous action spaces is the Diffusion Policy Chi23. The dexterous manipulation team at TRI had been working on behavior cloning for some time, but the Diffusion Policy (which started as a summer internship project!) architecture has allowed us to very reliably train incredibly dexterous tasks and really start to scale up our ambitions for manipulation.

Denoising Diffusion models

"Denoising Diffusion" models are an approach to generative AI, made famous by their ability to generate high-resolution photorealistic images. Inspired by the "manifold hypothesis" (e.g. the idea that realistic images live on a low-dimensional manifold in pixel space), the intuition behind denoising diffusion is that we train a model by adding noise to samples drawn from the data distribution, then learn to predict the noise from the noisy images, in order to "denoise" random images back on to the manifold. While image generation made these models famous, they have proven to be highly capable in generating samples from a wide variety of high-dimension continuous distributions, even distributions that are conditioned on high-dimensional inputs. I recommend this blog post and Nakkiran24 as excellent introductions.

Let's consider samples $\bu \in \Re^m$ drawn from a training dataset $\mathcal{D}.$ Diffusion models are trained to estimate a noise vector ${\bf \epsilon} \in \Re^m$ to minimize the loss function $$\ell(\theta) = \mathbb{E}_{\bu, {\bf \epsilon}, \sigma} || {\bf f}_\theta(\bu + \sigma {\bf \epsilon}, \sigma ) - {\bf \epsilon} ||^2,$$ where $\theta$ is the parameter vector, and $f_\theta$ is typically some high-capacity neural network. In practice, training is done by randomly sampling $\bu$ from $\mathcal{D}$, ${\bf \epsilon}$ from $\mathcal{N}({\bf 0}_m, {\bf I}_{m \times m})$, and $\sigma$ from a uniform distribution over a positive set of numbers denoted as $\{\sigma_k\}_{k=0}^K,$ where we have $\sigma_k > \sigma_{k-1}.$

To sample a new output from the model, the denoising diffusion implicit models (DDIM) sampler Song20 takes multiple steps: $$\bu_{k-1} = \bu_k + (\sigma_{k-1} - \sigma_k)f_\theta(\bu_k, \sigma_k).$$ This specific parameterization of the update (and my preferred notation more generally) comes from Permenter23. I've presented the deterministic denoiser here, but some analysis suggests that Langevin sampling can improve performance if one takes many denoising stepsXu23.

Diffusion models have a slightly convoluted history. The term "diffusion" came from a paper Sohl-Dickstein15 which used an analogy from thermodynamics to use a prescribed diffusion process to slowly transform data into random noise, and then learned to reverse this procedure by training an inverse diffusion. Well before that, a series of work starting with Hyvarinen05 studied the problem of learning the score function (the gradient of the log probability of a distribution) of a data distribution, and Vincent11 made a connection to denoising autoencoders. Song19 put all of this together beautifully and combined it with deep learning to propose denoising diffusion as a generative modeling techinque. They learned a single network that was conditioned on the noise level. This was followed quickly by Ho20 which introduced denoise diffusion probabilistic models (DDPM) using an even simpler update and showed results competitive with other leading generative modeling techniques leading to Song20 giving us the DDIM update above. Permenter23 gives a deterministic interpretation as learning the distance function from the data manifold, and sampling as performing approximate gradient descent on this function.

Examples! Something like Figure 1 from Sohl-Dickstein15 would be good.

It is straight-forward to condition the generative model on an exogeneous input, by simply adding an additional signal, $\by$, to the denoiser: $f_\theta(\bu, \sigma, \by).$

Diffusion Policy

Behavior cloning is perhaps the simplest form of imitation learning -- it simply attempts learn a policy using supervised learning to match expert demonstrations. While it is tempting to learn deterministic output-feedback policies (maps from history of observations to actions), one quickly finds that human demonstrations are typically not unique. Perhaps this is not surprising, as we know that optimal feedback policies in general are not unique! To address this non-uniqueness / multi-modality in the human demonstrations, it's well understood that behavior cloning benefits from learning a conditional distribution over actions.

Diffusion Policy is the natural application of (conditional) denoising diffusion models to learning these policies. It was inspired, in particular, but the modeling choices in DiffuserJanner22. In particular, rather than generating a single action, the denoiser in diffusion policy outputs a sequence of actions with horizon $H_u$; like Zhao23 we found experimentally that this leads to more stable roll-outs. Block23b provides some possible theoretical justification for this choice. We condition the input on a history of observations of length $H_y.$

Diffusion Policy for LQG

Let me be clear, it almost certainly does not make sense to use a diffusion policy to implement LQG control. But because we understand LQG so well at this point, it can be helpful to understand what the Diffusion Policy looks like in this extremely simplified case.

Consider the case where we have the standard linear-Gaussian dynamical system: \begin{gather*} \bx[n+1] = \bA\bx[n] + \bB\bu[n] + \bw[n], \\ \by[n] = \bC\bx[n] + \bD\bu[n] + \bv[n], \\ \bw[n] \sim \mathcal{N}({\bf 0}, {\bf \Sigma}_w), \quad \bv[n] \sim \mathcal{N}({\bf 0}, {\bf \Sigma}_v). \end{gather*} Imagine that we create a dataset by rolling out trajectory demonstrations using the optimal LQG policy. The question is: what (exactly) does the diffusion policy learn?

Let's start with the $\mathcal{H}_2$ problem (e.g. LQR with Gaussian noise), where the observation and prediction horizons are limited to a single step, $H_y = H_u = 1$, and the denoiser is conditioned directly on state observations. We will generate roll-outs using the optimal policy, $\bu = - \bK^*\bx$, given a Gaussian distribution of intial conditions and Gaussian process noise. In this case, the training loss function reduces to $$\ell(\theta) = \mathbb{E}_{\bx, {\bf \epsilon}, \sigma} || {\bf f}_\theta(-\bK\bx + \sigma {\bf \epsilon}, \sigma, \bx) - {\bf \epsilon} ||^2,$$ where the expectation in $\bx$ is over the stationary distribution of the optimal policy. In this case, we don't need a neural network; take $f_\theta$ to be a simple function. In particular the optimal denoiser is given by $${\bf f}_\theta(\bu, \sigma, \bx) = \frac{1}{\sigma}\left[\bu + \bK\bx\right].$$ At evaluation time, the sampling iterations, $$\bu_{k-1} = \bu_k + \frac{\sigma_{k-1} - \sigma_k}{\sigma_k}\left[\bu_k + \bK\bx\right],$$ will converge on $\bu_0 = -\bK\bx.$ (Clearly $\bu_k = -\bK\bx$ is a fixed point of the iteration, and the $\frac{\sigma_{k-1} - \sigma_k}{\sigma_k}$ term is like the step-size of gradient descent.)

Returning to LQG, the diffusion policy architecture (with $H_u=1$) will be learning a denoiser conditioned on a finite history of actions and observations, \begin{gather*}f_\theta(\bu[n], \sigma, \bar{\by}_{H_y}, \bar{\bu}_{H_y}), \\ \bar{\by}_{H_y} = \left[\by[n-1],... ,\by[n-H_y]\right], \\ \bar{\bu}_{H_y} = \left[\bu[n-1],... ,\bu[n-H_y]\right].\end{gather*} We know that for LQG, the optimal actor that we will use for generating training data takes the form of a Kalman filter followed by LQR feedback on the estimated state. We can "unroll" the (truncated) Kalman filter into a linear function of the history of actions and observations; for observable and stabilizable linear systems we know that this truncation error will converge to zero as we increase $H_y$. Let's call this unrolled policy $\bu[n] = \hat{\pi}_{LQG}(\bar{\by}_{H_y}, \bar{\bu}_{H_y}).$ With some care, it can be shown that the optimal denoiser is given by $${\bf f}_\theta(\bu, \sigma, \bar{\by}_{H_y}, \bar{\bu}_{H_y}) = \frac{1}{\sigma}\left[\bu + \hat{\pi}_{LQG}(\bar{\by}_{H_y}, \bar{\bu}_{H_y})\right],$$ which will converge onto the truncated Kalman filter.

Notebook example

Predicting actions multiple steps into the future is a fundamentally important aspect of the Diffusion Policy architecture Block23b. For LQG, ...

Reduced-order LQG

Inverse reinforcement learning

Some good notes here: https://web.stanford.edu/class/cs237b/pdfs/lecture/lecture_10111213.pdf

Vistas

Multitask / foundation models for control

If control directly from pixels was the first capability unlocked by imitation learning, I would say that large-scale multitask decision making is the second. Multitask learning has a long history in the machine learning community Zhang21, and multitask learning for control was popular first in "multitask RL" citation?. But the rise of foundation models Bommasani21 in language and now in multimodal models, has completely changed our capabilities and our expectations. Suddenly, one can imagine a single policy, represented in a high-capacity neural network model, whose inputs are the robot sensors (including RGB cameras) plus a natural language command ("Robot, please make me a pizza"), and for the first time ever it seems like something like this could plausibly work.

In my view, this has potentially profound implications for how we think about control. Our basic control definitions start with, e.g. we have a state $\bx$, inputs $\bu$, outputs $\by.$ The discussion on output feedback got us thinking a little about state representations for control -- for instance a belief state is a sufficient (but not necessary) state because it is a sufficient statistic of the history of actions and observations. But multitask in the imitation learning setting changes things. In the simple case, we'll say that our inputs $\bu$, and outputs $\by$ are the same across all of the tasks. But it may well be that the underlying state space is not. (I admit that philosophically there is a state of the entire universe which is the same across tasks, but I mean the more tractable representations of state that we've been using through the notes.) What does it mean to learn a state representation for control across tasks where even the dimension/cardinality of the state space can be different? Even our catch-all definition of belief state breaks down in this case.

Are there tractable ways to describe distributions over tasks that are amenable to our strongest theoretical tools, but still relevant for the complexity and diversity of the real world? When we talked about stochastic optimal control, we gave examples where taking an average over many possible rollouts can actually simplify the loss landscape, avoiding some local minima and making optimization easier. Can multitask control formulations have a similar effect?

Going further, how exactly is it that solving/learning one task can potentially help us in solving/learning another? This brings up basic questions about designing a curriculum for our control systems. Is is possible for us to soar to higher and higher heights if we sequence our control problem instances correctly?

Distributed decentralized learning (aka "fleet learning")

When I start using phrases like "learning" and "curriculum", then it becomes very natural to think in terms of our natural intelligence. How did we learn to walk? To play tennis? But let's remember that these analogies only go so far. For me, the GPT series of models are clearly unlike any single natural intelligence, they are more like a collective intelligence of the entire species (though still certainly deficient in some metrics). In the age of foundation models, it may not be the case that every robot needs to learn to use a toaster; the dream of "fleet learning" is that one robot will learn how to use a toaster and then they will all have learned.

This brings up fundamental questions about the learning algorithms, about data efficiency (and privacy). But it also challenges our theories of dynamics and control. For instance, there are open questions about how to balance being a generalist and using only shared data vs being a specialist. Certainly if a particular robot is solving problems in a particular warehouse, then while the statistics of tasks across the world may help form robust representations, this robot can almost certainly perform better if it narrows and specializes the policies (and world models) to exploit the distribution of tasks in the warehouse.

A particular version of this question appears in the context of "cross-embodiment" data and models. Right now, robot data with action labels is scarce (compared with online data for text and images). This, in part, has motivated the use of datasets which combine data from many robots/platforms Shah23+Padalkar23, and architectures that can share representations and transfer learning even when the number/types of sensors and actuators changes across the fleet.

Be rigorous

The fact that many of these fundamental questions are now being asked makes this a simply amazing time to be a roboticist. However, the pace of new innovations is so fast that often researchers feel pressure to race to publication before having done proper rigorous theoretical or empirical work. We are building tall towers but with somewhat shakey foundations. I firmly believe that the tenants of dynamics and control (amongst other rigorous technical tools) have a lot to contribute to understanding and continuing to push the field forward, and that some of the maturity with which we can understand these simpler (but not very simple!) problems can serve as a model for what we should expect about our understanding of the even more complex ones.

References

  1. Brenna D Argall and Sonia Chernova and Manuela Veloso and Brett Browning, "A survey of robot learning from demonstration", Robotics and autonomous systems, vol. 57, no. 5, pp. 469--483, 2009.

  2. J.A. Bagnell, "An Invitation to Imitation", Tech. Report, CMU-RI-TR-15-08, March, 2015.

  3. Takayuki Osa and Joni Pajarinen and Gerhard Neumann and J Andrew Bagnell and Pieter Abbeel and Jan Peters and others, "An algorithmic perspective on imitation learning", Foundations and Trends in Robotics, vol. 7, no. 1-2, pp. 1--179, 2018.

  4. Dean A Pomerleau, "Alvinn: An autonomous land vehicle in a neural network", Advances in neural information processing systems, vol. 1, 1988.

  5. Michael Bain and Claude Sammut, "A Framework for Behavioural Cloning.", Machine Intelligence 15 , pp. 103--129, 1995.

  6. Daniel M Ziegler and Nisan Stiennon and Jeffrey Wu and Tom B Brown and Alec Radford and Dario Amodei and Paul Christiano and Geoffrey Irving, "Fine-tuning language models from human preferences", arXiv preprint arXiv:1909.08593, 2019.

  7. Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskever and others, "Improving language understanding by generative pre-training", , 2018.

  8. Alec Radford and Jeffrey Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever and others, "Language models are unsupervised multitask learners", OpenAI blog, vol. 1, no. 8, pp. 9, 2019.

  9. Tom Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared D Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and others, "Language models are few-shot learners", Advances in neural information processing systems, vol. 33, pp. 1877--1901, 2020.

  10. S{\'e}bastien Bubeck and Varun Chandrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and others, "Sparks of artificial general intelligence: Early experiments with gpt-4", arXiv preprint arXiv:2303.12712, 2023.

  11. Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee, "Visual Instruction Tuning", , 2023.

  12. Siddharth Karamcheti and Suraj Nair and Ashwin Balakrishna and Percy Liang and Thomas Kollar and Dorsa Sadigh, "Prismatic vlms: Investigating the design space of visually-conditioned language models", arXiv preprint arXiv:2402.07865, 2024.

  13. Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh, "What matters when building vision-language models?", , 2024.

  14. Cheng Chi and Zhenjia Xu and Siyuan Feng and Eric Cousineau and Yilun Du and Benjamin Burchfiel and Russ Tedrake and Shuran Song, "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", , 2024.

  15. Tony Z Zhao and Vikash Kumar and Sergey Levine and Chelsea Finn, "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", arXiv preprint arXiv:2304.13705, 2023.

  16. David Silver and Aja Huang and Chris J Maddison and Arthur Guez and Laurent Sifre and George Van Den Driessche and Julian Schrittwieser and Ioannis Antonoglou and Veda Panneershelvam and Marc Lanctot and others, "Mastering the game of Go with deep neural networks and tree search", nature, vol. 529, no. 7587, pp. 484--489, 2016.

  17. Sergey Levine and Chelsea Finn and Trevor Darrell and Pieter Abbeel, "End-to-end training of deep visuomotor policies", The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334--1373, 2016.

  18. Peter Florence and Lucas Manuelli and Russ Tedrake, "Self-Supervised Correspondence in Visuomotor Policy Learning", IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 492-499, April, 2020. [ link ]

  19. St{\'e}phane Ross and Geoffrey Gordon and Drew Bagnell, "A reduction of imitation learning and structured prediction to no-regret online learning", Proceedings of the fourteenth international conference on artificial intelligence and statistics , pp. 627--635, 2011.

  20. Lili Chen and Kevin Lu and Aravind Rajeswaran and Kimin Lee and Aditya Grover and Misha Laskin and Pieter Abbeel and Aravind Srinivas and Igor Mordatch, "Decision transformer: Reinforcement learning via sequence modeling", Advances in neural information processing systems, vol. 34, pp. 15084--15097, 2021.

  21. Scott Reed and Konrad Zolna and Emilio Parisotto and Sergio Gomez Colmenarejo and Alexander Novikov and Gabriel Barth-Maron and Mai Gimenez and Yury Sulsky and Jackie Kay and Jost Tobias Springenberg and others, "A generalist agent", arXiv preprint arXiv:2205.06175, 2022.

  22. Anthony Brohan and Noah Brown and Justice Carbajal and Yevgen Chebotar and Joseph Dabis and Chelsea Finn and Keerthana Gopalakrishnan and Karol Hausman and Alex Herzog and Jasmine Hsu and Julian Ibarz and Brian Ichter and Alex Irpan and Tomas Jackson and Sally Jesmonth and Nikhil Joshi and Ryan Julian and Dmitry Kalashnikov and Yuheng Kuang and Isabel Leal and Kuang-Huei Lee and Sergey Levine and Yao Lu and Utsav Malla and Deeksha Manjunath and Igor Mordatch and Ofir Nachum and Carolina Parada and Jodilyn Peralta and Emily Perez and Karl Pertsch and Jornell Quiambao and Kanishka Rao and Michael Ryoo and Grecia Salazar and Pannag Sanketi and Kevin Sayed and Jaspiar Singh and Sumedh Sontakke and Austin Stone and Clayton Tan and Huong Tran and Vincent Vanhoucke and Steve Vega and Quan Vuong and Fei Xia and Ted Xiao and Peng Xu and Sichun Xu and Tianhe Yu and Brianna Zitkovich, "RT-1: Robotics Transformer for Real-World Control at Scale", arXiv preprint arXiv:2212.06817 , 2022.

  23. Anthony Brohan and Noah Brown and Justice Carbajal and Yevgen Chebotar and Xi Chen and Krzysztof Choromanski and Tianli Ding and Danny Driess and Avinava Dubey and Chelsea Finn and Pete Florence and Chuyuan Fu and Montse Gonzalez Arenas and Keerthana Gopalakrishnan and Kehang Han and Karol Hausman and Alex Herzog and Jasmine Hsu and Brian Ichter and Alex Irpan and Nikhil Joshi and Ryan Julian and Dmitry Kalashnikov and Yuheng Kuang and Isabel Leal and Lisa Lee and Tsang-Wei Edward Lee and Sergey Levine and Yao Lu and Henryk Michalewski and Igor Mordatch and Karl Pertsch and Kanishka Rao and Krista Reymann and Michael Ryoo and Grecia Salazar and Pannag Sanketi and Pierre Sermanet and Jaspiar Singh and Anikait Singh and Radu Soricut and Huong Tran and Vincent Vanhoucke and Quan Vuong and Ayzaan Wahid and Stefan Welker and Paul Wohlhart and Jialin Wu and Fei Xia and Ted Xiao and Peng Xu and Sichun Xu and Tianhe Yu and Brianna Zitkovich, "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv preprint arXiv:2307.15818 , 2023.

  24. Abhishek Padalkar and Acorn Pooley and Ajinkya Jain and Alex Bewley and Alex Herzog and Alex Irpan and Alexander Khazatsky and Anant Rai and Anikait Singh and Anthony Brohan and others, "Open x-embodiment: Robotic learning datasets and rt-x models", arXiv preprint arXiv:2310.08864, 2023.

  25. {Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn, "OpenVLA: An Open-Source Vision-Language-Action Model", arXiv preprint arXiv:2406.09246, 2024.

  26. Nur Muhammad Shafiullah and Zichen Cui and Ariuntuya (Arty) Altanzaya and Lerrel Pinto, "Behavior Transformers: Cloning k modes with one stone", Advances in Neural Information Processing Systems , vol. 35, pp. 22955--22968, 2022.

  27. Seungjae Lee and Yibin Wang and Haritheja Etukuru and H Jin Kim and Nur Muhammad Mahi Shafiullah and Lerrel Pinto, "Behavior generation with latent actions", arXiv preprint arXiv:2403.03181, 2024.

  28. Pete Florence and Corey Lynch and Andy Zeng and Oscar A Ramirez and Ayzaan Wahid and Laura Downs and Adrian Wong and Johnny Lee and Igor Mordatch and Jonathan Tompson, "Implicit behavioral cloning", Conference on Robot Learning , pp. 158--168, 2022.

  29. Christopher M Bishop, "Mixture density networks", , 1994.

  30. Cheng Chi and Zhenjia Xu and Chuer Pan and Eric Cousineau and Benjamin Burchfiel and Siyuan Feng and Russ Tedrake and Shuran Song, "Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots", , 2024.

  31. Cheng Chi and Siyuan Feng and Yilun Du and Zhenjia Xu and Eric Cousineau and Benjamin Burchfiel and Shuran Song, "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", Proceedings of Robotics: Science and Systems , 2023.

  32. Preetum Nakkiran and Arwen Bradley and Hattie Zhou and Madhu Advani, "Step-by-Step Diffusion: An Elementary Tutorial", , 2024.

  33. Jiaming Song and Chenlin Meng and Stefano Ermon, "Denoising Diffusion Implicit Models", International Conference on Learning Representations , 2020.

  34. Frank Permenter and Chenyang Yuan, "Interpreting and Improving Diffusion Models Using the Euclidean Distance Function", arXiv preprint arXiv:2306.04848, 2023.

  35. Yilun Xu and Mingyang Deng and Xiang Cheng and Yonglong Tian and Ziming Liu and Tommi Jaakkola, "Restart Sampling for Improving Generative Processes", Advances in Neural Information Processing Systems , vol. 36, pp. 76806--76838, 2023.

  36. Jascha Sohl-Dickstein and Eric Weiss and Niru Maheswaranathan and Surya Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics", International conference on machine learning , pp. 2256--2265, 2015.

  37. Aapo Hyvarinen, "Estimation of {Non}-{Normalized} {Statistical} {Models} by {Score} {Matching}", Journal of Machine Learning Research, vol. 6, pp. 695–708, 2005.

  38. Pascal Vincent, "A connection between score matching and denoising autoencoders", Neural computation, vol. 23, no. 7, pp. 1661--1674, 2011.

  39. Yang Song and Stefano Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution", Advances in Neural Information Processing Systems , vol. 32, 2019.

  40. Jonathan Ho and Ajay Jain and Pieter Abbeel, "Denoising diffusion probabilistic models", Advances in neural information processing systems, vol. 33, pp. 6840--6851, 2020.

  41. Michael Janner and Yilun Du and Joshua Tenenbaum and Sergey Levine, "Planning with Diffusion for Flexible Behavior Synthesis", International Conference on Machine Learning , pp. 9902--9915, 2022.

  42. Adam Block and Ali Jadbabaie and Daniel Pfrommer and Max Simchowitz and Russ Tedrake, "Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior", Thirty-seventh Conference on Neural Information Processing Systems , 2023.

  43. Yu Zhang and Qiang Yang, "A survey on multi-task learning", IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586--5609, 2021.

  44. Rishi Bommasani and Drew A Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and others, "On the opportunities and risks of foundation models", arXiv preprint arXiv:2108.07258, 2021.

  45. Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine, "ViNT: A Foundation Model for Visual Navigation", Conference on Robot Learning , pp. 711--733, 2023.

Previous Chapter Table of contents Next Chapter