Algorithms for Walking, Running, Swimming, Flying, and Manipulation
© Russ Tedrake, 2024
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.
Table of contents |
For notational simplicity, let's start the discussion with discrete time, states, actions, and even observations. This allows us to have a lossless representation of the probability masses over these quantities, and to write them as vectors.
A partially-observable Markov decision process (POMDP) is a stochastic
state-space dynamical system with discrete states $X$, actions $U$,
observations $Y$, initial condition probabilities $P_{x_0}(x)$, transition
probabilities $P_x(x'|x,u)$, observation probabilities $P_y(y | x)$, and a
deterministic reward function $R: X \times U \rightarrow \Re$
An important idea for POMDP's is the notion of a "belief state" of a POMDP: \[[b_n]_i = \Pr(x_n = x_i | u_0, y_0, ... u_{n-1}, y_{n-1}).\] $b_n \in \mathcal{B}_n \subset \Delta^{|X|}$ where $\Delta^n \subset \Re^n$ is the $n$-dimensional simplex and we use $[b_n]_i$ to denote the $i$th element. $b_n$ is a sufficient statistic for the POMDP, in the sense that it captures all information about the history that can be used to predict the future state (and therefore also observations, and rewards) of the process. Using $h_n = [u_0, y_0, ..., u_{n-1}, y_{n-1}]$ to denote the history of all actions and observations, we have $$\Pr(x_{n+1} = x | b_n, h_n, u_n) = \Pr(x_{n+1} = x | b_n, u_n).$$ It is also sufficient for optimal control, in the sense that the optimal policy can be written as $u_n^* = \pi^*(b_n, n).$
The belief state is observable, with the optimal Bayesian estimator/observer given by the (belief-)state-space form as \begin{gather*} [b_{n+1}]_i = f_i(b_n, u_n, y_n) = \frac{P_y(y_n | x_i) \sum_{j \in |X|} P_x(x_i|x_j,u_n) [b_n]_j}{\Pr(y_n | b_n, u_n)}\end{gather*} where $b_0 \sim \mathcal{B}_0$, and $$\Pr(y_j | b, u) = \sum_{x' \in X} P_y(y | x') \sum_{j \in |X|} P_x(x' | x_j, u) [b]_j = {\bf P}_y(y_j|x) {\bf P}_{x,u} b,$$ where ${\bf P}_y(y_j|x)$ is the $|X|$ element row vector and ${\bf P}_{x,u}$ is the $|X| \times |X|$ matrix of transition probabilities given $u$. We can even write $$b_{n+1} = f(b_n, u_n, y_n) = \frac{{\bf D}_n {\bf P}_{x,u} b_{n}}{\Pr(y_n|b_n, u_n)},$$ where ${\bf D}_j$ is the diagonal matrix with $P_y(y_j|x_i)$ on the $i$the diagonal. Moreover, we can use this belief update to form the ``belief MDP'' for planning by marginalizing over the future observations, giving the \begin{align*} \Pr(b_{n+1}=b|b_n, u_n) = \sum_{\{y|b = f(b_n, u_n, y)\}} P_y(y | b_n, u_n) = \sum_{\{j | b=f(b_n, u_n, y_j)\}} {\bf D}_j {\bf P}_{x,u_n} b_n.\end{align*} Note that although $b_n$ is represented with a vector of real values, it actually lives in a finite space, $\mathcal{B}_n$, which is why we use probability mass and MDP notation here. It is well known that the optimal policy $\pi^*(b_n, n)$ is the optimal state feedback for the belief MDP.
Information-gathering actions...
It's worth noting that, even for fully-actuated robots, planning in belief space is underactuated. Even in the linear-Gaussian setting, the the dimension of the belief space is the number of means (equal to the number of deterministic states) plus the number of covariances (equal to the number of deterministic states squared), minus one if we take into account the constraint that the probabilities have to integrate to one. No robot has enough actuators to instantaneously control all of these degrees of freedom.
A number of other techniques that we have discussed are intimately related to this notion of planning under uncertainty and belief-space, even if that connection is not typically made explicitly. For instance, if we perform policy search in a partially-observable setting, using a dynamic policy, then the state realization in the policy is best thought of as an (approximate) belief state. These days it's also common to learn value functions or Q-functions with internal state; again the interpretation should be that the observations are being collated into an (approximate) belief state.
Let us contrast this, however, with the "student-teacher" architecture that is sometimes used in RL -- where one first trains a state-feedback policy with RL, then uses supervised learning to turn this into an output feedback policy. This optimal state-feedback policy can be fundamentally different -- it will not take any information-gathering actions...
Table of contents |