Partially-obervable Markov decision processes (POMDPs)

For notational simplicity, let's start the discussion with discrete time, states, actions, and even observations. This allows us to have a lossless representation of the probability masses over these quantities, and to write them as vectors.

A partially-observable Markov decision process (POMDP) is a stochastic state-space dynamical system with discrete states $X$, actions $U$, observations $Y$, initial condition probabilities $P_{x_0}(x)$, transition probabilities $P_x(x'|x,u)$, observation probabilities $P_y(y | x)$, and a deterministic reward function $R: X \times U \rightarrow \Re$ Kaelbling98. Our goal is to maximize the finite-horizon expected rewards, $E[\sum_{n=1}^N R(x_n, u_n)].$

An important idea for POMDP's is the notion of a "belief state" of a POMDP: \[[b_n]_i = \Pr(x_n = x_i | u_0, y_0, ... u_{n-1}, y_{n-1}).\] $b_n \in \mathcal{B}_n \subset \Delta^{|X|}$ where $\Delta^n \subset \Re^n$ is the $n$-dimensional simplex and we use $[b_n]_i$ to denote the $i$th element. $b_n$ is a sufficient statistic for the POMDP, in the sense that it captures all information about the history that can be used to predict the future state (and therefore also observations, and rewards) of the process. Using $h_n = [u_0, y_0, ..., u_{n-1}, y_{n-1}]$ to denote the history of all actions and observations, we have $$\Pr(x_{n+1} = x | b_n, h_n, u_n) = \Pr(x_{n+1} = x | b_n, u_n).$$ It is also sufficient for optimal control, in the sense that the optimal policy can be written as $u_n^* = \pi^*(b_n, n).$

The belief state is observable, with the optimal Bayesian estimator/observer given by the (belief-)state-space form as \begin{gather*} [b_{n+1}]_i = f_i(b_n, u_n, y_n) = \frac{P_y(y_n | x_i) \sum_{j \in |X|} P_x(x_i|x_j,u_n) [b_n]_j}{\Pr(y_n | b_n, u_n)}\end{gather*} where $b_0 \sim \mathcal{B}_0$, and $$\Pr(y_j | b, u) = \sum_{x' \in X} P_y(y | x') \sum_{j \in |X|} P_x(x' | x_j, u) [b]_j = {\bf P}_y(y_j|x) {\bf P}_{x,u} b,$$ where ${\bf P}_y(y_j|x)$ is the $|X|$ element row vector and ${\bf P}_{x,u}$ is the $|X| \times |X|$ matrix of transition probabilities given $u$. We can even write $$b_{n+1} = f(b_n, u_n, y_n) = \frac{{\bf D}_n {\bf P}_{x,u} b_{n}}{\Pr(y_n|b_n, u_n)},$$ where ${\bf D}_j$ is the diagonal matrix with $P_y(y_j|x_i)$ on the $i$the diagonal. Moreover, we can use this belief update to form the ``belief MDP'' for planning by marginalizing over the future observations, giving the \begin{align*} \Pr(b_{n+1}=b|b_n, u_n) = \sum_{\{y|b = f(b_n, u_n, y)\}} P_y(y | b_n, u_n) = \sum_{\{j | b=f(b_n, u_n, y_j)\}} {\bf D}_j {\bf P}_{x,u_n} b_n.\end{align*} Note that although $b_n$ is represented with a vector of real values, it actually lives in a finite space, $\mathcal{B}_n$, which is why we use probability mass and MDP notation here. It is well known that the optimal policy $\pi^*(b_n, n)$ is the optimal state feedback for the belief MDP.

Cheeze Maze

Make a simple visualization where one can walk through the maze and watch the belief states update. Could be fun to have them do it again, without the visualization of the map (only the belief states). There might be a few more useful thoughts in my "model-reduction for pomdps" draft: https://www.overleaf.com/project/5fbc14452ec47212266a4de5

Continuous states, actions, and observations

Light-dark domain

Finite-parameterizations of distributions

multi-modal guassian, particles, histograms, deep nets

Learned state representations

A number of other techniques that we have discussed are intimately related to this notion of planning under uncertainty and belief-space, even if that connection is not typically made explicitly. For instance, if we perform policy search in a partially-observable setting, using a dynamic policy, then the state realization in the policy is best thought of as an (approximate) belief state. These days it's also common to learn value functions or Q-functions with internal state; again the interpretation should be that the observations are being collated into an (approximate) belief state.

Let us contrast this, however, with the "student-teacher" architecture that is sometimes used in RL -- where one first trains a state-feedback policy with RL, then uses supervised learning to turn this into an output feedback policy. This optimal state-feedback policy can be fundamentally different -- it will not take any information-gathering actions...

AIS

Underactuated Robotics

Planning Under Uncertainty

Partially-obervable Markov decision processes (POMDPs)

Cheeze Maze

Continuous states, actions, and observations

Light-dark domain

Finite-parameterizations of distributions

Why plan in belief space?

Trajectory optimization

Sampling-based planning

Learned state representations

References