POMDP

  • Definitions:
    • Underlying state: the state space of the base problem
      • Only indirectly inferable from
      • a series of sensor readings
      • someone else's sensors
      • e.g. does the world disappear when I turn my back? I assume not.
    • Sensors: those things that actually return data
      • One possible definition: assume we have a set of symbols O
      • Sensor model is a probability distribution P(o|s,a)
      • dependancy on a allows sensing actions
      • This definition assumes we know the underlying state
      • In reality the reverse holds - we infer underlying state from our sensors
    • Belief state: A probability distribution over underlying states
      • Transition and sensor models are markov over belief state
    • Markov state: Any state description where the function of interest (transition function) only depends upon the current state
      • a belief state representation is a markov state representation
      • The instantaneous sensor readings in a partially observable problem do not usually form a markov state
      • There are other markov state representations apart from belief state
  • Problem is Partially Observable
    • Using Sensors as state leads to a non-markov problem
      • How the world behaves when the sensors give a particular reading depends on more than just that reading (e.g. the history)
  • Solution 1: ignore it
    • Use the sensor space as our state space
    • When we learn our model we'll average out the non-markov nature of things
      • Exactly how things average out might change as our policy changes
      • Assuming we ignore that too, our world might look non-stationary
      • Assuming there is NO way to sense the underlying non-markov-ness, this is the right solution
      • Any way that it could make a difference to know the underlying state is a way to detect that state
  • Solution 2: Use Tracking to estimate state
    • Make decision assuming you know where you are
    • track state to decide where you are
    • suboptimal. e.g. T intersection and looking up
      • world is a T intersection
      • stochastic movement actions
      • no default sensing
      • lookup action gives bools for 4 neighbouring walls, takes a timestep
      • solution to MDP will never use lookup action
      • If you miss the T, then you just sit hitting the wall while the expected location heads up the corridor
    • e.g. 2: AIBO playing soccer
      • MDP: Robot's head never moves/moves randomly - it knows where everything is anyway.
      • Having 'GPS' track location and ball doesn't matter because the hopeless head control means your getting less information than you need
    • Quadrants: Markov chain, HMM, MDP, POMDP
      • tracking is like using HMM to track state and MDP to decide action
  • Solution 3: Use entire history as 'state' for policy
    • can't get more data than this
    • HUGE state space
    • State space is variable length
    • Is actually used by some algorithms
      • Use function approximation to either
      • a) only look at a subset of the history/"state"
        • e.g. U-Tree (did the driving example you saw yesterday)
      • b) reduce the dimensionality of this "state"
  • Solution 4: MDP over "belief state" not underlying world state
    • Original problem: Discrete with k variables (each with m possible values for m^k states)
    • POMDP version: we have a probability we are in each state
      • each state has a corresponding number [0-1]
      • each discrete state in the underlying problem is a continuous dimension in the POMDP
    • Now need to solve continuous MDP
    • e.g. One light - 3 states - "On", "Dim", "Off" - 4 actions - "move switch up", "move switch down", "look in room", "wait"
      • 2 D state space (probabilities sum to 1 - removes one degree of freedom from the system)
      • Corners are where probabilites are known
      • centre is equal probability
      • look action moves belief state towards one of the corners (corner depends upon underlying state of the world)
      • move switch action moves belief state towards one the corners (corner depends upon action)
      • wait action moves belief state towards the centre (some probability someone else flicked the switch in an unknown way)
      • Don't actually need to look in this case
    • Transition function in belief state is based on transition function in underlying state
    • It turns out that the value function over belief state can be shown to be piecewise linear and concave
      • ignorance never helps - you're always better off near the corners
      • I'm not going to show this here
    • What happens if underlying state is continuous?
      • Need some representation of the belief distribution over that state
      • parametized form (gaussian)
      • functional form (however you want to represent that function - variable length parametisation)
      • solve the MDP formed over that space
  • Solution 5: Use other markov state
    • If we know that there is a k dim markov state space, any transformation of that state space will do
    • Predictive state representations
      • Imagine k 'tests'
      • If I perform this series of actions, what is the chance I get exactly this series of observations,
      • If we chose the right k tests, they can form a basis for our k dimensional markov state
      • Note - we don't actually need to execute the tests
      • without proof
      • A set of such tests exists
      • where no test is longer than k steps
      • Note: No concept of underlying state required
      • e.g. two identical worlds side by side
      • we don't know which world we're in
      • PSR would not care which world we're in (belief state would represent the possibilities)
      • Figuring out good tests and the transition function over the tests from a trajectory through the world is ongoing research
      • Still need to solve the continuous MDP over this state
  • Solution 6: Arbitrary memory and policy gradient
    • Doug Aberdeen's work
    • Review policy gradient MDP solving with neural nets
      • Neural net is a smooth parameterized policy
      • Use gradient descent to find the best weights = best policy
    • Use a second neural net to track markov state
      • Use gradient descent over the pair of neural nets to fit both together