POMDP
- Definitions:
- Underlying state: the state space of the base problem
- Only indirectly inferable from
- a series of sensor readings
- someone else's sensors
- e.g. does the world disappear when I turn my back? I assume not.
- Sensors: those things that actually return data
- One possible definition: assume we have a set of symbols O
- Sensor model is a probability distribution P(o|s,a)
- dependancy on a allows sensing actions
- This definition assumes we know the underlying state
- In reality the reverse holds - we infer underlying state from our sensors
- Belief state: A probability distribution over underlying states
- Transition and sensor models are markov over belief state
- Markov state: Any state description where the function of interest (transition function) only depends upon the current state
- a belief state representation is a markov state representation
- The instantaneous sensor readings in a partially observable problem do not usually form a markov state
- There are other markov state representations apart from belief state
- Underlying state: the state space of the base problem
- Problem is Partially Observable
- Using Sensors as state leads to a non-markov problem
- How the world behaves when the sensors give a particular reading depends on more than just that reading (e.g. the history)
- Using Sensors as state leads to a non-markov problem
- Solution 1: ignore it
- Use the sensor space as our state space
- When we learn our model we'll average out the non-markov nature of things
- Exactly how things average out might change as our policy changes
- Assuming we ignore that too, our world might look non-stationary
- Assuming there is NO way to sense the underlying non-markov-ness, this is the right solution
- Any way that it could make a difference to know the underlying state is a way to detect that state
- Solution 2: Use Tracking to estimate state
- Make decision assuming you know where you are
- track state to decide where you are
- suboptimal. e.g. T intersection and looking up
- world is a T intersection
- stochastic movement actions
- no default sensing
- lookup action gives bools for 4 neighbouring walls, takes a timestep
- solution to MDP will never use lookup action
- If you miss the T, then you just sit hitting the wall while the expected location heads up the corridor
- e.g. 2: AIBO playing soccer
- MDP: Robot's head never moves/moves randomly - it knows where everything is anyway.
- Having 'GPS' track location and ball doesn't matter because the hopeless head control means your getting less information than you need
- Quadrants: Markov chain, HMM, MDP, POMDP
- tracking is like using HMM to track state and MDP to decide action
- Solution 3: Use entire history as 'state' for policy
- can't get more data than this
- HUGE state space
- State space is variable length
- Is actually used by some algorithms
- Use function approximation to either
- a) only look at a subset of the history/"state"
- e.g. U-Tree (did the driving example you saw yesterday)
- b) reduce the dimensionality of this "state"
- Solution 4: MDP over "belief state" not underlying world state
- Original problem: Discrete with k variables (each with m possible values for m^k states)
- POMDP version: we have a probability we are in each state
- each state has a corresponding number [0-1]
- each discrete state in the underlying problem is a continuous dimension in the POMDP
- Now need to solve continuous MDP
- e.g. One light - 3 states - "On", "Dim", "Off" - 4 actions - "move switch up", "move switch down", "look in room", "wait"
- 2 D state space (probabilities sum to 1 - removes one degree of freedom from the system)
- Corners are where probabilites are known
- centre is equal probability
- look action moves belief state towards one of the corners (corner depends upon underlying state of the world)
- move switch action moves belief state towards one the corners (corner depends upon action)
- wait action moves belief state towards the centre (some probability someone else flicked the switch in an unknown way)
- Don't actually need to look in this case
- Transition function in belief state is based on transition function in underlying state
- It turns out that the value function over belief state can be shown to be piecewise linear and concave
- ignorance never helps - you're always better off near the corners
- I'm not going to show this here
- What happens if underlying state is continuous?
- Need some representation of the belief distribution over that state
- parametized form (gaussian)
- functional form (however you want to represent that function - variable length parametisation)
- solve the MDP formed over that space
- Solution 5: Use other markov state
- If we know that there is a k dim markov state space, any transformation of that state space will do
- Predictive state representations
- Imagine k 'tests'
- If I perform this series of actions, what is the chance I get exactly this series of observations,
- If we chose the right k tests, they can form a basis for our k dimensional markov state
- Note - we don't actually need to execute the tests
- without proof
- A set of such tests exists
- where no test is longer than k steps
- Note: No concept of underlying state required
- e.g. two identical worlds side by side
- we don't know which world we're in
- PSR would not care which world we're in (belief state would represent the possibilities)
- Figuring out good tests and the transition function over the tests from a trajectory through the world is ongoing research
- Still need to solve the continuous MDP over this state
- Solution 6: Arbitrary memory and policy gradient
- Doug Aberdeen's work
- Review policy gradient MDP solving with neural nets
- Neural net is a smooth parameterized policy
- Use gradient descent to find the best weights = best policy
- Use a second neural net to track markov state
- Use gradient descent over the pair of neural nets to fit both together