Partially Observable Markov Decision Processes

Earlier we talked about models. Then I gave a vague definition about a model giving a probability distribution over observations and rewards at any given point in time. Now it's time to make that more concrete. I'm going to introduce something very similar to Partially Observable Markov Decision Processes (but slightly different).

Let's assume that our model is a stochastic finite state machine. There will be six parts to this, <S, A, O, T, R, V>. S is a finite set of states. A is a finite set of actions. O is a finite set of observations. Then we get the functions; T defines the probability distribution over next states given a state and an action. R defines the probability distribution over rewards given a state and an action, and V defines a probability distribution over observations given a state.

e.g. imagine a 10 x 10 gridworld. That defines 100 states. We have four actions: N, S, E and W that, as we'll see, move us in the usual compass directions. We have 16 observations that correspond to 4 bits noting whether a wall is immediately North, South, East or West of the current state. T is the part of the model that tells us how the actions actually work. For any state without a wall in the given direction, we move in that direction with the appropriate action 90% of the time. 6% of the time (2% each) we move in one of the other directions, and 4% of the time we don't move. If there is a wall in a given direction, then trying to move in that direction leaves you where you are. The observation function, V, tells us which walls are immediately next to the current state, but has a 5% chance of flipping each bit. The reward is 1 in middle four squares in the room, and 0 elsewhere. There is a reward of -1 if you step into a wall.

  • Discussion: Are walls part of the state? Why/Why not?

Tracking the state

In the section on BayesianPrediction, models had to