We've already seen search through a discrete space. Now let's look at its relationship with Dynamic Programming.
- Consider backwards uniform-cost search
- Store (best) distance to goal at each node : V(n) = min V(n') + 1
- Note that you can update the 'best-distance' estimates in almost any order an it'll work, assuming enough repetitions
- Even random - stochastic dynamic programming
- The caching of the distance estimates allows the order to be decoupled from the search
- This best distance is a perfect heuristic! Pity it takes as long to find as the full search :)
- Consider a stochastic problem (but assume fully observable and markov with a reachable goal)
- Instead of best distance to goal, store best expected distance to goal
- Each action has an expected distance to goal == average distance to goal given the transition function
- The best expected distance to goal is the expected distance to goal of the best action.
- Can define this recursively:
- Let V(n) = expected best distance to goal from node n
- Let Q(n, a) = expected distance to goal from node n, if I choose action a
- Let T(n, a, n') be the probability we reach node n' if we choose action a in node n
- Q(n, a) = 1 + Sum_n' T(n, a, n') V(n')
- V(n) = max_a Q(n, a)
- Can also define the best policy easily:
- pi(n) = best action at node n = argmax_a Q(n, a)
- V(n) = Q(n, pi(n))
- Solving Markov Decision Processes
- Talk about maximising reward rather than minimising distance or cost
- Talk about states rather than nodes
- Rewards can vary for each step (say, each state/action) - use R(s, a)
- Rewards can actually be stochastic. R(s, a) then refers to the expected value
- Rewards can also depend upon the resulting state, but it doesn't add anything: R(s, a) = sum_s' T(s, a, s') . R(s, a, s')
- This gives the standard Bellman Equations for finite horizon problems:
- Q(s,a) = R(s, a) + Sum_s' T(s, a, s') V(s')
- pi(s) = argmax_s Q(s, a)
- V(s) = Q(s, pi(s))
- Goal criterion (mazimising reward)
- Finite horizon:
- At some point we reach a state that is absorbing (goal state)
- maximise sum_t R_t
- Infinite horizon: not guaranteed to terminate
- Naieve solution
- maximise sum_t R_t
- sum diverges
- Add a 'discount factor':
- maximise sum_t \gamma^t R_t
- view gamma as:
- interest rate
- mathematical trick
- probability of death on each step
- sum is bounded
- Q(s,a) = R(s, a) + \gamma Sum_s' T(s, a, s') V(s')
- This is the most common MDP framework
- Average reward
- Maximise lim T->infinity sum_t=0->T 1/T R_t
- Hard to work with, but a very nice framework in many respects
- is related to discounted setup as gamma -> 1
- Can be many equally optimal solutions
- Simple gain optimality ignores finite length prefixes to an optimal cycle
- Add an extra optimalilty criterion - bias optimality
- Choose the policy that has the best finite horizon rewards when not in a cycle
- Update regemes
- Value iteration: for each state update Q, V and Pi
- Policy iteration: update Q for all states/actions, then update V and Pi for all states/actions, etc
- Any mixture that updates all states/actions infinitely often will work
- Model free learning
- Can approximate Q/V/Pi without knowing T explicitly by sampling from the world
- Q(s,a) <- alpha Q(s,a) + (1-alpha) [ R + V(s') ]
- This is doing stochastic gradient descent to solve the same Bellman equations
- There are a number of different forms of this including TD(lambda) learning and SARSA