We've already seen search through a discrete space. Now let's look at its relationship with Dynamic Programming.

  • Consider backwards uniform-cost search
    • Store (best) distance to goal at each node : V(n) = min V(n') + 1
    • Note that you can update the 'best-distance' estimates in almost any order an it'll work, assuming enough repetitions
    • Even random - stochastic dynamic programming
    • The caching of the distance estimates allows the order to be decoupled from the search
    • This best distance is a perfect heuristic! Pity it takes as long to find as the full search :)
  • Consider a stochastic problem (but assume fully observable and markov with a reachable goal)
    • Instead of best distance to goal, store best expected distance to goal
    • Each action has an expected distance to goal == average distance to goal given the transition function
    • The best expected distance to goal is the expected distance to goal of the best action.
    • Can define this recursively:
      • Let V(n) = expected best distance to goal from node n
      • Let Q(n, a) = expected distance to goal from node n, if I choose action a
      • Let T(n, a, n') be the probability we reach node n' if we choose action a in node n
      • Q(n, a) = 1 + Sum_n' T(n, a, n') V(n')
      • V(n) = max_a Q(n, a)
    • Can also define the best policy easily:
      • pi(n) = best action at node n = argmax_a Q(n, a)
      • V(n) = Q(n, pi(n))
  • Solving Markov Decision Processes
    • Talk about maximising reward rather than minimising distance or cost
    • Talk about states rather than nodes
    • Rewards can vary for each step (say, each state/action) - use R(s, a)
    • Rewards can actually be stochastic. R(s, a) then refers to the expected value
    • Rewards can also depend upon the resulting state, but it doesn't add anything: R(s, a) = sum_s' T(s, a, s') . R(s, a, s')
    • This gives the standard Bellman Equations for finite horizon problems:
    • Q(s,a) = R(s, a) + Sum_s' T(s, a, s') V(s')
    • pi(s) = argmax_s Q(s, a)
    • V(s) = Q(s, pi(s))
    • Goal criterion (mazimising reward)
    • Finite horizon:
      • At some point we reach a state that is absorbing (goal state)
      • maximise sum_t R_t
    • Infinite horizon: not guaranteed to terminate
      • Naieve solution
      • maximise sum_t R_t
      • sum diverges
      • Add a 'discount factor':
      • maximise sum_t \gamma^t R_t
      • view gamma as:
        • interest rate
        • mathematical trick
        • probability of death on each step
      • sum is bounded
        • Q(s,a) = R(s, a) + \gamma Sum_s' T(s, a, s') V(s')
      • This is the most common MDP framework
      • Average reward
      • Maximise lim T->infinity sum_t=0->T 1/T R_t
      • Hard to work with, but a very nice framework in many respects
      • is related to discounted setup as gamma -> 1
      • Can be many equally optimal solutions
        • Simple gain optimality ignores finite length prefixes to an optimal cycle
        • Add an extra optimalilty criterion - bias optimality
        • Choose the policy that has the best finite horizon rewards when not in a cycle
    • Update regemes
    • Value iteration: for each state update Q, V and Pi
    • Policy iteration: update Q for all states/actions, then update V and Pi for all states/actions, etc
    • Any mixture that updates all states/actions infinitely often will work
  • Model free learning
    • Can approximate Q/V/Pi without knowing T explicitly by sampling from the world
    • Q(s,a) <- alpha Q(s,a) + (1-alpha) [ R + V(s') ]
    • This is doing stochastic gradient descent to solve the same Bellman equations
    • There are a number of different forms of this including TD(lambda) learning and SARSA