Review

  • Building Models

Today

  • Building Heuristics (deterministic, goal state(s))
    • Backward search / Dynamic programming
      • Both cost and search heuristics
    • Relaxing problem to find admissible (optimistic) heuristic
  • Dynamic Programming for Markov Decision Processes
    • Stochasticity
    • Random order of update
      • Value iteration
      • Policy iteration
      • Modified Policy iteration
    • State and state/action value functions
      • Q(s,a) = E(r + V(s')) = Sum_s' P(s,a,s') ( r + V(s') )
      • "Expected reward for taking action a in state s and then following policy pi"
      • Note: P(s,a,s') is the model of the world
      • V(s) = max_a Q(s, a)
      • "Expected reward for following policy pi from state s"
      • pi*(s) = argmax_a Q(s, a)
      • pi* is the optimal policy (assuming everything has converged)
    • Other optimality criterion
      • One way to view what we were doing is to find a policy that maximises the long term sum of reward
      • sum_{0 < t < infinity} r_t
      • Maximise "sum of reward" is tricky if sums can be infinite (loops)
      • infinite horizon, discounted
      • Find policy to maximise sum_{0 < t < infinity} gamma^t r_t
      • gamma is:
        • trick to keep things bounded
        • Interest rate
        • (1-gamma) = probability of transitioning to 0-value terminal state each step
      • Q(s,a) = E(r + gamma V(s'))
      • average reward
      • Maximise lim t_max -> infinity 1/t_max sum_{0 < t < t_max} r_t
    • Discussion
    • MDP solving finds a complete policy -> looks at entire state space

Introduction to Soft-Max / Boltzmann distribution The formula is p(i) = e^(f(i) / T) / (sum of all probabilities)

Written pretty here: http://computing.dcu.ie/~humphrys/Notes/AI/boltzmann.html