Review
- Building Models
Today
- Building Heuristics (deterministic, goal state(s))
- Backward search / Dynamic programming
- Both cost and search heuristics
- Relaxing problem to find admissible (optimistic) heuristic
- Backward search / Dynamic programming
- Dynamic Programming for Markov Decision Processes
- Stochasticity
- Random order of update
- Value iteration
- Policy iteration
- Modified Policy iteration
- State and state/action value functions
- Q(s,a) = E(r + V(s')) = Sum_s' P(s,a,s') ( r + V(s') )
- "Expected reward for taking action a in state s and then following policy pi"
- Note: P(s,a,s') is the model of the world
- V(s) = max_a Q(s, a)
- "Expected reward for following policy pi from state s"
- pi*(s) = argmax_a Q(s, a)
- pi* is the optimal policy (assuming everything has converged)
- Other optimality criterion
- One way to view what we were doing is to find a policy that maximises the long term sum of reward
- sum_{0 < t < infinity} r_t
- Maximise "sum of reward" is tricky if sums can be infinite (loops)
- infinite horizon, discounted
- Find policy to maximise sum_{0 < t < infinity} gamma^t r_t
- gamma is:
- trick to keep things bounded
- Interest rate
- (1-gamma) = probability of transitioning to 0-value terminal state each step
- Q(s,a) = E(r + gamma V(s'))
- average reward
- Maximise lim t_max -> infinity 1/t_max sum_{0 < t < t_max} r_t
- Discussion
- MDP solving finds a complete policy -> looks at entire state space
Introduction to Soft-Max / Boltzmann distribution The formula is p(i) = e^(f(i) / T) / (sum of all probabilities)
Written pretty here: http://computing.dcu.ie/~humphrys/Notes/AI/boltzmann.html