Agent Example:

  • Example one - DunceBot

    • Get to corner of room
    • Unknown map
    • Known sensors, actuators/actions
    • Deterministic, Markov

    • Discussion: Which is more important: Building a map, or reaching goal?

      • Exploration/Exploitation trade off - depends upon how many times you'll use the map
      • Are you going to be switching goals much?
      • Including goals as state
    • As so much is known, it is possible to invent a policy

    • This is an AI class - how can we make the system do more of the work?
      • If I were to change the goals, would that just require another search or a human-in-the-loop re-design?
  • Example Two

    • Non-deterministic
    • Can use Markov sensing to determine failures
  • Example Three

    • Unknown actions as well as unknown map
    • Deterministic
    • Have to learn model of actions at least
    • Trade off between prior knowledge assumed and amount of data needed for learning
    • Once you have the model, can use search/planning
  • Example Four

    • Unknown actions/map
    • Non-deterministic
    • Model based
      • Learn model
      • Solve using Dynamic Programming, or direct policy search
    • Model free
      • Solve using direct policy search (gradient descent in policy space - each function evaluation is a set of trial runs in the real world)
      • Model free dynamic programming (Which is what we introduce next)
  • Example Five

    • Known map
    • Known (but stochastic) actions.
    • Known sensor models.
    • Non-markov/partially observable

    • Use DP or gradient descent to find policy

    • Use Bayes filter to estimate state
    • Note this is an approximate solution - it doesn't learn when to perform sense actions
      • For more discussion about this Partially observable case, wait a week :)
  • Example Six

    • Unknown map
    • Known (but stochastic) actions.
    • Known sensor models
    • Non-markov/partially observable

    • Use Bayes filter to estimate both map and current state in map (SLAM)

    • Once you have enough map, use DP or gradient descent to find policy
    • Note: as above - this is still approximate

Q-Learning - Model free, Dynamic Programming based, MDP solving

  • Already seen model based q-function
    • Q(s,a) = E_s' ( r(s,a,s') + \gamma V(s') )
    • Q(s,a) = Sum_s' P(s,a,s') ( r(s,a,s') + \gamma V(s') )
  • This requires P(s,a,s') -> the action model
  • Can replace the model with a stochastic approxamiation
    • Calculate a short-term estimate of Q(s,a) as if the action we just took was deterministic
      • Q-short(s,a,s',r) = r + \gamma V(s')
    • Update the long-term estimate of Q(s,a) a short step of the way towards the short term estimate
      • Q-long(s,a) = alpha * Q-long(s,a) + (1-alpha) * Q-short(s,a,s',r)
      • Q(s,a) = alpha * Q(s,a) + (1-alpha) (r + \gamma V(s'))