Agent Example:

• Example one - DunceBot

• Get to corner of room
• Unknown map
• Known sensors, actuators/actions
• Deterministic, Markov

• Discussion: Which is more important: Building a map, or reaching goal?

• Exploration/Exploitation trade off - depends upon how many times you'll use the map
• Are you going to be switching goals much?
• Including goals as state
• As so much is known, it is possible to invent a policy

• This is an AI class - how can we make the system do more of the work?
• If I were to change the goals, would that just require another search or a human-in-the-loop re-design?
• Example Two

• Non-deterministic
• Can use Markov sensing to determine failures
• Example Three

• Unknown actions as well as unknown map
• Deterministic
• Have to learn model of actions at least
• Trade off between prior knowledge assumed and amount of data needed for learning
• Once you have the model, can use search/planning
• Example Four

• Unknown actions/map
• Non-deterministic
• Model based
• Learn model
• Solve using Dynamic Programming, or direct policy search
• Model free
• Solve using direct policy search (gradient descent in policy space - each function evaluation is a set of trial runs in the real world)
• Model free dynamic programming (Which is what we introduce next)
• Example Five

• Known map
• Known (but stochastic) actions.
• Known sensor models.
• Non-markov/partially observable

• Use DP or gradient descent to find policy

• Use Bayes filter to estimate state
• Note this is an approximate solution - it doesn't learn when to perform sense actions
• Example Six

• Unknown map
• Known (but stochastic) actions.
• Known sensor models
• Non-markov/partially observable

• Use Bayes filter to estimate both map and current state in map (SLAM)

• Once you have enough map, use DP or gradient descent to find policy
• Note: as above - this is still approximate

Q-Learning - Model free, Dynamic Programming based, MDP solving

• Already seen model based q-function
• Q(s,a) = E_s' ( r(s,a,s') + \gamma V(s') )
• Q(s,a) = Sum_s' P(s,a,s') ( r(s,a,s') + \gamma V(s') )
• This requires P(s,a,s') -> the action model
• Can replace the model with a stochastic approxamiation
• Calculate a short-term estimate of Q(s,a) as if the action we just took was deterministic
• Q-short(s,a,s',r) = r + \gamma V(s')
• Update the long-term estimate of Q(s,a) a short step of the way towards the short term estimate
• Q-long(s,a) = alpha * Q-long(s,a) + (1-alpha) * Q-short(s,a,s',r)
• Q(s,a) = alpha * Q(s,a) + (1-alpha) (r + \gamma V(s'))