Agent Example:
Example one - DunceBot
- Get to corner of room
- Unknown map
- Known sensors, actuators/actions
Deterministic, Markov
Discussion: Which is more important: Building a map, or reaching goal?
- Exploration/Exploitation trade off - depends upon how many times you'll use the map
- Are you going to be switching goals much?
- Including goals as state
As so much is known, it is possible to invent a policy
- This is an AI class - how can we make the system do more of the work?
- If I were to change the goals, would that just require another search or a human-in-the-loop re-design?
Example Two
- Non-deterministic
- Can use Markov sensing to determine failures
Example Three
- Unknown actions as well as unknown map
- Deterministic
- Have to learn model of actions at least
- Trade off between prior knowledge assumed and amount of data needed for learning
- Once you have the model, can use search/planning
Example Four
- Unknown actions/map
- Non-deterministic
- Model based
- Learn model
- Solve using Dynamic Programming, or direct policy search
- Model free
- Solve using direct policy search (gradient descent in policy space - each function evaluation is a set of trial runs in the real world)
- Model free dynamic programming (Which is what we introduce next)
Example Five
- Known map
- Known (but stochastic) actions.
- Known sensor models.
Non-markov/partially observable
Use DP or gradient descent to find policy
- Use Bayes filter to estimate state
- Note this is an approximate solution - it doesn't learn when to perform sense actions
- For more discussion about this Partially observable case, wait a week :)
Example Six
- Unknown map
- Known (but stochastic) actions.
- Known sensor models
Non-markov/partially observable
Use Bayes filter to estimate both map and current state in map (SLAM)
- Once you have enough map, use DP or gradient descent to find policy
- Note: as above - this is still approximate
Q-Learning - Model free, Dynamic Programming based, MDP solving
- Already seen model based q-function
- Q(s,a) = E_s' ( r(s,a,s') + \gamma V(s') )
- Q(s,a) = Sum_s' P(s,a,s') ( r(s,a,s') + \gamma V(s') )
- This requires P(s,a,s') -> the action model
- Can replace the model with a stochastic approxamiation
- Calculate a short-term estimate of Q(s,a) as if the action we just took was deterministic
- Q-short(s,a,s',r) = r + \gamma V(s')
- Update the long-term estimate of Q(s,a) a short step of the way towards the short term estimate
- Q-long(s,a) = alpha * Q-long(s,a) + (1-alpha) * Q-short(s,a,s',r)
- Q(s,a) = alpha * Q(s,a) + (1-alpha) (r + \gamma V(s'))
- Calculate a short-term estimate of Q(s,a) as if the action we just took was deterministic