Course Syllabus COMP3431 - 2010
Introduction
Definition of a sense/act agent (reward?) Concept of an agent making decisions based on its history Limitations on Worlds (Markov, Deterministic, Partially Observable, Continuous, ...)
Introduction to probabilities - sets of possible outcomes, each associated with a number - numbers must be consistent with set operations (Venn Diagram) - For discrete sets, can associate numbers with each element - For continuous sets, associate 'densities' with each element (derivative of a probability) - Probabilities > 0 - Probabilities sum to 1
Conditioning P(x|y) = P(x&y)/P(y)
Bayes Rule - Applies to individual probabilities - Know how to derive
Probability functions - If you do the same thing pairwise to all elements of a set, you can think of it as applying to the functions. e.g. forall x, f(x) * g(y) = (f*g)(x) x * x^2 = x^3
Bayes Rule applied to tracking - Motion model P(s'|s,a) - Observation model P(o|s) - tables
Representations of probability distributions - Closed form formulae (e.g. normal/gaussian, uniform, poisson, exponential) - Samples from distribution - Tabular
Combining distributions - Independent distributions (factored representations) - Multi-dimensional distributions: f(x,y) = g(x)*h(y) - Conditionally independent distributions - P(x,y|q) = P(x|q).P(y|q)
- Covariance matrix if normal distribution independent dimensions
Kalman filters - Use a Normal distribution for Bayesian tracking - Didn't derive the math. - Only works with linear motion and observation models - Extended Kalman Filters (EKF) - Use the tangent approximation (taylor epxansion) - requires matrices of derivatives (to get multi-dim slope)
Particle Filters - Motion update - Sample from the motion model - Observation update - Weight the particles by the P(o|s) (s == particle) - Resample with replacement (use cumulative distribution)
Unscented Kalman filters mentioned (UKF) - Would use these instead of EKF - but I didn't really describe in detail - Use particles instead of deriviaitves to handle non-linear models.
Simultaneous Localisation and Mapping (SLAM) - e.g. localise robots on robocup field, but someone has moved the beacons - Move the unknowns into the state space - e.g. state space was x,y,theta -> x,y,theta, x_b, y_b - We end up with locations correllated - arbitrarily assign location 0,0 to the start location
Other tricks with tracking - Correlation allows finding ball speed from two separate observations - Briefly talked about Distributed kalman filters - Replicate the filter across the robots - Each sends its observations and actions to the others - Observations can be bunched up (do gaussian multiplication locally (Kalman filter) and just send result) - Must reset Kalman filter on send so that no information is sent twice
Action/Planning
Introduce the POMDP definition
- Cut back to MDP (assume fully obs)
- Assume known models
- Defined optimality criteria
- Discounted Reward (most common)
- Undiscounted Reward
- Fixed Finite horizon
- Receding horizon
- Average Reward
Bellman Equation (dynamic programming) Q(s,a) = r(s,a) + \gamma \sum_{s' \in S} P(s'|s,a)V(s') V(s) = Q(s, \pi(s)) \pi(s) = argmax_a Q(s,a)
- Updates can happen in any order as long as you do each one often enough
Policy iteration as a solution method
Prioritized Sweeping
- use a priority Queue to choose update order
- Increase priority of children by |Delta V|
How to build a model (fully observable)
- Given a history s1, a1, r=6, s4, a3, r=-3, s1, a2, r=10, ....
- break this up into tuples
- Keep counts (and sum of reward for each s,a) - use ratios of counts as probabilities
Exploration/Exploitation tradeoff - n-armed bandits example - have to keep exploring - e-greedy (choose a random action e% of the time) - Boltzman exploration p(a) = e^{Q(s,a)/t} / (Sum_{a'} Q(s,a')/t) - give each action a weight of e^{Q(s,a)/t} and normalise to get P
Model Free learning - Q(s,a) = (1-\alpha)Q(s,a) + \alpha[r + \gamma V(s')]
Stochastic forward search - Expectimax (like a minimax search, but with expectation for 'opponent')
Policy Gradient - Sample 100 trajectories into the future - Find expected reward - Choose policy that maximises expected reward - Use optimisation techniques to optimise policy
- Discrete policy and make it differentiable by...
- turn it into a stochastic policy (for each state keep P of each action)
Discreization techniques for continuous problems
RRTs
Plan space vs state space search
Planning - State space planning - Plan space planning - partial order (least-commitment) planning - Understand Precond/Adds/deletes (STRIPS representation) - Understand total order plan - Understand partial order plan - What is a threat? - A link and a plan step. - The plan step COULD be ordered after the start and before the end of the link - The plan step has a post-condition (Add || delete) that COULD contradict the link predicate - How do I resolve them? - resolutions mean making the 'COULD's not match - bind variable to stop potential match - add a temporal constraint so that the step appears before or after the link
POMDPs (re-introduced Partial Observability) - Assume we know model (learning model in POMDPs is current research) - Belief state (= probability distribution over state space)
Optimistion - Simulated Annealing (move using Boltzmann exploration) - Simplex method (There are two algoirthms with this name - we covered the one NOT used for linear programming.) - Golden section search for 1D optimisation - Brent's method for 1D optimsiation (magic mixture of Golden section search and parabolic interpolation) - Conjugate gradients - Gradient methods (gradient does NOT point at minimum) - Powell's method for using 1D search - Note: ways to you can get zig-zag - cycle through basis function - at minima in one dimension, gradient is perpendicular to that direction - In steep valley simple gradient methods and just oscillate across the valley - If certain conditions apply - Convex - Not too flat - gradient doesn't change too quickly - Then, simple downhill with step proportional to gradient can work optimally