Utility Theory I

  • Think animal training

    • Can't explain the goal to the animal
    • Can give 'good/bad' feedback.
  • Goal of the agent is to maximise rewards over time

    • Is this enough? Could you 'teach' Asimov's three laws with this setup?
    • What does it mean to 'maximise rewards over time'?
    • g = \sum_{t=0}^{\infty} r_t diverges
    • Limit the horizon to N steps: g = \sum_{t=\text{now}}^{\text{now} + N} r_t
    • Discount the future: g = \sum_{t=0}^{\infty} \gamma^t r_t
    • Average rewards: g = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{t=\text{now}}^{\text{now} + N} r_t or g = \lim_{\gamma \rightarrow 1} \sum_{t=0}^{\infty} \gamma^t r_t
    • gain v bias optimality
  • Given a history, (AOR)*, this allows us to get a single 'quality' number.

  • This is an uncommon 'stateless' definition. We'll revisit this with state later (don't worry now if that doesn't mean anything).

Bayesian Agents I

Combining BayesianPrediction and Utility Theory.

  • Start with a set of models of the environment
    • Models are functions from histories to probability distributions over observations and rewards
  • At each time step

    • do a forward search
    • for observations consider each possible observation,
    • On the way down each branch do the Bayesian update as if you'd just seen that observation
    • On the way out, take a weighted average of the result by observation probability
    • rewards are similar to observations
    • for actions, consider each possible action. On the way out take the maximum.
  • We'll revisit this without the search later...