Introduction to Bayesian Statistics

  • Perception is inherently uncertain
  • Perception is inherently subjective
    • If you get a different response from your sensors to someone else, then you'll believe something different from them.
  • Bayesian Statistics are a way of numerically representing subjective uncertainty
  • Cox's theorem
    • There are other ways of defining probability like objects, but they're either equivalent under a transform, or weird in some way.
    • Cox had three desiderata:
      • Divisibility and comparability - The plausibility of a statement is a real number and is dependent on information we have related to the statement.
      • Common sense - Plausibilities should vary sensibly with the assessment of plausibilities in the model.
      • Consistency - If the plausibility of a statement can be derived in many ways, all the results must be equal.
  • Probability theory basics (Another view of this is on p14 of Probabilistic Robotics)
    • Divide the possible worlds into sets
    • Assign each set of possible worlds a number representing how likely the true world is in that set (could use Measure theory here)
      • We could think in terms of 'individual worlds' rather than sets of worlds, but that leads to problems with continuous spaces. We'll do this for discrete spaces though.
    • No set of worlds has a negative probability; \forall X, P(X) \geq 0
    • The empty set has probability 0; P(\emptyset) = 0
      • We assume that the 'true world' is in our set
      • Other sets may also have probability 0
    • The probability of the universal set is 1
      • In a discrete universe, if we sum the probabilities of all the worlds then we get 1
      • In a continuous universe, use of Lebesgue integration on the probabilities on the sets of worlds should give 1
    • For disjoint sets A and B, P(A \; \text{or} \; B) = P( A \cup B ) = P( A ) + P( B )
  • When talking about a space possible worlds, we often use parameters to define individual worlds or sets of worlds
    • e.g. If we have three possible corresponding to X = 1, 2 and 3, then we might write P(X = 2) = 0.2 to say the probability of the second world is 0.2.
    • You'll sometimes see X referred to as a 'random variable'. This isn't very meaningful when talking about Bayesian statistics.
  • We'll use | to mean 'given' as in P(X = 2 | Y = 2) = 0.66.
    • P(A | B) = \frac{P(A \& B)}{P(B)}
    • Imagine we had 6 worlds
      • P(X = 1, Y = 1) = 0.1, P(X=2, Y=1)=0.1, P(X=2,Y=2)=0.2, P(X=2,Y=3)=0.3, P(X=3,Y=2)=0.1, P(X=3,Y=3)=0.1, P(X=4,Y=3)=0.1
      • Y = 2 in two of those worlds
      • P(X=2,Y=2), P(X=3,Y=2)
      • Of those two worlds, X = 2 in one of them
      • P(X = 2 | Y = 2) = \frac{P(X = 2 \;\&\; Y = 2)}{P(Y=2)} = \frac{0.2}{0.3} = \frac{2}{3}
  • When talking about continuous worlds, where probability is defined over bounded sets, e.g. 1.1 < x < 1.2, we'll define the probability density function as the derivative of the probability. e.g. If I don't know exactly where I am, but I know I'm near x = 1.1, I might define a probability density x = Normal(1.1, 1). I need to integrate this over a range of x to get an actual probability. (The Normal or Gaussian Distribution is a probability density function that we'll see a fair bit.)
    • Another way to consider this: Imagine we wanted to represent a probability distribution over contiguous sets of real numbers in [0,1]. We'll discretize the 0,1 line into N sections of width \Delta = \frac{1}{N}. If we just stored the probability mass in each of these sections, then as N \rightarrow \infty, and \Delta \rightarrow 0 the probabilities, P, all drop to 0 - useless. So what we'll store is \frac{P}{\Delta}. Now, as the buckets get smaller, that cancels and we end up with reasonable numbers. But these numbers are not probabilities, they are probability densities.
  • If you have a distribution that does not sum/integrate to 1, it is usually possible to renormalise it. (This is using the term normalise similarly to the way vectors can be normalised to length of 1 (but we don't use the L2 norm to do so).) First the sum or integral over the entire space is calculated, and then the entire distribution is scaled by that value so that it now sums to one. This fails if the original sum was unbounded.
    • Normalisation is a separate concept from the Normal or Gaussian distribution.
  • If a probability distribution has a parameter that can meaningfully be summed (e.g. a numeric parameter is ok, an alphabetic on less so), then we can calculate an expected value for that parameter. E_{x} P = \sum_{x} x P(x)
  • To marginalise a variable, means to sum over all values of that variable. e.g. If we have a probability distribution defined with two variables, x and y, and we consider the probability of sets whose definition does not refer to x, e.g. (1.0 < y < 2.0), then those sets implicitly include all values of x.

Example Sets

At this stage I haven't said anything about how to use probability distributions. Let's just have a look at the sorts of things we can place probability distributions over.

  • A small discrete set of options: This is either i) an apple pie, or ii) a peach pie.
  • A continuous set of options: I am x cm tall, where x is a real number.
    • Note that in practice you need to talk about ranges to define sets on continuous spaces: e.g. what is the probability I'm between 170 and 171cm tall.
    • When talking about the numbers we integrate to get probabilities, we call them probability densities. As these are usually functions of some variable, they are also often called probability density functions.
  • A multi-dimensional set of options (with a fixed dimensionality): I am between 170 and 171 cm tall, and you are between 180 and 181 cm tall.
  • A variable-dimensional world. e.g. consider a 'stretch of road' to contain a number of cars. Each car has a position and a velocity. We can now consider sets of stretches of road. e.g. all stretches of road with a car between 10 and 11m from the end of the road.
  • If we consider an agent, then its history of length n will be a sequence of action, observation pairs: h = a_1o_1a_2o_2a_3o_3\ldots{}a_no_n. We could consider sets of histories, e.g. the set of histories where I had a feeling of just having woken up, I opened my eyes, I saw the ceiling, I looked down, and saw my bedroom, then ...
  • We can define what a valid logical formula is. Then we can consider sets of formulae, e.g. \mathrm{tasty}(p_1) \; \textrm{and} \; \mathrm{tasty}(p_2) \; \textrm{and} \; \mathrm{tasty}(p_3) \; \textrm{and} \; \forall_{p \in \mathrm{Pumkins}} \; \mathrm{tasty}(p) (See http://dx.doi.org/10.1007/s10472-009-9136-7)
  • We could consider sets of Haskell programs. e.g. Sets of lazy Haskell functions of type \mathrm{history} \rightarrow \mathrm{boolean} that don't return false when given a specific history.