Some Cheerful Facts About Probability

In the course of training to be a scientist, you generally learn some statistics and probability theory. I’ve grown to be quite fond of the topic, but as I’ve learned it, there are a few things in particular that I’ve found brilliantly satisfying. Simple tricks, some of which may seem counter-intuitive, but for some reason fascinated me when I grasped them.

Here are a few of them.

Population size is irrelevant

When one wanders in to the comments section on a news article reporting on a study finding out something about the American people, it’s not uncommon to find someone griping that it’s impossible for a survey of 1000 to produce valid knowledge about the population as a whole.

As it turns out, it is!

A good deal of statistical inference is based on the law of large numbers: a set of statistical properties that describe what happens when you make a large number of observations. Crucially, these laws depend only on the number of observations, not on the number of possible things to be observed.

If we want to estimate the average wimgspan μ\mu of dragons, we can randomly select nn dragons, measure their wingspans1, and compute the average x\bar x. As nn gets larger — we measure more and more dragons — our estimate x\bar x approaches μ\mu. Moreover, the rate at which it approaches the true average wingspan is independent of the number of dragons that exist in the world. Therefore, we only need to sample sufficiently many dragons to achieve an estimate within our desired tolerance, which we can also compute based on the number of measured dragons and some assumptions about the distribution of dragon wingspans, and we do not care (or even need to know) how many dragons there are.

Specifically, if dragon wingspans are normally distributed (a reasonable assumption, as many physical properties are approximately normally distributed), then the wingspan xx of a dragon is x𝒩(μ,σ2)x \sim \mathcal{N}(\mu,\sigma^2); if we compute x\bar x by measuring nn randomly-selected dragons, then x\bar x follows the probability distribution 𝒩(μ,σ2/n)\mathcal{N}(\mu, \sigma^2/n). Note that the number of total dragons does not appear in this distribution!

Now, we still need to select our dragons well so that we have a representative sample of dragons. But representativeness is a function of the sampling strategy, not of the sample size, population size, or sample proportion. So long as we make sure that any dragon is equally likely to be selected, for the purposes of estimating the mean wingspan (or the mean of any other population parameter), we can look only at the sample size.

There are, in practice, many problems with obtaining representative samples, and for some types of analyses we need to augment the sampling strategy (for example, oversampling to enable rigorous subgroup analysis). But the fact that 1000 is much less than 300,000,000 is not one of them.

Rejection sampling works

Suppose we have some silly probability distribution, say a mountain range distribution, and we want to generate samples from it.

Most statistical and scientific computing environments have functions for drawing random varaibles from common distributions: uniform and Gaussian almost assuredly, and many provide Gamma, beta, binomial, and several others as well.

It turns out that we can use a simple probability distribution SS and a weighted coin to sample from a difficult distribution XX with a technique called rejection sampling. Basically:

  1. Draw a sample from xSx \sim S, and compute its probability PS(x)P_S(x).
  2. Compute the probability of xx under XX, PX(x)P_X(x).
  3. Flip a coin with weight PX(x)mPS(x)\frac{P_X(x)}{m P_S(x)}, where m>1m > 1 is a parameter bounding the relationship between XX and SS (grossly oversimplifying here). If it comes up heads, accept the sample; if not, try again.

This technique can also be adapted to rewrite sampling strategies: given sufficient data sampled with strategy XX, we can resample with rejection sampling to produce a sample approximating YY (with some limitations — all possible desired values must e represented in the original data, for example). It can also be extended, via the Metropolis-Hastings algorithm, to even more complicated scenarios.

I love the elegant power of rejection sampling.

Probability is the expected value of the characteristic function

This is one that I just learned in the last month. It seems obvious in retrospect, but is one of those things that you don’t necessarily think about until it’s pointed out.

In probability theory, we often talk about the probability of an event; for example P(XA)P(X\in A), the probability that a random variable XX has a value in the set AA. We also discuss the expected value of a variable, E(X)E(X), which is the value that we expect, on average, to obtain if we observe an XX.

When dealing with sets, we have what are called characteristic functions: indicator functions that denote whether or not a value is in the set. For a set AA, the characteristic function cA(x)=1c_A(x) = 1 if xAx \in A, and 00 otherwise.

Beautifully and elegantly, it turns out that P(XA)=E(cA(X))P(X \in A) = E(c_A(X)). That is, the probability of XX being in AA is equal to the expected value of AA’s characteristic function applied to XX.

I learned this reading A User’s Guide to Measure Theoretic Probability, which is so far an excellent text. Pollard uses this identity to unify probability and expected value so that he can deal solely with expected values and have related probability results arise from applying expected value properties to characteristic functions.