# Some Cheerful Facts About Probability

In the course of training to be a scientist, you generally learn some statistics and probability theory. I’ve grown to be quite fond of the topic, but as I’ve learned it, there are a few things in particular that I’ve found brilliantly satisfying. Simple tricks, some of which may seem counter-intuitive, but for some reason fascinated me when I grasped them.

Here are a few of them.

## Population size is irrelevant

When one wanders in to the comments section on a news article reporting on a study finding out something about the American people, it’s not uncommon to find someone griping that it’s impossible for a survey of 1000 to produce valid knowledge about the population as a whole.

As it turns out, it is!

A good deal of statistical inference is based on the *law of large numbers*: a set of statistical properties that describe what happens when you make a large number of observations. Crucially, these laws depend only on the number of observations, *not* on the number of possible things to be observed.

If we want to estimate the average wimgspan \(\mu\) of dragons, we can randomly select \(n\) dragons, measure their wingspans^{1}, and compute the average \(\bar x\). As \(n\) gets larger — we measure more and more dragons — our estimate \(\bar x\) approaches \(\mu\). Moreover, the rate at which it approaches the true average wingspan is *independent* of the number of dragons that exist in the world. Therefore, we only need to sample sufficiently many dragons to achieve an estimate within our desired tolerance, which we can also compute based on the number of measured dragons and some assumptions about the distribution of dragon wingspans, and we do not care (or even need to know) how many dragons there are.

Specifically, if dragon wingspans are normally distributed (a reasonable assumption, as many physical properties are approximately normally distributed), then the wingspan \(x\) of a dragon is \(x \sim \mathcal{N}(\mu,\sigma^2)\); if we compute \(\bar x\) by measuring \(n\) randomly-selected dragons, then \(\bar x\) follows the probability distribution \(\mathcal{N}(\mu, \sigma^2/n)\). Note that the number of total dragons does not appear in this distribution!

Now, we still need to select our dragons well so that we have a *representative* sample of dragons. But representativeness is a function of the sampling strategy, not of the sample size, population size, or sample proportion. So long as we make sure that any dragon is equally likely to be selected, for the purposes of estimating the mean wingspan (or the mean of any other population parameter), we can look only at the sample size.

There are, in practice, many problems with obtaining representative samples, and for some types of analyses we need to augment the sampling strategy (for example, oversampling to enable rigorous subgroup analysis). But the fact that 1000 is much less than 300,000,000 is not one of them.

## Rejection sampling works

Suppose we have some silly probability distribution, say a mountain range distribution, and we want to generate samples from it.

Most statistical and scientific computing environments have functions for drawing random varaibles from common distributions: uniform and Gaussian almost assuredly, and many provide Gamma, beta, binomial, and several others as well.

It turns out that we can use a simple probability distribution \(S\) and a weighted coin to sample from a difficult distribution \(X\) with a technique called *rejection sampling*. Basically:

- Draw a sample from \(x \sim S\), and compute its probability \(P_S(x)\).
- Compute the probability of \(x\) under \(X\), \(P_X(x)\).
- Flip a coin with weight \(\frac{P_X(x)}{m P_S(x)}\), where \(m > 1\) is a parameter bounding the relationship between \(X\) and \(S\) (grossly oversimplifying here). If it comes up heads, accept the sample; if not, try again.

This technique can also be adapted to rewrite sampling strategies: given sufficient data sampled with strategy \(X\), we can resample with rejection sampling to produce a sample approximating \(Y\) (with some limitations — all possible desired values must e represented in the original data, for example). It can also be extended, via the Metropolis-Hastings algorithm, to even more complicated scenarios.

I love the elegant power of rejection sampling.

## Probability is the expected value of the characteristic function

This is one that I just learned in the last month. It seems obvious in retrospect, but is one of those things that you don’t necessarily think about until it’s pointed out.

In probability theory, we often talk about the probability of an event; for example \(P(X\in A)\), the probability that a random variable \(X\) has a value in the set \(A\). We also discuss the expected value of a variable, \(E(X)\), which is the value that we expect, on average, to obtain if we observe an \(X\).

When dealing with sets, we have what are called *characteristic functions*: indicator functions that denote whether or not a value is in the set. For a set \(A\), the characteristic function \(c_A(x) = 1\) if \(x \in A\), and \(0\) otherwise.

Beautifully and elegantly, it turns out that \(P(X \in A) = E(c_A(X))\). That is, the probability of \(X\) being in \(A\) is equal to the expected value of \(A\)’s characteristic function applied to \(X\).

I learned this reading *A User’s Guide to Measure Theoretic Probability*, which is so far an excellent text. Pollard uses this identity to unify probability and expected value so that he can deal solely with expected values and have related probability results arise from applying expected value properties to characteristic functions.