Notes on Probability

Photo of a roulette table.
Photo by Kaysha on Unsplash

This document summarizes key concepts in probability theory. It is intended to be a concise reference for what I use in classes and research writing, not a thorough tutorial exposition.

I also present the concepts here in 🧑🏻‍🏫 CS 533 Week 4.

Set Concepts and Notation

  • A set AA is an unordered collection of distinct elements.
  • \emptyset is the empty set.
  • ABA \cup B is the union: all elements in either AA or BB (or both).
  • ABA \cap B is the intersection: all elements in both AA and BB.
  • A\BA \setminus B is the set difference: all elements in AA but not in BB.
  • If AA is a subset of some larger set UU that contains all the possible elements under consideration, then the complement Ac=U\AA^c = U \setminus A is the set of elements not in AA.
  • |A||A| is the cardinality (or size) of AA. It may be infinite.
  • 𝒫(A)\mathcal{P}(A) is the power set of AA: the set of all subsets of AA.

Kinds of Sets

There are, broadly speaking, three kinds of sets in terms of their cardinality:

  • Finite sets have a finite number of elements. There is some natural number nn such that |A|=n|A| = n.
  • Countable sets (or countably infinite sets) are infinite sets with the same cardinality as the set of natural numbers (|A|=|||A| = |\mathbb{N}|). Formally, there exists an isomorphism (a 1:1 onto mapping) between members of AA and \mathbb{N}. Natural numbers, integers, rationals, and algebraics (rationals and roots) are all countable sets.
  • Uncountable sets are infinite sets whose cardinality is larger than that of the natural numbers. The real numbers (\mathbb{R}) and the power set of the natural numbers (𝒫()\mathcal{P}(\mathbb{N})) are two frequently-encountered uncountable sets.

We also talk about discrete and continuous sets:

  • A continuous set AA with an ordering << is a set where we can always find an element to fit between any two other elements: for any a,bAa, b \in A such that a<ba < b, there is a cAc \in A such that a<c<ba < c < b.
  • A discrete set is a set that is not continuous: there are irreducible gaps between elements.

All finite sets are discrete. The natural numbers and integers are also discrete. The real numbers are continuous. Rationals and algebraics are also continuous, but we won’t be using them directly.

Events

A random process (or a process modeled as random) produces distinct individual outcomes, called elementary events. We use EE to denote the set of such outcomes; for a coin flip, E={H,T}E = \{H, T\}. For a random process that produces a count, E=E = \mathbb{N}.

Probability is defined over events. An event AA is a subset of EE (AEA \subseteq E). If elementary events are events, they are represented as singletons: A={H}A = \{H\} means “coin is heads”. EE, the set of all elementary events, is the event “something happened”.

We use set operations to combine events (for these examples, we consider EE to be a deck of 52 standard playing cards; AA is “2” and BB is “red card”):

Illustration
Cards, A, and B
  • ABA \cap B is the event “both AA and BB happened”; for our example, the conjunction is “red 2”, of which there are 2 (2♥, 2♦).

    Illustration of ABA \cap B

    Highlighting A∩B
  • ABA \cup B is the event “either AA or BB (or both) happened”; for our example, the disjunction is “2 or red” — any 2, or any red card; this set has size 28: the 26 red cards (13 of each red suit), plus the two black 2s.

    Illustration of ABA \cup B

    Highlighting A∪B
  • A\BA \setminus B is the event “AA happened but not BB”. If BAB \subseteq A, then A\B=A \setminus B = \emptyset; for our example, the difference is “black 2”, because it is the set of 2s that are not red.

    Illustration of A\BA \setminus B

    Highlighting A\B

With these definitions, we can now define the event space: \mathcal{F} is the set of all possible events (subsets of EE). This is a set of sets. It does not necessarily contain every subset of EE, but it has the following properties:

  • EE \in \mathcal{F}.
  • If AA \in \mathcal{F}, then its complement AcA^c \in \mathcal{F}. We say \mathcal{F} is closed under complement.
    • Since EE \in \mathcal{F} and Ec=E^c = \emptyset, \emptyset \in \mathcal{F}.
  • If A1,A2,,AnA_1, A_2, \dots, A_n \in \mathcal{F}, then their union iAi\bigcup_i A_i \in \mathcal{F}. This applies also to unions of countably many sets. We say \mathcal{F} is closed under countable unions.

\mathcal{F} is called a sigma algebra (or sigma field). For a finite set EE, we usually use =𝒫(E)\mathcal{F}= \mathcal{P}(E), the power set of EE. This means that every possible subset of EE (and therefore every conceivable set of elementary events) is an event.

Here are some additional properties of sigma algebras (these are listed separately from the previous properties because those are the definition of a sigma algebra and these are consequences — we can prove them from the definitions and axioms):

  • If A,BA, B \in \mathcal{F}, then ABA \cap B \in \mathcal{F}

Probability

Now that we have a sigma algebra, we can define the concept of probability. A probability distribution (or probability measure) P\mathrm{P} over a sigma algebra \mathcal{F} is a function that obeys the following (Kolmogorov’s axioms):

  1. P[E]=1\mathrm{P}[E] = 1 — the probability of something happening is 1.
  2. P[A]0\mathrm{P}[A] \ge 0non-negativity: probabilities are not negative.
  3. If A1,A2,,AnA_1, A_2, \dots, A_n are (countably many) disjoint events in \mathcal{F}, then P[iAi]=iP[Ai]\mathrm{P}[\bigcup_i A_i] = \sum_i \mathrm{P}[A_i] (countable additivity).

A collection of disjoint sets is also called mutually exclusive. What it means is that for any Ai,AjA_i, A_j in the collection, AiAj=A_i \cap A_j = \emptyset — the two events cannot both happen simultaneously.

We call a field of events equipped with a probability measure (E,,P)(E, \mathcal{F}, \mathrm{P}) a probability space.

What this probability measure does is that it describes how “much” of the total probability is associated with event. This is sometimes called the probability mass, because probability acts like a conserved quantity (like mass or energy in physics). There is a total probability of 1 (from the first axiom P[E]=1\mathrm{P}[E] = 1); the probability measure over other events tells us how likely they are relative to other events by quantifying how much of the probability mass is placed on them: if P[A]=0.5\mathrm{P}[A] = 0.5, that tells us that half the probability mass is on event AA. This then has a variety of interpretations:

  • Interpreted as a description of long-run frequencies, if we repeated the random process infinitely many times, half of the times should be AA.
  • Interpreted as an expectation of future observations of currently-unobserved outcomes, AA is just as likely as it is not.

The non-negativity axiom keeps us from trying to assign negative probabilities to events because they won’t be meaningful, and the countable additivity axiom ensures that probabilities “make sense” in a way consistent with describing a distribution of mass across the various events. If we have two distinct, disjoint events, then the probability mass assigned to the pair is the sum of their individual masses; the axiom generalizes this to countable sets of disjoint events.

Some additional facts about probability that can be derived from the above axioms:

  • P[A]1\mathrm{P}[A] \le 1 (combined with non-negativity, we have 0P[A]10 \le \mathrm{P}[A] \le 1)
  • P[AB]=P[A]+P[B]P[AB]\mathrm{P}[A \cup B] = \mathrm{P}[A] + \mathrm{P}[B] - \mathrm{P}[A \cap B]
  • P[Ac]=1P[A]\mathrm{P}[A^c] = 1 - \mathrm{P}[A]
  • P[A\B]=P[A]P[AB]\mathrm{P}[A \setminus B] = \mathrm{P}[A] - \mathrm{P}[A \cap B]
  • If ABA \subseteq B, then P[A]P[B]\mathrm{P}[A] \le \mathrm{P}[B]

Joint and Conditional Probability

We define the joint probability P[A,B]=P[AB]\mathrm{P}[A, B] = \mathrm{P}[A \cap B]: the probability of both AA and BB happening in the same observation. This is sometimes also written P[A;B]\mathrm{P}[A; B], and commas and semicolons are sometimes mixed. This is usually to separate different kinds of events in the probability statement.

The conditional probability P[B|A]\mathrm{P}[B|A], read “the probability of BB given AA”, is the probability of BB conditioned on the knowledge that AA has happened.

Conditional and joint probabilities decompose as follows:

  • P[A,B]=P[A|B]P[B]\mathrm{P}[A,B] = \mathrm{P}[A|B] \mathrm{P}[B]
  • P[A,B]=P[B|A]P[A]\mathrm{P}[A,B] = \mathrm{P}[B|A] \mathrm{P}[A]

From this we can derive Bayes’ theorem:

P[B|A]=P[A|B]P[B]P[A]\mathrm{P}[B|A] = \frac{\mathrm{P}[A|B] \mathrm{P}[B]}{\mathrm{P}[A]}

Bayes’ theorem gives us a tool to invert a conditional probability: given P[A|B]\mathrm{P}[A|B] and the associated unconditional probabilities P[A]\mathrm{P}[A] and P[B]\mathrm{P}[B], we can obtain P[B|A]\mathrm{P}[B|A]. Crucially, this inversion requires the base rates, as expressed by unconditional probabilities, of AA and BB — the conditional probability alone does not provide sufficient information.

In many interesting settings, such as Bayesian inference, we are looking to compute and compare P[A|B]\mathrm{P}[A|B] for several different BB and the same AA. For such computations, we do not actually need P[A]\mathrm{P}[A] unless we need the actual probabilities; if we simply wish to know which BB is the most likely for a given AA, we treat P[A]\mathrm{P}[A] as fixed and look for the largest joint probability P[A|B]P[B]\mathrm{P}[A|B]\mathrm{P}[B] (sometimes called an unscaled probability, because we have not scaled it by P[A]\mathrm{P}[A] to obtain a proper conditional probability). For such computations, we say that P[B|A]P[A|B]P[B]\mathrm{P}[B|A] \propto \mathrm{P}[A|B]\mathrm{P}[B] (P[B|A]\mathrm{P}[B|A] is proportional to P[A|B]P[B]\mathrm{P}[A|B]\mathrm{P}[B]).

Finally, we can marginalize a joint distribution by summing. If =B1,B2,,Bn\mathcal{B} = {B_1, B_2, \dots, B_n} is a collection of mutually exclusive events that span EE, then:

P[A]=BP[A,B]\mathrm{P}[A] = \sum_{B \in \mathcal{B}} \mathrm{P}[A, B]

We call \mathcal{B} a partition of EE. By “span EE”, we mean that for any eEe \in E, there is some BiB_i \in \mathcal{B} such that eBie \in B_i.

Independence

Two events are independent if knowing the outcome of one tells you nothing about the probability of the other. The following are true if and only if AA and BB are independent:

  • P[A|B]=P[A]\mathrm{P}[A|B] = \mathrm{P}[A]
  • P[B|A]=P[B]\mathrm{P}[B|A] = \mathrm{P}[B]
  • P[A,B]=P[A]P[B]\mathrm{P}[A, B] = \mathrm{P}[A] \mathrm{P}[B]

Continuous Probability & Random Variables

If EE is continuous (typically E=E = \mathbb{R}), then we can’t meaningfully talk about the probabilities of elementary events. The probability that an observation is exactly any particular value xx \in \mathbb{R} is (typically) zero.

Instead, we define a sigma field where events are intervals:

  • E=E = \mathbb{R}
  • \mathcal{F} is the set of intervals, their complements, and their countable unions. It contains infinitesimally small intervals, but not singletons.

This is not the only way to define probabilities over continuous event spaces, but it is the common way of defining probabilities over real values. This particular sigma-field is called the Borel sigma algebra, and we will denote it (,)(\mathbb{R}, \mathcal{B}).

We often talk about continuous distributions as the distribution of a random variable XX. A random variable is a variable that takes on random values. We can (often) observe or sample a random variable.

We define continuous probabilities in terms of a distribution function FXF_X:

FX(x)=P[X<x]F_X(x) = \mathrm{P}[X < x]

This is also called the cumultaive distribution function (CDF).

We can use it to compute the probability for any interval:

P[x1X<x2]=FX(x2)FX(x1)\mathrm{P}[x_1 \le X < x_2] = F_X(x_2) - F_X(x_1)

This probability is called the probability mass on a particular interval.

Distributions are often defined by a probability density function pp such that

F(x)=xp(x*)dx*F(x) = \int_{-\infty}^x p(x_*) dx_*

Unlike probabilities or probability mass, densities can exceed 1. When you use sns.distplot and it shows the kernel density estimator (KDE), it is showing you an estimate of the density. That is why the yy axis is weird.

We can also talk about joint and conditional continuous probabilities and densities. When marginalizing a continuous probability density, we replace the sum with an integral:

p(x)=p(x,y)dyp(x) = \int p(x,y) dy

Expectation

The expected value of a random variable XX, E[X]\mathrm{E}[X], is its mean. It is computed as the weighted sum over the possible values of xx, where the weight for each value is its probability (or density). For discrete XX with probability measure PP, we have:

E[X]=xXxP[x]\mathrm{E}[X] = \sum_{x \in X} x \mathrm{P}[x]

If XX is continuous and has probability density pp, we have:

E[X]=xp(x)dx\mathrm{E}[X] = \int x p(x) dx

We can also talk about the conditional expectation E[X|A]\mathrm{E}[X | A], the expected value of XX given that we know event AA happened. It is defined as E[X|A]=xp(x|A)dx\mathrm{E}[X|A] = \int x p(x|A) dx.

Variance and Covariance

The variance of a random variable XX is the expected value of its squared deviation from its mean:

Var(X)=E[(XE[X])2]\mathrm{Var}(X) = \mathrm{E}[(X - \mathrm{E}[X])^2]

The standard deviation is the square root of variance (σX=Var(X)\sigma_X = \sqrt{\mathrm{Var}(X)}).

The covariance of two random variables is the expected value of the product of their deviations from mean:

Cov(X,Y)=E[(XE[X])(YE[Y])]\mathrm{Cov}(X, Y) = \mathrm{E}[(X - \mathrm{E}[X]) (Y - \mathrm{E}[Y])]

The correlation rXY=Cov(X,Y)σXσYr_{XY} = \frac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y}.

We can also show that Var(X)=Cov(X,X)\mathrm{Var}(X) = \mathrm{Cov}(X, X).

Random variables can also be described as independent in the same way as events: knowing one tells you nothing about the other. If two random variables are independent then their covariance Cov(X,Y)=0\mathrm{Cov}(X, Y) = 0 (this implication is one-directional — there exist non-independent random variables whose covariance is 0).

Properties of Expected Values

Expected value obeys a number of useful properties (XX and YY are random variables, and α\alpha, β\beta, etc. are real numbers):

  • Linearity of expectation:
    • E[X+Y]=E[X]+E[Y]\mathrm{E}[X + Y] = \mathrm{E}[X] + \mathrm{E}[Y]
    • E[αX]=αE[X]\mathrm{E}[\alpha X] = \alpha \mathrm{E}[X]
  • If XX and YY are independent, then E[XY]=E[X]E[Y]\mathrm{E}[XY] = \mathrm{E}[X] \mathrm{E}[Y]
  • If E[X]=0\mathrm{E}[X] = 0, then Var(X)=E[X2]\mathrm{Var}(X) = \mathrm{E}[X^2]
  • If E[X]=E[Y]=0\mathrm{E}[X] = \mathrm{E}[Y] = 0, then Cov(X,Y)=E[XY]\mathrm{Cov}(X, Y) = \mathrm{E}[X Y]

Expectation of Indicator Functions

Sets can be described as an indicator function (or characteristic function) 𝕀A:E{0,1}\mathbb{I}_A: E \to \{0,1\}. This function is defined as:

𝕀A(x)={1xA0xA\mathbb{I}_A(x) = \begin{cases} 1 & x \in A \\ 0 & x \not\in A \end{cases}

Then the expected value of this function is the same as the probability of AA:

E[𝕀A(X)]=P[A]\mathrm{E}[\mathbb{I}_A(X)] = \mathrm{P}[A]

Odds

Another way of computing probability is to compute with odds: the ratio of probabilities for or against an event. This is given by:

Odds(A)=P[A]P[Ac]=P[A]1P[A]\mathrm{Odds}(A) = \frac{\mathrm{P}[A]}{\mathrm{P}[A^c]} = \frac{\mathrm{P}[A]}{1 - \mathrm{P}[A]}

The log odds are often computationally convenient, and are the basis of logistic regression:

logOdds(a)=logP[A]log(1P[A])\mathrm{log}\mathrm{Odds}(a) = \mathrm{log}\mathrm{P}[A] - \mathrm{log}(1 - \mathrm{P}[A])

The logit function converts probabilities to log-odds.

We can also compute an odds ratio of two outcomes:

OR(A,B)=Odds(A)Odds(B)\mathrm{OR}(A, B) = \frac{\mathrm{Odds}(A)}{\mathrm{Odds}(B)}

Interpretation

So far, this document has focused on probability as a mathematical object: we have a probability space obeying a set of axioms, and we can do various things with it. We have not, however, discussed what a probability is. Is it a description of frequencies? Something else?

Under the instrumentalist school of thought, the mathematical definition above is what a probability is. Probability is a measure over a sigma algebra that satisfies Kolmogorov’s axioms, nothing more, nothing less. In this view, all other interpretations of probability are simply applications of probability.

Frequentism defines probability as the long-run behavior of infinite sequences: the probability of an event is the fraction of times it would appear if we repeated the experiment or observation infinitely many independent times. P[H]=0.5\mathrm{P}[H] = 0.5 because, if we flip a coin infinitely many times, half of the results will be heads.

Subjectivism or subjective Bayesianism defines probability as a consistent description of the beliefs of a rational agent. That is, it describes what the agent currently believes about as-yet-unobserved outcomes. P[H]=0.5\mathrm{P}[H] = 0.5 because, prior to flipping a coin, the agent believes heads to be just as likely as tails.

There are variants on these theories and other theories as well. They may seem similar, but they are very different in terms of their implications and application. For one thing, under frequentism, probabilities only make sense in the context of repeated (or theoretically repeatable) random events, and we cannot talk about the probability of a fixed but unknown thing, such as a population parameter. Under subjective Bayesianism, because probability is about the agent’s subjective state of belief, we can talk about the probability of a population parameter, because all we are doing is describing the agent’s belief about the parameter’s value.

Further Reading

This thread by Michael Betancourt provides a good overview of the instrumentalist idea of probability, which treats probabilities simply as their mathematical objects. It influenced how I write about mass above.

Some more tutorials from Jeremy Kun:

If you want to dive more deeply into probability theory, Michael Betancourt’s case studies are rather mathematically dense but quite good:

For a book: