## An indicator approach to discrete probability

The purpose of this elementary post is to illustrate that much of discrete probability can be analyzed in terms of indicator functions, linearity and independence. These ideas are well-known in probabilitly, but the following approach is seldom seen in elementary probability books. Of course, these results can be derived by other means, e.g. directly or by analytic tools such as generating functions.

We first recall a few terminologies. Let $(\Omega, {\mathcal{F}}, {\mathbb{P}})$ be a probability space that supports whatever random variables we will consider. For $E \subset \Omega$, let $1_E: \Omega \rightarrow \{0, 1\}$ be the indicator function of $E$, i.e. $1_E(\omega) = 1$ if and only if $\omega \in E$. We will also use extensively the concept of independence. Visit my earlier post if you need to recall them. We only mention that a collection of events are independent if and only if their indicators are (prove this). For any random variable $Y$, the mean is $\mu_Y = {\mathbb{E}}(Y)$ and the variance is $Var(Y) = {\mathbb{E}}(Y - \mu_Y)^2 = {\mathbb{E}}(Y^2) - \mu_Y^2$ (assuming existence). If $Y_1, ..., Y_N$ are independent (uncorrelation suffices), then

$Var(Y_1 + ... + Y_N) = Var(Y_1) + ... + Var(Y_N)$.

Some useful properties of indicator function

The following holds for all $A, B \subset \Omega$:

$A = B \Leftrightarrow 1_A = 1_B$

$1_{A^c} = 1 - 1_A$

$1_{A \cap B} = 1_{AB} = 1_A1_B$

$1_{A \cup B} = 1_A + 1_B - 1_A1_B$

$1_A 1_A = 1_A$, $1_A(1 - 1_A) = 0$

$1_{A \triangle B} = | 1_A - 1_B |$

For example, the inclusion exclusion formula can be proved easily by indicators (see this post). Also, the last property easily implies that $\rho(A, B) = {\mathbb{P}}(A \triangle B)$ defines a pseudo-metric on ${\mathcal{F}}$. (Prove this. And what is the corresponding completion?)

1. Bernoulli

Fix $0 < p < 1$ and let $A$ be an event with ${\mathbb{P}}(A) = 1$. Let $X = 1_A$. Then $X$ is distributed as Bernoulli($p$), i.e.

${\mathbb{P}}(X = 1) = p, {\mathbb{P}}(X = 0) = 1 - p$.

This is the simplest non-trivial random variable. The mean of $X$ is simply ${\mathbb{E}}(X) = {\mathbb{E}}(1_A) = {\mathbb{P}}(A) = p$, and the variance is $Var(X) = {\mathbb{E}}(X^2) - p^2 = p - p^2 = p(1 - p)$. (Note that $X^2 = X$ in this case.)

2. Binomial

Fix $0 < p < 1$, $n \geq 1$ and let $X_i$$i = 1, ..., n$ be i.i.d. Bernoulli($p$) random variables. Let

$X = X_1 + X_2 + ... + X_n$.

Then $X$ is distributed as Bernoulli($n$, $p$). For $0 \leq k \leq n$, the probability ${\mathbb{P}}(X = k)$ equals

${\mathbb{P}}(X_1 + ... + X_n = k) = {\mathbb{P}}( \bigcup_{i_1 < ... < i_k} \{X_{i_1} + ... + X_{i_k} = k\}) = C_k^n p^k (1 - p)^{n - k}$.

The mean of $X$ is

${\mathbb{E}}(X) = \sum_{i = 1}^n {\mathbb{E}}(X_i) = np$.

Similarly, the variance is

$Var(X) = \sum_{i = 1}^n Var(X_i) = np(1 - p)$.

This is almost trivial, but not so if you do this directly from the definition: (try!)

$\sum_{k = 0}^n k C_k^n p^k (1 - p)^{n - k} = np$.

$\sum_{k = 0}^n (k - np)^2 C_k^n p^k (1 - p)^{n - k} = np(1 - p)$.

Thus we see the power of expressing $X$ in terms of a sum of i.i.d. Bernoulli random variables.

3. Geometric

Again we fix $0 < p < 1$. Let $\{X_n\}_{n = 1}^{\infty}$ be an infinite sequence of i.i.d. Bernoulli($p$) random variables. Let

$X = \inf \{n \geq 1: X_n = 1\}$.

Then $X$ is distributed as Geometric($p$).The probabilistic meaning is that you wait for the first “success” occurs. A standard Borel-Cantelli argument shows that ${\mathbb{P}}(X < \infty) = 1$. Note also that

$X = \sum_{n = 1}^{\infty} n 1_{\{X_k = 0, k < n, X_n = 1\}}$.

For $n = 1, 2, ...$,

${\mathbb{P}}(X = n) = {\mathbb{P}}(X_k = 0, k < n, X_n = 1) = (1 - p)^{n - 1}p$.

We would like to calculate ${\mathbb{E}}(X)$ and $Var(X)$. Again you may try to calculate

$\mu_X = \sum_{n = 1}^{\infty} n (1 - p)^{n - 1}p$   and   $Var(X) = \sum_{n = 1}^{\infty} (n - \mu_X)^2 (1 - p)^{n - 1}p$,

but we prefer the following approach which is more probabilistic.

We first calculate the mean. By the tower property of conditional expectation,

${\mathbb{E}}(X) = {\mathbb{E}}({\mathbb{E}}(X | X_1)) = {\mathbb{E}}(X | X_1 = 0) {\mathbb{P}}(X_1 = 0) + {\mathbb{E}}(X | X_1 = 1){\mathbb{P}}(X_1 = 1)$.

On the set $\{X_1 = 1\}$, $X = 1$. Hence the second term is just $p$. Next, observe that on the set $\{X_1 = 0\}$,

$X = \inf\{n \geq 1: X_n = 1\} = 1 + \inf\{n \geq 1: X_{n+1} = 1\} = 1 + Y$,

where $Y = \inf\{n \geq 1: X_{n+1} = 1\}$. Clearly, $Y$ (defined everywhere on $\Omega$) is independent of $X_1$ and has the same distribution as $X$.Hence, if $X_1$ fails, the process “regenerates” itself. It follows that

${\mathbb{E}}(X | X_1 = 0) = 1 + {\mathbb{E}}(Y | X_1 = 0) = 1 + {\mathbb{E}}(Y) = 1 + {\mathbb{E}}(X)$.

In summary, we have derived the equation

${\mathbb{E}}(X) = p + (1 - p)(1 + {\mathbb{E}}(X))$,

and solving yields $\mu_X = {\mathbb{E}}(X) = 1/p$. Actually, this can be rephrased in terms of the Markov property.

We can calculate ${\mathbb{E}}(X^2)$ (and hence $Var(X)$) by similar method. We get

${\mathbb{E}}(X^2) = {\mathbb{E}}({\mathbb{E}}(X^2 | X_1)) = p \cdot 1 + (1 - p) {\mathbb{E}}((1 + X)^2)$.

Now ${\mathbb{E}}((1 + X)^2) = 1 + 2/p + {\mathbb{E}}(X^2)$. Hence after solving we get ${\mathbb{E}}(X^2) = (2 - p) / p^2$ and hence

$Var(X) = {\mathbb{E}}(X^2) - \mu_X^2 = \frac{1 - p}{p^2}$.

4. Negative binomial (Next time)

5. Multinomial

Let $m, n$ be positive integers, and fix a non-degenerate probability vector $(p_1, ..., p_m)$, i.e. all components are strictly positive. Let $Y_1, ..., Y_n$ be i.i.d. random variables with ${\mathbb{P}}(Y_1 = i) = p_i$.You may think about them as $n$ independent draws from an urn containing $m$ labelled balls.

For each $i$, let $X_i = \sum_{j = 1}^n 1\{Y_j = i\}$ be the count of “ball $i$“. Let $X$ be the random vector $X = (X_1, ..., X_m)$. Then $X$ is said to be distributed as Multinomial($n$, $m$, $p$). (This notation is not standard.) We have

${\mathbb{P}}((X_1, ..., X_m) = (k_1, .., k_m)) = \frac{n!}{k_1!...k_m!} p_1^{k_1}...p_m^{k_m}$

for each $m$-tuple $(k_1, ..., k_m)$ such that $k_1 + ... + k_m = 1$. To see this, note that each atom (with positive mass) of the event corresponds to a partition of $\{1, ..., n\}$ into $m$ classes of sizes $k_1, ..., k_m$. And, given such a partition, the probability of drawing balls according to this partition is $p_1^{k_1}...p_m^{k_m}$. Since there are $\frac{n!}{k_1!...k_m!}$ such partitions, multiplying gives the probability.

The first thing to note is that the marginal distribution of each $X_i$ is simply Binomial($n$, $p_i$), i.e.

${\mathbb{P}}(X_i = k) = C_k^n p^k (1 - p)^{n - k}$.

This follows directly from the definition $X_i = \sum_{j = 1}^n 1\{Y_j = i\}$. Compare this with the following direct but somewhat messy calculation:

$C_k^n p^k (1 - p)^{n - k} = \sum_{k_1, ..., k_{i - 1}, k_{i + 1}, ..., k_m} \frac{n!}{k_1!...k_m!} p_1^{k_1}...p_m^{k_m}$

It follows that ${\mathbb{E}}(X_i) = np_i$ and $Var(X_i) = np_i(1 - p_i)$.

Our next aim is to calculate the covariances among the components of $X$. This gives some idea of the dependence structure of the variable. Recall that the covariance of $X_i$ and $X_j$ is defined by

$Cov(X_i, X_j) = {\mathbb{E}}(X_i - \mu_{X_i})(X_j - \mu_{X_j}) = {\mathbb{E}}(X_iX_j) - \mu_{X_i} \mu_{X_j}$.

If $i = j$ then $Cov(X_i, X_i) = Var(X_i)$. (This is nothing but $ = \|x\|^2$ in terms of inner product.)

It is a disaster to calcualte ${\mathbb{E}}(X_iX_j)$ directly from the definition (of course, you may try). We shall approach it using indicators:

${\mathbb{E}}(X_iX_j) = {\mathbb{E}}(\sum_{r = 1}^n1\{Y_r = i\} \sum_{s = 1}^n 1\{Y_s = j\}) = \sum_{r, s = 1}^n {\mathbb{P}}(Y_r = i, Y_s = j)$.

We focus on the case $i \neq j$. Then the diagonal vanishes (why) and the sum becomes

$\sum_{r \neq s}{\mathbb{P}}(Y_r = i, Y_s = j) = n(n-1)p_ip_j$.

Hence $Cov(X_i, X_j) = n(n-1)p_ip_j - np_inp_j = -np_ip_j$. This is negative, and should be expected, because on the average, if you get more of something, you get less of other things. One more word: the correlation coefficent is

$\rho_{ij} = \frac{Cov(X_i, X_j)}{\sqrt{Var(X_i)}\sqrt{Var(X_j)}} = -\sqrt{\frac{p_ip_j}{(1 - p_i)(1 - p_j)}}$.

It does not depend on $n$. (Ask yourself why)