## Martingale Theory I: Background

Last revised: 28/10

I am going to write about martingale theory. After discussing some background knowledge, our first aim are the optional stopping and convergence theorems in discrete time. After that, we shall deal with refinements in continuous time and some applications. Your comments and corrections are highly appreciated.

Here are some good books that come to my mind: [C], [D1][DM], [DW], [S], [K].

*******************************

Introduction

Martingale originally means a gambling strategy: Suppose you double your bet whenever you lose. Then when you finally win, you win all the previous losses back. As a mathematical object, it was used quite a bit by Levy and others in 1930s, but it was Doob who proved the main theorems and illustrated its amazing uses. Nowadays, martingale is a major tool in both theoretical and applied probability. Regarding the history of martingale theory, I would like to quote Leo Breiman:

Probability has two hands: On the right is the
rigorous; the left hand thinks probabilistically,
reduces problems to gambling situations, coin-
tossing, motions of a physical particle.

This also serves as the guiding principle of our exposition.

*******************************

(1) Classical example: Gambler’s ruin (symmetric case)

We follow tradition and start by describing a famous gambling problem in intuitive terms:

Suppose you enter a casino with $x$ dollars. Each time you bet \$1 against the toss of a fair coin (where the tosses are independent of each other). You quit if either you lose all money or have $N$ dollars in your pocket. What is the probability that you lose your shirt? (or, in other words, to lose $x$ dollars before winning $N - x$ dollors?)

Let $X_n$, $n = 1, 2, ...$ denote the outcome of the $n$-th bet. Then ${\mathbb{P}}(X_n = 1) = {\mathbb{P}}(X_n = -1) = 1/2$ and your position after the $n$-th bet is

$S_n = x + X_1 + X_2 + ... + X_n$.

We set $S_0 = x$. Let $T$ be the first time that $S_n = 0$ or $N$. The required probability is thus ${\mathbb{P}}(S_T = 0)$.

Suppose $S_n = y$ after the first $n$ tosses. Then $S_{n+1}$ depends only on the $n+1$-th toss. Now, $S_{n+1}$ is either $S_n + 1$ or $S_n - 1$, and both has probability 1/2. The conditional expectation is thus $S_n$. We record this fact by writing

${\mathbb{E}}[S_{n+1} | S_0, S_1, ..., S_n] = S_n$.

This says that $S_n$ is a martingale. Among other things, this implies that

${\mathbb{E}}(S_n) = x$

for all $n$. This means that given the past outcomes, the next bet is always “fair”.

*******************************

To emphasize the gambling flavor, we deliberately use a “loose” language in the above discussion, using words like “coin tosses”, “outcome”, “first time”, “suppose … after …” etc. The concept of martingale is a substantial generalization of the idea of “fair game“, but the definition requires a certain amount of measure-theoretic machinery. Before embarking on a rigorous development, let us quickly solve the ruin problem and give some remarks. We write the required probability as $u(x) = {\mathbb{P}}_x(S_T = 0)$. Here the subscript $x$ means that the initial position is $x$.

1.1 Solution 1: Recall that ${\mathbb{E}}_x(S_n) = x$. We simply put $n = T$ (see Remark 1.3). Then ${\mathbb{E}}_x(S_T) = x$. Now $S_T$ is either $0$ or $N$, with probabilities $u(x)$ and $1 - u(x)$ respectively. It follows that

${\mathbb{E}}_x(S_T) = 0 \times u(x) + N \times (1 - u(x)) = x \Rightarrow u(x) = \frac{N - x}{N}$. $\blacksquare$

1.2 Solution 2: Consider the situation after the first toss. With probability 1/2, $S_1 = x - 1$, and the (conditional) probability is $u(x-1)$. Otherwise, $S_1 = x + 1$ and the (conditional) probability is $u(x + 1)$. Hence for $x = 1, ..., N - 1$, total probability law (see Remark 1.4) gives

$u(x) = \frac{1}{2}u(x+1) + \frac{1}{2}u(x-1)$

This gives a difference equation in $u(x)$. Together with the boundary conditions $u(0) = 1$ and $u(N) = 0$, we get $u(x) = \frac{N - x}{N}$ as above. $\blacksquare$

Can you formulate and solve the asymmetric case?

*******************************

1. 3 Remark. In Solution 1 we used the following reasoning: The gamble is fair at any time $n$. Hence it is still fair at the first time that $S_n$ hits $0$ or $N$. But watch out! Here $T = \inf\{n: S_n = 0 \text{ or } N \}$ is random! Can we substitute $T$ into the formula ${\mathbb{E}}_x(S_n) = x$, where $n$ is fixed? This is true only for a certain class of random times called stopping times. We will discuss this again when we study the optional sampling theorem.

1.4 Remark. Actually, in Solution 2 we did not use the martingale property of $S_n$. What we used is the Markov property of $S_n$. The difference equation can be given a potential-theoretic meaning. If we let

$\Delta u(x) = [\frac{1}{2}u(x+1) + \frac{1}{2}u(x-1)] - u(x), x = 1, ..., N - 1$

to be the discrete Laplace operator associated to $S_n$, then $u(x) = {\mathbb{E}}_x(1(S_T = 0))$ solves the Dirichlet problem

$\Delta u(x) = 0, x = 1, ..., N - 1, u(0) = 1, u(N) = 0$.

We will discuss probabilistic potential theory only later.

*******************************

(2) Basic ideas of stochastic processes

Our development will rest on the modern measure-theoretic foundation of probability. Unfortunately, the material is somewhat top-heavy: there are a lot of measure-theoretic notions to handle if we want to be really rigorous. As you have just seen, probability is full of intuitive terms, and it is sometimes hard for the novice to separate “probabilistic thinking” from “rigor”. Actually, many probabilists and statisticians in the early-mid 20th century resisted the invasion of measure theory (see [D2] for an interesting historical survey). Several decades have passed, and now measure theory is widely accepted as the(?) proper foundation of probability. But the drawback is that things become more and more technical. Here is a relevant quote from Meyer:

One needs [for stochastic integration] a six month
course [to cover only] the definitions. What is there to do?

This is a dilemma of modern math education.

*******************************

First, we want a mathematical model of chance processes (over time). There are many examples: our position in a casino, the ups and downs of stock prices, the motion of a particle in a cup of still water, the inevitable noise over the communication channel, the population of a species, the trajectory of a rocket, etc. We shall model these phenomena mathematically in terms of stochastic processes.

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space.

2.1 Definition. A discrete-time (real-valued) stochastic process is a collection $X = \{X_n: n \geq 0\}$ of random variables defined on $(\Omega, \mathcal{F}, \mathbb{P})$.

Recall that a random variable is nothing but a real-valued measurable function on $\Omega$. Although not apparent in the discrete time setting, a better point of view is to treat $X$ as a function on the product space ${\mathbb{N}} \times \Omega$. That is, $X: {\mathbb{N}} \times \Omega \rightarrow {\mathbb{R}}$, where for each $n \in {\mathbb{N}}$, the function $\omega \in \Omega \mapsto X_n(\omega) = X(n, \omega)$ is a random variable. (Note: It follows that if ${\mathbb{N}}$ is given the discrete sigma-field, then $X$ is automatically measurable with respect to the product sigma-field on ${\mathbb{N}} \times \Omega$. This technicality is quite important in continuous time.)

For each $\omega \in \Omega$, the real sequence $\{X_n(\omega)\}_{n = 0}^{\infty}$ (in ordinary sense) is called the sample path or trajectory of $\omega$. The sample path can be regarded as an element of the function space ${\mathbb{R}}^{{\mathbb{N}}}$ endowed with the product sigma-field. Hence, a third formulation of a stochastic process is simply a measurable function from $\Omega$ to ${\mathbb{R}}^{{\mathbb{N}}}$.

*******************************

2.2 Examples
Here is a list of processes commonly encountered in probability books.

(1) Let $X_0, X_1, X_2, ...$ be a sequence of independent and identically distributed (i.i.d.) random variables,  Then $\{X_n\}_{n = 0}^{\infty}$ is a stochastic process, called the independent stationary process.

(2) If each $X_n$ in (1)  is distributed as Bernoulli($p$), then $\{X_n\}_{n = 0}^{\infty}$ is a mathematical model of an infinite sequence of (possibly unfair) coin tosses.

(3) If each $X_n$  in (1) is distributed as $N(0, \sigma^2)$ (normal distribution), then $\{X_n\}_{n = 0}^{\infty}$ is a mathematical model of (discrete-time) white noise, useful in signal processing and time series analysis.

The independent stationary process by itself is not so interesting as the $X_n$‘s at different times are independent. Very often, $\{X_n\}$ is used as a building block to construct other processes:

(4) Fix $x \in {\mathbb{R}}$ and define $S_0 = x$,

$S_n = x + X_1 + X_2 + ... X_n, n \geq 1$.

The process $\{S_n\}$ is called a random walk. (See our first example.)

(5) In (1), suppose $X_1$ (and hence for all $n$) has finite expectation (i.e. ${\mathbb{E}}|X_1| < \infty$). Let

$\overline{X}_n = \frac{1}{n}(X_1 + X_2 + ... + X_n)$

be the average of the first $n$ “observations”. The strong law of large number (look at the picture) states that $\overline{X}_n$ converges to $\mu = {\mathbb{E}}X_1$ almost surely as $n \rightarrow \infty$.

(6) Let $M_n = \max_{1 \leq i \leq n} |X_i|$. $\{M_n\}$ is called the maximal process associated to $\{X_n\}$.

(7) In (1), suppose that each $X_n$ is uniformly distributed on $[0, 1]$. Let $S$ be a countable set, and let $F: S \times [0, 1] \rightarrow S$ be such that for each $s \in S$, $F(s, \cdot)$ is measurable. Let $s_0 \in S$ be fixed and define

$Z_0 = s_0,$

$Z_{n+1} = F(Z_n, X_{n+1})$.

Then $\{Z_n\}$ is a Markov chain on $S$, which can be viewed as a randomized dynamical system (if $F(s, \cdot) = F(s)$ is constant for each $s$, then $Z_n = F^n(s_0)$). This representation of Markov chain also provides a convenient method for simulation.

*******************************

Some remarks on probability distributions (You may skip this)
We used several distributions in the above example. In general, the distribution of a random variable $Z$ is the Borel measure $\mu_Z(\cdot) = {\mathbb{P}}(Z \in \cdot) = {\mathbb{P}} \circ Z^{-1}$ on ${\mathbb{R}}$. (Here, $\{Z \in E\}$ is a shorthand of the set $\{\omega \in \Omega: Z(\omega) \in E\}$.) In probability, the Bernoulli, binomial, Poisson and normal (etc) distributions are just Borel measures on ${\mathbb{R}}$. For example, the standard Normal distribution $N(0, 1)$ is the Borel measure

$\gamma(E) = \int_E \frac{1}{\sqrt{2 \pi}} \exp(-\frac{x^2}{2}) dx$.

We say that a certain random variable $Z$ has a certain distribution if ${\mathbb{P}} \circ Z^{-1}$ is that measure. For example, we say that $Z$ has standard normal distribution (denoted as $Z \sim N(0, 1)$ if ${\mathbb{P}}(Z \in E) = \gamma(E)$ for all Borel sets $E$.

More generally, if $(E, {\mathcal{E}})$ is a measurable space and $X: \Omega \rightarrow E$ is measurable, we say that $X$ is a random element in $E$. The distribution of $X$ is the measure $\mu_X(\cdot) = {\mathbb{P}} \circ X^{-1} (\cdot)$ on $(E, {\mathcal{E}})$. Applying this to the case where $E$ is a function space, we may define the distribution of a stochastic process. For example, the Wiener measure corresponding to Brownian motion (to be introduced later) is a measure on $C[0, 1]$, the space of continuous functions on $[0, 1]$.

*******************************

Sigma-fields
The interesting thing of a stochastic process is the dependence structure of the random variables $\{X_n\}$. For example, the values $X_1, ..., X_{10}$ may tell us something about $X_{11}$. Naturally, we will use conditional probability to describe the structure.

In the general framework, we will use sigma-fields to represent information. Here is a trivial (but important) example.

2.3 Example. Take $(\Omega, {\mathcal{F}}) = ([0, 1), {\mathcal{B}})$, the unit interval with Borel sigma-field. For each $n$, let ${\mathcal{F}}_n$ be the sigma algebra generated by dyadic intervals of the form $[\frac{k}{2^n}, \frac{k+1}{2^n}), k = 0, ..., 2^n - 1$. Now let $\omega \in [0, 1)$ be a fixed but unknown point. For each $A \in {\mathcal{F}}_n$, we may ask whether $x \in A$. Suppose we know the answer for all $A \in {\mathcal{F}}_n$, i.e. we know the values $1_A(\omega), A \in {\mathcal{F}}_n$. (Here $1_A$ is the indicator function of $A$.) This is equivalent to knowing the unique dyadic interval of level $n$ that contains $\omega$. As $n$ increases, the position of $\omega$ is known more and more precisely. In this sense, we may say that ${\mathcal{F}}_n$ contains more information than ${\mathcal{F}}_m$ if $n > m$.

The dynamic nature of information in stochastic processes is illustrated by the following gambling situation.

2.4 Example. Suppose we are betting against the sum of several dices (e.g. “Big-small”). Let $X_n$ be the profit/loss of the $n$-th gamble per unit bet, and let $A_n$ be your bet on the $n$-th draw. Having observed a few previous outcomes, you may choose not to bet at time $n$ by choosing $A_n(\omega) = 0$. Suppose initially you have $x$ dollars. Then your position after the $n$-th draw is

$M_n = x + A_1X_1 + ... + A_nX_n$.

For example, suppose your strategy is to bet 1 unit whenever the previous 3 draws are “big”. Then we may set $A_n = 1(X_{n-1}, X_{n-2}, X_{n-3} \text{ are big})$.

Intuitively, your choice $A_n$ should depend only on the previous outcomes $X_1, ..., X_{n-1}$. It is then reasonable that

$A_n(\omega) = f_n(X_1(\omega), ..., X_{n-1}(\omega))$

for some (measurable) function $f_n$. A fancy way of expressing this is to say that $A_n$ is measurable with respect to the sigma-field generated by $X_1, ..., X_{n-1}$, i.e. $A_n \in {\mathcal{F}}_{n-1} = \sigma(X_1, ..., X_{n-1})$ (see the next definition). As time passes (i.e. $n$ increases), ${\mathcal{F}}_n$ also increases, and we are making decisions based on more and more information.

*******************************

2.5 Definition. Let $\{(E_i, {\mathcal{E}}_i)\}_{i \in I}$ be a family of measurable spaces, and let $f_i: \Omega \rightarrow E_i$ be measurable. The sigma-algebra generated by $\{f_i\}_{i \in I}$, denoted by $\sigma(\{f_i\}_{i \in I})$, is the smallest sigma-algebra containing sets of the form $f_i^{-1}(E_i)$, where $E_i \in {\mathcal{E}}_i$ and $i \in I$.

We formulate a general version of Example 2.4. Roughly speaking, it says that if $g$ is measurable with respect to a certain sigma-field, then $g$ is a function of the random element that generates the sigma-field.

2. 6 Doob-Dynkin Lemma. Let $(E, {\mathcal{E}})$ be a measurable space and let $f: \Omega \rightarrow E$ be measurable. Then $g: \Omega \rightarrow {\mathbb{R}}$ is $\sigma(f)$-measruable if and only if there exists a measurable $h: E \rightarrow {\mathbb{R}}$ such that $g = h \circ f$.

Proof: We consider first the case that $g$ takes countably many distinct values $a_n$. For each $n$, $\{g = a_n\} = g^{-1}(a_n)$ is $\sigma(f)$-measurable, hence $\{g = a_n\} = \{f \in A_n\}$ for some $A_n \in {\mathcal{E}}$. Now, although the sets $\{g = a_n\}$ are disjoint, $A_n$ may not be (why?). But we can remedy this by letting $B_n = A_n \setminus \bigcup_{k = 1}^{n - 1}A_k$. We let then $h = \sum_n a_n1_{B_n} + a1_B$, where $a$ is distinct from all $a_n$ and $B$ is the complement of $\bigcup_n B_n$. (Why need this extra fuss?) It follows that $f = h \circ g$.

For the general case, assume that $g$ is measurable. Recall that every measurable function can be written as a limit of simple functions. Explicitly,

$g_n = \sum_{k \in {\mathbb{Z}}} k2^{-n}1\{k2^{-n} < g \leq (k+1)2^{-n}\}$

converges to $g$ (uniformly). For each $n$, we may construct $h_n$ such that $g_n = h_n \circ f$. Now, it is attempting to let $h = \lim_n h_n$, but this is not legitimate (WHY?). Instead, let $H = \{x \in E: \lim_n h_n(x) \text{ exists}\}$. We then set $h = \lim_n h_n$ on $H$ and $0$ elsewhere. Then $g = h \circ f$ everywhere. $\blacksquare$

2.7 Exercise. Formulate Example 2.4 above by Doob-Dynkin lemma.

2.8 Exercise. Let ${\mathcal{G}}$ be a sub-sigma-field of ${\mathcal{F}}$ and let a random variable $Z$ be ${\mathcal{G}}$-measurable. On $\Omega$, consider the following equivalence relation:

$\omega \sim \omega' \Leftrightarrow 1_A(\omega) = 1_A(\omega')$ for all $A \in {\mathcal{G}}$.

Then $Z$ is constant on each equivalence class.

*******************************

(3) Conditional probability

There are many reasons that probability theory is not merely a branch of measure theory. One reason is the following. Measure theory focuses on sigma-additivity, but in probability there is also a multiplication law that deals with conditional probabilities. (Technically it is related to product measure, but (question) what is its role in real analysis, except in iterated intergrals?)

The mathematics in this section is intended only to sharpen intuition and will not be used subsequently. I also want to take this chance to introduce a few ideas from statistics. As above we work on a given probability space $(\Omega, {\mathcal{F}}, {\mathbb{P}})$.

3.1 Definition. Let $A, B \in {\mathcal{F}}$ with ${\mathbb{P}}(B) > 0$. The conditional probability of $A$ given $B$ is

${\mathbb{P}}(A | B) = \frac{{\mathbb{P}}(A \cap B)}{{\mathbb{P}}(B)}$.

Any events $A$ and $B$ (${\mathbb{P}}(B) > 0$ or not) are said to be independent if ${\mathbb{P}}(A \cap B) = {\mathbb{P}}(A){\mathbb{P}}(B)$. More generally, events $A_1, A_2, ..., A_n$ are said to be independent if for any subcollection $A_{i_1}, ..., A_{i_m}$,

${\mathbb{P}}(\bigcap_{j = 1}^m A_{i_j}) = {\mathbb{P}}(A_{i_1}) \times ... \times {\mathbb{P}}(A_{i_m})$.

We emphasize that the definition depends on the probability measure ${\mathbb{P}}$. This definition will be modified later to suit more general purpose.

To understand the subtlety of conditional probability, do the following

3.2 Exercise. A couple has two children, and each child has equal probability of being male or female. Supposedly different births are independent of each other, so each combination has probability $1/4 = 1/2 \times 1/2$. Given that one of the children is a boy, what is the conditional probability that the other is also a boy? The answer is not 1/2!!!!!

3.3 Exercise. For $n \geq 3$, give an example of $n$ dependent events such that every $n - 1$ events of them are independent. (Hint: If you have no idea, try $n = 3$ first.)

A good conceptual viewpoint towards conditional probability is the multiplication law

${\mathbb{P}}(A \cap B) = {\mathbb{P}}(B) {\mathbb{P}}(A | B)$.

Here is the meaning: You ask for the likelihood of the joint event A and B. Logically, if both A and B happen, B must happen. Hence we first ask for the likelihood that B happens. And then we ask for the likelihood of A in the occurence of B. We multiply them and this gives us the likelihood of the joint event. We may reason with A first, and this gives ${\mathbb{P}}(A \cap B) = {\mathbb{P}}(A) {\mathbb{P}}(B | A)$, and both approaches are consistent. For more about probabilistic logic, see Cox’s theorem and [J].

*******************************

No treatment of conditional probability is complete without a discussion of Bayes’ theorem. Roughly speaking, it tells us how to update our probability in light of data. Although mathematically simple, it is the crux of probabilistic logic.

3.4 Bayes’ theorem. Let $A, I$ and $D$ be events. Then

${\mathbb{P}}(H | I, D) = \frac{{\mathbb{P}}(D | H, I)}{{\mathbb{P}}(D | I)} {\mathbb{P}}(H | I)$.

Whenever the conditional probabilities are defined. (Here comma (,) means intersection.)

Proof: The proof is just a matter of definition:

${\mathbb{P}}(H | I, D) = \frac{{\mathbb{P}}(H, I, D)}{{\mathbb{P}}(I, D)} = \frac{{\mathbb{P}}(H, I, D)}{{\mathbb{P}}(H, I)} \frac{{\mathbb{P}}(I)}{{\mathbb{P}}(I, D)} \frac{{\mathbb{P}}(H, I)}{{\mathbb{P}}(I)} = \frac{{\mathbb{P}}(D | H, I)}{{\mathbb{P}}(D | I)} {\mathbb{P}}(H | I)$. $\blacksquare$

The theorem is interpreted as follows. You are trying to infer the likelihood of a hypothesis $H$. Here, $I$ denotes your background knowledge. Then ${\mathbb{P}}(H | I)$ represents your view towards $H$ before you see the data $D$ (this is called the prior probability). After you see the data $D$, you update your probability to ${\mathbb{P}}(H | I, D)$, the posterior probability. How? Your compare the probabilities ${\mathbb{P}}(D | H, I)$ and ${\mathbb{P}}(D | I)$. If the data is more plausible assuming $H$ is true, your posterior probability should be higher than your prior probability. If this seems confusing, study the next examples.

*******************************

3.5 Example (Monte Hall problem). (a) In a game show, there are three boxes:

[B1]            [B2]             [B3]

and behind one of the doors there lies a prize. You are asked to choose a door. Then you choose one, say B1. (*) At least one of the remaining boxes is empty. The gamemaster opens one of the empty boxes at random. Suppose B3 is opened:

[B1]           [B2]             (B3)

Now, she asks: Do you want to change your choice to B2?

(b) At (*), the gamemaster opens one of the three boxes at random, and it turns out that (1) your box is not opened and (2) the box opened is empty. The gamemaster then asks the same question: Do you want to change your choice to B2?

Solution: (a) Let $H_i, i = 1, 2, 3$ be the event that box $i$ contains the prize. Without additional information (we simply take $I = \Omega$ be the sure event), our prior probabilities are

${\mathbb{P}}(H_i) = \frac{1}{3}, i = 1, 2, 3.$

Our data is that D = “the gamemaster opens B3”. Since we have chosen B1, we would like to compare ${\mathbb{P}}(H_2 | D)$ and ${\mathbb{P}}(H_1 | D)$. By Bayes’ theorem,

${\mathbb{P}}(H_i | D) = \frac{{\mathbb{P}}(D | H_i)}{{\mathbb{P}}(D)} {\mathbb{P}}(H_i)$.

Note that only the term ${\mathbb{P}}(D | H_i)$ depends on $i$.

If the prize is in B1, the gamemaster opens B2 or B3, with equal probability. Hence

${\mathbb{P}}(D | H_1) = \frac{1}{2}$.

If the prize is in B2, the gamemaster must open B3 (since she cannot open B1, for it is chosen). Hence

${\mathbb{P}}(D | H_2) = 1$.

Finally, if the prize is in B3, the gamemaster will not open B3. So ${\mathbb{P}}(D | H_3) = 0$. Hence, after simple normalization, we see that the posterior distribution is given by

${\mathbb{P}}(H_1 | D) = \frac{1}{3}, {\mathbb{P}}(H_2 | D) = \frac{2}{3}$.

and so we should switch to B2!

(b) Exercise! (Hint: What is the difference between (a) and (b)? The action of the gamemaster is now independent of the position of the prize!) $\blacksquare$

3.6 Example (statistical inference). Suppose there are 10 urns, and each contains 10 balls. The $i$-th urn contains $i$ red balls and $10 - i$ blue balls. Consider the following experiment. First, one of the urns is chosen at random, and $n$ balls are drawn from the chosen urn with replacement. Suppose that $k$ of the balls are red, can you guess which of the urns is chosen?

Solution. Let $H_i, i = 1, ..., 10$ be the event that urn $i$ is chosen. Again we take $I = \Omega$ be the sure event. Our prior probabilities are

${\mathbb{P}}(H_i) = 0.1, i = 1, 2, ..., 10$.

The data is D = “$k$ of the $n$ balls are red”. Conditioned on $H_i$, the number of red balls has Binomial($n$, $i/10$) distribution, i.e.

${\mathbb{P}}(D | H_i) = C_k^n \left( \frac{i}{10} \right)^k \left(1 - \frac{i}{10} \right) ^{n - k}$.

By Bayes’ theorem, ${\mathbb{P}}(H_i | D) = \frac{{\mathbb{P}}(D | H_i)}{{\mathbb{P}}(D)} {\mathbb{P}}(H_i)$. Note that only ${\mathbb{P}}(D | H_i)$ depends on $i$, and the remainding part is a constant that we may ignore. The collection $\{ {\mathbb{P}}(H_i | D) \}_{i = 1}^{10}$ gives the posterior distribution of the urn chosen.

For illustration, suppose that urn 3 is chosen (but unknown to the statistician). We sample 100 balls from the urn and get

RBBBBRBBRRBBRBBBRBBB
BBBBRBBBBBRBRBBRBBBR
BBBBBBBBBBRBRBBRBBBR
BBRBBBBBBBBBBRBRRBBB
RRBBBBBBBBBRBBBRBBRR

Following the above procedure, we compute the posteior distributions after 0, 5, 20, 50 and 100 draws:

Comments for ($n$, $k$):

(0, 0): Before any draws, the distribution is uniform, i.e. all urns are equally likely.

(5, 1): There is only 1 red ball in 5 draws. At this point, the most probable hypothesis is $H_2$, but due to the small sample size the distribution gives substantial mass to other possibilities. However, note that $H_{10}$ is automatically rejected (with probability 0), since there is at least one red ball.

(20, 6), (50, 11), (100, 25): As more and more draws are revealed, you see that the distribution concentrates more and more around 2 and 3. After 100 draws, there are 25 red balls and so the posterior distribution cannot distinguish between $H_2$ and $H_3$ – but we are very sure that the other hypotheses are false. As $n \rightarrow \infty$, the posterior distribution will converge to the unit mass at 3 (can you formulate and prove this?). I think this is very impressive, although the result is expected. $\blacksquare$

3.7 Exercise. Verify that my figures are correct.

3.8 Exercise. Rigorously, construct probability spaces that corresponds to the situations in Examples 3.5 and 3.6.

**************END**************

The measure-theoretic treatment of conditional expectation (and probability) will begin in Martingale Theory II: Conditional expectation and basic properties of martingale.

This entry was posted in Probability. Bookmark the permalink.

### One Response to Martingale Theory I: Background

1. Anonymous says:

A solution for the asymmetric case?