Last revised: 28/10
I am going to write about martingale theory. After discussing some background knowledge, our first aim are the optional stopping and convergence theorems in discrete time. After that, we shall deal with refinements in continuous time and some applications. Your comments and corrections are highly appreciated.
Martingale originally means a gambling strategy: Suppose you double your bet whenever you lose. Then when you finally win, you win all the previous losses back. As a mathematical object, it was used quite a bit by Levy and others in 1930s, but it was Doob who proved the main theorems and illustrated its amazing uses. Nowadays, martingale is a major tool in both theoretical and applied probability. Regarding the history of martingale theory, I would like to quote Leo Breiman:
Probability has two hands: On the right is the
rigorous; the left hand thinks probabilistically,
reduces problems to gambling situations, coin-
tossing, motions of a physical particle.
This also serves as the guiding principle of our exposition.
(1) Classical example: Gambler’s ruin (symmetric case)
We follow tradition and start by describing a famous gambling problem in intuitive terms:
Suppose you enter a casino with dollars. Each time you bet $1 against the toss of a fair coin (where the tosses are independent of each other). You quit if either you lose all money or have dollars in your pocket. What is the probability that you lose your shirt? (or, in other words, to lose dollars before winning dollors?)
Let , denote the outcome of the -th bet. Then and your position after the -th bet is
We set . Let be the first time that or . The required probability is thus .
Suppose after the first tosses. Then depends only on the -th toss. Now, is either or , and both has probability 1/2. The conditional expectation is thus . We record this fact by writing
This says that is a martingale. Among other things, this implies that
for all . This means that given the past outcomes, the next bet is always “fair”.
To emphasize the gambling flavor, we deliberately use a “loose” language in the above discussion, using words like “coin tosses”, “outcome”, “first time”, “suppose … after …” etc. The concept of martingale is a substantial generalization of the idea of “fair game“, but the definition requires a certain amount of measure-theoretic machinery. Before embarking on a rigorous development, let us quickly solve the ruin problem and give some remarks. We write the required probability as . Here the subscript means that the initial position is .
1.1 Solution 1: Recall that . We simply put (see Remark 1.3). Then . Now is either or , with probabilities and respectively. It follows that
1.2 Solution 2: Consider the situation after the first toss. With probability 1/2, , and the (conditional) probability is . Otherwise, and the (conditional) probability is . Hence for , total probability law (see Remark 1.4) gives
This gives a difference equation in . Together with the boundary conditions and , we get as above.
Can you formulate and solve the asymmetric case?
1. 3 Remark. In Solution 1 we used the following reasoning: The gamble is fair at any time . Hence it is still fair at the first time that hits or . But watch out! Here is random! Can we substitute into the formula , where is fixed? This is true only for a certain class of random times called stopping times. We will discuss this again when we study the optional sampling theorem.
1.4 Remark. Actually, in Solution 2 we did not use the martingale property of . What we used is the Markov property of . The difference equation can be given a potential-theoretic meaning. If we let
to be the discrete Laplace operator associated to , then solves the Dirichlet problem
We will discuss probabilistic potential theory only later.
(2) Basic ideas of stochastic processes
Our development will rest on the modern measure-theoretic foundation of probability. Unfortunately, the material is somewhat top-heavy: there are a lot of measure-theoretic notions to handle if we want to be really rigorous. As you have just seen, probability is full of intuitive terms, and it is sometimes hard for the novice to separate “probabilistic thinking” from “rigor”. Actually, many probabilists and statisticians in the early-mid 20th century resisted the invasion of measure theory (see [D2] for an interesting historical survey). Several decades have passed, and now measure theory is widely accepted as the(?) proper foundation of probability. But the drawback is that things become more and more technical. Here is a relevant quote from Meyer:
One needs [for stochastic integration] a six month
course [to cover only] the definitions. What is there to do?
This is a dilemma of modern math education.
First, we want a mathematical model of chance processes (over time). There are many examples: our position in a casino, the ups and downs of stock prices, the motion of a particle in a cup of still water, the inevitable noise over the communication channel, the population of a species, the trajectory of a rocket, etc. We shall model these phenomena mathematically in terms of stochastic processes.
Let be a probability space.
2.1 Definition. A discrete-time (real-valued) stochastic process is a collection of random variables defined on .
Recall that a random variable is nothing but a real-valued measurable function on . Although not apparent in the discrete time setting, a better point of view is to treat as a function on the product space . That is, , where for each , the function is a random variable. (Note: It follows that if is given the discrete sigma-field, then is automatically measurable with respect to the product sigma-field on . This technicality is quite important in continuous time.)
For each , the real sequence (in ordinary sense) is called the sample path or trajectory of . The sample path can be regarded as an element of the function space endowed with the product sigma-field. Hence, a third formulation of a stochastic process is simply a measurable function from to .
Here is a list of processes commonly encountered in probability books.
(1) Let be a sequence of independent and identically distributed (i.i.d.) random variables, Then is a stochastic process, called the independent stationary process.
(2) If each in (1) is distributed as Bernoulli(), then is a mathematical model of an infinite sequence of (possibly unfair) coin tosses.
(3) If each in (1) is distributed as (normal distribution), then is a mathematical model of (discrete-time) white noise, useful in signal processing and time series analysis.
The independent stationary process by itself is not so interesting as the ‘s at different times are independent. Very often, is used as a building block to construct other processes:
(4) Fix and define ,
The process is called a random walk. (See our first example.)
(5) In (1), suppose (and hence for all ) has finite expectation (i.e. ). Let
be the average of the first “observations”. The strong law of large number (look at the picture) states that converges to almost surely as .
(6) Let . is called the maximal process associated to .
(7) In (1), suppose that each is uniformly distributed on . Let be a countable set, and let be such that for each , is measurable. Let be fixed and define
Then is a Markov chain on , which can be viewed as a randomized dynamical system (if is constant for each , then ). This representation of Markov chain also provides a convenient method for simulation.
Some remarks on probability distributions (You may skip this)
We used several distributions in the above example. In general, the distribution of a random variable is the Borel measure on . (Here, is a shorthand of the set .) In probability, the Bernoulli, binomial, Poisson and normal (etc) distributions are just Borel measures on . For example, the standard Normal distribution is the Borel measure
We say that a certain random variable has a certain distribution if is that measure. For example, we say that has standard normal distribution (denoted as if for all Borel sets .
More generally, if is a measurable space and is measurable, we say that is a random element in . The distribution of is the measure on . Applying this to the case where is a function space, we may define the distribution of a stochastic process. For example, the Wiener measure corresponding to Brownian motion (to be introduced later) is a measure on , the space of continuous functions on .
The interesting thing of a stochastic process is the dependence structure of the random variables . For example, the values may tell us something about . Naturally, we will use conditional probability to describe the structure.
In the general framework, we will use sigma-fields to represent information. Here is a trivial (but important) example.
2.3 Example. Take , the unit interval with Borel sigma-field. For each , let be the sigma algebra generated by dyadic intervals of the form . Now let be a fixed but unknown point. For each , we may ask whether . Suppose we know the answer for all , i.e. we know the values . (Here is the indicator function of .) This is equivalent to knowing the unique dyadic interval of level that contains . As increases, the position of is known more and more precisely. In this sense, we may say that contains more information than if .
The dynamic nature of information in stochastic processes is illustrated by the following gambling situation.
2.4 Example. Suppose we are betting against the sum of several dices (e.g. “Big-small”). Let be the profit/loss of the -th gamble per unit bet, and let be your bet on the -th draw. Having observed a few previous outcomes, you may choose not to bet at time by choosing . Suppose initially you have dollars. Then your position after the -th draw is
For example, suppose your strategy is to bet 1 unit whenever the previous 3 draws are “big”. Then we may set .
Intuitively, your choice should depend only on the previous outcomes . It is then reasonable that
for some (measurable) function . A fancy way of expressing this is to say that is measurable with respect to the sigma-field generated by , i.e. (see the next definition). As time passes (i.e. increases), also increases, and we are making decisions based on more and more information.
2.5 Definition. Let be a family of measurable spaces, and let be measurable. The sigma-algebra generated by , denoted by , is the smallest sigma-algebra containing sets of the form , where and .
We formulate a general version of Example 2.4. Roughly speaking, it says that if is measurable with respect to a certain sigma-field, then is a function of the random element that generates the sigma-field.
2. 6 Doob-Dynkin Lemma. Let be a measurable space and let be measurable. Then is -measruable if and only if there exists a measurable such that .
Proof: We consider first the case that takes countably many distinct values . For each , is -measurable, hence for some . Now, although the sets are disjoint, may not be (why?). But we can remedy this by letting . We let then , where is distinct from all and is the complement of . (Why need this extra fuss?) It follows that .
For the general case, assume that is measurable. Recall that every measurable function can be written as a limit of simple functions. Explicitly,
converges to (uniformly). For each , we may construct such that . Now, it is attempting to let , but this is not legitimate (WHY?). Instead, let . We then set on and elsewhere. Then everywhere.
2.7 Exercise. Formulate Example 2.4 above by Doob-Dynkin lemma.
2.8 Exercise. Let be a sub-sigma-field of and let a random variable be -measurable. On , consider the following equivalence relation:
for all .
Then is constant on each equivalence class.
(3) Conditional probability
There are many reasons that probability theory is not merely a branch of measure theory. One reason is the following. Measure theory focuses on sigma-additivity, but in probability there is also a multiplication law that deals with conditional probabilities. (Technically it is related to product measure, but (question) what is its role in real analysis, except in iterated intergrals?)
The mathematics in this section is intended only to sharpen intuition and will not be used subsequently. I also want to take this chance to introduce a few ideas from statistics. As above we work on a given probability space .
3.1 Definition. Let with . The conditional probability of given is
Any events and ( or not) are said to be independent if . More generally, events are said to be independent if for any subcollection ,
We emphasize that the definition depends on the probability measure . This definition will be modified later to suit more general purpose.
To understand the subtlety of conditional probability, do the following
3.2 Exercise. A couple has two children, and each child has equal probability of being male or female. Supposedly different births are independent of each other, so each combination has probability . Given that one of the children is a boy, what is the conditional probability that the other is also a boy? The answer is not 1/2!!!!!
3.3 Exercise. For , give an example of dependent events such that every events of them are independent. (Hint: If you have no idea, try first.)
A good conceptual viewpoint towards conditional probability is the multiplication law
Here is the meaning: You ask for the likelihood of the joint event A and B. Logically, if both A and B happen, B must happen. Hence we first ask for the likelihood that B happens. And then we ask for the likelihood of A in the occurence of B. We multiply them and this gives us the likelihood of the joint event. We may reason with A first, and this gives , and both approaches are consistent. For more about probabilistic logic, see Cox’s theorem and [J].
No treatment of conditional probability is complete without a discussion of Bayes’ theorem. Roughly speaking, it tells us how to update our probability in light of data. Although mathematically simple, it is the crux of probabilistic logic.
3.4 Bayes’ theorem. Let and be events. Then
Whenever the conditional probabilities are defined. (Here comma (,) means intersection.)
Proof: The proof is just a matter of definition:
The theorem is interpreted as follows. You are trying to infer the likelihood of a hypothesis . Here, denotes your background knowledge. Then represents your view towards before you see the data (this is called the prior probability). After you see the data , you update your probability to , the posterior probability. How? Your compare the probabilities and . If the data is more plausible assuming is true, your posterior probability should be higher than your prior probability. If this seems confusing, study the next examples.
3.5 Example (Monte Hall problem). (a) In a game show, there are three boxes:
[B1] [B2] [B3]
and behind one of the doors there lies a prize. You are asked to choose a door. Then you choose one, say B1. (*) At least one of the remaining boxes is empty. The gamemaster opens one of the empty boxes at random. Suppose B3 is opened:
[B1] [B2] (B3)
Now, she asks: Do you want to change your choice to B2?
(b) At (*), the gamemaster opens one of the three boxes at random, and it turns out that (1) your box is not opened and (2) the box opened is empty. The gamemaster then asks the same question: Do you want to change your choice to B2?
Solution: (a) Let be the event that box contains the prize. Without additional information (we simply take be the sure event), our prior probabilities are
Our data is that D = “the gamemaster opens B3”. Since we have chosen B1, we would like to compare and . By Bayes’ theorem,
Note that only the term depends on .
If the prize is in B1, the gamemaster opens B2 or B3, with equal probability. Hence
If the prize is in B2, the gamemaster must open B3 (since she cannot open B1, for it is chosen). Hence
Finally, if the prize is in B3, the gamemaster will not open B3. So . Hence, after simple normalization, we see that the posterior distribution is given by
and so we should switch to B2!
(b) Exercise! (Hint: What is the difference between (a) and (b)? The action of the gamemaster is now independent of the position of the prize!)
3.6 Example (statistical inference). Suppose there are 10 urns, and each contains 10 balls. The -th urn contains red balls and blue balls. Consider the following experiment. First, one of the urns is chosen at random, and balls are drawn from the chosen urn with replacement. Suppose that of the balls are red, can you guess which of the urns is chosen?
Solution. Let be the event that urn is chosen. Again we take be the sure event. Our prior probabilities are
The data is D = “ of the balls are red”. Conditioned on , the number of red balls has Binomial(, ) distribution, i.e.
By Bayes’ theorem, . Note that only depends on , and the remainding part is a constant that we may ignore. The collection gives the posterior distribution of the urn chosen.
For illustration, suppose that urn 3 is chosen (but unknown to the statistician). We sample 100 balls from the urn and get
Following the above procedure, we compute the posteior distributions after 0, 5, 20, 50 and 100 draws:
Comments for (, ):
(0, 0): Before any draws, the distribution is uniform, i.e. all urns are equally likely.
(5, 1): There is only 1 red ball in 5 draws. At this point, the most probable hypothesis is , but due to the small sample size the distribution gives substantial mass to other possibilities. However, note that is automatically rejected (with probability 0), since there is at least one red ball.
(20, 6), (50, 11), (100, 25): As more and more draws are revealed, you see that the distribution concentrates more and more around 2 and 3. After 100 draws, there are 25 red balls and so the posterior distribution cannot distinguish between and – but we are very sure that the other hypotheses are false. As , the posterior distribution will converge to the unit mass at 3 (can you formulate and prove this?). I think this is very impressive, although the result is expected.
3.7 Exercise. Verify that my figures are correct.
3.8 Exercise. Rigorously, construct probability spaces that corresponds to the situations in Examples 3.5 and 3.6.
The measure-theoretic treatment of conditional expectation (and probability) will begin in Martingale Theory II: Conditional expectation and basic properties of martingale.