Martingale theory lecture notes

Márton Balázs ¹
Some proofs have been transferred over from Probability 3 revision notes by Aaron Smith.

School of Mathematics, University of Bristol

These notes summarise the lectures and exercise classes Martingale Theory with Applications since Autumn 2021 in the University of Bristol.

1 A quick summary of some parts of measure theory

…from the probabilist point of view. This section mostly follows Shiryaev [ 1 ] . The aim is to build up a mathematical model of random experiments and measurements (random variables, that is) thereof. Probabilities of these random outcomes are also to be constructed.

The power set $\mathcal P(\Omega )$ of a set $\Omega $ is the set of all subsets of $\Omega $, and $A^\text c:\, =\Omega -A$ denotes the complement of a set $A\subseteq \Omega $.

Definition 1.1

A family $\mathcal A\subseteq \mathcal P(\Omega )$ of sets is called an algebra, if

$\Omega \in \mathcal A$,
for every $A,\, B\in \mathcal A$, $A\cup B\in \mathcal A$,
for every $A\in \mathcal A$, $A^\text c\in \mathcal A$.

This simple construction allows us to build a prototype of probabilities as follows.

Definition 1.2

Let $\mathcal A$ be an algebra. A set function $\mu \, :\, \mathcal A\to [0,\, \infty ]$ is a finitely additive measure on $\mathcal A$, if for all disjoint $A,\, B\in \mathcal A$,

\[ \mu (A\cup B)=\mu (A)+\mu (B). \]

As it turns out, the objects we defined so far are too general for our purposes, hence a refinement comes next.

Definition 1.3

A family $\mathcal F\subseteq \mathcal P(\Omega )$ of sets is called a $\sigma $-algebra, if

$\Omega \in \mathcal F$,
for every countably many sets $A_1,\, A_2,\, \dots \in \mathcal F$, $\bigcup _nA_n\in \mathcal F$,
for every $A\in \mathcal F$, $A^\text c\in \mathcal F$.

In this case the pair $(\Omega ,\, \mathcal F)$ is called a measurable space. Any set $A\in \mathcal F$ is said to be $\mathcal F$-measurable, or just measurable.

The novelty here is the requirement of $\mathcal F$ being closed for countably infinite unions, as opposed to finite unions only for an algebra.

Example 1.4

Here are examples of $\sigma $-algebras (check!).

$\mathcal F_*=\{ \emptyset ,\, \Omega \} $ is called the trivial $\sigma $-algebra.
If $\Omega $ is countable, very often $\mathcal F^*=\mathcal P(\Omega )$ is considered, which is a $\sigma $-algebra in this case.
For a set $A\subset \Omega $, $\mathcal F_A=\{ \emptyset ,\, A,\, A^\text c,\, \Omega \} $ is the $\sigma $-algebra generated by $A$.

Definition 1.5

A measure $\mu $ on an algebra $\mathcal A$ is a set function $\mu \colon \mathcal A\to [0,\, \infty ]$ such that for any mutually disjoint sets $A_1,\, A_2,\, \dots \in \mathcal A$ with $\bigcup _nA_n\in \mathcal A$,

$\begin{equation}\mu \Bigl(\bigcup _{n=1}^\infty A_n\Bigr)=\sum _{n=1}^\infty \mu (A_n)\label{eq:siadd}\tag{1.1}\end{equation}$

holds. If $\mu (\Omega )=1$, then we call $\mu $ a probability measure, and often use ${\mathbb P}$ instead. In this case the triplet $(\Omega ,\, \mathcal F,\, {\mathbb P})$ is called a probability space.

Notice that a $\sigma $-algebra is automatically an algebra, thus this definition applies for measures on $\sigma $-algebras. Property (1.1) is referred to as $\sigma $-additivity.

When modeling a random experiment, the set $\Omega $ is called the sample space of the experiment, it contains all elementary outcomes. Its measurable subsets $A\in \mathcal F$ are called events. These are exactly those sets of outcomes which have a probability. The empty set $\emptyset $ is always an event, called the null event. The above definitions imply that ${\mathbb P}(\emptyset )=0$.

Here is a rather useful characterisation of probability measures.

Theorem 1.6

Let ${\mathbb P}$ be a finitely additive measure on an algebra $\mathcal A$, and assume ${\mathbb P}(\Omega )=1$. Then the following are equivalent.

${\mathbb P}$ is a probability measure.
If $(A_n)_{n\ge 1}$ is an increasing sequence of sets in $\mathcal A$ (that is, $A_n\subseteq A_{n+1}$ for $n\ge 1$) and $\bigcup _{n=1}^\infty A_n\in \mathcal A$, then the limit below exists, and
\[ \lim _{n\to \infty }{\mathbb P}(A_n)={\mathbb P}\Bigl(\bigcup _{n=1}^\infty A_n\Bigr). \]
If $(A_n)_{n\ge 1}$ is a decreasing sequence of sets in $\mathcal A$ (that is, $A_n\supseteq A_{n+1}$ for $n\ge 1$) and $\bigcap _{n=1}^\infty A_n\in \mathcal A$, then the limit below exists, and
\[ \lim _{n\to \infty }{\mathbb P}(A_n)={\mathbb P}\Bigl(\bigcap _{n=1}^\infty A_n\Bigr). \]
If $(A_n)_{n\ge 1}$ is a decreasing sequence of sets in $\mathcal A$ and $\bigcap _{n=1}^\infty A_n=\emptyset $, then the limit below exists, and
\[ \lim _{n\to \infty }{\mathbb P}(A_n)=0. \]

Notice that the union of increasing sets, and the intersection of decreasing sets, are sometimes called the limit of the sets.

Proof.

We first show that 1 implies 2. Let $B_1 = A_1$, $B_n = A_n \backslash A_{n-1}$ for all $n\geq 2$. Then $\{ B_n\} _{n=1}^\infty $ satisfies $\bigcup _{n=1}^\infty B_n = \bigcup _{n=1}^\infty A_n$ and $B_i\cap B_j = \emptyset $ for all $i\neq j$. So $\{ B_n\} _{n=1}^\infty $ forms a disjoint partition of $\bigcup _{n=1}^\infty A_n$. By the $\sigma $-additivity assumption in 1 we have that

\[ {\mathbb P}(\bigcup _{n=1}^\infty A_n)={\mathbb P}(\bigcup _{n=1}^\infty B_n) = \sum _{n=1}^\infty {\mathbb P}(B_n) = \lim _{n\to \infty } \sum _{k=1}^n {\mathbb P}(B_k) = \lim _{n\to \infty } {\mathbb P}(\bigcup _{k=1}^n B_k) = \lim _{n\to \infty } {\mathbb P}(A_n). \]

Now we show 2 implies 3. Since $A_n$ is a decreasing sequence, $\Omega \backslash A_n$ is an increasing sequence. Moreover,

\[ {\mathbb P}(\bigcap _{n=1}^\infty A_n) = 1 - {\mathbb P}( \bigcup _{n=1}^\infty \Omega \backslash A_n ) = 1 - \lim _{n\to \infty } {\mathbb P}( \Omega \backslash A_n) = \lim _{n\to \infty } {\mathbb P}(A_n). \]

3 implies 4 trivially, it is simply a special case. It remains to show that 4 implies 1, i.e., that if $A_n$ is a decreasing sequence with $\bigcap _{n=1}^\infty A_n = \lim _{n\to \infty } A_n = \emptyset $ then $\lim _{n\to \infty } {\mathbb P}(A_n) = 0$ implies ${\mathbb P}$ has the $\sigma $-additivity property.

Take any disjoint family of sets $\{ A_k\} _{k=1}^\infty $. Then by finite additivity,

\[ \sum _{k=1}^\infty {\mathbb P}(A_k) = \lim _{n\to \infty } \sum _{k=1}^n {\mathbb P}(A_k) = \lim _{n\to \infty } {\mathbb P}(\bigcup _{k=1}^n A_k ) = \lim _{n\to \infty } [{\mathbb P}(\bigcup _{k=1}^\infty A_k ) - {\mathbb P}(\bigcup _{k=n+1}^\infty A_k )]. \]

Now $\bigcup _{k=n+1}^\infty A_k$ is a decreasing sequence in $n$ where $\bigcap _{n=1}^\infty \bigcup _{k=n+1}^\infty A_k = \emptyset $. This is because if $\omega \in \bigcup _{k=n+1}^\infty A_k$ then $\omega \in A_N$ for unique $N$ (by disjointness of the family). Hence $\omega $ is not in the intersection of all tail unions. By 3, ${\mathbb P}(\bigcup _{k=n+1}^\infty A_k ) \to 0$. Moreover, $\sigma $-additivity holds.

Remark 1.7

If $A_n$ is an increasing sequence then ${\mathbb P}(A_n)$ is an increasing sequence. Indeed this is because $A_n\subseteq A_{n+1}$ implies that $A_{n+1} = A_{n+1}\backslash A_n \cup A_n$. By the $\sigma $-additivity of ${\mathbb P}$ we have that ${\mathbb P}(A_{n+1}) = {\mathbb P}(A_{n+1}\backslash A_n) + {\mathbb P}(A_n) \geq {\mathbb P}(A_n)$. Moreover ${\mathbb P}(A_n)\nearrow {\mathbb P}(\lim _{n\to \infty } A_n)$.

This applies similarly to decreasing sequences. If $A_n$ is decreasing then ${\mathbb P}(A_n)$ is a decreasing sequence and ${\mathbb P}(A_n)\searrow {\mathbb P}(\lim _{n\to \infty } A_n)$.

We now proceed with a short summary on how some commonly used $\sigma $-algebras are constructed. Here is the main tool for this:

Lemma 1.8

Let $\mathcal E\subseteq \mathcal P(\Omega )$. Then

there is a smallest algebra $\alpha (\mathcal E)$ that contains all sets from $\mathcal E$;
there is a smallest $\sigma $-algebra $\sigma (\mathcal E)$ that contains all sets from $\mathcal E$.

Proof.

The intersection of algebras is an algebra, and the intersection of $\sigma $-algebras is a $\sigma $-algebra. To find the ones in the lemma, take the intersection of all algebras, respectively $\sigma $-algebras, that contain all sets in $\mathcal E$.

The above $\alpha (\mathcal E)$ and $\sigma (\mathcal E)$ are said to be generated by $\mathcal E$.

Let us now consider

$\begin{equation}\begin{aligned} \mathcal A:\, =\Bigl\{ & \bigcup _{i=1}^n(a_i,\, b_i]\, :\, n<\infty \text{, and }a_1<b_1\le a_2<b_2\le \dots \le a_n<b_n\text{ in }\mathbb R\cup \{ -\infty \} ;\\ & \bigcup _{i=1}^n(a_i,\, b_i]\cup (c,\, \infty )\, :\, n<\infty \text{, and }a_1<b_1\le a_2<b_2\le \dots \le a_n<b_n\le c\text{ in }\mathbb R\cup \{ -\infty \} \Bigr\} . \end{aligned}\label{eq:prebor}\tag{1.2}\end{equation}$

This is an algebra (check!), but not a $\sigma $-algebra: each of $\bigl(0,\, 1-\frac1n\bigr]$ is in $\mathcal A$, but the union of these sets for all $n$ is $(0,\, 1)$, which is not in $\mathcal A$. However, this algebra can be used to generate the following $\sigma $-algebra:

Definition 1.9

The Borel $\sigma $-algebra on $\mathbb R$, denoted $\mathcal B(\mathbb R)$, is the $\sigma $-algebra generated by (1.2). Sets in $\mathcal B(\mathbb R)$ are said to be Borel sets.

This $\sigma $-algebra contains all subsets of $\mathbb R$ that are "of practical interest". I.e., it is not easy to come up with a non-Borel set in $\mathbb R$. Those interested can look up the Vitali set for an example.

In a similar way, $n$-dimensional rectangles:

\[ \bigl\{ (a_1,\, b_1]\times (a_2,\, b_2]\times \dots \times (a_n,\, b_n]\, :\, a_1{\lt}b_1,\ a_2{\lt}b_2,\dots ,a_n{\lt}b_n\text{ in }\mathbb R\bigr\} , \]

rather than one-dimensional intervals, can be used to generate $\mathcal B(\mathbb R^n)$, the Borel $\sigma $-algebra on $\mathbb R^n$. This will contain "all $n$-dimensional sets of practical interest".

One can then proceed to $\mathbb R^\infty $, the set of real-valued sequences, by considering the $\sigma $-algebra $\mathcal B(\mathbb R^\infty )$ generated by rectangles of arbitrary finite dimension. Again, "practical sets", such as

\[ \bigl\{ (x_n)\, :\, \lim _{n\to \infty }x_n\text{ exists and is finite}\bigr\} ;\ \bigl\{ (x_n)\, :\, \sup _nx_n{\gt}5\bigr\} ;\ \bigl\{ (x_n)\, :\, \liminf _{n\to \infty }x_n{\gt}5\bigr\} \]

all belong to $\mathcal B(\mathbb R^\infty )$.

One can even define $\mathcal B(\mathbb R^T)$ with an uncountable set $T$, for example $\sigma $-algebras on function spaces. This usually requires some restrictions on the family of functions considered.

The next theorem, which we cover without proof, allows to construct measures on generated $\sigma $-algebras.

Theorem 1.10 (Carathéodory)

Let $\mathcal A$ be an algebra on $\Omega $. If $\mu _0$ is a $\sigma $-additive measure on $(\Omega ,\, \mathcal A)$, then there exists a unique extension of it to $\bigl(\Omega ,\, \sigma (\mathcal A)\bigr)$ (the generated $\sigma $-algebra).

Definition 1.11

Let $F\, :\, \mathbb R\to [0,\, 1]$ be a cumulative distribution function, and define the $\sigma $-additive measure ${\mathbb P}$ on (1.2) by ${\mathbb P}(a,\, b]:\, =F(b)-F(a)$. This extends to the Lebesgue-Stieltjes measure on $\bigl(\mathbb R,\, \mathcal B(\mathbb R)\bigr)$.

When $F$ is the Uniform(0, 1) distribution, we obtain the Lebesgue measure on $\mathcal B([0,\, 1])$ this way.

We briefly mention that ${\mathbb P}$ can be extended to $\mathbb R^n$ in a natural way, then Kolmogorov’s extension theorem can be used, under certain circumstances, to extend further to $\mathbb R^\infty $ or even $\mathbb R^T$.

The next task is to construct random variables on a probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$.

Definition 1.12

Let $(\Omega ,\, \mathcal F)$ be a measurable space. A function $X\, :\, \Omega \to \mathbb R$ is called measurable, if for any $B\in \mathcal B(\mathbb R)$, $X^{-1}(B)\in \mathcal F$. A (real-valued) random variable on a probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$ is a measurable function from $X\, :\, \Omega \to \mathbb R$.

It should now be clear how probabilities associated with random variables work. The above definition exactly says that a random variable taking value in a Borel set $B\in \mathcal B(\mathbb R)$ is an event in our probability space (i.e., belongs to $\mathcal F$). As an example, let us consider the distribution function $F$ of a random variable. When first encountered, it is usually defined as $F(x)={\mathbb P}\{ X\le x\} $. With the above construction, it should rather be written as

\[ F(x)={\mathbb P}\{ \omega \in \Omega \, :\, X(\omega )\le x\} ={\mathbb P}\bigl\{ X^{-1}(-\infty ,\, x]\bigr\} . \]

which of course has the same meaning, we just understand properly now what is behind the notation. The set $(-\infty ,\, x]\in \mathcal B(\mathbb R)$ is a Borel set, $X$ is a measurable function which implies that $X^{-1}(-\infty ,\, x]\in \mathcal F$, hence this is an event and it makes sense to talk about its probability in the sample space $\Omega $.

We remark that limits, sums, differences, products, ratios (where defined), Borel-measurable functions of random variables are each random variables again, in other words these operations do not ruin measurability.

Next we briefly summarise without proofs how to construct expectations of random variables. (In measure theory, this would be called integrals of measurable functions.) We sometimes omit mentioning the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$.

Definition 1.13

A random variable $X$ is called simple, if there exist $n{\gt}0$, $x_1,\, x_2,\, \dots ,\, x_n\in \mathbb R$, and $A_1,\, A_2,\, \dots ,\, A_n\in \mathcal F$ with which

\[ X(\omega )=\sum _{k=1}^nx_k\cdot {\bf 1}_{A_k}(\omega ). \] Here ${\bf 1}_A$ stands for the indicator function:

\[ {\bf 1}_A(\omega )=\left\{ \begin{aligned} & 1,& & \text{if }\omega \in A,\\ & 0,& & \text{if }\omega \notin A. \end{aligned} \right. \]

The next theorem we borrow from measure theory without proof.

Theorem 1.14

For any random variable $X$ there exists a sequence $X_1,\, X_2,\, \dots $ of simple random variables such that $|X_n|\le |X|$ for all $n$, and $X_n(\omega )\to X(\omega )$ as $n\to \infty $ for all $\omega \in \Omega $.
Moreover, if $X(\omega )\ge 0$ for every $\omega \in \Omega $, then $X_n$ can be chosen to be non-decreasing in $n$ for every fixed $\omega \in \Omega $ (denoted $X_n(\omega )\nearrow X(\omega )$ as $n\to \infty $ for all $\omega \in \Omega $).

Definition 1.15 (Expectations)

If $X$ is simple with $X=\sum _{k=1}^nx_k\cdot {\bf 1}_{A_k}$, then $\operatorname{{\mathbb E}}X:\, =\sum _{k=1}^nx_k\cdot {\mathbb P}(A_k)$.
If $X\ge 0$ is a random variable, then $\operatorname{{\mathbb E}}X:\, =\lim _{n\to \infty }\operatorname{{\mathbb E}}X_n$, where $X_n\nearrow X$ are simple random variables. (Such sequence exists by the above, and it is a theorem that this limit does not depend on the choice of the sequence.) Notice that $\operatorname{{\mathbb E}}X=\infty $ is possible.
If $X$ is a random variable, $\operatorname{{\mathbb E}}X:\, =\operatorname{{\mathbb E}}X^+-\operatorname{{\mathbb E}}X^-$, unless both expectations on the right-hand side are infinite, in which case $\operatorname{{\mathbb E}}X$ is not defined.

Here the positive and negative parts are used:

$\begin{equation}x^+=x\cdot {\bf 1}\{ x>0\} ,\qquad x^-=-x\cdot {\bf 1}\{ x<0\} ,\qquad x=x^+-x^-\text{ for any }x\in \mathbb R.\label{eq:pnp}\tag{1.3}\end{equation}$

Notice that options for $\operatorname{{\mathbb E}}X$ are "not defined", $=\infty $, $=-\infty $, or $\in \mathbb R$.

2 Conditional expectation and a toy example

We start this section without proof with Kolmogorov’s Theorem on conditional expectations from Williams [ 2 ] . Notation: $\operatorname{{\mathbb E}}(\cdot \, ;\, G):\, =\operatorname{{\mathbb E}}(\cdot {\bf 1}_G)$.

Theorem 2.1

Let $X$ be a random variable on the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$, with $\operatorname{{\mathbb E}}|X|{\lt}\infty $. Let $\mathcal G$ be a sub $\sigma $-algebra. Then there exists a random variable $V$ such that

$V$ is $\mathcal G$-measurable,
$\operatorname{{\mathbb E}}|V|{\lt}\infty $,
$\operatorname{{\mathbb E}}(V\, ;\, G)=\operatorname{{\mathbb E}}(X\, ;\, G)$ for any $G\in \mathcal G$.

Such $V$ is called a version of the conditional expectation $\operatorname{{\mathbb E}}(X\, |\, \mathcal G)$. Indeed, two random variables $V$ and $V'$ with the above properties agree almost everywhere: ${\mathbb P}(V=V')=1$.

Our toy example will be the following. Let $\Omega =\{ 1,\, 2,\, \dots ,\, 12\} $, $\mathcal F=\mathcal P(\Omega )$, and ${\mathbb P}$ be the uniform measure on the finite set $\Omega $. Elementary outcomes in $\Omega $ will be denoted by $\omega $. Define the random variables

\[ Y:\, =\Bigl\lceil \frac\omega 4\Bigr\rceil =\left\{ \begin{aligned} & 1,& & \text{ if }\omega =1,\, 2,\, 3,\, 4,\\ & 2,& & \text{ if }\omega =5,\, 6,\, 7,\, 8,\\ & 3,& & \text{ if }\omega =9,\, 10,\, 11,\, 12, \end{aligned} \right.\qquad X:\, =\Bigl\lceil \frac\omega 2\Bigr\rceil =\left\{ \begin{aligned} & 1,& & \text{ if }\omega =1,\, 2,\\ & 2,& & \text{ if }\omega =3,\, 4,\\ & 3,& & \text{ if }\omega =5,\, 6,\\ & 4,& & \text{ if }\omega =7,\, 8,\\ & 5,& & \text{ if }\omega =9,\, 10,\\ & 6,& & \text{ if }\omega =11,\, 12. \end{aligned} \right. \]

The $\sigma $-algebra generated by $Y$ is

$\begin{multline*} \begin{aligned} \mathcal G:& =\sigma (Y):\, =\sigma \bigl(Y^{-1}\bigl(\mathcal B(\mathbb R)\bigr)\bigr)=\sigma \bigl(\{ 1,\, 2,\, 3,\, 4\} ,\, \{ 5,\, 6,\, 7,\, 8\} ,\, \{ 9,\, 10,\, 11,\, 12\} \bigr)\\ & =\bigl\{ \emptyset ,\, \{ 1,\, 2,\, 3,\, 4\} ,\, \{ 5,\, 6,\, 7,\, 8\} ,\, \{ 9,\, 10,\, 11,\, 12\} ,\\ \end{aligned}\\ \{ 1,\, 2,\, 3,\, 4,\, 5,\, 6,\, 7,\, 8\} ,\, \{ 1,\, 2,\, 3,\, 4,\, 9,\, 10,\, 11,\, 12\} ,\, \{ 5,\, 6,\, 7,\, 8,\, 9,\, 10,\, 11,\, 12\} ,\, \Omega \bigr\} . \end{multline*}$

Similarly, the $\sigma $-algebra generated by $X$ is

\[ \mathcal H:\, =\sigma (X):\, =\sigma \bigl(X^{-1}\bigl(\mathcal B(\mathbb R)\bigr)\bigr)=\sigma \bigl(\{ 1,\, 2\} ,\, \{ 3,\, 4\} ,\, \{ 5,\, 6\} ,\, \{ 7,\, 8\} ,\, \{ 9,\, 10\} ,\, \{ 11,\, 12\} \bigr). \]

We see that $\mathcal G\subset \mathcal H\subset \mathcal F$. The $\sigma $-algebra $\mathcal G$ is coarser (contains less information), while $\mathcal H$ is finer (more information). We also see that

$Y$ is $\mathcal G$-measurable (by definition).
$Y$ is $\mathcal H$-measurable (due to $\mathcal G\subset \mathcal H$).
$X$ is $\mathcal H$-measurable (by definition).
$X$ is not $\mathcal G$-measurable (e.g., $X^{-1}\{ 1\} =\{ 1,\, 2\} \notin \mathcal G$).

Next, we find the conditional expectation $\operatorname{{\mathbb E}}(X\, |\, \mathcal G)$ based on the definition above. As $\mathcal G=\sigma (Y)$, an equivalent notation for this is $\operatorname{{\mathbb E}}(X\, |\, \mathcal G)=\operatorname{{\mathbb E}}(X\, |\, Y)$. Due to $|\Omega |=12{\lt}\infty $, finite mean of $V=\operatorname{{\mathbb E}}(X\, |\, \mathcal G)$ is not an issue. We look for a $\mathcal G$-measurable random variable $V$ with $\operatorname{{\mathbb E}}(V\, ;\, G)=\operatorname{{\mathbb E}}(X\, ;\, G)$ for any $G\in \mathcal G$. An efficient choice for $G$ is $\{ 1,\, 2,\, 3,\, 4\} $. As $V$ is $\mathcal G$-measurable, and $\mathcal G$ has no set that distinguishes between these four outcomes, we find that $V(\omega )$ is the same for $\omega =1,\, 2,\, 3,\, 4$. The above expectations turn into

\[ \begin{aligned} V(1){\mathbb P}\{ 1\} +V(2){\mathbb P}\{ 2\} +V(3){\mathbb P}\{ 3\} +V(4){\mathbb P}\{ 4\} & =X(1){\mathbb P}\{ 1\} +X(2){\mathbb P}\{ 2\} +X(3){\mathbb P}\{ 3\} +X(4){\mathbb P}\{ 4\} \\ V(1){\mathbb P}\{ 1\} +V(1){\mathbb P}\{ 2\} +V(1){\mathbb P}\{ 3\} +V(1){\mathbb P}\{ 4\} & =X(1){\mathbb P}\{ 1\} +X(2){\mathbb P}\{ 2\} +X(3){\mathbb P}\{ 3\} +X(4){\mathbb P}\{ 4\} \\ V(1)=V(2)=V(3)=V(4)& =\frac{1\cdot \frac1{12}+1\cdot \frac1{12}+2\cdot \frac1{12}+2\cdot \frac1{12}}{\frac1{12}+\frac1{12}+\frac1{12}+\frac1{12}}=1.5. \end{aligned} \]

Similarly, with the respective choices $G=\{ 5,\, 6,\, 7,\, 8\} $ and $G=\{ 9,\, 10,\, 11,\, 12\} $,

\[ \begin{aligned} V(5)=V(6)=V(7)=V(8)& =\frac{3\cdot \frac1{12}+3\cdot \frac1{12}+4\cdot \frac1{12}+4\cdot \frac1{12}}{\frac1{12}+\frac1{12}+\frac1{12}+\frac1{12}}=3.5,\\ V(9)=V(10)=V(11)=V(12)& =\frac{5\cdot \frac1{12}+5\cdot \frac1{12}+6\cdot \frac1{12}+6\cdot \frac1{12}}{\frac1{12}+\frac1{12}+\frac1{12}+\frac1{12}}=5.5. \end{aligned} \]

Hence the conditional expectation is the random variable

\[ \operatorname{{\mathbb E}}(X\, |\, \mathcal G)(\omega )=V(\omega )=\left\{ \begin{aligned} & 1.5,& & \text{ if }\omega =1,\, 2,\, 3,\, 4,\\ & 3.5,& & \text{ if }\omega =5,\, 6,\, 7,\, 8,\\ & 5.5,& & \text{ if }\omega =9,\, 10,\, 11,\, 12, \end{aligned} \right. \]

being just the average of $X$ over the smallest nontrivial respective units in $\mathcal G$.

In a similar way one can check

\[ \operatorname{{\mathbb E}}(Y\, |\, \mathcal G)(\omega )=\left\{ \begin{aligned} & 1,& & \text{ if }\omega =1,\, 2,\, 3,\, 4,\\ & 2,& & \text{ if }\omega =5,\, 6,\, 7,\, 8,\\ & 3,& & \text{ if }\omega =9,\, 10,\, 11,\, 12 \end{aligned} \right\} =Y(\omega ), \]

and indeed it is always the case that $\operatorname{{\mathbb E}}(Y\, |\, Y)=Y$ almost everywhere (a.e.).

Further examples are $\operatorname{{\mathbb E}}\bigl(X\, |\, \{ \emptyset ,\, \Omega \} \bigr)$, where the random variable $V$ we are looking for is measurable w.r.t. the trivial $\sigma $-algebra $\{ \emptyset ,\, \Omega \} $, in other words is a constant. Picking $G=\emptyset $ gives $\operatorname{{\mathbb E}}(V\, ;\, \emptyset )=0=\operatorname{{\mathbb E}}(X\, ;\, \emptyset )$, which is not very informative. The choice $G=\Omega $ on the other hand fixes the value of the constant $V$:

\[ \begin{aligned} \operatorname{{\mathbb E}}(V\, ;\, \Omega )& =\operatorname{{\mathbb E}}(X\, ;\, \Omega )\\ \operatorname{{\mathbb E}}V& =\operatorname{{\mathbb E}}X\\ V& =\operatorname{{\mathbb E}}X, \end{aligned} \]

that is, $\operatorname{{\mathbb E}}\bigl(X\, |\, \{ \emptyset ,\, \Omega \} \bigr)=\operatorname{{\mathbb E}}X$. This is again true a.e. in general, conditioning on the trivial $\sigma $-algebra always produces a full expectation.

If, on the other hand, one conditions on the full $\sigma $-algebra $\mathcal F$ that has all information that can be available in the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$, then every event $G\in \mathcal F$ can be substituted, and the very detailed ones completely fix the conditional expectation. In our example we can e.g., take $\{ 7\} $ to obtain

\[ \begin{aligned} \operatorname{{\mathbb E}}(V\, ;\, \{ 7\} )& =\operatorname{{\mathbb E}}(X\, ;\, \{ 7\} ),\\ V(7)\cdot {\mathbb P}\{ 7\} & =X(7)\cdot {\mathbb P}\{ 7\} ,\\ V(7)& =X(7)=4. \end{aligned} \]

Similarly, for any $\omega \in \Omega $ one has $V(\omega )=X(\omega )$, which leads us to $\operatorname{{\mathbb E}}(X\, |\, \mathcal F)=V=X$. This is again a.e. true for general probability spaces: conditioning on the full information does not do any averaging and gives back the random variable instead.

Our final example is

$\begin{multline*} \begin{aligned} \mathcal I:& =\sigma \bigl(\{ 1,\, 5,\, 9\} ,\, \{ 3,\, 7,\, 11\} \bigr)\\ & =\bigl\{ \emptyset ,\, \{ 1,\, 5,\, 9\} ,\, \{ 3,\, 7,\, 11\} ,\, \{ 2,\, 4,\, 6,\, 8,\, 10,\, 12\} ,\, \{ 1,\, 3,\, 5,\, 7,\, 9,\, 11\} , \end{aligned}\\ \{ 1,\, 2,\, 4,\, 5,\, 6,\, 8,\, 9,\, 10,\, 12\} ,\, \{ 2,\, 3,\, 4,\, 6,\, 7,\, 8,\, 10,\, 11,\, 12\} ,\, \Omega \bigr\} . \end{multline*}$

We compute $V=\operatorname{{\mathbb E}}(Y\, |\, \mathcal I)$ as before. This is $\mathcal I$-measurable, hence constant on $\{ 1,\, 5,\, 9\} $, as well as on $\{ 3,\, 7,\, 11\} $ and on $\{ 2,\, 4,\, 6,\, 8,\, 10,\, 12\} $. Substituting these as $G$ (the rest in $\mathcal I$ will not provide additional help) in $\operatorname{{\mathbb E}}(V\, ;\, G)=\operatorname{{\mathbb E}}(Y\, ;\, G)$ results in

\[ \begin{aligned} V(1)=V(5)=V(9)& =\frac{Y(1){\mathbb P}\{ 1\} +Y(5){\mathbb P}\{ 5\} +Y(9){\mathbb P}\{ 9\} }{{\mathbb P}\{ 1\} +{\mathbb P}\{ 5\} +{\mathbb P}\{ 9\} }=\frac{1\cdot \frac1{12}+2\cdot \frac1{12}+3\cdot \frac1{12}}{\frac1{12}+\frac1{12}+\frac1{12}}=2,\\ V(3)=V(7)=V(11)& =\frac{Y(3){\mathbb P}\{ 3\} +Y(7){\mathbb P}\{ 7\} +Y(11){\mathbb P}\{ 11\} }{{\mathbb P}\{ 3\} +{\mathbb P}\{ 7\} +{\mathbb P}\{ 11\} }=\frac{1\cdot \frac1{12}+2\cdot \frac1{12}+3\cdot \frac1{12}}{\frac1{12}+\frac1{12}+\frac1{12}}=2,\\ V(2)=V(4)=V(6)\qquad \qquad & \\ =V(8)=V(10)=V(12)& =\tfrac {Y(2){\mathbb P}\{ 2\} +Y(4){\mathbb P}\{ 4\} +Y(6){\mathbb P}\{ 6\} +Y(8){\mathbb P}\{ 8\} +Y(10){\mathbb P}\{ 10\} +Y(12){\mathbb P}\{ 12\} }{{\mathbb P}\{ 2\} +{\mathbb P}\{ 4\} +{\mathbb P}\{ 6\} +{\mathbb P}\{ 8\} +{\mathbb P}\{ 10\} +{\mathbb P}\{ 12\} }\\ & =\frac{1\cdot \frac1{12}+1\cdot \frac1{12}+2\cdot \frac1{12}+2\cdot \frac1{12}+3\cdot \frac1{12}+3\cdot \frac1{12}}{\frac1{12}+\frac1{12}+\frac1{12}+\frac1{12}+\frac1{12}+\frac1{12}}=2. \end{aligned} \]

We find that $\operatorname{{\mathbb E}}(Y\, |\, \mathcal I)$ is actually a constant, and in fact $=\operatorname{{\mathbb E}}Y$.

We can repeat this calculation with any function $f\, :\, \mathbb R\to \mathbb R$ (in general this is chosen to be bounded and measurable) to find $\operatorname{{\mathbb E}}\bigl(f(Y)\, |\, \mathcal I\bigr)=\operatorname{{\mathbb E}}\bigl(f(Y)\bigr)$, a constant. This is when we say that the random variable $Y$ is independent of the $\sigma $-algebra $\mathcal I$. Knowing which of the events $\{ 1,\, 5,\, 9\} $ and $\{ 3,\, 7,\, 11\} $ did or did not happen will not tell us any information about $Y$.

If $\mathcal I$ happens to be generated by yet another random variable $Z$, $\mathcal I=\sigma (Z)$, then the above is equivalent to variables $Y$ and $Z$ being independent.

An important property and tool with conditional expectations is the following:

Theorem 2.2 (Tower rule)

Let $Z$ be a random variable on the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$, with $\operatorname{{\mathbb E}}|Z|{\lt}\infty $. Let $\mathcal G\subseteq \mathcal H$ be sub $\sigma $-algebras ($\mathcal G$ is coarser and $\mathcal H$ is finer). Then

\[ \operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(Z\, |\, \mathcal G)\, |\, \mathcal H\bigr)=\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(Z\, |\, \mathcal H)\, |\, \mathcal G\bigr)=\operatorname{{\mathbb E}}(Z\, |\, \mathcal G). \] The proof follows from the definition of conditional expectations after a bit of manipulations, we leave this to the reader. Some special cases of interest:

If $\mathcal H=\mathcal F$, the full $\sigma $-algebra in the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$, then $\operatorname{{\mathbb E}}(Z\, |\, \mathcal H)=\operatorname{{\mathbb E}}(Z\, |\, \mathcal F)=Z$ for any random variable. The above then reads $\operatorname{{\mathbb E}}(Z\, |\, \mathcal G)$ for all three terms.
If $\mathcal G=\{ \emptyset ,\, \Omega \} $, the trivial $\sigma $-algerba, then $\operatorname{{\mathbb E}}(\cdot \, |\, \mathcal G)=\operatorname{{\mathbb E}}(\cdot )$. The Tower rule then becomes
\[ \operatorname{{\mathbb E}}\bigl((\operatorname{{\mathbb E}}Z)\, |\, \mathcal H\bigr)=\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(Z\, |\, \mathcal H)\bigr)=\operatorname{{\mathbb E}}Z. \]
The first of these terms is uninteresting, but the second equality is very useful and might be familiar from earlier studies, especially when $\mathcal H=\sigma (V)$, the $\sigma $-algebra generated by another random variable $V$. In this case it reads $\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(Z|V)\bigr)=\operatorname{{\mathbb E}}Z$.

3 Probability toolbox

The following statements are widely used across probability, and will be built on in this unit. We always assume the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$ in the background.

We start with an important fact from calculus.

Lemma 3.1

Let $a_k\in (0,\, 1)$, $k=1,\, 2,\, \dots $. Then

\[ \prod _{k=1}^\infty (1-a_k)=0\quad \Leftrightarrow \quad \sum _ka_k=\infty . \]

Proof.

First, let us assume that the $a_k$ do not converge to zero. That means that there is an $\varepsilon {\gt}0$ such that infintely often $a_k\ge \varepsilon $. Since for every other $k$, $1-a_k\le 1$, the product infinitely often gets a factor at most $1-\varepsilon $ while it can never increase. It follows that the product converges to zero. The sum can never decrease due to $a_k{\gt}0$, and infinitely often increases its value by $\varepsilon $. Hence it diverges to $\infty $ and the statement is true.

Now let us assume that $\lim _{k\to \infty }a_k=0$. Convexity of the exponential function implies $1-x\le \text{\rm e}^{-x}$ for any $x\in \mathbb R$. As terms in the product are non-negative,

\[ 0\le \prod _{k=1}^\infty (1-a_k)\le \prod _{k=1}^\infty \text{\rm e}^{-a_k}=\text{\rm e}^{-\sum _{k=1}^\infty a_k}. \]

This proves $\Leftarrow $.

The function $\text{\rm e}^{-2x}$ is smooth with value $1$ and derivative $-2$ at $x=0$. Hence for all small enough $x{\gt}0$, $1-x\ge \text{\rm e}^{-2x}$. There is an index $K$ that makes $a_k$ small enough for this purpose for any $k\ge K$. Therefore

\[ \prod _{k=K}^\infty (1-a_k)\ge \prod _{k=K}^\infty \text{\rm e}^{-2a_k}=\text{\rm e}^{-2\sum _{k=K}^\infty a_k}. \]

If $\prod _{k=1}^\infty (1-a_k)=0$, then the left hand-side above is also zero, which proves $\Rightarrow $.

Definition 3.2

Let $A_1,\, A_2,\, \dots $ be events. Then

\[ \limsup _nA_n:\, =\bigcap _{n=1}^\infty \bigcup _{k=n}^\infty A_k. \] By decoding the union and the intersection it becomes clear that this event describes that infinitely many of the $A_n$’s occur, in other words $A_n$’s occur infinitely often (i.o.).

Theorem 3.3 (Borel-Cantelli lemmas)

If $A_1,\, A_2,\, \dots $ are any events with $\sum _n{\mathbb P}(A_n){\lt}\infty $, then ${\mathbb P}(\limsup _nA_n)=0$.
If $A_1,\, A_2,\, \dots $ are independent events with $\sum _n{\mathbb P}(A_n)=\infty $, then ${\mathbb P}(\limsup _nA_n)=1$.

Proof.

Notice that $\bigcup _{k=n}^\infty A_k$ is decreasing in $n$. Thus, by continuity of probability (Theorem 1.6) and Boole’s inequality,
\[ {\mathbb P}\Bigl(\bigcap _{n=1}^\infty \bigcup _{k=n}^\infty A_k\Bigr)=\lim _{n\to \infty }{\mathbb P}\Bigl(\bigcup _{k=n}^\infty A_k\Bigr)\le \lim _{n\to \infty }\sum _{k=n}^\infty {\mathbb P}(A_k)=0. \]
Notice that $\bigcap _{k=n}^\infty A_k^\text c$ is increasing in $n$. Thus,
\[ {\mathbb P}\Bigl(\bigcap _{n=1}^\infty \bigcup _{k=n}^\infty A_k\Bigr)=1-{\mathbb P}\Bigl(\bigcup _{n=1}^\infty \bigcap _{k=n}^\infty A_k^\text c\Bigr)=1-\lim _{n\to \infty }{\mathbb P}\Bigl(\bigcap _{k=n}^\infty A_k^\text c\Bigr)=1-\lim _{n\to \infty }\prod _{k=n}^\infty \bigl(1-{\mathbb P}(A_k)\bigr). \]
The product is $0$ for any $n$ due to Lemma 3.1, which completes the proof.

We now turn to interchangeability of limits and expectations. The below are standard parts of measure theory, where they are treated for more general integrals than just expectations and sums as here.

Theorem 3.4 (Monotone convergence)

Let $Y,\, X,\, X_1,\, X_2,\, X_3,\, \dots $ be random variables.

If $X_n\ge Y$ for each $n$, $\operatorname{{\mathbb E}}Y{\gt}-\infty $, and $X_n\nearrow X$ for every $\omega \in \Omega $, then $\operatorname{{\mathbb E}}X_n\nearrow \operatorname{{\mathbb E}}X$.
If $X_n\le Y$ for each $n$, $\operatorname{{\mathbb E}}Y{\lt}\infty $, and $X_n\searrow X$ for every $\omega \in \Omega $, then $\operatorname{{\mathbb E}}X_n\searrow \operatorname{{\mathbb E}}X$.

Proof.

Proof of 1 only; 2 follows similarly.

Suppose $Y \equiv 0$, i.e., $\forall \omega \in \Omega $, $Y(\omega )=0$. Then as $X_n\geq 0$ (seen in the measure theory section) for each $X_k$ there exists a sequence of simple (that is, constant on finitely many measurable sets) random variables $X_k^{(n)}$ such that $X_k^{(n)}\nearrow X_k$ as $n\to \infty $.

\[ \begin{matrix} X_1 & \leq & X_2 & \leq & X_3 & \leq & \cdots & \leq & X \\ \bigvee \! | & & \bigvee \! | & & \bigvee \! | \\ \vdots & & \vdots & & \vdots \\ \bigvee \! | & & \bigvee \! | & & \bigvee \! | \\ X_1^{(3)} & & X_2^{(3)} & & X_3^{(3)} \\ \bigvee \! | & & \bigvee \! | & & \bigvee \! | \\ X_1^{(2)} & & X_2^{(2)} & & X_3^{(2)} \\ \bigvee \! | & & \bigvee \! | & & \bigvee \! | \\ X_1^{(1)} & & X_2^{(1)} & & X_3^{(1)} \end{matrix} \]

Define $Z^{(n)}:=\max _{1\leq j \leq n} X_j^{(n)}$. That is, $Z^{(n)}$ is the maximum value of the first $n$ terms in the $n$th row from the bottom in the table above.

Properties of $Z^{(n)}$:

For all $1\leq k\leq n$ we have $X_k^{(n)}\leq Z^{(n)} \leq X_n$. The first inequality follows immediately from the definition of $Z^{(n)}$, it is simply the maximum of such values, and is hence an upper bound. The second inequality follows from chasing the column up within which the maximum lies and then across to the value $X_n$. Formally, for some $1\leq k \leq n$, $Z^{(n)}=X_k^{(n)}\leq X_k \leq X_n$.
$Z^{(n-1)}\leq Z^{(n)}$. Why? $Z^{(n-1)}=\max _{1\leq j \leq n-1} X_j^{(n-1)}\leq \max _{1\leq j \leq n-1} X_j^{(n)}\le Z^{(n)}.$ The inequality follows from the fact that for all $j\in \mathbb N$ we have $X_j^{(n-1)}\leq X_j^{(n)}$, and then we maximise over a larger domain.

Define $Z:=\lim _{n\to \infty } Z^{(n)}$, which exists because $Z^{(n)}$ is an increasing sequence (the limit may possibly be infinite).

Since for all $1\leq k\leq n$ we have $X_k^{(n)}\leq Z^{(n)} \leq X_n$, taking $n\to \infty $ we see that

\[ \lim _{n\to \infty } X_k^{(n)}\leq \lim _{n\to \infty } Z^{(n)} \leq \lim _{n\to \infty } X_n \implies X_k \leq Z \leq X \underbrace{\implies }_{k\to \infty }Z=X. \]

Note that, since the $Z^{(n)}$s are simple (indeed they are a maximum of simple random variables), by the definition of expectation of a limit simple random variables,

\[ \operatorname{{\mathbb E}}X = \operatorname{{\mathbb E}}Z = \operatorname{{\mathbb E}}\lim _{n\to \infty } Z^{(n)} = \lim _{n\to \infty } \operatorname{{\mathbb E}}Z^{(n)} \leq \lim _{n\to \infty } \operatorname{{\mathbb E}}X_n. \]

Thus it remains to show that $\operatorname{{\mathbb E}}X \geq \lim _{n\to \infty } \operatorname{{\mathbb E}}X_n$. Since $X_n\nearrow X$ we have that $X_n\leq X$ for all $n$, which implies that $\operatorname{{\mathbb E}}X_n \leq \operatorname{{\mathbb E}}X$. Hence,

\[ \lim _{n\to \infty } \operatorname{{\mathbb E}}X_n \leq \operatorname{{\mathbb E}}X. \]

In the case where $Y\not\equiv 0$ then we repeat the above analysis with $X_n-Y$ which is a non-negative random variable.

For the next statement, notice that every sequence has a liminf.

Theorem 3.5 (Fatou’s lemma)

Let $Y,\, X_1,\, X_2,\, X_3,\, \dots $ be random variables with $X_n\ge Y$ for each $n$, $\operatorname{{\mathbb E}}Y{\gt}-\infty $. Then $\liminf _n\operatorname{{\mathbb E}}X_n\ge \operatorname{{\mathbb E}}\liminf _nX_n$.

It is sometimes convenient to pick $Y\equiv 0$ in the above theorems.

Proof.

Define $Z_n := \inf _{m\geq n} X_m$. Then $Z_n$ is an increasing sequence; indeed $\inf _{m\geq n} X_m\leq \inf _{m\geq n+1} X_m$ since the infinum is over a larger domain. Furthermore, $Z_n\nearrow Z:=\liminf _{n\to \infty } X_n$; this follows from the fact that $Z_n$ is increasing and by definition

\[ \liminf _{n\to \infty } X_n = \lim _{n\to \infty } (\inf _{m\geq n} X_m)=\lim _{n\to \infty } Z_n. \]

Now $Z_n = \inf _{m\geq n} X_m\geq Y$ as $X_m\geq Y$ for all $m\in \mathbb N$. Thus we are in good shape to apply the monotone convergence theorem:

\[ \lim _{n\to \infty } \operatorname{{\mathbb E}}Z_n = \operatorname{{\mathbb E}}Z= \operatorname{{\mathbb E}}\liminf _{n\to \infty } X_n \]

But on the left-hand-side, as the limit exists it is equal to the $\liminf $. Now $Z_n = \inf _{m\geq n} X_m\leq X_n$ and thus

$\begin{align*} \operatorname{{\mathbb E}}\liminf _{n\to \infty } X_n & = \lim _{n\to \infty } \operatorname{{\mathbb E}}Z_n = \liminf _{n\to \infty } \operatorname{{\mathbb E}}Z_n \leq \liminf _{n\to \infty } \operatorname{{\mathbb E}}X_n. \qedhere \end{align*}$

Theorem 3.6 (Dominated convergence)

Let $Y,\, X,\, X_1,\, X_2,\, X_3,\, \dots $ be random variables, and assume $|X_n|\le Y$ for each $n$, $\operatorname{{\mathbb E}}Y{\lt}\infty $, and $X_n\to X$ almost surely (a.s., that is, ${\mathbb P}\{ X_n\to X\} =1$). Then $\operatorname{{\mathbb E}}|X|{\lt}\infty $, $\operatorname{{\mathbb E}}X_n\to \operatorname{{\mathbb E}}X$, and $\operatorname{{\mathbb E}}|X-X_n|\to 0$.

Proof.

To prove finiteness of the expectation note that $X_n\xrightarrow {\text{a.s.}} X$ as $n\to \infty $ implies that $|X_n|\xrightarrow {\text{a.s.}} |X|$ as $n\to \infty $ (mod is a continuous function). Furthermore, since $|X_n|\leq Y$, we have that $|X|=\lim _{n\to \infty } |X_n| \leq Y$ almost surely.

To prove convergence of the expectation we construct the following chain of inequalities

$\begin{align*} \operatorname{{\mathbb E}}X = \operatorname{{\mathbb E}}\lim _{n\to \infty } X_n = \operatorname{{\mathbb E}}\liminf _{n\to \infty } X_n \underset {*}{\leq } \liminf _{n\to \infty } \operatorname{{\mathbb E}}X_n \leq \limsup _{n\to \infty } \operatorname{{\mathbb E}}X_n \underset {**}{\leq } \operatorname{{\mathbb E}}\limsup _{n\to \infty } X_n = \operatorname{{\mathbb E}}\lim _{n\to \infty } X_n = \operatorname{{\mathbb E}}X, \end{align*}$

where $*$ follows from Fatou’s lemma and $**$ follows from Fatou’s lemma on $-X_n$; indeed we have the relation $\liminf _{n\to \infty } (-X_n) = - \limsup _{n\to \infty } X_n.$ Thus equality holds throughout the chain and we conclude that

\[ \underbrace{\liminf _{n\to \infty } \operatorname{{\mathbb E}}X_n = \limsup _{n\to \infty } \operatorname{{\mathbb E}}X_n}_{\implies =\lim _{n\to \infty } \operatorname{{\mathbb E}}X_n} = \operatorname{{\mathbb E}}X. \]

Since the $\liminf $ and $\limsup $ agree, $\lim _{n\to \infty } \operatorname{{\mathbb E}}X_n$ exists and is equal to $\operatorname{{\mathbb E}}X$. Thus

\[ \lim _{n\to \infty } \operatorname{{\mathbb E}}X_n = \operatorname{{\mathbb E}}X (\underset {\star }{=} \operatorname{{\mathbb E}}\lim _{n\to \infty } X_n), \]

where $\star $ follows from the fact that $\operatorname{{\mathbb E}}\lim _{n\to \infty } X_n$ is also an element in the chain.

Finally, to prove $\operatorname{{\mathbb E}}|X-X_n|\to 0$ we note that $|X_n-X|\leq |X_n| + |X| \leq 2Y$. Then we repeat the analysis above with $|X_n-X|$ and bounding random variable $Y\equiv 2Y$.

Example 3.7

Let

\[ X_n=\left\{ \begin{aligned} & n^2-1,& & \text{with probability }\frac1{n^2},\\ & -1,& & \text{with probability }1-\frac1{n^2}, \end{aligned} \right. \]

be independent. One easily checks $\operatorname{{\mathbb E}}X_n=0$ $\forall n$, hence $\lim _{n\to \infty }\operatorname{{\mathbb E}}X_n=0$. However, the probabilities in the first line are summable, hence Borel-Cantelli implies that a.s. $X_n\ne -1$ only happens for finitely many $n$. It follows that $X_n\to -1$ a.s., the limit does not swap with the expectation. Conditions of both Monotone and Dominated convergence fail.

Two important corollaries concern swapping sum and expectation. There is no issue with finite sums, but infinite sums require some thought. These will be important later on, hence the proof is provided.

Theorem 3.8 (Tonelli)

Let $X_n\ge 0$ be random variables. Then $\operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k=\sum _{k=1}^\infty \operatorname{{\mathbb E}}X_k$.

Proof.

First notice that $\sum _{k=1}^nX_k\ge 0$ and non-decreasing in $n$, hence the expectations and infinite sums are well-defined. The statement follows from Monotone convergence on the sequence $\sum _{k=1}^nX_k$ which converges monotonically to the infinite sum:

\[ \operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k=\operatorname{{\mathbb E}}\lim _{n\to \infty }\sum _{k=1}^nX_k=\lim _{n\to \infty }\operatorname{{\mathbb E}}\sum _{k=1}^nX_k=\lim _{n\to \infty }\sum _{k=1}^n\operatorname{{\mathbb E}}X_k=\sum _{k=1}^\infty \operatorname{{\mathbb E}}X_k. \]

Theorem 3.9 (Fubini)

Let $X_n$ be random variables with $\operatorname{{\mathbb E}}\sum _{k=1}^\infty |X_k|{\lt}\infty $. Then $\operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k=\sum _{k=1}^\infty \operatorname{{\mathbb E}}X_k$.

Proof.

Recall (1.3), and notice $|x|=x^++x^-$ for any real $x$. By positivity,

$\begin{equation}\infty >\operatorname{{\mathbb E}}\sum _{k=1}^\infty |X_k|=\operatorname{{\mathbb E}}\sum _{k=1}^\infty \bigl(X_k^++X_k^-\bigr)=\operatorname{{\mathbb E}}\Bigl(\sum _{k=1}^\infty X_k^++\sum _{k=1}^\infty X_k^-\Bigr)=\operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k^++\operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k^-,\label{eq:tmod}\tag{3.1}\end{equation}$

which also implies that both sums on the right are a.s. finite. Therefore, a.s.,

\[ \begin{aligned} \sum _{k=1}^\infty X_k& =\lim _{n\to \infty }\sum _{k=1}^nX_k=\lim _{n\to \infty }\sum _{k=1}^n\bigl(X_k^+-X_k^-\bigr)\\ & =\lim _{n\to \infty }\Bigl(\sum _{k=1}^nX_k^+-\sum _{k=1}^nX_k^-\Bigr)=\lim _{n\to \infty }\sum _{k=1}^nX_k^+-\lim _{n\to \infty }\sum _{k=1}^nX_k^-=\sum _{k=1}^\infty X_k^+-\sum _{k=1}^\infty X_k^-. \end{aligned} \]

By (3.1), we can apply $\operatorname{{\mathbb E}}$ separately on this difference. Both sums on the right-hand side are of non-negative terms, hence Tonelli’s theorem applies separately:

\[ \begin{aligned} \operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k& =\operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k^+-\operatorname{{\mathbb E}}\sum _{k=1}^\infty X_k^-=\sum _{k=1}^\infty \operatorname{{\mathbb E}}X_k^+-\sum _{k=1}^\infty \operatorname{{\mathbb E}}X_k^-\\ & =\lim _{n\to \infty }\sum _{k=1}^n\operatorname{{\mathbb E}}X_k^+-\lim _{n\to \infty }\sum _{k=1}^n\operatorname{{\mathbb E}}X_k^-=\lim _{n\to \infty }\Bigl(\sum _{k=1}^n\operatorname{{\mathbb E}}X_k^+-\sum _{k=1}^n\operatorname{{\mathbb E}}X_k^-\Bigr)=\lim _{n\to \infty }\sum _{k=1}^n\operatorname{{\mathbb E}}X_k=\sum _{k=1}^\infty \operatorname{{\mathbb E}}X_k. \end{aligned} \]

When joining the two limits we used that, by (3.1) and the same application of Tonelli’s theorem, each of $\sum _{k=1}^n\operatorname{{\mathbb E}}X_k^+$ and $\sum _{k=1}^n\operatorname{{\mathbb E}}X_k^-$ has a finite limit.

Here is a simple, but very useful theorem, the proof of which is again omitted.

Theorem 3.10 (Jensen’s inequality)

Let $X$ be a random variable with $\operatorname{{\mathbb E}}|X|{\lt}\infty $, and $g$ a convex $\mathbb R\to \mathbb R$ function. Then $g(\operatorname{{\mathbb E}}X)\le \operatorname{{\mathbb E}}g(X)$.

Proof.

Since $g$ is convex, for all $x_0\in \mathbb R$ there exists $\lambda $ such that $g(x)\geq g(x_0) + \lambda (x-x_0)$. (E.g., the tangent to the curve at all points is a lower bound for the curve if it happens to be differentiable, but this is not needed.) Hence $g(X)\geq g(x_0) + \lambda (X-x_0)$; in particular, for $x_0=\operatorname{{\mathbb E}}X \in \mathbb R$,

\[ g(X)\geq g(\operatorname{{\mathbb E}}X) + \lambda (X-\operatorname{{\mathbb E}}X). \]

Note that $\lambda $ is a constant which depends on the function $g$ and the value of $\operatorname{{\mathbb E}}X$ only (the slope of the bounding line is dependent only on the position $x_0$ on the curve) and thus it is not random. Therefore $\operatorname{{\mathbb E}}\bigl(g(X)^-\bigr){\lt}\infty $, hence $\operatorname{{\mathbb E}}g(X)$ exists, and

\[ \operatorname{{\mathbb E}}g(X)\geq \operatorname{{\mathbb E}}g(\operatorname{{\mathbb E}}X) + \operatorname{{\mathbb E}}[ \lambda (X-\operatorname{{\mathbb E}}X)] = g(\operatorname{{\mathbb E}}X) + \lambda [\operatorname{{\mathbb E}}X - \operatorname{{\mathbb E}}X] = g(\operatorname{{\mathbb E}}X). \qedhere \]

Next we turn to expectations of powers of random variables.

Definition 3.11

Given the probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$ and a $p{\gt}0$ real, we denote by $L^p(\Omega ,\, \mathcal F,\, {\mathbb P})$ the set of random variables with finite $p^\text {th}$ absolute moment. We also introduce the notation $||X||_p:\, =\bigl(\operatorname{{\mathbb E}}|X|^p\bigr)^{1/p}$, with the convention that here the $p^\text {th}$ power is inside the expectation, while the $1/p$ power is outside. Hence $L^p(\Omega ,\, \mathcal F,\, {\mathbb P})$ is exactly the set of those random variables with finite $||X||_p$.

As we will see, often cases where $p\ge 1$ are relevant.

Next we explore useful properties of $||\cdot ||_p$.

Theorem 3.12 (Ljapunov’s inequality)

For any real $0{\lt}p{\lt}q$ and any random variable, $||X||_p\le ||X||_q$.

Proof.

$\begin{align*} (\operatorname{{\mathbb E}}|X|^s)^\frac {1}{s} = [(\operatorname{{\mathbb E}}|X|^s)^\frac {t}{s}]^\frac {1}{t} \underset {\star \star }{\leq } [\operatorname{{\mathbb E}}(|X|^s)^\frac {t}{s}]^\frac {1}{t} = (\operatorname{{\mathbb E}}|X|^t)^\frac {1}{t}, \end{align*}$

where $\star \star $ follows from Jensen’s inequality on the function $(\, \cdot \, )^\frac {t}{s}:\mathbb R\to \mathbb R$; which is convex since $t{\gt}s$ and therefore $\frac{t}{s}{\gt}1$.

Theorem 3.13 (Hölder’s inequality)

Let $p,\, q{\gt}1$ that satisfy $\frac1p+\frac1q=1$. If $||X||_p{\lt}\infty $ and $||Y||_q{\lt}\infty $, then $\operatorname{{\mathbb E}}|XY|\le ||X||_p\cdot ||Y||_q$.

The case $p=q=2$ should be familiar under the name Cauchy-Schwarz inequality.

Proof.

Since $\log :\mathbb R\to \mathbb R$ is a concave function and $\frac{x}{p}+\frac{y}{q}$ is a convex combination of $x$ and $y$ we have that

\[ \log \bigg(\frac{x}{p}+\frac{y}{q}\bigg) \geq \frac{\log (x)}{p}+\frac{\log (y)}{q} \implies \frac{x}{p}+\frac{y}{q} \geq x^\frac {1}{p}\cdot y^\frac {1}{q}, \]

since the exponential function is increasing. Let $x=\frac{|X|^p}{\operatorname{{\mathbb E}}|X|^p}$ and $y=\frac{|Y|^q}{\operatorname{{\mathbb E}}|Y|^q}$. Then

\[ \frac{1}{p}\frac{|X|^p}{\operatorname{{\mathbb E}}|X|^p}+\frac{1}{q}\frac{|Y|^q}{\operatorname{{\mathbb E}}|Y|^q} \geq \frac{|X|}{(\operatorname{{\mathbb E}}|X|^p)^\frac {1}{p}}\cdot \frac{|Y|}{(\operatorname{{\mathbb E}}|Y|^q)^\frac {1}{q}}. \]

Taking expectations of both sides,

$\begin{align*} 1 & \geq \frac{\operatorname{{\mathbb E}}|XY|}{(\operatorname{{\mathbb E}}|X|^p)^\frac {1}{p}(\operatorname{{\mathbb E}}|Y|^q)^\frac {1}{q}}.\qedhere \end{align*}$

Notice that by $|X|^p\ge 0$, $||X_p||=0$ implies that $X=0$ a.s. Also, $||\lambda X||_p=|\lambda |\cdot ||X||_p$ for any $\lambda \in \mathbb R$ is easily checked from the definition. This, together with the triangle inequality below, justifies the name $p$-norm for $||\cdot ||_p$ when $p\ge 1$.

Theorem 3.14 (Minkowski’s inequality)

Let $p\ge 1$, and $||X||_p{\lt}\infty $, $||Y||_p{\lt}\infty $. Then $||X+Y||_p\le ||X||_p+||Y||_p$.

Proof.

If either (or both) $\| X\| _p=\infty $ or $\| Y\| _p=\infty $ then the inequality holds trivially. Hence suppose that $\| X\| _p{\lt}\infty $ and $\| Y\| _p{\lt}\infty $.

For the case $p=1$ this is trivial and follows immediately from the triangle inequality of the mod.

Now consider the case $p{\gt}1$. Define $F(x) = (a+x)^p-2^{p-1}(a^p+x^p)$, $x{\gt}0$; where $a{\gt}0$ is some constant. This has the derivative

\[ F'(x)=p(a+x)^{p-1}-2^{p-1} p x^{p-1}, \]

and so $F$ is stationary at $x=a$. Furthermore,

$\begin{align*} F'(x)>0 \iff p(a+x)^{p-1}-2^{p-1} p x^{p-1} >0 \iff \bigg( \frac{a+x}{2x}\bigg)^{p-1}>1 \iff x<a. \end{align*}$

Similarly, $F'(x)=0$ if and only if $x=a$ and $F'(x){\lt}0$ if and only if $x{\gt}a$. Thus $F$ is an increasing function for $x{\lt}a$; reaching a global maximum at $x=a$ and then decreasing for $x{\gt}a$. Therefore, $F(x)\leq F(a)=0$ for all $x\in \mathbb R$. We therefore have the inequality

\[ (a+x)^p\leq 2^{p-1}(a^p+x^p) \]

for all $a{\gt}0$, $x{\gt}0$, $p{\gt}1$. Applying this:

\[ |X+Y|^p\leq (|X|+|Y|)^p\leq 2^{p-1}(|X|^p+|Y|^p). \]

Taking expectations,

\[ \operatorname{{\mathbb E}}|X+Y|^p \leq 2^{p-1}( \operatorname{{\mathbb E}}|X|^p + \operatorname{{\mathbb E}}|Y|^p){\lt}\infty \]

since we assumed that both $\| X\| _p{\lt}\infty $ and $\| Y\| _p{\lt}\infty $. This verifies that $\| X+Y\| _p{\lt}\infty $, that is, $X+Y \in L^P(\Omega )$. Now we prove the Minkowski inequality.

$\begin{align*} \operatorname{{\mathbb E}}|X+Y|^p & = \operatorname{{\mathbb E}}(|X+Y||X+Y|^{p-1}) \underset {*}{\leq } \operatorname{{\mathbb E}}\big[\big(|X|+|Y|\big)|X+Y|^{p-1}\big]=\operatorname{{\mathbb E}}(|X||X+Y|^{p-1})+\operatorname{{\mathbb E}}(|Y||X+Y|^{p-1}). \end{align*}$

Above, $*$ follows from the triangle inequality on $\mathbb R$. Let $q$ be such that $\frac{1}{p}+\frac{1}{q}=1$; this implies $q=p/(p-1)$. By Hölder’s inequality

$\begin{align*} & \operatorname{{\mathbb E}}(|X||X+Y|^{p-1}) \leq (\operatorname{{\mathbb E}}|X|^p)^\frac {1}{p} (\operatorname{{\mathbb E}}|X+Y|^{(p-1)q})^\frac {1}{q} = (\operatorname{{\mathbb E}}|X|^p)^\frac {1}{p} (\underbrace{\operatorname{{\mathbb E}}|X+Y|^p}_{<\infty \text{ by }*})^\frac {1}{q} = \| X\| _p\| X+Y\| _p^\frac {p}{q}, \\ & \operatorname{{\mathbb E}}(|Y||X+Y|^{p-1}) \leq (\operatorname{{\mathbb E}}|Y|^p)^\frac {1}{p} (\operatorname{{\mathbb E}}|X+Y|^{(p-1)q})^\frac {1}{q} = (\operatorname{{\mathbb E}}|Y|^p)^\frac {1}{p} (\operatorname{{\mathbb E}}|X+Y|^p)^\frac {1}{q} = \| Y\| _p\| X+Y\| _p^\frac {p}{q}. \end{align*}$

Plugging this into the above yields that,

\[ \underbrace{\operatorname{{\mathbb E}}|X+Y|^p}_{=\| X+Y\| _p^p} \leq \big(\| X\| _p+\| Y\| _p\big)\| X+Y\| _p^\frac {p}{q}. \]

Dividing through by $\| X+Y\| _p^\frac {p}{q} (\geq 0)$ and noting that $p-p/q=1$ by the definition of $q$ (multiply by $p$), this is precisely the Minkowski inequality.

4 Modes of convergence

There are several ways to state that a sequence of random variables converges to a limit. We define the most commonly used modes and state some of their connections.

Definition 4.1

Random variables $X_n$ converge weakly to $X$, denoted $X_n\overset {\text w}{\longrightarrow }X$, if for every bounded and continuous $f\, :\, \mathbb R\to \mathbb R$ function, $\operatorname{{\mathbb E}}f(X_n)\to \operatorname{{\mathbb E}}f(X)$. This is equivalent to convergence of the distribution functions: $F_{X_n}(x)\to F_X(x)$ at every $x$ where the limit distribution function $F_X$ is continuous. (Those interested can look up Portmanteau’s theorem.) Other commonly used notation for this is $X_n\Rightarrow X$.
Random variables $X_n$ converge in probability to $X$, denoted $X_n\overset {{\mathbb P}}{\longrightarrow }X$, if for any $\varepsilon {\gt}0$, ${\mathbb P}\{ |X-X_n|\ge \varepsilon \} \to 0$.
Random variables $X_n$ converge strongly, or almost surely to $X$, if ${\mathbb P}\{ X_n\to X\} =1$.
Fix a $p{\gt}0$. Random variables $X_n$ converge in $L^p$, denoted $X_n\overset {L^p}{\longrightarrow }X$ if $||X-X_n||_p\to 0$.

Notice that for weak convergence one does not even need the random variables to be defined on the same probability space. This mode only features the distributions, not the actual values of the random variables. The other three modes compare values, hence require the random variables to be defined on a common probability space.

Theorem 4.2

$X_n\xrightarrow {\text{a.s.}} X$ if and only if for all $\varepsilon {\gt}0$, ${\mathbb P}(\sup _{k\geq n} |X_k-X|\geq \varepsilon )\to 0$ as $n\to \infty $.
$X_n$ is Cauchy almost surely if and only if for all $\varepsilon {\gt}0$, ${\mathbb P}(\sup _{k,\ell \geq n}|X_k-X_\ell | \geq \varepsilon )\to 0$ as $n\to \infty $. This is also equivalent to for all $\varepsilon {\gt}0$, ${\mathbb P}(\sup _{k\geq 0}|X_{n+k}-X_n | \geq \varepsilon )\to 0$.

Remark 4.3

Note that 1 is like a ‘boosted’ version of convergence in probability, where we require that all points onwards from $n$ are within $\varepsilon $ of $X$.

Proof.

1 Define the events $A_k^m:=\{ \omega \in \Omega : |X_k-X|\geq \frac{1}{m}\} $ and $A^m:=\bigcap _{n=1}^\infty \bigcup _{k=n}^\infty A_k^m$, which is the event that $|X_k-X|\geq \frac{1}{m}$ for infinitely many $k$.

Note that $X_n\not\to X$ if for some $m\in \mathbb N$, $A^m$ occurs; that is, for some $m{\gt}0$, $|X_k-X|\geq \frac{1}{m}$ for infinitely many $k$. The event that this happens for at least one $m$ is $\bigcup _{m=1}^\infty A^m$. Thus $X_n\xrightarrow {\text{a.s.}} X$ if and only if ${\mathbb P}(\bigcup _{m=1}^\infty A^m)=0$. Since ${\mathbb P}(A^m)\leq {\mathbb P}(\bigcup _{m=1}^\infty A^m) \leq \sum _{m=1}^\infty {\mathbb P}(A^m)$, ${\mathbb P}(\bigcup _{m=1}^\infty A^m)=0$ if and only if ${\mathbb P}(A^m)=0$ for all $m\in \mathbb N$. ${\mathbb P}(A^m)=0$ if and only if

$\begin{align*} 0 = {\mathbb P}\bigg(\bigcap _{n=1}^\infty \underbrace{\bigcup _{k=n}^\infty A_k^m}_\star \bigg) = \lim _{n\to \infty } {\mathbb P}\bigg( \underbrace{\bigcup _{k=n}^\infty A_k^m}_{\star \star } \bigg) = \lim _{n\to \infty } {\mathbb P}(\sup _{k\geq n} |X_k-X|\geq 1/m). \end{align*}$

$\star $ defines a decreasing sequence. $\star \star $ is the event that for some $k\geq n$ we have $|X_k-X|\geq \frac{1}{m}$, which occurs if and only if the supremum, $\sup _{k\geq n} |X_k-X|\geq 1/m$.

As the above $m$ is arbitrarily large, the proof is complete.

To prove 2, we repeat exactly the same analysis with the event $B_{k,\ell }^m=\{ \omega \in \Omega : |X_k-X_\ell | \geq \frac{1}{m}\} $.

Theorem 4.4

$X_n\to X$ a.s. implies $X_n\overset {{\mathbb P}}{\longrightarrow }X$.
Fix any $p{\gt}0$. Then $X_n\overset {L^p}{\longrightarrow }X$ implies $X_n\overset {{\mathbb P}}{\longrightarrow }X$.
$X_n\overset {{\mathbb P}}{\longrightarrow }X$ implies $X_n\overset {\text w}{\longrightarrow }X$.

Proof.

To prove 1 we use the previous theorem, that is, that $X_n \xrightarrow {\text{a.s.}} X$ if and only if for all $\varepsilon {\gt}0$, ${\mathbb P}(\sup _{k\geq n} |X_k-X|\geq \varepsilon )\to 0$ as $n\to \infty $. The event $|X_n-X|\geq \varepsilon $ implies that $\sup _{k\geq n} |X_k-X|\geq \varepsilon $, and thus

\[ {\mathbb P}(|X_n-X|\geq \varepsilon )\leq {\mathbb P}(\sup _{k\geq n} |X_k-X|\geq \varepsilon )\to 0, \]

which implies $X_n\overset {{\mathbb P}}{\longrightarrow }X$.

To prove 2 we use Markov’s inequality, which says that for a non-negative random variable $Z$, ${\mathbb P}(Z\geq c) \leq \operatorname{{\mathbb E}}Z /c$. Then for all $p{\gt}0$,

\[ {\mathbb P}(|X_n-X|\geq \varepsilon )={\mathbb P}(|X_n-X|^p\geq \varepsilon ^p)\leq \frac{\operatorname{{\mathbb E}}|X_n-X|^p}{\varepsilon ^p} = \frac{\| X_n-X\| _p^p}{\varepsilon ^p}\to 0 \]

as $n\to \infty $ since $X_n\overset {L^p}{\longrightarrow }$ if and only if $\| X_n-X\| _p\to 0$.

Now we prove 3 which says that convergence in probability implies weak convergence (in distribution). This is the most difficult to prove. Fix $f$ bounded and continuous such that $|f(x)|\leq c$ $(\spadesuit )$ for some $c{\gt}0$. We want to show that $\operatorname{{\mathbb E}}f(X_n) \to \operatorname{{\mathbb E}}f(X)$, or equivalently that $|\operatorname{{\mathbb E}}f(X_n) - \operatorname{{\mathbb E}}f(X)|\to 0$. Fix $\varepsilon {\gt}0$.

Assuming $X$ has a proper distribution, i.e., that $X$ is finite almost surely, there exists $N{\gt}0$ such that

$\begin{equation}{\mathbb P}(|X|>N)\leq \frac{\varepsilon }{4c}.\tag{<math element at 0x140566115883840>}\end{equation}$

$[-2N,2N]$ is compact, and thus $f$ is uniformly continuous on $[-2N,2N]$. Hence there exists $\delta {\gt}0$ such that for all $x,y\in [-2N,2N]$ with $|x-y|\leq \delta $ we have

$\begin{equation}|f(x)-f(y)|\leq \frac{\varepsilon }{2}.\tag{<math element at 0x140566115886912>}\end{equation}$

Define $\operatorname{{\mathbb E}}(Z;A) = \operatorname{{\mathbb E}}(Z {\bf 1}A)$. Then for a disjoint partition of the sample space $A_1,A_2,\ldots ,A_n$, we may write $\operatorname{{\mathbb E}}Z = \operatorname{{\mathbb E}}(Z;A_1) + \operatorname{{\mathbb E}}(Z;A_2) + \cdots + \operatorname{{\mathbb E}}(Z;A_n)$. This is because $\operatorname{{\mathbb E}}(Z;A_1) + \operatorname{{\mathbb E}}(Z;A_2) + \cdots + \operatorname{{\mathbb E}}(Z;A_n) = \operatorname{{\mathbb E}}[Z({\bf 1}A_1+\cdots +{\bf 1}A_n)]$ by linearity of expectation and because $({\bf 1}A_1+\cdots +{\bf 1}A_n)(\omega )=1$ for all $\omega \in \Omega $ by the disjointness of the partition.

$\begin{align*} |\operatorname{{\mathbb E}}f(X_n) - \operatorname{{\mathbb E}}f(X)| & = | \operatorname{{\mathbb E}}[f(X_n) - f(X)] | \\ & \leq \operatorname{{\mathbb E}}|f(X_n) - f(X)| & \text{Jensen's inequality on } |\cdot | \\ & = \operatorname{{\mathbb E}}\Big[ \underbrace{|f(X_n) - f(X)|}_{\leq \frac{\varepsilon }{2} \impliedby \dagger } ; \underbrace{|X_n-X|\leq \delta ; |X| \leq N}_{\implies X_n,X\in [-2N,2N], {\bf 1} \leq 1}\Big] \\ & + \operatorname{{\mathbb E}}\Big[ \underbrace{|f(X_n) - f(X)|}_{\spadesuit \implies \leq 2c} ; \underbrace{|X_n-X|\leq \delta }_{{\bf 1} \leq 1} ; |X| > N \Big] \\ & + \operatorname{{\mathbb E}}\Big[ \underbrace{|f(X_n) - f(X)|}_{\spadesuit \implies \leq 2c} ; |X_n-X| >\delta \Big] \\ & \leq \frac{\varepsilon }{2} + 2c\underbrace{{\mathbb P}(|X|>N)}_{\star \implies \leq \frac{\varepsilon }{4c}} + 2c{\mathbb P}(|X_n-X|>\delta ) \\ & \leq \varepsilon + 2c{\mathbb P}(|X_n-X|>\delta ). \end{align*}$

Since $X_n\overset {{\mathbb P}}{\longrightarrow }X$, there exists $N{\gt}0$ such that $n\geq N$ implies that ${\mathbb P}(|X_n-X|{\gt}\delta )\leq \frac{\varepsilon }{2c}$. Thus $n\geq N$ implies that

$\begin{align*} |\operatorname{{\mathbb E}}f(X_n) - \operatorname{{\mathbb E}}f(X)| & \leq \varepsilon + 2c{\mathbb P}(|X_n-X|>\delta ) \leq 2\varepsilon . \qedhere \end{align*}$

We conclude this part with examples that show how reverse implications can fail in the above theorem.

Example 4.5

Let $U\sim \text{Uniform}(0,\, 1)$, and

\[ \begin{aligned} X_1& ={\bf 1}_{[0,\, 1]}(U),\quad X_2={\bf 1}_{[0,\, \frac12]}(U),\quad X_3={\bf 1}_{[\frac12,\, 1]}(U),\quad X_4={\bf 1}_{[0,\, \frac13]}(U),\quad X_5={\bf 1}_{[\frac13,\, \frac23]}(U),\quad X_6={\bf 1}_{[\frac23,\, 1]}(U),\\ X_7& ={\bf 1}_{[0,\, \frac14]}(U),\quad X_8={\bf 1}_{[\frac14,\, \frac24]}(U),\quad X_9={\bf 1}_{[\frac24,\, \frac34]}(U),\quad X_{10}={\bf 1}_{[\frac34,\, 1]}(U),\quad X_{11}={\bf 1}_{[0,\, \frac15]}(U),\quad \dots \end{aligned} \]

This sequence converges to $0$ in $L^p$, therefore in probability and weakly as well. (Just check the probability that $X_n\ne 0$.) However, there is always a later $X_n$ with value 1, hence the sequence does not converge a.s.

Example 4.6

Let $U\sim \text{Uniform}(0,\, 1)$, and

$X_n:\, =U^n$. This converges to $0$ in all senses.
$X_n:\, =nU^n$. This converges to $0$ a.s., hence in probability and weakly as well. However,
\[ \operatorname{{\mathbb E}}|X_n-0|=n\operatorname{{\mathbb E}}U^n=n\cdot \frac1{n+1}\to 1\ne 0, \]
therefore $L^1$ convergence does not hold. Notice how both Monotone and Dominated convergence fail for $X_n$.
We can take this to more extreme by $X_n:\, =\text{\rm e}^{n}{\bf 1}_{[0,\, \frac1n]}(U)$. Again, this converges to $0$ a.s. However, $||0-X_n||_p=\frac1{n^{1/p}}\cdot \text{\rm e}^{n}\to \infty $ for any $p{\gt}0$.

Example 4.7

Given a sequence $0\le p_n\le 1$, let $X_n\sim \text{Bernoulli}(p_n)$ and independent. Then

by the definitions, $p_n\to 0$ is equivalent to each of $L^p$ and in probability convergence to $0$,
by the two Borel-Cantelli lemmas, $\sum _np_n{\lt}\infty $ is equivalent to a.s. convergence to $0$.

The choice $p_n=\frac1n$ therefore gives $L^p$ but not a.s. convergence.

We close this section by noting that, with the assumptions as stated there, Monotone convergence, Fatou’s lemma and Dominated convergence hold if we require $X_n\to X$ a.s., or $X_n\overset {{\mathbb P}}{\longrightarrow }X$ only, instead of convergence for all $\omega \in \Omega $.

5 Martingales, stopping times

Finally, all background is there to consider martingales. From here we mostly follow Williams [ 2 ] . Here are the definitions required.

Definition 5.1

A filtered space is $\bigl(\Omega ,\, \mathcal F,\, (\mathcal F_n)_{n=0}^\infty ,\, {\mathbb P}\bigr)$, where $(\Omega ,\, \mathcal F,\, {\mathbb P})$ is a probability space, and $\mathcal F_0\subseteq \mathcal F_1\subseteq \mathcal F_2\subseteq \dots \subseteq \mathcal F$ are $\sigma $-algebras, jointly called a filtration. We also define $\mathcal F_\infty :\, =\sigma \bigl(\bigcup _n\mathcal F_n\bigr)\subseteq \mathcal F$.

Definition 5.2

A process (sequence of random variables, that is) $X_n$ is adapted to the filtration $(\mathcal F_n)_{n\ge 0}$, if for every $n$, the variable $X_n$ is $\mathcal F_n$-measurable.

This is to say that $\mathcal F_n$ contains all information about $X_n$, in other words given $\mathcal F_n$, $X_n$ is not random anymore. An often used scenario is defining $\mathcal F_n=\sigma (X_0,\, X_1,\, \dots ,\, X_n)$ from a process $(X_n)_{n\ge 0}$. This is usually assumed when a filtration is not explicitly given.

Definition 5.3

A process $(M_n)_{n\ge 0}$ in a probability space $(\Omega ,\, \mathcal F,\, {\mathbb P})$ is a martingale with respect to a filtration $(\mathcal F_n)_{n\ge 0}$, if

it is adapted to $(\mathcal F_n)_{n\ge 0}$;
$\operatorname{{\mathbb E}}|M_n|{\lt}\infty $ $\forall n\ge 0$;
$\operatorname{{\mathbb E}}(M_{n+1}\, |\, \mathcal F_n)=M_n$ a.s., $\forall n\ge 0$.

If instead, in the last equality, we have $\le $, then we call $M$ a supermartingale, while if it is $\ge $ then it is a submartingale.

Notice that $M_n$ is a (sub-, super-)martingale if and only if $M_n-M_0$ is. Also, tower rule checks that $\operatorname{{\mathbb E}}M_n\begin{smallmatrix} = \\ \le \\ \ge \end{smallmatrix}\operatorname{{\mathbb E}}M_0$ for the respective cases.

Example 5.4

Let $X_k$ be independent random variables with $\operatorname{{\mathbb E}}X_k=0$ for each $k\ge 1$. Then $\sum _{k=1}^nX_k$ is a martingale. (For $n=0$ we have an empty sum which we postulate to be zero.)

Example 5.5

Let $X_k$ be independent random variables with $\operatorname{{\mathbb E}}X_k=1$ for each $k\ge 1$. Then $\prod _{k=1}^nX_k$ is a martingale. (For $n=0$ we have an empty product which we postulate to be one.)

Example 5.6

Let $\xi \in L^1$ and $\mathcal F_n$ be a filtration. Then $M_n:\, =\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_n)$ is a martingale due to the Tower rule.

The first applications come from stopping a martingale at a time when something particular happens to a process. This is covered next.

Definition 5.7

A process $(C_n)_{n\ge 1}$ is predictable (also said previsible), if for every $n\ge 1$, $C_n$ is $\mathcal F_{n-1}$-measurable.

Imagine a game in rounds that is based on random outcomes $X_n$. For example, $X_n$ can be the share price of a certain stock. If a player possesses $C_k$ of these stocks in the $k^\text {th}$ round of the game, then their joint value changes $C_k\cdot (X_k-X_{k-1})$ in this round. The total change in wealth is given by

$\begin{equation}(C\bullet X)_n:\, =\sum _{k=1}^nC_k\cdot (X_k-X_{k-1}).\label{eq:mtrdef}\tag{5.1}\end{equation}$

Definition 5.8

The process $(C\bullet X)_n$ in (5.1) is called the martingale transform of $X$ by $C$, or the discrete stochastic integral of $C$ by $X$.

In this setup the player can change the amount $C_k$ they own of the stock. However, as players do not foresee the future, it is natural to assume that $C$ is predictable.

Theorem 5.9

Let $C$ be a predictable process.

If there is a bound $K{\gt}0$: $0\le C_n\le K$ ($\forall n,\, \omega $), and $X$ is (super)martingale, then so is $C\bullet X$.
If $|C_n|\le K$ ($\forall n,\, \omega $), and $X$ is martingale, then so is $C\bullet X$.

The bound on $C_n$ can be relaxed to both $C_n$ and $X_n$ being in $L^2$ for each $n$.

Proof.

That $C\bullet X$ is adapted follows easily. Consider

\[ \operatorname{{\mathbb E}}\bigl((C\bullet X)_{n+1}-(C\bullet X)_n\, |\, \mathcal F_n\bigr)=\operatorname{{\mathbb E}}\bigl(C_{n+1}\cdot (X_{n+1}-X_n)\, |\, \mathcal F_n\bigr)=C_{n+1}\operatorname{{\mathbb E}}\bigl((X_{n+1}-X_n)\, |\, \mathcal F_n\bigr). \]

The last step used predictability of $C$, and the bounds imposed on $C$ or $C$ and $X$. These latter are the necessary conditions for $C\bullet X\in L^1$ and for pulling $C_{n+1}$ out of the expectation. The theorem follows from the (super)martingale property of $X$ in the respective cases.

Definition 5.10

A random variable $T=0,\, 1,\, 2,\, \dots ,\, \infty $ is a stopping time, if $\forall n\le \infty $, the event $\{ T\le n\} \in \mathcal F_n$. For a process $X_n$, we call $X^T_n:\, =X_{T\land n}$ the stopped process.

Here $i\land j$ means the smaller of the two numbers $i$ or $j$. This definition expresses the property that we can decide whether the stopping time has arrived or not at time $n$ just by looking at the history of the process(es) up to $n$. Notice that $n=\infty $ is included in the definition. As an elementary example, the time of the first Head in a sequence of coinflips is a stopping time, while the time one before the first Head is not.

Theorem 5.11

Let $T$ be a stopping time. If $X$ is a (super)martingale, then so is $X^T$.

Proof.

Define

\[ C_n:\, ={\bf 1}\{ n\le T\} ={\bf 1}\{ n-1{\lt}T\} =1-{\bf 1}\{ T\le n-1\} , \]

which is $\mathcal F_{n-1}$-measurable by the right-hand side. In other words, $C$ is predictable, which implies that $C\bullet X$ is a (super)martingale. On the other hand, the sum with this $C$ is telescopic:

\[ (C\bullet X)_n=\sum _{k=1}^n{\bf 1}\{ k\le T\} \cdot (X_k-X_{k-1})=\sum _{k=1}^{T\land n}(X_k-X_{k-1})=X_{T\land n}-X_0=X^T_n-X_0. \]

Theorem 5.12 (Doob’s optional stopping theorem)

Let $T$ be a stopping time and $X$ a supermartingale. If either of the below holds:

$T$ is bounded,
$X$ is bounded and $T$ is a.s. finite,
$X$ is of bounded increments: $\exists K{\gt}0$: $|X_n-X_{n-1}|\le K$ $\forall n\ge 1$, and $\operatorname{{\mathbb E}}T{\lt}\infty $,

then $\operatorname{{\mathbb E}}X_T\le \operatorname{{\mathbb E}}X_0$. If $M$ is a martingale and either of 1, 2, 3 holds, then $\operatorname{{\mathbb E}}M_T=\operatorname{{\mathbb E}}M_0$.

Notice that one of the homework sheets (will) contain(s) an improved version of this theorem.

Proof.

In all cases $T{\lt}\infty $ a.s., hence $X_{T\land n}\underset {n\to \infty }\longrightarrow X_T$ a.s. We also know by Theorem 5.11 that $\operatorname{{\mathbb E}}X_{T\land n}\le \operatorname{{\mathbb E}}X_0$. The question is whether we can pass this limit through the expectation. In the respective cases:

for any $n$ larger than the bound on $T$, $\operatorname{{\mathbb E}}X_0\ge \operatorname{{\mathbb E}}X_{T\land n}=\operatorname{{\mathbb E}}X_T$.
the bound on $X$ allows to use Dominated convergence.
in this case we have, for every $n$,
\[ |X_{T\land n}-X_0|=\Bigl|\sum _{k=1}^{T\land n}(X_k-X_{k-1})\Bigr|\le \sum _{k=1}^{T\land n}|X_k-X_{k-1}|\le K\cdot (T\land n)\le KT. \]
As the right-hand side is independent of $n$ and has finite mean, it can act as the bounding $Y$ variable in Dominated convergence, which then again allows to pass the limit through the expectation. The proof for a martingale $M$ is analogous.

Notice that we only used, hence it is sufficient to verify, the conditions on the stopped process $X^T$.

Corollary 5.13

If $M$ is a martingale of bounded increments, $C$ is bounded and predictable, $T\in L^1$ is a stopping time, then $\operatorname{{\mathbb E}}(C\bullet M)_T=0$.

When $X\ge 0$, we can apply Fatou’s lemma instead of Dominated convergence to relax the conditions of optional stopping:

Theorem 5.14

If $X\ge 0$ is a supermartingale and $T$ is an a.s. finite stopping time, then $\operatorname{{\mathbb E}}X_T\le \operatorname{{\mathbb E}}X_0$.

Proof.

In this case $X_{T\land n}\ge 0$ and converges to $X_T$ a.s., hence by Fatou’s lemma,

\[ \operatorname{{\mathbb E}}X_T=\operatorname{{\mathbb E}}\lim _{n\to \infty }X_{T\land n}=\operatorname{{\mathbb E}}\liminf _nX_{T\land n}\le \liminf _n\operatorname{{\mathbb E}}X_{T\land n}\le \liminf _n\operatorname{{\mathbb E}}X_0=\operatorname{{\mathbb E}}X_0. \]

The next lemma is useful for checking if $\operatorname{{\mathbb E}}T{\lt}\infty $ holds. It is enough to check that, no matter the history, stopping happens within a fix time interval with a probability bounded away from zero:

Lemma 5.15

Let $T$ be a stopping time, and assume $\exists N,\, \varepsilon {\gt}0\, :\, {\mathbb P}\{ T\le n+N\, |\, \mathcal F_n\} \ge \varepsilon $ a.s. for all $n\ge 0$. Then $\operatorname{{\mathbb E}}T{\lt}\infty $.

Proof.

For $k\ge 1$,

\[ \begin{aligned} {\mathbb P}\{ T{\gt}kN\} & ={\mathbb P}\bigl\{ \{ T{\gt}kN\} \cap \{ T{\gt}(k-1)N\} \bigr\} ={\mathbb P}\{ T{\gt}kN\, |\, T{\gt}(k-1)N\} {\mathbb P}\{ T{\gt}(k-1)N\} \\ & \le (1-\varepsilon ){\mathbb P}\{ T{\gt}(k-1)N\} \end{aligned} \]

by the assumption, which recursively gives ${\mathbb P}\{ T{\gt}kN\} \le (1-\varepsilon )^k$. The sequence ${\mathbb P}\{ T{\gt}\ell \} $ is non-increasing, and its sum gives the expectation of $T$ (check if you have not seen this before!). Hence

\[ \operatorname{{\mathbb E}}T=\sum _{\ell =0}^\infty {\mathbb P}\{ T{\gt}\ell \} \le N\sum _{k=0}^\infty {\mathbb P}\{ T{\gt}kN\} \le N\frac1{1-(1-\varepsilon )}=\frac N\varepsilon {\lt}\infty . \]

We turn to a few applications.

Example 5.16

A monkey repeatedly types any of the 26 letters of the English alphabet independently with equal chance. Let $T$ be the number of letters that have been typed when the entire word “ABRACADABRA” first appears. We are after $\operatorname{{\mathbb E}}T$.

First notice that this is a stopping time and is finite due to Lemma 5.15: no matter what happened before, with probability $\varepsilon =26^{-11}$ the stopping occurs in at most $N=11$ steps.

To proceed, we add a gambling process to this problem. Before each hit of a new letter, a new gambler arrives and bets £1 on the letter “A”. If the new letter is something else, the gambler loses their bet and exits the game. If the new letter is “A”, then the gambler receives £26, which they all bet on the next letter being “B”. If this next letter is something else, all bet is lost and the gambler exits the game. If this next letter is “B”, then the gambler receives £$26^2$, which they all bet on the third letter being “R”, and so on. If the gambler wins all the way down to “ABRACADABRA”, the gambler exits the game with wealth £$26^{11}$ and $T$ becomes the total number of letters typed at this point. Otherwise the gambler loses all their money and exits the game. Also, new gamblers arrive at each step and start the same betting strategy. Denote by $X_n$ the combined wealth of all gamblers who are in play after the $n^\text {th}$ letter has been typed. We claim that $M_n:\, =X_n-n$ is a martingale.

To check this, notice that $M_n$ is adapted to the natural filtration generated by the monkey, and is bounded for any given $n$, hence is in $L^1$. The martingale property is checked by first observing that the expected wealth of a gambler already in play does not change. That is because their bet is lost with probability $\frac{25}{26}$ and multiplied by $26$ with probability $\frac1{26}$, hence the mean value stays. A gambler already out of play stays with their 0 wealth so that doesn’t change either. However, a new gambler arrives and their expected wealth after a flip is $26\cdot \frac1{26}=1$ which adds to the conditional expectation of $X_{n+1}$. Therefore,

\[ \operatorname{{\mathbb E}}(M_{n+1}\, |\, \mathcal F_n)=\operatorname{{\mathbb E}}(X_{n+1}\, |\, \mathcal F_n)-(n+1)=X_n+1-(n+1)=M_n. \]

Notice that $\operatorname{{\mathbb E}}M_1=\operatorname{{\mathbb E}}X_1-1=0$, hence we can extend the definition by $M_0=0$.

The stopped martingale $M^T$ is of bounded increments, and $\operatorname{{\mathbb E}}T{\lt}\infty $, hence Optional stopping applies and gives

\[ 0=M_0=\operatorname{{\mathbb E}}M_T=\operatorname{{\mathbb E}}X_T-\operatorname{{\mathbb E}}T. \]

It remains to calculate $X_T$. At the time of stopping, there are only three gamblers in play. The one who came at time $T-11$ has wealth $26^{11}$. Less lucky is the gambler who arrived at $T-4$; they successfully bet on “ABRA” and have £$26^4$. Finally, the gambler who just arrived at $T$ and bet on “A” has £26. Thus,

\[ \operatorname{{\mathbb E}}T=\operatorname{{\mathbb E}}X_T=X_T=26^{11}+26^4+26\simeq 3.7\cdot 10^{15}. \]

Notice how repetitions in “ABRACADABRA” increase $\operatorname{{\mathbb E}}T$ (why?).

Example 5.17

Let $X_k$ be i.i.d. $\pm 1$-valued with equal chance. The sum $S_n=\sum _{k=1}^nX_k$ is called simple symmetric random walk (SSRW). Empty sums are zero, hence $S_0=0$. Then $S_n$ is a martingale, and

\[ T:\, =\inf \{ n\, :\, S_n=1\} \]

is a stopping time in the natural filtration of the walk. We will check in Example 5.18 that $T$ is a.s. finite, and $S_T=1$ a.s. Also, $\operatorname{{\mathbb E}}S_{T\land n}=\operatorname{{\mathbb E}}S_0=0$ from the martingale property, as seen in Theorem 5.11. However, the conditions of Optional stopping are not met, and

\[ 1=\operatorname{{\mathbb E}}S_T\ne \operatorname{{\mathbb E}}S_0=0. \]

In particular this shows via case 3 of Optional stopping that $\operatorname{{\mathbb E}}T=\infty $ since $S_n$ is of bounded increment.

Example 5.18

Consider the SSRW and the stopping time $T$ as in Example 5.17. We show $T{\lt}\infty $ a.s., and even calculate its moment generating function. Finiteness of $T$ implies recurrence of SSRW.

For any $\theta {\gt}0$, the process

\[ M_n^\theta :\, =\frac{\text{\rm e}^{\theta S_n}}{(\operatorname{{cosh}}\theta )^n} \]

is a martingale. It is clearly adapted and finite for each $n$, and the martingale property works out:

\[ \operatorname{{\mathbb E}}\bigl(M_{n+1}^\theta \, |\, \mathcal F_n\bigr)=\frac{\operatorname{{\mathbb E}}\bigl(\text{\rm e}^{\theta (S_n+X_{n+1})}\, |\, \mathcal F_n\bigr)}{(\operatorname{{cosh}}\theta )^{n+1}}=\frac{\text{\rm e}^{\theta S_n}}{(\operatorname{{cosh}}\theta )^{n+1}}\cdot \frac{\text{\rm e}^{\theta }+\text{\rm e}^{-\theta }}2=\frac{\text{\rm e}^{\theta S_n}}{(\operatorname{{cosh}}\theta )^n}=M^\theta _n. \]

We cannot rule out $T=\infty $. However, in this case we know $S_n\le 0$ for each $n$, hence

\[ 0\le M^\theta _n\le \frac{\text{\rm e}^{\theta \cdot 0}}{(\operatorname{{cosh}}\theta )^n}\underset {n\to \infty }\longrightarrow 0 \]

by $\cosh \theta {\gt}1$. On the event $\{ T=\infty \} $ we therefore define $M^\theta _T=0$ and have $M^\theta _{T\land n}=M^\theta _n\underset {n\to \infty }\longrightarrow M^\theta _T$. We also have $M^\theta _{T\land n}\underset {n\to \infty }\longrightarrow M^\theta _T$ on $\{ T{\lt}\infty \} $.

By definition of $T$,

\[ M_{T\land n}^\theta =\frac{\text{\rm e}^{\theta S_{T\land n}}}{(\operatorname{{cosh}}\theta )^{T\land n}}\le \frac{\text{\rm e}^{\theta }}{(\operatorname{{cosh}}\theta )^{T\land n}}\le \text{\rm e}^{\theta }, \]

which is a bound independent of $n$, hence dominated convergence applies on the above limit:

\[ \begin{aligned} 1& =\lim _{n\to \infty }\operatorname{{\mathbb E}}M_{T\land n}^\theta =\operatorname{{\mathbb E}}\lim _{n\to \infty }M_{T\land n}^\theta =\operatorname{{\mathbb E}}M^\theta _T=\operatorname{{\mathbb E}}(M^\theta _T\, ;\, T{\lt}\infty )=\operatorname{{\mathbb E}}\Bigl(\frac{\text{\rm e}^{\theta S_T}}{(\operatorname{{cosh}}\theta )^T}\, ;\, T{\lt}\infty \Bigr)\\ & =\operatorname{{\mathbb E}}\Bigl(\frac{\text{\rm e}^{\theta }}{(\operatorname{{cosh}}\theta )^T}\, ;\, T{\lt}\infty \Bigr)=\operatorname{{\mathbb E}}\Bigl(\frac{\text{\rm e}^{\theta }}{(\operatorname{{cosh}}\theta )^T}\Bigr), \end{aligned} \]

where the martingale property $\operatorname{{\mathbb E}}M_{T\land n}^\theta =M_0^\theta =1$, then $M^\theta _T=0$ on $\{ T=\infty \} $, then $S_T=1$ on $\{ T{\lt}\infty \} $, finally $\cosh \theta {\gt}1$ were used. From this we conclude

$\begin{equation}\operatorname{{\mathbb E}}\Bigl(\frac1{(\operatorname{{cosh}}\theta )^T}\Bigr)={\text{\rm e}^{-\theta }}.\label{eq:ETte}\tag{5.2}\end{equation}$

Next we take $\theta \searrow 0$, which makes $\cosh \theta \searrow 1$. If $T=\infty $ then $\frac1{(\operatorname{{cosh}}\theta )^T}=0$ for any $\theta {\gt}0$, hence so is its limit. However, if $T{\lt}\infty $ then the limit is 1. Together,

\[ \lim _{\theta \searrow 0}\frac1{(\operatorname{{cosh}}\theta )^T}={\bf 1}\{ T{\lt}\infty \} . \]

As the bound $0\le \frac1{(\operatorname{{cosh}}\theta )^T}\le 1$ holds no matter the value of $T$, we can apply dominated convergence on the above limit:

\[ 1=\lim _{\theta \searrow 0}\text{\rm e}^{-\theta }=\lim _{\theta \searrow 0}\operatorname{{\mathbb E}}\Bigl(\frac1{(\operatorname{{cosh}}\theta )^T}\Bigr)=\operatorname{{\mathbb E}}\Bigl(\lim _{\theta \searrow 0}\frac1{(\operatorname{{cosh}}\theta )^T}\Bigr)=\operatorname{{\mathbb E}}{\bf 1}\{ T{\lt}\infty \} ={\mathbb P}\{ T{\lt}\infty \} , \]

which was our goal.

Substitute $0{\lt}\alpha =\frac1{\operatorname{{cosh}}\theta }{\lt}1$ into (5.2) and solve this substitution for $\text{\rm e}^{-\theta }$ to obtain (check!) the generating function for the random variable $T$

\[ \operatorname{{\mathbb E}}\alpha ^T=\frac{1-\sqrt{1-\alpha ^2}}\alpha . \]

6 Martingale convergence

This section gives an introduction to yet another strength of the martingale property: convergence under rather general assumptions. We start with examining upcrossings:

Definition 6.1

Fix $a{\lt}b$ real numbers and let $X_n$ be a process. The upcrossing number until $N$ is defined as

\[ U_N[a,\, b]:\, =\max \{ k\, :\, \exists 0\le s_1{\lt}t_1{\lt}s_2{\lt}t_2{\lt}\dots {\lt}s_k{\lt}t_k\le N\text{ with }X_{s_i}{\lt}a,\ X_{t_i}{\gt}b\text{ for all }1\le i\le k\} . \] This definition counts what the name suggests: how many times the process increases from below level $a$ to above level $b$ before time $N$. To capture these increments, we also define

\[ C_0:\, =0,\qquad C_n:\, ={\bf 1}\{ C_{n-1}=1\} {\bf 1}\{ X_{n-1}\le b\} +{\bf 1}\{ C_{n-1}=0\} {\bf 1}\{ X_{n-1}{\lt}a\} \quad (n\ge 1). \]

Think about $C_n$ as an on-off switch: once it is on ($C_{n-1}=1$), it stays on unless $X$ exceeds level $b$. Once it is off ($C_{n-1}=0$), it stays off until $X$ descends below level $a$. Notice that if $X_n$ is adapted, then this $C_n$ is predictable. Let

\[ Y_n:\, =(C\bullet X)_n=\sum _{k=1}^n(X_k-X_{k-1})C_k. \]

This captures the increments of $X$ during the on periods of $C$. It then follows that

\[ Y_N\ge (b-a)\cdot U_N[a,\, b]-(X_N-a)^-. \]

To see this notice that any upcrossing is of increment at least $b-a$, all of which is captured by $Y_N$. However, if $X$ descends below $a$ after the last upcrossing, that is also captured by $Y_N$ and might contribute with a negative sign. This is taken care of by the last term.

Lemma 6.2 (Doob’s upcrossing lemma)

If $X_n$ is a supermartingale, then $(b-a)\operatorname{{\mathbb E}}U_N[a,\, b]\le \operatorname{{\mathbb E}}(X_N-a)^-$.

Proof.

By Theorem 5.9, $Y_n$ is a supermartingale, hence $\operatorname{{\mathbb E}}Y_N\le 0$.

Corollary 6.3

Let $a{\lt}b$, and let $X_n$ be a supermartingale that is bounded in $L^1$. (That is, $\sup _n\operatorname{{\mathbb E}}|X_n|{\lt}\infty $.) Then

\[ (b-a)\operatorname{{\mathbb E}}U_\infty [a,\, b]\le |a|+\sup _m\operatorname{{\mathbb E}}|X_m|{\lt}\infty . \]

In particular, $U_\infty [a,\, b]$ is a.s. finite.

Proof.

As $U_N[a,\, b]\ge 0$ is monotone in $N$, we have

\[ (b-a)\operatorname{{\mathbb E}}U_\infty [a,\, b]=(b-a)\operatorname{{\mathbb E}}\lim _NU_N[a,\, b]=(b-a)\lim _N\operatorname{{\mathbb E}}U_N[a,\, b]\le \lim _N\operatorname{{\mathbb E}}(X_N-a)^- \]

by monotone convergence. To finish the proof,

\[ \operatorname{{\mathbb E}}(X_N-a)^-\le \operatorname{{\mathbb E}}|X_N-a|\le \operatorname{{\mathbb E}}|X_N|+|a|\le |a|+\sup _m\operatorname{{\mathbb E}}|X_m|, \]

and the right-hand side is independent of $N$.

Theorem 6.4 (Doob’s forward convergence theorem)

Let $X_n$ be a supermartingale that is bounded in $L^1$. Then $X_\infty :\, =\lim _n X_n$ exists a.s. and is finite.

Proof.

\[ \begin{aligned} \{ X_n\text{ does not converge}\} =\{ \liminf _nX_n{\lt}\limsup _nX_n\} & =\bigcup _{\begin{smallmatrix} a{\lt}b \\ a,b\in {\mathbb Q} \end{smallmatrix}}\{ \liminf _nX_n{\lt}a{\lt}b{\lt}\limsup _nX_n\} \\ & \subseteq \bigcup _{\begin{smallmatrix} a{\lt}b \\ a,b\in {\mathbb Q} \end{smallmatrix}}\{ U_\infty [a,\, b]=\infty \} . \end{aligned} \]

The right-hand side is a countable union of zero probability events, has zero probability itself.

To see that the limit is a.s. finite, use Fatou’s lemma:

\[ \operatorname{{\mathbb E}}|X_\infty |=\operatorname{{\mathbb E}}\liminf _n|X_n|\le \liminf _n\operatorname{{\mathbb E}}|X_n|\le \sup _n\operatorname{{\mathbb E}}|X_n|{\lt}\infty , \]

which implies $|X_\infty |{\lt}\infty $ a.s.

Notice that if $X_n\ge 0$ is a supermartingale, then $L^1$-boundedness is automatic: $\operatorname{{\mathbb E}}|X_n|=\operatorname{{\mathbb E}}X_n\le \operatorname{{\mathbb E}}X_0$.

Finally, we have a bit to say about $L^2$-martingales. The scalar product of $L^2$-random variables $X$ and $Y$ is defined by $\langle X,\, Y\rangle :\, =\operatorname{{\mathbb E}}(XY)$. Indeed check that this defines a scalar product. If $M_n$ is an $L^2$-martingale, and $k\le m{\lt}n$, then $M_n-M_m$ is orthogonal to $\mathcal F_k$ of the filtration. Namely, for any $Y$ $\mathcal F_k$-measurable random variable

\[ \langle M_n-M_m,\, Y\rangle =\operatorname{{\mathbb E}}\bigl((M_n-M_m)Y\bigr)=\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}\bigl((M_n-M_m)Y\, |\, \mathcal F_k\bigr)=\operatorname{{\mathbb E}}\bigl(Y\operatorname{{\mathbb E}}(M_n-M_m\, |\, \mathcal F_k)\bigr)=\operatorname{{\mathbb E}}\bigl(Y\cdot (M_k-M_k)\bigr)=0. \]

In particular, increments of an $L^2$-martingale in disjoint time intervals are orthogonal: for $\ell {\lt}k\le m{\lt}n$, $\langle M_n-M_m,\, M_k-M_\ell \rangle =0$. Another way of stating this is that these increments are uncorrelated. This is the main observation used in the next theorem.

Theorem 6.5

An $L^2$ martingale $M_n$ is bounded in $L^2$ if and only if $\sum _{k=1}^\infty \operatorname{{\mathbb E}}(M_k-M_{k-1})^2{\lt}\infty $. In this case $M_n\to M_\infty $ a.s. and in $L^2$.

Proof.

Due to the orthogonal increments as above, all cross terms of the square below have zero mean, and the following Pythagorean theorem holds:

\[ \operatorname{{\mathbb E}}M_n^2=\operatorname{{\mathbb E}}\Bigl(M_0+\sum _{k=1}^n(M_k-M_{k-1})\Bigr)^2=\operatorname{{\mathbb E}}M_0^2+\sum _{k=1}^n\operatorname{{\mathbb E}}(M_k-M_{k-1})^2\nearrow \operatorname{{\mathbb E}}M_0^2+\sum _{k=1}^\infty \operatorname{{\mathbb E}}(M_k-M_{k-1})^2 \]

as $n\to \infty $. This proves the first statement: the left-hand side is bounded iff the infinite sum is finite.

If $M_n$ is bounded in $L^2$, then it is also bounded in $L^1$ by Ljapunov’s inequality, and the forward convergence theorem provides the a.s. limit. To see the $L^2$ convergence, use Fatou’s lemma and the Pythagorean theorem as

\[ \begin{aligned} \operatorname{{\mathbb E}}(M_\infty -M_n)^2=\operatorname{{\mathbb E}}\lim _{r\to \infty }(M_{n+r}-M_n)^2\le \liminf _r\operatorname{{\mathbb E}}(M_{n+r}-M_n)^2& =\liminf _r\sum _{k=n+1}^{n+r}\operatorname{{\mathbb E}}(M_k-M_{k-1})^2\\ & =\sum _{k=n+1}^\infty \operatorname{{\mathbb E}}(M_k-M_{k-1})^2\underset {n\to \infty }\longrightarrow 0 \end{aligned} \]

due to finiteness of the infinite sum.

7 Doob decomposition

We briefly cover a very useful technique of separating a martingale from a stochastic process. This is called Doob decomposition and its continuous-time analogue, called Doob-Meyer decomposition, is the basis of stochastic integration.

Theorem 7.1 (Doob decomposition)

Let $(X_n)_{n\ge 0}$ be an adapted process in $L^1$. Then

there is
- a martingale $(M_n)_{n\ge 0}$ with $M_0=0$,
- a predictable process $(A_n)_{n\ge 0}$ with $A_0=0$
such that $X_n=X_0+M_n+A_n$. This decomposition is almost everywhere unique in the sense that for any other pair $(\widehat M_n,\ \widehat A_n)_{n\ge 0}$ with the above properties we have ${\mathbb P}\{ M_n=\widehat M_n\text{ and }A_n=\widehat A_n\text{ for all }n\ge 0\} =1$.
$(X_n)_{n\ge 0}$ is a submartingale if and only if ${\mathbb P}\{ A_n\le A_{n+1}\text{ for all }n\ge 0\} =1$ in the above decomposition.

Assuming this decomposition works, we can actually guess what $A$ should be via a next step analysis:

$\begin{equation}\begin{aligned} \operatorname{{\mathbb E}}(X_k-X_{k-1}\, |\, \mathcal F_{k-1})& =\operatorname{{\mathbb E}}(M_k-M_{k-1}\, |\, \mathcal F_{k-1})+\operatorname{{\mathbb E}}(A_k-A_{k-1}\, |\, \mathcal F_{k-1})\\ & =M_{k-1}-M_{k-1}+\operatorname{{\mathbb E}}(A_k-A_{k-1}\, |\, \mathcal F_{k-1})=A_k-A_{k-1}. \end{aligned}\label{eq:gdd}\tag{7.1}\end{equation}$

We start the proof by summing this display for the definition of $A$.

Proof.

Define $A_0=0$ and for $n\ge 1$

\[ A_n:\, =\sum _{k=1}^n\operatorname{{\mathbb E}}(X_k-X_{k-1}\, |\, \mathcal F_{k-1}). \]

This is predictable due to $k-1\le n-1$ in the above sum and the properties of the conditional expectation. Then let

\[ M_n:\, =X_n-X_0-A_n. \]

$M_0=0$, $M_n\in L^1$, and $M$ being adapted are clear and, by separating the last term in the sum,

\[ \begin{aligned} \operatorname{{\mathbb E}}(M_n\, |\, \mathcal F_{n-1})& =\operatorname{{\mathbb E}}(X_n-X_0-A_n\, |\, \mathcal F_{n-1})=\operatorname{{\mathbb E}}(X_n\, |\, \mathcal F_{n-1})-X_0-\sum _{k=1}^n\operatorname{{\mathbb E}}(X_k-X_{k-1}\, |\, \mathcal F_{k-1})\\ & =X_{n-1}-X_0-\sum _{k=1}^{n-1}\operatorname{{\mathbb E}}(X_k-X_{k-1}\, |\, \mathcal F_{k-1})=X_{n-1}-X_0-A_{n-1}=M_{n-1}. \end{aligned} \]

Hence the above decomposition has the required properties. For a.e. uniqueness notice that any decomposition has to satisfy (7.1), which leaves no other choice of $A$, hence no other choice of $M$ (up to zero probability sets). Part 2 is coming from $A_n=X_n-X_0-M_n$:

\[ A_{n+1}-A_n=\operatorname{{\mathbb E}}(A_{n+1}\, |\, \mathcal F_n)-A_n=\operatorname{{\mathbb E}}(X_{n+1}\, |\, \mathcal F_n)-X_0-\operatorname{{\mathbb E}}(M_{n+1}\, |\, \mathcal F_n)-X_n+X_0+M_n=\operatorname{{\mathbb E}}(X_{n+1}\, |\, \mathcal F_n)-X_n. \]

Definition 7.2

Let $(M_n)_{n\ge 0}$ be an $L^2$-martingale with $M_0=0$. Then $M^2$ has the a.e. unique Doob decomposition into a martingale $N$ and a predictable process $A$:

\[ M^2=N+A. \]

The process $A$ is often denoted as $\langle M\rangle $, and is called the brackets process of $M$.

It is easy to check, and we will see later, that $(M^2_n)_{n\ge 0}$ is a submartingale. It therefore follows that $A$ is a.s. non-decreasing, with an a.s. limit $A_\infty :\, =\lim _{n\to \infty }A_n$. Since $\operatorname{{\mathbb E}}M_n^2=\operatorname{{\mathbb E}}A_n$, we have that $M$ is bounded in $L^2$ if and only if $\operatorname{{\mathbb E}}A_\infty {\lt}\infty $.

We also note that

\[ A_n-A_{n-1}=\operatorname{{\mathbb E}}(M_n^2-M_{n-1}^2\, |\, \mathcal F_{n-1})=\operatorname{{\mathbb E}}\bigl((M_n-M_{n-1})^2\, |\, \mathcal F_{n-1}\bigr). \]

8 Uniform integrability

We have seen before that $L^p$ convergence is stronger than convergence in probability. However, there is a condition that allows to conclude $L^p$ convergence from convergence in probability. This is explored below, and will be used later for martingales.

Notice that by monotone convergence,

$\begin{equation}\lim _{c\to \infty }\operatorname{{\mathbb E}}\bigl(|X|^p\, ;\, |X|\ge c\bigr)=0\qquad \text{for any }X\in L^p.\label{eq:lpc}\tag{8.1}\end{equation}$

This helps understanding the following definition.

Definition 8.1

A sequence $X_n$ of random variables is $p^\text {th}$ power uniformly integrable, if

\[ \lim _{c\to \infty }\sup _n\operatorname{{\mathbb E}}\bigl(|X_n|^p\, ;\, |X_n|\ge c\bigr)=0. \] The following lemma will help exploiting uniform integrability.

Lemma 8.2

Let $X\in L^1$. Then $\forall \varepsilon {\gt}0$ $\exists \delta {\gt}0$ such that $\forall F\in \mathcal F$ event with ${\mathbb P}(F){\lt}\delta $, $\operatorname{{\mathbb E}}\bigl(|X|\, ;\, F\bigr){\lt}\varepsilon $ holds.

Proof.

By contradiction, assume that there is an $\varepsilon {\gt}0$ and a sequence $F_n$ of events such that ${\mathbb P}(F_n){\lt}2^{-n}$, but $\operatorname{{\mathbb E}}\bigl(|X|\, ;\, F_n\bigr)\ge \varepsilon $. Denote $H:\, =\limsup _n F_n$. Then, on one hand, Borel-Cantelli 1 implies that ${\mathbb P}(H)=0$. On the other hand, an application of Fatou’s lemma on $-|X|\cdot {\bf 1}_{F_n}$ gives

\[ \operatorname{{\mathbb E}}\bigl(|X|\, ;\, H\bigr)=\operatorname{{\mathbb E}}\bigl(\limsup _n|X|\cdot {\bf 1}_{F_n}\bigr)\ge \limsup _n\operatorname{{\mathbb E}}\bigl(|X|\, ;\, F_n\bigr)\ge \varepsilon , \]

which is a contradiction.

We can reprove (8.1) with this lemma. If $X\in L^p$, fix $Y=|X|^p\in L^1$, $\varepsilon {\gt}0$, and $\delta $ for this $Y$ as in Lemma 8.2. For this $\delta $, via Markov’s inequality, there is a large enough $K$, such that

\[ {\mathbb P}\{ |X|\ge K\} ={\mathbb P}\{ |X|^p\ge K^p\} \le \frac{\operatorname{{\mathbb E}}|X|^p}{K^p}{\lt}\delta . \]

Then, with $F=\{ |X|\ge K\} $, the lemma says

\[ \operatorname{{\mathbb E}}\bigl(|X|^p\, ;\, |X|\ge K\bigr)=\operatorname{{\mathbb E}}(Y\, ;\, F){\lt}\varepsilon . \]

That is, by picking large enough $K$, we could bring the expectation below $\varepsilon $. This is equivalent to (8.1).

With the help of the above, we can now go from in probability convergence and uniform integrability to $L^p$ convergence.

Theorem 8.3

Let $p\ge 1$, suppose $X,\, X_1,\, X_2,\, \ldots \in L^p$, and $X_n\overset {{\mathbb P}}{\longrightarrow }X$. Then $\ref{it:uiv}\Rightarrow \ref{it:ui}\Leftrightarrow \ref{it:uii}\Leftrightarrow \ref{it:uiii}\Leftarrow \ref{it:uv}$, where

$X_n\overset {L^p}{\longrightarrow }X$;
$X_n$ is $p^\text {th}$ power uniformly integrable;
$\operatorname{{\mathbb E}}|X_n|^p\to \operatorname{{\mathbb E}}|X|^p$;
there exists a $p{\lt}q{\lt}\infty $, such that $\sup _n\operatorname{{\mathbb E}}|X_n|^q{\lt}\infty $;
there exists a $Y\in L^p$, such that $\forall n,\quad |X_n|{\lt}Y$.

Below partial proof is given to this statement.

Proof of $\ref{it:ui}\Rightarrow \ref{it:uii}$, for $p=1$.

Fix $\varepsilon {\gt}0$, we seek $K$ such that $\forall n$, $\operatorname{{\mathbb E}}\bigl(|X_n|\, ;\, |X_n|\ge K\bigr)\le \varepsilon $.

By the assumed $L^1$-convergence, there is an $N$ such that $\operatorname{{\mathbb E}}|X-X_n|{\lt}\frac\varepsilon 2$ whenever $n{\gt}N$. Lemma 8.2 provides positive $\delta _0,\, \delta _1,\, \delta _2,\, \dots ,\, \delta _N$ such that

\[ \forall F\text{, if }{\mathbb P}(F){\lt}\delta _0\text{, then }\operatorname{{\mathbb E}}\bigl(|X|\, ;\, F\bigr){\lt}\frac\varepsilon 2,\qquad \forall F\text{, if }{\mathbb P}(F){\lt}\delta _n\text{, then }\operatorname{{\mathbb E}}\bigl(|X_n|\, ;\, F\bigr){\lt}\varepsilon \]

for $1\le n\le N$. Set $\delta =\min \{ \delta _0,\, \delta _1,\, \dots ,\, \delta _N\} $ and notice that this is still positive.

When $n\le N$, define $F_n=\{ |X_n|{\gt}K\} $, and pick $K$ large enough that ${\mathbb P}(F_n){\lt}\delta $ for each $1\le n\le N$. Notice that this is possible due to only finitely many of the $X_n$’s for this case. By the above, we then have $\operatorname{{\mathbb E}}\bigl(|X_n|\, ;\, |X_n|\ge K\bigr){\lt}\varepsilon $.

When $n{\gt}N$, we argue as follows. The assumed $L^1$ convergence implies boundedness in $L^1$: $\sup _r\operatorname{{\mathbb E}}|X_r|{\lt}\infty $. By increasing $K$ if necessary, we can achieve

$\begin{equation}\frac{\sup _r\operatorname{{\mathbb E}}|X_r|}K<\delta .\label{eq:supl1k}\tag{8.2}\end{equation}$

A triangle inequality gives (with common factor ${\bf 1}\{ |X_n|\ge K\} $)

\[ \operatorname{{\mathbb E}}\bigl(|X_n|\, ;\, |X_n|\ge K\bigr)\le \operatorname{{\mathbb E}}\bigl(|X|\, ;\, |X_n|\ge K\bigr)+\operatorname{{\mathbb E}}\bigl(|X_n-X|\, ;\, |X_n|\ge K\bigr)\le \operatorname{{\mathbb E}}\bigl(|X|\, ;\, |X_n|\ge K\bigr)+\operatorname{{\mathbb E}}|X_n-X|. \]

For the first term, take $F_n=\{ |X_n|\ge K\} $, and notice ${\mathbb P}(F_n)\le \frac{\operatorname{{\mathbb E}}|X_n|}K{\lt}\delta $ due to (8.2). The choice we made with $\delta \le \delta _0$ then bounds this term by $\frac\varepsilon 2$. The second term is also bounded by $\frac\varepsilon 2$ due to our initial choice of $N$ and $n{\gt}N$.

Proof of $\ref{it:uii}\Rightarrow \ref{it:ui}$, for $p=1$.

Define the cutoff function

\[ \varphi _K(x)=\left\{ \begin{aligned} & K,& & \text{if }x{\gt}K,\\ & x,& & \text{if }-K\le x\le K,\\ & -K,& & \text{if }x{\lt}-K. \end{aligned} \right. \]

Fix $\varepsilon {\gt}0$ and notice that by the assumed uniform integrability and by (8.1), there is a $K$ for which

\[ \operatorname{{\mathbb E}}\bigl|\varphi _K(X_n)-X_n\bigr|{\lt}\frac\varepsilon 3,\qquad \text{and}\qquad \operatorname{{\mathbb E}}\bigl|\varphi _K(X)-X\bigr|{\lt}\frac\varepsilon 3. \]

Also, $\bigl|\varphi _K(x)-\varphi _K(y)\bigr|\le |x-y|$ for any $x,\, y\in \mathbb R$, hence $\varphi _K(X_n)\overset {{\mathbb P}}{\longrightarrow }\varphi (X)$ holds via $X_n\overset {{\mathbb P}}{\longrightarrow }X$ (check!). As $|\varphi (\cdot )|\le K$, Dominated convergence implies $\operatorname{{\mathbb E}}\bigl|\varphi _K(X_n)-\varphi (X)\bigr|\to 0$, in particular this can be brought below $\frac\varepsilon 3$ for large $n$’s. Combining via triangle inequality,

\[ \operatorname{{\mathbb E}}|X_n-X|\le \operatorname{{\mathbb E}}\bigl|X_n-\varphi _K(X_n)\bigr|+\operatorname{{\mathbb E}}\bigl|\varphi _K(X_n)-\varphi (X)\bigr|+\operatorname{{\mathbb E}}\bigl|\varphi _K(X)-X\bigr|{\lt}\varepsilon . \]

Proof of $\ref{it:ui}\Rightarrow \ref{it:uiii}$.

This is just two triangle inequalities:

\[ \begin{aligned} ||X_n||_p& =||X_n-X+X||_p\le ||X_n-X||_p+||X||_p\\ ||X||_p& =||X-X_n+X_n||_p\le ||X-X_n||_p+||X_n||_p\text{, i.e.,}\\ ||X_n||_p& \ge ||X||_p-||X-X_n||_p. \end{aligned} \]

As we assumed $L^p$ convergence, $||X-X_n||_p\to 0$. This results in

\[ ||X||_p\le \liminf _n||X_n||_p\le \limsup _n||X_n||_p\le ||X||_p, \]

that is liminf and limsup agree and $||X_n||_p\to ||X||_p$.

$\ref{it:uiii}\Rightarrow \ref{it:ui}$ is called Sceffé’s theorem, it will not be used later on and its proof is somewhat tedious, therefore it is skipped. Those interested can ask me about it on drop-in sessions.

Proof of $\ref{it:uiv}\Rightarrow \ref{it:uii}$.

Given $p{\lt}q$, we start with a little exercise in algebra. Set

\[ p_\text H=\frac qp{\gt}1,\qquad q_\text H=\frac q{q-p}{\gt}1,\qquad \text{and check }\frac1{p_\text H}+\frac1{q_\text H}=1. \]

Apply Hölder’s inequality on the variables $|X_n|^p$ and ${\bf 1}\{ X_n\ge c\} $, with these $p_\text H$ and $q_\text H$ parameters:

\[ \operatorname{{\mathbb E}}\bigl(|X_n|^p\, ;\, X_n\ge c\bigr)\le \bigl(\operatorname{{\mathbb E}}|X_n|^q\bigr)^\frac pq\cdot \bigl(\operatorname{{\mathbb E}}{\bf 1}\{ X_n\ge c\} ^\frac q{q-p}\bigr)^\frac {q-p}q=\bigl(\operatorname{{\mathbb E}}|X_n|^q\bigr)^\frac pq\cdot \bigl({\mathbb P}\{ X_n\ge c\} \bigr)^\frac {q-p}q. \]

Here we also used that raising and indicator to a positive power does not change a thing. Markov’s inequality gives ${\mathbb P}\{ X_n\ge c\} \le {\mathbb P}\{ |X_n|^q\ge c^q\} \le \frac{\operatorname{{\mathbb E}}|X_n|^q}{c^q}$, which further bounds the above by

\[ \bigl(\operatorname{{\mathbb E}}|X_n|^q\bigr)^\frac pq\cdot \frac{\bigl(\operatorname{{\mathbb E}}|X_n|^q\bigr)^\frac {q-p}q}{c^{q-p}}=\frac{\operatorname{{\mathbb E}}|X_n|^q}{c^{q-p}}. \]

Under the assumptions of 4, taking $\lim _{c\to \infty }\sup _n$ brings this to 0.

Proof of $\ref{it:uv}\Rightarrow \ref{it:ui}$.

This is just an application of Dominated convergence.

9 Uniformly integrable martingales

Uniform integrability gives further powerful tools with martingales:

Theorem 9.1

Let $M_n$ be a uniformly integrable martingale. Then $M_\infty =\lim _{n\to \infty }M_n$ exists a.s. and in $L^1$, and $M_n=\operatorname{{\mathbb E}}(M_\infty \, |\, \mathcal F_n)$ a.s.

Proof.

Due to uniform integrability, we can pick a $c{\gt}0$ for which $\sup _n\operatorname{{\mathbb E}}(|M_n|\, ;\, |M_n|\ge c)\le 1$. Then,

\[ \operatorname{{\mathbb E}}|M_n|=\operatorname{{\mathbb E}}(|M_n|\, ;\, |M_n|\ge c)+\operatorname{{\mathbb E}}(|M_n|\, ;\, |M_n|{\lt}c)\le 1+c, \]

where we used the simple algebraic fact that $|x|\cdot {\bf 1}\{ |x|{\lt}c\} \le c$ for any real $x$. As the right-hand side is independent of $n$, it follows that $M_n$ is bounded in $L^1$, thus converges a.s. to a finite limit by Doob’s forward convergence. That implies convergence in probability, which in turn gives $L^1$ convergence when uniform integrability is added (Theorem 8.3).

For the last bit, fix $r{\gt}n{\gt}0$, $F\in \mathcal F_n$:

\[ |\operatorname{{\mathbb E}}(M_n\, ;\, F)-\operatorname{{\mathbb E}}(M_\infty \, ;\, F)|=|\operatorname{{\mathbb E}}(M_r\, ;\, F)-\operatorname{{\mathbb E}}(M_\infty \, ;\, F)|=|\operatorname{{\mathbb E}}(M_r-M_\infty \, ;\, F)|\le \operatorname{{\mathbb E}}\bigl(|M_r-M_\infty |\, ;\, F\bigr)\le \operatorname{{\mathbb E}}|M_r-M_\infty | \]

a.s., where first the martingale property (check how!), then Jensen’s inequality on the $|\cdot |$ function was used. By the $L^1$ convergence, the right-hand side goes to 0 as $r\to \infty $. As the left-hand side has no $r$ in it, it is therefore zero: $\operatorname{{\mathbb E}}(M_n\, ;\, F)=\operatorname{{\mathbb E}}(M_\infty \, ;\, F)$. This proves $M_n=\operatorname{{\mathbb E}}(M_\infty \, |\, \mathcal F_n)$ a.s. due to the Kolmogorov definition of conditional expectations and $M_n$ being $\mathcal F_n$-measurable.

The next theorem is the reverse statement in some sense.

Theorem 9.2 (Lévy’s upwards theorem)

Let $\xi \in L^1$, $(\mathcal F_n)_n$ be a filtration, and $M_n:\, =\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_n)$ (this is well defined almost everywhere). Then $M_n$ is a uniformly integrable martingale, $M_n$ converges a.s. and in $L^1$ to a limit $M_\infty $, and $M_\infty =\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_\infty )$ a.s.

Proof.

The proof has three parts.

1. That $M_n$ is a martingale is a simple application of the Tower rule.

2. Next we show that $M_n$ is uniformly integrable. Fix $\varepsilon {\gt}0$ and $\delta {\gt}0$ for $\xi $ as in Lemma 8.2. Then pick $K{\gt}\frac{\operatorname{{\mathbb E}}|\xi |}\delta $. An application of Markov’s inequality, Jensen’s inequality on the conditional expectation, then the Tower rule shows

\[ {\mathbb P}\{ |M_n|\ge K\} \le \frac{\operatorname{{\mathbb E}}|M_n|}K=\frac{\operatorname{{\mathbb E}}\bigl|\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_n)\bigr|}K\le \frac{\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}\bigl(|\xi |\, \big|\, \mathcal F_n\bigr)}K=\frac{\operatorname{{\mathbb E}}|\xi |}K{\lt}\delta , \]

making $\{ |M_n|\ge K\} $ a suitable event for Lemma 8.2. We apply this lemma in the last step below, following conditional Jensen again, the fact that $\{ |M_n|\ge K\} \in \mathcal F_n$ and the Tower rule:

\[ \begin{aligned} \operatorname{{\mathbb E}}[|M_n|\, ;\, |M_n|\ge K]& =\operatorname{{\mathbb E}}\bigl[\bigl|\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_n)\bigr|\, ;\, |M_n|\ge K\bigr]\\ & \le \operatorname{{\mathbb E}}\bigl[\operatorname{{\mathbb E}}\bigl(|\xi |\, \big|\, \mathcal F_n\bigr)\, ;\, |M_n|\ge K\bigr]\\ & =\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}\bigl(|\xi |\, ;\, |M_n|\ge K\, \big|\, \mathcal F_n\bigr)=\operatorname{{\mathbb E}}\bigl(|\xi |\, ;\, |M_n|\ge K\bigr)\le \varepsilon . \end{aligned} \]

The right-hand side has no $n$, hence uniform integrability is proved. This implies existence of the limit $M_\infty $.

3. We need to show that this limit $M_\infty $ a.s. coincides with $\eta :\, =\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_\infty )$. We do this for the case $\xi {\gt}0$, otherwise the difference $\xi =\xi ^+-\xi ^-$ then provides the general proof. For any set $F\in \mathcal F_\infty $, let

\[ {\mathbb Q}_1(F):\, =\operatorname{{\mathbb E}}(\eta \, ;\, F)\qquad \text{and}\qquad {\mathbb Q}_2(F):\, =\operatorname{{\mathbb E}}(M_\infty \, ;\, F), \]

these are measures on $(\Omega ,\, \mathcal F_\infty )$. If $F\in \mathcal F_n$, then Tower rules, measurability (recall $\mathcal F_n\subseteq \mathcal F_\infty $), and $M_n=\operatorname{{\mathbb E}}(M_\infty \, |\, \mathcal F_n)$ from the previous theorem imply

\[ \begin{aligned} {\mathbb Q}_1(F)=\operatorname{{\mathbb E}}(\eta \, ;\, F)& =\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_\infty )\, ;\, F\bigr)=\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}(\xi \, ;\, F\, |\, \mathcal F_\infty )=\operatorname{{\mathbb E}}(\xi \, ;\, F)\\ & =\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}(\xi \, ;\, F\, |\, \mathcal F_n)=\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(\xi \, |\, \mathcal F_n)\, ;\, F\bigr)=\operatorname{{\mathbb E}}(M_n\, ;\, F)\\ & =\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(M_\infty \, |\, \mathcal F_n)\, ;\, F\bigr)=\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}(M_\infty \, ;\, F\, |\, \mathcal F_n)=\operatorname{{\mathbb E}}(M_\infty \, ;\, F)={\mathbb Q}_2(F). \end{aligned} \]

This shows that ${\mathbb Q}_1$ agrees with ${\mathbb Q}_2$ on $\bigcup _n\mathcal F_n$, hence also on $\mathcal F_\infty =\sigma \Bigl(\bigcup _n\mathcal F_n\Bigr)$. Since $\{ \eta {\gt}M_\infty \} \in \mathcal F_\infty $,

\[ 0={\mathbb Q}_1\{ \eta {\gt}M_\infty \} -{\mathbb Q}_2\{ \eta {\gt}M_\infty \} =\operatorname{{\mathbb E}}(\eta -M_\infty \, ;\, \eta {\gt}M_\infty ), \]

therefore ${\mathbb P}\{ \eta {\gt}M_\infty \} =0$. In a similar way ${\mathbb P}\{ \eta {\lt}M_\infty \} =0$, which completes the proof.

Next we prove an important theorem in probability using the machinery built up so far.

Definition 9.3

Let $X_1,\, X_2,\, \dots $ be random variables, and $\mathcal T_n:\, =\sigma (X_{n+1},\, X_{n+2},\, \dots )$. The tail $\sigma $-algebra is $\mathcal T:\, =\bigcap _n\mathcal T_n$.

An event in $\mathcal T$ is $\mathcal T_n$-measurable for every $n$. In other words, it does not depend on any finite number of changes in the sequence $X_1,\, X_2,\, \dots $.

Theorem 9.4 (Kolmogorov’s 0-1 law)

The tail $\sigma $-algebra of independent variables is trivial. That is, with the above definition, for any $F\in \mathcal T$ we have ${\mathbb P}(F)=0$ or $1$.

It is important here that the random variables are independent.

Proof.

Let, as before, $\mathcal F_n=\sigma (X_1,\, X_2,\, \dots ,\, X_n)$ and $F\in \mathcal T$. Define $\eta ={\bf 1}_F$. As $F\in \mathcal T\subseteq \mathcal F_\infty $, $\eta $ is $\mathcal F_\infty $-measurable, and is of course in $L^1$. Add Lévy’s upwards theorem:

\[ \eta =\operatorname{{\mathbb E}}(\eta \, |\, \mathcal F_\infty )=\lim _{n\to \infty }\operatorname{{\mathbb E}}(\eta \, |\, \mathcal F_n). \]

However, $F\in \mathcal T\subseteq \mathcal T_{n}$, which is independent of $\mathcal F_n$, as these are generated by a disjoint set of independent random variables. It follows that

\[ {\bf 1}_F=\eta =\lim _{n\to \infty }\operatorname{{\mathbb E}}(\eta \, |\, \mathcal F_n)=\lim _{n\to \infty }\operatorname{{\mathbb E}}\eta =\operatorname{{\mathbb E}}\eta ={\mathbb P}(F), \]

which shows ${\mathbb P}(F)$ is either 0 or 1.

Example 9.5

For a SSRW $S_n=\sum _{k=1}^nX_k$, let $F:\, =\{ \frac{S_n}n\to v\} $ be the event that the walk has asymptotic velocity $v$. Changing any finite number of the i.i.d. $X_k$’s does not influence the liminf, nor the limsup, of $\frac{S_n}n$, hence $F$ is in the tail $\sigma $-algebra of the $X_k$’s. It follows that $F$ is trivial, has either probability 0 or 1. Indeed, the Strong law of large numbers states that ${\mathbb P}(F)=1$ when $v=0$, and zero in all other cases.

Recall that a filtration is an increasing system of $\sigma $-algebras and a natural interpretation is the expanding information collected from observing a process starting at time zero. In what follows we still have a filtration, but the $\sigma $-algebras are indexed up to time $-1$, rather than starting from time $0$ as before. Accordingly, the limit is taken as the $\sigma $-algebras decrease, rather than increase which is what has been done so far. An application follows further down.

Theorem 9.6 (Lévy’s downward theorem)

Let $\mathcal G_{-n}$, $n\ge 1$, be $\sigma $-algebras with

\[ \mathcal G_{-n}\subseteq \mathcal G_{-n+1}\subseteq \mathcal G_{-n+2}\subseteq \dots \subseteq \mathcal G_{-1},\qquad \text{and}\qquad \mathcal G_{-\infty }:\, =\bigcap _{n=1}^\infty \mathcal G_{-n}. \]

Let $\gamma \in L^1$, and $M_{-n}:\, =\operatorname{{\mathbb E}}(\gamma \, |\, \mathcal G_{-n})$. Then $M_{-\infty }:\, =\lim _{n\to \infty }M_{-n}$ exists a.s. and in $L^1$, and $M_{-\infty }=\operatorname{{\mathbb E}}(\gamma \, |\, \mathcal G_{-\infty })$ a.s.

Proof.

The martingale property $\operatorname{{\mathbb E}}(M_{-n+1}\, |\, \mathcal G_{-n})=M_{-n}$ is checked the very same way as in Example 5.6, except the time index is negative.

Uniform integrability works the same way as in Lévy’s upward theorem. This implies $L^1$-boundedness, and the upcrossing proof works as well to show a.s. convergence. Together with uniform integrability, $L^1$ convergence follows.

Finally, to show $M_{-\infty }=\operatorname{{\mathbb E}}(\gamma \, |\, \mathcal G_{-\infty })$ a.s., fix $r{\gt}0$ and an event $G\in \mathcal G_{-\infty }\subseteq \mathcal G_{-r}$. Then

\[ \operatorname{{\mathbb E}}(\gamma \, ;\, G)=\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}(\gamma \, ;\, G\, |\, \mathcal G_{-r})=\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(\gamma \, |\, \mathcal G_{-r})\, ;\, G\bigr)=\operatorname{{\mathbb E}}(M_{-r}\, ;\, G). \]

Further, due to Jensen’s inequality and the $L^1$ convergence,

\[ |\operatorname{{\mathbb E}}(M_{-r}\, ;\, G)-\operatorname{{\mathbb E}}(M_{-\infty }\, ;\, G)|=\bigl|\operatorname{{\mathbb E}}(M_{-r}-M_{-\infty }\, ;\, G)\bigr|\le \operatorname{{\mathbb E}}\bigl(|M_{-r}-M_{-\infty }|\, ;\, G\bigr)\le \operatorname{{\mathbb E}}|M_{-r}-M_{-\infty }|\underset {r\to \infty }\longrightarrow 0. \]

Hence $\operatorname{{\mathbb E}}(\gamma \, ;\, G)=\operatorname{{\mathbb E}}(M_{-\infty }\, ;\, G)$, and $M_{-\infty }=\operatorname{{\mathbb E}}(\gamma \, |\, \mathcal G_{-\infty })$ a.s. follows from $M_{-\infty }$ being $\mathcal G_{-\infty }$-measurable and the Kolmogorov definition of conditional expectations.

As an application, here is the proof of the Strong law of large numbers using martingales.

Theorem 9.7 (Strong law of large numbers)

Let $X_k$ be i.i.d. random variables in $L^1$, and $S_n=\sum _{k=1}^nX_k$. Then $\frac{S_n}n\to \operatorname{{\mathbb E}}X_1$ a.s. and in $L^1$.

Proof.

Let $\mathcal G_{-n}=\sigma (S_n,\, S_{n+1},\, \dots )$, and notice that this system satisfies the conditions of the Downward theorem. Notice also that due to independence and the structure of $S_n$,

\[ \mathcal G_{-n}=\sigma \bigl\{ \sigma (S_n),\, \sigma (X_{n+1},\, X_{n+2},\, \dots )\bigr\} , \]

where the $\sigma $-algebras $\sigma (S_n),\, \sigma (X_{n+1},\, X_{n+2},\, \dots )$ are independent. This is to say that the process of $S_m$’s, $m\ge n$ from time $n$ is determined by $S_n$ and, independently, $X_k$’s for $k{\gt}n$. It follows by symmetry (check!) that

\[ M_{-n}:\, =\operatorname{{\mathbb E}}(X_1\, |\, \mathcal G_{-n})=\operatorname{{\mathbb E}}\bigl(X_1\, |\, \sigma (S_n)\bigr)=\frac{S_n}n. \]

Lévy’s downward theorem applies on this martingale, and gives the existence of the a.s. and $L^1$ limit $M_{-\infty }$. All that is left is to identify what this limit is. To this order notice that $M_{-\infty }=\lim _{n\to \infty }M_{-n}=\lim _{n\to \infty }\frac{S_n}n$ a.s. (one can instead use limsup here to make sure it is defined surely) does not depend on changes made on any finitely many of the $X_k$’s. It is therefore in the tail $\sigma $-algebra of the i.i.d. sequence, which is trivial by Kolmogorov’s 0-1 law. For any $c\in \mathbb R$, ${\mathbb P}\{ M_{-\infty }=c\} $ is therefore 0 or 1. However, it cannot be 0 for every $c$, there exists a non-random value which $M_{-\infty }$ takes a.s. and, due to $\operatorname{{\mathbb E}}M_{-\infty }=\operatorname{{\mathbb E}}M_{-1}=\operatorname{{\mathbb E}}X_1$, this value can only be $\operatorname{{\mathbb E}}X_1$.

10 Doob’s submartingale inequality

We continue with yet another important property of (sub)martingales. Compare the below with Markov’s inequality.

Theorem 10.1 (Doob’s submartingale inequality)

Let $Z_n\ge 0$ be a submartingale. Then for every $c{\gt}0$ real and $n{\gt}0$ integer,

\[ {\mathbb P}\bigl\{ \sup _{k\le n}Z_k\ge c\bigr\} \le \frac{\operatorname{{\mathbb E}}\bigl(Z_n\, ;\, \sup _{k\le n}Z_k\ge c\bigr)}c\le \frac{\operatorname{{\mathbb E}}Z_n}c. \] Since we are looking at a finite number of values, ‘max’ would be appropriate instead of ‘sup’. However, ‘sup’ is the usual formulation as it also works for the continuous time version of the theorem.

Proof.

If the event $F:\, =\bigl\{ \sup _{k\le n}Z_k\ge c\bigr\} $ occurs, then there must be a first instance of $k$ where $Z_k\ge c$. This is captured by the disjoint union

\[ F=\bigcup _{k=0}^nF_k,\qquad \text{where}\qquad \begin{aligned} F_0:& =\{ Z_0\ge c\} ,\\ F_k:& =\bigcap _{i=0}^{k-1}\{ Z_i{\lt}c\} \cap \{ Z_k\ge c\} \qquad (k{\gt}0). \end{aligned} \]

Notice that $F_k\in \mathcal F_k$, hence the submartingale property gives, for $k\le n$,

\[ \operatorname{{\mathbb E}}(Z_n\, ;\, F_k)=\operatorname{{\mathbb E}}\operatorname{{\mathbb E}}\bigl((Z_n\, ;\, F_k)\, |\, \mathcal F_k\bigr)=\operatorname{{\mathbb E}}\bigl(\operatorname{{\mathbb E}}(Z_n\, |\, \mathcal F_k)\, ;\, F_k\bigr)\ge \operatorname{{\mathbb E}}(Z_k\, ;\, F_k). \]

The event $F_k$ implies $Z_k\ge c$, hence we can proceed by

\[ \operatorname{{\mathbb E}}(Z_n\, ;\, F_k)\ge \operatorname{{\mathbb E}}(Z_k\, ;\, F_k)\ge c{\mathbb P}(F_k). \]

Summing this in $k$ proves the theorem via the disjoint union above.

Submartingales occur more often than one would first think. Let $M_n$ be a martingale, $g$ a convex function, and assume $\operatorname{{\mathbb E}}|g(M_n)|{\lt}\infty $ for each $n$. Then by Jensen’s inequality,

\[ \operatorname{{\mathbb E}}\bigl(g(M_{n+1})\, |\, \mathcal F_n\bigr)\ge g\bigl(\operatorname{{\mathbb E}}(M_{n+1}\, |\, \mathcal F_n)\bigr)=g(M_n), \]

hence $g(M_n)$ is a submartingale.

Example 10.2

A total of $n$ people queue up to buy one ticket each for a small performance with $n$ seats in the theatre. The price is one pound, and each person independently, with equal chance either has exact change, or a two pound coin in which case a one pound coin change is needed from the cashier. The cashier starts the service with $m$ one pound coins. We seek an upper bound on the probability that the cashier runs out of one pound coins at some point while serving this queue.

Set $X_i$ to be 1 if person $i$ has one pound, and $-1$ if they have a two pound coin. Then $S_k:\, =\sum _{i=1}^kX_i$ is the change in the number of one pound coins with the cashier due to serving the first $k$ customers. It is a SSRW hence a martingale, and the cashier runs out of one pound coins iff $m+\min _{1\le k\le n}S_k{\lt}0$.

A non-negative submartingale is produced by taking square (a convex function) of $S_k$. Hence, the probability of the cashier running out of change is bounded by

\[ {\mathbb P}\bigl\{ \min _{1\le k\le n}S_k{\lt}-m\bigr\} \le {\mathbb P}\bigl\{ \sup _{1\le k\le n}S_k^2\ge (m+1)^2\bigr\} \le \frac{\operatorname{{\mathbb E}}S_n^2}{(m+1)^2}=\frac n{(m+1)^2}. \]

As an example, $n=100$ and $m=30$ already gives a bound of cca 10%.

11 A discrete Black-Scholes option pricing formula

As an application, a very simple version of the option pricing formula is presented, still following Williams [ 2 ] . First, the probability space that will govern the stock market is assumed in this form:

$\begin{equation}\Omega =\{ \omega _1,\, \omega _2,\, \dots ,\, \omega _N\} ,\text{ where }\omega _n=\left\{ \begin{aligned} & +1,& & \text{with probability }p,\\ & -1,& & \text{with probability }1-p, \end{aligned} \right.\label{eq:bsom}\tag{11.1}\end{equation}$

independently for different $n$’s. This generates the natural filtration $\mathcal F_n=\sigma \{ \omega _1,\, \dots ,\, \omega _n\} $.

Stocks have value $S_n$, while bonds have value $B_n$ per unit on day $n$, $0\le n\le N$. We have $A_n$ stocks, and $V_n$ bonds in the morning of day $n$, so our total wealth is $A_nS_n+V_nB_n$. During the day, we are allowed to exchange these, and by the evening of day $n$ we might have $A_{n+1}$ stocks and $V_{n+1}$ bonds. However, the total wealth during transactions must be conserved:

$\begin{equation}X_n:\, =A_nS_n+V_nB_n=A_{n+1}S_n+V_{n+1}B_n.\label{eq:xav}\tag{11.2}\end{equation}$

Overnight, the values change from $S_n$ to $S_{n+1}$ for stocks, and from $B_n$ to $B_{n+1}$ for bonds. Bonds are not very exciting, their value is deterministic, $B_n=(1+r)^nB_0$ with a fixed $-1{\lt}r{\lt}\infty $, which one can also write as

\[ B_n-B_{n-1}=rB_{n-1}. \]

Stocks, on the other hand, will change randomly. With $-1{\lt}a{\lt}r{\lt}b{\lt}\infty $ also fixed, the random rates are governed by the probability space as

$\begin{equation}R_n:\, =\frac{a+b}2+\frac{b-a}2\omega _n=\left\{ \begin{aligned} & b,& & \text{if }\omega _n=1,\\ & a,& & \text{if }\omega _n=-1, \end{aligned} \right.\label{eq:rndef}\tag{11.3}\end{equation}$

and we have

\[ S_n-S_{n-1}=R_nS_{n-1}. \]

The European option is a contract made on day 0. It allows (but does not force) us to buy, at the end of day $N$, a stock at striking price $K$. Its value on day $N$ is therefore $(S_N-K)^+$ (when $S_N{\lt}K$, we do not use it). But how much does it worth on day 0?

Definition 11.1

A hedging strategy for the above option with initial value $x$ is a predictable process $(A_n,\, V_n)$ such that for every $\omega \in \Omega $,

$X_0=x$,
$X_n\ge 0$ for all $0\le n\le N$,
$X_N=(S_N-K)^+$.

($A_n$ and $V_n$ can possibly go negative.)

If there exists a hedging strategy for exactly one value of $x$, then this should be the price of the European option (for anything cheaper, everybody would buy it, and any more expensive, people would rather do the hedging strategy). This is exactly the case:

Theorem 11.2

A hedging strategy as above exists if and only if

\[ x=(r+1)^{-N}\cdot \operatorname{{\mathbb E}}(S_N-K)^+ \]

with respect to $\Omega $ (11.1) with $p=\frac{r-a}{b-a}$. This strategy is unique, and features $A_n\ge 0$ for all $0\le n\le N$.

To prove this theorem, we need two lemmas. Define the martingale (with $Z_0:\, =0$)

$\begin{equation}Z_n:\, =\sum _{k=1}^n(\omega _k-2p+1).\label{eq:zndef}\tag{11.4}\end{equation}$

The space $\Omega $ is simple enough to derive any other martingale from this one:

Lemma 11.3

Let $M_n$ be any martingale in $\bigl(\Omega ,\, (\mathcal F_n)_{n\ge 0}\bigr)$. Then there is a unique predictable process $H_n$, such that

\[ M_n=M_0+(H\bullet Z)=M_0+\sum _{k=1}^nH_k(Z_k-Z_{k-1}). \]

Proof.

The proof is by brute force. We can write all random variables as functions of sequences of $\omega _k$’s. Those $\mathcal F_n$-measurable will be functions of $\omega _1,\, \omega _2,\, \dots ,\, \omega _n$ only. Now let us reverse-engineer what $H_n$ needs to be:

\[ \begin{aligned} M_n-M_{n-1}& =\sum _{k=1}^nH_k(Z_k-Z_{k-1})-\sum _{k=1}^{n-1}H_k(Z_k-Z_{k-1})=H_n(Z_n-Z_{n-1})=H_n(\omega _n-2p+1)\\ & =H_n(2-2p){\bf 1}\{ \omega _n=1\} -H_n2p{\bf 1}\{ \omega _n=-1\} . \end{aligned} \]

From here, checking the two cases $\omega _n=\pm 1$,

$\begin{equation}\begin{aligned} H_n(\omega _1,\, \dots ,\, \omega _{n-1},\, 1)& =\frac{M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, 1)-M_{n-1}(\omega _1,\, \dots ,\, \omega _{n-1})}{2-2p},\\ H_n(\omega _1,\, \dots ,\, \omega _{n-1},\, -1)& =\frac{M_{n-1}(\omega _1,\, \dots ,\, \omega _{n-1})-M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, -1)}{2p}. \end{aligned}\label{eq:h2}\tag{11.5}\end{equation}$

However, $H_n$ needs to be predictable, it cannot depend on $\omega _n$. In other words, the two lines of this display must agree. This is where the martingale property for $M_n$ comes handy:

\[ M_{n-1}=\operatorname{{\mathbb E}}(M_n\, |\, \mathcal F_{n-1})=pM_n(\omega _1,\, \dots ,\, \omega _{n-1},\, 1)+(1-p)M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, -1) \]

Rearranging,

\[ (1-p)\bigl(M_{n-1}-M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, -1)\bigr)=p\bigl(M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, 1)-M_{n-1}\bigr) \]

exactly saying that the two lines of (11.5) indeed agree, and $H_n$ does not depend on $\omega _n$. The choice

\[ H_n(\omega _1,\, \dots ,\, \omega _{n-1})=\frac{M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, 1)-M_{n-1}}{2-2p}=\frac{M_{n-1}-M_n(\omega _1,\, \dots ,\, \omega _{n-1},\, -1)}{2p} \]

will then work, and, following the proof, is also unique.

Definition 11.4

Following a hedging strategy $(A_n,\, V_n)$, the discounted value of our wealth is

$\begin{equation}Y_n:\, =(1+r)^{-n}\cdot X_n=(1+r)^{-n}\cdot (A_nS_n+V_nB_n)=(1+r)^{-n}\cdot (A_{n+1}S_n+V_{n+1}B_n)\label{eq:disw}\tag{11.6}\end{equation}$

(recall (11.2)).

This is the amount of bonds we would need to buy at time zero if we were to achieve wealth $X_n$ at time $n$ purely via the fixed bond interest rate $r$.

Lemma 11.5

For any hedging strategy $(A_n,\, V_n)$, the discounted wealth (11.6) is a martingale in the probability space (11.1) with $p:\, =\frac{r-a}{b-a}$. It can be obtained by transforming the martingale (11.4) as $Y_n=Y_0+(H\bullet Z)_n$ by the unique predictable process

$\begin{equation}H_n=\frac{b-a}2(1+r)^{-n}\cdot A_nS_{n-1}\label{eq:hs}.\tag{11.7}\end{equation}$

Proof.

Rewrite the definition (11.6) using $B_n=(1+r)^nB_0$:

\[ Y_n=(1+r)^{-n}\cdot A_nS_n+V_nB_0=(1+r)^{-n}\cdot A_{n+1}S_n+V_{n+1}B_0. \]

Its increment can be written as (use the first expression for $Y_n$ and the second one for $Y_{n-1}$)

\[ Y_n-Y_{n-1}=(1+r)^{-n}\cdot A_n\bigl(S_n-(1+r)S_{n-1}\bigr)=(1+r)^{-n}\cdot A_n(R_n-r)S_{n-1}. \]

To proceed, assume $p=\frac{r-a}{b-a}$, that is

\[ r=(b-a)p+a=\frac{a+b}2+\frac{b-a}2(2p-1), \]

the mean of $R_n$ (11.3). Hence

\[ Y_n-Y_{n-1}=\frac{b-a}2(1+r)^{-n}\cdot A_nS_{n-1}(\omega _n-2p+1)=H_n\cdot (\omega _n-2p+1), \]

with the definition (11.7) of $H_n$. This is equivalent to $Y_n=Y_0+(H\bullet Z)_n$ with the martingale $Z_n$ from (11.4). This implies that $Y$ is a martingale, and the previous lemma then assures that no other predictable $H$ can provide $Y=Y_0+H\bullet Z$.

Proof of Theorem 11.2.

Define the martingale

\[ Y_n:\, =(r+1)^{-N}\cdot \operatorname{{\mathbb E}}\bigl[(S_N-K)^+\, |\, \mathcal F_n\bigr]. \]

We construct the hedging strategy with this being its discounted wealth as in (11.6). To do that, use Lemma 11.3 to find the unique predictable process $H_n$ for this martingale, from which the predictable strategy $A_n$ can be read off via (11.7). (Identity (11.2) then produces $V_n$ as well.) The wealth process with this strategy is $X_n=(1+r)^nY_n$, and we check $X_0=(1+r)^0Y_0=(r+1)^{-N}\cdot \operatorname{{\mathbb E}}(S_N-K)^+$ since $\mathcal F_0$ is trivial; $X_n\ge 0$ is obvious, and $X_N=(1+r)^N(1+r)^{-N}(S_N-K)^+=(S_N-K)^+$ since all random variables are $\mathcal F_N$-measurable. Hence the $(A_n,\, V_n)$ created this way is a hedging strategy for the European option with initial value as stated in the theorem.

To see uniqueness of this strategy, assume there is another one $(A'_n,\, V'_n)$ hedging the same European option. Its discounted wealth $Y_n'$ is a martingale that satisfies

\[ Y'_N=(1+r)^{-N}X'_N=(r+1)^{-N}\cdot (S_N-K)^+. \]

The martingale property then implies

\[ Y'_n=\operatorname{{\mathbb E}}(Y'_N\, |\, \mathcal F_n)=(r+1)^{-N}\cdot \operatorname{{\mathbb E}}\bigl[(S_N-K)^+\, |\, \mathcal F_n\bigr], \]

thus $Y'_n=Y_n$, which in turn implies that $A'_n=A_n$ for each $n$. Uniqueness of the martingale in particular implies uniqueness of the initial value $x=X_0=Y_0=(r+1)^{-N}\cdot \operatorname{{\mathbb E}}(S_N-K)^+$ of the hedging strategy.

It remains to show that $A_n\ge 0$ for each $n$ under this strategy. This is equivalent to $H_n\ge 0$, and from the lemma we have

\[ \begin{aligned} H_n& =\frac{Y_n(\omega _1,\, \dots ,\, \omega _{n-1},\, 1)-Y_{n-1}}{2-2p}\\ & =\frac{(1+r)^{-N}}{2-2p}\bigl[\operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1},\, 1\bigr)-\operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1}\bigr)\bigr]. \end{aligned} \]

Hence we need to prove

$\begin{multline*} \operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1},\, 1\bigr)\\ \begin{aligned} & \ge \operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1}\bigr)\\ & =p\operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1},\, 1\bigr)+(1-p)\operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1},\, -1\bigr) \end{aligned}\end{multline*}$

which happens if and only if

\[ \operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1},\, 1\bigr)\ge \operatorname{{\mathbb E}}\bigl((S_N-K)^+\, |\, \omega _1,\, \dots ,\, \omega _{n-1},\, -1\bigr). \]

To see this, first notice that the function $(S_N-K)^+$ is non-decreasing in each of the variables $\omega _1,\, \dots ,\, \omega _N$. Calculating the above conditional expectations involves summing over $\omega _{n+1},\, \dots ,\, \omega _N$ each taking values $\pm 1$. For every such outcome,

\[ (S_N-K)^+(\omega _1,\, \dots ,\, \omega _{n-1},\, 1,\, \omega _{n+1},\, \dots ,\, \omega _N)\ge (S_N-K)^+(\omega _1,\, \dots ,\, \omega _{n-1},\, -1,\, \omega _{n+1},\, \dots ,\, \omega _N), \]

and the inequality survives the summation for the conditional expectations.

References

[1]

Albert Nikolaevich Shiryaev. Probability (2nd Ed.). Springer, 1996.
[2]

David Williams. Probability with Martingales. Cambridge University Press, 1991.

Test Yourself

Martingale theory lecture notes

1 A quick summary of some parts of measure theory

Definition 1.1

Definition 1.2

Definition 1.3

Example 1.4

Definition 1.5

Theorem 1.6

Proof.

Remark 1.7

Lemma 1.8

Proof.

Definition 1.9

Theorem 1.10 (Carathéodory)

Definition 1.11

Definition 1.12

Definition 1.13

Theorem 1.14

Definition 1.15 (Expectations)

2 Conditional expectation and a toy example

Theorem 2.1

Theorem 2.2 (Tower rule)

3 Probability toolbox

Lemma 3.1

Proof.

Definition 3.2

Theorem 3.3 (Borel-Cantelli lemmas)

Proof.

Theorem 3.4 (Monotone convergence)

Proof.

Theorem 3.5 (Fatou’s lemma)

Proof.

Theorem 3.6 (Dominated convergence)

Proof.

Example 3.7

Theorem 3.8 (Tonelli)

Proof.

Theorem 3.9 (Fubini)

Proof.

Theorem 3.10 (Jensen’s inequality)

Proof.

Definition 3.11

Theorem 3.12 (Ljapunov’s inequality)

Proof.

Theorem 3.13 (Hölder’s inequality)

Proof.

Theorem 3.14 (Minkowski’s inequality)

Proof.

4 Modes of convergence

Definition 4.1

Theorem 4.2

Remark 4.3

Proof.

Theorem 4.4

Proof.

Example 4.5

Example 4.6

Example 4.7

5 Martingales, stopping times

Definition 5.1

Definition 5.2

Definition 5.3

Example 5.4

Example 5.5

Example 5.6

Definition 5.7

Definition 5.8

Theorem 5.9

Proof.

Definition 5.10

Theorem 5.11

Proof.

Theorem 5.12 (Doob’s optional stopping theorem)

Proof.

Corollary 5.13

Theorem 5.14

Proof.

Lemma 5.15

Proof.