MA427 Ergodic Theorymaslav/Teaching/warwick... · 2012-12-20 · MA427 Ergodic Theory...

MA427 Ergodic Theory

Course Notes (2012-13)

1 Introduction

1.1 Orbits

Let X be a mathematical space. For example, X could be the unit interval [0, 1], a circle, atorus, or something far more complicated like a Cantor set. Let T : X → X be a functionthat maps X into itself.

Let x ∈ X be a point. We can repeatedly apply the map T to the point x to obtain thesequence:

{x, T (x), T (T (x)), T (T (T (x))), . . . , ...}.

We will often write T n(x) = T (· · · (T (T (x)))) (n times). The sequence of points

x, T (x), T 2(x), . . .

is called the orbit of x .We think of applying the map T as the passage of time. Thus we think of T (x) as where

the point x has moved to after time 1, T 2(x) is where the point x has moved to after time2, etc.

Some points x ∈ X return to where they started. That is, T n(x) = x for some n > 1.We say that such a point x is periodic with period n.

By way of contrast, points may move move densely around the space X. (A sequence issaid to be dense if (loosely speaking) it comes arbitrarily close to every point of X.)

If we take two points x, y of X that start very close then their orbits will initially be close.However, it often happens that in the long term their orbits move apart and indeed becomedramatically different. This is known as sensitive dependence on initial conditions, and ispopularly known as chaos.

In general, for a given dynamical system T it is impossible to understand the orbit structureof every orbit. Ergodic theory takes a more qualitative approach: we aim to describe the longterm behaviour of a typical orbit, at least in the case when T satisfies a technical conditioncalled ‘measure-preserving’.

To make the notion of ‘typical’ precise, we need to use measure theory. Roughly speaking,a measure is a function that assigns a ‘size’ to a given subset of X. One of the simplestmeasures is Lebesgue measure on [0, 1]; here the measure of an interval [a, b] ⊂ [0, 1] is justits length b − a.

1

1.2 Introducing the doubling map 1 INTRODUCTION

Let T : [0, 1] → [0, 1] and fix a subinterval [a, b] ⊂ [0, 1]. Let x ∈ [0, 1]. What is thefrequency with which the orbit of x hits the set [a, b]? Recall that the characteristic functionχA of a subset A is defined by

χA(x) =

{1 if x ∈ A,0 if x 6∈ A.

Then the number of times the first n points of the orbit of x hits [a, b] is given by

n−1∑j=0

χ[a,b](Tj(x)).

Thus the proportion of the first n points in the orbit of x that lie in [a, b] is equal to

1

n

n−1∑j=0

χ[a,b](Tj(x)).

Hence the frequency with which the orbit of x lies in [a, b] is given by

limn→∞

1

n

n−1∑j=0

χ[a,b](Tj(x))

(assuming of course that this limit exists!).One of the main results of the course, namely Birkhoff’s ergodic theorem, tells us that

when T is ergodic (a technical property—albeit an important one—that we won’t define here)then for ‘most’ orbits the above frequency is equal to the measure of the interval [a, b]. Inthe case of Lebesgue measure, this means that:

limn→∞

1

n

n−1∑j=0

χ[a,b](Tj(x)) = b − a, for almost all x ∈ X.

(Here ‘almost all’ is the technical measure-theoretic way of saying ‘most’.)One way of looking at Birkhoff’s ergodic theorem is the following: the time average of a

typical point x ∈ X (i.e. the frequency with which its orbit lands in a given subset) is equalto the space average (namely, the measure of that subset).

In this course, we develop the necessary background that builds up to Birkhoff’s ergodictheorem, together with some illuminating examples. We also study en route some interestingdiversions to other areas of mathematics, notably number theory.

1.2 Introducing the doubling map

Let X = [0, 1] denote the unit interval. Define the map T : X → X by:

T (x) = 2x mod 1 =

{2x if 0 ≤ x < 1/2

2x − 1 if 1/2 ≤ x ≤ 1

(‘mod 1’ stands for ‘modulo 1’ and means ‘ignore the integer part’; for example 3.456 mod 1

is .456).

2

1.3 Leading digits 1 INTRODUCTION

Exercise 1.1. Draw the graph of the doubling map. By sketching the orbit in the this graph,indicate that 7/15 is periodic. Try sketching the orbits of some points near 7/15.

In §1.1 we mentioned in passing that we will be interested in a technical condition called‘measure-preserving’. We can illustrate this property here. Fix an interval [a, b] and considerthe set

T−1[a, b] = {x ∈ [0, 1] | T (x) ∈ [a, b]}.

One can easily check that

T−1[a, b] =

[a

2,b

2

]∪[a + 1

2,b + 1

2

],

so that T−1[a, b] is the union of two intervals, each of length (b − a)/2. Hence the lengthof T−1[a, b] is equal to b − a, which is the same as the length of [a, b].

1.3 Leading digits

The leading digit of a number n ∈ N is the digit (between 1 and 9) that appears at theleftmost-end of n when n is written in base 10. Thus, the leading digit of 4629 is 4, etc.

Consider the sequence 2n:

1, 2, 4, 8, 16, 32, 64, 128, . . .

and consider the sequence of leading digits:

1, 2, 4, 8, 1, 3, 6, 1, . . . .

Exercise 1.2. By writing down the sequence of leading digits for 2n for n = 1, 2, . . . , somethinglarge of your choosing, try guessing the frequency with which the digit 1 appears as a leadingdigit. (Hint: it isn’t 3/10ths.) Do the same for the digit 2. Can you guess the frequencywith which the digit r appears?

We will study this problem in greater detail later.

3

2 UNIFORM DISTRIBUTION

2 Uniform Distribution

2.1 Uniform distribution and Weyl’s criterion

Before we discuss dynamical systems in greater detail, we shall consider a simpler settingwhich highlights some of the main ideas in ergodic theory.

Let xn be a sequence of real numbers. We may decompose xn as the sum of its integerpart [xn] = sup{m ∈ Z | m ≤ xn} (i.e. the largest integer which is less than or equal to xn)and its fractional part {xn} = xn − [xn]. Clearly, 0 ≤ {xn} < 1. The study of xn mod 1 is thestudy of the sequence {xn} in [0, 1).

Definition 2.1. We say that the sequence xn is uniformly distributed mod 1 if for every a, bwith 0 ≤ a < b < 1, we have that

1

ncard{j | 0 ≤ j ≤ n − 1, {xj} ∈ [a, b]} → b − a, as n →∞.

(The condition is saying that the proportion of the sequence {xn} lying in [a, b] converges tob − a, the length of the interval.)

Remark 2.2. We can replace [a, b] by [a, b), (a, b] or (a, b) with the same result.

Exercise 2.3. Show that if xn is uniformly distributed mod 1 then {xn} is dense in [0, 1).

The following result gives a necessary and sufficient condition for xn to be uniformlydistributed mod 1.

Theorem 2.4 (Weyl’s Criterion). The following are equivalent:

(i) the sequence xn is uniformly distributed mod 1;

(ii) for each ` ∈ Z \ {0}, we have1

n

n−1∑j=0

e2πi`xj → 0

as n →∞.

2.2 The sequence xn = nα

The behaviour of the sequence xn = nα depends on whether α is rational or irrational. Ifα ∈ Q, it is easy to see that {nα} can take on only finitely many values in [0, 1): if α = p/q

(p ∈ Z, q ∈ N, hcf(p, q) = 1) then {nα} takes the q values

0,

{p

q

},

{2p

q

}, . . . ,

{(q − 1)p

q

}.

In particular, {nα} is not uniformly distributed mod 1.

4

2.3 Proof of Weyl’s criterion 2 UNIFORM DISTRIBUTION

If α ∈ R \ Q then the situation is completely different. We shall apply Weyl’s Criterion.For l ∈ Z \ {0}, e2πi`α 6= 1, so we have

1

n

n−1∑j=0

e2πi`jα =1

n

e2πi`nα − 1

e2πi`α − 1.

Hence ∣∣∣∣∣1nn−1∑j=0

e2πi`jα

∣∣∣∣∣ ≤ 1

n

2

|e2πi`α − 1| → 0, as n →∞.

Hence nα is uniformly distributed mod 1.

Remarks 2.5. (i) More generally, we could consider the sequence xn = nα+ β. It is easy tosee by modifying the above arguments that xn is uniformly distributed mod 1 if and only if αis irrational.(ii) Fix α > 1 and consider the sequence xn = αnx , for some x ∈ (0, 1). Then it is possibleto show that for almost every x , the sequence xn is uniformly distributed mod 1. We willprove this later in the course, at least in the cases when α = 2, 3, 4, . . ..(iii) Suppose in the above remark we fix x = 1 and consider the sequence xn = αn. Then onecan show that xn is uniformly distributed mod 1 for almost all α > 1. However, not a singleexample of such an α is known!

Exercise 2.6. Calculate the frequency with which 2n has r (r = 1, . . . , 9) as the leading digitof its base 10 representation. (You may assume that log10 2 is irrational.)(Hint: first show that 2n has leading digit r if and only if

r 10` ≤ 2n < (r + 1)10`

for some ` ∈ Z+.)

Exercise 2.7. Calculate the frequency with which 2n has r (r = 0, 1, . . . , 9) as the seconddigit of its base 10 representation.

2.3 Proof of Weyl’s criterion

Proof. Since e2πixj = e2πi{xj}, we may suppose, without loss of generality, that xj = {xj}.(i) ⇒ (ii): Suppose that xj is uniformly distributed mod 1. If χ[a,b] is the characteristicfunction of the interval [a, b], then we may rewrite the definition of uniform distribution inthe form

1

n

n−1∑j=0

χ[a,b](xj)→∫ 1

0

χ[a,b](x) dx, as n →∞.

From this we deduce that

1

n

n−1∑j=0

f (xj)→∫ 1

0

f (x) dx, as n →∞,

5

2.3 Proof of Weyl’s criterion 2 UNIFORM DISTRIBUTION

whenever f is a step function, i.e., a linear combination of characteristic functions of intervals.Now let g be a continuous function on [0, 1] (with g(0) = g(1)). Then, given ε > 0, we

can find a step function f with ‖g − f ‖∞ ≤ ε. We have the estimate∣∣∣∣∣1nn−1∑j=0

g(xj)−∫ 1

0

g(x) dx

∣∣∣∣∣≤

∣∣∣∣∣1nn−1∑j=0

(g(xj)− f (xj))

∣∣∣∣∣+

∣∣∣∣∣1nn−1∑j=0

f (xj)−∫ 1

0

f (x) dx

∣∣∣∣∣+

∣∣∣∣∫ 1

0

f (x) dx −∫ 1

0

g(x) dx

∣∣∣∣≤ 2ε+

∣∣∣∣∣1nn−1∑i=0

f (xj)−∫ 1

0

f (x) dx

∣∣∣∣∣ .Since the last term converges to zero, we thus obtain

lim supn→∞

∣∣∣∣∣1nn−1∑j=0

g(xj)−∫ 1

0

g(x) dx

∣∣∣∣∣ ≤ 2ε.

Since ε > 0 is arbitrary, this gives us that

1

n

n−1∑j=0

g(xj)→∫ 1

0

g(x) dx,

as n →∞, and this holds, in particular, for g(x) = e2πi`x . If ` 6= 0 then∫ 1

0

e2πi`x dx = 0,

so the first implication is proved.(ii) ⇒ (i): Suppose now that Weyl’s Criterion holds. Then

1

n

n−1∑j=0

g(xj)→∫ 1

0

g(x) dx, as n →∞,

whenever g(x) =∑m

k=1 αke2πi`kx is a trigonometric polynomial.

Let f be any continuous function on [0, 1] with f (0) = f (1). Given ε > 0 we can finda trigonometric polynomial g such that ‖f − g‖∞ ≤ ε. (This is a consequence of Fejér’sTheorem.) As in the first part of the proof, we can conclude that

1

n

n−1∑j=0

f (xj)→∫ 1

0

f (x) dx, as n →∞.

6

2.4 Generalisation to Higher Dimensions 2 UNIFORM DISTRIBUTION

Now consider the interval [a, b] ⊂ [0, 1). Given ε > 0, we can find continuous functionsf1, f2 (with f1(0) = f1(1), f2(0) = f2(1)) such that

f1 ≤ χ[a,b] ≤ f2

and ∫ 1

0

f2(x)− f1(x) dx ≤ ε.

We then have that

lim infn→∞

1

n

n−1∑j=0

χ[a,b](xj) ≥ lim infn→∞

1

n

n−1∑j=0

f1(xj) =

∫ 1

0

f1(x) dx

≥∫ 1

0

f2(x) dx − ε ≥∫ 1

0

χ[a,b](x) dx − ε

and

lim supn→∞

1

n

n−1∑j=0

χ[a,b](xj) ≤ lim supn→∞

1

n

n−1∑j=0

f2(xj) =

∫ 1

0

f2(x) dx

≤∫ 1

0

f1(x) dx + ε ≤∫ 1

0

χ[a,b](x) dx + ε.

Since ε > 0 is arbitrary, we have shown that

limn→∞

1

n

n−1∑j=0

χ[a,b](xj) =

∫ 1

0

χ[a,b](x) dx = b − a,

so that xi is uniformly distributed mod 1.

2.4 Generalisation to Higher Dimensions

We shall now look at the distribution of sequences in Rk .

Definition 2.8. A sequence xn = (x1n , . . . , x

kn ) ∈ Rk is said to be uniformly distributed mod 1

if, for each choice of k intervals [a1, b1], . . . , [ak , bk ] ⊂ [0, 1), we have that

1

n

n−1∑j=0

k∏i=1

χ[ai ,bi ]({xij })→

k∏i=1

(bi − ai), as n →∞.

We have the following criterion for uniform distribution.

Theorem 2.9 (Multi-dimensional Weyl’s Criterion). The sequence xn ∈ Rk is uniformly dis-tributed mod 1 if and only if

1

n

n−1∑j=0

e2πi(`1x1j +···+`kxkj ) → 0, as n →∞,

for all ` = (`1, . . . , `k) ∈ Zk \ {0}.

7

2.5 Generalisation to polynomials 2 UNIFORM DISTRIBUTION

Remark 2.10. Here and throughout 0 ∈ Zk denotes the zero vector (0, . . . , 0).

Proof. The proof is essentially the same as in the case k = 1.

We shall apply this result to the sequence xn = (nα1, . . . , nαk), for real numbers α1, . . . , αk .Suppose first that the numbers α1, . . . , αk , 1 are rationally independent. This means that

if r1, . . . , rk , r are rational numbers such that

r1α1 + · · ·+ rkαk + r = 0,

then r1 = · · · = rk = r = 0. In particular, for ` = (`1, . . . , `k) ∈ Zk \ {0} and n ∈ N,

`1nα1 + · · ·+ `knαk /∈ Z,

so thate2πi(`1nα1+···+`knαk) 6= 1.

We therefore have that∣∣∣∣∣1nn−1∑j=0

e2πi(`1jα1+···+`k jαk)

∣∣∣∣∣ =

∣∣∣∣1n e2πin(`1α1+···+`kαk) − 1

e2πi(`1α1+···+`kαk) − 1

∣∣∣∣≤

1

n

2

|e2πi(`1α1+···+`kαk) − 1| → 0, as n →∞.

Therefore, by Weyl’s Criterion, (nα1, . . . , nαk) is uniformly distributed mod 1.Now suppose that the numbers α1, . . . , αk , 1 are rationally dependent, i.e. there exist

rational numbers r1, . . . , rk , r , not all equal to zero, such that r1α1 + · · ·+ rkαk + r = 0. Thenthere exists ` = (`1, . . . , `k) ∈ Zk \ {0} such that

`1α1 + · · ·+ `kαk ∈ Z.

Thus e2πi(`1nα1+···+`knαk) = 1 for all n ∈ N and so

1

n

n−1∑j=0

e2πi(`1jα1+···+`k jαk) = 1 6→ 0, as n →∞.

Therefore, (nα1, . . . , nαk) is not uniformly distributed mod 1.

2.5 Generalisation to polynomials

We shall now consider another generalisation of the sequence nα. Write

p(n) = αknk + αk−1n

k−1 + · · ·α1n + α0.

Theorem 2.11 (Weyl). If any one of α1, . . . , αk is irrational then p(n) is uniformly distributedmod 1.

8


(Note that it is irrelevent whether or not α0 is irrational.) To prove this theorem we shallneed the following technical result.

Lemma 2.12 (van der Corput’s Inequality). Let z0, . . . , zn−1 ∈ C and let 1 ≤ m ≤ n − 1.Then

m2

∣∣∣∣∣n−1∑j=0

zj

∣∣∣∣∣2

≤ m(n +m − 1)

n−1∑j=0

|zj |2

+ 2(n +m − 1)<m−1∑j=1

(m − j)n−1−j∑i=0

zi+j zi .

Proof. Consider the following sums:

S1 = z0

S2 = z0 + z1

...

Sm = z0 + z1 + · · ·+ zm−1

Sm+1 = z1 + z2 + · · ·+ zm...

Sn = zn−m + zn−m+1 + · · ·+ zn−1

Sn+1 = zn−m+1 + zn−m+2 · · ·+ zn−1

...

Sn+m−2 = zn−2 + zn−1

Sn+m−1 = zn−1.

Notice that each zj occurs in exactly m of the sums Sk . Thus

S1 + · · ·+ Sn+m−1 = m

n−1∑j=0

zj

and so

m2

∣∣∣∣∣n−1∑j=0

zj

∣∣∣∣∣2

= |S1 + · · ·+ Sn+m−1|2

≤ (|S1|+ · · ·+ |Sn+m−1|)2

≤ (n +m − 1)(|S1|2 + · · ·+ |Sn+m−1|2),

using the fact that (l∑

k=1

ak

)2

≤ ll∑

k=1

a2k .

9


Now, using the formula ∣∣∣∣∣l∑

k=1

ak

∣∣∣∣∣2

=

l∑k=1

|ak |2 + 2Re

(∑i<j

aiaj

),

we have

|S1|2 + · · ·+ |Sn+m−1|2 = m

(n−1∑j=0

|zj |2)

+ 2Re

(m−1∑r=1

(m − r)

n−r−1∑j=0

zjzj+r

).

Hence

m2

∣∣∣∣∣n−1∑j=0

zj

∣∣∣∣∣2

≤ m(n +m − 1)

(n−1∑j=0

|zj |2)

+ 2(n +m − 1)Re

(m−1∑j=1

(m − j)n−j−1∑i=1

zizi+j

),

as required.

Let xn ∈ R. For each m ≥ 1 define the sequence x (m)n = xn+m − xn of mth differences.

The following lemma allows us to infer the uniform distribution of the sequence xn if we knowthe uniform distribution of the each of the mth differences of xn.

Lemma 2.13. Let xn ∈ R be a sequence. Suppose that for each m ≥ 1 the sequence x (m)n of

mth differences is uniformly distributed mod 1. Then xn is uniformly distributed mod 1.

Proof. We shall apply Weyl’s Criterion. We need to show that if ` ∈ Z \ {0} then

1

n

n−1∑j=0

e2πi`xj → 0, as n →∞.

Let zj = e2πi`xj for j = 0, . . . , n − 1. Note that |zj | = 1. Let 1 < m < n. By van derCorput’s inequality,

m2

n2

∣∣∣∣∣n−1∑j=0

e2πi`xj

∣∣∣∣∣2

≤m

n2(n +m − 1)n

+2(n +m − 1)

n<m−1∑j=1

(m − j)n

n−1−j∑i=0

e2πi`(xi+j−xi )

=m

n(m + n − 1) +

2(n +m − 1)

n<m−1∑j=1

(m − j)An,j

where

An,j =1

n

n−1−j∑i=0

e2πi`(xi+j−xi ) =1

n

n−1−j∑i=0

e2πi`x(j)i .

10


As the sequence x (j)i of j th differences is uniformly distributed mod 1, by Weyl’s criterion we

have that An,j → 0 for each j = 1, . . . , m − 1. Hence for each m ≥ 1

lim supn→∞

m2

n2

∣∣∣∣∣n−1∑j=0

e2πi`xj

∣∣∣∣∣2

≤ lim supn→∞

m(n +m − 1)

n= m.

Hence, for each m > 1 we have

lim supn→∞

1

n

∣∣∣∣∣n−1∑j=0

e2πii`xj

∣∣∣∣∣ ≤ 1√m.

As m > 1 is arbitrary, the result follows.

Proof of Weyl’s Theorem. We will only prove Weyl’s theorem in the special case where theleading digit αk of

p(n) = αknk + · · ·+ α1n + α0

is irrational. (The general case, where αi is irrational for some 1 ≤ i ≤ k can be deducedvery easily from this special case, but we will not go into this.)

We shall use induction on the degree of p. Let ∆(k) denote the statement ‘for everypolynomial q of degree ≤ k , with irrational leading coefficient, the sequence q(n) is uniformlydistributed mod 1’. We know that ∆(1) is true.

Suppose that ∆(k−1) is true. Let p(n) = αknk+· · ·+α1n+α0 be an arbitrary polynomial

of degree k with αk irrational. For each m ∈ N, we have that

p(n +m)− p(n)

= αk(n +m)k + αk−1(n +m)k−1 + · · ·+ α1(n +m) + α0

− αknk − αk−1nk−1 − · · · − α1n − α0

= αknk + αkkn

k−1m + · · ·+ αk−1nk−1 + αk−1(k − 1)nk−2h

+ · · ·+ α1n + α1m + α0 − αknk − αk−1nk−1 − · · · − α1n − α0.

After cancellation, we can see that, for each m, p(n + m) − p(n) is a polynomial of degreek − 1, with irrational leading coefficient αkkm. Therefore, by the inductive hypothesis,p(n+m)−p(n) is uniformly distributed mod 1. We may now apply Lemma 2.13 to concludethat p(n) is uniformly distributed mod 1 and so ∆(k) holds. This completes the induction.

Exercise 2.14. Let p(n) = αknk + αk−1n

k−1 + · · · + α1n + α0, q(n) = βknk + βk−1n

k−1 +

· · · + β1n + β0. Show that (p(n), q(n)) is uniformly distributed mod 1 if at least one of(αk , βk , 1), . . . , (α1, β1, 1) is rationally independent.

11

3 EXAMPLES OF DYNAMICAL SYSTEMS

3 Examples of Dynamical Systems

3.1 The circle

Several of the key examples in the course take place on the circle. There are two different—although equivalent—ways of thinking about the circle.

We can think of the circle as the quotient group

R/Z = {x + Z | x ∈ R}which is easily seen to be equivalent to [0, 1) mod 1. We refer to this as additive notation.

Alternatively, we can regard the circle as

S1 = {z ∈ C | |z | = 1} = {exp 2πiθ | θ ∈ [0, 1)}.We refer to this as multiplicative notation.

The two viewpoints are obviously equivalent, and we shall use whichever is most convenientgiven the circumstances.

We will also be interested in maps of the k-dimensional torus. The k-dimensional torusis defined to be

Rk/Zk = {x + Zk | x ∈ Rk} = [0, 1)k mod 1

(in additive notation) and

S1 × · · · × S1(k-times) = {(exp 2πiθ1, . . . , exp 2πiθk) | θ1, . . . , θk ∈ [0, 1)}(in multiplicative notation).

3.2 Rotations on a circle

Fix α ∈ [0, 1) and define the map

T : R/Z→ R/Z : x 7→ x + α mod 1.

(In multiplicative notation this is: exp 2πiθ 7→ exp 2πi(θ+α).) This map acts on the circle byrotating it by angle α. Clearly, we have that T n(0) = nα mod 1 = {nα}, i.e. the fractionalparts we considered in section 2 form the orbit of 0.

Suppose that α = p/q is rational (here, p, q ∈ Z, q 6= 0). Then

T q(x) = x + qp/q mod 1 = x + p mod 1 = x.

Hence every point of R/Z is periodic.When α is irrational, one can show that every point x ∈ R/Z has a dense orbit. This can

be deduced from uniform distribution, but it can also be proved directly.Exercise 3.1. Prove that, for an irrational rotation of the circle, every orbit is dense. (Recallthat the orbit of x is dense if: for all y ∈ R/Z and for all ε > 0, there exists n > 0 such thatd(T n(x), y) < ε.)

(Hints: (1) First show that T n(x) = T n(0) + x and conclude that it’s sufficient to provethat the orbit of 0 is dense. (2) Prove that T n(x) 6= Tm(x) for n 6= m. (3) Show that foreach ε > 0 there exists n > 0 such that 0 < nα mod 1 < ε (you will need to remember thatthe circle is sequentially compact). (4) Now show that the orbit of 0 is dense.)

12

3.3 The doubling map 3 EXAMPLES OF DYNAMICAL SYSTEMS

3.3 The doubling map

We have already seen the doubling map

T : R/Z 7→ R/Z : x 7→ 2x mod 1.

(In multiplicative notation this is

T (exp 2πiθ) = exp 2πi(2θ).

or, writing z = e2πiθ, T (z) = z2.)

Proposition 3.2. Let T be the doubling map.

(i) There are 2n − 1 points of period n.

(ii) The periodic points are dense.

(iii) There exists a dense orbit.

Proof. We prove (i). Notice that

T n(x) = 2nx = x mod 1

if there exists an integer p > 0 such that

2nx = x + p.

Hencex =

p

2n − 1.

We get distinct values of x ∈ [0, 1) for p = 0, 1, . . . , 2n − 2. Hence there are 2n − 1 periodicpoints.

We leave (ii) as an exercise.

Exercise 3.3. Prove (ii).

We sketch the proof of (iii). Let us denote the interval [0, 1/2) by the symbol 0 anddenote the interval [1/2, 1) by 1. Let x ∈ [0, 1). For each n ≥ 0 let xn denote the symbolcorresponding to the interval in which T n(x) lies. Thus to each x ∈ [0, 1) we associate asequence (x0, x1, . . .) of 0s and 1s. It is easy to see that

x =

∞∑n=0

xn2n+1

so that the sequence (x0, x1, . . .) corresponds to the base 2 expansion of x .Notice that if x has coding (x0, x1, . . .) then

T (x) = 2x mod 1 =

∞∑n=0

2xn2n+1

mod 1 = x0 +

∞∑n=0

xn+1

2n+1mod 1 =

∞∑n=0

xn+1

2n+1

13

3.4 Shifts of finite type 3 EXAMPLES OF DYNAMICAL SYSTEMS

so that T (x) has expansion (x1, x2, . . .), i.e. T can be thought of as acting on the coding ofx be shifting the associated sequence one place to the left.

For each n-tuple x0, x1, . . . , xn−1 let

I(x0, . . . , xn−1) = {x ∈ [0, 1) | T k(x) lies in interval xk for k = 0, 1, . . . , n − 1}.

That is, I(x0, . . . , xn−1) corresponds to the set of all x ∈ [0, 1) whose base 2 expansion startsx0, . . . , xn−1. We call I(x0, . . . , xn−1) a cylinder of rank n.

Exercise 3.4. Draw all cylinders of length ≤ 4.

One can show:

(i) a cylinder of rank n is an interval of length 2−n.

(ii) for each x ∈ [0, 1) with base 2 expansion x0, x1, . . ., the intervals I(x0, . . . , xn) ‘converge’as n →∞ (in an appropriate sense) to x .

From these observations it is easy to see that, in order to construct a dense orbit, it issufficient to construct a point x such that for every cylinder I there exists n = n(I) such thatT n(x) ∈ I. To do this, firstly write down all possible cylinders (there are countably many):

0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000, 0001, . . . .

Now take x to be the point with base 2 expansion

010001101100000101001110010111011100000001 . . .

(that is, just adjoin all the symbolic representations of all cylinders in some order). One caneasily check that such a point x has a dense orbit.

Exercise 3.5. Write down the proof of Proposition 3.2(iii), adding in complete details.

Remark 3.6. This technique of coding the orbits of a given dynamical system by partitioningthe space X and forming an itinerary map is a very powerful technique that can be used tostudy many different classes of dynamical system.

3.4 Shifts of finite type

Let S = {1, 2, . . . , k} be a finite set of symbols. We will be interested in sets consistingof sequences of these symbols, subject to certain conditions. We will impose the followingconditions: we assume that for each symbol i we allow certain symbols (depending only oni) to follow i and disallow all other symbols.

This information is best recorded in a k × k matrix A with entries in {0, 1}. That is, weallow the symbol j to follow the symbol i if and only if the corresponding (i , j)th entry of thematrix A (denoted by Ai ,j) is equal to 1.

14


Definition 3.7. Let A be a k × k matrix with entries in {0, 1}. Let

Σ+A = {(xj)∞j=0 | Axi ,xi+1

= 1, for j ∈ Z+}

denote the set of all infinite sequences of symbols (xj) where symbol j can follow symbol iprecisely when Ai ,j = 1. We call Σ+

A a (one-sided) shift of finite type.Let

ΣA = {(xj)∞j=−∞ | Axj ,xj+1= 1, for j ∈ Z}

denote the set of all bi-infinite sequences of symbols subject to the same conditions. We callΣA a (two-sided) shift of finite type.

Sometimes for brevity we refer to Σ+A or ΣA as a shift space.

An alternative description of Σ+A and ΣA can be given as follows. Consider a directed graph

with vertex set {1, 2, . . . , k} and with a directed edge from vertex i to vertex j precisely whenAi ,j = 1. Then Σ+

A and ΣA correspond to the set of all infinite (respectively, bi-infinite) pathsin this graph.

Defineσ+ : Σ+

A → Σ+A

by(σ+(x))j = xj+1.

Then σ+ takes a sequence in Σ+A and shifts it one place to the left (deleting the first term),

We call σ+ the (one-sided, left) shift map.There is a corresponding shift map on the two-sided shift space. Define

σ : ΣA → ΣA

by(σ(x))j = xj+1,

so that σ shifts sequences one place to the left. Notice that in this case, we do not need todelete any terms in the sequence, We call σ the (two-sided, left) shift map.

Notice that σ is invertible but σ+ is not. For ease of notation, we shall often write σ todenote both the one-sided and the two-sided shift map.

Examples 3.8.

Take A to be the k × k matrix with each entry equal to 1, Then any symbol can follow anyother symbol. Hence Σ+

A is the space of all sequences of symbols {1, 2, . . . , k}. In this casewe write Σ+

k for Σ+A and refer to it as the full one-sided k-shift. Similarly, we can define the

full two-sided k-shift,

Take A to be the matrix (1 1

1 0

).

Then Σ+A consists of all sequences of 1s and 2s subject to the condition that each 2 must be

followed by a 1.

15


The following two exercises show that, for certain A, Σ+A (or ΣA) can be rather uninter-

esting.

Exercise 3.9. Let

A =

(0 1

0 0

).

Show that Σ+A is empty.

Exercise 3.10. Let

A =

(1 1

0 1

).

Calculate Σ+A .

The following conditions on A guarantee that Σ+A (or ΣA) is more interesting than the

examples in exercises 5.1 and 5.2.

Definition 3.11. Let A be a k × k matrix with entries in {0, 1}. We say that A is irreducibleif for each i , j ∈ {1, 2, . . . , k} there exists n = n(i , j) > 0 such that (An)i ,j > 0. (Here, (An)i ,jdenotes the (i , j)th entry of the nth power of A.)

Definition 3.12. Let A be a k × k matrix with entries in {0, 1}. We say that A is aperiodicif there exists n > 0 such that for all i , j ∈ {1, 2, . . . , k} we have (An)i ,j > 0.

In graph-theoretic terms, the matrix A is irreducible if there exists a path along edgesfrom any vertex to any other vertex. The matrix A is aperiodic if this path can be chosento have the same length (i.e. consist of the same number of edges), irrespective of the twovertices chosen.

Exercise 3.13. (i) Consider the matrix (1 1

1 0

).

Draw the corresponding directed graph. Is this matrix irreducible? Is it aperiodic?

(ii) Consider the matrix 0 1 0 1

1 0 1 0

0 1 0 1

1 0 1 0

.Draw the corresponding directed graph. Is this matrix irreducible? Is it aperiodic?

Remark 3.14. These shift spaces may seem very strange at first sight—it takes a long timeto get used to them. However (as we shall see) they are particularly tractable examples ofchaotic dynamical systems. Moreover, a wide class of dynamical systems (notably hyperbolicdynamical systems) can be modeled in terms of shifts of finite type. We have already seena particularly simple example of this: the doubling map can be modeled by the full one-sided2-shift.

16

3.5 Periodic points 3 EXAMPLES OF DYNAMICAL SYSTEMS

3.5 Periodic points

A sequence x = (xj)∞j=0 ∈ Σ+

A is periodic for the shift σ if there exists n > 0 such thatσnx = x .

One can easily check that this means that

xj = xj+n for all j ∈ Z+.

That is, the sequence x is determined by a finite block of symbols x0, . . . , xn−1 and

x = (x0, x1, . . . , xn−1, x0, x1, . . . , xn−1, . . . , ).

Exercise 3.15. Consider the full one-sided k-shift. How many periodic points of period n arethere?

3.6 Cylinders

Later on we will need a particularly tractable class of subsets of shift spaces. These are thecylinder sets and are formed by fixing a finite set of co-ordinates. More precisely, in ΣA wedefine

[y−m, . . . , y−1, y0, y1, . . . , yn]−m,n = {x ∈ ΣA | xj = yj , −m ≤ j ≤ n},

and in Σ+A we define

[y0, y1, . . . , yn]0,n = {x ∈ Σ+A | xj = yj , 0 ≤ j ≤ n}.

3.7 A metric on Σ+A

What does it mean for two sequences in Σ+A to be ‘close’? Heuristically we will say that two

sequences (xj)∞j=0 and (yj)

∞j=0 are close if they agree for a large number of initial places.

More formally, for two sequences x = (xj)∞j=0, y = (yj)

∞j=0 ∈ Σ+

A we define n(x, y) bysetting n(x, y) = n if xj = yj for j = 0, . . . , n − 1 but xn 6= yn. Thus n(x, y) is the first placein which the sequences x and y disagree. (We set n(x, y) =∞ if x = y .)

We define a metric d on Σ+A by

d((xj)∞j=0, (yj)

∞j=0) =

(1

2

)n(x,y)

if x 6= y

and d((xj)∞j=0, (yj)

∞j=0) = 0 if x = y .

Exercise 3.16. Show that this is a metric.

In the two-sided case, we can define a metric in a similar way. Let x = (xj)∞j=−∞, y =

(yj)∞j=−∞ ∈ ΣA. Define n(x, y) by setting n(x, y) = n if xj = yj for |j | ≤ n − 1 and either

xn 6= yn or x−n 6= y−n. Thus n(x, y) is the first place, going either forwards or backwards, inwhich the sequences x, y disagree. (We again set n(x, y) =∞ if x = y .)

17

3.7 A metric on Σ+A 3 EXAMPLES OF DYNAMICAL SYSTEMS

We define a metric d on ΣA in the same way:

d((xj)∞j=−∞, (yj)

∞j=−∞) =

(1

2

)n(x,y)

if x 6= y

and d((xj)∞j=−∞, (yj)

∞j=−∞) = 0 if x = y .

Theorem 3.17. Let Σ+A be a shift of finite type.

(i) Σ+A is a compact metric space.

(ii) The shift map σ is continuous.

Remark 3.18. The corresponding statements for the two-sided case also hold.

Proof. (i) If Σ+A = ∅ or if Σ+

A finite then trivially it is compact. Thus we may assume thatΣ+A is infinite.

Let x (m) ∈ Σ+A be a sequence (in reality, a sequence of sequences!). We need to show

that x (m) has a convergent subsequence. Since Σ+A =

⋃ki=1[i ] at least one cylinder [i ]

contains infinitely many elements of the sequence x (m); call this [y0]. Thus there areinfinitely m for which x (m) ∈ [y0].

Since [y0] =⋃Ay0,i

=1[y0i ] we similarly obtain a cylinder of length 2, [y0y1] say, containing

infinitely many elements of the sequence x (m).

Continue inductively in this way to obtain a nested family of cylinders [y0, . . . , yn], n ≥ 0,each containing infinitely many elements of the sequence x (m).

Set y = (yn)∞n=0 ∈ Σ+A . Then for each n ≥ 0, there exist infinitely many m for which

d(y , x (m)) ≤ (1/2)m. Thus y is the limit of some subsequence of x (m).

(ii) We want to show the following: ∀ε > 0 ∃δ > 0 s.t. d(x, y) < δ ⇒ d(σ(x), σ(y)) < ε.

Let ε > 0. Choose n such that 1/2n < ε. Let δ = 1/2n+1. Suppose that d(x, y) < δ.Then n(x, y) > n+ 1, so that x and y agree in the first n+ 1 places. Hence σ(x) andσ(y) agree in the first n places, so that n(σ(x), σ(y)) > n. Hence d(σ(x), σ(y)) =

1/2n(σ(x).σ(y)) < 1/2n < ε.

Exercise 3.19. Let A be an irreducible k × k matrix with entries in {0, 1}. Show that the setof all periodic points for σ is dense in Σ+

A . (Recall that a subset Y of a set X is said to bedense if: for all x ∈ X and for all ε > 0 there exists y ∈ Y such that d(x, y) < ε, i.e. anypoint of X can be arbitrarily well approximated by a point of Y .)

Exercise 3.20. Let A be an irreducible k × k matrix with entries in {0, 1}. Show that thereexists a point x ∈ Σ+

A with a dense orbit. (Hint: first show that if the orbit of a point visitseach cylinder then it is dense. To construct such a point, mimic the argument used for thedoubling map above. Use irreducibility to show that one can concatenate cylinders togetherby inserting finite strings of symbols between them.)

18

3.8 The continued fraction map 3 EXAMPLES OF DYNAMICAL SYSTEMS

3.8 The continued fraction map

Every x ∈ (0, 1) can be expressed as a continued fraction:

x =1

x0 + 1

x1+ 1

x2+ 1x3+···

(1)

for xn ∈ N.For example,

−1 +√

5

2=

1

1 + 1

1+ 1

1+ 11+···

3

4=

1

1 + 13

π = 3 +1

7 + 1

15+ 1

1+ 1292+···

One can show that rational numbers have a finite continued fraction expansion (thatis, the above expression terminates at xn for some n). Conversely, it is clear that a finitecontinued fraction expansion gives rise to a rational number.

Thus each irrational x ∈ (0, 1) has an infinite continued fraction expansion of the form(1). Moreover, one can show that this expansion is unique. For brevity, we will sometimewrite (1) as x = [x0; x1; x2; . . .].

Recall that earlier in this section we saw how the doubling map x 7→ 2x mod 1 can beused to determine the base 2 expansion of x . Here we introduce a dynamical system thatallows us to determine the continued fraction expansion of x .

We can read off the numbers xi from the transformation T : [0, 1] → [0, 1] defined byT (0) = 0 and, for 0 < x < 1,

T (x) =1

xmod 1.

Then

x0 =

[1

x

], x1 =

[1

Tx

], . . . , xn =

[1

T nx

].

This is called the continued fraction map or the Gauss map.

Exercise 3.21. Draw the graph of the continued fraction map.

Later in the course we will study the ergodic theoretic properties of the continued fractionmap and use them to deduce some interesting facts about continued fractions

3.9 Endomorphisms of a torus

Take X = Rk/Zk to be the k-torus.

19

3.9 Endomorphisms of a torus 3 EXAMPLES OF DYNAMICAL SYSTEMS

Let A = (ai j) be a k × k matrix with entries in Z and with detA 6= 0. We can define alinear map Rk → Rk by x1

...xk

7→ A

x1...xk

.For brevity, we shall often write this as (x1, . . . , xk) 7→ A(x1, . . . , xk).

Since A is an integer matrix, it maps Zk to itself. We claim that A allows us to define amap

T = TA : Rk/Zk → Rk/Zk

(x1, . . . , xk) 7→ A(x1, . . . , xk) mod 1.

To see that this map is well defined, we need to check that if x, y ∈ Rk determine thesame point in Rk/Zk then Ax mod 1 and Ay mod 1 are the same point in Rk/Zk . But this isclear: if x, y ∈ Rk give the same point in the torus, then x = y + n for some n ∈ Zk . HenceAx = A(y + n) = Ay + An. As A maps Zk to itself, we see that An ∈ Zk so that Ax, Aydetermine the same point in the torus.

Definition 3.22. Let A = (ai j) denote a k×k matrix with integer entries such that detA 6= 0.Then we call the map TA : Rk/Zk → Rk/Zk a linear toral endomorphism.

The map T is not invertible in general. However, if detA = ±1 then A−1 exists and is aninteger matrix. Hence we have a map T−1 given by

T−1(x1, . . . , xk) = A−1(x1, . . . , xk) mod 1.

One can easily check that T−1 is the inverse of T .

Definition 3.23. Let A = (ai j) denote a k × k matrix with integer entries such that detA =

±1. Then we call the map TA : Rk/Zk → Rk/Zk a linear toral automorphism.

Example 3.24. Take A to be the matrix

A =

(2 1

1 1

)and define T : R2/Z2 → R2/Z2 to be the induced map:

T (x1, x2) = (2x1 + x2 mod 1, x1 + x2 mod 1).

Then T is a linear toral automorphism and is called Arnold’s cat map. (CAT stands for‘C’ontinuous ‘A’utomorphism of the ‘T’orus.)

Definition 3.25. Suppose that detA = ±1. Then we call T a hyperbolic toral automorphismif A has no eigenvalues of modulus 1.

20


Exercise 3.26. Check that Arnold’s cat map is hyperbolic. Decide whether the followingmatrices give hyperbolic toral automorphisms:

A1 =

(1 1

0 1

), A2 =

(1 1

1 0

).

Let us consider the special case of a toral automorphism of the 2-dimensional torus R2/Z2.

Proposition 3.27. Let T be a hyperbolic toral automorphism of R2/Z2 with correspondingmatrix A having eigenvalues λ1, λ2.

(i) The periodic points of T correspond precisely with the set of rational points of R2/Z2:{(p1

q,p2

q

)+ Z2 | p1, p2, q ∈ N, 0 ≤ p1, p2 < q

}.

(In particular, the periodic points are dense.)

(ii) Suppose that detA = 1. Then the number of points of period n is given by:

card{x ∈ R2/Z2 | T n(x) = x} = |λn1 + λn2 − 2|.

Proof. (i) If (x1, x2) = (p1/q, p2/q) has rational co-ordinates then we can write

T n(x1, x2) =

(p

(n)1

q,p

(n)2

q

)

where 0 ≤ p(n)1 , p

(n)2 < q are integers. As there are at most q2 distinct possibilities for

p(n)1 , p

(n)2 , this sequence (in n) must be eventually periodic. Hence there exists n1 > n0

such that T n1(x1, x2) = T n0(x1, x2). As T is invertible, we see that T n1−n0(x1, x2) =

(x1, x2) so that (x1, x2) is periodic.

Conversely, If (x1, x2) ∈ R2/Z2 is periodic then T n(x1, x2) = (x1, x2) for some n > 0.Hence

An(x1

x2

)=

(x1

x2

)+

(n1

n2

)(2)

for some n1, n2 ∈ Z. As A is hyperbolic, A has no eigenvalues of modulus 1. Hence An

has no eigenvalues of modulus 1, and in particular 1 is not an eigenvalue. Hence An− Iis invertible. Hence solutions to (2) have the form(

x1

x2

)= (An − I)−1

(n1

n2

).

As An − I has entries in Z, the matrix (An − I)−1 has entries in Q. Hence x1, x2 ∈ Q.

21


(ii) A point (x1, x2) is periodic with period n for T if and only if

(An − I)(x1

x2

)=

(n1

n2

). (3)

We may take x1, x2 ∈ [0, 1). Let u = (An− I)(0, 1), v = (An− I)(1, 0). The map An− Imaps [0, 1)× [0, 1) onto the parallelogram

R = {αu + βv | 0 ≤ α, β < 1}.

For the point (x1, x2) ∈ [0, 1) × [0, 1) to be periodic, it follows from (3) that (An −I)(x1, x2) must be an integer point of R. Thus the number of periodic points of periodn correspond to the number of integer points in R. One can check that the number ofsuch points is equal to the area of R. Hence that number of periodic points of periodn is given by | det(An − I)|.Let us calculate the eigenvalues of An − I. Let µ be an eigenvalue of An − I witheigenvector v . Then

(An − I)v = µv ⇔ Anv = (µ+ 1)v

so that µ + 1 is an eigenvalue of An. As the eigenvalues of A are given by λ1, λ2, theeigenvalues of An are given by λn1, λ

n2. Hence the eigenvalues of A

n−I are λn1−1, λn2−1.As the determinant of a matrix is given by the product of the eigenvalues, we have that

| det(An − I)| = |(λn1 − 1)(λn2 − 1)|= |(λ1λ2)n + 1− (λn1 + λn2)|= λn1 + λn2 − 2,

as λ1λ2 = detA = 1.

22

4 MEASURE THEORY

4 Measure Theory

4.1 Background

In section 1 we remarked that ergodic theory is the study of the qualitative distributionalproperties of typical orbits of a dynamical system and that these properties are expressedin terms of measure theory. Measure theory therefore lies at the heart of ergodic theory.However, we will not need to know the (many!) intricacies of measure theory and thissection will be devoted to an expository account of the required facts.

4.2 Measure spaces

Loosely speaking, a measure is a function that, when given a subset of a space X, will sayhow ‘big’ that subset is. A motivating example is given by Lebesgue measure. The Lebesguemeasure of an interval is given by its length. In defining an abstract measure space, we will betaking the properties of ‘length’ (or, in higher dimensions, ‘volume’) and abstracting them,in much the same way that a metric space abstracts the properties of ‘distance’.

It turns out that in general it is not possible to be able to define the measure of an arbitrarysubset of X. Instead, we will usually have to restrict our attention to a class of subsets of X.

Definition 4.1. A collection B of subsets of X is called a σ-algebra if the following propertieshold:

(i) ∅ ∈ B,

(ii) if E ∈ B then its complement X \ E ∈ B,

(iii) if En ∈ B, n = 1, 2, 3, . . ., is a countable sequence of sets in B then their union⋃∞n=1 En ∈ B.

Examples 4.2.

The trivial σ-algebra is given by B = {∅, X}.The full σ-algebra is given by B = P(X), i.e. the collection of all subsets of X.

Here are some easy properties of σ-algebras:

Lemma 4.3. Let B be a σ-algebra of subsets of X. Then

(i) X ∈ B;

(ii) if En ∈ B then⋂∞n=1 En ∈ B.

Exercise 4.4. Prove Lemma 4.3.

In the special case when X is a compact metric space there is a particularly importantσ-algebra.

23

4.3 The Kolmogorov Extension Theorem 4 MEASURE THEORY

Definition 4.5. Let X be a compact metric space. We define the Borel σ-algebra B(X) tobe the smallest σ-algebra of subsets of X which contains all the open subsets of X.

Remarks 4.6.

By ‘smallest’ we mean that if C is another σ-algebra that contains all open subsets of X thenB(X) ⊂ C.We say that the Borel σ-algebra is generated by the open sets. We call sets in B(X) a Borelset.

By Definition 4.1(ii), the Borel σ-algebra also contains all closed sets and is the smallestσ-algebra with this property.

Let X be a set and let B be a σ-algebra of subsets of X.

Definition 4.7. A function µ : B → R+ ∪ {∞} is called a measure if:

(i) µ(∅) = 0;

(ii) if En is a countable collection of pairwise disjoint sets in B (i.e. En∩Em = ∅ for n 6= m)then

µ

(∞⋃n=1

En

)=

∞∑n=1

µ(En).

(If µ(X) <∞ then we call µ a finite measure.) We call (X,B, µ) a measure space.If µ(X) = 1 then we call µ a probability or probability measure and refer to (X,B, µ) as

a probability space.

Remark 4.8. Thus a measure just abstracts properties of ‘length’ or ‘volume’. Condition (i)says that the empty set has zero length, and condition (ii) says that the length of a disjointunion is the sum of the lengths of the individual sets.

Definition 4.9. We say that a property holds almost everywhere if the set of points on whichthe property fails to hold has measure zero.

We will usually be interested in studying measures on the Borel σ-algebra of a compactmetric space X. To define such a measure, we need to define the measure of an arbitraryBorel set. In general, the Borel σ-algebra is extremely large. In the next section we see thatit is often unnecessary to do this and instead it is sufficient to define the measure of a certainclass of subsets.

4.3 The Kolmogorov Extension Theorem

A collection A of subsets of X is called an algebra if:

(i) ∅ ∈ A,

(ii) if A,B ∈ A then A ∩ B ∈ A;

24

4.3 The Kolmogorov Extension Theorem 4 MEASURE THEORY

(iii) if A ∈ A then Ac ∈ A.

Thus an algebra is like a σ-algebra, except that we do not assume that A is closed undercountable unions.

Example 4.10. Take X = [0, 1], and A = {all finite unions of subintervals}.

Let B(A) denote the σ-algebra generated by A, i.e., the smallest σ-algebra containing A.(In the above example B(A) is the Borel σ-algebra.)

Theorem 4.11 (Kolmogorov Extension Theorem). Let A be an algebra of subsets of X.Suppose that µ : A → R+ satisfies:

(i) µ(∅) = 0;

(ii) there exists finitely or countably many sets Xn ∈ A such that X =⋃n Xn and µ(Xn) <

∞;

(iii) if En ∈ A, n ≥ 1, are pairwise disjoint and if⋃∞n=1 En ∈ A then

µ

(∞⋃n=1

En

)=

∞∑n=1

µ(En).

Then there is a unique measure µ : B(A)→ R+ which is an extension of µ : A → R+.

Remarks 4.12.

(i) The important hypotheses are (i) and (iii). Thus the Kolmogorov Extension Theorem saysthat if we have a function µ that looks like a measure on an algebra A, then it is indeed ameasure when extended to B(A).

(ii) We will often use the Kolmogorov Extension Theorem as follows. Take X = [0, 1] andtake A to be the algebra consisting of all finite unions of subintervals of X. We then definethe ‘measure’ µ of a subinterval in such a way as to be consistent with the hypotheses of theKolmogorov Extension Theorem. It then follows that µ does indeed define a measure on theBorel σ-algebra.

(iii) Here is another way in which we shall use the Kolmogorov Extension Theorem. Supposewe have two measures, µ and ν, and we want to see if µ = ν. A priori we would have tocheck that µ(B) = ν(B) for all B ∈ B. The Kolmogorov Extension Theorem says that itis sufficient to check that µ(E) = ν(E) for all E in an algebra A that generates B. Forexample, to show that two measures on [0, 1] are equal, it is sufficient to show that they givethe same measure to each subinterval.

25

4.4 Examples of measure spaces 4 MEASURE THEORY

4.4 Examples of measure spaces

Lebesgue measure on [0, 1]. Take X = [0, 1] and take A to be the collection of all finiteunions of subintervals of [0, 1]. For a subinterval [a, b] define

µ([a, b]) = b − a.

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measureon the Borel σ-algebra B. This is Lebesgue measure.

Lebesgue measure on R/Z. Take X = R/Z = [0, 1) mod 1 and take A to be the collectionof all finite unions of subintervals of [0, 1). For a subinterval [a, b] define

µ([a, b]) = b − a.

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measureon the Borel σ-algebra B. This is Lebesgue measure on the circle.

Lebesgue measure on the k-dimensional torus. Take X = Rk/Zk = [0, 1)k mod 1 andtake A to be the collection of all finite unions of k-dimensional sub-cubes

∏kj=1[aj , bj ] of

[0, 1)k . For a sub-cube∏kj=1[aj , bj ] of [0, 1)k , define

µ(

k∏j=1

[aj , bj ]) =

k∏j=1

(bj − aj).

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measureon the Borel σ-algebra B. This is Lebesgue measure on the torus.

Stieltjes measures. Take X = [0, 1] and let ρ : [0, 1]→ R+ be an increasing function suchthat ρ(1)− ρ(0) = 1. Take A to be the algebra of finite unions of subintervals and define

µρ([a, b]) = ρ(b)− ρ(a).

This satisfies the hypotheses of the Kolmogorov Extension Theorem, and so defines a measureon the Borel σ-algebra B. We say that µρ is the measure on [0, 1] with density ρ.

Dirac measures. Finally, we give an example of a class of measures that do not fall intothe above categories. Let X be an arbitrary space and let B be an arbitrary σ-algebra. Letx ∈ X. Define the measure δx by

δx(A) =

{1 if x ∈ A0 if x 6∈ A.

Then δx defines a probability measure. It is called the Dirac measure at x .

4.5 Integration: The Riemann integral

Before discussing the Lebesgue theory of integration, we briefly review the construction ofthe Riemann integral. This gives a method for defining the integral of (sufficiently nice)functions defined on [a, b]. In the next subsection we will see how the Lebesgue integral is a

26

4.5 Integration: The Riemann integral 4 MEASURE THEORY

generalisation of the Riemann integral, in the sense that it allows us to integrate functionsdefined on spaces more general than subintervals of R (as well as a wider class of functions).However, the Lebesgue integral has other nice properties, for example it is well-behaved withrespect to limits. Here we give a brief exposition about some inadequacies of the Riemannintegral and how they motivate the Lebesgue integral.

Let f : [a, b]→ R be a bounded function (for the moment we impose no other conditionson f ).

A partition ∆ of [a, b] is a finite set of points ∆ = {x0, x1, x2, . . . , xn} with

a = x0 < x1 < x2 < · · · < xn = b.

In other words, we are dividing [a, b] up into subintervals.We then form the upper and lower Riemann sums

U(f ,∆) =

n−1∑i=0

supx∈[xi ,xi+1]

f (x) (xi+1 − xi),

L(f ,∆) =

n−1∑i=0

infx∈[xi ,xi+1]

f (x) (xi+1 − xi).

The idea is then that if we make the subintervals in the partition small, these sums will be agood approximation to (our intuitive notion of) the integral of f over [a, b]. More precisely,if

inf∆U(f ,∆) = sup

∆L(f ,∆),

where the infimum and supremum are taken over all possible partitions of [a, b], then we write∫ b

a

f (x) dx

for their common value and call it the (Riemann) integral of f between those limits. We alsosay that f is Riemann integrable.

The class of Riemann integrable functions includes continuous functions and step functions(i.e. finite linear combinations of characteristic functions of intervals).

However, there are many functions for which one wishes to define an integral but whichare not Riemann integrable, making the theory rather unsatisfactory. For example, definef : [0, 1]→ R by

f (x) =

{1 if x ∈ Q0 otherwise.

Since between any two distinct real numbers we can find both a rational number and anirrational number, given 0 ≤ y < z ≤ 1, we can find y < x < z with f (x) = 1 andy < x ′ < z with f (x ′) = 0. Hence for any partition ∆ = {x0, x1, . . . , xn} of [0, 1], we have

U(f ,∆) =

n−1∑i=0

(xi+1 − xi) = 1,

L(f ,∆) = 0.

27

4.6 Integration: The Lebesgue integral 4 MEASURE THEORY

Taking the infimum and supremum, respectively, over all partitions ∆, shows that f is notRiemann integrable.

Why does Riemann integration not work for the above function and how could we goabout improving it? Let us look again at (and slightly rewrite) the formulæ for U(f ,∆) andL(f ,∆). We have

U(f ,∆) =

n−1∑i=0

supx∈[xi ,xi+1]

f (x) l([xi , xi+1])

and

L(f ,∆) =

n−1∑i=0

infx∈[xi ,xi+1]

f (x) l([xi , xi+1]),

where, for an interval [y , z ],l([y , z ]) = z − y

denotes its length. In the example above, things didn’t work because dividing [0, 1] intointervals (no matter how small) did not ‘separate out’ the different values that f could take.But suppose we had a notion of ‘length’ that worked for more general sets than intervals.Then we could do better by considering more complicated ‘partitions’ of [0, 1], where bypartition we now mean a collection of subsets {E1, . . . , Em} of [0, 1] such that Ei ∩ Ej = ∅,if i 6= j , and

⋃mi=1 Ei = [0, 1].

In the example, for instance, it might be reasonable to write∫ 1

0

f (x) dx = 1× l([0, 1] ∩Q) + 0× l([0, 1]\Q)

= l([0, 1] ∩Q).

Instead of using subintervals, the Lebesgue integral uses a much wider class of subsets(namely sets in the given σ-algebra) together with a notion of ‘generalised length’ (namely,measure).

4.6 Integration: The Lebesgue integral

Let (X,B, µ) be a measure space. We are interested in how to integrate functions definedon X with respect to the measure µ. In the special case when X = [0, 1], B is the Borelσ-algebra and µ is Lebesgue measure, this will extend the definition of the Riemann integralto functions that are not Riemann integrable.

Definition 4.13. A function f : X → R is measurable if f −1(D) ∈ B for every Borel subsetD of R, or, equivalently, if f −1(c,∞) ∈ B for all c ∈ R.

A function f : X → C is measurable if both the real and imaginary parts, Ref and Imf ,are measurable.

We define integration via simple functions.

28

4.7 Examples 4 MEASURE THEORY

Definition 4.14. A function f : X → R is simple if it can be written as a linear combinationof characteristic functions of sets in B, i.e.:

f =

r∑i=1

aiχAi ,

for some ai ∈ R, Ai ∈ B, where the Ai are pairwise disjoint.

For a simple function f : X → R we define∫f dµ =

r∑i=1

aiµ(Ai)

(which can be shown to be independent of the representation of f as a simple function).Thus for simple functions, the integral can be thought of as being defined to be the areaunderneath the graph.

If f : X → R, f ≥ 0, is measurable then one can show that there exists an increasingsequence of simple functions fn such that fn ↑ f pointwise as n →∞ (i.e. for every x , fn(x)

is an increasing sequence and fn(x)→ f (x) as n →∞) and we define∫f dµ = lim

n→∞

∫fn dµ.

This can be shown to be independent of the choice of sequence fn.For an arbitrary measurable function f : X → R, we write f = f + − f −, where f + =

max{f , 0} ≥ 0 and f − = max{−f , 0} ≥ 0 and define∫f dµ =

∫f + dµ−

∫f − dµ.

Finally, for a measurable function f : X → C, we define∫f dµ =

∫Ref dµ+ i

∫Imf dµ.

We say that f is integrable if ∫|f | dµ < +∞.

4.7 Examples

Lebesgue measure. Let X = [0, 1] and let µ denote Lebesgue measure on the Borel σ-algebra. If f : [0, 1] → R is Riemann integrable then it is also Lebesgue integrable and thetwo definitions agree.

The Stieltjes integral. Let ρ : [0, 1]→ R+ and suppose that ρ is differentiable. Then∫f dµρ =

∫f (x)ρ′(x) dx.

29

4.8 The Lp Spaces 4 MEASURE THEORY

Integration with respect to Dirac measures. Let x ∈ X and recall that we defined theDirac measure by

δx(A) =

{1 if x ∈ A0 if x 6∈ A.

If χA denotes the characteristic function of A then∫χA dδx =

{1 if x ∈ A0 if x 6∈ A.

Hence if f =∑aiχAi is a simple function then

∫f dδx = ai where x ∈ Ai . Now let f : X → R.

By choosing an increasing sequence of simple functions, we see that∫f dδx = f (x).

4.8 The Lp Spaces

Let us say that two measurable functions f , g : X → C are equivalent if f = g µ-a.e. Weshall write L1(X,B, µ) (or L1(µ)) for the set of equivalence classes of integrable functionson (X,B, µ). We define

‖f ‖1 =

∫|f | dµ.

Then d(f , g) = ‖f − g‖1 is a metric on L1(X,B, µ).More generally, for any p ≥ 1, we can define the space Lp(X,B, µ) consisting of (equiva-

lence classes of) measurable functions f : X → C such that |f |p is integrable. We can againdefine a metric on Lp(X,B, µ) by defining d(f , g) = ‖f − g‖p where

‖f ‖p =

(∫|f |p dµ

)1/p

.

It is worth remarking that convergence in Lp neither implies nor is implied by convergencealmost everywhere.

If (X,B, µ) is a finite measure space and if 1 ≤ p < q then

Lq(X,B, µ) ⊂ Lp(X,B, µ).

Apart from L1, the most interesting Lp space is L2(X,B, µ). This is a Hilbert space withthe inner product

〈f , g〉 =

∫f g dµ.

Remark 4.15. We shall continually abuse notation by saying that, for example, a functionf ∈ L1(X,B, µ) when, strictly speaking, we mean that the equivalence class of f lies inL1(X,B, µ).

Exercise 4.16. Give an example of a sequence of functions fn ∈ L1([0, 1],B, µ) (µ = Lebesgue)such that fn → 0 µ-a.e. but fn 6→ 0 in L1.

30

4.9 Convergence theorems 4 MEASURE THEORY

Exercise 4.17. Give an example to show that

L2(R,B, µ) 6⊂ L1(R,B, µ)

where µ is Lebesgue measure.

4.9 Convergence theorems

We state the following two convergence theorems for integration.

Theorem 4.18 (Monotone Convergence Theorem). Suppose that fn : X → R is an increasingsequence of integrable functions on (X,B, µ). If

∫fn dµ is a bounded sequence of real numbers

then limn→∞ fn exists µ-a.e. and is integrable and∫limn→∞

fn dµ = limn→∞

∫fn dµ.

Theorem 4.19 (Dominated Convergence Theorem). Suppose that g : X → R is integrableand that fn : X → R is a sequence of measurable functions functions with |fn| ≤ g µ-a.e.and limn→∞ fn = f µ-a.e. Then f is integrable and

limn→∞

∫fn dµ =

∫f dµ.

Remark 4.20. Both the Monotone Convergence Theorem and the Dominated ConvergenceTheorem fail for Riemann integration.

31

5 MEASURES ON COMPACT METRIC SPACES

5 Measures on Compact Metric Spaces

5.1 The Riesz Representation Theorem

Let X be a compact metric space and let

C(X,R) = {f : X → R | f is continuous}

denote the space of all continuous functions on X. Equip C(X,R) with the metric

d(f , g) = ‖f − g‖∞ = supx∈X|f (x)− g(x)|.

Let B denote the Borel σ-algebra on X and let µ be a probability measure on (X,B).Then we can think of µ as a functional that acts on C(X,R), namely

C(X,R)→ R : f 7→∫f dµ.

We will often write µ(f ) for∫f dµ.

Notice that this map enjoys several natural properties:

(i) the functional defined by µ is continuous: i.e. if fn ∈ C(X,R) and fn → f then µ(fn)→µ(f ).

(i’) the functional defined by µ is bounded: i.e. if f ∈ C(X,R) then |µ(f )| ≤ ‖f ‖∞.

(ii) the functional defined by µ is linear:

µ(λ1f1 + λ2f2) = λ1µ(f1) + λ2µ(f2)

where λ1, λ2 ∈ R and f1, f2 ∈ C(X,R).

(iii) if f ≥ 0 then µ(f ) ≥ 0 (i.e. the map µ is positive);

(iv) consider the function 1 defined by 1(x) ≡ 1 for all x ; then µ(1) = 1 (i.e. the map µ isnormalised).

Exercise 5.1. Prove the above assertions.

Remark 5.2. It can be shown that a linear functional if continuous if and only if it is bounded.Thus in the presence of (ii), we have that (i) is equivalent to (i’).

The Riesz Representation Theorem says that the above properties characterise all Borelprobability measures on X. That is, if we have a map w : C(X,R) → R that satisfies theabove four properties, then w must be given by integrating with respect to a Borel probabilitymeasure. This will be a very useful method of constructing measures: we need only constructcontinuous positive normalised linear functionals!

Theorem 5.3 (Riesz Representation Theorem). Let w : C(X,R) → R be a functional suchthat:

32

5.2 The space M(X) 5 MEASURES ON COMPACT METRIC SPACES

(i) w is bounded: i.e. for all f ∈ C(X,R) we have |w(f )| ≤ ‖f ‖∞;

(ii) w is linear: i.e. w(λ1f1 + λ2f2) = λ1w(f1) + λ2w(f2);

(iii) w is positive: i.e. if f ≥ 0 then w(f ) ≥ 0;

(iv) w is normalised: i.e. w(1) = 1.

Then there exists a Borel probability measure µ ∈ M(X) such that

w(f ) =

∫f dµ.

Moreover, µ is unique.

5.2 The space M(X)

In all of the examples that we shall consider, X will be a compact metric space and B will bethe Borel σ-algebra.

We will also be interested in the space of continuous R-valued functions

C(X,R) = {f : X → R | f is continuous}.

This space is also a metric space. We can define a metric on C(X,R) by first defining

‖f ‖∞ = supx∈X|f (x)|

and then definingρ(f , g) = ‖f − g‖∞.

This metric turns C(X,R) into a complete metric spaces. (Recall that a metric space is saidto be complete if every Cauchy sequence is convergent.) Note also that C(X,R) is a vectorspace.

An important property of C(X,R) that will prove to be useful later on is that it is separable,that is, it contains countable dense subsets.

Rather than fixing one measure on (X,B), it is interesting to consider the totality ofpossible (probability) measures. To formalise this, let M(X) denote the set of all probabilitymeasures on (X,B). The following simple fact will be useful later on.

Proposition 5.4. The space M(X) is convex: if µ1, µ2 ∈ M(X) and 0 ≤ α ≤ 1 thenαµ1 + (1− α)µ2 ∈ M(X).

Exercise 5.5. Prove the above proposition.

33

5.3 The weak∗ topology on M(X) 5 MEASURES ON COMPACT METRIC SPACES

5.3 The weak∗ topology on M(X)

It will be very important to have a sensible notion of convergence in M(X); this is calledweak∗ convergence. We say that a sequence of probability measures µn weak∗ converges toµ, as n →∞ if, for every f ∈ C(X,R),∫

f dµn →∫f dµ, as n →∞.

If µn weak∗ converges to µ then we write µn ⇀ µ. (Note that with this definition it is notnecessarily true that µn(B) → µ(B), as n → ∞, for B ∈ B.) We can make M(X) into ametric space compatible with this definition of convergence by choosing a countable densesubset {fn}∞n=1 ⊂ C(X,R) and, for µ,m ∈ M(X), setting

d(µ,m) =

∞∑n=1

1

2n‖fn‖∞

∣∣∣∣∫ fn dµ−∫fn dm

∣∣∣∣ .However, we will not need to work with a particular metric: what is important is the definitionof convergence.

Notice that there is a continuous embedding of X inM(X) given by the map X → M(X) :

x 7→ δx , where δx is the Dirac measure at x :

δx(A) =

{1 if x ∈ A,0 if x /∈ A,

(so that∫f dδx = f (x)).

Exercise 5.6. Show that the map δ : X → M(X) is continuous. (Hint: This is really justunravelling the underlying definitions.)

Exercise 5.7. Let X be a compact metric space. For µ ∈ M(X) define

‖µ‖ = supf ∈C(X,R),‖f ‖∞≤1

∣∣∣∣∫ f dµ

∣∣∣∣ .We say that µn converges strongly to µ if ‖µn − µ‖ → 0 as n → ∞. The topology thisdetermines is called the strong topology (or the operator topology).

(i) Show that if µn → µ strongly then µn ⇀ µ in the weak∗ topology.

(ii) Show that X ↪→ M(X) : x 7→ δx is not continuous in the strong topology.

(iii) Prove that ‖δx − δy‖ = 2 if x 6= y . (You may use Urysohn’s Lemma: Let A and Bbe disjoint closed subsets of a metric space X. Then there is a continuous functionf ∈ C(X,R) such that 0 ≤ f ≤ 1 on X while f ≡ 0 on A and f ≡ 1 on B.)

Hence prove that M(X) is not compact in the strong topology when X is infinite.

Exercise 5.8. Give an example of a sequence of measures µn and a set B such that µn ⇀ µ

but µn(B) 6→ µ(B).

34

5.4 M(X) is weak∗ compact 5 MEASURES ON COMPACT METRIC SPACES

5.4 M(X) is weak∗ compact

We can use the Riesz Representation Theorem to establish another important property ofM(X): that it is compact.

Theorem 5.9. Let X be a compact metric space. Then M(X) is weak∗ compact.

Proof. In fact, we shall show that M(X) is sequentially compact, i.e., that any sequenceµn ∈ M(X) has a convergent subsequence. For convenience, we shall write µ(f ) =

∫f dµ.

Since C(X,R) is separable, we can choose a countable dense subset of functions {fi}∞i=1 ⊂C(X,R). Given a sequence µn ∈ M(X), we shall first consider the sequence of real numbersµn(f1) ∈ R. We have that |µn(f1)| ≤ ‖f1‖∞ for all n, so µn(f1) is a bounded sequence of realnumbers. As such, it has a convergent subsequence, µ(1)

n (f1) say.Next we apply the sequence of measures µ(1)

n to f2 and consider the sequence µ(1)n (f2) ∈ R.

Again, this is a bounded sequence of real numbers and so it has a convergent subsequenceµ

(2)n (f2).In this way we obtain, for each i ≥ 1, nested subsequences {µ(i)

n } ⊂ {µ(i−1)n } such that

µ(i)n (fj) converges for 1 ≤ j ≤ i . Now consider the diagonal sequence µ(n)

n . Since, for n ≥ i ,µ

(n)n is a subsequence of µ(i)

n , µ(n)n (fi) converges for every i ≥ 1.

We can now use the fact that {fi} is dense to show that µ(n)n (f ) converges for all f ∈

C(X,R), as follows. For any ε > 0, we can choose fi such that ‖f − fi‖∞ ≤ ε. Since µ(n)n (fi)

converges, there exists N such that if n,m ≥ N then

|µ(n)n (fi)− µ(m)

m (fi)| ≤ ε.

Thus if n,m ≥ N we have

|µ(n)n (f )− µ(m)

m (f )| ≤ |µ(n)n (f )− µ(n)

n (fi)|+ |µ(n)n (fi)− µ(m)

m (fi)|+ |µ(m)

m (fi)− µ(m)m (f )|

≤ 3ε,

so µ(n)n (f ) converges, as required.To complete the proof, write w(f ) = limn→∞ µ

(n)n (f ). We claim that w satisfies the

hypotheses of the Riesz Representation Theorem and so corresponds to integration withrespect to a probability measure.

(i) By construction, w is a linear mapping: w(λf + µg) = λw(f ) + µw(g).

(ii) As |w(f )| ≤ ‖f ‖∞, we see that w is bounded.

(iii) If f ≥ 0 then it is easy to check that w(f ) ≥ 0. Hence w is positive.

(iv) It is easy to check that w is normalised: w(1) = 1.

Therefore, by the Riesz Representation Theorem, there exists µ ∈ M(X) such that w(f ) =∫f dµ. We then have that

∫f dµ

(n)n →

∫f dµ, as n →∞, for all f ∈ C(X,R), i.e., that µ(n)

n

converges weak∗ to µ, as n →∞.

35

6 MEASURE PRESERVING TRANSFORMATIONS

6 Measure Preserving Transformations

6.1 Invariant measures

Let (X,B, µ) be a probability space. A transformation T : X → X is said to be measurableif T−1B ∈ B for all B ∈ B.

Definition 6.1. We say that T is a measure-preserving transformation (m.p.t.) or, equiva-lently, µ is said to be a T -invariant measure, if µ(T−1B) = µ(B) for all B ∈ B.

Remark 6.2. We write L1(X,B, µ) for the space of (equivalence classes of) all functionsf : X → R that are integrable with respect to the measure µ, i.e.

L1(X,B, µ) =

{f : X → R | f is measurable and

∫|f | dµ <∞

}.

Lemma 6.3. The following are equivalent:

(i) T is a measure-preserving transformation;

(ii) for each f ∈ L1(X,B, µ), we have∫f dµ =

∫f ◦ T dµ.

Proof. For B ∈ B, χB ∈ L1(X,B, µ) and χB ◦ T = χT−1B, we have

µ(B) =

∫χB dµ =

∫χB ◦ T dµ

=

∫χT−1B dµ = µ(T−1B).

This proves one implication.Conversely, suppose that T is a measure-preserving transformation. For any characteristic

function χB, B ∈ B,∫χB dµ = µ(B) = µ(T−1B) =

∫χT−1B dµ =

∫χB ◦ T dµ

and so the equality holds for any simple function (a finite linear combination of characteristicfunctions). Given any f ∈ L1(X,B, µ) with f ≥ 0, we can find an increasing sequence ofsimple functions fn with fn → f pointwise, as n →∞. For each n we have∫

fn dµ =

∫fn ◦ T dµ

and, applying the Monotone Convergence Theorem to both sides, we obtain∫f dµ =

∫f ◦ T dµ.

To extend the result to general real-valued f , consider the positive and negative parts. Thiscompletes the proof.

36

6.2 Continuous transformations 6 MEASURE PRESERVING TRANSFORMATIONS

6.2 Continuous transformations

We shall now concentrate on the special case where X is a compact metric space, B is theBorel σ-algebra and T is a continuous mapping (in which case T is measurable). The mapT induces a mapping on the set of (Borel) probability measures M(X) as follows:

Definition 6.4. Define the induced mapping T∗ : M(X)→ M(X) by

(T∗µ)(B) = µ(T−1B).

(We call T∗µ the push-forward of µ by T .)

Exercise 6.5. Check that T∗µ is a probability measure.

Then µ is T -invariant if and only if T∗µ = µ. Write

M(X,T ) = {µ ∈ M(X) | T∗µ = µ}.

Lemma 6.6. For f ∈ C(X,R) we have∫f d(T∗µ) =

∫f ◦ T dµ.

Proof. From the definition, for B ∈ B,∫χB d(T∗µ) =

∫χB ◦ T dµ.

Thus the result also holds for simple functions. If f ∈ C(X,R) is such that f ≥ 0, we canchoose an increasing sequence of simple functions fn converging to f pointwise. We have∫

fn d(T∗µ) =

∫fn ◦ T dµ

and, applying the Monotone Convergence Theorem to each side, we obtain∫f d(T∗µ) =

∫f ◦ T dµ.

The result extends to general real-valued f ∈ C(X,R) by considering positive and negativeparts.

Lemma 6.7. Let T : X → X be a continuous mapping of a compact metric space. Thefollowing are equivalent:

(i) T∗µ = µ;

(ii) for all f ∈ C(X,R) ∫f dµ =

∫f ◦ T dµ.

37

6.3 Existence of invariant measures 6 MEASURE PRESERVING TRANSFORMATIONS

Proof. (i) ⇒ (ii): This follows from Lemma 6.3, since C(X,R) ⊂ L1(X,B, µ).(ii) ⇒ (i): Define two linear functionals w1, w2 : C(X,R)→ R as follows:

w1(f ) =

∫f dµ, w2(f ) =

∫f dT∗µ.

Note that both w1 and w2 are bounded positive normalised linear functionals on C(X,R).Moreover, by Lemma 6.6

w2(f ) =

∫f dT∗µ =

∫f ◦ T dµ =

∫f dµ = w1(f )

so that w1 and w2 determine the same linear functional. By uniqueness in the Riesz Repre-sentation Theorem, this implies that T∗µ = µ.

Exercise 6.8. Show that the map T∗ : M(X)→ M(X) is continuous in the weak∗ topology.

6.3 Existence of invariant measures

Given a continuous mapping T : X → X of a compact metric space, it is natural to askwhether invariant measures necessarily exist, i.e., whether M(X,T ) 6= ∅. The next resultshows that this is the case.

Theorem 6.9. Let T : X → X be a continuous mapping of a compact metric space. Thenthere exists at least one T -invariant probability measure.

Proof. Let σ ∈ M(X) be a probability measure (for example, we could take σ to be a Diracmeasure). Define the sequence µn ∈ M(X) by

µn =1

n

n−1∑j=0

T j∗σ,

so that, for B ∈ B,

µn(B) =1

n(σ(B) + σ(T−1B) + · · ·+ σ(T−(n−1)B)).

SinceM(X) is weak∗ compact, some subsequence µnk converges, as k →∞, to a measureµ ∈ M(X). We shall show that µ ∈ M(X,T ). By Lemma 6.7, this is equivalent to showingthat ∫

f dµ =

∫f ◦ T dµ ∀f ∈ C(X,R).

To see this, note that∣∣∣∣∫ f ◦ T dµ−∫f dµ

∣∣∣∣ = limk→∞

∣∣∣∣∫ f ◦ T dµnk −∫f dµnk

∣∣∣∣= lim

k→∞

∣∣∣∣∣ 1

nk

∫ nk−1∑j=0

(f ◦ T j+1 − f ◦ T j) dσ

∣∣∣∣∣= lim

k→∞

∣∣∣∣ 1

nk

∫(f ◦ T nk − f ) dσ

∣∣∣∣≤ lim

k→∞

2‖f ‖∞nk

= 0.

38

6.4 Properties of M(X,T ) 6 MEASURE PRESERVING TRANSFORMATIONS

Therefore, µ ∈ M(X,T ), as required.

6.4 Properties of M(X,T )

We now know that M(X,T ) 6= ∅. The next result gives us some basic information about itsstructure.

Theorem 6.10. (i) M(X,T ) is convex: i.e. µ1, µ2 ∈ M(X,T ) ⇒ αµ1 + (1 − α)µ2 ∈M(X,T ), for all 0 ≤ α ≤ 1.

(ii) M(X,T ) is weak∗ closed (and hence compact).

Proof. (i) If µ1, µ2 ∈ M(X,T ) and 0 ≤ α ≤ 1 then

(αµ1 + (1− α)µ2)(T−1B)

= αµ1(T−1B) + (1− α)µ2(T−1B)

= αµ1(B) + (1− α)µ2(B) = (αµ1 + (1− α)µ2)(B),

so αµ1 + (1− α)µ2 ∈ M(X,T ).(ii) Let µn be a sequence in M(X,T ) and suppose that µn ⇀ µ ∈ M(X), as n →∞. For

f ∈ C(X,R), ∫f dµn =

∫f ◦ T dµn.

As n → ∞, the left-hand side converges to∫f dµ and the right-hand side converges to∫

f ◦ T dµ. Hence∫f dµ =

∫f ◦ T dµ and so, by Lemma 11.3, µ ∈ M(X,T ). This shows

thatM(X,T ) is closed. It is compact since it is a closed subset of the compact setM(X).

6.5 Simple examples

We give two methods by which one can show that a given dynamical system preserves a givenmeasure. We shall illustrate these two methods by proving that (i) a rotation of a torus, and(ii) the doubling map preserve Lebesgue measure. Let us first recall how these examples aredefined.

6.5.1 Rotations on tori

Take X = Rk/Zk , the k-dimensional torus. Recall that Lebesgue measure µ is defined byfirst defining the measure of a k-dimensional cube [a1, b1]× · · · × [ak , bk ] to be

µ

(k∏j=1

[aj , bj ]

)=

k∏j=1

(bj − aj)

and then extending this to the Borel σ-algebra by using the Kolmogorov Extension Theorem.Fix α = (α1, . . . , αk) ∈ Rk and define T : X → X by

T (x1, . . . , xk) = (x1 + α1, . . . , xk + αk) mod 1.

39

6.6 Kolmogorov Extension Theorem6 MEASURE PRESERVING TRANSFORMATIONS

(In multiplicative notation this becomes:

T (e2πiθ1, . . . , e2πiθk ) = (e2πi(θ1+α1), . . . , e2πi(θk+αk)).)

This is the rotation of the k-dimensional torus Rk/Zk by the vector (α1, . . . , αk).In dimension k = 1 we get a rotation of a circle defined by

T : R/Z→ R/Z : x 7→ x + α mod 1.

6.5.2 The doubling map

Let X = R/Z denote the circle. The doubling map is defined to be

T : R/Z→ R/Z : x 7→ 2x mod 1.

6.6 Kolmogorov Extension Theorem

Recall the Kolmogorov Extension Theorem:

Theorem 6.11 (Kolmogorov Extension Theorem). Let A be an algebra of subsets of X.Suppose that µ : A → R+ ∪ {∞} satisfies:

(i) µ(∅) = 0;

(ii) there exists finitely or countably many sets Xn ∈ A such that X =⋃n Xn and µ(Xn) <

∞;

(iii) if En ∈ A, n ≥ 1, are pairwise disjoint and if⋃∞n=1 En ∈ A then

µ

(∞⋃n=1

En

)=

∞∑n=1

µ(En).

Then there is a unique measure µ : B(A) → R+ ∪ {∞} which is an extension of µ : A →R+ ∪ {∞}.

That is, if something looks like a measure on an algebra A, then it extends uniquely to ameasure defined on the σ-algebra B(A) generated by A.

Corollary 6.12. Let A be an algebra of subsets of X. Suppose that µ1 and µ2 are twomeasures on B(A) such that µ1(E) = µ2(E) for all E ∈ A. Then µ1 = µ2 on B(A).

To show that a dynamical system T preserves a probability measure µ we have to showthat T∗µ = µ. By the above corollary, we see that it is sufficient to check that T∗µ = µ onan algebra that generates the σ-algebra.

Recall that the collection of all finite unions of sub-intervals forms an algebra of subsets ofboth [0, 1] and R/Z that generates the Borel σ-algebra. Similarly, the collection of all finite

40

6.6 Kolmogorov Extension Theorem6 MEASURE PRESERVING TRANSFORMATIONS

unions of k-dimensional sub-cubes of Rk/Zk forms an algebra of subsets of the k-dimensionaltorus Rk/Zk that generates the Borel σ-algebra.

Thus to show that for a dynamical system T defined on R/Z preserves a measure µ weneed only check that

T∗µ(a, b) = µT−1(a, b) = µ(a, b)

for all subintervals (a, b).

6.6.1 Rotations of a circle

We claim that the rotation T (x) = x + α mod 1 preserves Lebesgue measure µ. First notethat

T−1(a, b) = {x | T (x) ∈ (a, b)} = (a − α, b − α).

Hence

T∗µ(a, b) = µT−1(a, b)

= µ(a − α, b − α)

= (b − α)− (a − α)

= b − a= µ(a, b).

Hence T∗µ = µ on the algebra of finite unions of subintervals. As this algebra generates theBorel σ-algebra, by uniqueness in the Kolmogorov Extension Theorem we see that T∗µ = µ;i.e. Lebesgue measure is T -invariant.


We claim that the doubling map T (x) = 2x mod 1 preserves Lebesgue measure µ. First notethat

T−1(a, b) = {x | T (x) ∈ (a, b)} =

(a

2,b

2

)∪(a + 1

2,b + 1

2

).

Hence

T∗µ(a, b) = µT−1(a, b)

= µ

(a

2,b

2

)∪(a + 1

2,b + 1

2

)=

b

2−a

2+

(b + 1)

2−

(a + 1)

2= b − a = µ(a, b).

Hence T∗µ = µ on the algebra of finite unions of subintervals. As this semi-algebra generatesthe Borel σ-algebra, by uniqueness in the Kolmogorov Extension Theorem we see that T∗µ =

µ; i.e. Lebesgue measure is T -invariant.

41

6.7 Fourier series 6 MEASURE PRESERVING TRANSFORMATIONS

6.7 Fourier series

Let B denote the Borel σ-algebra on R/Z and let µ be Lebesgue measure. Given a Lebesgueintegrable function f ∈ L1(R/Z,B, µ), we can associate to f the Fourier series

a0

2+

∞∑n=1

(an cos 2πnx + bn sin 2πnx) ,

where

an = 2

∫ 1

0

f (x) cos 2πnx dµ, bn = 2

∫ 1

0

f (x) sin 2πnx dµ.

(Notice that we are not claiming that the series converges—we are just formally associatingthe Fourier series to f .)

We shall find it more convenient to work with a complex form of the Fourier series:∞∑

n=−∞cne

2πinx ,

where

cn =

∫ 1

0

f (x)e−2πinxdµ.

(In particular, c0 =∫ 1

0f dµ.)

We are still not making any assumption as to (i) whether the series converges at all, or(ii) whether, if the series does converge, it converges to f (x). In general, answering thesequestions relies on the class of function to which f belongs.

The weakest class of function is f ∈ L1(X,B, µ). In this case, we only know that the co-efficients cn → 0 as |n| → ∞. Although this condition is clearly necessary for

∑∞n=−∞ cne

2πinx

to converge, it is not sufficient, and there exist examples of functions f ∈ L1(X,B, µ) forwhich the series does not converge to f (x).

Lemma 6.13 (Riemann-Lebesgue Lemma). If f ∈ L1(R/Z,B, µ) then cn → 0 as |n| → ∞,i.e.:

limn→±∞

∫ 1

0

f (x)e2πinx dµ = 0.

It is of great interest and practical importance to know when and in what sense the Fourierseries converges to the original function f . For convenience, we shall write the nth partialsum of a Fourier series as

sn(x) =

n∑`=−n

c`e2πi`x

and the average of the first n partial sums as

σn(x) =1

n(s0(x) + s1(x) + · · ·+ sn−1(x)) .

We define L2(X,B, µ) to be the set of all functions f : X → R such that∫|f |2 dµ <∞.

Notice that L2 ⊂ L1.

42


Theorem 6.14. (i) (Riesz-Fischer Theorem) If f ∈ L2(R/Z,B, µ) then sn converges to fin L2, i.e., ∫

|sn − f |2 dµ→ 0, as n →∞.

(ii) (Fejér’s Theorem) If f ∈ C(R/Z) then σn converges uniformly to f as n → ∞, i.e.,‖σn − f ‖∞ → 0, as n →∞.

In summary:

Class of Property of Fourier Fourier series convergesfunction coefficients to functionL1 cn → 0 Not in generalL2 partial sums sn converge Yes, sn → f

(convergence in L2 sense)continuous averages σn of partial Yes, σn → f

sums converge (uniform convergence)

6.7.1 Rotations of a circle

Let T (x) = x +α mod 1 be a circle rotation. We now give an alternative method of provingthat µ is T -invariant using Fourier series. Recall Lemma 6.7: µ is T -invariant if and only if∫

f ◦ T dµ =

∫f dµ for all f ∈ C(X,R).

Heuristically, the argument is as follows. First note that∫e2πinx dµ =

{0, if n 6= 0

1, if n = 0.

If f ∈ C(X,R) has Fourier series∑

n∈Z cne2πinx then f ◦T has Fourier series

∑n∈Z cne

2πinαe2πinx .The underlying idea is the following:∫

f ◦ T dµ =

∫ ∑n∈Z

cne2πinαe2πinx dµ

=∑n∈Z

cne2πinα

∫e2πinx dµ

= c0 =

∫f dµ.

Notice that the above involves saying the ‘the integral of an infinite sum is the infinite sumof the integrals’. This is not necessarily the case, so to make this argument rigorous we needto use Theorem 7.4(ii) to justify this step.

Let f ∈ C(X,R). Then f has a Fourier series∑n∈Z

cne2πinx .

43


Let sn(x) denote the nth partial sum:

sn(x) =

n∑`=−n

c`e2πi`x .

Then

sn(Tx) =

n∑`=−n

c`e2πi`αe2πi`x

and this is the nth partial sum for the Fourier series of f ◦T . As∫e2πi`x dµ = 0 unless ` = 0,

it follows that ∫sn dµ = c0 =

∫sn ◦ T dµ.

Considerσn(x) =

1

n(s0 + · · ·+ sn−1)(x).

Then σn(x)→ f (x) uniformly. Moreover, σn(Tx)→ f (Tx) uniformly. Hence∫f dµ = lim

n→∞

∫σn dµ = c0 = lim

n→∞

∫σn ◦ T dµ =

∫f ◦ T dµ

and Lemma 6.7 implies that Lebesgue measure is invariant.


Define T : X → X byT (x) = 2x mod 1.

Heuristically, the argument is as follows: If f has Fourier series∑

n cne2πinx then f ◦ T

has Fourier series∑

n cne2πi2nx . Hence∫

f ◦ T dµ =

∫ ∑n

cne2πi2nx dµ

=∑n

cn

∫e2πi2nx dµ

= c0

=

∫f dµ.

Again, this needs to be made rigorous, and the argument is similar to that above.

6.7.3 Higher dimensional Fourier series

Let X = Rk/Zk be the k-dimensional torus and let µ denote Lebesgue measure on X. Letf ∈ L1(X,B, µ) be an integrable function defined on the torus. For each n = (n1, . . . , nk) ∈Zk define

cn =

∫f (x)e−2πi〈n,x〉 dµ

44

6.8 The continued fraction map 6 MEASURE PRESERVING TRANSFORMATIONS

where 〈n, x〉 = n1x1 + · · ·+ nkxk . Then we can associate to f the Fourier series:∑n∈Zk

cne2πi〈n,x〉,

where n = (n1, . . . , nk), x = (x1, . . . , xk). Essentially the same convergence results hold asin the case k = 1, provided that we write

sn(x) =

n∑`1=−n

· · ·n∑

`k=−n

c`e2πi〈`,x〉.

Exercise 6.15. For an integer k ≥ 2 define T : R/Z→ R/Z by T (x) = kx mod 1. Show thatT preserves Lebesgue measure.

Exercise 6.16. Let β > 1 denote the golden ratio (so that β2 = β + 1). Define T : [0, 1]→[0, 1] by T (x) = βx mod 1. Show that T does not preserve Lebesgue measure. Define themeasure µ by µ(B) =

∫Bk(x) dx where

k(x) =

1

1β

+ 1

β3on [0, 1/β)

1

β(

1β

+ 1

β3

) on [1/β, 1).

By using the Kolmogorov Extension Theorem, show that T preserves µ.

Exercise 6.17. Define the logistic map T : [0, 1] → [0, 1] by T (x) = 4x(1 − x). Define themeasure µ by

µ(B) =1

π

∫B

1√x(1− x)

dx.

(i) Check that µ is a probability measure.

(ii) By using the Kolmogorov Extension Theorem, show that T preserves µ.


Recall that the continued fraction map T : [0, 1)→ [0, 1) is defined by

T (x) =

{0 if x = 0,{

1x

}= 1

xmod 1 if 0 < x < 1.

One can easily show that the continued fraction map does not preserve Lebesgue measure,i.e. there exists B ∈ B such that T−1B and B have different measure. (Indeed, choose B tobe any interval.)

Although the continued fraction map does not preserve Lebesgue measure, it does preserveGauss’ measure µ, defined by

µ(B) =1

log 2

∫B

1

1 + xdx.

45

6.8 The continued fraction map 6 MEASURE PRESERVING TRANSFORMATIONS

Remark 6.18. Two measures are said to be equivalent if they have the same sets of measurezero. Gauss’ measure and Lebesgue measure are equivalent. This means that any propertythat holds for µ-almost every point also holds for Lebesgue almost every point. This remarkwill have applications later when we use Birkhoff’s Ergodic Theorem to describe propertiesof the continued fraction expansion for typical (i.e. Lebesgue almost every) points.

Proof. Using the Kolmogorov Extension Theorem argument, we only have to check thatµ(T−1I) = µ(I) for intervals. If I = (a, b) then

T−1(a, b) =

∞⋃n=1

(1

b + n,

1

a + n

).

Thus

µ(T−1(a, b))

=1

log 2

∞∑n=1

∫ 1a+n

1b+n

1

1 + xdx

=1

log 2

∞∑n=1

[log

(1 +

1

a + n

)− log

(1 +

1

b + n

)]=

1

log 2

∞∑n=1

[log(a + n + 1)− log(a + n)− log(b + n + 1) + log(b + n)]

= limN→∞

1

log 2

N∑n=1

[log(a + n + 1)− log(a + n)− log(b + n + 1) + log(b + n)]

=1

log 2limN→∞

[log(a + N + 1)− log(a + 1)− log(b + N + 1) + log(b + 1)]

=1

log 2

(log(b + 1)− log(a + 1) + lim

N→∞log

(a + N + 1

b + N + 1

))=

1

log 2(log(b + 1)− log(a + 1))

=1

log 2

∫ b

a

1

1 + xdx = µ(a, b),

as required.

Exercise 6.19. Define the map T : [0, 1]→ [0, 1] by

T (x) =

{x

1−x if 0 ≤ x ≤ 1/21−xx

if 1/2 ≤ x ≤ 1.

Define the measure µ on [0, 1] by

µ(B) =

∫B

dx

x

(note that the measure µ is not a probability measure as µ([0, 1]) =∞).

46

6.9 Linear toral endomorphisms 6 MEASURE PRESERVING TRANSFORMATIONS

(i) Show that µ([a, b]) = log b − log a.

(ii) Show that

T−1[a, b] =

(a

1 + a,

b

1 + b

)∪(

1

1 + a,

1

1 + b

).

(iii) Show that µ is T -invariant.

(iv) Define h : [0, 1]→ [0,∞] by

h(x) =1

x− 1.

Define S = hTh−1 : [0,∞]→ [0,∞] (so that S and T are topologically conjugate—i.e.they have the same dynamics). Show that we have

S(x) =

{x − 1 if 1 ≤ x <∞1x− 1 if 0 ≤ x < 1.

Relate the map S to continued fractions.

6.9 Linear toral endomorphisms

Let T : Rk/Zk → Rk/Zk be a linear toral endomorphism. Recall that this means that T isgiven as follows:

T (x1, . . . , xk) = A(x1, . . . , xk) mod 1

where A = (ai ,j) is a k × k matrix with entries in Z and with detA 6= 0.We shall show that µ is T -invariant by using Fourier series.

6.9.1 Fourier series in higher dimensions

Let X = Rk/Zk be the k-dimensional torus and let µ denote Lebesgue measure on X. Letf ∈ L1(X,B, µ) be an integrable function defined on the torus. For each n = (n1, . . . , nk) ∈Zk define

cn =

∫f (x1, . . . , xk)e−2πi〈n,x〉 dµ

where 〈n, x〉 = n1x1 + · · ·+ nkxk .Then we can associate to f the Fourier series:∑

n∈Zkcne

2πi〈n,x〉,

where n = (n1, . . . , nk), x = (x1, . . . , xk). Essentially the same convergence results hold asin the case k = 1, provided that we write

sn(x) =

n∑`1=−n

· · ·n∑

`k=−n

c`e2πi〈`,x〉.

47

6.9 Linear toral endomorphisms 6 MEASURE PRESERVING TRANSFORMATIONS

As in the one-dimensional case, we have that

c0 =

∫f dµ,

and ∫e2πi〈n,x〉 dµ =

{0 if n 6= (0, . . . , 0)

1 if n = (0, . . . , 0).

6.9.2 Lebesgue measure is an invariant measure for a toral endomorphism

Let µ denote Lebesgue measure. To show that µ is T -invariant, it is sufficient to prove thatfor each continuous function f ∈ C(X,R) we have∫

f ◦ T dµ =

∫f dµ.

We associate to such an f its Fourier series:∑n∈Zk

cne2πi〈n,x〉.

Then f ◦ T has Fourier series ∑n∈Zk

cne2πi〈n,Ax〉.

Intuitively, we can write ∫f ◦ T dµ =

∫ ∑n∈Zk

cne2πi〈n,Ax〉 dµ

=

∫ ∑n∈Zk

cne2πi〈nA,x〉 dµ

=∑n∈Zk

cn

∫e2πi〈nA,x〉 dµ.

Using the fact that detA 6= 0, we see that nA = 0 if and only if n = 0. Hence, all of theintegrals above are 0, except for the term corresponding to n = 0. Hence∫

f ◦ T dµ = c0 =

∫f dµ.

(This argument can be made rigorous as for circle rotations.)Therefore, by Lemma 6.7, µ is T -invariant.

Exercise 6.20. Fix α ∈ R and define the map T : R2/Z2 → R2/Z2 by

T (x, y) = (x + α, x + y).

By using Fourier Series, sketch a proof that Lebesgue measure is T -invariant.

48

6.10 Shifts of finite type 6 MEASURE PRESERVING TRANSFORMATIONS

6.10 Shifts of finite type

Let A be a k × k matrix with entries equal to 0 or 1. Recall that we have defined the(two-sided) shift of finite type by

ΣA = {x = (xn) ∈ {1, . . . , k}Z | A(xn, xn+1) = 1 ∀n ∈ Z}

and the (one-sided) shift of finite type

Σ+A = {x = (xn) ∈ {1, . . . , k}Z+ | A(xn, xn+1) = 1 ∀n ∈ Z+}.

The shift maps σ : ΣA → ΣA, σ : Σ+A → Σ+

A are defined by

σ(. . . , x1, x0︸︷︷︸0th place

, x1, x2, . . .) = (. . . , x0, x1︸︷︷︸0th place

, x2, x3, . . .),

σ(x0, x1, x2, x3, . . .) = (x1, x2, x3, . . .),

respectively, i.e., σ shifts sequences one place to the left.As an analogue of intervals in this case, we have so-called ‘cylinder sets’, formed by fixing

a finite set of co-ordinates. More precisely, in ΣA we define

[y−m, . . . , y−1, y0, y1, . . . , yn] = {x ∈ ΣA | xi = yi , −m ≤ i ≤ n},

and in Σ+A we define

[y0, y1, . . . , yn] = {x ∈ Σ+A | xi = yi , 0 ≤ i ≤ n}.

In each case the cylinder sets form a semi-algebra which generates the Borel σ-algebra.(Cylinder sets are both open and closed.)

6.10.1 The Perron-Frobenius theorem

The following standard result will be useful.

Theorem 6.21 (Perron-Frobenius). Let B be a non-negative aperiodic k × k matrix (i.e.Bi ,j ≥ 0 for each 1 ≤ i , j ≤ k and there exists n > 0 such that Bni,j > 0 for all 0 ≤ i , j ≤ k).Then:

(i) there exists a positive eigenvalue λ > 0 such that all other eigenvalues λi ∈ C satisfy|λi | < λ,

(ii) the eigenvalue λ is simple (i.e. the corresponding eigenspace is one-dimensional),

(iii) there is a unique right-eigenvector v = (v1, . . . , vk)T such that vj > 0,∑n

j=1 |vj | = 1

andBv = λv,

(iv) there is a unique positive left-eigenvector u = (u1, . . . , uk) such that uj > 0,∑n

j=1 |uj | =

1 anduB = λu,

(v) eigenvectors corresponding to eigenvalues other than λ are not positive: i.e. at leastone co-ordinate is positive and at least one co-ordinate is negative.

49


6.10.2 Markov measures

We will now see how to construct a large class of σ-invariant measures on shifts of finitetype.

Definition 6.22. A k × k matrix P is said to be stochastic if:

(i) P (i , j) ≥ 0 i , j = 1, . . . , k,

(ii)∑k

j=1 P (i , j) = 1, i = 1, . . . , k.

Suppose that P is compatible with A, i.e.,

P (i , j) > 0 ⇐⇒ A(i , j) = 1.

Suppose in addition that P , or equivalently A, is aperiodic, i.e., there exists n such that foreach i , j we have P n(i , j) > 0.

Applying the Perron-Frobenius theorem, we see that there exists a unique maximal eigen-value λ. As P is stochastic, we must have that λ = 1 (why?). Moreover, by (ii) in the abovedefinition, the right-eigenvector is (1, . . . , 1). Let p = (p1, . . . , pk) be the corresponding(strictly positive) left eigenvector, normalised so that

∑ki=1 pi = 1.

Now we define a probability measure µ = µP on ΣA, Σ+A by

µP [y`, y`+1, . . . , yn] = py`P (y`, y`+1) · · ·P (yn−1, yn),

on cylinder sets. (By the Kolmogorov Extension Theorem, this uniquely defines a measureon the whole Borel σ-algebra.)

We shall show that the measure µP on Σ+A is σ-invariant. By the Kolmogorov Extension

Theorem, it is enough to show that µP and σ∗µP agree on cylinder sets. Now

σ∗µP [y0, y1, . . . , yn] = µP (σ−1[y0, y1, . . . , yn])

= µP

(k⋃j=1

[j, y0, y1, . . . , yn]

)

=

k∑j=1

µP [j, y0, y1, . . . , yn]

=

k∑j=1

pjP (j, y0)P (y0, y1) · · ·P (yn−1, yn)

= py0P (y0, y1) · · ·P (yn−1, yn)

= µP [y0, y1, . . . , yn],

as required (where to get the penultimate line we have used the fact that pP = p). Probabilitymeasures of this form are called Markov measures.

Given an aperiodic k×k matrix A there are of course many compatible stochastic matricesP . Each such stochastic matrix generates a different Markov measure. However, there areseveral naturally defined measures that turn out to be Markov, and we give two of them here.

50


6.10.3 Full shifts

Recall that if A(i , j) = 1 for all i , j then

ΣA = {1, . . . , k}Z, Σ+A = {1, . . . , k}Z+

are the full shifts on k symbols. In this case we may define a (family of) measures by takingp = (p1, . . . , pk) to be any (positive) probability vector (i.e., pi > 0,

∑ki=1 pi = 1) and

definingµ[yl , . . . , yn] = pyl · · · pyn .

Such a µ is called a Bernoulli measure.

Exercise 6.23. Show that Bernoulli measures are Markov measures (i.e. construct a matrixP for which (p1, . . . , pk)P = (p1, . . . , pk).

6.10.4 The Parry measure

As A is a non-negative aperiodic matrix, by the Perron-Frobenius theorem there exists aunique maximal eigenvalue λ with corresponding left and right eigenvectors u = (u1, . . . , uk)

and v = (v1, . . . , vk), respectively. Define

Pi ,j =Ai ,jvjλvi

pi =uivic

where c =∑k

i=1 uivi .

Exercise 6.24. Show that P is a stochastic matrix and that p is a normalised left-eigenvaluefor P .

The corresponding Markov measure is called the Parry measure.

51

7 ERGODICITY

7 Ergodicity

7.1 The definition of ergodicity

In this section, we introduce what it means to say that a transformation is ergodic withrespect to an invariant measure. Ergodicity is an important concept for many reasons, notleast because Birkhoff’s Ergodic Theorem holds:

Theorem 7.1. Let T be an ergodic transformation of the probability space (X,B, µ) and letf ∈ L1(X,B, µ). Then

1

n

n−1∑j=0

f (T jx)→∫f dµ

for µ-almost every x ∈ X.

Definition 7.2. Let (X,B, µ) be a probability space and let T : X → X be a measure-preserving transformation. We say that T is an ergodic transformation (or µ is an ergodicmeasure) if, for B ∈ B,

T−1B = B ⇒ µ(B) = 0 or 1.

Remark 7.3. One can view ergodicity as an indecomposability condition. If ergodicity doesnot hold and we have T−1A = A with 0 < µ(A) < 1, then one can split T : X → X intoT : A → A and T : X \ A → X \ A with invariant probability measures 1

µ(A)µ(· ∩ A) and

11−µ(A)

µ(· ∩ (X \ A)), respectively.

It will sometimes be convenient for us to weaken the condition T−1B = B to µ(T−1B4B) =

0, where 4 denotes the symmetric difference:

A4B = (A \ B) ∪ (B \ A).

The next lemma allows us to do this.

Lemma 7.4. If B ∈ B satisfies µ(T−1B4B) = 0 then there exists B∞ ∈ B with T−1B∞ =

B∞ and µ(B4B∞) = 0. (In particular, µ(B) = µ(B∞).)

Proof. For each j ≥ 0, we have the inclusion

T−jB4B ⊂j−1⋃i=0

T−(i+1)B4T−iB

=

j−1⋃i=0

T−i(T−1B4B)

and so (since T preserves µ)

µ(T−jB4B) ≤ jµ(T−1B4B) = 0.

52

7.2 An alternative characterisation of ergodicity 7 ERGODICITY

Let

B∞ =

∞⋂j=0

∞⋃i=j

T−iB.

We have that

µ

(B4

∞⋃i=j

T−iB

)≤

∞∑i=j

µ(B4T−iB) = 0.

Since the sets⋃∞i=j T

−iB decrease as j increases we hence have µ(B4B∞) = 0. Also,

T−1B∞ =

∞⋂j=0

∞⋃i=j

T−(i+1)B

=

∞⋂j=0

∞⋃i=j+1

T−iB = B∞,

as required.

Corollary 7.5. If T is ergodic and µ(T−1B4B) = 0 then µ(B) = 0 or 1.

Remark 7.6. Occasionally, instead of saying that µ(A4B) = 0, we will say that A = B a.e.or A = B mod 0.

7.2 An alternative characterisation of ergodicity

The next result characterises ergodicity in a convenient way.

Proposition 7.7. Let T be a measure-preserving transformation of (X,B, µ). The followingare equivalent:

(i) T is ergodic;

(ii) whenever f ∈ L1(X,B, µ) satisfies f ◦ T = f µ-a.e. we have that f is constant µ-a.e.

Remark 7.8. We can replace L1 in the statement by measurable or by L2.

Proof. (i) ⇒ (ii): Suppose that T is ergodic and that f ∈ L1(X,B, µ) with f ◦ T = f µ-a.e.For k ∈ Z and n ∈ N, define

X(k, n) =

{x ∈ X |

k

2n≤ f (x) <

k + 1

2n

}= f −1

([k

2n,k + 1

2n

)).

Since f is measurable, X(k, n) ∈ B.We have that

T−1X(k, n)4X(k, n) ⊂ {x ∈ X | f (Tx) 6= f (x)}

so thatµ(T−1X(k, n)4X(k, n)) = 0.

53

7.3 Rotations of a circle 7 ERGODICITY

Hence µ(X(k, n)) = 0 or µ(X(k, n)) = 1.For each fixed n, the union

⋃k∈ZX(k, n) is equal to X up to a set of measure zero, i.e.,

µ

(X4

⋃k∈Z

X(k, n)

)= 0,

and this union is disjoint. Hence we have∑k∈Z

µ(X(k, n)) = µ(X) = 1

and so there is a unique kn for which µ(X(kn, n)) = 1. Let

Y =

∞⋂n=1

X(kn, n).

Then µ(Y ) = 1 and, by construction, f is constant on Y , i.e., f is constant µ-a.e.(ii)⇒ (i): Suppose that B ∈ B with T−1B = B. Then χB ∈ L1(X,B, µ) and χB ◦T (x) =

χB(x) ∀ x ∈ X, so, by hypothesis, χB is constant µ-a.e. Since χB only takes the values 0

and 1, we must have χB = 0 µ-a.e. or χB = 1 µ-a.e. Therefore

µ(B) =

∫X

χB dµ = 0 or 1,

and T is ergodic.

7.3 Rotations of a circle

Fix α ∈ R and define T : R/Z → R/Z by T (x) = x + α mod 1. We have already seen thatT preserves Lebesgue measure.

Theorem 7.9. Let T (x) = x + α mod 1.

(i) If α ∈ Q then T is not ergodic.

(ii) If α 6∈ Q then T is ergodic.

Proof. Suppose that α ∈ Q and write α = p/q for p, q ∈ Z with q 6= 0. Define

f (x) = e2πiqx ∈ L2(X,B, µ).

Then f is not constant but

f (Tx) = e2πiq(x+p/q) = e2πi(qx+p) = e2πiqx = f (x).

Hence T is not ergodic.

54

7.4 The doubling map 7 ERGODICITY

Suppose that α 6∈ Q. Suppose that f ∈ L2(X,B, µ) is such that f ◦ T = f a.e. Supposethat f has Fourier series

∞∑n=−∞

cne2πinx .

Then f ◦ T has Fourier series∞∑

n=−∞cne

2πinαe2πinx .

Comparing Fourier coefficients we see that

cn = cne2πinα,

for all n ∈ Z. As α 6∈ Q, e2πinα 6= 1 unless n = 0. Hence cn = 0 for n 6= 0. Hence f hasFourier series c0, i.e. f is constant a.e.

Exercise 7.10. Show that, when α ∈ Q, the rotation T (x) = x+α mod 1 is not ergodic fromthe definition, i.e. find an invariant set B = T−1B which has Lebesgue measure 0 < µ(B) < 1.

7.4 The doubling map

Let X = R/Z and define T : X → X by T (x) = 2x mod 1.

Proposition 7.11. The doubling map T is ergodic with respect to Lebesgue measure µ.

Proof. Let f ∈ L2(R/Z,B, µ) and suppose that f ◦ T = f µ-a.e. Let f have Fourier series

f (x) =

∞∑m=−∞

ame2πimx (in L2).

For each j ≥ 0, f ◦ T j has Fourier series∞∑

m=−∞ame

2πim2jx .

Comparing Fourier coefficients we see that

am = a2jm

for all m ∈ Z and each j = 0, 1, 2, . . .. The Riemann-Lebesgue Lemma says that an → 0 as|n| → ∞. Hence, if m 6= 0, we have that am = a2jm → 0 as j → ∞. Hence for m 6= 0 wehave that am = 0. Thus f has Fourier series a0, and so must be equal to a constant a.e.Hence T is ergodic with respect to µ.

55

7.5 Linear toral automorphisms 7 ERGODICITY

7.5 Linear toral automorphisms

Let X = Rk/Zk and let µ denote Lebesgue measure. Let A be a k × k integer matrix withdetA = ±1 and define T : X → X by

T (x1, . . . , xk) = A(x1, . . . , xk) mod 1.

Proposition 7.12. A linear toral automorphism T is ergodic with respect to µ if and only ifno eigenvalue of A is a root of unity.

Remark 7.13. In particular, hyperbolic toral automorphisms (i.e. no eigenvalues of modulus1) are ergodic with respect to Lebesgue measure.

To prove this criterion, we shall use the following:

Lemma 7.14. Let T be a linear toral automorphism. The following are equivalent:

(i) T is ergodic with respect to µ;

(ii) the only m ∈ Zk for which there exists p > 0 such that

e2πi〈m,Apx〉 = e2πi〈m,x〉 µ-a.e.

is m = 0.

Proof. (i) ⇒ (ii): Suppose that T is ergodic and that there exists m ∈ Zk and p > 0 suchthat

e2πi〈m,Apx〉 = e2πi〈m,x〉 µ-a.e.

Let p be the least such exponent and define

f (x) = e2πi〈m,x〉 + e2πi〈m,Ax〉 + · · ·+ e2πi〈m,Ap−1x〉

= e2πi〈m,x〉 + e2πi〈mA,x〉 + · · ·+ e2πi〈mAp−1,x〉.

Then f ∈ L2(Rk/Zk ,B, µ) and, since e2πi〈m,·〉 ◦T = e2πi〈m,A·〉, we have f ◦T = f µ-a.e. SinceT is ergodic, we thus have f = constant a.e. However, the only way that this can happen isif m = 0.

(ii) ⇒ (i): Now suppose that if, for some m ∈ Zk and p > 0, we have

e2πi〈m,Apx〉 = e2πi〈m,x〉 µ-a.e.,

then m = 0. Let f ∈ L2(Rk/Zk ,B, µ) and suppose that f ◦T = f µ-a.e. We shall show thatT is ergodic by showing that f is constant µ-a.e.

We may expand f as a Fourier series

f (x) =∑m∈Zk

ame2πi〈m,x〉 (in L2).

Since f ◦ T p = f µ-a.e., for all p > 0, we have∑m∈Zk

ame2πi〈mAp,x〉 =

∑m∈Zk

ame2πi〈m,x〉,

56

7.5 Linear toral automorphisms 7 ERGODICITY

for all p > 0. By the uniqueness of Fourier expansions, we can compare coefficients andobtain, for every m ∈ Zk ,

am = amA = · · · = amAp = · · · .

If am 6= 0 then there can only be finitely many indices in the above list, for otherwise it wouldcontradict the fact that am → 0 as |m| → ∞. In other words, there exists p > 0 such that

m = mAp.

But then e2πi〈m,Apx〉 = e2πi〈m,x〉 and so, by hypothesis, m = 0. Thus am = 0 for all m 6= 0

and so f is equal to the constant a0 µ-a.e. Therefore T is ergodic.

Proof of Proposition 7.12. We shall prove the contrapositive statements in each case.First suppose that T is not ergodic. Then, by the Lemma, there exists m ∈ Zk \ {0} and

p > 0 such thate2πi〈m,Apx〉 = e2πi〈m,x〉,

or, equivalently, e2πi〈mAp,x〉 = e2πi〈m,x〉, which is to say that mAp = m. Thus Ap has 1 as aneigenvalue and hence A has a pth root of unity as an eigenvalue.

Now suppose that A has a pth root of unity as an eigenvalue. Then Ap has 1 as aneigenvalue and so

m(Ap − I) = 0

for some m ∈ Rk \ {0}. Since A is an integer matrix, we can in fact take m ∈ Zk \ {0}. Wehave

e2πi〈m,Apx〉 = e2πi〈mAp,x〉 = e2πi〈m,x〉,

so, by the Lemma, T is not ergodic.

Exercise 7.15. Define T : R2/Z2 → R2/Z2 by

T (x, y) = (x + α, x + y).

Suppose that α 6∈ Q. By using Fourier series, show that T is ergodic with respect to Lebesguemeasure.

Exercise 7.16. (This exercise is outside the scope of the course and so is not examinable.)It is easy to construct lots of examples of hyperbolic toral automorphisms (i.e. no eigenval-

ues of modulus 1—the cat map is such an example), which must necessarily be ergodic withrespect to Lebesgue measure. It is harder to show that there are ergodic toral automorphismswith some eigenvalues of modulus 1.

(i) Show that to have ergodic toral automorphism of Rk/Zk with an eigenvalue of modulus1, we must have k ≥ 4.

Consider the matrix

A =

0 1 0 0

0 0 1 0

0 0 0 1

−1 8 −6 8

.

57

7.6 Existence of ergodic measures 7 ERGODICITY

(ii) Show that A defines a linear toral automorphism TA of the 4-dimensional torus R4/Z4.

(iii) Show that A has four eigenvalues, two of which have modulus 1.

(iv*) Show that TA is ergodic with respect to Lebesgue measure. (Hint: you have to showthat the two eigenvalues of modulus 1 are not roots of unity, i.e. are not solutions toλn − 1 = 0 for some n. The best way to do this is to use results from Galois theory onthe irreducibility of polynomials.)

7.6 Existence of ergodic measures

We now return to the general theory of studying the structure of continuous transformationsof compact metric spaces. Recall that we have already seen that the space M(X,T ) of T -invariant probability measures is always non-empty. We now describe how ergodic measures(for T ) fit in to the picture we have developed of M(X,T ). We shall then use this to showthat, in this case, ergodic measures always exist.

Recall that M(X,T ) is convex: if µ1, µ2 ∈ M(X,T ) then αµ1 + (1−α)µ2 ∈ M(X,T ) forevery 0 ≤ α ≤ 1.

A point in a convex set is called an extremal point if it cannot be written as a non-trivialconvex combination of (other) elements of the set. More precisely, µ is an extremal point ofM(X,T ) if whenever

µ = αµ1 + (1− α)µ2,

with µ1, µ2 ∈ M(X,T ), 0 < α < 1 then we have µ1 = µ2 = µ.

Remarks 7.17.

(i) Let Y be the unit square

Y = {(x, y) | 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} ⊂ R2.

Then the extremal points of Y are the corners (0, 0), (0, 1), (1, 0), (1, 1).

(ii) Let Y be the (closed) unit disc

Y = {(x, y) : x2 + y 2 ≤ 1} ⊂ R2.

Then the set of extremal points of Y is precisely the unit circle {(x, y) | x2 + y 2 = 1}.The next result will allow us to show that ergodic measures for continuous transformations

on compact metric spaces always exist.

Theorem 7.18. The following are equivalent:

(i) the T -invariant probability measure µ is ergodic;

(ii) µ is an extremal point of M(X,T ).

58


Proof. For the moment, we shall only prove (ii) ⇒ (i): if µ is extremal then it is ergodic. Infact, we shall prove the contrapositive. Suppose that µ is not ergodic; we show that µ is notextremal. As µ is not ergodic, there exists B ∈ B such that T−1B = B and 0 < µ(B) < 1.

Define probability measures µ1 and µ2 on X by

µ1(A) =µ(A ∩ B)

µ(B), µ2(A) =

µ(A ∩ (X \ B))

µ(X \ B).

(The condition 0 < µ(B) < 1 ensures that the denominators are not equal to zero.) Clearly,µ1 6= µ2, since µ1(B) = 1 while µ2(B) = 0.

Since T−1B = B, we also have T−1(X \ B) = X \ B. Thus we have

µ1(T−1A) =µ(T−1A ∩ B)

µ(B)

=µ(T−1A ∩ T−1B)

µ(B)

=µ(T−1(A ∩ B))

µ(B)

=µ(A ∩ B)

µ(B)

= µ1(A)

and (by the same argument)

µ2(T−1A) =µ(T−1A ∩ (X \ B))

µ(X \ B)= µ2(A),

i.e., µ1 and µ2 are both in M(X,T ).However, we may write µ as the non-trivial (since 0 < µ(B) < 1) convex combination

µ = µ(B)µ1 + (1− µ(B))µ2,

so µ is not extremal.We defer the proof of (i)⇒ (ii) until later (as an appendix to section 9) as it requires the

Radon-Nikodym Theorem, which we have yet to state.

Theorem 7.19. Let T : X → X be a continuous mapping of a compact metric space. Thenthere exists at least one ergodic measure in M(X,T ).

Proof. By Theorem 7.18, it is equivalent to prove that M(X,T ) has an extremal point.Choose a countable dense subset of C(X,R), {fi}∞i=0 say. Consider the first function f0.

Since the map

M(X,T )→ R : µ 7→∫f0 dµ

59


is (weak∗) continuous and M(X,T ) is compact, there exists (at least one) ν ∈ M(X,T ) suchthat ∫

f0 dν = supµ∈M(X,T )

∫f0 dµ.

If we define

M0 =

{ν ∈ M(X,T ) |

∫f0 dν = sup

µ∈M(X,T )

∫f0 dµ

}then the above shows that M0 is non-empty. Also, M0 is closed and hence compact.

We now consider the next function f1 and define

M1 =

{ν ∈ M0 |

∫f1 dν = sup

µ∈M0

∫f1 dµ

}.

By the same reasoning as above, M1 is a non-empty closed subset of M0.Continuing inductively, we define

Mj =

{ν ∈ Mj−1 |

∫fj dν = sup

µ∈Mj−1

∫fj dµ

}

and hence obtain a nested sequence of sets

M(X,T ) ⊃ M0 ⊃ M1 ⊃ · · · ⊃ Mj ⊃ · · ·

with each Mj non-empty and closed.Now consider the intersection

M∞ =

∞⋂j=0

Mj .

Recall that the countable intersection of a decreasing sequence of non-empty compact setsis non-empty. Hence M∞ is non-empty and we can pick µ∞ ∈ M∞. We shall show that µ∞is extremal (and hence ergodic).

Suppose that we can write µ∞ = αµ1 + (1 − α)µ2, µ1, µ2 ∈ M(X,T ), 0 < α < 1. Wehave to show that µ1 = µ2. Since {fj}∞j=0 is dense in C(X,R), it suffices to show that∫

fj dµ1 =

∫fj dµ2 ∀ j ≥ 0.

Consider f0. By assumption∫f0 dµ∞ = α

∫f0 dµ1 + (1− α)

∫f0 dµ2.

In particular, ∫f0 dµ∞ ≤ max

{∫f0 dµ1,

∫f0 dµ2

}.

60


However µ∞ ∈ M0 and so∫f0 dµ∞ = sup

µ∈M(X,T )

∫f0 dµ ≥ max

{∫f0 dµ1,

∫f0 dµ2

}.

Therefore ∫f0 dµ1 =

∫f0 dµ2 =

∫f0 dµ∞.

Thus, the first identity we require is proved and µ1, µ2 ∈ M0. This last fact allows us toemploy the same argument on f1 (with M(X,T ) replaced by M0) and conclude that∫

f1 dµ1 =

∫f1 dµ2 =

∫f1 dµ∞

and µ1, µ2 ∈ M1.Continuing inductively, we show that for an arbitrary j ≥ 0,∫

fj dµ1 =

∫fj dµ2

and µ1, µ2 ∈ Mj . This completes the proof.

Exercise 7.20. Define the North-South Map as follows. Let X be the circle of radius 1 centredat (0, 1) in R2. Call (0, 2) the North Pole (N) and (0, 0) the South Pole (S) of X. Define amap φ : X \{N} → R×{0} by drawing a straight line through N and x and denoting by φ(x)

the unique point on the x-axis that this line crosses (this is just stereographic projection ofthe circle). Define T : X → X by

T (x) =

{φ−1

(12φ(x)

)if x ∈ X \ {N},

N if x = N.

Hence T (N) = N, T (S) = S and if x 6= N,S then T n(x)→ S as n →∞.

(i) Show that δS and δN (the Dirac delta measures at N, S, respectively) are T -invariant.

(ii) Show that any invariant measure assigns zero measure to X \ {N,S}.(Hint: take x 6= N,S and consider the interval I = [x, T (x)). Then ∪n∈ZT−nI is adisjoint union.)

(iii) Hence find all invariant measures for the North-South map.

(iv) Find all ergodic measures for the North-South map.

61

7.7 Bernoulli Shifts 7 ERGODICITY

7.7 Bernoulli Shifts

Let σ : Σk → Σk be the full shift on k symbols. (The following discussion works equally wellfor the one-sided version Σ+

k .) Let p = (p1, . . . , pk) be a probability vector and let µp be theBernoulli measure determined by p, i.e., on cylinder sets

µp[z0, . . . , zn−1] = pz0· · · pzn−1

.

We shall show that σ is ergodic with respect to µp.We shall use the following fact: Given B ∈ B and ε > 0 we can find a finite disjoint

collection of cylinder sets C1, . . . , CN such that

µp

(B4

N⋃j=1

Cj

)< ε.

Suppose that B ∈ B satisfies σ−1B = B. Choosing C1, . . . , CN as above and writingE =

⋃Nj=1 Cj , we have

|µp(B)− µp(E)| < ε.

The key point is the following: If n is sufficiently large then F = σ−nE and E depend ondifferent co-ordinates. Hence, since µp is defined by a product,

µp(F ∩ E) = µp(F ) µp(E)

= µp(σ−nE) µp(E)

= µp(E)2,

since µp is σ-invariant.We also have the estimate

µp(B4F ) = µp(σ−nB4σ−nE)

= µp(σ−n(B4E))

= µp(B4E) < ε.

Since B4(E ∩ F ) ⊂ (B4E) ∪ (B4F ), we therefore obtain

µp(B4(E ∩ F )) ≤ µp(B4E) + µp(B4F ) < 2ε

and hence|µp(B)− µp(E ∩ F )| < 2ε.

Hence we can estimate

|µp(B)− µp(B)2| ≤ |µp(B)− µp(E ∩ F )|+ |µp(E ∩ F )− µp(B)2|< 2ε+ |µp(E)2 − µp(B)2|= 2ε+ (µp(E) + µp(B))︸︷︷︸

≤2

|µp(E)− µp(B)|︸︷︷︸<ε

≤ 4ε.

Since ε > 0 is arbitrary, we have µp(B) = µp(B)2. This is only possible if µp(B) = 0 orµp(B) = 1. Therefore σ is ergodic with respect to µp.

62

7.8 Remarks on the continued fraction map 7 ERGODICITY

Remark 7.21. For general subshifts of finite type σ : ΣA → ΣA, σ is ergodic with respectto the Markov measure µP if and only if the stochastic matrix P is irreducible (i.e., for each(i , j) there exists n > 0 such that P n(i , j) > 0).

Exercise 7.22. We have seen that there are lots of (indeed, uncountably many) ergodicmeasures for the full one-sided two-shift. We can use this fact to show that there areuncountably many ergodic measures for the doubling map.

Let Σ+2 = {0, 1}Z+

be the one-sided full shift on two symbols. Define π : Σ+2 → R/Z by

π(x0, x1, . . .) =x0

2+x1

22+ · · ·+

xn2n+1

+ · · ·

(i) Show that π is continuous.

(ii) Let T : R/Z→ R/Z be the doubling map: T (x) = 2x mod 1. Show that π◦σ = T ◦π.(Thus T is a factor of σ.)

(iii) If µ is a σ-invariant probability measure on Σ+2 , show that π∗µ (where π∗µ(B) =

µ(π−1B) for a Borel subset B ⊂ R/Z) is a T -invariant probability measure on R/Z.(Lebesgue measure on R/Z corresponds to choosing µ to be the Bernoulli (1/2, 1/2)-measure on Σ+

2 .)

(iv) Show that if µ is an ergodic measure for σ, then π∗µ is an ergodic measure for T .

(v) Conclude that there are infinitely many ergodic measures for the doubling map.

7.8 Remarks on the continued fraction map

Recall that the continued fraction map T : [0, 1) → [0, 1) is defined by T (0) = 0 and, for0 < x < 1,

T (x) =

{0 if x = 0,{

1x

}= 1

xmod 1 if 0 < x < 1.

This sends the point

x =1

x0 + 1

x1+ 1

x2+ 1x3+···

to the point

T (x) =1

x1 + 1

x2+ 1

x3+ 1x4+···

.

That is, T acts by shifting the continued fraction expansion one place to the left (andforgetting the 0th term). Thus we can think of T as a full one-sided subshift, albeit with aninfinite number of symbols.

Using this analogy, we can then define a cylinder to be a set of the form

I(x0, x1, . . . , xn) = {x ∈ (0, 1) | x has continued fraction expansion

starting x0, x1, . . . xn}.

63

7.8 Remarks on the continued fraction map 7 ERGODICITY

Recall that T preserves Gauss’ measure, defined by

µ(B) =1

log 2

∫B

dx

1 + x.

We claim that µ is an ergodic measure. One proof of this uses similar ideas as the proofthat Bernoulli measures for subshifts of finite type are ergodic. However, a crucial propertyof Bernoulli measures that was used is the following: given two cylinders, E and F , we haveµ(E ∩ σ−nF ) = µ(E)µ(F ) provided n is sufficiently large. This equality holds because theformula for the Bernoulli measure of a cylinder is ‘locally constant’, i.e. it depends only on afinite number of co-ordinates. The formula for Gauss’ measure is not locally constant: thefunction 1/(1 + x) depends on all (i.e. infinitely many) co-efficients in the continued fractionexpansion of x . However, with some effort, one can prove that there exist constants c, C > 0

such thatcµ(E)µ(F ) ≤ µ(E ∩ σ−nF ) ≤ Cµ(E)µ(F )

for ‘cylinders’ for the continued fraction map. It turns out that this is sufficient to proveergodicity. In summary:

Proposition 7.23. Let T denote the continued fraction map. Then T is ergodic with respectto Gauss’ measure.

64

8 RECURRENCE AND UNIQUE ERGODICITY

8 Recurrence and Unique Ergodicity

8.1 Poincaré’s Recurrence Theorem

We now go back to the general setting of a measure-preserving transformation of a probabilityspace (X,B, µ). The following is the most basic result about the distribution of orbits.

Theorem 8.1 (Poincaré’s Recurrence Theorem). Let T : X → X be a measure-preservingtransformation of (X,B, µ) and let A ∈ B have µ(A) > 0. Then for µ-a.e. x ∈ A, the orbit{T nx}∞n=0 returns to A infinitely often.

Proof. LetE = {x ∈ A | T nx ∈ A for infinitely many n},

then we have to show that µ(A\E) = 0.If we write

F = {x ∈ A | T nx 6∈ A ∀n ≥ 1}

then we have the identity

A \ E =

∞⋃k=0

(T−kF ∩ A).

Thus we have the estimate

µ(A\E) = µ

(∞⋃k=0

(T−kF ∩ A)

)

≤ µ

(∞⋃k=0

T−kF

)

≤∞∑k=0

µ(T−kF ).

Since µ(T−kF ) = µ(F ) ∀k ≥ 0 (because the measure is preserved), it suffices to show thatµ(F ) = 0.

First suppose that n > m and that T−mF ∩ T−nF 6= ∅. If y lies in this intersection thenTmy ∈ F and T n−m(Tmy) = T ny ∈ F ⊂ A, which contradicts the definition of F . ThusT−mF and T−nF are disjoint.

Since {T−kF}∞k=0 is a disjoint family, we have

∞∑k=0

µ(T−kF ) = µ

(∞⋃k=0

T−kF

)≤ µ(X) = 1.

Since the terms in the summation have the constant value µ(F ), we must have µ(F ) = 0.

Exercise 8.2. Construct an example to show that Poincaré’s recurrence theorem does not holdon infinite measure spaces. (Recall that a measure space (X,B, µ) is infinite if µ(X) =∞.)

65

8.2 Unique Ergodicity 8 RECURRENCE AND UNIQUE ERGODICITY

8.2 Unique Ergodicity

We finish this section by looking at the case where T : X → X has a unique invariantprobability measure.

Definition 8.3. Let (X,B) be a measurable space and let T : X → X be a measurabletransformation. If there is a unique T -invariant probability measure then we say that T isuniquely ergodic.

Remark 8.4. You might wonder why we don’t just call such T ‘uniquely invariant’. Recallfrom Theorem 8.18 that the extremal points of M(X,T ) are precisely the ergodic measures.If M(X,T ) consists of just one measure then that measure is extremal, and so must beergodic.

Unique ergodicity (for continuous maps) implies the following strong convergence result.

Theorem 8.5. Let X be a compact metric space and let T : X → X be a continuoustransformation. The following are equivalent:

(i) T is uniquely ergodic;

(ii) for each f ∈ C(X) there exists a constant c(f ) such that

1

n

n−1∑j=0

f (T jx)→ c(f ),

uniformly for x ∈ X, as n →∞.

Proof. (ii)⇒ (i): Suppose that µ, ν are T -invariant probability measures; we shall show thatµ = ν.

Integrating the expression in (ii), we obtain∫f dµ = lim

n→∞

1

n

n−1∑j=0

∫f ◦ T j dµ

=

∫limn→∞

1

n

n−1∑j=0

f ◦ T j dµ

=

∫c(f ) dµ = c(f ),

and, by the same argument ∫f dν = c(f ).

Therefore ∫f dµ =

∫f dν ∀f ∈ C(X)

and so µ = ν (by the Riesz Representation Theorem).

66

8.3 Example: The Irrational Rotation 8 RECURRENCE AND UNIQUE ERGODICITY

(i) ⇒ (ii): Let M(X,T ) = {µ}. If (ii) is true, then, by the Dominated ConvergenceTheorem, we must necessarily have c(f ) =

∫f dµ. Suppose that (ii) is false. Then we can

find f ∈ C(X) and sequences nk ∈ N and xk ∈ X such that

limk→∞

1

nk

nk−1∑j=0

f (T jxk) 6=∫f dµ.

For each k ≥ 1, define a measure νk ∈ M(X) by

νk =1

nk

nk−1∑j=0

T j∗δxk ,

so that ∫f dνk =

1

nk

nk−1∑j=0

f (T jxk).

By the proof of Theorem 13.1, νk has a subsequence which converges weak∗ to a measureν ∈ M(X,T ). In particular, we have∫

f dν = limk→∞

∫f dνk 6=

∫f dµ.

Therefore, ν 6= µ, contradicting unique ergodicity.

8.3 Example: The Irrational Rotation

Let X = R/Z, T : X → X : x 7→ x +α mod 1, α irrational. Then T is uniquely ergodic (andµ = Lebesgue measure is the unique invariant probability measure).

Proof. Let m be an arbitrary T -invariant probability measure; we shall show that m = µ.Write ek(x) = e2πikx . Then∫

ek(x) dm =

∫ek(Tx) dm

=

∫ek(x + α) dm

= e2πikα

∫ek(x) dm.

Since α is irrational, if k 6= 0 then e2πikα 6= 1 and so∫ek dm = 0. (4)

Let f ∈ C(X) have Fourier series∑∞

k=−∞ akek , so that a0 =∫f dµ. For n ≥ 1, we let

σn denote the average of the first n partial sums. Then σn → f uniformly as n →∞. Hence

limn→∞

∫σn dm =

∫f dm.

67

8.3 Example: The Irrational Rotation 8 RECURRENCE AND UNIQUE ERGODICITY

However using (4), we may calculate that∫σn dm = a0 =

∫f dµ.

Thus we have that∫f dm =

∫f dµ, for every f ∈ C(X), and so m = µ.

68

9 BIRKHOFF’S ERGODIC THEOREM

9 Birkhoff’s Ergodic Theorem

9.1 Introduction

An ergodic theorem is a result that describes the limiting behaviour of the sequence

1

n

n−1∑j=0

f ◦ T j (5)

as n →∞. The precise formulation of an ergodic theorem depends on the class of functionf (for example, one could assume that f is integrable, L2, or continuous), and the notionof convergence that we use (for example, we could study pointwise convergence, L2 conver-gence, or uniform convergence). The result that we are interested here—Birkhoff’s ErgodicTheorem—deals with pointwise convergence of (5) for an integrable function f .

9.2 Conditional expectation

We will need the concepts of Radon-Nikodym derivates and conditional expectation.

Definition 9.1. Let µ be a measure on (X,B). We say that a measure ν is absolutelycontinuous with respect to µ and write ν � µ if ν(B) = 0 whenever µ(B) = 0, B ∈ B.

Remark 9.2. Thus ν is absolutely continuous with respect to µ if sets of µ-measure zero alsohave ν-measure zero (but there may be more sets of ν-measure zero).

For example, let f ∈ L1(X,B, µ) be non-negative and define a measure ν by

ν(B) =

∫B

f dµ.

Then ν � µ.The following theorem says that, essentially, all absolutely continuous measures occur in

this way.

Theorem 9.3 (Radon-Nikodym). Let (X,B, µ) be a probability space. Let ν be a measuredefined on B and suppose that ν � µ. Then there is a non-negative measurable function fsuch that

ν(B) =

∫B

f dµ, for all B ∈ B.

Moreover, f is unique in the sense that if g is a measurable function with the same propertythen f = g µ-a.e.

Exercise 9.4. If ν � µ then it is customary to write dν/dµ for the function given by theRadon-Nikodym theorem, that is

ν(B) =

∫B

dν

dµdµ.

Prove the following relations:

69

9.2 Conditional expectation 9 BIRKHOFF’S ERGODIC THEOREM

(i) If ν � µ and f is a µ-integrable function then∫f dν =

∫fdν

dµdµ.

(ii) If ν1, ν2 � µ thend(ν1 + ν2)

dµ=dν1

dµ+dν2

dµ.

(iii) If λ� ν � µ thendλ

dµ=dλ

dν

dν

dµ.

Let A ⊂ B be a sub-σ-algebra. Note that µ defines a measure on A by restriction. Letf ∈ L1(X,B, µ), with f non-negative. Then we can define a measure ν on A by setting

ν(A) =

∫A

f dµ.

Note that ν � µ|A. Hence by the Radon-Nikodym theorem, there is a unique A-measurablefunction E(f | A) such that

ν(A) =

∫E(f | A) dµ.

We call E(f | A) the conditional expectation of f with respect to the σ-algebra A.So far, we have only defined E(f | A) for non-negative f . To define E(f | A) for an

arbitrary f , we split f into positive and negative parts f = f+− f− where f+, f− ≥ 0 and define

E(f | A) = E(f+ | A)− E(f− | A).

Thus we can view conditional expectation as an operator

E(· | A) : L1(X,B, µ)→ L1(X,A, µ).

Note that E(f | A) is uniquely determined by the two requirements that

(i) E(f | A) is A-measurable, and

(ii)∫Af dµ =

∫AE(f | A) dµ for all A ∈ A.

Intuitively, one can think of E(f | A) as the best approximation to f in the smaller space ofall A-measurable functions.

Exercise 9.5. (i) Prove that f 7→ E(f | A) is linear.

(ii) Suppose that g isA-measurable and |g| <∞ µ-a.e. Show that E(f g | A) = gE(f | A).

(iii) Suppose that T is a measure-preserving transformation. Show that E(f | A) ◦ T =

E(f ◦ T | T−1A).

(iv) Show that E(f | B) = f .

70

9.3 Birkhoff’s Pointwise Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

(v) Let N denote the trivial σ-algebra consisting of all sets of measure 0 and 1. Show thatE(f | N ) =

∫f dµ.

To state Birkhoff’s Ergodic Theorem precisely, we will need the sub-σ-algebra I of T -invariant subsets, namely:

I = {B ∈ B | T−1B = B a.e.}.

Exercise 9.6. Prove that I is a σ-algebra.

9.3 Birkhoff’s Pointwise Ergodic Theorem

Birkhoff’s Ergodic Theorem deals with the behaviour of 1n

∑n−1j=0 f (T jx) for µ-a.e. x ∈ X,

and for f ∈ L1(X,B, µ).

Theorem 9.7 (Birkhoff’s Ergodic Theorem). Let (X,B, µ) be a probability space and letT : X → X be a measure-preserving transformation. Let I denote the σ-algebra of T -invariant sets. Then for every f ∈ L1(X,B, µ), we have

1

n

n−1∑j=0

f (T jx)→ E(f | I)

for µ-a.e. x ∈ X.

Corollary 9.8. Let (X,B, µ) be a probability space and let T : X → X be an ergodic measure-preserving transformation. Let f ∈ L1(X,B, µ). Then

1

n

n−1∑j=0

f (T jx)→∫f dµ, as n →∞,

for µ-a.e. x ∈ X.

Proof. If T is ergodic then I is the trivial σ-algebra N consisting of sets of measure 0 and1. If f ∈ L1(X,B, µ) then E(f | N ) =

∫f dµ. The result follows from the general version

of Birkhoff’s ergodic theorem.

9.4 The proof of Birkhoff’s Ergodic Theorem

The proof is something of a tour de force of hard analysis. It is based on the followinginequality.

Theorem 9.9 (Maximal Inequality). Let (X,B, µ) be a probability space, let T : X → X bea measure-preserving transformation and let f ∈ L1(X,B, µ). Define f0 = 0 and, for n ≥ 1,

fn = f + f ◦ T + · · ·+ f ◦ T n−1.

71

9.4 The proof of Birkhoff’s Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

For n ≥ 1, setFn = max

0≤j≤nfj

(so that Fn ≥ 0). Then ∫{x |Fn(x)>0}

f dµ ≥ 0.

Proof. Clearly Fn ∈ L1(X,B, µ). For 0 ≤ j ≤ n, we have Fn ≥ fj , so Fn ◦ T ≥ fj ◦ T . Hence

Fn ◦ T + f ≥ fj ◦ T + f = fj+1

and thereforeFn ◦ T (x) + f (x) ≥ max

1≤j≤nfj(x).

If Fn(x) > 0 thenmax

1≤j≤nfj(x) = max

0≤j≤nfj(x) = Fn(x),

so we obtain thatf ≥ Fn − Fn ◦ T

on the set A = {x | Fn(x) > 0}.Hence ∫

A

f dµ ≥∫A

Fn dµ−∫A

Fn ◦ T dµ

=

∫X

Fn dµ−∫A

Fn ◦ T dµ

≥∫X

Fn dµ−∫X

Fn ◦ T dµ

= 0

where we have used

(i) Fn = 0 on X \ A

(ii) Fn ◦ T ≥ 0

(iii) µ is T -invariant.

Corollary 9.10. If g ∈ L1(X,B, µ) and if

Bα =

{x ∈ X | sup

n≥1

1

n

n−1∑j=0

g(T jx) > α

}

then for all A ∈ B with T−1A = A we have that∫Bα∩A

g dµ ≥ αµ(Bα ∩ A).

72


Proof. Suppose first that A = X. Let f = g − α, then

Bα =

∞⋃n=1

{x |

n−1∑j=0

g(T jx) > nα

}

=

∞⋃n=1

{x | fn(x) > 0}

=

∞⋃n=1

{x | Fn(x) > 0}

(since fn(x) > 0 ⇒ Fn(x) > 0 and Fn(x) > 0 ⇒ fj(x) > 0 for some 1 ≤ j ≤ n). WriteCn = {x | Fn(x) > 0} and observe that Cn ⊂ Cn+1. Thus χCn converges to χBα and sof χCn converges to f χBα, as n → ∞. Furthermore, |f χCn | ≤ |f |. Hence, by the DominatedConvergence Theorem,∫

Cn

f dµ =

∫X

f χCn dµ→∫X

f χBα dµ =

∫Bα

f dµ, as n →∞.

Applying the maximal inequality, we have, for all n ≥ 1,∫Cn

f dµ ≥ 0.

Therefore ∫Bα

f dµ ≥ 0,

i.e., ∫Bα

g dµ ≥ αµ(Bα).

For the general case, we work with the restriction of T to A, T : A → A, and apply themaximal inequality on this subset to get∫

Bα∩Ag dµ ≥ αµ(Bα ∩ A),

as required.

Proof. Proof of Birkhoff’s Ergodic Theorem Let

f ∗(x) = lim supn→∞

1

n

n−1∑j=0

f (T jx)

and

f∗(x) = lim infn→∞

1

n

n−1∑j=0

f (T jx).

73


Writing

an(x) =1

n

n−1∑j=0

f (T jx),

observe thatn + 1

nan+1(x) = an(Tx) +

1

nf (x).

Taking the lim sup and lim inf as n →∞ gives us that f ∗ ◦ T = f ∗ and f∗ ◦ T = f∗.We have to show

(i) f ∗ = f∗ µ-a.e

(ii) f ∗ ∈ L1(X,B, µ)

(iii)∫f ∗ dµ =

∫f dµ.

We prove (i). For α, β ∈ R, define

Eα,β = {x ∈ X | f∗(x) < β and f ∗(x) > α}.

Note that{x ∈ X | f∗(x) < f ∗(x)} =

⋃β<α, α,β∈Q

Eα,β

(a countable union). Thus, to show that f ∗ = f∗ µ-a.e., it suffices to show that µ(Eα,β) = 0

whenever β < α. Since f∗ ◦T = f∗ and f ∗ ◦T = f ∗, we see that T−1Eα,β = Eα,β. If we write

Bα =

{x ∈ X | sup

n≥1

1

n

n−1∑j=0

f (T jx) > α

}

then Eα,β ∩ Bα = Eα,β.Applying Corollary 9.10 we have that∫

Eα,β

f dµ =

∫Eα,β∩Bα

f dµ

≥ αµ(Eα,β ∩ Bα) = αµ(Eα,β).

Replacing f , α and β by −f , −β and −α and using the fact that (−f )∗ = −f∗ and(−f )∗ = −f ∗, we also get ∫

Eα,β

f dµ ≤ βµ(Eα,β).

Thereforeαµ(Eα,β) ≤ βµ(Eα,β)

and since β < α this shows that µ(Eα,β) = 0. Thus f ∗ = f∗ µ-a.e. and

limn→∞

1

n

n−1∑j=0

f (T jx) = f ∗(x) µ-a.e.

74


We prove (ii). Let

gn =

∣∣∣∣∣1nn−1∑j=0

f ◦ T j∣∣∣∣∣ .

Then gn ≥ 0 and ∫gn dµ ≤

∫|f | dµ

so we can apply Fatou’s Lemma to conclude that limn→∞ gn = |f ∗| is integrable, i.e., thatf ∗ ∈ L1(X,B, µ).

We prove (iii). For n ∈ N and k ∈ Z, define

Dnk =

{x ∈ X |

k

n≤ f ∗(x) <

k + 1

n

}.

For every ε > 0, we have thatDnk ∩ B k

n−ε = Dn

k .

Since T−1Dnk = Dn

k , we can apply Corollary 22.4 again to obtain∫Dnk

f dµ ≥(k

n− ε)µ(Dn

k).

Since ε > 0 is arbitrary, we have ∫Dnk

f dµ ≥k

nµ(Dn

k).

Thus ∫Dnk

f ∗ dµ ≤k + 1

nµ(Dn

k)

≤1

nµ(Dn

k) +

∫Dnk

f dµ

(where the first inequality follows from the definition of Dnk). Since

X =⋃k∈Z

Dnk

(a disjoint union), summing over k ∈ Z gives∫X

f ∗ dµ ≤1

nµ(X) +

∫X

f dµ

=1

n+

∫X

f dµ.

Since this holds for all n ≥ 1, we obtain∫X

f ∗ dµ ≤∫X

f dµ.

75

9.5 Consequences of the Ergodic Theorem 9 BIRKHOFF’S ERGODIC THEOREM

Applying the same argument to −f gives∫(−f )∗ dµ ≤

∫−f dµ

so that ∫f ∗ dµ =

∫f∗ dµ ≥

∫f dµ.

Therefore ∫f ∗ dµ =

∫f dµ,

as required.Finally, we prove that f ∗ = E(f | I). First note that as f ∗ is T -invariant, it is measurable

with respect to I. Moreover, if I is any T -invariant set then∫I

f , dµ =

∫I

f ∗ dµ.

Hence f ∗ = E(f | I).

9.5 Consequences of the Ergodic Theorem

Here we give some simple corollaries of Birkhoff’s Ergodic Theorem. The first result says that,for a typical orbit of an ergodic dynamical system, ‘time averages’ equal ‘space averages’.

Corollary 9.11. If T is ergodic and if B ∈ B then for µ-a.e. x ∈ X, the frequency with whichthe orbit of x lies in B is given by µ(B), i.e.,

limn→∞

1

ncard{j ∈ {0, 1, . . . , n − 1} | T jx ∈ B} = µ(B) µ-a.e.

Proof. Apply the Birkhoff Ergodic Theorem with f = χB.

It is possible to characterise ergodicity in terms of the behaviour of sets, rather thanpoints, under iteration. The next result deals with this.

Theorem 9.12. Let (X,B, µ) be a probability space and let T : X → X be a measure-preserving transformation. The following are equivalent:

(i) T is ergodic;

(ii) for all A,B ∈ B,1

n

n−1∑j=0

µ(T−jA ∩ B)→ µ(A)µ(B),

as n →∞.

76

9.6 Normal numbers 9 BIRKHOFF’S ERGODIC THEOREM

Proof. (i) ⇒ (ii): Suppose that T is ergodic. Since χA ∈ L1(X,B, µ), Birkhoff’s ErgodicTheorem tells us that

1

n

n−1∑j=0

χA ◦ T j → µ(A), as n →∞,

µ-a.e. Multiplying both sides by χB gives

1

n

n−1∑j=0

χA ◦ T j χB → µ(A)χB, as n →∞,

µ-a.e. Since the left-hand side is bounded (by 1), we can apply the Dominated ConvergenceTheorem to see that

1

n

n−1∑j=0

µ(T−jA ∩ B) =1

n

n−1∑j=0

∫χA ◦ T j χB dµ

=

∫1

n

n−1∑j=0

χA ◦ T j χB dµ→ µ(A)µ(B),

as n →∞.(ii) ⇒ (i): Now suppose that the convergence holds. Suppose that T−1A = A and take

B = A. Then µ(T−jA ∩ B) = µ(A) so

1

n

n−1∑j=0

µ(A)→ µ(A)2,

as n →∞. This gives µ(A) = µ(A)2. Therefore µ(A) = 0 or 1 and so T is ergodic.

9.6 Normal numbers

A number x ∈ [0, 1) is called normal (in base 2) if it has a unique binary expansion, thedigit 0 occurs in its binary expansion with frequency 1/2, and the digit 1 occurs in its binaryexpansion with frequency 1/2. We will show that Lebesgue a.e. x ∈ [0, 1) is normal.

To see this, observe that Lebesgue almost every x ∈ [0, 1) has a unique binary expansionx = ·x1x2 . . ., xi ∈ {0, 1}. Define Tx = 2x mod 1. Then xn = 0 if and only if T n−1x ∈[0, 1/2). Thus

1

ncard{1 ≤ i ≤ n | xi = 0} =

1

n

n−1∑i=0

χ[0,1/2)(T ix).

Since T is ergodic (with respect to Lebesgue measure), for Lebesgue almost every point xthe above expression converges to

∫χ[0,1/2)(x) dx = 1/2. Similarly the frequency with which

the digit 1 occurs is equal to 1/2. Hence Lebesgue almost every point in [0, 1) is normal.

Exercise 9.13. (i) Let r ≥ 2. What would it mean to say that a number x ∈ [0, 1) isnormal in base r?

77

9.7 Continued fractions 9 BIRKHOFF’S ERGODIC THEOREM

(ii) Prove that for each r , Lebesgue a.e. x ∈ [0, 1) is normal in base r .

(iii) Conclude that Lebesgue a.e. x ∈ [0, 1) is simultaneously normal in every base r =

2, 3, 4, . . ..

Exercise 9.14. Prove that the arithmetic mean of the digits appearing in the base 10 expansionof Lebesgue-a.e. x ∈ [0, 1) is equal to 4.5, i.e. prove that if x =

∑∞j=0 xj/10j+1, xj ∈

{0, 1, . . . , 9} thenlimn→∞

1

n(x0 + x1 + · · ·+ xn−1) = 4.5 a.e.

9.7 Continued fractions

We will show that for Lebesgue a.e. x ∈ (0, 1) the frequency with which the natural numberk occurs in the continued fraction expansion of x is given by

1

log 2log

((k + 1)2

k(k + 2)

).

Let λ denote Lebesgue measure and let µ denote Gauss’ measure. Then λ-a.e. and µ-a.e.x ∈ (0, 1) is irrational and has an infinite continued fraction expansion

x =1

x1 + 1

x2+ 1

x3+ 1x4+···

.

Let T denote the continued fraction map. Then xn = [1/T n−1x ].Fix k ∈ N. Then xn = k precisely when [1/T n−1x ] = k , i.e.

k ≤1

T n−1x< k + 1

which is equivalent to requiring

1

k + 1< T n−1x ≤

1

k.

Hence

1

ncard{1 ≤ i ≤ n | xi = k} =

1

n

n−1∑i=0

χ(1/(k+1),1/k](Tix)

→∫χ(1/(k+1),1/k] dµ for µ-a.e. x

=1

log 2

[log

(1 +

1

k

)− log

(1 +

1

k + 1

)]=

1

log 2log

(k + 1)2

k(k + 2).

As µ and λ are equivalent, this holds for Lebesgue almost every point.

78

9.8 Appendix 9 BIRKHOFF’S ERGODIC THEOREM

Exercise 9.15. (i) Deduce from Birkhoff’s Ergodic Theorem that if T is an ergodic measure-preserving transformation of a probability space (X,B, µ) and f ≥ 0 is measurable but∫f dµ =∞ then

1

n

n−1∑j=0

f (T jx)→∞ µ-a.e.

(Hint: define fM = min{f ,M} and note that fM ∈ L1(X,B, µ). Apply Birkhoff’sErgodic Theorem to each fM.)

(ii) For x ∈ (0, 1) \Q write its infinite continued fraction expansion as

x =1

x1 + 1

x2+ 1x3+···

.

Show that for Lebesgue almost every x ∈ (0, 1) we have

1

n(x1 + x2 + · · ·+ xn)→∞

as n → ∞. (That is, for a typical point x , the average value of the co-efficients in itscontinued fraction expansion is infinite.)

9.8 Appendix

Completion of the proof of Theorem 7.18. Suppose that µ is ergodic and that µ = αµ1 +

(1 − α)µ2, with µ1, µ2 ∈ M(X,T ) and 0 < α < 1. We shall show that µ1 = µ (so thatµ2 = µ, also), i.e., that µ is extremal.

If µ(A) = 0 then µ1(A) = 0, so µ1 � µ. Therefore the Radon-Nikodym derivativedµ1/dµ ≥ 0 exists. One can easily deduce from the statement of the Radon-NikodymTheorem that µ1 = µ if and only if dµ1/dµ = 1 µ-a.e. We shall show that this is indeed thecase by showing that the sets where, respectively, dµ1/dµ < 1 and dµ1/dµ > 1 both haveµ-measure zero.

Let

B =

{x ∈ X :

dµ1

dµ(x) < 1

}.

We have that (*)∫B∩T−1B

dµ1

dµdµ+

∫B\T−1B

dµ1

dµdµ =

∫B

dµ1

dµdµ

= µ1(B) = µ1(T−1B)

=

∫T−1B

dµ1

dµdµ

=

∫B∩T−1B

dµ1

dµdµ+

∫T−1B\B

dµ1

dµdµ.

79

9.8 Appendix 9 BIRKHOFF’S ERGODIC THEOREM

Comparing the first and last terms, we see that∫B\T−1B

dµ1

dµdµ =

∫T−1B\B

dµ1

dµdµ.

In fact, these integrals are taken over sets of the same µ-measure:

µ(T−1B\B) = µ(T−1B)− µ(T−1B ∩ B)

= µ(B)− µ(T−1B ∩ B)

= µ(B\T−1B).

However on the LHS of (*) the integrand dµ1/dµ < 1 and on the RHS of (*) the integranddµ1/dµ ≥ 1. Thus we conclude that µ(B\T−1B) = µ(T−1B\B) = 0, which is to say thatµ(T−1B4B) = 0. Therefore (since T is ergodic) by Corollary 8.5, µ(B) = 0 or µ(B) = 1.

We can rule out the possibility that µ(B) = 1 by observing that if µ(B) = 1 then

1 = µ1(X) =

∫X

dµ1

dµdµ =

∫B

dµ1

dµdµ < µ(B) = 1,

a contradiction. Therefore µ(B) = 0.If we define

C =

{x ∈ X :

dµ1

dµ(x) > 1

}then repeating essentially the same argument gives µ(C) = 0.

Hence

µ

{x ∈ X :

dµ1

dµ(x) = 1

}= µ(X\(B ∪ C))

= µ(X)− µ(B)− µ(C) = 1,

i.e., dµ1/dµ = 1 µ-a.e. Therefore µ1 = µ, as required.

80

10 ENTROPY

10 Entropy

10.1 The Classification Problem

The classification problem is to decide when two measure-preserving transformations are‘the same’? We say that two measure-preserving transformations are ‘the same’ if they are(measure theoretically) isomorphic.

Definition 10.1. We say that two measure-preserving transformations (X,B, µ, T ) and (Y, C, m, S)

are (measure theoretically) isomorphic if there exist M ∈ B and N ∈ C such that

(i) TM ⊂ M, SN ⊂ N,

(ii) µ(M) = 1, m(N) = 1,

and there exists a bijection φ : M → N such that

(i) φ, φ−1 are measurable and measure-preserving,

(ii) φ ◦ T = S ◦ φ.

Remark 10.2. We often say ‘metrically isomorphic’ in place of ‘measure-theoretically isom-porphic’. Here, ‘metric’ is a contraction of ‘measure-theoretic’; it has no connection withmetric spaces!

How can we decide whether two measure-preserving transformations are isomorphic?A partial answer is given by looking for isomorphism invariants. The most important and

successful invariant is the (measure theoretic) entropy. This is a non-negative real numberthat characterizes the complexity of the measure-preserving transformation. It was introducedby Kolmogorov and Sinai in 1958/59 and immediately solved the outstanding open problemin the subject: whether, for example,

S : R/Z→ R/Z : x 7→ 2x mod 1

T : R/Z→ R/Z : x 7→ 3x mod 1

are isomorphic. (The invariant measure is Lebesgue in each case.) The answer is no sincethe systems have different entropies (log 2 and log 3, respectively).

10.2 Conditional expectation

Recall that if A ⊂ B is a σ-algebra then we define the operator

E(· | A) : L1(X,B, µ)→ L1(X,A, µ)

such that if f ∈ L1(X,B, µ) then

(i) E(f | A) is A-measurable, and

(ii) for all A ∈ A,∫AE(f | A) dµ =

∫Af dµ.

81

10.3 Information and Entropy 10 ENTROPY

Definition 10.3. Let A ⊂ B be a σ-algebra. We define the conditional probability of B ∈ Bgiven A to be the function

µ(B | A) = E(χB | A).

Suppose that α is a countable partition of the set X. By this we mean that α =

{A1, A2, . . .}, Ai ∈ B, and

(i) ∪iAi = X

(ii) Ai ∩ Aj = ∅ if i 6= j

(up to sets of measure zero). (More precisely µ(∪iAi4X) = 0 and µ(Ai4Aj) = 0 if i 6= j .)The partition α generates a σ-algebra. By an abuse of notation, we denote this σ-algebra

by α. The conditional expectation of an integrable function f with respect to the partitionα is easily seen to be:

E(f | α) =∑A∈α

χA(x)

∫Af dµ

µ(A).

Finally, we will need the following useful result.

Theorem 10.4 (Increasing martingale theorem). Let A1 ⊂ A2 ⊂ · · · ⊂ An ⊂ · · · be anincreasing sequence of σ-algebras such that An ↑ A (i.e. ∪nAn generates A). Then

(i) E(f | An)→ E(f | A) a.e., and

(ii) E(f | An)→ E(f | A) in L1, i.e.∫|E(f | An)− E(f | A)| dµ→ 0

as n →∞.

Proof. Omitted.

10.3 Information and Entropy

We begin with some motivation. Suppose we are trying to locate a point x in a probabilityspace (X,B, µ). To do this we use a (countable) partition α = {A1, A2, . . .}, Aj ∈ B. Bythis we mean that

(i) ∪iAi = X

(ii) Ai ∩ Aj = ∅ if i 6= j

(up to sets of measure zero). (More precisely µ(∪iAi4X) = 0 and µ(Ai4Aj) = 0 if i 6= j .)If we find that x ∈ Ai then we have received some information. We want to define a

functionI(α) : X → R+

82

10.3 Information and Entropy 10 ENTROPY

such that I(α)(x) is the amount of information we receive on learning that x ∈ Ai . We wouldlike this to depend only on the size of Ai , i.e., µ(Ai), and to be large when µ(Ai) is small andsmall when µ(Ai) is large. Thus we require I(α) to have the form

I(α)(x) =∑A∈α

χA(x)φ(µ(A)) (6)

for some function φ : [0, 1]→ R+, as yet unspecified.Let α = {A1, A2, A3, . . .}, β = {B1, B2, B3, . . .} be two partitions. Define the join α ∨ β

of α and β to be the partition

α ∨ β = {A ∩ B | A ∈ α,B ∈ β}.

We say that two partitions α, β are independent if µ(A ∩ B) = µ(A)µ(B), wheneverA ∈ α, B ∈ β.

It is then natural to require that if α and β are two independent partitions then

I(α ∨ β) = I(α) + I(β). (7)

That is, if α and β are independent then the amount of information we obtain by knowingwhich element of α∨β we are in is equal to the amount of information we obtain by knowingwhich element of α we are in together with the amount of information we obtain by knowingwhich element of β we are in.

Applying (7) to (6), we see that we have

φ(µ(A ∩ B)) = φ(µ(A)µ(B)) = φ(µ(A)) + φ(µ(B)).

If we also want φ to be continuous, then φ(t) must be (a multiple of) − log t. Throughout,we will use the convention that 0× log 0 = 0.

Definition 10.5. Given a partition α, we define the information I(α) : X → R+ by

I(α)(x) = −∑A∈α

χA(x) logµ(A).

We define the entropy H(α) of the partition α to be the average value, i.e.,

H(α) =

∫I(α) dµ

=

∫−∑A∈α

χA logµ(A) dµ

= −∑A∈α

µ(A) logµ(A).

83

10.4 Conditional Information and Entropy 10 ENTROPY

10.4 Conditional Information and Entropy

Let A be a sub-σ-algebra of B. We define the conditional information of α given A to be

I(α|A)(x) = −∑A∈α

χA(x) logµ(A|A)(x),

where µ(A|A) = E(χA|A).Once again, the conditional entropy H(α|A) is the average

H(α|A) =

∫−∑A∈α

χA logµ(A|A) dµ

=

∫−∑A∈α

µ(A|A) logµ(A|A) dµ

(by one of the properties of conditional expectation and the Monotone Convergence Theo-rem).

As a special case, we have that I(α|N ) = I(α) and H(α|N ) = H(α), where N is thetrivial σ-algebra consisting of sets of measure 0 and 1.

Exercise 10.6. Show that H(α|A) = 0 (or I(α|A) ≡ 0) if and only if α ⊂ A. (In particular,H(α|B) = 0, I(α|B) ≡ 0.)

10.5 Basic Properties

Recall that if α is a countable partition of a measurable space (X,B) and if C ⊂ B is a sub-σ-algebra then we define the conditional information and conditional entropy of α relative toC to be

I(α | C) = −∑A∈α

χA logµ(A | C)

and

H(α | C) = −∫ ∑

A∈α

χA logµ(A | C)

= −∫ ∑

A∈α

µ(A | C) logµ(A | C),

respectively.

Exercise 10.7. Show that if γ is a countable partition of X then

µ(A | γ) =∑C∈γ

χC

∫CχA dµ

µ(C)

=∑C∈γ

χCµ(A ∩ C)

µ(C).

84

10.5 Basic Properties 10 ENTROPY

Lemma 10.8 (The Basic Identities). For three countable partitions α, β, γ we have that

I(α ∨ β | γ) = I(α | γ) + I(β | α ∨ γ),

H(α ∨ β | γ) = H(α | γ) +H(β | α ∨ γ).

Proof. We only need to prove the first identity, the second follows by integration.If x ∈ A ∩ B, A ∈ α, B ∈ β, then

I(α ∨ β | γ)(x) = − logµ(A ∩ B | γ)(x)

and

µ(A ∩ B | γ) =∑C∈γ

χCµ(A ∩ B ∩ C)

µ(C)

(exercise). Thus, if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, we have

I(α ∨ β | γ)(x) = − log

(µ(A ∩ B ∩ C)

µ(C)

).

On the other hand, if x ∈ A ∩ C, A ∈ α, C ∈ γ, then

I(α | γ)(x) = − log

(µ(A ∩ C)

µ(C)

)and if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, then

I(β | α ∨ β)(x) = − log

(µ(B ∩ A ∩ C)

µ(A ∩ C)

).

Hence, if x ∈ A ∩ B ∩ C, A ∈ α, B ∈ β, C ∈ γ, we have

I(α | γ)(x) + I(β | α ∨ γ)(x) = − log

(µ(A ∩ B ∩ C)

µ(C)

)= I(α ∨ β | γ)(x).

Definition 10.9. Let α and β be countable partitions of X. We say that β is a refinementof α and write α ≤ β if every set in α is a union of sets in β.

Exercise 10.10. Show that if α ≤ β then I(α | β) = 0. (This corresponds to an intuitiveunderstand as to how information should behave: if α ≤ β then we receive no informationknowing which element of α a point is in, given that we know which element of β it lies in.)

Corollary 10.11. If γ ≥ β then

I(α ∨ β | γ) = I(α | γ),

H(α ∨ β | γ) = H(α | γ).

Proof. If γ ≥ β then β ≤ γ ≤ α ∨ γ and so I(β | α ∨ γ) ≡ 0, H(β | α ∨ γ) = 0. The resultnow follows from the Basic Identities.

85

10.6 Entropy of a Transformation Relative to a Partition 10 ENTROPY

10.6 Entropy of a Transformation Relative to a Partition

We are now (at last!) in a position to bring measure-preserving transformations back intothe picture. We are going to define the entropy of a measure-preserving transformation Trelative to a partition α (with H(α) < +∞). Later we shall remove the dependence on α toobtain the genuine entropy.

We first need the following standard analytic lemma.

Lemma 10.16. Let an be a sub-additive sequence of real numbers (i.e. an+m ≤ an + am).Then the sequence an/n converges to its infimum as n →∞.

Proof. Omitted. (As an exercise, you might want to try to prove this.)

Exercise 10.17. Let α be a countable partition of X. Show that T−1α = {T−1A | A ∈ α} isa countable partition of X. Show that H(T−1α) = H(α).

Let us write

Hn(α) = H

(n−1∨i=0

T−iα

).

Using the basic identity (with γ equal to the trivial partition) we have that

Hn+m(α) = H

(n+m−1∨i=0

T−iα

)

= H

(n−1∨i=0

T−iα

)+H

(n+m−1∨i=n

T−iα

∣∣∣∣∣n−1∨i=0

T−iα

)

≤ H

(n−1∨i=0

T−iα

)+H

(n+m−1∨i=n

T−iα

)

= H

(n−1∨i=0

T−iα

)+H

(T−n

m−1∨i=0

T−iα

)= Hn(α) +Hm(α).

We have just shown that Hn(α) is a sub-additive sequence. Therefore, by Lemma 10.16,

limn→∞

1

nHn(α)

exists and we can make the following definition.

Definition 10.18. We define the entropy of a measure-preserving transformation T relativeto a partition α (with H(α) < +∞) to be

h(T,α) = limn→∞

1

nH

(n−1∨i=0

T−iα

).

87

10.7 The entropy of a measure-preserving transformation 10 ENTROPY

Remark 10.19. Since

Hn(α) ≤ Hn−1(α) +H(α) ≤ · · · ≤ nH(α)

we have0 ≤ h(T,α) ≤ H(α).

Remark 10.20. Here is an alternative formula for h(T,α). Let

αn = α ∨ T−1α ∨ · · · ∨ T−(n−1)α.

Then

H(αn) = H(α | T−1α ∨ · · · ∨ T−(n−1)α) +H(T−1α ∨ · · · ∨ T−(n−1)α)

= H(α | T−1α ∨ · · · ∨ T−(n−1)α) +H(αn−1).

Hence

H(αn)

n=

H(α | T−1α ∨ · · · ∨ T−(n−1)α)

n

+H(α | T−1α ∨ · · · ∨ T−(n−2)α)

n

+ · · ·+H(α | T−1α)

n+H(α)

n.

Since

H(α | T−1α ∨ · · · ∨ T−(n−1)α) ≤ H(α | T−1α ∨ · · · ∨ T−(n−2)α) ≤ · · · ≤ H(α)

and

H(α | T−1α ∨ · · · ∨ T−(n−1)α)→ H(α |∞∨i=1

T−iα)

(by the Increasing Martingale Theorem), we have


1

nH(αn) = H(α |

∞∨i=1

T−iα).

10.7 The entropy of a measure-preserving transformation

Finally, we can define the entropy of T with respect to the measure µ.

Definition 10.21. Let T be a measure-preserving transformation of the probability space(X,B, µ). Then the entropy of T with respect to µ is defined to be

h(T ) = sup{h(T,α) | α is a countable partition such that H(α) <∞}.

88

10.8 Entropy as an isomorphism invariant 10 ENTROPY

10.8 Entropy as an isomorphism invariant

Recall the definition of what it means to say that two measure-preserving transformationsare metrically isomorphic.

Definition 10.22. We say that two measure-preserving transformations (X,B, µ, T ) and(Y, C, m, S) are (measure theoretically) isomorphic if there exist M ∈ B and N ∈ C suchthat

(i) TM ⊂ M, SN ⊂ N,

(ii) µ(M) = 1, m(N) = 1,

and there exists a bijection φ : M → N such that

(i) φ, φ−1 are measurable and measure-preserving (i.e. µ(φ−1A) = m(A) for all A ∈ C),

(ii) φ ◦ T = S ◦ φ.

We prove that two metrically isomorphic measure-preserving transformations have thesame entropy.

Theorem 10.23. Let T : X → X be a measure-preserving of (X,B, µ) and let S : Y → Y be ameasure-preserving transformation of (Y, C, m). If T and S are isomorphic then h(T ) = h(S).

Proof. Let M ⊂ X, N ⊂ Y and φ : M → N be as above. If α is a partition of Y then(changing it on a set of measure zero if necessary) it is also a partition of N. The inverseimage φ−1α = {φ−1A | A ∈ α} is a partition of M and hence of X. Furthermore,

Hµ(φ−1α) = −∑A∈α

µ(φ−1A) logµ(φ−1A)

= −∑A∈α

m(A) logm(A)

= Hm(α).

More generally,

Hµ

(n−1∨j=0

T−j(φ−1α)

)= Hµ

(φ−1

(n−1∨j=0

S−jα

))

= Hm

(n−1∨j=0

S−jα

).

Therefore, dividing by n and letting n →∞, we have

h(S,α) = h(T, φ−1α).

89

10.9 Calculating entropy 10 ENTROPY

Thus

h(S) = sup{h(S,α) | α partition of Y,Hm(α) <∞}= sup{h(T, φ−1α) | α partition of Y,Hm(α) <∞}≤ sup{h(T, β) | β partition of X,Hµ(β) <∞}= h(T ).

By symmetry, we also have h(T ) ≤ h(S). Therefore h(T ) = h(S).

Note that the converse to Theorem 10.23 is false in general: if two measure-preservingtransformations have the same entropy then they are not necessarily metrically isomorphic.

10.9 Calculating entropy

At first sight, the entropy of a measure-preserving transformation seems hard to calculateas it involves taking a supremum over all possible (finite entropy) partitions. However, someshort cuts are possible.

10.10 Generators and Sinai’s theorem

A major complication in the definition of entropy is the need to take the supremum over allfinite entropy partitions. Sinai’s theorem guarantees that h(T ) = h(T,α) for a partition αwhose refinements generates the full σ-algebra.

We begin by proving the following result.

Theorem 10.24 (Abramov’s theorem). Suppose that α1 ≤ α2 ≤ · · · ↑ B are countablepartitions such that H(αn) <∞ for all n ≥ 1. Then

h(T ) = limn→∞

h(T,αn).

Proof. Choose any countable partition β such that H(β) <∞. Fix n > 0. Then

H

(k−1∨j=0

T−jβ

)≤ H

(k−1∨j=0

T−jβ ∨k−1∨j=0

T−jαn

)

≤ H

(k−1∨j=0

T−jαn

)+H

(k−1∨j=0

T−jβ |k−1∨j=0

T−jαn

),

by the basic identity.

90

10.10 Generators and Sinai’s theorem 10 ENTROPY

Observe that

H

(k−1∨j=0

T−jβ |k−1∨j=0

T−jαn

)

= H

(β |

k−1∨j=0

T−jαn

)+H

(k−1∨j=1

T−jβ | β ∨k−1∨j=0

T−jαn

)

≤ H(β|αn) +H

(k−1∨j=1

T−jβ |k−1∨j=1

T−jαn

)

= H(β|αn) +H

(k−2∨j=0

T−jβ |k−2∨j=0

T−jαn

).

Continuing this inductively we see that

H

(k−1∨j=0

T−jβ |k−1∨j=0

T−jαn

)≤ kH(β|αn).

Hence

h(T, β) = limk→∞

1

kH

(k−1∨j=0

T−jβ

)

≤ limk→∞

1

kH

(k−1∨j=0

T−jαn

)+H(α | αn)

= h(T,αn) +H(β | αn).

We now prove that H(β | αn) → 0 as n → ∞. To do this, it is sufficient to prove thatI(β | αn)→ 0 in L1 as n →∞. Recall that

I(β | αn)(x) = −∑B∈β

χB(x) logµ(B | αn)(x) = − logµ(B | αn)(x)

if x ∈ B, B ∈ β. By the Increasing Martingale Theorem, we know that

µ(B | αn)(x)→ χB a.e.

Hence for x ∈ BI(β | αn)(x)→ − logχB = 0.

Hence for any countable partition β withH(β) <∞ we have that h(T, β) ≤ limn→∞ h(T,αn).The result follows by taking the supremum over all such β.

Definition 10.25. We say that a countable partition α is a generator if T is invertible and

n−1∨j=−(n−1)

T−jα→ B

91

10.11 Entropy of a power 10 ENTROPY

as n →∞.We say that a countable partition α is a strong generator if

n−1∨j=0

T−jα→ B

as n →∞.

Remark 10.26. To check whether a partition α is a generator (respectively, a strong gen-erator) it is sufficient to check that it separates almost every pair of points. That is, foralmost every x, y ∈ X, there exists n such that x, y are in different elements of the partition∨n−1j=−(n−1) T

−jα (∨n−1j=0 T

−jα, respectively).

The following important theorem will be the main tool in calculating entropy.

Theorem 10.27 (Sinai’s theorem). Suppose α is a strong generator or that T is invertibleand α is a generator. If H(α) <∞ then

h(T ) = h(T,α).

Proof. The proofs of the two cases are similar, we prove the case when T is invertible and αis a generator of finite entropy.

Let n ≥ 1. Then

h(T,

n∨j=−n

T−jα)

= limk→∞

1

kH(T nα ∨ · · · ∨ T−nα ∨ T−(n−1)α ∨ · · · ∨ T−(n+k−1)α)

= limk→∞

1

kH(α ∨ · · · ∨ T−(2n+k−1)α)

= h(T,α)

for each n. As α is a generator, we have that

n∨j=−n

T−jα→ B.

By Abramov’s theorem, h(T,α) = h(T ).

10.11 Entropy of a power

Observe that if T preserves the measure µ then so does T k . The following result relates theentropy of T and T k .

Theorem 10.28. (i) For k ≥ 0 we have that h(T k) = kh(T ).

(ii) If T is invertible then h(T ) = h(T−1).

92

10.11 Entropy of a power 10 ENTROPY

Proof. We prove (i), leaving the case k = 0 as an exercise. Choose a countable partition αwith H(α) <∞. Then

h

(T k ,

k−1∨j=0

T−jα

)= lim

n→∞

1

nH

(nk−1∨j=0

T−jα

)

= k limn→∞

1

nkH

(nk−1∨j=0

T−jα

)= kh(T,α).

Thus,

kh(T ) = supH(α)<∞

kh(T,α)

= supH(α)<∞

h

(T k ,

k−1∨j=0

T−jα

)≤ sup

H(α)<∞h(T k , α) = h(T k).

On the other hand,

h(T k , α) = limn→∞

1

nH

(n−1∨j=0

T−jkα

)

≤ limn→∞

1

nH

(nk−1∨j=0

T−jα

)by Corollary 25.3

= k limn→∞

1

nkH

(nk−1∨j=0

T−jα

)= kh(T,α),

and so h(T k) ≤ kh(T ), completing the proof.We prove (ii). We have

H

(n−1∨j=0

T−jα

)= H

(T n−1

n−1∨j=0

T−jα

)

= H

(n−1∨j=0

T jα

).

Therefore


1

nH

(n−1∨j=0

T−jα

)

= limn→∞

1

nH

(n−1∨j=0

T jα

)= h(T−1, α).

Taking the supremum over α gives h(T ) = h(T−1).

Exercise 10.29. Prove that the entropy of the identity map is zero.

93

10.12 Calculating entropy using generators 10 ENTROPY

10.12 Calculating entropy using generators

In this subsection, we show how generators and Sinai’s theorem can be used to calculate theentropy for some of our examples.

10.13 Subshifts of finite type

Let A be an irreducible k ×k matrix with entries from {0, 1}. Recall that we define the shiftsof finite type to be the spaces

ΣA = {(xn)∞n=−∞ ∈ {1, . . . , k}Z | A(xn, xn+1) = 1 for all n ∈ Z},Σ+A = {(xn)∞n=0 ∈ {1, . . . , k}N | A(xn, xn+1) = 1 for all n ∈ N},

and the shift maps σ : ΣA → ΣA, σ : Σ+A → Σ+

A by (σx)n = xn+1.Let P be a stochastic matrix and let p be a normalised left eigenvector so that pP = p.

Suppose that P is compatible with A, so that Pi ,j > 0 if and only if A(i , j) = 1. Recall thatwe define the Markov measure µP by defining it on cylinder sets by

µP [z0, z1, . . . , zn] = pz0Pz0z1

· · ·Pzn−1zn ,

and then extending it to the full σ-algebra by using the Kolmogorov Extension Theorem.We shall calculate hµP (σ) for the one-sided shift which for notational brevity we denote

by σ : ΣA → ΣA; the calculation for the two-sided shift is similar.Let α be the partition {[1], . . . , [k ]} of ΣA into cylinders of length 1. Then

H(α) = −k∑i=1

µP [i ] logµP [i ]

= −k∑i=1

pi log pi <∞.

The partition αn =∨ni=0 σ

−iα consists of all allowed cylinders of length n + 1:

n∨i=0

σ−iα = {[z0, z1, . . . , zn] | A(zi , zi+1) = 1, i = 0, . . . , n − 1}.

94

10.13 Subshifts of finite type 10 ENTROPY

Hence α is a strong generator. Moreover, we have

H

(n∨i=0

σ−iα

)= −

∑[z0,z1,...,zn]∈αn

µ[z0, z1, . . . , zn] logµ[z0, z1, . . . , zn]

= −∑

[z0,z1,...,zn]∈αn

pz0Pz0z1

· · ·Pzn−1zn log(pz0Pz0z1

· · ·Pzn−1zn)

= −k∑

i0=1

· · ·k∑

in=1

pi0Pi0i1 · · ·Pin−1in log(pi0Pi0i1 · · ·Pin−1in)

= −k∑

i0=1

· · ·k∑

in=1

pi0Pi0i1 · · ·Pin−1in(log pi0 + logPi0i1 + · · ·+ logPin−1in)

= −k∑

i0=1

pi0 log pi0 − nk∑

i ,j=1

piPi j logPi j ,

where we have used the identities∑k

j=1 Pi j = 1 and∑k

i=1 piPi j = pj .Therefore

hµP (σ) = hµP (σ,α)

= limn→∞

1

n + 1H

(n∨i=0

σ−iα

)

= −k∑

i ,j=1

piPi j logPi j .

Exercise 10.30. Carry out the above calculation for a full shift on k symbols with a Bernoullimeasure determined by the probability vector p = (p1, . . . , pk) to show that in this case theentropy is −

∑ki=1 pi log pi .

Recall from Theorem 11.25 that if two measure-preserving transformations are metri-cally isomorphic then they have the same entropy but that the converse is not necessarilytrue. However, for Markov measures on two-sided shifts of finite type entropy is a completeinvariant:

Theorem 10.31 (Ornstein’s theorem). Any two 2-sided Bernoulli shifts with the same entropyare metrically isomorphic.

Theorem 10.32 (Ornstein and Friedman). Any two 2-sided aperiodic Markov shifts with thesame entropy are metrically isomorphic.

Remark 10.33. Both of these theorems are false for 1-sided shifts. The isomorphism problemfor 1-sided shifts is a very subtle problem.

95

10.14 The continued fraction map 10 ENTROPY


Recall that the continued fraction map is defined by T (x) = 1/x mod 1 and preserves Gauss’measure µ defined by

µ(B) =1

log 2

∫B

1

1 + xdx.

Let An = (1/(n + 1), 1/n) and let α be the partition α = {An | n = 1, 2, 3, . . .}.Exercise 10.34. Check that H(α) < ∞. (Hint: use the fact that Gauss’ measure µ andLebesgue measure λ are comparable, i.e. there exist constants c, C > 0 such that c ≤µ(B)/λ(B) ≤ C for all B ∈ B.

We claim that α is a strong generator for T . To see this, recall that each irrational xhas a distinct continued fraction expansion. Hence α separates irrational, hence almost all,points.

For notational convenience let

[x0, . . . , xn−1] = Ax0∩ T−1Ax1

∩ · · · ∩ T−(n−1)Axn−1

= {x ∈ [0, 1] | T j(x) ∈ Axj for j = 0, . . . , n − 1}

so that [x0, . . . , xn−1] is the set of all x ∈ [0, 1] whose continued fraction expansion startsx0, . . . , xn−1.

If x ∈ [x0, . . . , xn−1] then

I(α | T−1α ∨ · · · ∨ T−nα) = − logµ([x0, . . . , xn])

µ([x1, . . . , xn]).

We will use the following fact: if In(x) is a nested sequence of intervals such that In(x) ↓{x} as n →∞ then

limn→∞

1

λ(In(x))

∫In(x)

f (y) dy = f (x)

where λ denotes Lebesgue measure. We will also need the fact that

limn→∞

λ([x0, . . . , xn])

λ([x1, . . . , xn])=

1

|T ′(x)| .

Hence

µ([x0, . . . , xn])

µ([x1, . . . , xn])

=

∫[x0,...,xn]

dx1+x∫

[x1,...,xn]dx

1+x

=

( ∫[x0,...,xn]

dx1+x

λ([x0, . . . , xn])

/ ∫[x1,...,xn]

dx1+x

λ([x1, . . . , xn])

)×λ([x0, . . . , xn])

λ([x1, . . . , xn])

→(

1

1 + x

/1

1 + Tx

)1

|T ′(x)| .

96


Hence

I

(α |

∞∨j=1

T−jα

)= − log

(1 + Tx

1 + x

1

|T ′(x)|

).

Using the fact that µ is T -invariant we see that

H

(α |

∞∨j=1

T−jα

)=

∫I

(α |

∞∨j=1

T−jα

)dµ

=

∫− log

1

|T ′(x)| dµ

=

∫log |T ′(x)| dµ.

Now T (x) = 1/x mod 1 so that T ′(x) = −1/x2. Hence

h(T ) = H

(α |

∞∨j=1

T−jα

)= −

2

log 2

∫log x

1 + xdx,

which cannot be simplified much further.

Exercise 10.35. Define T : [0, 1] → [0, 1] to be the doubling map T (x) = 2x mod 1. Let µdenote Lebesgue measure. We know that µ is a T -invariant probability measure. Prove thath(T ) = log 2.

Exercise 10.36. Define T : [0, 1]→ [0, 1] by T (x) = 4x(1− x). Define the measure µ by

µ(B) =1

π

∫B

1√x(1− x)

dx.

We have seen in a previous exercise that µ is an invariant probability measure. Show thath(T ) = log 2.

(Hint: you may use the fact that the partition α = {[0, 1/2], [1/2, 1]} is a strong gener-ator.)

Exercise 10.37. Let β > 1 by the golden mean, so that β2 = β+1. Define T (x) = βx mod 1.Define the density

k(x) =

1

1β

+ 1

β3on [0, 1/β)

1

β(

1β

+ 1

β3

) on [1/β, 1).

and define the measureµ(B) =

∫B

k(x) dx.

In a previous exercise, we saw that µ is T -invariant. Assuming that α = {[0, 1/β), [1/β, 1]}is a strong generator, show that h(T ) = logβ.

97


Exercise 10.38. Let T (x) = 1/x mod 1 and let

µ(B) =1

log 2

∫B

1

1 + xdx

be Gauss’ measure.Let Ak = (1/(k + 1), 1/k ]. Explain why α = {Ak}∞k=1 is a strong generator for T .Show that the entropy of T with respect to µ can be written as

h(T ) = −1

log 2

∫ 1

0

log x2

1 + xdx.

98

MA427 Ergodic Theorymaslav/Teaching/warwick... · 2012-12-20 · MA427 Ergodic Theory...

Documents

Transcript of MA427 Ergodic Theorymaslav/Teaching/warwick... · 2012-12-20 · MA427 Ergodic Theory...