zhenglevinhome.files.wordpress.com · 2019-03-22 · Chapter 2 Generating Random Objects Zheng...

Chapter 2Generating Random Objects

Zheng Levin

January 30, 2019

1 Uniform Random Variables

It is basic in simulation to generate realizations ui of i.i.d uniform random variables Ui onthe interval (0, 1). The generated value ui are called random numbers. There are 2 ways togenerate random numbers.

1. Using physical devices like radioactive decaying element.

2. Deterministic recursive algorithms that generate pseudorandom numbers.

The first method is not our concern but we should know that it is usually very slow comparedto algorithmic generator.

1.1 Deterministic Recursive Algorithms

Most algorithmic random number generators are designed in the following manner, as canbe described by a quadruple (E, µ, f, g). The generator has some internal state st at timet. The state space E is finite (but huge). And the state of generator evolves according tothe recursive formula st+1 = f(st). At each state, a random number is generated using themapping g : E → (0, 1) as ut = g(st). The initial state s1 is specified by hand, and is knownas the seed.

Since E is finite, there is always a finite period d such that st = st+d, and thus ut = ut+d.From time t + d the generated sequence just repeats itself, for another same pattern withlength d. Thus we see that values generated from the above scheme are not in theory random,although they may be approximately random (for example when period d is extremely long).

Example 1.1. Linear Congruential Generator. This is a simple but ancient generator. Wefirst find a big integer M , and some clever choices of intergers a, b, and then setting

st+1 = (ast + b) mod M

ut =stM

The difficult part of designing a good generator using above scheme of course lies in choosingappropriate M,a, b to make the generated sequence have long period.

1

We can futher generalize the evolution equation of state as linear combination of severalsuccessive states, as

st+1 = (a0st + a1st−1 + · · ·+ anst−n) mod M

Again ai are all some delicately chosen integers to make the period long.

Example 1.2. Mixed Generator. By using mixed generator, we can obtain longer periodthan that of linear congruential one. For example, using 2 linear congruential generators as

xn = (a1xn−2 − a2xn−3) mod M1

yn = (b1yn−1 − b2yn−3) mod M2

and then mixing 2 sequence by

zn = (xn − yn) mod M1

Finally, if zn > 0, return un = znM1+1

; otherwise return un = M1

M1+1.

Example 1.3. Sometimes we also look for linear congruential generator with modulo 2 toexploit the binary representation of computer. For example, a general scheme is given by

xn = Axn−1, yn = Bxn, un =yn1

2+ · · ·+ ynl

2l(1.1)

where xk ∈ 0, 1k, yk ∈ 0, 1l and A,B are matrices with shapes k × k and l × k. Here unis just writting the bit sequence yn1 · · · ynl behind 0. the decimal point.

Special cases of above framework includes Tausworthe generators, as given by

zn = a1zn−1 + · · ·+ akzn−k mod 2, un =znt+1

2+ · · ·+ znt+r

2k

1.2 Statistical Tests

As mentioned above, algorithmic random number generator usually gives pseudorandomnumber, and we can always design specific statistcal test to show that the generated sequenceis not uniform on (0, 1) and values are not independent.

The first method we introduce is Q-Q plot. But a few concepts need to be clarified beforewe proceed to this method.

Definition 1.1. Given n values X1, · · · , Xn ∈ R, together they are called a sample. Wedefine the empirical sample distribution of them, by

Fn(x) =1

n

n∑i=1

I(Xi ≤ x) (1.2)

so we see that it is a step function with each step jumping up by 1/n.

According to Glivenko-Cantelli theorem, this distribution converges to the underlyingdistribution if Xi are i.i.d random variables.

2

Definition 1.2. An α-quantile of distribution F is defined to be a value qα such thatqα = infx : F (x) ≥ α. If F is continuous and strictly increasing, then qα = F−1(α). Thequantile of empirical distribution is naturally called empirical quantile.

Now we are ready to introduce the following tests for uniformity. All of following methodsbelongs to goodness-of-fit tests.

1. Q-Q plot. Quantile-quantile plot is a graphical test, where x-axis represents the empir-ical quantile from data, and y-axiz represents the theoratical quantile (in the case ofuniform distribution, qα = α). If the sample comes from the specified distribution, theplotted points should lie on the line y = x.

2. χ2-test. This belongs to significance test. Suppose the sample Xi comes from uniformdistribution on (0, 1) (null hypothesis), then if we partition the interval (0, 1) into Ksubintervals, say Ik = (k/K, (k + 1)/K). Define Nk =

∑ni=1 I(Xi ∈ Ik) to be the

number of values Xi that fall in subinterval Ik. Also, the expected number of data ineach subinterval should be n/K, and Nk should be independent of each others. Underall these assumption, the statistic

K−1∑k=0

(Nk − n/K)2

Nk

should asymptotically (as sample size n grows) χ2-distributed with K − 1 degrees offreedom, provided that n/K is not too small. If this statistic falls beyond cetain level,then we tend to believe that Xi are not uniformly distributed.

3. Kolmogorov-Smirnov statistic. This also belongs to significance test. It is defined tobe

maxx|Fn(x)− F (x)|

where Fn(x) is the empirical distribution of the data Xi and F (x) = x is uniformdistribution function. Again, if this statistic lies beyond certain level, we tend tobelieve that the data Xi are not uniformly distributed.

We also need to test the generated values are independent (independence test).

1. (d-blocks) Group generated value into d-block (uk+1, · · · , uk+d) (actually just viewingthis as Rd random vector), there are n

dd-blocks in total (where n is the number of

generated values). We then use χ2 test for uniformity of these d-blocks in d dimensionsquare (or cube) (0, 1)d, for d = 2, 3, · · · . Just cutting the cube (0, 1)d into Kd sub-cubes, each of them with width 1/K, and then count the number of d-blocks in eachsub-cubes, denoted as Ni, where i means the ith sub-cube. We then use the followingstatistic

Kd∑i=1

(Ni − ndKd )2

Ni

where ndKd is the expected number of d-blocks in each sub-cube if the unifomity hy-

pothesis is true. This statistic is approximatedly χ2 distributed with Kd − 1 degrees

3

of freedom, as n→∞. And if above statistic is too large, we tend to believe that thegenerated values uk, · · · are not independent.

2. (Normal transform, correlation) Taking transformation xk = Φ−1(uk), then xk shouldbe normally distributed if uk is uniformly distributed. Since normal variables areindependent if and only if they are uncorrelated, we could compute the empiricalcorrelations between xk’s (using empirical correlations at different lags k = 1, 2, · · · asour statistics). If xk are independent, then the transformed variables uk = Φ(xk) arealso independent.

3. (Gap test) Take any sub-interval J ⊂ (0, 1). If uk are uniformly distributed on (0, 1),then the first k such that uk ∈ J should be geometrically distributed with successprobability |J | < 1.

4. (Coupon collector’s test) Viewing generated value uk as a random sequence of digits,and count the length of minimal subsequence starting from some position necessaryto get all element in some specified set of K digits, for example all K = 10 digits orK = 3 for example 1, 3, 5. And then compare the empirical distribution of thesecounted lengths with theoretical distribution.

1.3 Randomness Criteria

What are the criteria that u1, · · · should satisfy in order to be interpreted as realization ofi.i.d uniform variables?

1. (∞-distributed) For any d = 1, 2, · · · , the empirical distribution of d-blocks

(uk+1, · · · , uk+d)

should converge to uniform distribution in d dimension unit cube (0, 1)d, as the numberof generated values n→∞.

2. (Unpredictability) All subsequences of digits when normalized by puting a 0. in frontof them should also be ∞-distributed.

Exercises

Exercise 1.1. Explain how Tausworthe generators fit into the framework (1.1).

Solution 1.1. The bit vector xn−1 = [zn−1, · · · , zn−k]′, and the recursion matrix

A =

[a1, · · · , ak−1 ak

Ik−1 0

]where Ik−1 is a k− 1 dimension identity matrix, and 0 is a k− 1 dimension zero vector. Theoutput matrix B = Ik is the k dimension identity matrix.

4

2 Nonuniform Random Variables

Example 2.1. (Simple Discrete r.v.’s) Suppose the probability mass function of X is givenby P (X = i) = pi, and support of distribution is given by 1, · · · , N. Then we cansimulate from this distribution by first generating random number U , and set X = 1 ifU < p1, otherwise set X = n if

p1 + · · ·+ pn−1 ≤ U < p1 + · · ·+ pn

If the probability mass function is given by P (X = i) = kin

, then we could speed up thesimulation by storing integer 1 in the first k1 slots, and then storing integer i from positionk1 + · · · + ki−1 + 1 to k1 + · · · + ki. Then we can simulate from this distribution by firstgenerating random number U and then return value at slot [nU ] + 1, where [x] means thebiggest integer that is smaller than or equal to x. (Trading space for time)

For continuous random variable, there are 2 general simulation methods, namely inver-sion and acceptance-rejection. And for any particular distribution, there are many ad hocsimulation method around.

2.1 Inversion

We first define the general inverse function of distribution function F as

Definition 2.1.F←(u) = infx : F (x) ≥ u

For continuous strictly increasing distribution F , we have F−1 = F←. And note that theabove definition is exactly the same as that of quantile function of distribution F .

Remark 2.1. Note that we used left arrow in superscript because this version of inversefunction is in fact left-continuous (see exercise 2.3). We can also define a right-continuousversion of inverse function by

F→(u) = infx : F (x) > u

We have the following useful proposition.

Proposition 2.1. 1. So called swithcing formula

u ≤ F (x)⇔ F←(u) ≤ x (2.1)

2. If U is uniform on (0, 1), then the inverse transformed variable F←(U) has distributionF .

3. If X has continuous distribution F , then the transformed variable F (X) is uniformlydistributed with (0, 1).

5

Proof. 1. Suppose y ≤ F (x0), then since x0 ∈ x : F (x) ≥ y, we certainly have

infx : F (x) ≥ y = F←(y) ≤ x0

On the other hand, if F←(y) ≤ x0, then by nondecreasing of F , we have x0 ∈ x :F (x) ≥ y and F (x0) ≥ y.

2. By above switching formula, the distribution function of F←(U) can be computed as

P (F←(U) ≤ x) = P (U ≤ F (x)) = F (x)

Thus we see that the transformed variable F←(U) indeed has distribution function F .

3. Let us compute the tail distribution function of F (X). Again by switching formula,and continuity of F ,

P (F (X) ≥ u) = P (X ≥ F←(u)) = P (X > F←(u)) = 1− F (F←(u))

we need to prove that this expression is equal to 1−u, or we need to prove F (F←(u)) =u. By taking x = F←(u) in switching formula, we have F (F←(u)) ≥ u. To get theopposite direction, we take yn < F←(u) for all n but yn ↑ F←(u). Then again byswitching formula, F (yn) < u for all n, letting n→∞ and by continuity of F , we haveF (F←(u)) ≤ u.

Remark 2.2. Strengthness and weakness of inversion method.

1. (Strengthness) Very efficient (compared to acceptance-rejection method in next sec-tion) because only one transform is needed.

2. (Weakness) Only applicable when inverse function F← has closed form expression (sothat it can be programmed).

In what follows, we will assume that Ui are i.i.d random numbers.

Example 2.2. (Exponential r.v.) The distribution function for an exponential r.v Z withrate 1 is given by F (z) = 1 − e−z, and thus the inverse function is F−1(y) = − log(1 − y).So we could simulate from standard exponential r.v by first generating a random number Uand then set Z = − log(1− U). But since 1− U is also uniform on (0, 1), we could just setZ = − log(U).

To simulate any exponential r.v X with rate λ, we could use the relation X = Z/λ, andthus set

X = − log(U)

λ

Example 2.3. (Gamma r.v with integer shape parameter n) Gamma distributed r.v X withinteger shape parameter n and rate λ can be simulated by simply summing n i.i.d exponentialr.v’s with common rate λ together. That is, with n i.i.d random numbers Ui, we set

X = −n∑i=1

log(Ui)

λ= − log(

∏ni=1 Ui)

λ

The second expression is more efficient than the first one because it only need to evaluatelog function one time.

6

Example 2.4. (Poisson r.v.) Suppose we want to simulate a Poisson r.v with mean λ. Wenotice that the number of arrivals by time 1 of a Poisson process with rate λ is exactlyPoisson distributed with mean λ. Also recall that the interarrival times are i.i.d exponentialr.v’s with rate λ. So we could set

N = max

n ≥ 0 : −

n∑i=1

log(Ui)

λ≤ 1

= max

n ≥ 0 :

n∏i=1

Ui ≥ e−λ

Again the second expression is more efficient for simulation because only one exp functionneeds to be evaluated while the first one has n log functions to compute.

Sometimes when explicit formula for inverse transform F← does not exist, we could useapproximate expression. For example,

Example 2.5. (Standard normal r.v.) We use

Φ−1(u) ≈ y +

∑4i=0 piy

i∑4i=0 qiy

i, 0.5 < u < 1 (2.2)

for some appropriate chosen coefficients pi, qi, and y =√−2 log(1− u). The case 0 < u < 0.5

can be handled by symmetry.

In what follows, suppose the inverse transformation F← is available, then we could sim-ulate the following conditional distributions.

Example 2.6. (X|X ∈ (a, b)) The conditional distribution of X given that X ∈ (a, b) is

G(x) =F (x)− F (a)

F (b)− F (a), x ∈ (a, b)

Then the inverse tranformation of this conditional distribution is thus given by

G←(y) = F←(Fa + yF ba) = F←(F (a) + y(F (b)− F (a)))

Thus to simulate from this conditional distribution, we just need to set

X = F←(Fa + UF ba) = F←((1− U)Fa + UFb)

for some random number U .

Example 2.7. (X−a|X > a) Suppose we want to simulate from the conditional distributionof exceeding amount X − a given that X > a. Define Y ∼ X − a|X > a. Then since theconditional distribution function is now given by

P (Y ≤ y) = P (X ≤ a+ y|X > a) =F y+aa

1− Fa=F (y + a)− F (a)

1− F (a)

Then we could simulate from this distribution by setting

Y = F←((1− Fa)U + Fa)− a

for some random number U .

7

2.2 Simple Acceptance-Rejection

Suppose we want to simulate from density f(x). The idea is to find a density function g(x)that is easier to simulate and bound f as f(x) ≤ Cg(x) for all x and for some (smaller thebetter) constant C > 1. Then the algorithm proceeds as 1. This algorithm works indeed, as

Algorithm 1 Acceptance-Rejection

Generate Y from density g(x), and random number U .

if U < f(Y )Cg(Y )

then

Return X = Y (accepting this Y )else

Go back to previous step (rejecting this Y ).end if

we will check in the following the generated X ∼ f(x), indeed has the target density.

Proof. The generated X is distributed like X ∼ Y |U < f(Y )/Cg(Y ). Thus the density ofthis X is

P (X ∈ dx) =P (U < f(x)/Cg(x))P (Y ∈ dx)

P (U < f(Y )/Cg(Y ))

=f(x)/g(x) · g(x)dx

CP (· · · )=

f(x)dx

CP (· · · )

Integrating over x and using the fact that above expression is density function (equating itwith 1), we have

P (U < f(Y )/Cg(Y )) = 1/C

and that X has indeed density function f .

Remark 2.3. Let’s compute the probability at each step of accepting a Y . That is

P

(U <

f(Y )

Cg(Y )

)=

∫f(y)

Cg(y)g(y)dy =

1

C

As can be seen, we must have C ≥ 1. And the number of iterations needed before terminationof algorithm is thus geometrically distributed with success probability 1/C, and the expectednumber of runs is given by exactly C. So an efficient A-R algorithm should have as small Cas possible. In other words, g should look alike in shape as f as much as possible.

Remark 2.4. Strengthness and weakness of acceptance-rejection method.

1. (Strengthness) No need to know the closed form formulas for distribution function Fand its inverse F←.

2. (Weakness)

(a) Only applicable when explicit formula for density function f is available.

8

(b) Need to find appropriate proposal density g that can be easily simulated, andlooks like f in shape so that the expected run time C is small. Otherwise, if suchproposal g is very hard to find, or the resulting C is large (thus the algorithm isvery inefficient), then A-R method has little worth.

Remark 2.5. The idea of A-R method.Suppose we want to simulate from density f but inversion method is not applicable. Butthen we found that another density g can be relatively easily simulated from. Let us nowconsider the ratio f/g. A natural question is, can we first simulate from g and then getback data following f by incorporating the ratio f/g? Answering this question produces thedevelopment of A-R method.

Example 2.8. Let f(x) =√

2/πe−x2/2 be the density function of the absolute value of

standard normal r.v X = |Z|, where Z ∼ N(0, 1). We then choose g(x) = e−x to be thedensity of standard exponential r.v. Then

h(x) =f(x)

g(x)=√

2e/πe−(x−1)2/2 ≤√

2e/π = C

So the expected run time is C ≈ 1.3.Then the standard normal r.v can be simulated by putting a random sign (with equally

likely being + and −) to X.

An example showing how to engineer appropriate proposal density g is as follows.

Example 2.9. Suppose the target density is gamma with parameter (α, λ) = (4.3, 1). For-mally,

f(x) = e−xx3.3

Γ(4.3), x > 0

We have used inversion method to simulate gamma distribution with integer shape param-eters in example 2.3. So we want to choose an appropriate integer n and some rate µ, anduse g(x) the density of gamma distribution with parameter (n, µ). By principle mentionedin remark 2.3, we must choose n ≈ α = 4.3 for example n = 4, 5 and µ ≈ λ = 1, in order tomake g ≈ f . On the other hand, to make f(x) ≤ Cg(x) for some constant C ≥ 1 for any x,we must make g decay slower than f as x → 0 and x → ∞ when f, g → 0 (otherwise, theinequality f ≤ Cg could be easily violated as x → 0 or x → ∞). Then if we focus on theside x → 0, the only choice for n is then n = 4 (For n = 5, the decaying rate of g is O(x4)which is faster than that of f -O(x3.3)-as x→ 0). Then

g(x) = µe−µxx3

3!, x > 0

To determine µ, let’s focus on x→∞, when

f(x) ∼ O(x3.3e−x), g(x) ∼ O(x3e−µx)

Then as x→∞,g(x)

f(x)∼ O(x−0.3e−x(µ−1))

9

So after choosing n = 4, we cannot simply set µ = λ = 1, because that would make g ∼ o(f),i.e g is decaying faster than f does as x→∞. In order to make g decay slower than f does,we need to use µ < 1, for example, µ = 0.9.

In conclusion, we choose g(x) to be density of gamma distribution with parameters(n, µ) = (4, 0.9).

Remark 2.6. Strategy in using A-R method.We first need to find a proposal density g that looks like target density f in shape, so thatthe resulting C is close to 1 (thus the algorithm is efficient).

Then we need to tailor g to minimize C as much as possible.Also, some hints from above example. g should decay slower than f at tails (when x

approaching the endpoints of support of f). Otherwise, if g is decaying at a faster rate attails, the inequalities f ≤ Cg can be easily violated at those tails.

Example 2.10. (Log-concave density) A function f(x) > 0 is said to be log-concave if thefunction log(f(x)) is concave.

Let f be a log-concave decreasing density with support [0,∞). Without loss of generality,we could assume f(0) = 1. In fact for any general r.v X with density f , we could set

X = f(0)X, which would have density

P (X ∈ dx) = P

(X ∈

(x

f(0),x

f(0)+

dx

f(0)

))=

1

f(0)f

(x

f(0)

)dx

and in particular, density at x = 0 is 1. Then we point out that

f(x) ≤ min(1, e1−x) (2.3)

Proof. Since f is decreasing, we know that f(x) ≤ f(0) = 1. And since e1−x is strictlydecreasing and is 1 when x = 1, so we only need to prove that for x > 1, f(x) ≤ e1−x.

We need to use the fact that f is a density function, i.e∫f = 1. And suppose the above

is not true, that is, for some x0 > 1 we have f(x0) > e1−x0 , or log(f(x0)) > 1− x0. Also, bylog-concavity, for x ∈ (0, x0),

log(f(x)) ≥ log(f(x0))

x0

x >1− x0

x0

x

Or rewritten asf(x) > e(1−x0)x/x0 , x ∈ (0, x0)

Thus

1 =

∫ ∞0

f(x)dx >

∫ x0

0

e(1−x0)x/x0dx =x0

x0 − 1(1− e−(x0−1))

But that would makex0e−x0 > e−1

which is a contradiction because for x0 > 1 the function x0e−x0 is decreasing in x0.

10

Then we can use the density g(x) = 12

min(1, e1−x) to simulate from f using A-R. Andsince

f(x)

g(x)≤ 2 = C

according to previous remark, the expected number of runs to simulate is 2. As for g, it isexactly the density of a r.v Y that distributed like a uniform r.v U w.p 1/2 and like 1 + Zw.p 1/2, where Z is a standard exponential r.v. In fact, the density of Y is given by

P (Y ∈ dy) = P (U ∈ dy)1

2+ P (1 + Z ∈ dy)

1

2

But when y ≤ 1, P (U ∈ dy) = 1 and

P (1 + Z ∈ dy) = P (Z ∈ (y − 1, y − 1 + dy)) = 0

while when y > 1, P (U ∈ dy) = 0,

P (1 + Z ∈ dy) = P (Z ∈ (y − 1, y − 1 + dy)) = e1−y

therefore we indeed have g(y) = P (Y ∈ dy). Thus we can simulate from g by generatingthis Y .

Example of log-concave density is the density function of absolute value of standardnormal, as

log(f(x)) = − log(2/π)

2− x2

2

is concave in x.

Example 2.11. (Conditional distribution given X ∈ (a, b) with log-concave density) Con-sider a log-concave density f(x) whose support this time is not necessarily the positive realline. By log-concavity, for any x > a, we have

log(f(x))− log(f(a))

x− a≤ d log(f(x))

dx

∣∣∣x=a

=f ′(a)

f(a)= β

Or rewritten asf(x) ≤ f(a)eβ(x−a)

Now if we consider the r.v X with this density but we are given that X ∈ (a, b), then itsconditional density is

f(x) =f(x)

F (b)− F (a), x ∈ (a, b)

Using above inequality, we have

f(x) ≤ f(a)e−βa

F (b)− F (a)eβx, x ∈ (a, b)

Define

g(x) = βeβx

eβb − eβa, x ∈ (a, b)

11

Then using above relation,

f(x)

g(x)≤ C =

(eβ(b−a) − 1)f(a)

β(F (b)− F (a))

And we can simulate from f using g and A-R method, where g can be simulated usinginversion. The distribution function of g

G(x) =eβx − eβa

eβb − eβa, x ∈ (a, b)

And thus Y with this distribution can be generated as

Y ∼ 1

βlog((1− U)eβa + Ueβb

)for some random number U .

2.3 Ad Hoc Algorithms

Recall some conclusion from probability theory.

1. χ2 distribution with k degrees of freedom is exactly gamma distribution with shapeparameter k/2 and rate 1/2.

2. χ2 r.v can be expressed as the sum of k squared i.i.d standard normal r.v’s.

Example 2.12. (Box-Muller method for normal r.v.) Suppose Z1, Z2 are 2 independentstandard normal r.v’s, then R2 = Z2

1 + Z22 is χ2 distributed with 2 degrees of freedom, and

thus is gamma distributed with shape 1 and rate 1/2, which is exactly an exponential r.vwith rate 1/2. Thus R2 can be easily simulated as

R2 ∼ −2 log(U1)

for some random number U1.Let’s consider then the joint density of Z1, Z2,

f(z1, z2) =1

2πexp

(−z

21 + z2

2

2

)And make the following transform (some kind of expressing in polar coordinates, except thatwe are using squared radius instead of radius itself),

r2 = z21 + z2

2 , θ = arctan(z2/z1)

The Jacobbian of this transform is 1/2, and thus the density function under this transformbecomes

f(r2, θ) =

(1

2π

)1

2e−r

2/2, r2 > 0, θ ∈ (0, 2π)

12

Thus we can see that R2 = Z21 +Z2

2 and Θ = arctan(Z2/Z1) are independent r.v’s and R2 isexponentially distributed with rate 1/2 (which coincides with previous conclusion) and Θ isuniformly distributed over the interval (0, 2π).

And thus we can generate first the radius by

R ∼√−2 log(U1)

and then the angle byΘ ∼ 2πU2

where U1, U2 are independent random numbers. Then set

Z1 = R cos(Θ) =√−2 log(U1) cos(2πU2)

Z2 = R sin(Θ) =√−2 log(U1) sin(2πU2)

Using above expression, we can now generate pairs of independent standard normal r.v’s.

Example 2.13. (Marsaglia’s polar method for standard normal r.v.) The insight is thatif (V1, V2) are jointly uniformly distributed over the unit disk, that is, their joint densityfunction is given by

f(v1, v2) =1

π, v2

1 + v22 ≤ 1

Making the following transform related to polar coordinates,

r2 = v21 + v2

2, θ = arctan(v2/v1)

and noting that the due to Jacobbian of this transform, there will be a factor 1/2 out frontof the density function (after transformation), as

f(r2, θ) =1

2π, r2 ≤ 1, θ ∈ (0, 2π)

thus we see that R2 = V 21 + V 2

2 and Θ = arctan(V2/V1) are independent r.v’s and they areboth uniformly distributed, with the first one on (0, 1) and the second on on (0, 2π).

Thus according to discussion in the previous example (about Box-Muller method), ther.v’s defined by

Z1 =√−2 log(R2) cos(Θ) =

√−2 log(V 2

1 + V 22 )

V 21 + V 2

2

V1

Z2 =√−2 log(R2) sin(Θ) =

√−2 log(V 2

1 + V 22 )

V 21 + V 2

2

V2

are independent standard normal r.v’s.So in order to simulate Z1, Z2, we just need to figure out how to simulate above V1, V2.

In fact, we just need to generate random numbers U1, U2, and look at Vi = 2Ui − 1 (which

is uniformly distributed on (−1, 1), so that (V1, V2) are jointly uniformly distributed on the

square (−1, 1)2). If V 21 + V 2

2 ≤ 1, then we accept them and set Vi = Vi; otherwise, rerunprevious step until we get desired ones.

13

Remark 2.7. The above A-R method for sampling V1, V2 is slower than Box-Muller, al-though it requires less evaluation of special functions like sin, cos.

Example 2.14. (Geometric r.v.) Suppose N is geometrically distributed with success prob-ability p. The probability mass function is given by

P (N = n) = p(n) = qn−1p, n = 1, 2, · · ·

where q = 1− p.One way to simulate from this distribution is to simulate from Bernoulli distribution

with success probability p (like flipping a coin with probability p landing head) successively,obtaining X1, X2, · · · , as i.i.d such Bernoulli r.v’s. Recall that a Bernoulli r.v with successprobability p can be easily simulated by setting

X = I(U < p)

where U is a random number, and I(·) is an indicator function. Then we set N = n if forall k < n, Xk = 0 while Xn = 1 (the first success is at time n). The expected number ofrandom numbers needed is thus 1/p, and if p→ 0, this method becomes very inefficient (onthe other hand, if p→ 1 this is a very efficient method).

Another method for generating N is by inversion. In fact if setting

N =

[log(U)

log(q)+ 1

]where U is random number. Then

P (N = n) = P

(n <

logU

log q+ 1 ≤ n+ 1

)=P (qn ≤ U < qn−1) = qn−1 − qn = pqn−1

which shows that N is indeed geometrically distributed with success probability p.

Example 2.15. (Alias method for finite support discrete r.v.) Suppose we want to simulatefrom the probability mass function p(k) whose support is 1, · · · , n. The alias method saysthat we could find n − 1 two-point distribution qi(k) (i.e qi(k) > 0 only for two k’s) suchthat p(k) can be expressed as the even mixture of them, i.e.

p(k) =1

n− 1

n−1∑i=1

qi(k)

To simplify the discussion, we single out the following fact as a lemma.

Lemma 2.1. For above probability mass function p(k),

1. There must be some k such that p(k) < 1/(n− 1),

2. for above k, there is a l 6= k such that p(k) + p(l) > 1/(n− 1).

14

Proof. For the first part, if otherwise for all k we have p(k) ≥ 1/(n− 1) then 1 =∑

k p(k) ≥n/(n− 1), a contradiction.

For the second part, if for this k, for any l 6= k we have p(k) + p(l) ≤ 1/(n − 1), thensumming over l 6= k, we have

1 ≥ (n− 1)p(k) +∑l 6=k

p(l) = (n− 1)p(k) + 1− p(k) = (n− 2)p(k) + 1

Or p(k) ≤ 0 or p(k) = 0, a contradiction to assumption of the first part p(k) > 1/(n−1).

Then we will describe the algorithm for constructing distributions qi(k). To start, wechoose k and l specified in lemma 2.1, and then define q1(·) with mass concentrating only onposition k and l, and in particular all mass of p is placed in position k of q1, i.e

q1(k)/(n− 1) = p(k)

And q1(l) = 1− q1(k) can be easily determined. q1(k) = (n− 1)p(k) < 1 by the first part ofprevious lemma.

Subtracting q1/(n− 1) from p, we get another probability mass function p after normal-ization (so that

∑p = 1),

p =n− 1

n− 2

(p− q1

n− 1

)Here inside the bracket, position k becomes 0 and positions other than l are all subtracting0. At position l, we have

p(l)− q1(l)/(n− 1) = p(l) + p(k)− 1

n− 1> 0

by the second part of previous lemma. Thus everything goes well (and validly).Then we treat p as our new probability mass function with support on n− 1 points, and

run above procedure again. In the end, we can reduce the case to probability mass functionwith support on only 2 points (in n− 2 steps), which require no further treatment. In otherwords, p can be factorized using above procedure into

p =1

n− 2

n−1∑i=2

qi

where q2, · · · , qn−1 are all probability mass function with support on only 2 points. Thusfinally,

p =1

n− 1

n−1∑i=1

qi

(using the expression for p).In order to simulate from p(k), we can first simulate I which is equally likely to be

1, · · · , n − 1, and then simulate from two point distribution qI . This method is so calledalias method.

15

For any mixture distribution F =∑n

i=1 piFi where Fi are distributions from which wecan relatively easily simulate and pi are probability mass function on 1, · · · , n. Thento simulate from this mixture, we can first use above alias method to simulate I fromdistribution pi, and then simulate from FI . This is because by conditioning

F = P (X ∼ F ) =∑i

P (X ∼ Fi)pi =∑i

Fipi

2.4 Further Uses of Acceptance-Rejection Ideas

Example 2.16. (Unknown normalizing factor in target density) The A-R method can be

generalized into the case where the target density has the form f(x) = f(x)/D for someunknown constant D, and some positive function f(x) not necessarily normalized (i.e maybe∫f 6= 1). We can still try to find a density h(x) that can be (relatively easier) simulated,

and C such thatf(x)

h(x)≤ C, ∀x

and then simulate Y from h, accepting it and setting X = Y with probability f(Y )Ch(Y )

(which

is less than 1 due to above inequality), in particular, we generate random number U andaccept this Y if

U <f(Y )

Ch(Y )

We could prove that the resulting X indeed has density f .

Proof. First let us determine the probability of accepting a generated Y by conditioning onthe value of Y ,

P

(U <

f(Y )

Ch(Y )

)=

∫f(y)

Ch(y)h(y)dy =

D

C

∫f(x)

Ddx =

D

C

so we see that D/C ≤ 1 must hold, and the expected number of runs is proportional to Cbecause the number of runs is geometrically distributed with above success probability.

Then we note that the resulting X can be expressed like

X ∼ Y∣∣∣U <

f(Y )

Ch(Y )

In words, it has the conditional distribution as that of Y given it was accepted. So thedensity function of X is given by

P (X ∈ dx) = P

(Y ∈ dx

∣∣∣U <f(Y )

Ch(Y )

)

=P(U < f(Y )

Ch(Y )

∣∣∣Y ∈ dx)P (Y ∈ dx)

P(U < f(Y )

Ch(Y )

)=

f(x)Ch(x)

h(x)dx

D/C=f(x)

Ddx = f(x)dx

16

Thus the density of generated X is indeed f .

Before we proceed we must point out the following general and useful technique forsimulating uniform r.v’s on a finite region in sapce.

Proposition 2.2. Suppose Ω is a finite (bounded) region in d dimensional space Rd. Andsuppose X1, · · · , Xd are jointly uniformly distributed on this region. Then to simulate theser.v’s, we can first find a rectangle R in Rd that contains Ω, and then generate independentuniform r.v’s, denoted as U1, · · · , Ud on this R, and accept them if they fall inside Ω.

Remark 2.8. The rectangle R can be found as follows. Define

ai = infxi : (x1, · · · , xd) ∈ Ω, bi = supxi : (x1, · · · , xd) ∈ Ω

to be the left-most and right-most value of region Ω on xi-axis, for i = 1, · · · , d. Then therectangle is defined as

R = [a1, b1]× · · · × [ad, bd]

Then Ui is uniformly distributed on [ai, bi], and they are independent. Also we define thevolume of this rectangle to be

|R| =d∏i=1

(bi − ai)

Of course we assume that ai < bi, or otherwise, the region Ω is in fact flat and we canconsider things in space with lower dimension (maybe after some rotations and tranforms toΩ). And we used |Ω| to denote the volume of region Ω.

We need to prove the above procedure indeed gives us uniform r.v’s on region Ω. In fact,Xi are distributed like

X1, · · · , Xd ∼ U1, · · · , Ud|(Ui)di=1 ∈ Ω

Thus the joint density function of Xi is given by

P ((Xi)di=1 ∈ dx1 · · · dxd) = P ((Ui)

di=1 ∈ dx1 · · · dxd|(Ui)di=1 ∈ Ω)

=P ((Ui)

di=1 ∈ dx1 · · · dxd)

P ((Ui)di=1 ∈ Ω)=

1/|R||Ω|/|R|

=1

|Ω|

where we have used the shorthand

dx1 · · · dxd = (x1, x1 + dx1)× · · · × (xd, x1dxd)

to denote the infinitesimal rectangle, for (xi)di=1 ∈ Ω. And we have noticed that

P ((Ui)di=1 ∈ Ω|(Ui)di=1 ∈ dx1 · · · dxd) = 1

Therefore the generated Xi are indeed jointly uniformly distributed on the region Ω (althoughnot necessarily independent).

Example 2.17. (Ratio of uniforms) Suppose the target density is f(x). In this example wepropose the following method,

17

1. Simulate (V1, V2) which are jointly uniformly distributed on the region defined as

Ω = Ωc(f) = (v1, v2) : 0 < v1 ≤√cf(v2/v1)

for some appropriately chosen constant factor c.

2. Return X = V2/V1.

The A-R idea comes into place in this method when we want to simulate V1, V2. In practice,we would find a rectangle R that contains the region Ω, and generate jointly uniform r.v’sW1,W2 on R, and do not accept them until (W1,W2) ∈ Ω. When we accept these Wi, wejust set Vi = Wi (as pointed out in proposition 2.2).

For example, (by viewing x = v2/v1) we can choose

a = sup√cf(x) : x ∈ R

andb+ = sup

√cx2f(x) : x > 0, b− = − sup

√cx2f(x) : x < 0

Then it is obvious that 0 < v1 ≤ a. And if v2 > 0, then

v1 ≤√cf(v2/v1)⇔ v2

2 = v21(v2/v1)2 ≤ c(v2/v1)2f(v2/v1) ≤ (b+)2

and similarly if v2 < 0, then

v1 ≤√cf(v2/v1)⇔ |v2| ≤

√c(v2/v1)2f(v2/v1) ≤ −b−

or v2 ≥ b−. In general v2 ∈ [b−, b+]. Thus we can generate independent r.v’s U1, U2 on therectanble R = [0, a]× [b−, b+], and accept them if (U1, U2) ∈ Ω. The acceptance probabilityis thus

|Ω||R|

=c

2a(b+ − b−)(2.4)

where |Ω| = c/2 is proven below.If we assume that V1, V2 can be simulated (as above), then the original algorithm works

because of following.

Proof. Define the r.v Y by Y = V1, and we consider the random vector (X, Y ). Since thetransform between (X, Y ) and (V1, V2) is

v1 = y, v2 = xy

Computing the Jaccobian of this transform we know that the joint density of (X, Y ) is thatof (V1, V2) multiplied by a factor of y, because

dv1dv2 = ydxdy

In particular, let |Ω| be the volume (area) of region Ω, then the joint density of (V1, V2) isgiven by

g(v1, v2) =1

|Ω|, (v1, v2) ∈ Ω

18

while the joint density of (X, Y ) is given by

g(x, y) =y

|Ω|, (x, y) ∈ Ω

where under the coordinates (x, y), the region is described by

Ω = (x, y) : 0 < y ≤√cf(x)

We need to determine the area |Ω|, as

1 =

∫Ω

g(x, y)dxdy =

∫dx

∫ √cf(x)

0

y

|Ω|dy =

c

2|Ω|

∫f(x)dx =

c

2|Ω|

or |Ω| = c/2 <∞, which is indeed finite.Thus the marginal density of X is given by (integrating out y from the joint density of

(X, Y )) ∫ √cf(x)

0

g(x, y)dy = f(x)

which shows that the generated X is indeed distributed with the target density f .

Consider the following example usage of this method. Suppose the target is to simulatestandard normal r.v. We choose c =

√2π, then cf(x) = e−x

2/2, and√cf(x) = e−x

2/4. Theacceptance region becomes

Ω = (v1, v2) : 0 < v1 ≤ e−(v2/v1)2/4 = (v1, v2) : v1 > 0, log(v1) ≤ −(v2/v1)2/4

And the boundaries of rectangle R are given by

a = 1, b+ =√

2/e = −b−

So the final uniform ratio algorithm for generating standard normal r.v is given by 2 as below,The acceptance probability is given by equation (2.4), and in this case it is

√πe/4 ≈ 73%.

Algorithm 2 Uniform Ratio for Standard Normal r.v.

Generate independent random number U1, U2

Set U ′1 = U1, U′2 =

√2/e(2U2 − 1)

if −4 log(U ′1) ≥ (U ′2/U′1)2 then

Return X = U ′2/U′1.

elseGo back to the first step.

end if

19

Example 2.18. (Von Neumann, exponential) Our target in this example is to simulate astandard exponential X. Let us separate X into its integer part [X] = M and its decimal partY = X − [X]. We will argue that 1 plus the integer part-M + 1-is a r.v that is distributedlike geometric r.v with success probability 1 − e−1, and the decimal part-Y -is a standardexponential r.v conditioned to Y ∈ (0, 1), that is, Y has density function

g(y) =e−y

1− e−1, y ∈ (0, 1)

The A-R comes into place when we try to simulate this Y , but that’s a story for laterdiscussion. We now need to prove that above procedure indeed gives the desired standardexponential.

Proof. First focus on M + 1. The probability mass function

P (M + 1 = m) = P ([X] = m− 1)

=P (m− 1 ≤ X < m) = (1− e−m)− (1− e−(m−1))

=e−(m−1)(1− e−1)

Thus M + 1 is indeed a geometric r.v with success probability 1− e−1.Then let us focus on Y , by conditioning on [X], we have

P (Y ∈ dy) =∞∑m=1

P (X ∈ (m− 1 + y,m− 1 + y + dy))

=∞∑m=1

e−(m−1+y)dy = e−y∞∑m=0

(e−1)mdy =e−y

1− e−1dy = g(y)dy

In order to simulate from Y , Von Neumann proposes the following method. Suppose Uiare a bunch of i.i.d random number, and let

N = minn ≥ 2 : U1 > · · · > Un−1 < Un

In words, N is the index of the first random number UN that breaks the descending orderof the Ui sequence. Note that

N > n⇔ U1 > · · · > Un

And all n! orderings of n random numbers are equally likely to happen, we have

P (N > n) =1

n!

Considering U1, which is the upper bound of the following decreasing subsequence, and notethat the values of Ui are independent of their relative rank order, we have

P (N > n,U1 ≤ y) = P (N > n)P (Ui ≤ y, i = 1, · · · , n) =yn

n!

20

And thus

P (N = n, U1 ≤ y) =yn−1

(n− 1)!− yn

n!

Summing over even integers n = 2k, k = 1, 2, · · · , we have

P (N is even, U1 ≤ y) = −∞∑n=1

(−y)n

n!= 1− e−y

and integrating out y, we have

P (N is even) = 1− e−1

And thus if we are given that N is even, the density of U1 is

P (U1 ≤ y|N is even) =1− e−y

1− e−1

Thus we see that Y ∼ U1|N is even, and we have the following algorithm for generatingstandard exponential r.v. As can be seen we only accept even N and reject odd N , and

Algorithm 3 Von Neumann method for Exp(1)

Generate i.i.d random number U1, U2, · · · , until the decreasing order U1 > U2 > · · · isbroken. Set the first broken time to be N (as discussed above).if N is even then

Generate geometric r.v M ′ with success probability 1− e−1 (using methods in example2.14) and set M = M ′ − 1.Return X = M + U1.

elseGo back to the first step.

end if

that’s where A-R comes in.So how efficient is this algorithm? Or more specifically, how many random numbers are

needed to generate one desired X? Let T be the number of iterations, and let Ni be thenumber of random numbers generated in the ith iteration. Then the total number of randomnumbers required can be written as

S =T∑i=1

Ni

Since T is a stopping time of the process Ni, by Wald’s equation, the expected totalnumber of Uj can be computed as

E[S] = E[T ]E[N1]

Now T is a geometric r.v with success probability P (N is even) = 1 − e−1, so E[T ] =(1− e−1)−1 ≈ 1.6, or there are on average 1.6 iterations before termination. And

E[N1] =∞∑n=0

P (N > n) = e ≈ 2.7

21

So in each iteration, there are on average 2.7 random numbers are used. Finally, we knowthat

E[S] =e

1− e−1≈ 4.3

Therefore the expected total number of random numbers generated in this algorithm beforetermination is 4.3.

Example 2.19. (Squeezing) This is a generalization (or variant) of A-R method. The ideais to find a positive function h (not necessarily a density) with f being its upper bound, i.eh ≤ f over the real line. Also h ≤ 1 must holds. Then we accept a generated Y from densityg if either U < h(Y ) or h(Y ) < U < f(Y )/(Cg(Y )) holds for some random number U .

Algorithm 4 Squeezing A-R

Generate Y from density g. Generate random number U .if U < h(Y ) or U < f(Y )

Cg(Y )then

Return X = Y .else

Go back to the first step.end if

This algorithm works indeed, because the output X is distributed like

X ∼ Y∣∣∣U < h(Y ) or h(Y ) < U <

f(Y )

Cg(Y )

where Y has density g. And thus the density of X can be computed as

P (X ∈ dx) =P(U < h(Y ) or h(Y ) < U < f(Y )

Cg(Y )

∣∣∣Y ∈ dx)P (Y ∈ dx)

P(U < h(Y ) or h(Y ) < U < f(Y )

Cg(Y )

)=

(h(x) + f(x)Cg(x)

− h(x))g(x)dx

K=f(x)dx

CK

where K = P(U < h(Y ) or h(Y ) < U < f(Y )

Cg(Y )

). Then integrating over x this should yield

1, and thus K = 1/C, also X indeed has density f .The acceptance probability is still 1/C, and thus the expected run time is still C. But

the function f is evaluated only if U > h(Y ), which would save many time if f is hard toevaluate.

2.5 Transform-Based Methods

To better understand contents of this subsection, besides reading the text, one may also needto look at [1, chapter XIV, section 3].

The situation in this subsection is that the density function f(x) is very hard to evaluate(or no closed form formula, no approximate series etc), but its characteristic function

φ(s) =

∫f(x)eixsdx

22

is available and is easier to evaluate. Now the problem is to generate r.v X with density fusing the characteristic function φ instead of f .

Example 2.20. One basic idea is to select n points xi on the real line (larger n yields moreaccurate approximation), and find a way to evaluate f(xi) from the characteristic functionφ, and then construct a n-point discrete p with support on these n points, and probabilitymass function

p(xi) ∝ f(xi) (2.5)

And then we simulate X from this discrete distribution, as an approximate version of desiredX.

So here is one solution. Suppose we choose a starting point x0, a step length ε > 0, andthen linearly choose the following n− 1 points as

xk = xk−1 + ε = x0 + kε, k = 1, · · · , n− 1

We want to find probability mass function pk = p(xk) described as above. So let us takediscrete Fourier transform of the n values pk and see what will happen.

pl =1

n

n−1∑k=0

pke2πikl/n

Here is what we do, we replace k = (xk − x0)/ε, then we obtain

pl =1

n

n−1∑k=0

pk exp(2πil(xk − x0)/nε)

=1

nexp(−2πilx0/nε)

n−1∑k=0

pk exp(2πilxk/nε)

Since we want pk to satisfy equation (2.5), we can view the summation above as an approx-imation to the following integral (pk ↔ f(xk)dx)

n−1∑k=0

pk exp(2πilxk/nε) ≈∫

exp(2πilx/nε)f(x)dx = φ(2πl/nε)

Thus everything boils down to evaluating characteristic function φ. Now we just need totake inverse discrete Fourier transform of the following approximate DFT values

ql =1

nexp(−2πilx0/nε)φ(2πl/nε)

to get

pk =n−1∑l=0

ql exp(−2πikl/n)

Then we normalize pk to get p∗k to make it a valid probability mass function. We then

simulate X from p∗k.

23

To make the generated value continuous instead of coming out from finite number ofpossibility (i.e xk), one usually add a uniform noise to the result as X + ε(U − 1/2), whereU is some random number.

The discrete Fourier transform (and inverse version) can be efficiently implemented knownas FFT (Fast Fourier Transform).

Example 2.21. We also have the following method based on Fourier transform of density.By Fourier inversion formula, we have

f(x) =1

2π

∫ ∞−∞

φ(s)e−isxds (2.6)

Thus we have the following upper bound for f ,

f(x) ≤ 1

2π

∫−∞|φ(s)|ds = c

Then using integration by parts two times in equation (2.6), if φ′′ is integrable, i.e distributionF has second moment, then we have

f(x) = − 1

2πix2

∫ ∞−∞

φ′′(s)e−isxds (2.7)

Then we have the following decaying order upper bound for f ,

f(x) ≤ x−2 1

2π

∫ ∞−∞|φ′′(s)|ds =

k

x2

In conclusion, we havef(x) ≤ c ∧ (k/x2) (2.8)

for all x, where ∧ means minimum.Defining g(x) = Ac ∧ (k/x2) to be a density function, where A is a normalizing term,

and ∧ is taking minimum. Then equating c = k/x2 we can solve the intersecting pointx =

√k/c, and thus

g(x) = A

c, |x| <

√k/c,

k/x2, |x| ≥√k/c.

We then determine the factor A by integrating over x and equating to 1 as

1 =

∫g = A

(2c√k/c+ 2

∫ ∞√k/c

k

x2dx

)= A · 4

√kc

Thus A = 1/4√kc.

We now have f(x) ≤ g(x)/A from equation (2.8). Thus by A-R method, we can simulateY having density g and accept it with probability Af(Y )/g(Y ), where f(Y ) need to useequation (2.6) to evaluate.

24

To simulate Y , we can imagine that it is a mixture-with even weights-of uniform Uon [0,

√k/c] and a independent variable X with support (

√k/c,∞) and density h(x) =

x−2k/√kc, and then put a random sign before them. X has distribution

H(x) =

∫ x

√k/c

h(y)dy = 1−√k/c

x, x >

√k/c

Thus we can set X =

√k/c

Ufor some random number U , by inversion. So we have the

following algorithm for simulating Y ,

Algorithm 5 Simulating Y

Generate random numbers U1, U2, U3.if U1 < 1/2 then

Set positive sign s = 1.else

Set negative sign s = −1.end ifif U2 < 1/2 then

Return Y = s√k/cU3.

else

Return Y = s

√k/c

U3

end if

Exercises

Exercise 2.1. Write a routine for generation of a Weibull r.v. X with tail

P (X > x) = e−xβ

, x > 0

by inversion. Check the routine via a histogram of simulated values plotted against thetheoretical density, say for β = 1/2.

Solution 2.1. The distribution function is given by

F (x) = 1− P (X > x) = 1− exp(−xβ)

The inverse map is given by solving F (x) = u and we get

F−1(u) = (− log(1− u))1/β

And since 1− U is also uniform if U is one itself, we can simulate from Weibull by setting

X = (− log(U))1/β

The density of Weibull distribution is

f(x) = βxβ−1 exp(−xβ)

25

Exercise 2.2. In the uniform(0, 1) distribution, derive the relevant formulas for generatingmax(U1, · · · , Un) and min(U1, · · · , Un) by inversion.

Solution 2.2. The distribution of maxi Ui is given by

P(

maxiUi ≤ x

)= P (Ui ≤ x, i = 1, · · · , n) = xn

Thus by inversion, we can simulate maxi Ui using U1/n for some random number U .On the other hand, since mini Ui = 1 − maxi(1 − Ui), we can use above method, using

1−U1/n to simulate the minimal value (we have used the fact that 1−U is also uniform on(0, 1) if U is itself).

Exercise 2.3. Show that F← is left-continuous. Is the distinction between right or leftcontinuity important for r.v. generation by inversion?

Solution 2.3. Note that from the switching formula (2.1), we also have equivalent inequality

y > F (x)⇔ F←(y) > x (2.9)

(the opposite directions). Now consider an increasing sequence un ↑ u < 1, then due tonondecreasing of F←, we have

F←(un−1) ≤ F←(un) ≤ F←(u)

for all n. Denote the limitL = lim

n→∞F←(un)

which exists because the increasing sequence F←(un) is bounded by F←(u). Taking limitn→∞ in F←(un) ≤ F←(u) we thus have L ≤ F←(u).

On the other hand, choose an increasing sequence xm ↑ F←(u), but xm < F←(u) forall m. Then due to equation (2.9), we have u > F (xm) for all m. For a fixed m, we canalways find large n such that un > F (xm) (since un ↑ u). Then using (2.9) again, we haveF←(un) > xm. Now letting n→∞, due to increasing of F←(un), we have L > xm. Finally,letting m→∞, we have L ≥ F←(u).

In conclusion L = F←(u), and thus F← is left-continuous. Note that we extensively usedthe equivalent strict inequalities (2.9) in proving the left-continuity of F←.

In fact the distinction between right or left-continuity is very important for r.v generationby inversion. Consider the distribution

F (x) =

x, x ∈ [0, 1/3],

1/3, x ∈ [1/3, 2/3],

2x− 1, x ∈ [2/3, 1]

If the inverse F← is taken to be left-continuous, then F←(1/3) = 1/3, otherwise, if it istaken to be right-continuous, then F←(1/3) = 2/3. Thus we see that the output values ofinversion under different settings are very different.

26

Exercise 2.4. Let f be a density and S(f) = (v, x) : 0 ≤ v ≤ f(x) be the region under fand above real line. Show that if a bivariate r.v. (V,X) has a uniform distribution on S(f),then the marginal distribution of X has density f . More generally, show that the same istrue if S(f, c) = (v, x) : 0 ≤ v ≤ cf(x) and (V,X) has a uniform distribution on S(f, c).

Solution 2.4. The area of the region S(f) is |S(f)| =∫f = 1, since f is density function.

The joint density of (V,X) is thus given by

g(v, x) =1

|S(f)|= 1, (v, x) ∈ S(f)

Integrating out v, we get the marginal distribution of X,∫ f(x)

0

g(v, x)dv = f(x)

Thus we see that the marginal density of X is indeed f .More generally, the area of the region S(f, c) is |S(f, c)| =

∫cf = c. The joint density of

(V,X) is

g(v, x) =1

|S(f, c)|= 1/c, (v, x) ∈ S(f, c)

Integrating out v, ∫ cf(x)

0

g(v, x)dv =cf(x)

c= f(x)

Therefore the marginal density of X is also f in this case.

Exercise 2.5. As a fast but approximate method for generating a r.v X from N(0, 1), ithas been suggested to use a normalized sum of n uniforms,

X =Sn − anbn

where Sn = U1 + · · ·+Un. Find an, bn such that X has the correct two first moments. Whatare the particular features of the common choice n = 12?

Solution 2.5. Equating

E[X] =E[Sn]− an

bn=n/2− an

bn= 0

V ar(X) =V ar(Sn)

b2n

=n

12b2n

= 1

We solve that

an =n

2, bn =

√n

12

When n = 12, bn = 1, the formula becomes X = S12 − 6.

27

Exercise 2.6. Assume that the density of X can be written as f(x) = cg(x)H(x), whereg(x) is the density of an r.v Y and H(x) = P (Z > x) the tail of an r.v Z. Show that X canbe generated by sampling Y, Z independent and rejecting until Y ≤ Z.

Solution 2.6. The algorithm is, sampling independent Y, Z, and accept Y if Y < Z. Thisway, X is distributed like X ∼ Y |Y < Z, with Y, Z independent. Then the density of thisX is

P (X ∈ dx) = P (Y ∈ dx|Y < Z) =P (Y < Z|Y ∈ dx)P (Y ∈ dx)

P (Y < Z)

=P (Z > x)P (Y ∈ dx)

P (Y < Z)=H(x)g(x)dx

P (Y < Z)

Integrating over x, using∫f = 1 we know that

∫gH = 1/c, and using the fact that above

is also a density we know that the acceptance probability is given by P (Y < Z) = 1/c,therefore X indeed have density f .

Exercise 2.7. Produce 100,000 standard normal r.v.’s using each of the following methods:

1. inversion using the approximation,

2. Box-Muller,

3. Marsaglia polar,

4. A-R,

5. Ratio of uniforms

6. Routine of a standard package.

The desired results are the CPU time needed for each of the methods. Note that the resultsof course depend not only on hardware and software, but also on implementation issues suchas the efficiency of the r.v. generation scheme for the exponential r.v.’s in 4.

Exercise 2.8. Write a routine for generation of an inverse Gaussian r.v. X with density

f(x; c, ξ) =c

x3/2√

2πexp(ξc− (c2/x+ ξ2x)/2), x > 0

by A-R when ξ = c = 1. Check the routine via confidence intervals for E[X], E[X2], and (alittle harder!) V ar(X), using the known formulas E[X] = c/ξ, V ar(X) = c/ξ3.

Solution 2.7. Using density of exponential with rate 1/2 as our proposal, i.e g(x) = e−x/2/2,then

f(x; 1, 1)

g(x)=

√2

πex−3/2e−1/2x, x > 0

Taking derivative we can found the critical point is x = 1/3, plugging in we have the followingbound

C = 3

√6

eπ

28

3 Multivariate Random Variables

Suppose there are n r.v’s Xi, which will be denoted as random vector X = (Xi)ni=1.

3.1 Multivariate Normals

Suppose Xi are jointly normally distributed with mean vector µ = E[X] and covariancematrix Σ = E[(X − µ)(X − µ)′], where A′ means the transpose of matrix (or vector) of

A. Since we can always generate a multivariate normal X with mean vector 0 and then setX = µ+ X to get the desired vector, in what follows we will assume that E[X] = 0.

Example 3.1. Consider the case n = 2, where the covariance matrix is completely charac-terized as

V ar(Xi) = σ2i , i = 1, 2 Cov(X1, X2) = ρσ1σ2

where ρ is the correlation between X1, X2. Then we can first generate 3 independent standardnormal r.v’s Zi and then set X1, X2 as

X1 = σ1(√

1− |ρ|Y1 +√|ρ|Y3)

X2 = σ2(√

1− |ρ|Y2 + sign(ρ)√|ρ|Y3)

Proof. The transform between X, Y is given by X = AY where

A =

[σ1

√1− |ρ| 0 σ1

√|ρ|

0 σ2

√1− |ρ| sign(ρ)σ2

√|ρ|

]So

E[X] = AE[Y ] = 0

and covariance matrix Σ = AA′ is exactly[σ2

1 ρσ1σ2

ρσ1σ2 σ22

]

Note that we have used Y3 to cause the correlation between X1, X2, and use Y1, Y2 toexplain variances of X1, X2 respectively.

For general multivariate X normally distributed with mean 0 and covariance matrix Σ,we can always find matrix C such that Σ = CC ′ (thus C can be viewed as the square root ofΣ). Then we only need to generate n independent standard normal r.v’s Z = (Zi)

ni=1 whose

covariance matrix is the identity matrix I, and thus setting X = CZ will yield the desiredrandom vector.

In fact, since Σ is real-valued, symmetric, semi-definite matrix, eigen-decomposition (orspectral decomposition) and square root are always available. That is, we can always findorthogonal matrix B such that Σ = BΛB′ where Λ = diag(λi) is a diagonal matrix formed byeigen-values λi of matrix Σ (where λi ≥ 0 is by semi-definiteness of Σ). Then C = B

√ΛB′

is the square root of Σ, where√

Λ = diag(√λi) (which is well defined since λi ≥ 0).

One should note that the square root of Σ is not unique, and we need to find a way toefficiently compute such matrix C.

29

Remark 3.1. If the covariance matrix Σ is not definite, i.e there is zero eigen-value, or thereis linear relationship between Xi, then we are confronting degenerate situation, which is notour concern currently. So in what follows, we always assume Σ is positive definite, and allof its eigen-values are positive.

Example 3.2. (Cholesky factorization) This is an algorithm for decomposing a symmetricpositive definite matrix Σ with positive diagonal elements into the form Σ = CC ′, where C issome lower triangular matrix also with positive diagonal elements. It can be best describedin the following block matrix way.

Write the matrix C into 4 blocks as

C =

[c 0

c C

]where c is a scalar, 0 is 1× (n− 1), c is (n− 1)× 1 and C is (n− 1)× (n− 1). Then equating

CC ′ =

[c2 cc′

cc cc′ + CC ′

]=

[σ2

1 Σ12

Σ21 Σ22

]= Σ

where we have write Σ into 4 blocks with same shapes as C, and Σ′12 = Σ21 due to symmetry.So we can solve that

c = σ1 > 0, c =Σ21

c

andCC ′ = Σ22 − cc′ (3.1)

If we can prove that the right hand side of equation (3.1) is still a positive definite matrix

(with shape (n − 1) × (n − 1)), then we can recursively factorize it into C using similarprocedure, and then get the desired C.

Proof. The right hand side of equation (3.1) is equal to

Σ = Σ22 −Σ21Σ12

σ21

On the other hand, since Σ is positive definite, for any non-zero n dimension vector x wehave x′Σx > 0. Writing x into 2 blocks x = [x1, x

′]′ where x1 is a scalar and x is (n − 1)dimension vector, then

0 < x′Σx = x′Σ22x+ x21σ

21 + x′Σ21x1 + x1Σ12x

If we choose x1 = −Σ12x/σ21 = −x′Σ21/σ

21, then the above inequality still holds since x is

arbitrary. But when we plug this in we get

0 < x′Σ22x+x′Σ21Σ12x

σ21

− 2x′Σ21Σ12x

σ21

= x′Σx

Since x is arbitrary, we know that Σ is also positive definite.

30

Example 3.3. (Symmetric positive correlations) Suppose the covariance is given by

Σii = σ2, Σij = ρσ2, i 6= j

for some correlation ρ > 0 (ρ ≤ 1 of course). Then we can easily simulate X using indepen-dent standard normal r.v’s Z, Yi and setting

Xk = σ√ρZ + σ

√1− ρYk

Proof. Xk are still jointly normally distributed because the transform is linear. Besides,

V ar(Xk) = σ2ρ+ σ2(1− ρ) = σ2

Cov(Xi, Xj) = V ar(σ√ρZ) = ρσ2

Example 3.4. (Symmetric negative correlations) The covariance matrix takes the same formas in example 3.3, except that now ρ < 0. But in order to make the covariance matrix Σpositive definite, we need to constrain ρ > − 1

n−1(and if we only require Σ to be semi-definite,

we can take equality ρ = − 1n−1

).

Proof. The covariance matrix Σ is positive definite if and only if for any non-zero n dimensionvector x, we have

0 < x′Σx =∑

Σijxixj =∑i

σ2x2i +

∑i,j

ρσ2xixj

=(1− ρ)∑i

x2i + ρ

(∑i

xi

)2

Writting ||x||2 =∑

i x2i , we know that to make above inequality hold for any x 6= 0, we must

make

|ρ|1− ρ

< minx

∑i x

2i

(∑

i xi)2 =

1

maxx (∑

i xi/||x||)2

Now x/||x|| is just the normalized version of x, and thus the optimization problem in de-nominator is equivalent to

maxx

(∑i

xi

)2

subject to∑i

x2i = 1

Using Lagrangian method, we can solve this problem to get the optimal x given by

x∗ =1√n

(1, · · · , 1)

Plugging back in we get|ρ|

1− ρ<

1

n⇔ ρ > − 1

n− 1

31

Then our proposal is to use n independent standard normal r.v’s Yk, and set

Xk = bYk − a∑l 6=k

Yl

for some constant a, b, which can be solved by equating

V ar(Xk) = b2 + a2(n− 1) = σ2

Cov(Xi, Xj) = −2ab+ a2(n− 2) = ρσ2

If ρ = − 1n−1

, i.e the maximal negative correlation is reached, then we can solve

a =σ√

n(n− 1), b = σ

√1− 1/n

and in this case

Xk =εk√

1− 1/n, εk = Yk − Y , Y =

1

n

∑i

Yi

where Y is the sample mean and εk is the residual.

3.2 Other Parametric Multivariate Distributions

The following examples show 2 distributions with common marginal distribution for eachcomponent.

Example 3.5. (Multivariate t) Suppose Zi are i.i.d standard normals, and W is an inde-pendent χ2 r.v with ν degrees of freedom. Then the variable

Ti =Zi√W/ν

is called t distributed with ν degrees of freedom. And the vector T = (Ti)ni=1 are called

multivariate t distributed with ν degrees of freedom.Note that Ti’s are dependent through common denominator

√W/ν. But we can also

generalized the situation to dependent Zi.

Example 3.6. (Multivariate Laplace) Suppose Zi are i.i.d standard normals, and W isindependent standard exponential, then the variable Li =

√WZi has Laplace distribution,

and the vector L = (Li)ni=1 has multivariate Laplace distribution.

Note that these Li are denpendent through the common factor√W .

The following examples show some distributions with marginal distribution all from onefamily.

Example 3.7. (Dirichlet) Suppose Yi are independent gammas with parameters (ai, 1), fori = 1, · · · , n. And let S =

∑i Yi be the sum of these variables. Then the following vector

has the so called Dirichlet distribution with parameters (ai)ni=1,

D =

(YiS

)ni=1

And for each component, say Di, it’s marginal distribution is beta with parameter (ai, a−ai)(note that the second parameter is not a but a− ai), where a =

∑i ai.

32

Proof. We will prove the following general result.

Lemma 3.1. Suppose X, Y are independent gammas with parameters (α, λ) and (β, λ) re-spectively (so they have the same rate but may be different shapes). Then the variable

B =X

X + Y

is beta distributed with parameters (α, β).

Write down the joint density function of X, Y by

f(x, y) = λ2e−λ(x+y) (λx)α−1(λy)β−1

Γ(α)Γ(β), x, y > 0

Then perform the transform x = sb, y = s(1−b). Computing the Jaccobian of this transformwe know that the joint density of S,B are given by (original one multiplied by factor s)

f(s, b) = λ2se−λs(λsb)α−1(λs(1− b))β−1

Γ(α)Γ(β)

=λe−λs(λs)α+β−1

Γ(α + β)bα−1(1− b)β−1 Γ(α + β)

Γ(α)Γ(β)= fS(s)fB(b)

As can be seen S and B are also independent, with S being gamma distributed with param-eters (α + β, λ) and B being beta distributed with parameters (α, β).

Returning to discussion of Dirichlet distribution. For any component i, we can write

Di =YiS

=Yi

Yi +∑

j 6=i Yj

where Yi and∑

j 6=i Yj are independent gamma variables with respective parameters (ai, 1)and (a− ai, 1). And thus by previous lemma, Di is indeed beta distributed with parameters(ai, a− ai).

Example 3.8. (Multinomial) Suppose we perform n independent experiments. Each exper-iment yields result i with probability pi. The number of outcome i is denoted as Xi, thenthe vector X = (Xi)

ki=1 is called multinomially distributed with success probabilities (pi)

ki=1.

The probability mass function is given by

f(n1, · · · , nk) =n!

n1! · · ·nk!

k∏i=1

pnii ,∑i

ni = n

One naive approach to simulate from this distribution is to simulate from the distributionpi (performing one time of experiment), and then count the numbers of different outcomes.This method cost O(n) random numbers.

A more efficient method is to use the conditional distribution of Xi given that X1, · · · , Xi.First consider the marginal probability of X1, · · · , Xi. If we view all outcomes besides

33

1, · · · , i as one single outcome, then the marginal probability of X1, · · · , Xi is also multi-nomially distributed with parameter (p1, · · · , pi, 1−

∑ij=1 pj),

P (Xj = nj, j ≤ i) =n!

n1! · · ·ni!(n−∑i

j=1 nj)!

i∏j=1

pnjj

(1−

i∑j=1

pj

)n−∑ij=1 nj

Therefore the conditional distribution of Xi given all previous counts Xj, j ≤ i is

P (Xi = ni|Xj = nj, j < i) =P (Xj = nj, j ≤ i)

P (Xj = nj, j < i)

=

(n−

∑i−1j=1 nj

ni

)(pi

1−∑i−1

j=1 pj

)ni (1−

∑ij=1 pj

1−∑i−1

j=1 pj

)n−∑ij=1 nj

Thus we see that Xi|X1, · · · , Xi−1 is binomially distributed with parameters(n−

i−1∑j=1

Xj,pi

1−∑i−1

j=1 pj

)In particular, X1 is binomially distrited with parameters (n, p1). This is very intuitive, whenwe know the following n−

∑i−1j=1 Xj experiments do not contain any outcome in 1, · · · , i−1,

we are basically flipping a coin with head corresponding to outcome i and tail to non-i (> i).So we arrive at the following algorithm for simulating from multinomial distribution.

Algorithm 6 Multinomial Distribution

for i = 1 to k − 1 doGenerate Xi from binomial with parameters (n, pi).Set n← n−Xi, andfor j = i+ 1 to k do

Set pj ← pj1−pi

end forend forSet Xk = n.

Note that at each iteration of the algorithm we are excluding the possibility of outcomei.

Before we proceed to next example, a little detour into differential forms.

Definition 3.1. The area form in n dimensional space is defined to be the following (n−1)-form

ω(x) =n∑j=1

(−1)j−1xjdx1 ∧ · · · ∧ dxj ∧ · · · ∧ dxn

where dxj means the corresponding term does not contain the differential dxj.The infinitesimal area in n dimension space is defined to be the absolute value of this

area form, |ω(x)|.

34

For example, suppose the transform is given by (spherical transfrom)

xi = rui, i = 1, · · · , n

where r2 =∑

i x2i and

∑i u

2i = 1. Then the original n-form can be written as∧

i

dxi =∧i

(uidr + rdui) = rn−1dr ∧ ω(u)

where we have discard the term rndu1 ∧ · · · ∧ dun because it represents the infinitesimalvolume (up to a sign) on the unit sphere, which is 0 (because the dimension is lower).

Example 3.9. (Uniform on region Ω) Suppose in d dimension space Rd there is a finite(bounded) region Ω, and we want to simulate X = (Xi)

di=1 that is uniformly distributed on

this region.The case where |Ω| > 0 (the region has volume) is already discussed in proposition 2.2

and remark 2.8.In this example, we consider a special case of the situation where |Ω| = 0 (The region has

lower dimension than d or simply too sparse). In particular, suppose Ω is the unit sphere inRd,

Ω = x : x21 + · · ·+ x2

d = 1We can simulate the uniform on this sphere by generating d i.i.d standard normals Zi, andthen setting

Xi =Zi√∑j Z

2j

Proof. The joint density function of Zi is given by

f(z1, · · · , zn) =1√2π

n exp

(−∑i

z2i /2

)Then we make the transform

zi = rxi, i = 1, · · · , n, r2 =∑i

z2i

According to previous discussion, the infinitesimal probability becomes

f(z)dz1 · · · dzn =1√2π

n e−r2/2rn−1drdS

where dS = |ω(x)| is the infinitesimal area. We argue that the density function on the unitsphere is obtained by intergrating out r in above expression. So let us compute it

dS√2π

n

∫ ∞0

rn−1e−r2/2dr =

2n/2−1dS√2π

n

∫ ∞0

tn/2−1e−tdt =Γ(n/2)

2πn/2dS =

dS

An

where An is the surface area of unit sphere in n dimension space. Thus we see that thejoint density of Xi is indeed the inverse of surface area An and hence are jointly uniformlydistributed on the unit sphere.

35

Example 3.10. (Marshall-Olkin bivariate exponentials) Let T1, T2, T12 be independent ex-ponentials with rates λ1, λ2, λ12, respectively. Then we define Xi = Ti ∧ T12, for i = 1, 2,where ∧ means minimum. X1, X2 are correlated through the common term T12.

The joint tail distribution of X1, X2 can be computed as

P (X1 > x1, X2 > x2) = P (T1 > x1, T2 > x2, T12 > max(x1, x2))

= exp(−(λ1x1 + λ2x2 + λ12 max(x1, x2)))

due to independence of T1, T2, T12.

3.3 Copulas

Definition 3.2. A copula is the distribution of a random vector U = (Ui)ni=1, where each

Ui is marginally uniformly distributed on (0, 1), but Ui may be dependent among eachothers.

Recall from proposition 2.1 that if X has distribution function F , then the transformedvariable F (X) is uniformly distributed on (0, 1). Then we can define the copula of a randomvector.

Definition 3.3. For a general random vector X = (Xi)ni=1, suppose Xi has distribution

function Fi. The copula of X is the distribution of the random vector

U = F (X) = (Fi(Xi))ni=1

So U is indeed a copula by the first definition.

Definition 3.4. The distribution of vector 1 − U = (1 − Ui)ni=1 is called tail copula, whereU is itself a copula.

Example 3.11. If a copula U is completely positively dependent, i.e. U1 = · · · = Un, thenthis copula is called comonotone. The distribution of comonotone copula is given as

P (Ui ≤ ui, i = 1, · · · , n) = P(U1 ≤ min

i(ui)

)= min

i(ui)

If copula variables Ui are independent, then the distribution is given by

P (Ui ≤ ui, i = 1, · · · , n) =n∏i=1

ui

If X has multivariate normally distributed then its copula is called Gaussian copula. IfX has multivariate t distributed, then its copula is called multivariate t copula.

Example 3.12. Copulas can be used in the following way. Take a random vector X withdependence inside as starting point (for example, X with multivariate normal distributionwith non-diagonal covariance matrix). Suppose Fi are marginal distributions for each Xi.Then we construct the copula of X by

U = F (X) = (F1(X1), · · · , Fn(Xn))

36

And then choose some marginal distribution Gi, and take inversion on each of these variables,getting the target

Y = G−1(U) = (G−1i (Fi(Xi)))

ni=1

Then each Yi has marginal distribution Gi, but the dependence structure inside X areinherited by Y .

But sometimes we often directly take inversion of some copula U , without any startingpoint, as shown in following examples.

To simplify discussion, in what follows, we only consider the case n = 2, i.e copulas with2 variables. We will use C(u1, u2) to denote the distribution function, c(u1, u2) the densityfunction of copulas.

Example 3.13. (A-R) If the density function c(u1, u2) is explicitly available, suppose itsupper bound is given by K. Then we can use acceptance-rejection method to simulatethis copula. Generate independent random numbers U1, U2, accept them with probabilityc(U1, U2)/K (using another random number U , accept them if U ≤ c(U1, U2)/K).

Proof. The resulting copula-denoted as (V1, V2)-is distributed as

V1, V2 ∼ U1, U2

∣∣∣U ≤ c(U1, U2)

K

And their joint density is given by

P (V1, V2 ∈ dv1dv2) =P (U ≤ c(v1, v2)/K)P (U1, U2 ∈ dv1dv2)

P (U ≤ c(U1, U2)/K)

=c(v1, v2)

KP (U ≤ c(U1, U2)/K)

where P (U1, U2 ∈ dv1dv2) = 1 and since c(v1, v2) is density over the unit square (0, 1)2,integrating above w.r.t v1, v2 and equating to 1, we have the acceptance probability

P (U ≤ c(U1, U2)/K) =1

K

And the density of resulting V1, V2 is indeed given by c(v1, v2).

Thus we also see that the expected number of random numbers need in this algorithm is2K. For general n, the expected number of required random numbers is nK.

Following are some important copulas and their simulation algorithms.

Example 3.14. (Farlie-Gumbel-Morgenstern, [2, chapter 10]) The distribution of this copulais given by

C(u1, u2) = u1u2(1 + ε(1− u1)(1− u2))

where ε ∈ [−1, 1].To generate this copula, we can first generate 1 random numbers V1, setting U1 = V1,

and then generate U2 from the conditional distribution U2|U1.

37

So let us compute the conditional distribution of U2 given U1. First, the joint densityfunction can be easily computed from distribution through differentiating as

c(u1, u2) = 1 + ε(2u1 − 1)(2u2 − 1)

Then since the marginal density of u1 is simply c(u1) = u1, we have

c(u2|u1) =c(u1, u2)

c(u1)=

1

u1

+ ε(2− 1/u1)(2u2 − 1)

The conditional distribution of U2 given U1 is thus

C(u2|u1) =

∫ u2

0

c(x|u1)dx =u2

u1

+ ε(2− 1/u1)u2(u2 − 1)

Equating C(u2|u1) = v we solve for the inverse function C−1(v;u1).

ε(2− 1/u1)u22 + (1/u1 − ε(2− 1/u1))u2 − v = 0

The only positive root of this quadratic equation w.r.t u2 is given by

u2 =

√d− b2a

= C−1(v;u1)

wherea = ε(2− 1/u1), b = 1/u1 − a, d = b2 + 4av

Then we can generate a random number V2 and set U2 = C−1(V2;U1).

Example 3.15. (Archimedean family) The following are a family of copulas, whose distri-bution is given by

C(u1, u2) =

φ−1(φ(u1) + φ(u2)), if φ(u1) + φ(u2) ≤ φ(0),

0, otherwise.

where φ : [0, 1] → [0,∞] is a twice-differentiable function on (0, 1) with φ(1) = 0, andφ′(u) < 0, φ′′(u) > 0 for all u ∈ (0, 1).

To simulate Archimedean copula one can first generate 2 independent random numberV1, V2, and then set

U1 = φ−1(U1φ(ω)), U2 = φ−1((1− U1)φ(ω))

where ω = K−1(U2) with

K(t) = t− φ(t)

φ′(t)

One could check theorem 4.3.4 in [3].Some examples of Archimedean copulas. The Clayton copula has generator φ(u) =

u−α − 1, α > 0. The Frank copula has generator

φ(u) = − log

(eαu − 1

eα − 1

)38

Example 3.16. (Frailty) In this example there is a latent variable Z, given which the copulavariables U1, U2 are conditionally independent. That is,

P (U1, U2|Z) = P (U1|Z)P (U2|Z)

In addition, the conditional distribution is given by

P (Ui ≤ ui|Z = z) = uzi , i = 1, 2

Then the joint distribution of this copula is given by

P (Ui ≤ ui, i = 1, 2) =

∫P (Ui ≤ ui, i = 1, 2|Z = z)fZ(z)dz

=

∫uz1u

z2fZ(z)dz = E

[(u1u2)Z

]Note that

(u1u2)Z = exp(Z(log(u1) + log(u2))

If we define ϕ(s) = E[esZ ] to be the moment generating function of Z, then above jointdistribution is exactly

C(u1, u2) = ϕ(log(u1) + log(u2))

Remark 3.2. I found another formulation of frailty copula as follows. Let Yi be i.i.d standardexponentials, and Z is another independent variable, called the frailty. Define Xi = Yi/Z,then the marginal tail distribution of Xi is given by

P (Xi > xi) =

∫e−zxifZ(z)dz = E[e−Zxi ] = ϕ(−xi)

thus we see that all Xi has the same marginal tial distribution F (x) = ϕ(−x), where againϕ is the moment generating function of Z. The joint tail distribution of Xi’s is then givenby

P (Xi > xi, i = 1, 2) = E[e−(x1+x2)Z ] = ϕ(−(x1 + x2))

Consider the tail copula Ui = F (Xi) = ϕ(−Xi), whose distribution function is given by

P (Ui ≤ ui, i = 1, 2) = P (Xi ≥ F−1

(ui), i = 1, 2)

=P (Xi ≥ −ϕ−1(ui), i = 1, 2) = ϕ(ϕ−1(u1) + ϕ−1(u2))

Thus we see that the copula Ui defined above is in fact archimedean with generator ϕ−1.This formulation seems to be the standard one, since it can be used to solve exercise 3.6,

while the one in text cannot.

3.4 Tail Dependence

Consider a bivariate random vector (X1, X2) with joint c.d.f F (x1, x2) = P (Xi ≤ xi, i = 1, 2)with marginals Fi(x) = P (Xi ≤ x).

Before proceeding, we introduce the following cocepts.

39

Definition 3.5. Consider a set S ⊂ R2, and (x1, y1), (x2, y2) ∈ S. We say S is increasing if

x1 < x2 ⇒ y1 ≤ y2

We say it is decreasing ifx1 < x2 ⇒ y1 ≥ y2

An increasing or decreasing set must be line or curve, in the sense that for any point(x0, y0) ∈ S, there is no such number ε > 0 that the ε-neighbourhood

U0(ε) = (x, y) : |x− x0|2 + |y − y0|2 < ε2

is totally inside S. In fact, suppose S is increasing, if there is some ε0 > 0 such thatU0(ε0) ⊂ S, then choose

x1 = x0 +ε02, y1 = y0 −

ε02

we will have (x1, y1) ∈ U0(ε0) ⊂ S, but x1 > x0 and y1 < y0, which is a contradiction to thedefinition of increasing set.

We then have the following inequality between joint and marginal distributions.

Proposition 3.1. 1. Frechet-Hoeffding bounds

(F1(x1) + F2(x2)− 1)+ ≤ F (x1, x2) ≤ F1(x1) ∧ F2(x2) (3.2)

recall that ∧ means minimum.

2. The upper bound is attained if and only if the support of (X1, X2) is an increasing set.

3. The lower bound is attained if and only if the support is a decreasing set.

An example of an increasing (decreasing) set is the graph of an nondecreasing (nonin-creasing) function, in particular, consider f(x) = ±x. We have the following corollary.

Corollary 3.1. Suppose the marginals are identical F = F1 = F2. The upper bound isattained for the comonotonic copula X1 = X2, and the lower bound is attained for thecountermonotonic copula X1 = F←(U), X2 = F←(1− U) with U being uniform on (0, 1).

Proof. We only need to prove that the support of countermonotonic copula is indeed de-creasing. Let X ′1 = F←(U ′), X ′2 = F←(1 − U ′) for U ′ uniform on (0, 1). If X1 < X ′1, i.eF←(U) < F←(U ′), then U < U ′, and thus 1− U > 1− U ′, taking inverse we have X2 ≥ X ′2,as desired.

We now introduce a characterization of extreme event denpendence, which is importantin risk management.

Definition 3.6. The following limit

λu = limt↑1

P (F1(X1) > t|F2(X1) > t) = 2− limt↑1

1− C(t, t)

1− tis called the upper tail dependence coefficient, and similarly,

λl = limt↓0

P (F1(X1) ≤ t|F2(X2) ≤ t) = limt↓0

C(t, t)

t

is called the lower tail dependence coefficient.

40

For independence copula C(t, t) = t2, we have λu = 0, and for comonotonic copulaC(t, t) = t, we have λu = 1. But as we shall see, for bivariate Gaussian copula withcorrelation, λu is also 0, which indicates that tail dependence coefficient is a very roughcharacterization of dependence structure.

Exercises

Exercise 3.1. Let X1, · · · , Xp be continuous r.v.’s and h1, · · · , hp strictly increasing func-tions. Show that X1, · · · , Xp and h1(X1), · · · , hp(Xp) generate the same copula.

Solution 3.1. Suppose Xi has distribution Fi. Then the copula of (Xi)pi=1 is given by

(Fi(Xi))pi=1. On the other hand, since hi is strictly increasing, the inverse function h−1

i

exists. The distribution of transformed variable hi(Xi) is thus given by

P (hi(Xi) ≤ x) = P (Xi ≤ h−1i (x)) = Fi(h

−1i (x))

Thus we see that the distribution of hi(Xi) is Fih−1i , where means functional composition.

Thus the copula of (hi(Xi))pi=1 is given by

(Fi h−1i (hi(Xi)))

pi=1 = (Fi(Xi))

pi=1

which is exactly the same as the copula of (Xi)pi=1.

Exercise 3.2. Show that the tail copula of the Marshall–Olkin bivariate exponential distri-bution is given by

C(u1, u2) = (u1−α11 u2) ∧ (u1u

1−α22 )

where αi = λ12λi+λ12

. (Recall that ∧ means minimum of variables from both sides).

Solution 3.2. See example 3.10 for recap. The marginal tail distribution of Xi is given bye−(λi+λ12)x, thus the tail copula of Marshall-Olkin bivariate exponential is given by

(exp(−(λ1 + λ12)X1), exp(−(λ2 + λ12)X2))

The joint distribution function of this tail copula can be computed as (using the joint taildistribution of Xi from example 3.10)

C(u1, u2) = P (exp(−(λi + λ12)Xi) ≤ ui, i = 1, 2)

=P

(Xi ≥ −

log(ui)

λi + λ12

, i = 1, 2

)=u1−α1

1 u1−α22 (uα1

1 ∧ uα22 )

which is exactly equal to the desired expression.

Exercise 3.3. Show that the Morgenstern copula depends only on ε but not F1, F2.

Solution 3.3.

41

Exercise 3.4. Show that the frailty copula with Z having the Gamma density z1/α−1e−z

Γ(1/α)

reduces to the Clayton copula.

Solution 3.4. The moment generating function of gamma variable Z is given by

ϕ(s) = (1− s)−1/α, s < 1

Its inverse can be solved asϕ−1(u) = 1− u−α

Thus according remark 3.2, the copula with frailty Z is given by

C(u1, u2) = (u−α1 + u−α2 − 1)−1/α (3.3)

which is the same as Archimedean copula with Clayton generator φ(u) = u−α − 1.

Exercise 3.5. Show that the Clayton copula approaches comonotonicity as α → ∞ andindependence as α→ 0. Show more generally that the dependence is increasing in α in thesense that the c.d.f is nondecreasing in α.

Solution 3.5. Recall from example 3.11 that the comonotone copula has distribution C(u1, u2) =min(u1, u2), and independent copula has distribution C(u1, u2) = u1u2. Let us now computethe limits w.r.t α. We will use the distribution (3.3) for Clayton copula.

(α→∞) If u1 = min(u1, u2) < 1, then since u2/u1 > 1, (u2/u1)−α → 0 as α→∞. Alsosince u1 < 1, uα1 → 0 as α→∞. Hence,

limα→∞

C(u1, u2) = u1 limα→∞

(1 + (u2/u1)−α − uα1 )−1/α = u1

Similarly if u2 = min(u1, u2) < 1, then limα→∞C(u1, u2) = u2. In conclusion limα→∞C(u1, u2) =min(u1, u2), which is exactly the distribution of comonotone copula. Therefore as α → ∞,the Clayton copula approaches comonotonicity.

(α→ 0) We have

limα→0

C(u1, u2) = u1u2 limα→0

(uα1 + uα2 − (u1u2)α)−1/α = u1u2

Thus we see that as α→ 0, the Clayton copula approaches independence.(Distribution in term of α) Compute the derivative of distribution C(u1, u2;α) w.r.t α,

we have∂C

∂α= (u−α1 + u−α2 − 1)−(1/α+1)

(u−(α+1)1 + u

−(α+1)2

)≥ 0

Thus we see that the distribution of Clayton copula is nondecreasing in α. Thus we canimagine that the dependence inside Clayton copula is increasing in α.

Exercise 3.6. Show that a frailty copula is Archimedean with generator given by φ−1 = ϕ.

Solution 3.6. See remark 3.2.

Exercise 3.7. Show that the bivariate Gaussian copula is tail-independent (λ = 0), but thebivariate t copula is not.

42

Solution 3.7. The following results are arccording to [5].

Definition 3.7. The multivariate distribution of random vector (Xi)ni=1 is elliptic if it has

the following form

f(x) =1√|Σ|

g((x− µ)′Σ−1(x− µ)

where µ is mean vector, Σ is the covariance matrix. The function g is called the densitygenerator.

Note that density generator is different from density.

Proposition 3.2. Suppose elliptically distributed (X1, X2) has density generator g, and forany t > 1,

limx→∞

g(tx)

g(x)= 0

then their copula must be tail independent.

Bivariate Gaussian distribution is elliptical. For bivariate Gaussian copula with zeromean, the density generator has the form

g(x) =1

2πexp(−x/2)

And for t > 1,

limx→∞

g(tx)

g(x)= lim

x→∞exp(−x(t− 1)/2) = 0

Thus bivariate Gaussian copula is tail independent by above proposition.

Definition 3.8. For function f : R+ → R+ (maps positive real line to itself), if for anyt > 0, we have

limx→∞

f(tx)

f(x)= tα

for some constant α, then we say that f is regularly varying at infinity. And α is called theregular variation index of f .

Proposition 3.3. Suppose elliptically distributed (X1, X2) has density generator g that isregularly varying, then their copula must be tail dependent with coefficient

λ =

∫ h(ρ)

0uα√

1−u2du∫ 1

0uα√

1−u2du

where ρ is the correlation coefficient between X1, X2, and

h(ρ) =

√1 + ρ

2

43

We must point out that even when ρ = 0, i.e X1, X2 are uncorrelated, we have h(ρ) =1/√

2, and thus λ ∈ (0, 1). Also the upper and lower tail dependence coefficients conincidewith each other due to symmetry of elliptical distribution.

Now consider bivariate t distributed (X1, X2) with ν degrees of freedom, whose distribu-tion is also elliptical with density generator given by

g(x) =Γ((ν + 2)/2)

νπΓ(ν/2)(1 + x/ν)−(ν+2)/2

And for all t > 0,

limx→∞

f(tx)

f(x)=

(1 + tx/ν

1 + x/ν

)−(ν+2)/2

= t−(ν+2)/2

thus the density generator of bivariate t distribution is regularly varying at infinity withindex −(ν + 2)/2, and therefore bivariate t copula possesses tail dependence.

4 Simple Stochastic Processes

Example 4.1. (Discrete Markov Process-Naive Approach) Consider a Markov process Xnwith finite state space E, and transition probabilities given by pij, initial distribution givenby πi. To generate this process, we can simulate the initial state X0 from distribution πi, andthen starting from this point generate the next state according to the conditional probability

P (X1 = j|X0 = i) = pij

which is exactly the ith row of the transition probability matrix. For the subsequent process,suppose the current state is Xn = i, we still use the ith row of transition probability matrixto simulate the next state Xn−1, due to Markovian property

P (Xn+1 = j|Xn = i) = pij

Example 4.2. (Markov process in time series-recursion approach) In some cases, particu-larly in the field of time series, the natural way to generate a process is through recursionformula, and then it turns out that they are actually Markov processes.

A simple example is the autoregressive process characterized by the following recursion

Xn+1 = aXn + εn

where εn are i.i.d with common density function f . The transition probabilities are thengiven by

P (Xn+1 = y|Xn = x) = P (εn = y − ax) = f(y − ax)

which is clearly stationary (and from the recursion formula if given Xn then Xn+1 will bedependent on Xn and independent all older past).

And other example is the GARCH model (Generalized Autoregressive Conditionally Het-eroscedatisc), which is described by the following recursion formulas

Xn = σnZn, σ2n = α0 + α1X

2n−1 + βσ2

n−1

44

where Zk are i.i.d with density f . The Markov process in this case is not Xn but (Xn, σ2n).

In fact we can see this from

P (Xn+1 = y, σ2n+1 = t|Xn = x, σ2

n = s)

=P (Zn+1 = y/√t)I(α0 + α1x

2 + βs = t)

=f(y/√t)I(α0 + α1x

2 + βs = t)

Note that given Xn, σ2n we can completely determine the value of σ2

n+1 from above recursion.

Example 4.3. (Inhomogeneous Poisson process-Thining approach) Let us first considerthe case of homogeneous Poisson process N(t) with rate λ. Its interarrival times are i.i.dexponentials with rate λ. Thus we can generate the arrival times Sn by first generating thesei.i.d exponentials X1, · · · ,, and then set

Sn =n∑i=1

Xi

Suppose we have an inhomogeneous Poisson process N ′(t) with intensity at time t givenby λ(t). Also suppose the intensity is bounded, λ(t) ≤ λ for all t, which is required for thefollowing simulation method to work.

To generate N ′(t), we first uses the above method to generate a bunch of arrival timesSn. And then we accept arrival at time Sn = t with probability λ(t)/λ. This thinned (orsampled) process is exactly N ′(t).

Proof. Let us compute the probability that there is exactly 1 event within the interval(t, t+ dt). Let the underlying homogeneous Poisson process be N(t), then

P (N ′(t+ dt)−N ′(t) = 1)

=P (N(t+ dt)−N(t) = 1)P (event counted|N(t+ dt)−N(t) = 1)

=λdtλ(t)

λ= λ(t)dt

Note that to increase the acceptance probability we need to make the bound λ be close toλ(t) for every t. For general intensity λ(t) (not necessarily bounded), we can use different rateλ to tuck it within different intervals. This way we can also make the acceptance probabilitylarge over time. See [4, chapter 11].

An alternative procedure, which is applicable only when the function

Λ(t) =

∫ t

0

λ(s)ds

has inverse function Λ−1(z). Before introducing the method, let us recall some property ofnonhomogeneous Poisson process.

Proposition 4.1. The increment N(t)−N(s) is Poisson distributed with mean Λ(t)−Λ(s).

45

For proof, see [4, chapter 5]. We can then obtain the distribution of the first arrival timeT1, and the conditional distribution of Tn given Tn−1 as follows.

P (T1 > t) = P (N(t) = 0) = e−Λ(t)

andP (Tn > t|Tn−1 = s) = P (N(t)−N(s) = 0) = e−(Λ(t)−Λ(s))

Thus we can generate independent standard exponential X1, · · · , and set

T1 = Λ−1(X1), Tn = Λ−1

(n∑i=1

Xi

)This procedure is correct since both Λ and Λ−1 are increasing, and

P (T1 > t) = P (X1 > Λ) = e−Λ(t)

P (Tn > t|Tn−1 = s)

=P (X1 + · · ·+Xn > Λ(t)|X1 + · · ·+Xn−1 = Λ(s))

=P (Xn > Λ(t)− Λ(s)) = e−(Λ(t)−Λ(s))

Example 4.4. (Continuous time Markov process) Suppose we want to generate a continuoustime Makov chain Xt with state space E = 1, · · · , n and initial distribution πi, transitionrates λi and transition probability matrix pij (of course pii = 0). Then the simulation of thisprocess is very straightforward, by noting that if the current state is i, then it will stay inthis state for exponential time with rate λi and then transit to state j with probability pij.

The procedure for simulating continuous time Markov chain is then given by

Algorithm 7 Continuous Time Markov Process

Simulate X0 from initial distribution πi. Set t = 0.Simulate exponential r.v with rate λXt , denoted as Z. Set t = t+ Z.Simulate next state from distribution (pij)

nj=1 and set Xt equal to this generated state. Go

back to previous step.

Example 4.5. (Uniformization of Markov process) This is an alternative procedure forsimulating a continuous time Markov chain. We first choose some upper bound on thetransition rate η ≥ supi λi. Then we generate a Poisson process with rate η. Upon eachPoisson arrival, if the current state is i, the Markov chain will stay in the same state withprobability 1− λi/η, and transit to other state j 6= i with probability λipij/η.

This procudure works because of following.

Proof. Let Ti be the time before Markov chain leaves state i. Then by conditioning on thenumber of events in interval (0, t), we have

P (Ti > t) =∞∑n=0

P (Ti > t|Nt = n)P (Nt = n)

=∞∑n=0

(1− λi/η)ne−ηt(ηt)n

n!= e−ηte(η−λi)t = e−λit

46

Thus we see that Ti is indeed exponentially distributed with rate λi.On the other hand, if given that the chain is leaving its current state i at time t, then

the next state follows a distribution that is proportional to λipij/η, which is exactly pij afternormalization.

Example 4.6. (Markov-modulated Poisson Process) In this process, we can imagine thereis a underlying Markov chain Xt as in example 4.4. And the above Poisson process N(t) hasrate βi if the underlying Markov chain is in state Xt = i. An event is defined to be eithera Poisson arrival or a transition out of current state (i → j, j 6= i). Thus the event arrivalrate is given by λi + βi if the current state is Xt = i.

We want to simulate the arrival events of this process. A naive approach is similar toalgorithm 7, given as below.

Algorithm 8 Markov-modulated Poisson, Naive method

Generate X0 from initial distribution πi. Set t = 0.Record the current state i ← Xt. Generate exponential with rate λi + βi, denoted as Z,set t← t+ Z.Generate random number U .if U < λi

λi+βithen

Make a transition into j with probability pij, set Xt ← j.else

A new Poisson arrival, set Nt ← Nt + 1.end ifGo back to the second step.

We could also use uniformization method similar as in example 4.5. Finding an upperbound for the event arrival rate η ≥ supi(λi + βi). Then we generate Poisson process withrate η, denoted as Mt. Upon each Poisson arrival of Mt, given the current state is i, wemake a transition with probability λi/η, make a Poisson arrival with probability βi/η, anddo nothing with probability 1− (βi + λi)/η.

The proofs for correctness of both Markov chain and modulated Poisson process aresimilar as in example 4.5.

5 Further Selected Random Objects

5.1 Order Statistics

Our goal is to generate the ordered version of n i.i.d random variables Xi, denoted as X(i),such that

X(1) < · · · < X(n)

A naive approach is of course first generating these Xi, and then sort them in ascendingorder. This sorting method in general has at least run time of the order O(n log(n)) (forexample, the quich-sort algorithm). And this is very inefficient when only a few of Xi areactually needed (for example, the minimum or maximum).

47

Remark 5.1. Although in some cases, for example, sorting uniformly distributed objectsUi at most requires run time of order O(n). For example, split the interval [0, 1] into n/ksubintervals with equal width k/n, then on average each interval has k objects, when k << nthe sorting in each interval is very fast, and even paralle computing can be utilized in thismethod.

In order to speed up the simulation process, we utilize the following proposition.

Proposition 5.1. If U(i) are order statistics of the n random numbers Ui, and F is adistribution function with inverse F−1, then the variables X(i) = F−1(U(i)) are also orderstatistics of the i.i.d Xi with common distribution F .

In order to prove the above proposition, we need to take a detour into the field ofprobability theory about order statistics.

5.1.1 Probability Theory About Order Statistics

The Distribution of kth Order StatisticsFirst let us consider density function of X(k). By conditioning on which Xi is equal to

X(k), we can write

fX(k)(x) = P (X(k) = x)

=n∑i=1

P (Xi = x, exactly k − 1 others < x, exactly n− k others > x)

=nf(x)

(n− 1

k − 1

)F (x)k−1F (x)n−k

where we have used the i.i.d. of Xi’s. We write down the density function here for later use

fX(k)(x) = nf(x)

(n− 1

k − 1

)F (x)k−1F (x)n−k (5.1)

In particular the density of minimum is given by

fX(1)(x) = nf(x)F (x)n−1 (5.2)

And the density of maximum is given by

fX(n)(x) = nf(x)F (x)n−1 (5.3)

Let us determine the tail distribution of X(k) first. To do so, we need to compute the integral

FX(k)(x) = n

(n− 1

k − 1

)∫ ∞x

f(y)F (y)k−1F (y)n−kdy

=− n!

(k − 1)!(n− k)!

∫ ∞x

F (y)k−1F (y)n−kdF (y)

48

Make a change of variable z = F (y), we have

FX(k)(x) =

n!

(k − 1)!(n− k)!

∫ F (x)

0

(1− z)k−1zn−kdz

We need to determine the following integral. Using integration by parts, we have∫ p

0

(1− z)k−1zn−kdz =1

n− k + 1pn−k+1(1− p)k−1 +

k − 1

n− k + 1

∫ p

0

zn−k+1(1− z)k−2dz

= · · · = 1

n− k + 1pn−k+1(1− p)k−1 + · · ·+ (k − 1)!

(n− k + 1) · · · (n− 1)

∫ p

0

zn−1dz

=1

n− k + 1pn−k+1(1− p)k−1 + · · ·+ (k − 1)!(n− k)!

n!pn

=(k − 1)!(n− k)!

n!

k∑i=1

(n

k − i

)pn−k+i(1− p)k−i

Thus plugging in p = F (x), we have

FX(k)(x) =

k∑i=1

(n

k − i

)F (x)n−k+iF (x)k−i (5.4)

In particular, the tail distribution of minimum is

FX(1)(x) = F (x)n (5.5)

And the tail distribution of maximum is

FX(n)(x) =

n∑i=1

(n

i

)F (x)iF (x)n−i = 1− F (x)n (5.6)

We could also derive the same expression for tail distribution by conditioning on thenumber of variables that exceed x.

FX(k)(x) = P (X(k) > x)

=k∑i=1

P (n− k + i ones > x, k − i ones < x)

=k∑i=1

(n

k − i

)F (x)n−k+iF (x)k−i

Joint Distribution of 2 Order StatisticsSuppose k < l, for s > 0,

fX(k),X(l)(s, x+ s) = P (X(k) = s,X(l) = x+ s)

=n∑i=1

∑j 6=i

P (Xi = s,Xj = x+ s, k − 1 others < s,

n− l others > x+ s, l − k − 1 others ∈ (s, x+ s))

=n(n− 1)

2f(s)f(x+ s)F (s)k−1F (x+ s)n−l(F (x+ s)− F (s))l−k−1

49

In particular, the joint distribution of 2 consecutive order statistics X(k−1), X(k) is givenby

fX(k−1),X(k)(s, x+ s) =

n(n− 1)

2f(s)f(x+ s)F (s)k−2F (x+ s)n−k (5.7)

(There is no other order statistics within the interval (X(k−1), X(k)))Joint Distribution of All Order StatisticsBy conditioning on the actual rank order of X1, · · · , Xn, we have

fX(1),··· ,X(n)(x1, · · · , xn) = n!f(x1) · · · f(xn) (5.8)

In principle, any other joint density of order statistics can be derived from this one throughintegration.

Now return to the proof of proposition 5.1. Since distribution function of random numberis given by FU(p) = p, for p ∈ (0, 1), the density of kth order statistics of n random numbersis given by (equation (5.1))

fU(k)(p) = n

(n− 1

k − 1

)pk−1(1− p)n−k (5.9)

Thus the density of X(k) = F−1(U(k)) is

fX(k)(x) = P (F−1(U(k)) ∈ (x, x+ dx))/dx = P (U(k) ∈ (F (x), F (x+ dx))/dx

=fU(k)(F (x))

then plugging in p = F (x) in equation (5.9) we know that X(k) is distributed exactly as thekth order statistics of Xi (according to equaiton (5.1)).

5.1.2 Simulation of Uniform Order Statistics

With the help of proposition 5.1, to generate order statistics of distribution with inverse, wecan now focus on generating ordered random numbers.

Method 1We can generate order uniforms using the following method. First generate independentstandard exponentials Y1, · · · , Yn+1. And then set

U(i) =Y1 + · · ·+ YiY1 + · · ·+ Yn+1

, i = 1, · · · , n

Proof. DenoteSi = Y1 + · · ·+ Yi

which is the arrival time of the ith event in Poisson process with rate 1. Thus Yi is interarrivaltime between the (i− 1)th and the ith events in Poisson process. Suppose the rate of eventoccurence is λ (it will turn out that simulation is independent of choice of λ). Recall fromchapter 5, proposition 5.4 that given Sn+1 = t, the previous n arrival times S1, · · · , Sn aredistributed like order statistics of n i.i.d. uniform variables on interval (0, t). Thus

S1

Sn+1

, · · · , SnSn+1

∣∣∣Sn+1 = t ∼ U(1), · · · , U(n)

50

On the other hand, by equation (5.8), the joint distribution of order statistics of n i.i.d.random number is given by

P (U(1), · · · , U(n)) = n!

Then by conditioning on Sn+1, we have

P

(S1

Sn+1

, · · · , SnSn+1

)=

∫ ∞0

λe−λt(λt)n

n!P

(S1

Sn+1

, · · · , SnSn+1

∣∣∣Sn+1 = t

)dt

=

∫ ∞0

λe−λt(λt)n

n!n!dt

=

∫ ∞0

e−ttndt = n!

We see that SkSn+1

is indeed distributed like U(k), and thus we could simulate U(k) by setting

U(k) =Y1 + · · ·+ YkY1 + · · ·+ Yn+1

Method 2First generate independent random nubmers Ui. Then set backwards in i as

U(n) = U1/n1 , U(i) = U(i+1)U

1/in−i+1, i = n− 1, · · · , 1

Proof. The conditional joint density of U(1), · · · , U(n−1) given U(n) = y is given by (equations(5.1), (5.8))

P (U(1), · · · , U(n−1)|U(n) = y) =n!

nyn−1= (n− 1)!

(1

y

)n−1

which is just the joint distribution of order statistics of n − 1 i.i.d. uniform variables oninterval (0, y).

The distribution function of U(n) is given by

FU(n)(x) = xn

and thus by inverse transformation method, we could simulate it by setting

U(n) = n√U1

for some random number U1.Then given U(n), U(n−1) has conditional distribution

FU(n−1)|U(n)(x) =

(x

U(n)

)n−1

51

So again by inverse transformation method, we could simulate U(n−1) by setting

U(n−1) = U(n)n−1√U2

for some random number U2. And similarly, for U(k), we could simulate it using the value ofU(k+1) by setting

U(k) = U(k+1)k√Un−k+1

for some random number Un−k+1.

Method 3This method uses the property of order statistics of standard exponential. In above dis-cussion, we know that given X(k−1) the hazard rate of difference X(k) − X(k−1) is given by(n − k + 1)λ(X(k−1) + x). If the original Xi are i.i.d exponentials with rate λ, then theirhazard rate function is constantly λ, and thus given X(k−1), the difference X(k) − X(k−1)

is exponentially distributed with rate (n − k + 1)λ (due to one-to-one correspondence be-tween distribution and hazard rate function). Thus we can express the order statistics ofexponentials as

X(k) =1

λ

k∑i=1

Zin− i+ 1

where Zi are independent standard exponentials.Choosing λ = 1, and then using the transform u = 1− e−x we can get order statistics of

uniforms from these X(k).

5.1.3 Direct Simulation Using Accpet-Reject Method

In this paragraph we introduce an A-R method for generating the order statistics of densityf without invoking inversion method as discussed above.

Suppose we found an upper bound of f , given by cg, where c > 1 is a constant and g isanother density, whose order statistics are easier to simulate than that of f . Then choosingsome integer m > n (see [1, page 221] for a suggested choice of m) we have the followingmethod for generating order statistics of f . Again this method only takes run time of orderO(n) by [1].

Algorithm 9 Order Statistics, A-R Methodrepeat

Generate m order statistics of density g, denoted as Y(i).Generate m random numbers Ui.Delete those Y(i) which satisfy Ui < f(Y(i))/cg(Y(i)).

until The edited sample has N ≥ n elements.Randomly pick n objects out of this sample of size N .

52

5.1.4 Simulation of Maximum

Suppose we only want to simulate the maximum of n variables, i.e X(n). In this section wecall it Mn or simply M if the sample size n is clear from the context.

In what follows, we assume F is the distribution of samples Xi.

Method 1First generate random number U , and then set

M = F−1(U1/n)

Proof. The distribution of M is

P (M ≤ x) = P (U ≤ F n(x)) = F n(x)

which is exactly the distribution of maximum value (see equation (5.6)).

This method takes O(1) time.

Method 2-Quick Elimination AlgorithmIn this method , we choose some suitably large x (imagining it as the maximum value) anddefine the tail probability p = F (x). In principle, one should choose x such that p is of orderlog(n)/n. Then the algorithm works as

Algorithm 10 Quick Elimination

Generate an integer L from binomial distribution with parameters (n, p) (thus E[L] = np-which is ∼ O(log(n)) if p ∼ log(n)/n as mentioned earlier-is usually small since x is large).

if L = 0 (probability (1− p)n = F n(x)) thenSample n independent objects from distribution F conditioned to (−∞, x], and set Mto be the maximum of these variables.

elseSimulate L samples from F conditioned to (x,∞), denoted as Y1, · · · , YL, and set M asthe maximum of these values.

end if

The expected number of objects generated from F is given by

n(1− p)n + E[L|L > 0]P (L > 0) = E[L] + n(1− p)n = n(p+ (1− p)n)

5.1.5 Record

Definition 5.1. For a sequence of variables Xi, the first record is the first high value, andsimilarly, the kth record is the kth high value. More formally, the time at which the firstrecord occur is given by

N1 = minn > 1 : Xn > maxiXi

53

So XN1 is the first record value. And recursively, the time at which the kth record occur isjust the first time it exceed the last record,

Nk = minn : Xn > XNk−1, k = 2, · · ·

Thus XNk is the kth record value.

We can generate the record times and values by the following algorithms (suppose Xi arei.i.d with distribution F ).

Algorithm 11 Record Time and Value

Generate the starting point X1 from distribution F . Set k = 1. Set N0 = 1.Let x ← XNk−1

be the last record value. Generate geometric variable N with success

probability F (x) (with success being exceeding last record value x).Set Nk = Nk−1 +N .Generate new record value XNk by simulating from F conditioned to (x,∞).Update k ← k + 1 and go back to step 2.

5.2 Random Permutations

Suppose we want to generate a random permutation of the set 1, · · · , n, such that all n!possible permutations are equally likely.

The first method starts with the set L = 1, · · · , n, an empty (ordered) list S = , andworks in n steps. At each step, it randomly choose 1 element in L, remove it from L, andappend it to a S. The resulting list S gives the random permutation of the original set ofelements.

The second method is by successive random swapping. Starting with a the ordered listL = 1, · · · , n, we proceed as below.

Algorithm 12 Random permute-swap

Let k = 2.repeat

Generate random number U , and swap elements in L at positions k and [kU ] + 1.Set k ← k + 1.

until k = n+ 1

5.3 Point Processes

Point process in space is describing randomly occuring points in space. We want to simulatePoisson point process on 2 dimension space (no time dimension). First recall the followingcharacterization.

Definition 5.2. A Poisson point process with rate λ on region Ω in space is characterizedas,

54

1. The number of points occuring in subregion with area A is Poisson distributed withmean λA.

2. The numbers of points in any 2 disjoint subregions are independent.

Then using the following fact, we can simulate Poisson point process.

Proposition 5.2. For a Poisson point process, if given the number of points in subregion R,denoted as N , then these N points are independently uniformly distributed on this subregionR.

Proof. Given N(R) = n, i.e. there are exactly n points inside region R. Suppose they are(Xi, Yi).

Then we put little rectangles with width dxi and height dyi inside region R such that theydo not overlap (thus disjoint). Also suppose they have the same shape, dxi = dx, dyi = dyfor all i = 1, · · · , n. Denote these rectangles by Ri.

We first compute the conditional probability that each of them contains exactly onepoint, given the number of points inside R is n,

P (Ri contains exactly 1 point, i = 1, · · · , n|N(R) = n)

=P (Ri contains exactly 1 point, i = 1, · · · , n, no points ∈ R− ∪ni=1Ri)

P (N(R) = n)

=P (no points ∈ R− ∪ni=1Ri)

∏ni=1 P (Ri contains exactly 1 point)

P (N(R) = n)

=e−λ(R−

∑ni=1 dxidyi)

∏ni=1 e

−λdxidyiλdxidyie−λR(λR)n/n!

=n!

Rn

n∏i=1

dxidyi

where the second equality is by non-overlapping of the regions Ri’s and R − ∪ni=1Ri andassumption 2 about 2D Poisson process in section 11.5.2. The third equality is by assumption1 about 2D Poisson process.

On the other hand, we could also compute the above probability by conditioning on theactual point inside each little rectangle Ri. But there are n! different ways to assign n pointsinto n rectangles without repetition, and since all rectangles has the same shape, we have

P (Ri contains exactly 1 point, i = 1, · · · , n|N(R) = n)

=n!P ((Xi, Yi) ∈ Ri, i = 1, · · · , n|N(R) = n)

And therefore we see that

P ((Xi, Yi) ∈ Ri, i = 1, · · · , n|N(R) = n) =1

Rndx1dy1 · · · dxndyn

Since dx1dy1 · · · dxndyn is the differential of the vector (X1, Y1, · · · , Xn, Yn), we see thatthe conditional joint density of these n points inside R given that N(R) = n is given by

f(xi, yi; i = 1, · · · , n|N(R) = n) =1

Rn

55

This tells us that the n points inside R are independent and uniformly distributed acrossthe whole region.

The procedure to simulate Poisson point process with rate λ is then first generating Nwhich is Poisson distributed with mean λ|Ω| (using Poisson process), and then generate Nindependent points which are uniformly distributed inside region Ω (using method in remark2.8).

Exercises

Exercise 5.1. Consider the goal of generating random sample of size n out of populationof size N . Give simulation algorithms.

Solution 5.1. One can check [1, chapter XII, section 2] for more algorithms.A-R method

This procedure uses a set S to collect the sampled objects. We uniformly sample from the set1, · · · , N, check if the sampled object is already in set S, if not then add it to S, otherwisewe drop it and sample again. Terminate this sampling process until S has n items.

Sequential methodConsider the following probabilities.

1. The probability that item k is chosen.When k is chosen, there are only

(N−1n−1

)possibilities remained, thus the desired proba-

bility is (N−1n−1

)(Nn

) =n

N

2. The conditional probability that item j is chosen given that i 6= j is chosen.First the probability that both i and j are chosen is given by(

N−2n−2

)(Nn

) =n(n− 1)

N(N − 1)

Thus the desired conditional probability is given by dividing this probability by prob-ability in 1, which is just n−1

N−1.

3. The conditional probability that item j is chosen given that i 6= j is not chosen.First the probability that i is not chosen but j is chosen is given by(

N−2n−1

)(Nn

) =n(N − n)

N(N − 1)

Dividing this by the probability that i is not chosen, which is 1−n/N = N−nN

accordingto 1, yields the desired conditional probability n

N−1

56

4. The conditional probability that item k is chosen given that r other items are chosenand s other are not.First the probability that item k and other r items are chosen, s other items are notchosen, is given by (

N−r−s−1n−r−1

)(Nn

)The probability that those other r items are chosen and s other are not is given by(

N−r−sn−r

)(Nn

)Dividing the first by the second, we get the desired probability as(

N−r−s−1n−r−1

)(N−r−sn−r

) =n− r

N − r − s

Thus we can simulate a random sample by the following algorithm.

Algorithm 13 Random Sampling, sequential method

Let S = be the set recording the selected items. Let k = 1.repeat

Let r ← |S| be the number of selected elements, and s ← k − r − 1 be the number ofnot selected items. Generate random number U .if U < n−r

N−r−s thenAdd k to the set S, S ← S + k.

end ifk ← k + 1

until |S| = n

Exercise 5.2. Let 0 < R1 < R2 < · · · be the ordered radii of the points of a Poisson processon the disk Ω = (x1, x2) : x2

1 +x22 < r2. Give algorithms for simulation of the homogeneous

Poisson process on Ω by verifying and using that

1. the Ri are the points of an inhomogeneous Poisson process on (0, r),

2. R21, R2

2 −R21 are i.i.d exponentials.

Solution 5.2. Suppose the original Poisson process on disk Ω has intensity λ.Consider the first proposition. In fact, let N(a, b) denote the number of points in interval

(a, b), then we noted that the event N(a, b) = n is equivalent to that there areexactly npoints inside the ring

E(a, b) = (x1, x2) : a2 < x21 + x2

2 < b2Thus the probability

P (N(s, s+ ds) = 1) = P (N(E(s, s+ ds)) = 1) = λπ((s+ ds)2 − s2) = 2πλsds

57

where we have discarded the term ds2 because it is too small compared to ds. Thus we seethat Ri forms a inhomogeneous Poisson process on real line with intensity function given by

λ(s) = 2πλs

We also point out that when given the radius of a point, the conditional distribution ofits angle is uniform on [0, 2π]. Consider the joint density of radius s and angle θ. Theprobability that the region (in polar coordinate)

dΩ = (s, s+ ds)× (θ, θ + dθ)

contains exactly 1 point is given by

P (N(dΩ) = 1) = e−λsdsdθλsdsdθ = 2πλsdsdθ

2π

where we have expand the factor

e−λsdsdθ = 1− λsdsdθ + · · ·

and discard all terms smaller then dsdθ. We already know that 2πλs is the marginal densityof radius s, thus 1/2π is the marginal density of angle dθ, and thus we see that θ is indeedmarginally uniformly distributed on [0, 2π].

Then we can first simulate the inhomogeneous Poisson process with intensity λ(s) (seesection 4, which is exactly hazard rate function method discussed in [4, chapter 11]), andassign uniform angles for those points whose radii are less than r.

Now consider the second proposition. We noted that the event that R21 > x is equivalent

to that there is no points in the circle with radius√x. Thus the tail distribution of R2

1 is

P (R21 > x) = P (N(πx)) = e−λπx

therefore we see that R21 is exponentially distributed with rate λπ.

Then let us consider the conditional tail distribution of R22 − R2

1 given R1 = r1. In fact,when we are given R1 = r1, the event R2

2−R21 > x is equivalent to that there is no points in

the ring E(r1,√x+ r2

1) (see definition above), thus

P (R22 −R2

1 > x|R1 = r1) = P

(N

(E

(r1,√x+ r2

1

))= 0

)= e−λπ(x+r21−r21) = e−λπx

which is also exponentially distributed with rate λπ, and is independent of R1.In fact due to independent increments (imagine rings with increasing radii), R2

1, R2i+1−R2

i

are independent. And from above discussion, it is not hard to see they have the sameexponential distribution with rate λπ.

Then we can sequentially simulate the differences of squared radii from exponential dis-tribution with rate λπ, and add them up to restore squared radii, and hence the radii. Andfinally assign uniform angles on [0, 2π] to their corresponding points, as pointed out above.

58

6 Discrete-Event Systems and GSMPs

Consider the following model, which is called GSMP (Generalized Semi-Markov Process), ageneralized version of semi-markov process. The model consists of the following elements:

1. S, the set of states, i.e state space.

2. E, the set of all possible events. Event triggers state transition. And events may ormay not be active. Multiple events may be active at the same time. The event activein state s ∈ S is denoted as E(s) ⊂ E.

State transition of GSMP are controlled by competing clocks. Suppose an event e ∈ Eis scheduled in state s ∈ S. Then a clock corresponding to this event e is set, countingdown at rate rs,e (rate that specifed by event and current state). The clock’s countingdown is deterministic and no randomness in there. When the clock hits 0, a new evente∗ corresponding to this clock occurs (we call this alarm event), then a state transition istriggered. It will transit to state s′ with probability p(s′; s, e∗). When new state is entered,new events need to be rescheduled. When new event e′ is scheduled, a new clock withrandom reading from the distribution F (x; s′, e′, s, e∗) is set for this event. The old clocks ofold events which have not yet hit 0 will continue to count down, but at a rate correspondingto the new state s′ and their own events.

The randomness lies in 1. Initial clock reading, 2. State transition, while clocks’ countingdown is deterministic, since it’s at some fixed rate corresponding to the state.

Exercises

Exercise 6.1. A system has N machines and M < N repairmen. The operating time ofa machine has distribution F and the repair time has distribution G. A failed machine isrepaired immediately by a repairman if one is available, and otherwise it joins a queue ofmachines awaiting repair. Write up a GSMP representation of the system.

Solution 6.1. The state of system is completely given by the number of working machinen. Because then the number of failed machine is N − n, and the number of machines beingrepaired is m = min(N − n,M). The number of failed but waiting machines is N − n−m(queue length). And the number of busy repairmans is also given by m. The number of idlerepairmans is M −m. Thus the state space is S = 0, 1, · · · , N.

We separate clocks into 2 groups, one group for the working machines, the other onefor those being repaired. Each clock in working group corresponds to a working machine,with its readings corresponds to the remaining operating time of the machine. Each clockin repairing group corresponds to a machine being repaired, with its readings correspondsto the remaining repair time of the machine. We do not assign clocks to machines waitingin line.

The clocks count down at rate 1, regardless of the current state, because in this system,all clock readings represent physical time (and time elapses at fixed rate).

We use Cwi to denote the readings of clock in working group, and use Cr

j to denote thosein repairing group.

59

Note that in this model, the transitions between states are deterministic, because a fixedmachine will cause n increase by 1, and a down machine will cause n decrease by 1.

The possible events include 1. A machine is down, 2. A machine is fixed.Suppose initially all N machines just start working. We thus sample N independent

random values from distribution F and assign them to Cwi , i = 1, · · · , N , representing their

remaining operating time. In the beginning, there is no clock in repairing group since nomachine is being repaired.

When a machine is just down, the state (number of working machines) decreases by 1(the queue length automatically increases by 1, if there is any machine waiting in queuepreviously). We remove the clock with reading 0 from the working group, which exactlycorresponds to this newly failed machine. Then if there is still free repairman, we add anew clock in repairing group, and set its reading to a random value from distribution G,representing its remaining repair time. Otherwise, if there is no free repairman, then thismachine needs to wait in line, no clock will be added to any group.

When a machine is just fixed, the state increases by 1 (the queue length automaticallydecreases by 1, if there is any machine waiting in queue previously). We remove the clock withreading 0 from the repairing group, which exactly corresponds to this newly up machine.And we add a new clock in working group, and set its reading to a random value fromdistribution F , representing its remaining operating time.

Remark 6.1. The GSMP representation of the system is described above, but in order touse this representation to estimate the quantities like averate waiting time in queue and theaverage busy period, a lot of other works need to be done.

60

References

[1] Luc Devroye. Non-Uniform Random Variate Generation. Springer Science+BussinessMedia, LLC, 1986.

[2] Mark E. Johnson. Multivariate Statistical Simulation. John Wiley and Sons, Inc., 1987.

[3] Roger Nelson. An Introduction to Copulas. Springer-Verlag, 1999.

[4] Sheldon M. Ross. Introduction to Probability Models. Elsevier Science and TechnologyBooks, 2014.

[5] Rafael Schmidt. Tail dependence for elliptically contoured distributions. MathematicalMethods of Operations Research, 55(2):301–327, May 2002.

61

zhenglevinhome.files.wordpress.com · 2019-03-22 · Chapter 2 Generating Random Objects Zheng...

Documents

Transcript of zhenglevinhome.files.wordpress.com · 2019-03-22 · Chapter 2 Generating Random Objects Zheng...