Set theory background for probabilitySet theory background for probability Defining sets (a very...

Set theory background for probability

Defining sets (a very naïve approach)

A set is a collection of distinct objects. The objects within a set may be arbitrary, with the order of

objects within them having no significance. Here are several examples, demonstrating the above

properties:

2,3,5,7,11,13,17,19 (the set of all prime numbers smaller than 20)

, , , , , , ′ ′

The brackets … surround the objects that belong to a certain set. Objects are separated by

commas.

Note that:

1,2,3 1,3,2 (there is no ordering of objects within a set)

1,2,3,3 1,2,3 (sets contain one copy of each distinct object)

Working with sets

1. Membership

If an object is a member of a set we write:

∈ .

If an object is not a member of a set we write:

∉ .

2. Subsets

Assume that we have a set and another set , such that every element in is also a member of .

Then we can write:

⊆ .

We say that is a subset of .

Formally, we can write:

∀ ∈ ∈ (any that is a member of is a member of )

No need to worry: we will usually not encounter this formal logic notation.

Notice that the sets of , may actually be equal. If they are not, there must be some member of

that is not a member of . In this case we can write:

⊂ i.e. is a true subset of , or in logic notation: ∀ ∈ ∈ ∧ ∃ ∈ ∈

(any that is in is in AND there exists a in that is not in

).

3. Set equality

Sets , are equal if and only if ⊆ and ⊆ .

4. Abbreviated representation of sets and some common sets

We can use 3 dots to represent a continued or an infinite list with some apparent rule:

1,2,3, … , Set of integers in the range 1‐N.

1,2,3, … Set of all natural numbers (integers > 0).

0,1,2,3, … Set of all non‐negative integers

… , 2, 1,0,1,2, … Set of all integers

We can define a set in abbreviated form by defining a rule:

ℚ : , ∈ , 0 Set of all rational numbers.

Other common sets are the set of real numbers (this includes all rational numbers plus many

others, such as , √2, , …) and the set of complex numbers , which can be defined simply as

follows:

: , ∈ , √ 1 .

5. Venn diagrams

It is convenient to depict sets by circles with individual objects marked as dots within them (not all

objects need be marked). For example:

This implies that ⊂ .

Fig. 1

A

B

We can have more complicated interactions between pairs (and larger collections) of sets. For

example:

The overlap area (brown) includes objects

that are members of both and of . The

green and red areas include correspondingly

objects that are solely in A or in B but not in

both. We now come to operators that deal

with such interactions between sets.

Fig. 2

6. Set operators

The union of two sets , is a new set, containing all the objects from both sets. We write the

union this way:

∪ .

A formal definition would be:

∪ : ∈ ∈ .

Example:

2,3,5,8 , 1,5,9 .

Then ∪ 1,2,3,5,8,9 .

The number of elements in C is at least as large as the number of objects in the largest of the sets

A,B.

The intersection of two sets , is a new set, containing only objects that are both in A and in B.

∩ ,

or

∩ : ∈ ∈ .

For the above example, ∩ 5 .

BA

7. Cardinality

Cardinality is a term that we can usually replace with size. For finite sets, these terms are

interchangeable. The notation for cardinality are straight vertical lines surrounding the set in

question.

Examples:

| 1,2,3 | 3.

| : 20 | 8.

| | ∞ (actually, this is an inaccurate statement… is the notation for this kind of countable

infinity.)

8. Universal sets

Sometimes we are interested only in sets that are all subsets of a certain “universal” set

(sometimes also referred to as the space).

Let’s consider as an example sets that are all subsets of the natural number set .

: ∈ is one such subset of . Within this universal set, we can

define the complement set ∁ (in some texts denoted by ′ or ) as follows:

∁ ∈ ∉ . For the above example, the set ∁ is the set of all non‐prime integers.

The Venn diagram would make sense if we imagine all sets discussed being contained within a space,

marked usually by a rectangle:

Fig. 3

P

PC

The relative complement is an asymmetric operator between two sets:

∖ ≡ : ∈ ∉ .

For 2,3,5,8 , 1,5,9 ,

∖ 2,3,8 .

9. The empty set

The empty set, denoted by ∅, is a set without any objects (i.e. for any object , ∉ ∅). Notice (or try to show!) that:

For any set , ∅ ⊂ .

For any set , ∅ ∪ .

For any set , ∅ ∩ ∅.

10. Sets of sets

Sets can include arbitrary objects, including other sets1. In many cases, once given a set , we would

be interested in generating a new set, that contains as its objects some or all of the subsets of .

Let’s start with an example.

Assume that 1,2,3,4,5,6,7,8,9,10 . Now let’s imagine all possible pairs of numbers from S. Each

pair is a subset of S itself. For example, 2,6 ⊂ . How many such pairs do we have?

Notice that since we are interested in a set containing two elements from S, the order of the

elements doesn’t matter. We have 10 ways of choosing the object in the pair, and then 9 ways of

choosing the second from the remaining objects. Since every pair was wrongly counted twice in this

process ( , , ), we must divide the result by 2 to get the number of unique pairs:

10 ⋅ 9 /2 45.

We can imagine a new set, T, whose members are all the sets of pairs from S:

, : , ∈ .

Notice that S and T have no objects in common. This is because the objects in T are all sets

themselves, while the objects in S are integers. To make things clear:

.

.

1 We will ignore the difficulties that may generate. Just as a teaser, consider Russell’s beautiful paradox. Imagine that we define a rule for set S ‐ S includes all sets that do not include themselves. Does the set S include S as one of its objects?

11. The power set

Given a set we can construct a set that contains all subsets of . For example, for the set

1,2,3,4 , the power set is:

∅, , 1 , 2 , 3 , 4 , 1,2 , 2,3 , 3,4 , 1,3 , 2,4 , 1,4 , 1,2,3 , 2,3,4 , 1,2,4 , 1,3,4 .

Notice that the numbers 1‐4 themselves are NOT objects of , only the subsets 1 , 2 , 3 , 4

are!

Notice that | | 16 2 2| |.

Try to show (or provide an intuition) why for any finite set with cardinality N (i.e. | | ), the

cardinality of the power set is | | 2 .

12. Cartesian products

Given the sets , , we can generate a new set

, : ∈ , ∈ .

is a set containing as its objects ordered sequences of elements. The fact that the objects of are

sequqnces is denoted by the regular brackets surrounding the comma‐separated list. In the above

example, the first element in each sequence is an object from , while the second element in each

sequence is an object from .

For example, we can think of all points in the two‐dimensional real space as the Cartesian

product

, : , ∈ .

Because the elements of are sequences, , , if and only if .

Cartesian products are very useful for conceptualizing repeated experiments. For example, consider

a coin toss, where the possible outcomes are , . Then all possible results of 3 consecutive

coin tosses are contained in the set :

, , : ∈ , , , , , , , , , , , , , , , , , , , , , , ,

The cardinality of the resulting set is the product of the generating sets. In the above example,

| | | | ∙ | | ∙ | | | | 8.

13. Algebra of sets – some basic laws

A relatively comprehensive list of set theory laws can be found here:

http://en.wikipedia.org/wiki/Algebra_of_sets

Skipping the trivial ones (e.g. ∩ , ∪ ,…), here are some important ones which you

can try to prove:

Distributive laws: ∪ ∩ ∪ ∩ ∪

∩ ∪ ∩ ∪ ∩ De Morgan’s laws:

∪ ∁ ∁ ∩ ∁

∩ ∁ ∁ ∪ ∁

Laws pertaining to complements:

∖ ⋂ ∁

∖ ∁ ∁ ∪

(The basic technique for proving equalities in set theory is to take a member of the set defined on

each side of the equal sign, and show that it must also be a member of the set defined on the other

side. Thus, both sets are shown to contain one other and must be equal.)

Exercises (difficult exercise marked by *):

1. Prove that within the universal space , for every ⊂ , ∁ | | | |.

2. Prove that for any two sets , ,

| ∪ | | | | | | ∩ |. 3. Show that for any two sets , ,

∖ ∪ ∖ ∪ ∖ ⋂ .

4. (*) Prove the inclusion‐exclusion principle:

For any choice of sets , , … , ,

| ∪ ∪ …∪ |

∑ | | ∑ ∩ ∑ ∩ ∩ ⋯ 1 | ∩ …∩ |

(can be written in shorthand as: ⋃ ∑ 1 ∑ ⋂⋯

o Notice that for 2, this is just exercise 2.

Combinatorics for probability

Probability theory deals with both continuous and discrete distributions. Discrete distributions are

those in which the outcomes can be counted, e.g. the number of phone calls arriving within a certain

time interval, the result of a dice toss, etc.

In the case of discrete distributions, many questions in probability boil down to questions of

counting the number of possible outcomes. For example, assume we toss two dice, and want to

know the probability that the sum is 5.

Assuming the dice are fair, all outcomes are equally probable. There are 6 ∙ 6 36 different outcomes. From these, the following pairs of values lead to a sum of 5: (1,4),(2,3),(3,2),(4,1). Notice

that (1,4) and (4,1) are unique different outcomes (just imagine the two dice have different colours).

So the probability that the sum is 5 is 4/36 1/9.

In general, when all outcomes are equally likely, the probability of a set , , … , | ∈

(the ’s are different outcomes) is given by

| |

| |

| |.

What we will try to cover in this short introduction are some general methods for counting that are

useful for probability.

Sequences without replacement / partial permutations

Given a set , | | , how many unique sequences of unique objects can be formed? We denote

this by , . The word sequence implies that order of choice is significant. A real‐life problem

may be this: From an institute with 50 professors, a dean, a student’s dean and a seminar organiser

must be chosen. Each person may only hold one chair. The terminology “without replacement”

refers to the fact that once we’ve used a certain object from the group, it is not available for further

choices. Thus, the object is not “replaced” into the set once used. We can calculate this sequentially:

There are 50 choices for dean; once a dean has been chosen there remain 49 possible student

deans, and then 48 possible seminar organisers. Altogether, there are 50,3 50 ∙ 49 ∙ 48117,600 unique choices! In general, the formula is:

, ∙ 1 ∙ ⋯ ∙ 1∙ ∙⋯∙

∙ ∙⋯∙

!

!.

When , this reduces to the question of how many different orderings (permutations) of

objects exist. Defining 0! 1, we get

, ∙ 1 ∙ ⋯ ∙ 1 !

Sequences with replacement

Given a set Σ, |Σ| , how many sequences of length can be generated using these objects?

Notice that because we allow objects to be use multiple times, the objects may be thought of as

“replaced” into the set after use, allowing them to be used again. In this case, each object in the ‐

sequence can be any of the objects in Σ, and therefore the number of possible ‐sequences is

∙ ∙ ⋯ ∙ ( times) . An example (highly relevant to calculating entropy / information) is: how

many words of length can be generated with an alphabet Σ of size ?

Unordered samples without replacement / Combinations

A k‐combination from a set S is a choice of k objects from the set. A k‐combination differs from a

partial permutation in that the order of choice is not important. An example would be setting up a

committee of k people out of a group of n people, where all committee members are equal and have

no individual titles. How many such k‐combinations can be formed from a set of cardinality n?

We notice that we can start by calculating the partial permutation ,!

!. The problem is

that each group of k people was counted multiple times, while we now don’t care for the order. So

to get the number of k‐combinations we must divide , by the number of times each

combination was counted.

How many times is each k‐combination counted in , ? The answer is that every permutation of

the order is counted once, so we must divide , by the number of permutations of k objects,

i.e. , (see above). Therefore we get the result:

,

,

!

! !≡ .

The bracketed expression is usually called “n choose k” or “n over k”. This term pops up in a

variety of fields in mathematics, and was systematically studied by Blaise Pascal in the 17th century.

Here are some uses of the expression:

Polynomial coefficients:

Let’s consider the expression 1 . Opening the brackets, we will get a polynomial of degree

n: ⋯ . As an example, because there is just one way of choosing just

’s when opening the brackets, 1. The coefficients are given by . We shall show

this using a combinatorial approach:

1 1 ∙ 1 ∙ ⋯ ∙ 1

When opening the brackets, we must expand all possible multiples, by choosing each time a

different combination of either or 1 from each bracket and multiply them. The coefficient

will be an integer that indicates how many different choices of k ’s and (n‐k) 1’s there are. This

is exactly the definition of .

Therefore:

1 ∑ .

The cardinality of the power set of , | | :

The cardinality of the power set is the number of all subsets of S. We can think of it as the

sum of all k‐combinations, summing over 1,2, … , . If we plug 1 into the above formula,

we get:

2 1 1 1

So we see by a different approach that | | 2 .

Plugging in 1, we can get the identity:

0 0 1 1 1

Rearranging the terms yields:

10

1

which is useful in proving the inclusion‐exclusion principle.

Pascal’s triangle:

You probably all once saw the following structure:

The top row is the 0th row, the one below the 1st row etc. Each value is the sum of its two

nearest neighbours in the row just above. Confirm that the numbers in the first five rows

correspond to the polynomial coefficients of order 0‐5. Notice that this implies that:

. Try to provide a combinatorical explanation for this equation (hint: what

happens when we add a new object to a set?).

Multisets: choosing objects with replacement, without order

Assume that an animal samples four available ports (the set of choices S is of cardinality n=4).

Now let’s assume that the animal samples these ports repetitively, say 10 times (k=10). We

further assume that the animal can re‐choose the same port over and over (hence the notion of

“replacement”: the previous choice is still available after choosing it, as opposed, say, to lottery

numbers which can each only be chosen once).

Now assume we don’t care about the order of port visits, only about the number of visits to

each. How many different results are possible?

This is much trickier than the previous counting problems, and I suggest you play around with it

for a while. Once you’re frustrated, come back and read the neat solution (warning: spoiler ).

The trick to solving this problem is called the “stars and bars” method. Let’s assume that the

animal visits the ports (a,b,c,d) 3,2,1 and 4 times correspondingly. We don’t care about the order

of the visits, only about the identity, so we cluster all the a’s, b’s… together as follows:

aaabbcdddd

We can forget about the letters and form a more abstract representation:

***|**|*|****

We can easily disambiguate the port identity: they progress from a to d sequentially with every

vertical bar.

Altogether we now have 10+4‐1 symbols (or 1 symbols). Each distinct choice of k

objects with replacement from a set of n objects is homologous to a different placement of (n‐1)

vertical bars amongst 1 elements. But this is just the binomial term

11

1

Summary – number of different outcomes:

Without replacement With replacement

Ordered !!

Unordered 1

Multinomial

Exercises:

1. Verify that . Provide a combinatorial / set theory explanation for this equality.

2. A coin is tossed 10 times with results , , … , . How many combinations exist in which

, , … , ? If the coin is fair, what is the probability of such an event?

3. (*) In a big bag, there are blue balls and red balls. Without looking inside, you draw balls

without replacement. How many different combinations of draws would result in having blue

balls? What is the probability of observing this result given that all balls are equally likely to be

chosen?

4. (*) How many solutions in non‐negative integers does the following equation have?

⋯

5. Assume we have a set Σ, |Σ| . We call this set the alphabet. A string/word of length n is

defined as any sequence ⋯ such that ∈ Σ. Now let’s assume that there exist some

rules which determine which words are admissible. We will denote by all admissible words of

length n. The language is the set of all admissible words of all lengths, i.e. ⋃ .

A simple example (language RE1): assume Σ 0,1 . Words are generated by a process that

has “refractoriness”: after each 1, there must always be at least a one 0.

a. Calculate | | for 1,2,3,4,5 for RE1, by directly listing all admissible words.

b. (*) Can you derive a (recursive) formula for | | in RE1? c. Use Matlab to check what percentage of possible strings of length n are achievable in RE1 (i.e.

calculate | |/2 ) for 1 100. 6. We define the topological entropy of a language as:

a. lim → log | |.

What is the topological entropy of a language over alphabet Σ, |Σ| in which all words are

admissible?

b. Use Matlab to study the behaviour of the expression log | | of language RE1 for

1,2, … ,500. c. What would happen to the entropy as the refractory period gets longer (RE2, RE3, …)?

7. We have shown that:

2

Calculate the following sum, which would come in very handy later:

∙

8. Given a finite set , | | , a partition of is a division of into non‐overlapping and non‐

empty , , … This means that:

a. For all , ∅. b. For all , ∩ ∅.

c. ⋃ .

Example: The set of all dice toss outcomes is 1,2,3,4,5,6 . A partition into 4 parts can be for

example: {2},{3,5},{1},{4,6}. We denote the number of possible k‐partitions of a set of size n as:

(notice brace notation in contrast to the binomial bracket notation; this is called the Stirling

number of the second kind).

Prove that:

11 ∙ .

Probability and Statistics 1

Introduction

We toss a coin and observe the outcome. There are only two possible outcomes, Head (H) and Tails

(T). There are many questions we may be interested in knowing the answer to:

1. What is the probability of observing H vs. T?

2. What is the probability of observing a certain sequence of H’s and T’s?

3. When the coin is tossed multiple times, are the outcomes of consecutive tosses related to each

other?

4. A paranoid question: what is the chance that I observe a series of H’s and T’s that is highly

atypical of the actual probabilities?

Probability can of course raise it heads in more complicated scenarios. For example, we can ask what

we may know about the weight and height of a person picked at random on the street. This differs

significantly from the previous scenario of coin tossing. Weight and height values are not limited to

two outcomes, and in fact, can assume infinitely many values (if measurement resolution is not an

issue). Can we answer questions such as: “What is the probability of finding a 1.76m tall person who

weighs 54kg”?

What we’ll try to do in this short intro to probability and statistics is give some background relating

to just these questions:

! We’ll define a concept of “probability” relating to possible outcomes of some experiment.

! We’ll introduce several well‐known probabilities, both discrete ones (like the coin toss) and

continuous ones (like the weight / height example).

So let’s start with some examples, and develop the concepts as we go along.

Our first discrete distribution – the Bernoulli trial

In a Bernoulli trial, we deal with experiments that can have two outcomes, 1 and 0 (can be H’s / T’s,

Success / Failure, …). Each outcome has an associated probability:

1

Bernoulli trials are named after Jacob Bernoulli, a professor at Uni Basel in the 17th century.

Probability 0 means that something has no chance of happening. Probability 1 means that something

is bound to happen (this is a very loose hand‐waving way of defining probability but suffices for our

needs).

We could ask two seemingly “stupid” questions:

! What is the probability that the outcome is BOTH H AND T? Obviously, the answer is 0. An

experiment can’t be both a success and a failure. We say that these two events are DISJOINT.

We are now using terminology from Set Theory. Since ∩ ∅ (the intersection of H and T is empty, we get ∩ ∅ 0. So, the probability of not getting ANY outcome is 0.

! What is the probability that the outcome is EITHER H OR T? Since these are the only possible

outcomes, it is obvious that in this case, ∪ 1 1. Again, an obvious result: an experiment is expected to have some outcome with certainty.

Probability spaces I

To illustrate the concept of a probability space, we move to another example. A dice has 6 sides,

each associated with a probability. We can write in the meantime something like this:

1 , 2 , … , 6

And because the experiment should have some outcome, it’s obvious that:

⋯ 1

The set containing all the possible outcomes of an experiment is called the sample space. In this

case, the sample space S is:

1,2,3,4,5,6

Formal Definition 1: A sample space is the set of all possible outcomes of a particular experiment.

We can ask many more questions. For example, what is the probability of the dice outcome being

odd? This would be:

1,3,5

What is the probability of the result being larger than 3?

4,5,6

These are two examples of events. Such compound events are made up of individual outcomes, and

form a subset of the entire sample space.

Formal Definition 2: An event A is a collection (subset) of possible outcomes of an experiment, i.e.

⊆ where S is the sample space. This can include the empty set and S itself.

We would like to associate probabilities with events, such that if ⊆ we would have an associated probability 0.

Why at all work with events and not with the individual outcomes? The reason is that individual

outcomes may be very numerous and many times we’re really interested just in some summary

property of the outcome (was the outcome even or odd? Was the person’s BMI within the healthy

range?...)

The downside of working with events is that arithmetic is not straightforward. While for outcomes,

we can break up the probability to a sum, e.g.

1 ∪ 2 1 2

this doesn’t necessarily hold for events. For example, let’s define the sets 1,3,5 and

4,5,6 .

What is the probability that the outcome is either odd or larger than 3?

∪ 1,3,5 ∪ 4,5,6 1,3,4,5,6

But on the other hand,

1,3,5 4,5,6 2 ⋅

So

∪

The reason is that in the second sum, the probability of the outcome 5 is counted twice, once as an

odd number (in P[Q]) and once as a number larger than 3 (in P[T]), despite the fact it relates to a

single outcome.

It seems reasonable to attach a probability to any subset of our sample space (e.g.

∅ , , 1 , 1,6 , 2,3,5 …). In the dice example, we have 6 outcomes. This means that in

total we have 2 different events (we either include or exclude each outcome). In general, the

complete event space over a sample space is denoted as 2 .

We will see that this is not a reasonable way to construct events. To see this, let’s play around with

the simplest of continuous distributions.

Our first continuous distribution ‐ The Uniform Distribution

In this experiment, we choose a random real number between and . That is, ∈ , ( is in the

closed interval ranging from to ). Our sample space is therefore ∈ : .

All numbers in this interval are equally likely. Can we attach a probability to a unique outcome, i.e.

can we define ? This turns out to be impossible. Because all (infinite number of) values

∈ , are equally likely, attaching a value 0 to each of them would lead to their sum

being infinity.

What can we do then? Our intuition tells us that if we have a uniform distribution between 0 and 1,

there is probability ½ of being between 0 and 0.5, and probability ½ of being between 0.5 and 1 etc.

In general, we could imagine assigning a probability to anything that has a positive “measure”

(roughly speaking, some subset of the interval , to which we can attribute a notion of length).

Probability spaces II

Formal Definition 3: A collection of subsets of S is called a ‐algebra (or Borel field), denoted by , if

it satisfies the following three properties:

a. ∅ ∈ (the empty set is an element of ).

b. If ∈ , then ∁ ∈ ( is closed under complementation).

c. If , , … ∈ , then ⋃ ∈ ( is closed under countable unions).

The smallest “legit” ‐algebra on a sample space S has just two objects in it: ∅, . This trivial

‐algebra is quite boring. Obviously, 1 and ∅ 0. At the other end, the maximal

‐algebra is 2 which we’ve just seen may be too rich to allow us to attach probabilities to

all its members. So the question is how rich can a ‐algebra be while still allowing us to attach meaningful probabilities to the events it defines.

For the uniform distribution, assume . Then it makes sense to attach a probability

to the line segment , that is proportional to its relative length. We then get

,| || |

We can get more interesting events by using the complement and the union rules of definition 3.

Let’s define now what a probability space is:

Formal Definition 4: Given a sample space S and an associated ‐algebra , a probability function P

is a function with domain and range [0,1] ( : → 0,1 ) such that:

1. For every ∈ , 0. 2. 1. 3. If , , … are pairwise disjoint (i.e. ∩ ∅ for all ),

then ⋃ ∑

Conditional probability

Example 1: What is the probability of tossing two fair dice and getting the a sum of 10?

When all outcomes have equal probability, we can just count them and get

10| , , , , , |

.

But there is a different way of making this calculation: We could ask what the probability is of getting

sum of 10 imagining that the dice are tossed one after the other. The first dice must show a number

larger than 3, otherwise the sum cannot be 10. Once the first dice result is seen (let’s name it ), the

second dice must show exactly 10 for the sum to be 10. In probability language we can say:

1012

16

112

But how exactly do we write such probabilities, that depend on previous outcomes?

Example 2: What is the probability that we pick four aces out of a pack of 52 cards? Assuming all

cards are picked with equal probability, there are 524

unique 4‐tuples of cards. Only one of them is

the set of four aces. Therefore 4 .

But we could imagine picking cards one after the other, and “updating” our probability space

accordingly:

4452

351

250

149

(Notice that ! !

!)

But the factors in the multiplication are from different probability spaces, each with a decreasing

number of cards (the number decreases with each choice), and also with a decreasing number of

aces (because we assume that all previous picks were aces)!

What we basically did was to calculate conditional probabilities: probabilities that depend on

outcomes of other experiments in our model.

Formal definition 4: If , are events in sample space , and 0, then the conditional probability of A given B, written | , is

|∩

A Venn diagram can help understand how the term normalizes the conditional probability to

behave like a proper probability space:

Assuming that the size of circles in the Venn diagram above are proportional to their relative

probability, it is obvious that ≪ 1. But what happens if we’re already told that the outcome is in B? What is now the probability that the result is also in A?

What basically happens is that we update our probability space to include only outcomes in B. The

probability of being in B may be small, but once we know out outcome is in it, all outcomes within it

become more probable. On the other hand, all events outside it receive probability 0:

This is because the nominator, ∩ ∅ in this case.

B

A

S

(sample space)

B

A

S

(sample space)

Events can of course partially overlap, in which case:

In this case, | will be >0 but very small, because the intersection ∩ is small relative to the

updated probability space .

Let’s make sure that the conditional probability ∙ | , 0 is really a probability function:

1. For every in the Borel field, |∩

0.

2. |∩

1

3. If , , … are pairwise disjoint (i.e. ∩ ∅ for all ),

Then:

⋃ | ⋃ ∩ ⋃ ∩ ∑ ∩ ∑ |

B A

S

(sample space)

Bayes’ rule

Notice that by definition, if , 0, then:

|∩

⇒ ∩ | ∙

|∩

⇒ ∩ | ∙

Equating both sides we get:

| ∙ | ∙

from which we can derive Bayes’ Rule:

| | ∙

This rule is extremely useful in neuroscience, as we’ll see in some examples that follow.

The Law of Total Probability

Imagine we have a partitioning , , … of our sample space , i.e.

1. ⋃

2. If , then ∩ ∅.

Then for any event ,

∩ ∩ ∩ ∩

| ∙

For simple cases, we make think of it as a “weighted’ probability.

Example: The probability of haemophilia at birth is 1 10,000⁄ . The probability of

haemophilia at birth of males is |♂ 1 5,000⁄ . Assuming that ♂ P ♀ 0.5, then:

0.0001 |♂ ∙ ♂ |♀ ∙ ♀ 0.0002 ∙ 0.5 |♀ ∙ 0.5

We can thus extract the probability of a female haemophilia birth:

|♀0.0001 0.0002 ∙ 0.5

0.50

which is close to the true situation (almost no female haemophilia cases).

Bayes’ Rule v. II

Plugging the Law of Total Probability into Bayes’ rule, we get:

| | ∙∑ | ∙

Statistical independence

In some cases, the occurrence of an event , has no effect on the probability of another event, .

This means that:

| .

Plugging this into the equation defining conditional probability, we get:

|∩

⇒ ∩ ∙

! When two events are independent, their joint probability ∩ is the product of their

independent probabilities ∙ .

! When ∩ ∙ , we say that and are statistically dependent. This means

that information about one event influences what we know about the other (see Bayes’ Rule).

! A collection of events , , … is independent only if for every sub‐collection , , …,

∩ ∩ …

(i.e. all pair, triplets,… of events must exhibit this separability property)

Examples: We toss a fair coin twice. Knowledge about the first outcome (H/T) doesn’t contribute

knowledge about the outcome of the second toss. But, if we think of one event as indicating

whether the first outcome was H, and a second event indicating whether both tosses resulted in

H, obviously contributes information about !

Random variables

In many cases, we are interested in the probability of something that summarizes the behavior of an

experiment or several experiments. For example, we may toss coins and be interested only in the

number of heads, not in the order at which they occurred. We are willing to ignore the exact identity

of our outcomes.

Formal definition 5: A random variable is a function from a sample space into the real numbers.

Example: Let’s assume that our experiment is the toss of two dice. Then the sample space is the

Cartesian product 1,2,3,4,5,6 1,2,3,4,5,6 which contains 36 outcomes. Now we construct

a function : → such that for each point in the sample space, gives the sum of the two dice.

For example, 2,3 5. The function summarizes a property of interest of our outcomes (in

this case – the sum of the dice) while ignoring others (in this case – the actual dice values). Instead of

36 outcomes, the target of contains only 11 possible values (2,3,…,12).

We can now calculate 5 :

5 , ∈ : 5 1,4 , 2,3 , 3,2 , 4,1436

19

So we see that each possible value that the random variable can assume points back to several (or

just one) outcome in the original sample space.

Geometric distribution

To demonstrate the power of random variables in summarizing experiment results, we describe a

different summary of Bernoulli trials. In this case, we imagine a (possibly) infinite repetition of

independent Bernoulli trials, all having a probability of the result “1” and probability 1 of

the result “0”.

The question we ask: How many trials does it take until we observe the first “1”?

Our sample space is 0,1 . We can generate a probability distribution : → such that

indicates the number of the trial on which the first “1” was observed.

It’s easy to see that

000⋯01⋯ ∙ 1 ∙

Exercise: Verify that ∑ 1.

[Notice that an infinite number of points in map into , but we don’t really care about the

values of the Bernoulli trials after the point , since we’re only interested in how long it took us to

get there!]

Binomial distribution

We perform a set of exactly Bernoulli trials. In this distribution, there are two parameters: the

Bernoulli probability , and the number of trials .

The question we ask: how many “1”’s do we get?

We define : 0,1 → such that indicates the number of 1’s in the sequence.

It’s obvious that every sequence of length with “1”’s (and hence “0”’s) has an equal

probability of ∙ 1 . But how many such sequences do we have? Going back to

combinatorics it’s obvious that there are such sequences. Therefore,

| , ∙ ∙ 1

The random variable induces a σ‐algebra on the sample space and we can verify (exercise!) that

satisfies the probability axioms. As shorthand, we many times write instead of

.

Distribution functions

Formal definition 6: The cumulative distribution function (cdf) of a random variable is defined by:

for all .

Example: Assume that a neuron fires in each 1‐ms bin with an independent probability of 0.001. What is the probability that the neuron doesn’t fire at all for 2 seconds?

! What distribution describes this kind of problem?

! Derive an expression for the probability of 2‐second quiescence

! How is this related to the cdf?

For discrete distributions, the cdf will be a staircase function:

Assume ~ 3, .

0, 018, 0 1

12,1 2

78,2 3

1,3 ∞

Notice that:

! is defined for all ∈ despite the fact that can assume in this case only the values 0‐3!

! is right‐continuous (but not necessarily left‐continuous).

! is a monotonically non‐decreasing function of (why?)

! lim → 0, lim → 1.

Example: A continuous cdf is given by:

11

This is the logistic distribution. Verify that it satisfies the requirements from a cdf!

Probability density and mass functions

Formal definition 7: If is a discrete random variable, then its probability mass function is given by:

for all .

Notice that for a discrete variable that can only attain integer values (i.e. if ∉ , 0),

In a similar fashion, for continuous variables we can define a probability density function with an

integral replacing the sum:

Formal definition 8: If is a continuous random variable, then its probability density function is the

function satisfying:

Example: Taking the logistic distribution and using the fundamental theorem of calculus, we get:

1

! Notice that despite the fact we can’t define a probability of a single value for continuous

variables, we can define a “density”

! The probability of being within some interval , would be:

Set theory background for probabilitySet theory background for probability Defining sets (a very...

Documents

Transcript of Set theory background for probabilitySet theory background for probability Defining sets (a very...