Bayes Unit1

download Bayes Unit1

of 45

description

Bayesian Theory slide

Transcript of Bayes Unit1

  • George Mason University

    Unit 1 (v5) - 1 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Bayesian Inference andDecision Theory

    Unit 1: A Brief Tour of Bayesian Inference and Decision Theory

    Instructor: Kathryn Blackmond Laskey Room 2214 ENGR

    (703) 993-1644 Office Hours: Wednesday and Thursday 4:30-5:30 PM,

    or by appointment

    Spring 2015

  • George Mason University

    Unit 1 (v5) - 2 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    What this Course is About You will learn a way of thinking about problems of inference

    under uncertainty You will learn to construct mathematical models for inference

    and decision problems You will learn how to apply these models to draw inferences

    from data and to make decisions These methods are based on Bayesian Decision Theory, a

    formal theory for rational inference and decision making

  • George Mason University

    Unit 1 (v5) - 3 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Logistics Web site

    http://seor.gmu.edu/~klaskey/SYST664 Blackboard site: http://mymason.gmu.edu

    Textbook and Software Hoff, A First Course in Bayesian Statistical Methods, Springer, 2009 Other recommended texts on course web site Some assignments can be done in Excel We will use R, a free open-source statistical computing environment:

    http://www.r-project.org/. R code for many textbook examples is on authors web site We will use JAGS, an open-source package for Markov Chain Monte Carlo simulation:

    http://mcmc-jags.sourceforge.net/ Requirements

    Regular assignments (20%): can be handed in on paper or through Blackboard Take-home midterm (30%) and final (30%) Project (20%): apply methods to a problem of your choosing

    Office hours Official office hours are 4:30-5:30PM Wednesdays and Thursdays I respond to questions by email and am available by appointment

    Policies and Resources Academic integrity policy Read the policies and resources section of the syllabus

  • George Mason University

    Unit 1 (v5) - 4 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Course OutlineUnit 1: A Brief Tour of Bayesian Inference and Decision Theory Unit 2: Random Variables, Parametric Models, and Inference

    from Observation Unit 3: Statistical Models with a Single Parameter Unit 4: Monte Carlo Approximation Unit 5: The Normal Model Unit 6: Gibbs Sampling Unit 7: The Multivariate Normal Model Unit 6: Bayesian Regression and Analysis of Variance Unit 8: Hierarchical Bayesian Models Unit 9: Linear Regression Unit 10: Metropolis-Hastings Sampling

    (Later units are subject to change)

  • George Mason University

    Unit 1 (v5) - 5 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Learning Objectives for Unit 1 Describe the elements of a decision model Refresh knowledge of probability Apply Bayes rule for simple inference and decision problems

    and interpret the results Use a graph to express conditional independence among

    uncertain variables Explain why Bayesians believe inference cannot be separated

    from decision making Compare Bayesian and frequentist philosophies of statistical

    inference Compute and interpret the expected value of information (VOI)

    for a decision problem with an option to collect information

  • George Mason University

    Unit 1 (v5) - 6 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Bayesian Inference Bayesians use probability to quantify rational degrees of belief Bayesians view inference as belief dynamics

    Begin with prior beliefs Use evidence to update prior beliefs to posterior beliefs Posterior beliefs become prior beliefs for next evidence

    Inference problems are usually embedded in decision problems

  • George Mason University

    Unit 1 (v5) - 7 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Decision Theory Decision theory is a formal theory of decision making under

    uncertainty A decision problem consists of:

    Possible actions: {a}aA States of the world (usually uncertain): {s}sS Possible consequences: {c}cC (depends on state and action)

    Question: What is the best action? Answer (according to decision theory):

    Measure goodness of consequences with a utility function u(c) Measure likelihood of states with probability distribution p(s) Best action with respect to model maximizes expected utility:

    Caveat emptor: How good it is for you depends on fidelity of model to your beliefs

    and preferences

    a *= argmaxa

    E[u(c) |a]{ } For brevity, we may write E[u(a)] for E[u(c) | a]

  • George Mason University

    Unit 1 (v5) - 8 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Illustrative Example (Highly Oversimplified)

    Decision problem: Should patient be treated for disease? We suspect she may have disease but do not know Without treatment the disease will lead to long illness Treatment has unpleasant side effects

    Decision model: Actions: aT (treat) and aN (dont treat) States of world: sD (disease now) and sW (well now) Consequences: cWN (well shortly, no side effects), cWS (well shortly, side

    effects), cDN (disease for long time, no side effects) Probabilities and Utilities:

    P(sD) = 0.3 u(cWN) = 100, u(cWS) = 90; u(cDN) = 0

    Expected utility: Treat: .390 + .790 = 90 Don't treat: .30 + .7100 = 70

    Best action is aT (treat patient)

    u(cWN) 100

    u(cDN) 0

    u(cWS) 90

    P(sD) = 0.3 P(sW) = 0.7

    EU(aN) 70

  • George Mason University

    Unit 1 (v5) - 9 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Sensitivity Analysis: Optimal Decision as Function of Sickness Probability Expected utility of the two actions:

    E[U|aT] = 90 E[U|aN] = 0p + 100(1-p) = 100(1 p) We should treat if p > 0.1, dont treat if p < 0.1

    Typically we are uncertain about the value of p The chart shows how the optimal decision changes as we vary p

    We may want to collect more information to refine our estimate of p

    Expected gain from treatment at p = 0.3

    Expected utility of not treating depends on p = P(sD), the probability that the patient has the disease

  • George Mason University

    Unit 1 (v5) - 10 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Why be a Bayesian? Arguments from theory

    A coherent decision maker uses probability to represent uncertainty, uses utility to represent value, and maximizes expected utility

    If you are not coherent then someone can make "Dutch book" on you (turn you into a "money pump")

    Pragmatic arguments Decision theory provides a useful and principled methodology for modeling

    problems of inference, decision and learning from experience Engineering tradeoffs between accuracy, complexity and cost can be analyzed

    and evaluated Both empirical data and informed engineering judgment can be explicitly

    represented and incorporated into a model Bayesian methods can handle small, moderate and large sample sizes; small,

    moderate and large numbers of parameters With other approaches it is often more difficult to understand why you got the

    results you did and how to improve your model Arguments from experience

    Successful applications Success attributed to decision theoretic technology

    Caution: Uncritical application of cookbook methods can lead to disaster! Good modelers iteratively assess, check and revise assumptions

  • George Mason University

    Unit 1 (v5) - 11 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    A probability distribution is a function P() applied to sets such that: P(A) 0 for all subsets A of the universal set % P() = 1 If AiAj = then P(A1A2) = P(A1) + P(A2)+

    From this basic definition we can deduce other identities of probability theory, e.g.: P() = 0 P(A) 1 for all subsets A of % P(AB) = P(A) + P(B) - P(AB) for any A and B If AiAj = and A=A1A2, then P(AB) = i P(B|Ai)P(Ai)

    The conditional probability of A given B for any two sets A and B is defined as a number P(A|B) satisfying: P(A|B)P(B) = P(AB) If P(B)0 this is equivalent to the traditional

    formula:

    A is independent of B if P(A|B) = P(A)

    Review: Probability Basics

    P(A | B) = P(A B)P(B)

    Law of total probability

  • George Mason University

    Unit 1 (v5) - 12 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Extending the Disease Example: Gathering Information

    We may be able to perform a test before deciding whether to treat the patient

    Test has two outcomes: tP (positive) and tN (negative) Quality of test is characterized by two numbers:

    Sensitivity: Probability that test is positive if patient has disease Specificity: Probability that test is negative if patient does not have

    disease

    We will assume: Sensitivity: P(tP | sD) = 0.95 Specificity: P(tN | sW) = 0.85

    How does the model change if test results are available? Take test, observe outcome t Revise prior beliefs P(sD) to obtain posterior beliefs P(sD|t) Re-compute optimal decision using P(sD|t)

  • George Mason University

    Unit 1 (v5) - 13 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Bayes Rule: The Law of Belief Dynamics Objective: use evidence E to update beliefs about a hypothesis H

    H: patient has (does not have) disease E: evidence from test

    Procedure: apply Bayes Rule (standard form):

    Bayes Rule (odds likelihood form):

    Terminology: P(H) - The prior probability of H P(E|H) - The likelihood for E given H P(E) - The predictive probability of E P(H|E) - The posterior probability of H given E - The likelihood ratio for H1 versus H2 - The prior odds ratio for H1 versus H2

    The posterior probability of H1 increases relative to H2 if the evidence is more likely given H1 than given H2

    P(E |H1)P(E |H2)

    P(H1)P(H2)

    P(E)>0, HiHj = , =H1H2

    P(E)>0, P(H2)>0

    P(H | E) = P(EH )P(E) =P(E |H )P(H )

    P(E) =P(E |H )P(H )P(E |Hi )P(Hi )i

    P(H1 | E)P(H2 | E)

    = P(E |H1)P(H1)P(E |H2 )P(H2 )

  • George Mason University

    Unit 1 (v5) - 14 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Disease Example with Test Review of Problem Ingredients:

    P(sD) = 0.3 (prior probability of disease) P(tP | sD) = 0.95; P(tN | sW) = 0.85 (sensitivity & specificity of test) u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 (utilities)

    If test is negative: P(sD | tN) = (0.3 x 0.05)/(0.3 x 0.05 + 0.7 x 0.85) = 0.025 EU(aN | tN) = 0.025 0 + 0.975 100 = 97.5 EU(aT | tN) = 0.025 90 + 0.975 90 = 90 Best action is not to treat

    If test is positive: P(sD | tP) = (0.3 x 0.95)/(0.3 x 0.95 + 0.7 x 0.15) = 0.731 EU(aN | tP) = 0.731 0 + 0.269 100 = 26.9 EU(aT | tP) = 0.731 90 + 0.269 90 = 90 Best action is to treat

    Optimal policy is strategy aF (FollowTest): Treat if test is positive; dont treat if test is negative

    EU(aN | tN)

    97.5

    EU(aN | tP) 26.9

    EU(aT) = EU(aT|tP) = EU(aT|tP)

    90

    EU(aN) 70

  • George Mason University

    Unit 1 (v5) - 15 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Value of Information Reminder of problem ingredients:

    P(sD) = 0.3 (prior probability of disease) P(tP | sD) = 0.95; P(tN | sW) = 0.85 (sensitivity & specificity of test) u(cWN) = 100, u(cWS) = 90; u(cDN) = 0 (utilities)

    Probability test will be positive: P(tP) = P(tP | sD) P(sD) + P(tP | sW) P(sW) = 0.950.3 + 0.150.7 = 0.39

    If test is positive we should treat, with EU(aT) = 90 If test is negative we should not treat, with EU(aN | tN) = 97.5 Expected utility of FollowTest strategy (treat if test is positive,

    otherwise dont): EU(aF) = P(tP) EU(aT) + P(tN) EU(aN | tN)

    = 0.39 90 + (1-0.39) 97.5 = 94.575 If we do not perform the test, our best action is to treat everyone

    with EU(aT) = 90 Gain from test is 94.575 90 = 4.575 This is called the Expected Value of Sample Information (EVSI)

  • George Mason University

    Unit 1 (v5) - 16 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Expected Value of Perfect Information

    Expected Value of Perfect Information (EVPI) is increase in utility from perfect knowledge of an uncertain variable

    Suppose an oracle will tell us whether patient is sick An oracle has Sensitivity = Specificity = 1

    30% chance we discover she is sick and treat - utility 90 70% chance we discover she is well and dont treat - utility 100 Expected utility if we ask the oracle 0.3 x 90 + 0.7 x 100 = 97 EVPI = 97 - 90 = 7

    EVPI EVSI 0 EVPI = EVSI = 0 if the test will not change your decision

  • George Mason University

    Unit 1 (v5) - 17 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Should We Collect Information? General Principle: Free information can never hurt To analyze decision of whether to collect information D on outcome

    variable V: Find maximum expected utility option if we don't collect information Compute its expected utility U0 Find EVPI

    For each possible value V=v, assume it is known to be the true outcome, find optimal decision, calculate its expected utility, and multiply by the probability that V=v

    Add these values and subtract from no-information expected utility to get EVPI Compare EVPI with cost of information If EVPI is too small in relation to cost then stop; otherwise, compute EVSI

    For each possible result D=d of the experiment, find the maximum expected utility action a(d) and its utility u(a(d))

    For each outcome V=v and result D=d of the experiment, find the joint probability P(v,d) Calculate the expected utility with information USI = v,d P(v,d)u(a(d)) Compute difference USI = U0 in expected utility between no-information decision and

    decision with information to get EVSI Compare EVSI with cost of information Collect information if EVSI is large enough in relation to cost of information

  • George Mason University

    Unit 1 (v5) - 18 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    FollowTest strategy treats if test is positive and otherwise not

    E[U|aF] = 0.9590 + .050 + 0.15 (1-)90 + 0.85 (1-)100

    = 98.5 13

    World State

    Probability Action Utility

    Sick, Positive

    .95 Treat 90

    Sick, Negative

    .05 NoTreat 0

    Well, Positive

    .15(1-) Treat 90

    Well, Negative

    .85(1-) NoTreat 100

    Expected Utility of FollowTest Policy as Function of = P(sD)

  • George Mason University

    Unit 1 (v5) - 19 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    FollowTest strategy treats if test is positive and otherwise not

    E[U|aF] = P(sD|tP)P(tP)90 + . P(sD|tN)P(tN)0

    + P(sW|tP)P(tP)90 + P(sW|tN)P(tN)100= P(tP) [P(sD|tP)90 + P(sW|tP)90]

    + P(tN) [P(sD|tN)0 + P(sW|tN)100]= P(tP) EU[aT | tP] + P(tN) EU[aN | tN]

    World State

    Probability Action Utility

    Sick, Positive

    P(sD|tP)P(tP) Treat 90

    Sick, Negative

    P(sD|tN)P(tN) NoTreat 0

    Well, Positive

    P(sW|tP)P(tP) Treat 90

    Well, Negative

    P(sW|tN)P(tN) NoTreat 100

    Another Way to FindExpected Utility of FollowTest Policy

    (Compare with Slide 15)

    Before doing test, we think: We are uncertain about the test

    result: it will be positive with probability P(tP) and negative with probability P(tN).

    If it is positive our expected utility will be EU[aT | tP].

    If it is negative our expected utility will be EU[aN | tN].

    So our expected utility of following the test is P(tP) EU[aT|tP] + P(tN) EU[aN|tN]

  • George Mason University

    Unit 1 (v5) - 20 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    FollowTest: EU(aF) = 98.5 13 AlwaysTreat: EU(aT) = 90

    FollowTest is better when 98.5 13 > 90 or < 8.5/13 = 0.654 NoTreat: EU(aN) = 100(1 )

    FollowTest is better when 98.5 - 13 > 100(1-) or > 1.5/87 = 0.017

    Strategy Regions

    EVSI

    EVSI is positive for 0.017 < ps < 0.654

  • George Mason University

    Unit 1 (v5) - 21 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    FollowTest: EU(aF) = 98.5 13 c (c is cost of test) AlwaysTreat: EU(aT) = 90

    FollowTest is better when 98.5 13 c > 90 or < (8.5-c)/13 NoTreat: EU(aN) = 100(1 )

    FollowTest is better when 98.5 - 13 c > 100(1-) or > (1.5+c)/87

    Strategy Regions for Costly Test

    88"

    90"

    92"

    94"

    96"

    98"

    100"

    0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1"

    Expected(U*lity(of(Op*mal(Strategy(with(Costly(Test(

    c=0"

    c=1"

    c=4"

    c7.2"

    Gain from doing test with c=1 at p=0.3

    Test is worth doing if (1.5+c)/87 < ps < (8.5-c)/13

  • George Mason University

    Unit 1 (v5) - 22 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    For a test with cost c: E[U|FollowTest] = 98.5 13 - c NoTreat is best when < (1.5+c)/87 FollowTest is best when (1.5+c)/87 < < (8.5-c)/13

    Probability range where testing is optimal depends on cost of test If 0.018

  • George Mason University

    Unit 1 (v5) - 23 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Value of Information: Summary Collecting information may have value if it might change your

    decision Expected value of perfect information (EVPI) is utility gain from knowing true

    value of uncertain variable Expected value of sample information (EVSI) is utility gain from available

    information In our example, EVSI is positive for 0.017 < < 0.654

    If 0.017 0.1 EVSI is 87 - 1.5 If 0.1 0.654 EVSI is 8.5 - 13 If = 0.3 EVSI is 8.5 - 13 = 4.6 (testing is optimal)

    Costly information has value when EVSI is greater than cost of information

    In our example: If 0.017 0.1 Test if 87 - 1.5 > c (where c is cost of test) If 0.1 0.654 Test if 8.5 - 13 > c If p = 0.3 Test if 4.6 > c (test if c is less than 4.6)

  • George Mason University

    Unit 1 (v5) - 24 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    When Parameters of a Model are Unknown Our disease model depends on several parameters

    Prior probability of disease Sensitivity of test Specificity of test

    Usually these probabilities are estimated from data and/or expert judgment

    Randomized clinical trials have established that Test T has sensitivity 0.95 and specificity 0.85 for Disease D

    Given the presenting symptoms and my clinical judgment, I estimate a 60% probability that the patient has Disease D.

    How does a Bayesian combine data and expert judgment? Use clinical judgment to quantify uncertainty about the parameter as a

    probability distribution Collect data and use Bayes rule to obtain posterior distribution for the

    parameters given the data If appropriate, use clinical judgment to adjust results of studies to apply to a

    particular patient

  • George Mason University

    Unit 1 (v5) - 25 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Learning from a Random Sample drawn from a Parametric Model

    Many statistical models assume observations are drawn at random from a common probability distribution

    Data X1, , Xn are drawn at random from distribution with probability mass function f(x|) if:

    P(Xi=x|) = f(x|) for all i Xi is independent of Xj given for ij

    Example: Middle-aged male patients who complain of symptom S are drawn at random

    from a population with a proportion who have disease D Data X1, , Xn are independent and identically distributed (iid) given = P(Sick)

    If the value is unknown we can express our uncertainty about it by defining a probability distribution for its value

    Xi=1 (disease) or Xi=0 (no disease) Pr(Xi = 1 | ) = (this is called a Bernoulli distribution) Pr( = ) = g()

    We now use Bayes rule to obtain a posterior distribution g( | X1, , Xn )

    Notation convention: uppercase letters for unknowns; lowercase letters for particular values the unknowns can take on

    Xi iid means that if is known, learning the condition of some patients does not affect our beliefs about the conditions of the remaining patients

  • George Mason University

    Unit 1 (v5) - 26 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Random Sample: Graphical Representation We can use a graph to represent conditional

    independence Arc from to Xi means the distribution

    of Xi depends on No arc from Xi to Xj means that Xi and Xj

    are independent given A plate represents repeated structure

    Because all the Xi are inside the same plate, they all follow the same probability distribution

    X1

    X2X3

    Xii=1,,3

    ~ g()

    Xi | ~ Bernoulli()

  • George Mason University

    Unit 1 (v5) - 27 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Example: Bayesian Inference about a Parameter (with a very small sample)

    Uninformative prior distribution Pr( = ) = g() We assume that can have one of 20 equally

    likely values: 0.025, 0.075, , 0.975 We could represent prior knowledge by assigning

    some values of a greater probability than others Observe 5 iid cases: X1, X2, X3, X4, X5 Case 2 has disease; cases 1, 3, 4 and 5 do not Likelihood function (probability of observations as function of ):

    Pr(X1, X2, X3, X4, X5 | = ) = (1-)4 The likelihood function depends only on how many cases have the disease The number of cases having the disease and the total number of cases are

    sufficient for inference about % Use Bayes rule to calculate posterior

    distribution for :

    Because the prior distribution is uniform, the posterior distribution is proportional to the likelihood

    actually has a continuous range of values. We will treat continuous parameters later. For now we approximate with a finite set of values.

    g( | x) = g( ) f (x | )g( ') f (x | ')

    '=

    120(1 )

    4

    120 '(1 ')

    4 '

    =(1 )4 '(1 ')4

    '0.025 0.125 0.225 0.325 0.425 0.525 0.625 0.725 0.825 0.925

    Prior Distribution for Theta

    Theta

    Prio

    r Pro

    babi

    lity

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    0.14

    0.025 0.125 0.225 0.325 0.425 0.525 0.625 0.725 0.825 0.925

    Posterior Distribution for Theta

    Theta

    Pos

    terio

    r Pro

    babi

    lity

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    0.14

    Underscore indicates a vector: x=(x1, x2, x3, x4, x5)

  • George Mason University

    Unit 1 (v5) - 28 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Bayesian Inference Example: R Code

    0.025 0.125 0.225 0.325 0.425 0.525 0.625 0.725 0.825 0.925

    Prior Distribution for Theta

    Theta

    Prio

    r Pro

    babi

    lity

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    0.14

    0.025 0.125 0.225 0.325 0.425 0.525 0.625 0.725 0.825 0.925

    Posterior Distribution for Theta

    Theta

    Pos

    terio

    r Pro

    babi

    lity

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    0.14

    Horizontal axis is = P(Sick); height of bar is probability that =

  • George Mason University

    Unit 1 (v5) - 29 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    R Computing Environment R (http://www.r-project.org) is a free, open source statistical

    computing language and environment that includes: data handling and storage matrix and array operations tools for data analysis graphical facilities for data analysis and display programming language which includes conditionals, loops, user-defined

    recursive functions and input and output facilities Vibrant and active user community contributes new functionality

    Users can contribute packages to extend functionality Many resources exist to help users at a variety of levels

    http://cran.r-project.org/doc/manuals/R-intro.pdf RStudio (http://www.rstudio.com) is a free, open-source integrated

    development environment for R We will use R heavily in this course Most R assignments can be done by modifying sample code I will

    provide

  • George Mason University

    Unit 1 (v5) - 30 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Bayesian Learning and Sample Size When the sample size is very large:

    The posterior distribution will be concentrated around the maximum likelihood estimate and is relatively insensitive to the prior distribution

    We wont go too far wrong if we act as if the parameter is equal to the maximum likelihood estimate

    When the sample size is very small: The posterior distribution is highly dependent on the prior distribution Reasonable people may disagree on the value of the parameter

    When the sample size is moderate, Bayesian learning can be a big improvement on either expert judgment alone or data alone

    Achieving the benefit requires careful modeling This course will teach methods for constructing Bayesian models

    A powerful characteristic of the Bayesian approach is the flexibility to tailor results to moderate-sized sub-populations

    Bayesian estimate shrinks estimates of sub-population parameters toward population average

    Amount of shrinkage depends on sample size and similarity of sub-population to overall population

    Shrinkage improves estimates for small to moderate sized sub-populations

  • George Mason University

    Unit 1 (v5) - 31 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Effect of Sample Size on Posterior Distribution

    These plots show the posterior distribution for when:

    Prior distribution is uniform 20% of patients in sample have

    the disease Posterior distribution becomes more concentrated around 1/5 as sample size gets larger

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Sample size 5: 1 with, 4 without

    Sample size 20: 4 with, 16 without

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Sample size 80: 16 with, 64 without

    Horizontal axis is = P(sD); height of bar is probability that =

  • George Mason University

    Unit 1 (v5) - 32 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Effect of Sample Size on Impact of the Prior Distribution

    Prior distribution favors low probabilities:

    Prior distribution favors high probabilities:

    Bayesian inference shrinks posterior distribution toward prior expectations

    Posterior distribution for small sample is very sensitive to prior distribution Posterior distribution for larger sample is less sensitive to prior distribution

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Prior%Distribu+on%%

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Posterior(Distribu,on(for(1(Case(in(5(Samples(

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Posterior(Distribu,on(for(10(Cases(in(50(Samples(

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Prior%Distribu+on%

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Posterior(Distribu,on(for(1(Case(in(5(Samples(

    0"

    0.05"

    0.1"

    0.15"

    0.2"

    0.25"

    0.3"

    0.35"

    0.4"

    0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

    Posterior(Distribu,on(for(10(Cases(in(50(Samples(

    Horizontal axis is = P(sD); height of bar is probability that =

    Horizontal axis is = P(sD); height of bar is probability that =

  • George Mason University

    Unit 1 (v5) - 33 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    The Bayesian Resurgence Bayesian inference is as old as probability Bayesian view fell into disfavor in nineteenth and early

    twentieth centuries Positivism, empiricism, and quest for objectivity in science Paradoxes and systematic deviation of human judgment from

    Bayesian norm There has been a recent resurgence

    Computational advances make calculation possible for complex models Bayesian models can coherently integrate many different kinds of

    information Physical cause and effect Logical implication Informed expert judgment Empirical observation

    Unified theory and methods for data-rich and data-poor problems Clear connection to decision making

  • George Mason University

    Unit 1 (v5) - 34 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Probability really is none of these things. Probability can represent all of these things.

    Some Concepts of Probability Classical - Probability is a ratio of favorable cases to total (equipossible)

    cases Frequency - Probability is the limiting value as the number of trials

    becomes infinite of the frequency of occurrence of some event Logical - Probability is a logical property of ones state of information

    about a phenomenon Propensity - Probability is a propensity for certain kinds of event to occur

    and is a property of physical systems Subjectivist - Probability is an ideal rational agents degree of belief about

    an uncertain event Algorithmic - The algorithmic probability of a finite sequence is the

    probability that a universal computer fed a random input will give the sequence as output (related to Kolmogorov complexity)

    Game Theoretic - Probability is an agents optimal announced certainty for an event in a multi-agent game in which agents receive rewards that depend on both forecasts and outcomes

  • George Mason University

    Unit 1 (v5) - 35 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Historical Notes People have long noticed that some events are imperfectly predictable Mathematical probability first arose to describe regularities in games of

    chance In the twentieth century it became clear that probability theory provided a

    good model for a much broader class of problems: Physical (thermodynamics; quantum mechanics) Social (actuarial tables; sample surveys) Industrial (equipment failures)

    The subjectivist interpretation dates from the 18th century but fell out of favor because of the positivitist orientation of Western 19th and 20th century science

    Von Mises formulated a rigorous (and much-debated) frequency theory in the mid-twentieth century.

    Hierarchy of generality: Classical interpretation is restricted to equipossible cases Frequency interpretation is restricted to repeatable, random phenomena Subjectivist interpretation applies to any event about which the agent is uncertain Game theoretic interpretation applies even when probabilities are not true beliefs

  • George Mason University

    Unit 1 (v5) - 36 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    The Frequentist A frequentist believes:

    Probability can be legitimately applied only to repeatable problems Probability is an objective property in the real world Probability applies only to random processes Probabilities are associated only with collectives not individual events

    Frequentist Inference Data are drawn from a distribution of known form but with an unknown parameter

    (this includes nonparametric statistics in which the unknown parameter is the distribution itself)

    Often this distribution arises from explicit randomization (when not, statistician argues that the procedure was close enough to randomized that inferences apply)

    Inferences regard the data as random and the parameter as fixed (even though the data are known and the parameter is unknown)

    For example: A sample X1,XN is drawn from a normal distribution with mean . A 95% confidence interval is constructed. The interpretation is:

    If an experiment like this were performed many times we would expect in 95% of the cases that an interval calculated by the procedure we applied would include the true value of .

    A frequentist can say nothing about this experiment or about what we should believe about !

  • George Mason University

    Unit 1 (v5) - 37 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    The Subjectivist A subjectivist believes:

    Probability as an expression of a rational agents degrees of belief about uncertain propositions.

    Rational agents may disagree. There is no one correct probability. If the agent receives feedback her assessed probabilities will in the limit

    converge to observed frequencies Subjectivist Inference:

    Probability distributions are assigned to the unknown parameters and to the observations given the unknown parameters.

    Condition on knowns; use probability to express uncertainty about unknowns

    For example: A sample X1,XN is drawn from a normal distribution with mean having prior distribution g(). A 95% posterior credible interval is constructed, and the result is the interval (3.7, 4.9). The interpretation is:

    Given the prior distribution for and the observed data, the probability that lies between 3.7 and 4.9 is 95%.

    A subjectivist can draw conclusions about what we should believe about and about what we should expect on the next trial

  • George Mason University

    Unit 1 (v5) - 38 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Comparison: Understandability, Subjectivity and Honest Reporting

    Often the Bayesian answer is what the decision maker really wants to hear.

    Untrained people often interpret results in the Bayesian way. Frequentists are disturbed by the dependence of the posterior

    interval on the subjective prior distribution. It is more important that stochastics provides a means of communication among researchers whose personal beliefs about the phenomena under study may differ. If these beliefs are allowed to contaminate the reporting of results, how are the results of different researchers to be compared? - H. Dinges

    Bayesians say the prior distribution is not the only subjective element in an analysis. Assumptions about the sampling distribution are often also subjective.

    Bayesian probability statements are always subjective, but statistical analyses are often done for public consumption. Whose probability distribution should be reported?

    When there are enough data, a good Bayesian analysis and a good frequentist analysis will typically be in close agreement

    If the results are sensitive to the prior distribution, a Bayesian analyst is obligated to report this sensitivity, and to present the range of results obtained from a range of prior distributions

  • George Mason University

    Unit 1 (v5) - 39 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Comparison: Generality Subjectivists can handle problems the frequentist approach

    cannot (in particular, problems with not enough data for sound frequentist inference).

    Frequentist statisticians say this comes at a price -- when there are not enough data the result will be highly dependent on the prior distribution.

    Subjectivists often apply frequentist techniques but with a Bayesian interpretation

    Frequentists often apply Bayesian methods if they have good frequency properties

  • George Mason University

    Unit 1 (v5) - 40 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Axioms for Probability De Groot, 1970

    There is a qualitative relationship of relative likelihood , that operates on pairs of events, that satisfies the following conditions: SP1. For any two uncertain events A and B, one of the following relations holds: A

    B, A B or A ~ B. SP2. If A1, A2, B1, and B2 are four events such that A1A2=, B1B2=, and if Ai

    Bi, for i = 1,2, then A1A2 B1B2. If in addition Ai Bi for either i=1 or i=2, then A1A2 B1B2.

    SP3. If A is any event, then A. Furthermore, there is some event A0 for which A0.

    SP4. If A1A2 is a decreasing sequence of events, and B is some event such

    that Ai B for i=1, 2, , then Aii=1 B .

    SP5. There is an experiment, with a numerical outcome between the values of 0 and 1, such that if Ai is the event that the outcome x lies within the interval ai x bi, for i=1,2, then A1 A2 if and only if (b1-a1) (b2-a2).

  • George Mason University

    Unit 1 (v5) - 41 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Axioms for Utility Watson and Buede, 1987

    A reward is a prize the decision maker cares about. A lottery is a situation in which the decision maker will receive one of the possible rewards, where the reward to be received is governed by a probability distribution. There is a qualitative relationship of relative preference * , that operates on lotteries, that satisfies the following conditions: SU1. For any two lotteries L1 and L2, either L1 *L2, L1 * L2, or L1~*L2.

    Furthermore, if L1, L2, and L3 are any lotteries such that L1 *L2 and L2 *L3, then L1 *L3.

    SU2. If r1, r2 and r3 are rewards such that r1 * r2 * r3, then there exists a probability p such that [r1: p; r3: (1-p)] ~* r2, where [r1:p; r3:(1-p)] is a lottery that pays r1 with probability p and r3 with probability (1-p ).

    SU3. If r1 ~* r2 are rewards, then for any probability p and any reward r3, [r1: p; r3: (1-p)] ~* [r2: p; r3: (1-p )]

    SU4. If r1 * r2 are rewards, then [r1: p; r2: (1-p)] * [r1: q; r2: (1-q)] if and only if p > q.

    SU5. Consider three lotteries, Li = [r1: pi; r2: (1-pi)], i = 1, 2, 3, giving different probabilities of the two rewards r1 and r2. Suppose lottery M gives entry to lottery L2 with probability q and L3 with probability 1-q. Then L1~*M if and only if p1 = qp2 + (1-q)p3.

  • George Mason University

    Unit 1 (v5) - 42 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Probabilities and Utilities If your beliefs satisfy SP1-SP5, then there is a probability

    distribution Pr() over events such that for any two events A1 and A2, Pr(A1) Pr(A2) if and only if A1 A2.

    If your preferences satisfy SU1-SU5, then there is a utility function u() defined on rewards such that for any two lotteries L1 and L2, L1 * L2 if and only if E[u(L1)] E[u(L2)], where E[] denotes the expected value with respect to the probability distribution Pr().

    ?

  • George Mason University

    Unit 1 (v5) - 43 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Myth of the Cold-Hearted Rationalist A common criticism of decision theory is that its adherents are cold-hearted

    technocrats who care about numbers and not about what really matters They would put a dollar value on a human life They would send people to possible death on the basis of utilitarian calculations And so on

    The applied decision theorists response These kinds of tradeoffs are unquestionably difficult Whether we quantify them or not, as a society and as individuals we make them all the time They will be irrational and capricious if we approach them without a principled methodology Refusing to think about the tradeoffs will only ensure that they will be addressed haphazardly and/

    or by back door manipulation by powerful special interests As a society we need open debate and discussion of the difficult tradeoffs we are forced to make.

    Decision theory provides a justifiable, communicable framework for doing so disagreements about fact are clearly separated from disagreements about value inconsistencies can be spotted, discussed, and resolved commonly recurring problems need not be revisited once consensus has been reached

    Decision theory can be misused if models are sloppily built and leave out important elements

    When a group or society has not reached consensus there is no clear best choice Explicitly modeling subjective elements of a problem provides a framework for

    informed debate

  • George Mason University

    Unit 1 (v5) - 44 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    Unit 1: Summary and Synthesis The inventors of probability theory thought of it as a logic of

    enlightened rational reasoning. In the nineteenth century this was replaced by a view of probability as measuring objective propensities of intrinsically random phenomena

    The twentieth century has seen a resurgence of interest in subjective probability and an increased understanding of the appropriate role of subjectivity in science

    Bayesian methods often require more computational power than traditional frequentist methods

    The computer revolution has enabled the Bayesian resurgence Most statistics texts and courses take a frequentist approach but this

    is changing Bayesian decision theory provides methodology for rational choice

    under uncertainty Bayesian statistics is a theory of rational belief dynamics We took a broad-brush tour of Bayesian methodology We applied Bayesian thinking to a simple example that illustrates

    many of the concepts we will be learning this semester

  • George Mason University

    Unit 1 (v5) - 45 -

    Department of Systems Engineering and Operations Research

    Kathryn Blackmond Laskey Spring 2015

    References for Unit 1 Bayes, Thomas. An essay towards solving a problem in the doctrine of chances.

    Philosophical Transactions of the Royal Society of London, 53:370- 418, 1763. Bashir, S.A., Getting Started in R, http://www.luchsinger-mathematics.ch/Bashir.pdf Dawid, A.P. and Vovk, V.G. (1999), Prequential Probability: Principles and Properties,

    Bernoulli, 5: 125-162. de Finetti, Bruno. Theory of Probability: A Critical Introductory Treatment. New York:

    Wiley, 1974. Gelman, et al., Chapter 1 Hjek, Alan, "Interpretations of Probability", The Stanford Encyclopedia of Philosophy

    (Summer 2003 Edition), Edward N. Zalta(ed.), URL = .

    Lee, Chapter 1 Li, Ming and Vitanyi, Paul. An Introduction to Kolmogorov Complexity and Its

    Applications. (2nd ed) Springer-Verlag, 2005. Nau, Robert F. (1999), Arbitrage, Incomplete Models, And Interactive Rationality,

    working paper, Fuqua School of Business, Duke University. Neapolitan, R. Learning Bayesian Networks, Prentice Hall, 2003. Jaynes, E., Probability Theory: The Logic of Science, Cambridge University Press,

    2003) Savage, L.J., The Foundations of Statistics. Dover, 1972. Shafer, G. Probability and Finance: Its Only a Game, Wiley, 2001. von Mises R., 1957, Probability, Statistics and Truth, revised English edition, New

    York: Macmillan