lecture12(2) (1)

download lecture12(2) (1)

of 23

Transcript of lecture12(2) (1)

  • 7/27/2019 lecture12(2) (1)

    1/23

    EECS 440: Machine Learning

    Soumya Ray

    http://engr.case.edu/ray_soumya/eecs440_fall13/

    [email protected]

    Office: Olin 516

    Office hours: Th, Fri 1:30-2:30 or by appointmentText: Machine Learning by Tom Mitchell

    10/10/2013 1Soumya Ray, Case Western Reserve U.

    mailto:[email protected]:[email protected]
  • 7/27/2019 lecture12(2) (1)

    2/23

    Today

    Bayesian Learning

    Read Mitchell Chapter 6, extra material on

    website

    10/10/2013 Soumya Ray, Case Western Reserve U. 2

  • 7/27/2019 lecture12(2) (1)

    3/23

    Nave Bayes

    Simplest generative classifier for discrete data

    10/10/2013 Soumya Ray, Case Western Reserve U. 9

    ,

    1

    ( , ) ( | ) ( )

    ( , , | ) ( )

    ( | ) ( )

    Y i i i i i i i i

    i in i i i i

    ij ij i i i i

    j

    p y p Y y p Y y

    p x x Y y p Y y

    p X x Y y p Y y

    = = = =

    = = =

    = = = =

    Xx X x

    Nave Bayes parameters: Instead of storing probabilities

    for each example, we will only store these conditional

    probabilities and use this formula to calculate the

    probability for an example.

    Nave Bayes

    assumption:

    Attributes are

    conditionally

    independent

    given the class

  • 7/27/2019 lecture12(2) (1)

    4/23

    ML Hypothesis

    Ifevery hypothesis inHhas equal prior

    probability, only the first term matters

    This gives the maximum likelihood (ML)

    hypothesis

    10/10/2013 Soumya Ray, Case Western Reserve U. 20

    arg max Pr( | )ML

    h H

    h D h

    =

  • 7/27/2019 lecture12(2) (1)

    5/23

    Nave Bayes Parameter MLEs

    10/10/2013 Soumya Ray, Case Western Reserve U. 25

    # observed examples with 1 and 1 ( 1| 1)# observed examples with 1

    ( 1, 1)( 1| 1)

    ( 1)

    ii

    ii

    X Yp X YY

    p X Yp X Y

    p Y

    = == = ==

    = == = =

    =

    # observed examples with 1 ( 1)

    # observed examples

    Yp Y

    == =

  • 7/27/2019 lecture12(2) (1)

    6/23

    Example

    10/10/2013 Soumya Ray, Case Western Reserve U. 26

    Has-fur? Long-Teeth? Scary? Lion?

    Animal1 Yes No No No

    Animal2 No Yes Yes No

    Animal3 Yes Yes Yes Yes

    p(Has-fur=Yes|Lion)=?, p(Has-fur=Yes|Not-Lion)=?

    p(Long-Teeth=Yes|Lion)=?, p(Long-Teeth=Yes|Not-Lion)=?

    p(Scary=Yes|Lion)=?, p(Scary=Yes|Not-Lion)=?

    p(Lion)=?

  • 7/27/2019 lecture12(2) (1)

    7/23

    Smoothing probability estimates

    What happens if a certain value for a variableis not in our set of examples, for a certainclass?

    Suppose were trying to classify lions and wevenever seen a lion cub, so

    When we see a cub, its probability of being a lionwill be zero by our Nave Bayes formula, even if it

    has long teeth and fur Its a good idea to smooth our probability

    estimates to avoid this

    10/10/2013 Soumya Ray, Case Western Reserve U. 27

    ( | ) 0p Scary false Lion= =

  • 7/27/2019 lecture12(2) (1)

    8/23

    m-Estimates

    p is our prior estimate of the probability

    m is called Equivalent Sample Size whichdetermines the importance ofp relative to the

    observations

    If variable has v values, the specific case ofm=v,p=1/v is called Laplace smoothing

    10/10/2013 Soumya Ray, Case Western Reserve U. 28

    (#examples with and )( | )(#examples with )

    i ii i

    X x Y y mpp X x Y yY y m

    = = += = == +

  • 7/27/2019 lecture12(2) (1)

    9/23

    Nominal Attributes

    Need to estimate parametersp(Xi=vk|Y=y)

    Can use maximum likelihood estimates:

    10/10/2013 Soumya Ray, Case Western Reserve U. 29

    ( )( | )

    ( )

    #examples with and#examples with

    i ki k

    i k

    p X v Y yp X v Y y

    p Y y

    X v Y y

    Y y

    = == = =

    =

    = ==

    =

  • 7/27/2019 lecture12(2) (1)

    10/23

    Continuous Attributes

    IfXi is a continuous attribute, can modelp(Xi|y) as a

    Gaussian distribution (Gaussian nave Bayes)

    MLEs

    10/10/2013 Soumya Ray, Case Western Reserve U. 30

    | |( | ) ( , )i i y i yp X y N

    2

    2

    ( )

    ( )

    ( ) ( )

    ( )i

    ik k

    k examples

    i

    k

    k examples

    ik i k

    k examples

    k

    k examples

    x I y y

    I y y

    x I y y

    I y y

    =

    ==

    =

    ==

  • 7/27/2019 lecture12(2) (1)

    11/23

    Nave Bayes Geometry

    What does the decision surface of the nave

    Bayes classifier look like?

    An example is classified positive iff

    p(x,y=1) >p(x,y=0)

    10/10/2013 Soumya Ray, Case Western Reserve U. 31

    ( , 1)1

    ( , 0)

    ( | 1) ( 1)

    1( | 0) ( 0)

    i

    i

    i

    i

    p y

    p y

    p x y p y

    p x y p y

    =>

    =

    = =>

    = =

    x

    x

  • 7/27/2019 lecture12(2) (1)

    12/23

    Nave Bayes Geometry

    Classify an example as positive if

    10/10/2013 Soumya Ray, Case Western Reserve U. 32

    ( | 1)( 1)ln ln

    ( 0) ( | 0)

    ( | 1) ( 1)

    1( | 0) ( 0)

    1( | 1)( 1)

    0ln ln( 0) ( | 0)

    i

    ii

    i

    i

    i

    i

    p x yp y

    p y p x y

    i

    i i

    p x y p y

    p x y p y

    ep x yp y

    p y p x y

    ==+

    = =

    = =

    >= =

    >

    ==>+

    = =

  • 7/27/2019 lecture12(2) (1)

    13/23

    Nave Bayes Geometry

    10/10/2013 Soumya Ray, Case Western Reserve U. 33

    ,

    ( | 1)( 1)0ln ln

    ( 0) ( | 0)

    ( | 1)( 1)

    0ln ln ( )( 0) ( | 0)

    ( ) 0,

    ( | 1)( 1),ln ln

    ( 0) ( | 0)

    i

    i i

    i

    ii v i

    iv i

    i v

    iiv

    i

    p x yp y

    p y p x y

    p X v yp y

    I X vp y p yX v

    b w I X v

    p X v yp yb w

    p y p yX v

    == >+ = =

    = ==

    >+ = = ==

    + = >

    = === = = ==

    So Nave Bayes implementsa lineardecision boundary

    with specific parameters

    Indicator function

  • 7/27/2019 lecture12(2) (1)

    14/23

    Nave Bayes and Text Classification

    Used very successfully to categorize

    documents

    Is this document about sports or finance?

    Is this email spam or ham?

    Given a vocabulary, each attributeXi is the

    presence/absence of word i in the document

    Ignores word order

    Bag-of-words approach

    10/10/2013 Soumya Ray, Case Western Reserve U. 34

  • 7/27/2019 lecture12(2) (1)

    15/23

    Text Classification contd.

    Smoothed parameter estimates

    Called Multivariate Bernoulli model

    10/10/2013 Soumya Ray, Case Western Reserve U. 35

    ( present | )

    (#documents with present and ) 1

    (#documents with ) 2

    k

    k

    p word Y y

    word Y y

    Y y

    = =

    = +

    = +

    Bernoulli distribution

  • 7/27/2019 lecture12(2) (1)

    16/23

    Tree Augmented Nave Bayes

    Can augment the model so that there is a treestructure over the attributes

    In this case, the structure is also unknown

    Given a training sample, algorithm exists to learnoptimal structure

    Makes fewer independence assumptions thanNB, better classification performance

    10/10/2013 Soumya Ray, Case Western Reserve U. 38

  • 7/27/2019 lecture12(2) (1)

    17/23

    Tree Augmented Nave Bayes

    Create a complete graph

    Nodes are attributes

    Edges are weighted byI(X;Y|C)

    Find the maximal weighted spanning tree of

    this graph Can show this is the tree structure that maximizes

    likelihood (see paper on website)

    10/10/2013 Soumya Ray, Case Western Reserve U. 39

    ,

    ( , | )( ; | ) ( , , ) log( | ) ( | )X Y

    P x y cI X Y C P x y cP x c P y c

    =

    Class conditional

    mutual information

  • 7/27/2019 lecture12(2) (1)

    18/23

    Logistic Regression

    Simplest Discriminative model

    Models log odds as a linear function

    10/10/2013 Soumya Ray, Case Western Reserve U. 40

    [ ] ( )

    ( ) ( )

    ( )

    ( ) ( )

    ( 1| )log( 0 | )

    ( 1| ) 1 ( 1| )

    ( 1| )(1 )1

    ( 1| )1 1

    b

    b b

    b

    b b

    p Y bp Y

    p Y p Y e

    p Y e e

    ep Y

    e e

    +

    + +

    +

    + +

    == +

    =

    = = =

    = + =

    = = =

    + +

    w x

    w x w x

    w x

    w x w x

    x

    w x

    x

    x x

    x

    x

  • 7/27/2019 lecture12(2) (1)

    19/23

    Estimating parameters

    Use MLE, optimize log conditionallikelihood of

    the data

    10/10/2013 Soumya Ray, Case Western Reserve U. 41

    , arg max ( | )

    arg max log ( 1| ) log ( 0 | )

    1 1arg max log log 11 1

    i i

    i

    i i i i

    i pos i neg

    b bi pos i neg

    b p Y

    p Y p Y

    e e

    + +

    =

    = = + =

    = + + +

    w x w x

    w x

    x x

    Conditional Likelihood

  • 7/27/2019 lecture12(2) (1)

    20/23

    Estimating parameters

    Can use gradient descent, Newton methodsetc

    In practice, also use overfitting control via||w||

    Very robust method, works extremely well inmany practical situations, very easy to code

    Often good idea to try this first!

    10/10/2013 Soumya Ray, Case Western Reserve U. 42

  • 7/27/2019 lecture12(2) (1)

    21/23

    Logistic Regression Geometry

    Classify as positive iff:

    So like NB, LR alsoimplements a linear decisionboundary---but whats the difference?

    10/10/2013 Soumya Ray, Case Western Reserve U. 43

    ( 1| )1

    ( 0 | )

    ( 1| )or if log 0( 0 | )

    ( 1| )But log

    ( 0 | )

    So classify as positive iff 0

    p Y

    p Y

    p Yp Y

    p Yb

    p Y

    b

    =>

    =

    =>=

    == +

    =

    + >

    x

    x

    x

    x

    x

    w x

    x

    w x

  • 7/27/2019 lecture12(2) (1)

    22/23

    Relationship to Nave Bayes

    For certain values ofw, b logistic regression willimplement the same decision surface as naveBayes

    Both are linear discriminants, but LR does notmake the independence assumptions of NB More robust than NB, especially in the presence of

    irrelevant attributes

    Also handles continuous attributes nicely But (as with all discriminative models) no easy way to

    handle missing data

    10/10/2013 Soumya Ray, Case Western Reserve U. 44

  • 7/27/2019 lecture12(2) (1)

    23/23

    Generative and Discriminative Pairs

    10/10/2013 Soumya Ray, Case Western Reserve U. 48

    Training Sample Size

    Accuracy Generative

    DiscriminativeGenerative w/ correct model