Winow vs Perceptron

download Winow vs Perceptron

of 12

Transcript of Winow vs Perceptron

  • 8/13/2019 Winow vs Perceptron

    1/12

    SIMS 290-2:Applied Natural Language Processing

    Barbara Rosario

    1

    October 4, 2004

    Today

    Algorithms for Classification

    Binary classificationPerceptron

    Winnow

    2

    Support Vector Machines (SVM)

    Kernel Methods

    MultiClass classification!ecision "rees

    #a$%e Bayes

    K nearest neigh&or

    Binary Classification: examples

    Spam filtering (spam' not spam)

    Customer ser%ice message classification (urgent %snot urgent)

    3

    '

    Sentiment classification (positi%e' negati%e)

    Sometime it can &e con%enient to treat a multiwaypro&lem li*e a &inary one+ one class %ersus all theothers' for all classes

    Binary Classification

    ,i%en+ some data items that &elong to a positi%e(-. ) or a negati%e (. ) class

    "as*+ "rain the classifier and predict the class for a

    4

    ,eometrically+ find a separator

    Linear versus on Linearal!orit"ms

    Linearly separable data+ if all the data points can&e correctly classified &y a linear (hyperplanar)decision &oundary

    5

    Linearly separable data

    6

    Class1

    Class2Linear Decision boundary

  • 8/13/2019 Winow vs Perceptron

    2/12

    on linearly separable data

    7

    Class1

    Class2

    on linearly separable data

    8

    Non Linear ClassifierClass1

    Class2

    Linear versus on Linearal!orit"ms

    /inear or #on linear separa&le data0

    We can find out only empirically

    /inear algorithms (algorithms that find a linear decision&oundary)

    9

    W en we t in t e ata is ineary separa e

    Ad%antages

    1 Simpler' less parameters

    !isad%antages

    1 2igh dimensional data (li*e for #/") is usually not linearly separa&le

    34amples+ Perceptron' Winnow' SVM

    #ote+ we can use linear algorithms also for non linear pro&lems(see Kernel methods)

    Linear versus on Linearal!orit"ms

    #on /inear

    When the data is non linearly separa&le

    Ad%antages

    1 More accurate

    10

    !isad%antages

    1 More complicated' more parameters

    34ample+ Kernel methods

    #ote+ the distinction &etween linear and non linearapplies also for multiclass classification (we5ll seethis later)

    #imple linear al!orit"ms

    Perceptron and Winnow algorithm

    /inear

    Binary classification

    6nline rocess data se uentiall one data oint at the

    11

    time)

    Mista*e dri%en

    Simple single layer #eural #etwor*s

    Linear binary classification

    !ata+ 8(4i'yi)9i:.n4 in ;d (4 is a %ector in ddimensional space)

    feature %ector

    y in 8.'-.9

    la&el (class' category)

    12From Gert Lanckriet, Statistical Learning Theory Tutorial

    then y : -.

    1 if w4 - & ? > then y : .

  • 8/13/2019 Winow vs Perceptron

    3/12

  • 8/13/2019 Winow vs Perceptron

    4/12

  • 8/13/2019 Winow vs Perceptron

    5/12

    #upport /ector *ac"ine #/*-

    /arge Margin Classifier

    /inearl se ara&le case

    M w/xa + b (

    w/xb + b -(

    25

    From Gert Lanckriet, Statistical Learning Theory Tutorial

    ,oal+ find thehyperplane thatma4imies the margin

    w/ x + b 0

    Support %ectors

    #upport /ector *ac"ine #/*-

    "e4t classification

    2andwriting recognition

    Computational &iology (eg' microarray data)

    26

    From Gert Lanckriet, Statistical Learning Theory Tutorial

    @ace etection

    @ace e4pression recognition

    "ime series prediction

    on Linear problem

    27

    on Linear problem

    28

    on Linear problem

    Kernel methods

    A family of nonlinear algorithms

    "ransform the non linear pro&lem in a linear one (in

    29From Gert Lanckriet, Statistical Learning Theory Tutorial

    se linear algorithms to sol%e the linear pro&lem inthe new space

    *ain intuition of ernel met"ods

    (Copy here from &lac* &oard)

    30

  • 8/13/2019 Winow vs Perceptron

    6/12

    Basic principle )ernel met"ods

    ;d ;! (! == d) w/%x&+b0

    31

    From Gert Lanckriet, Statistical Learning Theory Tutorial

    X=[x z] (X)=[x2 z2 xz]

    "%x& sign%w(x2+w2

    2+w1x +b&

    Basic principle )ernel met"ods

    Linear separability+ more li*ely in high dimensions

    -apping+ maps input into highdimensionalfeature space

    32

    From Gert Lanckriet, Statistical Learning Theory Tutorial

    dimensional feature space

    -oti'ation+ appropriate choice of leads to linearsepara&ility

    "e can do t&is efficiently.

    Basic principle )ernel met"ods

    We can use the linear algorithms seen &efore(Perceptron' SVM) for classification in the higherdimensional space

    33

    *ulti1class classification

    ,i%en+ some data items that &elong to one of Mpossi&le classes

    "as*+ "rain the classifier and predict the class for a

    34

    ,eometrically+ harder pro&lem' no more simple

    geometry

    *ulti1class classification

    35

    *ulti1class classification: xamples

    Author identification

    /anguage identification

    "e4t categoriation (topics)

    36

  • 8/13/2019 Winow vs Perceptron

    7/12

    #ome- +l!orit"ms for *ulti1classclassification

    /inearParallel class separators+ !ecision "rees

    #on parallel class separators+ #a$%e Bayes

    37

    Knearest neigh&ors

    Linear, parallel class separators

    ex: 3ecision Trees-

    38

    Linear, O parallel class separatorsex: ave Bayes-

    39

    on Linear ex: k earest ei!"bor-

    40

    3ecision Trees

    !ecision treeis a classifier in the form of a treestructure' where each node is either+

    /eaf node indicates the %alue of the target attri&ute (class)of e4amples' or

    41http://dms.irb.hr/tutorial/tut_dtrees.php

    !ecision node specifies some test to &e carried out on asingle attri&ute%alue' with one &ranch and su&tree for eachpossi&le outcome of the test

    A decision tree can &e used to classify an e4ample &ystarting at the root of the tree and mo%ing through ituntil a leaf node' which pro%ides the classification ofthe instance

    Trainin! xamples

    JesWea*#ormalCool;ain!I

    JesWea*2ighMild;ain!

    JesWea*2igh2ot6%ercast!L

    #oStrong2igh2otSunny!

    #oWea*2igh2otSunny!.

    #lay /ennisWind2umidity"emp6utloo*!ay

    ,oal+ learn when we can play "ennis and when we cannot

    42

    #oStrong2ighMild;ain!.

    JesWea*#ormal2ot6%ercast!.L

    JesStrong2ighMild6%ercast!.

    JesStrong#ormalMildSunny!..

    JesStrong#ormalMild;ain!.>

    JesWea*#ormalColdSunny!G

    #oWea*2ighMildSunny!N

    JesWea*#ormalCool6%ercast!O

    #oStrong#ormalCool;ain!H

    lli

  • 8/13/2019 Winow vs Perceptron

    8/12

  • 8/13/2019 Winow vs Perceptron

    9/12

    Buildin! 3ecision Trees

    Splitting criterion5indin! t"e features and t"e values to split on

    6 for example, '"y test first 7cts8 and not 7vs89

    6 &"y test on 7cts 28 and not 7cts ;8 9

    #plit t"at !ives us t"e maximum information gain or t"emaximum reduction of uncertainty-

    49

    &"en all t"e elements at one node "ave t"e same class,no need to split furt"er

    n practice' one first &uilds a large tree and then one prunes it&ac* (to a%oid o%erfitting)

    See @oundations of Statistical #atural /anguage Processing'Manning and Schuete for a good introduction

    3ecision Trees: #tren!t"s

    !ecision trees are a&le to generate understanda&lerules

    !ecision trees perform classification without re7uiring

    50

    http://dms.irb.hr/tutorial/tut_dtrees.php

    !ecision trees are a&le to handle &oth continuousand categorical %aria&les

    !ecision trees pro%ide a clear indication of whichfeatures are most important for prediction orclassification

    3ecision Trees: 'ea)nesses

    !ecision trees are prone to errors in classificationpro&lems with many classes and relati%ely smallnum&er of training e4amples

    !ecision tree can &e computationally e4pensi%e to

    51http://dms.irb.hr/tutorial/tut_dtrees.php

    traineed to compare all possible splits

    $runin! is also expensiveMost decisiontree algorithms only e4amine a singlefield at a time "his leads to rectangular classification&o4es that may not correspond well with the actualdistri&ution of records in the decision space

    3ecision Trees

    !ecision "rees in We*a

    52

    ave Bayes

    More powerful that !ecision "rees

    !ecision "rees #a$%e Bayes

    53

    ave Bayes *odels

    ,raphical Models+ graphtheory plus pro&a&ilitytheory

    A

    54

    3dges are conditionalpro&a&ilities B C

    P(A)

    P(B|A)

    P(C|A)

  • 8/13/2019 Winow vs Perceptron

    10/12

  • 8/13/2019 Winow vs Perceptron

    11/12

    ave Bayes: #tren!t"s

    Very simple model3asy to understand

    Very easy to implement

    61

    '

    Modest space storage

    Widely used &ecause it wor*s really well for te4tcategoriation

    /inear' &ut non parallel decision &oundaries

    ave Bayes: 'ea)nesses

    #a$%e Bayes independence assumption has two conse7uences+

    "he linear ordering of words is ignored (&ag of wordsmodel)

    "he words are independent of each other gi%en the class+ @alse

    1 Presidentis more li*ely to occur in a conte4t that contains electionthani i

    62

    i i

    #a$%e Bayes assumption is inappropriate if there are strongconditional dependencies &etween the %aria&les

    (But e%en if the model is not ErightF' #a$%e Bayes models dowell in a surprisingly large num&er of cases &ecause often weare interested in classification accuracyand not in accuratepro&a&ility estimations)

    ave Bayes

    #a$%e Bayes in We*a

    63

    k earest ei!"bor Classification

    #earest #eigh&or classification rule+ to classify a newo&Dect' find the o&Dect in the training set that is mostsimilar "hen assign the category of this nearestnei h&or

    64

    K #earest #eigh&or (K##)+ consult * nearestneigh&ors !ecision &ased on the maDority categoryof these neigh&ors More ro&ust than * : .

    34ample of similarity measure often used in #/P is cosinesimilarity

    %1earest ei!"bor

    65

    %1earest ei!"bor

    66

  • 8/13/2019 Winow vs Perceptron

    12/12