Download - ch07 classific

Transcript
  • 8/4/2019 ch07 classific

    1/75

    1

    Classification and Prediction

    Data Mining

    Modified from the slidesby Prof. Han

  • 8/4/2019 ch07 classific

    2/75

    2

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

  • 8/4/2019 ch07 classific

    3/75

    3

    Classification vs. Prediction

    Classification:

    predicts categorical class labels classifies data (constructs a model) based on thetraining set and the values (class labels) in aclassifying attribute and uses it in classifying newdata

    Prediction (regression): models continuous-valued functions, i.e., predicts

    unknown or missing values

    e.g., expenditure of potential customers given theirincome and occupation

    Typical Applications credit approval

    target marketing

    medical diagnosis

  • 8/4/2019 ch07 classific

    4/75

    4

    ClassificationA Two-Step Process

    Model construction: describing a set of

    predetermined classes Each tuple/sample is assumed to belong to apredefined class, as determined by the class labelattribute

    The set of tuples used for model construction:training set

    The model is represented as classification rules,decision trees, or mathematical formulae

    Model usage: for classifying future or unknownobjects Estimate accuracy of the model

    The known label of test sample is compared with the classifiedresult from the model

    Accuracy rate is the percentage of test set samples that arecorrectly classified by the model

    Test set is independent of training set, otherwise over-fittingwill occur

  • 8/4/2019 ch07 classific

    5/75

    5

    Classification Process (1): ModelConstruction

    Training

    Data

    N A M E R A N K Y E A R S T E N U R E D

    Mike A ssistant Prof 3 no

    Mary Assistant Prof 7 yesBill Professor 2 yes

    Jim Associate Prof 7 yes

    Dave Assistant Prof 6 no

    Anne Associate Prof 3 no

    Classification

    Algorithms

    IF rank = professor

    OR years > 6

    THEN tenured = yes

    Classifier

    (Model)

  • 8/4/2019 ch07 classific

    6/75

    6

    Classification Process (2): Use theModel in Prediction

    Classifier

    Testing

    Data

    N A M E R A N K Y E A R S T E N U R E D

    Tom Assistant Prof 2 no

    Merlisa Associate Prof 7 no

    George Professor 5 yes

    Joseph Assistant Prof 7 yes

    Unseen Data

    (Jeff, Professor, 4)

    Tenured?

  • 8/4/2019 ch07 classific

    7/75

    7

    Supervised vs. Unsupervised Learning

    Supervised learning (classification)

    Supervision: The training data (observations,measurements, etc.) are accompanied by labelsindicating the class of the observations

    New data is classified based on the training set

    Unsupervised learning (clustering)

    The class labels of training data is unknown

    Given a set of measurements, observations, etc. withthe aim of establishing the existence of classes orclusters in the data

  • 8/4/2019 ch07 classific

    8/75

    8

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

  • 8/4/2019 ch07 classific

    9/75

    9

    Issues regarding classification andprediction (1): Data Preparation

    Data cleaning Preprocess data in order to reduce noise and handle

    missing values

    e.g., smoothing technique for noise and common valuereplacement for missing values

    Relevance analysis (feature selection in ML)

    Remove the irrelevant or redundant attributes

    Relevance analysis + learning on reduced attributes< learning on original set

    Data transformation

    Generalize data (discretize)

    E.g., income: low, medium, high E.g., street city

    Normalize data (particularly for NN) E.g., income [-1..1] or [0..1]

    Without this, what happen?

  • 8/4/2019 ch07 classific

    10/75

    10

    Issues regarding classification andprediction (2): Evaluating ClassificationMethods

    Predictive accuracy

    Speed

    time to construct the model

    time to use the model

    Robustness

    handling noise and missing values

    Scalability

    efficiency in disk-resident databases

    Interpretability:

    understanding and insight provided by the model

  • 8/4/2019 ch07 classific

    11/75

    11

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

  • 8/4/2019 ch07 classific

    12/75

    12

    Classification by Decision TreeInduction

    Decision tree

    A flow-chart-like tree structure

    Internal node denotes a test on an attribute

    Branch represents an outcome of the test

    Leaf nodes represent class labels or class distribution

    Decision tree generation consists of two phases Tree construction

    At start, all the training examples are at the root

    Partition examples recursively based on selected attributes

    Tree pruning Identify and remove branches that reflect noise or outliers

    Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the

    decision tree

  • 8/4/2019 ch07 classific

    13/75

    13

    Training Dataset

    age income student credit_rating buys_computer

    40 low yes fair yes

    >40 low yes excellent no3140 low yes excellent yes

  • 8/4/2019 ch07 classific

    14/75

    14

    Output: A Decision Tree forbuys_computer

    age?

    overcast

    student? credit rating?

    no yes fairexcellent

    40

    no noyes yes

    yes

    30..40

  • 8/4/2019 ch07 classific

    15/75

    15

    Algorithm for Decision Tree Induction

    Basic algorithm (a greedy algorithm)

    Tree is constructed in a top-down recursive divide-and-conquer manner

    At start, all the training examples are at the root

    Attributes are categorical (if continuous-valued, they arediscretized in advance)

    Examples are partitioned recursively based on selected

    attributes Test attributes are selected on the basis of a heuristic or

    statistical measure (e.g., information gain)

    Conditions for stopping partitioning

    All samples for a given node belong to the same class

    There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf

    There are no samples left

  • 8/4/2019 ch07 classific

    16/75

    16

    Attribute Selection Measure

    Information gain (ID3/C4.5)

    All attributes are assumed to be categorical Can be modified for continuous-valued attributes

    Gini index (IBM IntelligentMiner)

    All attributes are assumed continuous-valued

    Assume there exist several possible split values foreach attribute

    May need other tools, such as clustering, to get thepossible split values

    Can be modified for categorical attributes

  • 8/4/2019 ch07 classific

    17/75

    17

    Information Gain (ID3/C4.5)

    Select the attribute with the highest

    information gain Assume there are two classes, P and N

    Let the set of examples S contain p elements of classP and n elements of class N

    The amount of information, needed to decide if an

    arbitrary example in S belongs to P or N is definedas

    np

    n

    np

    n

    np

    p

    np

    pnpI

    22 loglog),(

  • 8/4/2019 ch07 classific

    18/75

    18

    Information Gain in Decision TreeInduction

    Assume that using attribute A a set S will be

    partitioned into sets {S1, S2 , , Sv} If Si contains pi examples of P and ni examples of N,

    the entropy, or the expected information needed toclassify objects in all subtrees Si is

    The encoding information that would begained by branching on A

    1),()(

    iii

    ii

    npInp

    np

    AE

    )(),()( AEnpIAGain

  • 8/4/2019 ch07 classific

    19/75

    19

    Attribute Selection by Information GainComputation

    Class P: buys_computer

    =yes

    Class N: buys_computer

    =no

    I(p, n) = I(9, 5) =0.940

    Compute the entropy for

    age: Hence

    Similarly

    age pi ni I(pi, ni)40 3 2 0.971

    971.0)2,3(14

    5

    )0,4(14

    4)3,2(

    14

    5)(

    I

    IIageE

    048.0)_(

    151.0)(

    029.0)(

    ratingcreditGain

    studentGain

    incomeGain

    )(),()( ageEnpIageGain

  • 8/4/2019 ch07 classific

    20/75

    20

    Gini Index (IBM IntelligentMiner)

    If a data set Tcontains examples from n classes, gini index,gini(T) is defined as

    wherepj is the relative frequency of classjin T. If a data set Tis split into two subsets T1 and T2 with sizes N1

    and N2 respectively, the giniindex of the split data containsexamples from n classes, the giniindex gini(T) is defined as

    The attribute provides the smallest ginisplit(T) is chosen to splitthe node (need to enumerate all possible splitting points foreach attribute).

    n

    j

    p jTgini

    1

    21)(

    )()()( 22

    11

    Tgini

    N

    NTgini

    N

    NTginisplit

  • 8/4/2019 ch07 classific

    21/75

    21

    Extracting Classification Rules fromTrees

    Represent the knowledge in the form of IF-THEN rules

    One rule is created for each path from the root to a leaf

    Each attribute-value pair along a path forms aconjunction

    The leaf node holds the class prediction

    Rules are easier for humans to understand

    Example IF age =

  • 8/4/2019 ch07 classific

    22/75

    22

    Avoid Overfitting in Classification

    The generated tree may overfit the training

    data Too many branches, some may reflect anomalies due

    to noise or outliers

    Result is in poor accuracy for unseen samples

    Two approaches to avoid overfitting

    Prepruning: Halt tree construction earlydo not splita node if this would result in the goodness measurefalling below a threshold

    Difficult to choose an appropriate threshold

    Postpruning: Remove branches from a fully growntreeget a sequence of progressively pruned trees

    Use a set of data different from the training data to decidewhich is the best pruned tree

  • 8/4/2019 ch07 classific

    23/75

    23

    Approaches to Determine the Final TreeSize

    Separate training (2/3) and testing (1/3) sets

    Use cross validation, e.g., 10-fold crossvalidation

    Use all the data for training

    but apply a statistical test (e.g., chi-square) toestimate whether expanding or pruning a node mayimprove the entire distribution

    Use minimum description length (MDL)principle:

    halting growth of the tree when the encoding is

    minimized

  • 8/4/2019 ch07 classific

    24/75

    24

    Enhancements to basic decision treeinduction

    Allow for continuous-valued attributes

    Dynamically define new discrete-valued attributesthat partition the continuous attribute value into adiscrete set of intervals

    Handle missing attribute values

    Assign the most common value of the attribute Assign probability to each of the possible values

    Attribute construction Create new attributes based on existing ones that

    are sparsely represented

    This reduces fragmentation, repetition, andreplication

  • 8/4/2019 ch07 classific

    25/75

    25

    Classification in Large Databases

    Classificationa classical problem extensively

    studied by statisticians and machine learningresearchers

    Scalability: Classifying data sets with millionsof examples and hundreds of attributes withreasonable speed

    Why decision tree induction in data mining? relatively faster learning speed (than other

    classification methods)

    convertible to simple and easy to understand

    classification rules can use SQL queries for accessing databases

    comparable classification accuracy with othermethods

  • 8/4/2019 ch07 classific

    26/75

    26

    Scalable Decision Tree InductionMethods in Data Mining Studies

    SLIQ (EDBT96 Mehta et al.) builds an index for each attribute and only class listand the current attribute list reside in memory

    SPRINT (VLDB96 J. Shafer et al.) constructs an attribute list data structure

    PUBLIC (VLDB98 Rastogi & Shim) integrates tree splitting and tree pruning: stop

    growing the tree earlier

    RainForest (VLDB98 Gehrke,Ramakrishnan & Ganti) separates the scalability aspects from the criteria

    that determine the quality of the tree

    builds an AVC-list (attribute, value, class label)

  • 8/4/2019 ch07 classific

    27/75

    27

    Data Cube-Based Decision-TreeInduction

    Integration of generalization with decision-treeinduction (Kamber et al97).

    Classification at primitive concept levels

    E.g., precise temperature, humidity, outlook, etc.

    Low-level concepts, scattered classes, bushy

    classification-trees

    Semantic interpretation problems.

    Cube-based multi-level classification

    Relevance analysis at multi-levels.

    Information-gain analysis with dimension + level.

  • 8/4/2019 ch07 classific

    28/75

    28

    Presentation of Classification Results

  • 8/4/2019 ch07 classific

    29/75

    29

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

  • 8/4/2019 ch07 classific

    30/75

    30

    Bayesian Classification: Why?

    Probabilistic learning: Calculate explicit probabilities forhypothesis, among the most practical approaches to

    certain types of learning problems

    Incremental: Each training example can incrementallyincrease/decrease the probability that a hypothesis iscorrect. Prior knowledge can be combined withobserved data.

    Probabilistic prediction: Predict multiple hypotheses,weighted by their probabilities

    Standard: Even when Bayesian methods arecomputationally intractable, they can provide a standardof optimal decision making against which other methods

    can be measured

  • 8/4/2019 ch07 classific

    31/75

    31

    Bayesian Theorem

    Given training data D, posteriori probability of

    a hypothesis h, P(h|D) follows the Bayestheorem

    MAP (maximum posteriori) hypothesis

    Practical difficulty: require initial knowledge of

    many probabilities, significant computationalcost

    )()()|(

    )|(DP

    hPhDPDhP

    .)()|(maxarg)|(maxarg hPhDPHh

    DhPHhMAP

    h

  • 8/4/2019 ch07 classific

    32/75

    34

    Bayesian classification

    The classification problem may be formalized usinga-posteriori probabilities:

    P(C|X) = prob. that the sample tupleX= is of class C.

    E.g. P(class=N | outlook=sunny,windy=true,)

    Idea: assign to sample X the class label C suchthat P(C|X) is maximal

  • 8/4/2019 ch07 classific

    33/75

    35

    Estimating a-posteriori probabilities

    Bayes theorem:

    P(C|X) = P(X|C)P(C) / P(X)

    P(X) is constant for all classes P(C) = relative freq of class C samples

    C such that P(C|X) is maximum =

    C such that P(X|C)P(C) is maximum Problem: computing P(X|C) is unfeasible!

    i l ifi i

  • 8/4/2019 ch07 classific

    34/75

    36

    Nave Bayesian Classification

    Nave assumption: attribute independence

    P(x1,,xk|C) = P(x1|C)P(xk|C)

    If i-th attribute is categorical:

    P(xi|C) is estimated as the relative freq ofsamples having value xi as i-th attribute in classC

    If i-th attribute is continuous:

    P(xi|C) is estimated thru a Gaussian densityfunction

    Computationally easy in both cases

    Pl i l i i P( |C)

  • 8/4/2019 ch07 classific

    35/75

    37

    Play-tennis example: estimating P(xi|C)

    Outlook Temperature Humidity Windy Classsunny hot high false N

    sunny hot high true N

    overcast hot high false P

    rain mild high false P

    rain cool normal false P

    rain cool normal true N

    overcast cool normal true P

    sunny mild high false N

    sunny cool normal false P

    rain mild normal false P

    sunny mild normal true P

    overcast mild high true P

    overcast hot normal false P

    rain mild high true N

    outlook

    P(sunny|p) = 2/9 P(sunny|n) = 3/5

    P(overcast|p) =4/9

    P(overcast|n) = 0

    P(rain|p) = 3/9 P(rain|n) = 2/5

    temperature

    P(hot|p) = 2/9 P(hot|n) = 2/5P(mild|p) = 4/9 P(mild|n) = 2/5

    P(cool|p) = 3/9 P(cool|n) = 1/5

    humidity

    P(high|p) = 3/9 P(high|n) = 4/5

    P(normal|p) =6/9

    P(normal|n) =2/5

    windy

    P(true|p) = 3/9 P(true|n) = 3/5

    P(false|p) = 6/9 P(false|n) = 2/5

    P(p) = 9/14

    P(n) = 5/14

    Pl t i l l if i X

  • 8/4/2019 ch07 classific

    36/75

    38

    Play-tennis example: classifying X

    An unseen sample X =

    P(X|p)P(p) =P(rain|p)P(hot|p)P(high|p)P(false|p)P(p) =3/92/93/96/99/14 = 0.010582

    P(X|n)P(n) =P(rain|n)P(hot|n)P(high|n)P(false|n)P(n) =2/52/54/52/55/14 = 0.018286

    Sample X is classified in class n (dont play)

    Th i d d h th i

  • 8/4/2019 ch07 classific

    37/75

    39

    The independence hypothesis

    makes computation possible

    yields optimal classifiers when satisfied but is seldom satisfied in practice, as

    attributes (variables) are often correlated.

    Attempts to overcome this limitation:

    Bayesian networks, that combine Bayesian reasoningwith causal relationships between attributes

    Decision trees, that reason on one attribute at thetime, considering most important attributes first

    B i B li f N t k (I)

  • 8/4/2019 ch07 classific

    38/75

    40

    Bayesian Belief Networks (I)

    FamilyHistory

    LungCancer

    PositiveXRay

    Smoker

    Emphysema

    Dyspnea

    LC

    ~LC

    (FH, S) (FH, ~S)(~FH, S)(~FH, ~S)

    0.8

    0.2

    0.5

    0.5

    0.7

    0.3

    0.1

    0.9

    Bayesian Belief Networks

    The conditional probability table

    for the variable LungCancer

    B i B li f N t k (II)

  • 8/4/2019 ch07 classific

    39/75

    41

    Bayesian Belief Networks (II)

    Bayesian belief network allows a subset of the

    variables conditionally independent A graphical model of causal relationships

    Several cases of learning Bayesian beliefnetworks

    Given both network structure and all the variables:easy

    Given network structure but only some variables

    When the network structure is not known in advance

    Ch t 7 Cl ifi ti d P di ti

  • 8/4/2019 ch07 classific

    40/75

    42

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

    N l N t k

  • 8/4/2019 ch07 classific

    41/75

    43

    Neural Networks

    Advantages

    prediction accuracy is generally high robust, works when training examples contain errors

    output may be discrete, real-valued, or a vector ofseveral discrete or real-valued attributes

    fast evaluation of the learned target function

    Criticism long training time

    difficult to understand the learned function (weights)

    not easy to incorporate domain knowledge

    A Ne on

  • 8/4/2019 ch07 classific

    42/75

    44

    A Neuron

    The n-dimensional input vector x is mapped into variable y by

    means of the scalar product and a nonlinear function mapping

    In hidden or output layer, the net input Ij to the unit j is

    mk-

    f

    weighted

    sum

    Input

    vectorx

    outputy

    Activation

    function

    weight

    vector w

    w0

    w1

    wn

    x0

    x1

    xn

    i

    jiijj OwI

    Network Training

  • 8/4/2019 ch07 classific

    43/75

    Network Training

    The ultimate objective of training

    obtain a set of weights that makes almost all thetuples in the training data classified correctly

    Steps

    Initialize weights with random values

    Feed the input tuples into the network one by one

    For each unit Compute the net input to the unit as a linear combination of

    all the inputs to the unit

    Compute the output value using the activation function

    Compute the error

    Update the weights and the bias

  • 8/4/2019 ch07 classific

    44/75

    Multi-Layer Perceptron

    Output nodes

    Input nodes

    Hidden nodes

    Output vector

    Input vector:xi

    wij

    i

    jiijj OwI

    jIj eO

    1

    1

    ))(1( jjjjj OTOOErr

    jkk

    kjjj wErrOOErr )1(

    ijijij OErrlww )(

    jjj Errl)(

    Oj(1-Oj): derivative

    of the sig. function

    Ij: the net inputto the unit j

    Oj: the output ofunit jsigmoid function:

    map Ij to [0..1]

    Error of hid-den layer

    Error ofoutput layer

    l: learningrate [0..1]

    l: too small learning pace is too slowtoo large oscillation between wrong solutions

    Heuristic: l=1/t (t: # iterations through training set so far)

    Multi Layer Perceptron

  • 8/4/2019 ch07 classific

    45/75

    47

    Multi-Layer Perceptron

    Case updating vs. epoch updating

    Weights and biases are updated after presentation ofeach sample vs.

    Deltas are accumulated into variables throughout thewhole training examples and then update

    Case updating is more common (more accurate)

    Termination condition Delta is too small (converge)

    Accuracy of the current epoch is high enough

    Pre-specified number of epochs

    In practice, hundreds of thousands of epochs

    Example

  • 8/4/2019 ch07 classific

    46/75

    48

    Example

    Class label = 1

    Example

  • 8/4/2019 ch07 classific

    47/75

    49

    Example

    ))(1( jjjjj OTOOErr

    i

    jiijj OwI

    jIj eO

    1

    1

    jkk

    kjjj wErrOOErr )1(

    O=0.332

    0.332

    O=0.525

    O=0.474

    Example

  • 8/4/2019 ch07 classific

    48/75

    50

    Example

    ijijij OErrlww )(

    jjj Errl)(

    E=0.1311

    O=0.332E=-0.0087

    O=0.525E=-0.0065

    Chapter 7 Classification and Prediction

  • 8/4/2019 ch07 classific

    49/75

    52

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

    Association-Based Classification

  • 8/4/2019 ch07 classific

    50/75

    53

    Association-Based Classification

    Several methods for association-basedclassification

    ARCS: Quantitative association mining and clusteringof association rules (Lent et al97)

    Associative classification: (Liu et al98)

    CAEP (Classification by aggregating emerging

    patterns) (Dong et al99)

    ARCS: Quantitative Association Rules

  • 8/4/2019 ch07 classific

    51/75

    54

    ARCS: Quantitative Association Rules Revisited

    age(X,30-34) income(X,24K-48K)

    buys(X,high resolution TV)

    Numeric attributes are dynamicallydiscretized Such that the confidence or compactness of the rules mined is

    maximized.

    2-D quantitative association rules: Aquan1 Aquan2 Acat

    Clusteradjacentassociation rules

    to form generalrules using a 2-D

    grid.

    Example:

    ARCS

  • 8/4/2019 ch07 classific

    52/75

    55

    ARCS

    The clustered association rules were applied tothe classification

    Accuracy were compared to C4.5

    ARCS were slightly better than C4.5

    Scalability

    ARCS requires constant amount of memoryregardless of database size

    C4.5 has exponentially higher execution times

    Association-Based Classification

  • 8/4/2019 ch07 classific

    53/75

    56

    Association-Based Classification

    Associative classification: (Liu et al98)

    It mines high support and high confidence rules in the

    form of cond_set => y where y is a class label

    Cond_set is a set of items

    Regard y as one item in association rule mining

    Support s%: if s% samples contain cond_set and y

    Rules are accurate if

    c% of samples that contain cond_set belong to y Possible rule (PR)

    If multiple rules have the same cond_set, choose the one with highestconfidence

    Two steps Step 1: find PRs

    Step 2: construct a classifier: sort the rules based on confidence and

    support Classifying new sample: use the first rule matched

    Default rule: for the new sample with no matched rule

    Empirical study: more accurate than C4.5

    Association-Based Classification

  • 8/4/2019 ch07 classific

    54/75

    57

    Association-Based Classification

    CAEP (Classification by aggregating emerging patterns)(Dong et al99)

    Emerging patterns (EPs): the itemsets whose supportincreases significantly from one class to another

    C1: buys_computer = yes

    C2: buys_computer = no

    EP: {age

  • 8/4/2019 ch07 classific

    55/75

    58

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

    Other Classification Methods

  • 8/4/2019 ch07 classific

    56/75

    59

    Other Classification Methods

    k-nearest neighbor classifier

    Case-based reasoning Genetic algorithm

    Rough set approach

    Fuzzy set approaches

    Instance-Based Methods

  • 8/4/2019 ch07 classific

    57/75

    60

    Instance Based Methods

    Instance-based learning:

    Store training examples and delay the processing(lazy evaluation) until a new instance must beclassified

    Typical approaches

    k-nearest neighbor approach

    Instances represented as points in a Euclidean space. Locally weighted regression

    Constructs local approximation

    Case-based reasoning Uses symbolic representations and knowledge-based inference

    The k-Nearest Neighbor Algorithm

  • 8/4/2019 ch07 classific

    58/75

    61

    The k Nearest Neighbor Algorithm

    All instances correspond to points in the n-D space.

    The nearest neighbor are defined in terms of Euclidean

    distance.

    The target function could be discrete- or real- valued.

    For discrete-valued, the k-NN returns the most commonvalue among the k training examples nearest to xq.

    Vonoroi diagram: the decision surface induced by 1-NNfor a typical set of training examples.

    .

    _+

    _ xq

    +

    _ _+

    _

    _

    +

    .

    .

    .

    . .

    Discussion on the k-NN Algorithm

  • 8/4/2019 ch07 classific

    59/75

    62

    Discussion on the k NN Algorithm

    The k-NN algorithm for continuous-valued targetfunctions

    Calculate the mean values of the k nearest neighbors

    Distance-weighted nearest neighbor algorithm

    Weight the contribution of each of the k neighborsaccording to their distance to the query point xq

    giving greater weight to closer neighbors

    Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors

    Curse of dimensionality: distance between neighborscould be dominated by irrelevant attributes.

    Assigning equal weight is a problem (vs. decision tree)

    To overcome it, axes stretch or elimination of the leastrelevant attributes.

    w

    d xq xi

    1

    2( , )

    Case-Based Reasoning

  • 8/4/2019 ch07 classific

    60/75

    63

    Case Based Reasoning

    Also uses: lazy evaluation + analyze similar instances

    Difference: Instances are not points in a Euclidean

    space

    Example: Water faucet problem in CADET (Sycara etal92)

    Methodology

    Instances represented by rich symbolic descriptions (e.g.,function graphs)

    Multiple retrieved cases may be combined

    Tight coupling between case retrieval, knowledge-basedreasoning, and problem solving

    Research issues

    Indexing based on syntactic similarity measure, and whenfailure, backtracking, and adapting to additional cases

    Remarks on Lazy vs. Eager Learning

  • 8/4/2019 ch07 classific

    61/75

    64

    Remarks on Lazy vs. Eager Learning

    Instance-based learning: lazy evaluation

    Decision-tree and Bayesian classification: eager

    evaluation

    Key differences

    Lazy method may consider query instance xq whendeciding how to generalize beyond the training data D

    Eager method cannot since they have already chosen

    global approximation when seeing the query

    Efficiency: Lazy - less time training but more timepredicting

    Accuracy

    Lazy method effectively uses a richer hypothesis space

    since it uses many local linear functions to form its implicitglobal approximation to the target function

    Eager: must commit to a single hypothesis that covers theentire instance space

    Genetic Algorithms

  • 8/4/2019 ch07 classific

    62/75

    65

    Genetic Algorithms

    GA: based on an analogy to biological evolution

    Each rule is represented by a string of bits

    An initial population is created consisting of randomlygenerated rules

    e.g., IF A1 and Not A2 then C2 can be encoded as 100

    Based on the notion of survival of the fittest, a newpopulation is formed to consists of the fittest rules andtheir offsprings

    The fitness of a rule is represented by its classificationaccuracy on a set of training examples

    Offsprings are generated by crossover and mutation

    Rough Set Approach

  • 8/4/2019 ch07 classific

    63/75

    66

    Rough Set Approach

    Rough sets are used to approximately or roughlydefine equivalent classes

    A rough set for a given class C is approximated by twosets: a lower approximation (certain to be in C) and anupper approximation (cannot be described as notbelonging to C)

    Finding the minimal subsets (reducts) of attributes (for

    feature reduction) is NP-hard but a discernibility matrixis used to reduce the computation intensity

    Fuzzy Set Approaches

  • 8/4/2019 ch07 classific

    64/75

    67

    Fuzzy Set Approaches

    Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as using

    fuzzy membership graph) Attribute values are converted to fuzzy values

    e.g., income is mapped into the discrete categories {low,medium, high} with fuzzy values calculated

    For a given new sample, more than one fuzzy valuemay apply

    Each applicable rule contributes a vote for membershipin the categories

    Typically, the truth values for each predicted categoryare summed

    Chapter 7. Classification and Prediction

  • 8/4/2019 ch07 classific

    65/75

    68

    Chapter 7. Classification and Prediction

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

    What Is Prediction?

  • 8/4/2019 ch07 classific

    66/75

    69

    What Is Prediction?

    Prediction is similar to classification

    First, construct a model

    Second, use model to predict unknown value Major method for prediction is regression

    Linear and multiple regression

    Non-linear regression

    Prediction is different from classification

    Classification refers to predict categorical class label

    Prediction models continuous-valued functions

    Predictive Modeling in Databases

  • 8/4/2019 ch07 classific

    67/75

    70

    ed ct e ode g atabases

    Predictive modeling: Predict data values or constructgeneralized linear models based on the database data.

    One can only predict value ranges or categorydistributions

    Method outline:

    Minimal generalization

    Attribute relevance analysis

    Generalized linear model construction

    Prediction

    Determine the major factors which influence theprediction

    Data relevance analysis: uncertainty measurement,entropy analysis, expert judgement, etc.

    Multi-level prediction: drill-down and roll-up analysis

    Regress Analysis and Log-Linear

  • 8/4/2019 ch07 classific

    68/75

    71

    Linear regression: Y = + X Two parameters , and specify the line and are to be

    estimated by using the data at hand.

    using the least squares criterion to the known values ofY1, Y2, , X1, X2, .

    Multiple regression: Y = b0 + b1 X1 + b2 X2.

    Many nonlinear functions can be transformed into theabove.

    e.g., X1 = X, X2 = X2

    Regress Analysis and Log LinearModels in Prediction

    Chapter 7. Classification and Prediction

  • 8/4/2019 ch07 classific

    69/75

    72

    p

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

    Classification Accuracy: Estimating

  • 8/4/2019 ch07 classific

    70/75

    73

    y gError Rates

    Partition: Training-and-testing

    use two independent data sets, e.g., training set (2/3),test set(1/3)

    used for data set with large number of samples

    Cross-validation divide the data set into k subsamples S1, S2, , Sk

    use k-1 subsamples as training data and one sub-sample

    as test data --- k-fold cross-validation 1st iteration: S2, , Sk for training, S1 for test

    2nd iteration: S1, S3, , Sk for training, S2 for test

    Accuracy = correct classifications from k iterations / # samples

    for data set with moderate size

    Stratified cross-validation Class distribution in each fold is similar with that of the total samples

    Bootstrapping (leave-one-out)

    for small size data

    Bagging and Boosting

  • 8/4/2019 ch07 classific

    71/75

    74

    gg g g

    Bagging (Fig. 7.17)

    Sample St from S with replacement, then buildclassifier Ct

    Given symptom (test data), ask multiple doctors(classifiers)

    Voting

    Bagging and Boosting

  • 8/4/2019 ch07 classific

    72/75

    75

    gg g g

    Boosting (Fig. 7.17)

    Combining with weights instead of voting

    Boosting increases classification accuracy

    Applicable to decision trees or Bayesian classifier

    Learn a series of classifiers, where each classifier inthe seriespays more attention to the examples

    misclassified by its predecessor Boosting requires only linear time and constant

    space

    Boosting Technique (II) Algorithm

  • 8/4/2019 ch07 classific

    73/75

    76

    g q ( ) g

    Assign every example an equal weight 1/N

    For t = 1, 2, , T do1. Obtain a hypothesis (classifier) h(t) under w(t)2. Calculate the error of h(t) and re-weight the

    examples based on the error

    3. Normalize w(t+1) to sum to 1

    Output a weighted sum of all the hypothesis,with each hypothesis weighted according toits accuracy on the training set

    Weight of classifiers, not that of examples

    Chapter 7. Classification and Prediction

  • 8/4/2019 ch07 classific

    74/75

    77

    p

    What is classification? What is prediction?

    Issues regarding classification and prediction Classification by decision tree induction

    Bayesian Classification

    Classification by backpropagation

    Classification based on concepts fromassociation rule mining

    Other Classification Methods

    Prediction

    Classification accuracy

    Summary

    Summary

  • 8/4/2019 ch07 classific

    75/75

    y

    Classification is an extensively studiedproblem (mainly in statistics, machine learning& neural networks)

    Classification is probably one of the mostwidely used data mining techniques with a lotof extensions

    Scalability is still an important issue fordatabase applications: thus combiningclassification with database techniques shouldbe a promising topic

    Research directions: classification of non-relational data, e.g., text, spatial, multimedia,etc..