Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees...

Lecture2:ClassificationandDecisionTrees

SanjeevArora EladHazan

ThislecturecontainsmaterialfromtheT.Micheltext“MachineLearning”,andslidesadaptedfrom

DavidSontag,LukeZettlemoyer,CarlosGuestrin,andAndrewMoore

COS402– MachineLearningand

ArtificialIntelligenceFall2016

• Enrollingtothecourse– priorities• Pre-requisitesreminder(calculus,linearalgebra,discretemath,probability,datastructures&graphalgorithms)

• Movietickets• Slides• Exercise1– theory– dueinoneweek,inclass

Agenda

Thiscourse:Basicprinciplesofhowtodesignmachines/programsthatact“intelligently.”

• Recallcrow/foxexample.• Startwithsimplertask– classification,learningfromexamples• Today:stillNaiveAI.Nextweek:statisticallearningtheory

Classification

Goal: Find best mapping from domain (features) to output (labels)

• Given a document (email), classify spam or ham. Features = words , labels = {spam, ham}

• Given a picture, classify if it contains a chair or notfeatures = bits in a bitmap image, labels = {chair, no chair}

GOAL: automatic machine that learns from examples

Terminology for learning from examples:• Set aside a ”training set” of examples, train a classification machine• Test on a “test set”, to see how well machine performs on unseen examples

Classifyingfuelefficiency

From the UCI repository (thanks to Ross Quinlan)

• 40datapoints

• Goal:predictMPG

• Needtofind:f : Xà Y

• Discretedata(fornow)

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Decisiontreesforclassification

• Whyusedecisiontrees?• Whatistheirexpressivepower?• Cantheybeconstructedautomatically?• Howaccuratecantheyclassify?• Whatmakesagooddecisiontreebesidesaccuracyongivenexamples?

Somerealexamples(fromRussell&Norvig,Mitchell)• BP’sGasOIL systemforseparatinggasandoilonoffshoreplatforms- decisiontreesreplacedahand-designedrulessystemwith2500rules.C4.5-basedsystemoutperformedhumanexpertsandsavedBPmillions.(1986)

• learningtoflyaCessnaonaflightsimulatorbywatchinghumanexpertsflythesimulator(1992)

• canalsolearntoplaytennis,analyzeC-sectionrisk,etc.

• interpretable/intuitive,popularinmedicalapplicationsbecausetheymimicthewayadoctorthinks

• modeldiscreteoutcomesnicely• canbeverypowerful (expressive)• C4.5andCART- from“top10dataminingmethods”- verypopular

• ThisThu:we’llseewhynot…

decision trees f : X à Y

• Each internal node tests an attribute xi

• One branch for each possible attribute value xi=v

• Each leaf assigns a class y

• To classify input x: traverse the tree from root to leaf, output the labeled y

• Can we construct a tree automatically?

Cylinders

3 4 5 6 8

good bad badMaker Horsepower

low med highamerica asia europe

bad badgoodgood goodbad

Humaninterpretable!

ExpressivepowerofDT

Whatkindof functionscantheypotentially represent?• Booleanfunctions?

F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

x3x3 x3x3

0 10 1 0 10 10

Issimplerbetter?

Whatkindof functionscantheypotentially represent?• Booleanfunctions?

F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

WhatistheSimplestTree?

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Isthisagoodtree?

[22+,18-] Means:correcton22examplesincorrecton18examples

predictmpg=bad

Arealldecisiontreesequal?

• Manytreescanrepresentthesameconcept• But,notalltreeswillhavethesamesize!

– e.g.,((AandB)or(notAandC))

+ _t f

• Which tree do we prefer?

t f+ _

_t t f

Learning simplestdecisiontreeis NP-hard

• Formaljustification- statisticallearningtheory(nextlecture)

• Learningthesimplest(smallest)decisiontreeisanNP-completeproblem[Hyafil &Rivest ’76]

• Resorttoagreedyheuristic:– Startfromemptydecisiontree– Splitonnextbestattribute(feature)– Recurs

( ) ( ){ }1 1, ,..., ,N NS y y= x x

LearningAlgorithmforDecisionTrees( ){ }

1,...,

, 0,1d

What happens if features are not binary? What about regression?Decision Trees

DTalgs differonthischoice!• ID3• CAT4.5• CART

ADecisionStump

Keyidea:Greedilylearntreesusingrecursion

Take theOriginalDataset..

And partition it accordingto the value of the attribute we split on

Records in which cylinders

RecursiveStep

Records in which cylinders

Build tree fromThese records..

Secondleveloftree

Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia

(Similar recursion in the other cases)

A full tree

Splitting:choosingagoodattribute

X1 X2 YT T TT F TT T TT F TF T TF F FF T FF F F

Y=t : 4Y=f : 0

Y=t : 1Y=f : 3

Y=t : 3Y=f : 1

Y=t : 2Y=f : 2

Would we prefer to split on X1 or X2?

Idea: use counts at leaves to define probability distributions, so we can measure uncertainty!

Measuringuncertainty

• Goodsplitifwearemorecertainaboutclassificationaftersplit– Deterministicgood(alltrueorallfalse)– Uniformdistributionbad– Whataboutdistributionsinbetween?

P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8

EntropyEntropyH(Y) ofarandomvariableY

More uncertainty, more entropy!

Information Theory interpretation:H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

ProbabilityofheadsEntropy

Entropyofacoinflip

High,LowEntropy

• “HighEntropy”– Yisfromauniformlikedistribution– Flathistogram– Valuessampledfromitarelesspredictable

• “LowEntropy”– Yisfromavaried(peaksandvalleys)distribution

– Histogramhasmanylowsandhighs– Valuessampledfromitaremorepredictable

(Slide from Vibhav Gogate)

EntropyExample

X1 X2 YT T TT F TT T TT F TF T TF F F

P(Y=t) = 5/6P(Y=f) = 1/6

H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6= 0.65

Probabilityofheads

Entropy

Entropyofacoinflip

ConditionalEntropyConditionalEntropyH(Y |X) ofarandomvariableY conditionedona

randomvariableX

Y=t : 4Y=f : 0

Y=t : 1Y=f : 1

P(X1=t) = 4/6P(X1=f) = 2/6

Example:

H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2)

Informationgain• Decreaseinentropy(uncertainty)aftersplitting

In our running example:

IG(X1) = H(Y) – H(Y|X1)= 0.65 – 0.33

IG(X1) > 0 à we prefer the split!

Learningdecisiontrees

• Startfromemptydecisiontree• Splitonnextbestattribute(feature)

– Use,forexample,informationgaintoselectattribute:

• Recurs

Lookatalltheinformationgains…

Suppose we want to predict MPG

Whentostop?

First split looks good! But, when do we stop?

Base Case One

Don’t split a node if all matching

records have the same

output value

Base Case Two

Don’t split a node if data points are

identical on remaining attributes

BaseCases:Anidea

• BaseCaseOne:Ifallrecordsincurrentdatasubsethavethesameoutputthendon’trecurs

• BaseCaseTwo:Ifallrecordshaveexactlythesamesetofinputattributesthendon’trecurs

Proposed Base Case 3:If all attributes have small

information gain then don’t recurs

•This is not a good idea

Theproblemwith proposedcase3a b y0 0 00 1 11 0 11 1 0

y = a XOR b

The information gains:

Ifweomit proposedcase3:

a b y0 0 00 1 11 0 11 1 0

y = a XOR bThe resulting decision tree:

Instead, perform pruning after building a tree

Non-BooleanFeatures

• Real-valuedfeatures?

Real->threshold

WEIGHT>?

𝑁𝑜 𝑌𝑒𝑠

• Number of thresholds <= # of different values in dataset

• Can choose threshold based on information gain

Summary:BuildingDecisionTrees

BuildTree(DataSet,Output)• IfalloutputvaluesarethesameinDataSet,returnaleafnode

thatsays“predictthisuniqueoutput”• Ifallinputvaluesarethesame,returnaleafnodethatsays

“predictthemajorityoutput”• ElsefindattributeX withhighestInfoGain• SupposeX hasnX distinctvalues(i.e.Xhasarity nX).

– Createanon-leafnodewithnX children.– Thei’th childshouldbebuiltbycalling

BuildTree(DSi,Output)WhereDSi containstherecordsinDataSetwhereX=ith valueofX.

Machine Space Search

• ID3 / C4.5 / CART search for a succinct tree that perfectly fits the data.

• They are not going to find it in general (NP-hard)

• Entropy-guided splitting – well-performing heuristic. Exists others.

• Why should we search for a small tree at all?

MPG Test set error

The test set error is much worse than the training set error…

…why?

Decisiontreeswilloverfit

Fittingapolynomial𝑡 = sin 2𝜋𝑥 + 𝜖

Figure fromMachineLearningandPatternRecognition,Bishop

Fittingapolynomial

𝑡 = sin 2𝜋𝑥 + 𝜖

RegressionusingpolynomialofdegreeM

Overfitting

• Precisecharacterization– statisticallearningtheory• SpecialtechnicstopreventoverfittinginDTlearning

• Pruningthetree,e.g.“reducederror”pruning:Dountilfurtherpruningisharmful:1. Evaluatetheimpactonvalidation(test)setofthedataofpruningeachpossiblenode

(andit’ssubtree)2. Greedilyremoveonethatmostimprovesvalidation(test)error

Conceptswehaveencountered(andwillappearagaininthecourse!)

• Greedyalgorithm.

• Tryingtofindsuccinctmachines.

• Computationalefficiencyofanalgorithm(running time).

• Overfitting.

Questions

• Whyaresmallertrees(theory) preferabletolargetrees(theory)?

• Whencanatreegeneralizetounseenexamples,andhowwell?

• Howmanyexamplesareneededtobuildatreethatgeneralizeswell?

• Howtoidentifyandpreventoverfitting?

• Arethereothernaturalclassificationmachinesandhowdotheycompare?

Needtoreasonmoregenerallyabout“whatlearningmeans”,TBConThu…

Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees...

Documents

Transcript of Lecture 2: Classification and Decision Trees …...Lecture 2: Classification and Decision Trees...

Mortgage Default: Classification Trees Analysisprimage.tau.ac.il/libraries/brender/booksf/1912239.pdf · nonparametric Classification and Regression Trees ... defaulted, or did not

Classification Trees and Regression Trees

Data Mining 2021 Classification Trees (1)

Data Mining Classification: Basic Concepts, … › ~james › w12-decisionTrees.pdfData Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for

CHAPTER 5. PAPER 1: FAST RULE-BASED CLASSIFICATION USING P-TREES …adenton/thesis/chapter5_Final.pdf · FAST RULE-BASED CLASSIFICATION USING P-TREES 5.1. Abstract Lazy classification

Classification Trees Testing

Classification with Decision Trees and Rules

Classification Trees for Time Series - AMA Teamama.liglab.fr/~douzal/documents/Cours4-Classification-Trees-TS.pdf · Time Series Classi cation trees Challenges Extending classi cation

Chapter 7 Classification and Regression Trees

Decision (classification) trees - University of Belgradeai.fon.bg.ac.rs/.../Decision-classification-trees.pdf · The main idea behind classification trees • Divide the predictor

Classification Trees - MIT OpenCourseWare · Leaves are classifications mi not mi mi not mi. Classification Trees ... and use them to vote on a classification

Classification: Basic Concepts and Decision Trees

Data Mining Classification: Decision Trees Classification

Bill Atwood, Nov. 2002GLAST 1 Classification PSF Analysis A New Analysis Tool: Insightful Miner Classification Trees From Cuts Classification Trees: Recasting.

Data Mining Classification: Basic Concepts, Decision Trees, and … · Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction

Classification: Decision Trees (Ch. 4)

Methods for growing classification and regression trees · Classification and regression trees Classification and regression trees are predictive models based on the decision trees

Lecture 16,17,18 Trees

Classification and Regression Trees: An Introduction › framelib › classification-and-regression-trees.pdf · CLASSIFICATION AND REGRESSION TREES: AN INTRODUCTION TECHNICAL GUIDE

Lecture 17 Trees