MLlecture1.ppt

© Jude Shavlik 2006,© Jude Shavlik 2006, David Page 2007David Page 2007

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison) Lecture #1, Slide Lecture #1, Slide 11

CS760 – Machine CS760 – Machine LearningLearning

• Course Instructor: David PageCourse Instructor: David Page• email: [email protected]: [email protected]• office: MSC 6743 (University & Charter) office: MSC 6743 (University & Charter) • hours: TBAhours: TBA

• Teaching Assistant: Daniel WongTeaching Assistant: Daniel Wong• email: [email protected]: [email protected]• office: TBAoffice: TBA• hours: TBAhours: TBA

© Jude Shavlik 2006,© Jude Shavlik 2006, David 007David Page 2007


Textbooks & Textbooks & Reading AssignmentReading Assignment• Machine Learning Machine Learning (Tom Mitchell) (Tom Mitchell)• Selected on-line readingsSelected on-line readings

• Read in Mitchell Read in Mitchell (posted on class web (posted on class web page)page)

• PrefacePreface• Chapter 1Chapter 1• Sections 2.1 and 2.2Sections 2.1 and 2.2• Chapter 8Chapter 8



Monday, Wednesday, Monday, Wednesday, andand Friday?Friday?

• We’ll meet 30 times this term (may or may We’ll meet 30 times this term (may or may not include exam in this count)not include exam in this count)

• We’ll meet on FRIDAY this and next week, We’ll meet on FRIDAY this and next week, in order to cover material for HW 1in order to cover material for HW 1(plus I have some business travel this term)(plus I have some business travel this term)

• DefaultDefault: we WILL meet on Friday unless I : we WILL meet on Friday unless I announce otherwiseannounce otherwise



Course "Style"Course "Style"

• Primarily algorithmic & experimentalPrimarily algorithmic & experimental• Some theory, both mathematical & Some theory, both mathematical &

conceptual (much on conceptual (much on statisticsstatistics))• "Hands on" experience, interactive "Hands on" experience, interactive

lectures/discussionslectures/discussions• Broad survey of many ML subfields, includingBroad survey of many ML subfields, including

• "symbolic" (rules, decision trees, ILP)"symbolic" (rules, decision trees, ILP)• "connectionist" (neural nets)"connectionist" (neural nets)• support vector machines, nearest-neighborssupport vector machines, nearest-neighbors• theoretical ("COLT")theoretical ("COLT")• statistical ("Bayes rule")statistical ("Bayes rule")• reinforcement learning, genetic algorithmsreinforcement learning, genetic algorithms



"MS vs. PhD" Aspects"MS vs. PhD" Aspects

• MS'ish topicsMS'ish topics• mature, ready for practical applicationmature, ready for practical application• first 2/3 – ¾ of semesterfirst 2/3 – ¾ of semester• Naive Bayes, Nearest-Neighbors, Decision Trees, Neural Naive Bayes, Nearest-Neighbors, Decision Trees, Neural

Nets, Suport Vector Machines, ensembles, experimental Nets, Suport Vector Machines, ensembles, experimental methodology (10-fold cross validation, methodology (10-fold cross validation, tt-tests)-tests)

• PhD'ish topicsPhD'ish topics• inductive logic programming, statistical relational inductive logic programming, statistical relational

learning, reinforcement learning, SVMs, use of prior learning, reinforcement learning, SVMs, use of prior knowledgeknowledge

• Other machine learning material covered in Other machine learning material covered in Bioinformatics CS 576/776, Jerry Zhu’s CS 838Bioinformatics CS 576/776, Jerry Zhu’s CS 838



Two Major GoalsTwo Major Goals

• to understand to understand whatwhat a learning a learning system should dosystem should do

• to understand to understand howhow (and how (and how wellwell) ) existing systems workexisting systems work• Issues in algorithm designIssues in algorithm design• Choosing algorithms for applicationsChoosing algorithms for applications



Background AssumedBackground Assumed

• LanguagesLanguages• Java (see CS 368 tutorial online)Java (see CS 368 tutorial online)

• AI TopicsAI Topics• SearchSearch• FOPCFOPC• UnificationUnification• Formal DeductionFormal Deduction

• MathMath• Calculus (partial derivatives)Calculus (partial derivatives)• Simple prob & statsSimple prob & stats

• No previous ML experience assumedNo previous ML experience assumed (so some overlap with CS 540)(so some overlap with CS 540)



RequirementsRequirements

• Bi-weekly programming HW'sBi-weekly programming HW's• "hands on" experience valuable"hands on" experience valuable• HW0 – build a datasetHW0 – build a dataset• HW1 – simple ML algo's and exper. methodologyHW1 – simple ML algo's and exper. methodology• HW2 – decision trees (?)HW2 – decision trees (?)• HW3 – neural nets (?)HW3 – neural nets (?)• HW4 – reinforcement learning (in a simulated HW4 – reinforcement learning (in a simulated

world)world)

• "Midterm" exam "Midterm" exam (in class, about 90% through semester)(in class, about 90% through semester)

• Find project of your choosingFind project of your choosing• during last 4-5 weeks of classduring last 4-5 weeks of class



GradingGrading

HW'sHW's 35%35%

"Midterm""Midterm" 40%40%

ProjectProject 20%20%

Quality DiscussionQuality Discussion 5% 5%



Late HW's PolicyLate HW's Policy

• HW's due @ 4pmHW's due @ 4pm• you have you have 55 late days to use late days to use

over the semesterover the semester• (Fri 4pm (Fri 4pm → Mon 4pm is → Mon 4pm is 11 late "day") late "day")

• SAVE UP late days!SAVE UP late days!• extensions only for extensions only for extremeextreme cases cases

• Penalty points after late days Penalty points after late days exhaustedexhausted

• Can't be more than ONE WEEK lateCan't be more than ONE WEEK late



Academic Misconduct Academic Misconduct (also on course homepage)(also on course homepage)

All examinations, programming assignments, All examinations, programming assignments, and written homeworks must be done and written homeworks must be done individuallyindividually. Cheating and plagiarism will be . Cheating and plagiarism will be dealt with in accordance with University dealt with in accordance with University procedures (see the procedures (see the Academic Misconduct Guide for StudentsAcademic Misconduct Guide for Students). ). Hence, for example, code for programming Hence, for example, code for programming assignments must not be developed in groups, assignments must not be developed in groups, nor should code be shared. You are encouraged nor should code be shared. You are encouraged to discuss with your peers, the TAs or the to discuss with your peers, the TAs or the instructor ideas, approaches and techniques instructor ideas, approaches and techniques broadly, but not at a level of detail where broadly, but not at a level of detail where specific implementation issues are described by specific implementation issues are described by anyone. If you have any questions on this, anyone. If you have any questions on this, please ask the instructor before you act. please ask the instructor before you act.



What Do You Think What Do You Think Learning Means?Learning Means?



What is Learning?What is Learning?

““Learning denotes changes in the system that Learning denotes changes in the system that

… … enable the system to do the same task … enable the system to do the same task …

more effectively the next time.”more effectively the next time.”

- - Herbert Herbert SimonSimon

““Learning is making useful changes in our Learning is making useful changes in our minds.”minds.”

- - Marvin Marvin MinskyMinsky



Today’sToday’s TopicsTopics

• Memorization as LearningMemorization as Learning• Feature SpaceFeature Space• Supervised MLSupervised ML• KK-NN (-NN (KK-Nearest Neighbor)-Nearest Neighbor)



Memorization (Rote Memorization (Rote Learning)Learning)

• Employed by first machine Employed by first machine learning systems, in 1950slearning systems, in 1950s• Samuel’s Checkers programSamuel’s Checkers program• Michie’s MENACE: Matchbox Educable Michie’s MENACE: Matchbox Educable

Naughts and Crosses EngineNaughts and Crosses Engine

• Prior to these, some people Prior to these, some people believed computers could not believed computers could not improveimprove at a task at a task with experiencewith experience



Rote Learning is Rote Learning is LimitedLimited

• Memorize I/O pairs and perform Memorize I/O pairs and perform exact matching with new inputsexact matching with new inputs

• If computer has not seen precise If computer has not seen precise case before, it cannot apply its case before, it cannot apply its experienceexperience

• Want computer to “generalize” Want computer to “generalize” from prior experiencefrom prior experience



Some Settings in Some Settings in Which Learning May Which Learning May HelpHelp• Given an input, what is appropriate Given an input, what is appropriate

response (output/action)?response (output/action)?• Game playing – board state/moveGame playing – board state/move• Autonomous robots (e.g., driving a vehicle) Autonomous robots (e.g., driving a vehicle)

-- world state/action-- world state/action• Video game characters – state/actionVideo game characters – state/action• Medical decision support – symptoms/ Medical decision support – symptoms/

treatmenttreatment• Scientific discovery – data/hypothesisScientific discovery – data/hypothesis• Data mining – database/regularityData mining – database/regularity



Broad Paradigms of Broad Paradigms of Machine LearningMachine Learning

• Inducing Functions from I/O PairsInducing Functions from I/O Pairs• Decision trees (e.g., Quinlan’s C4.5 [1993])Decision trees (e.g., Quinlan’s C4.5 [1993])• Connectionism / neural networks (e.g., backprop)Connectionism / neural networks (e.g., backprop)• Nearest-neighbor methodsNearest-neighbor methods• Genetic algorithmsGenetic algorithms• SVM’s SVM’s

• Learning without Learning without Feedback/TeacherFeedback/Teacher• Conceptual clusteringConceptual clustering• Self-organizing systemsSelf-organizing systems• Discovery systemsDiscovery systems

Not in Mitchell’s textbook (covered in CS 776)



IID IID (Completion of Lec #2)(Completion of Lec #2)

• We are assuming examples are We are assuming examples are IID: IID: independently identically independently identically distributeddistributed

• Eg, we are ignoring Eg, we are ignoring temporaltemporal dependencies (covered in dependencies (covered in time-series learningtime-series learning))

• Eg, we assume the learner has no Eg, we assume the learner has no say in which examples it gets say in which examples it gets (covered in (covered in active learningactive learning))

© Jude Shavlik 2006,© Jude Shavlik 2006, David 07David Page 2007


Supervised Learning Supervised Learning Task OverviewTask Overview

Concepts/Classes/

Decisions

Concepts/Classes/

Decisions

Feature Selection(usually done by humans)

Classification Rule Construction(done by learning algorithm)

Real WorldReal World

Feature SpaceFeature Space

HW 0

HW 1-3



Supervised Learning Supervised Learning Task Overview (cont.)Task Overview (cont.)

• Note: mappings on previous slide Note: mappings on previous slide are not necessarily 1-to-1are not necessarily 1-to-1• Bad for first mapping?Bad for first mapping?• Good for the second Good for the second

(in fact, it’s the goal!)(in fact, it’s the goal!)



Empirical Learning: Empirical Learning: Task DefinitionTask Definition• Given Given

• A collection of A collection of positivepositive examples of some examples of some concept/class/category (i.e., members of the class) and, concept/class/category (i.e., members of the class) and, possibly, a collection of the possibly, a collection of the negativenegative examples (i.e., non- examples (i.e., non-members)members)

• ProduceProduce• A description that A description that coverscovers (includes) all/most of the (includes) all/most of the

positive examples and non/few of the negative examples positive examples and non/few of the negative examples

(and, hopefully, properly categorizes most future (and, hopefully, properly categorizes most future examples!)examples!)

Note: one can easily extend this definition to handle more than two Note: one can easily extend this definition to handle more than two classesclasses

The KeyPoint!



ExampleExamplePositive Examples Negative Examples

How does this symbol classify?

•Concept

•Solid Red Circle in a (Regular?) Polygon

•What about?•Figures on left side of page•Figures drawn before 5pm 2/2/89 <etc>



.

.

.

Concept LearningConcept Learning

Learning systems differ in how they represent Learning systems differ in how they represent concepts:concepts:

TrainingExamples

Backpropagation

C4.5, CART

AQ, FOIL

SVMs

NeuralNet

DecisionTree

Φ <- X^YΦ <- Z

Rules

If 5x1 + 9x2 – 3x3 > 12Then +



Feature SpaceFeature Space

If examples are described in terms of If examples are described in terms of values of features, they can be plotted values of features, they can be plotted as points in an as points in an NN-dimensional space.-dimensional space.

Size

Color

Weight

?Big

2500

Gray

A “concept” is then a (possibly disjoint) volume in this space.



Learning from Labeled Learning from Labeled ExamplesExamples

• Most common and successful Most common and successful form of MLform of ML Venn Diagram

+ ++

+

- -

--

-

-

--

•Examples – points in a multi-dimensional “feature space”•Concepts – “function” that labels every point in feature space

(as +, -, and possibly ?)



Brief ReviewBrief Review

• ConjunctiveConjunctive Concept Concept• Color(?obj1, red)Color(?obj1, red)

^̂• Size(?obj1, large)Size(?obj1, large)

• DisjunctiveDisjunctive Concept Concept• Color(?obj2, blue)Color(?obj2, blue)

vv• Size(?obj2, small)Size(?obj2, small)

• More formally a “concept” is of the More formally a “concept” is of the formform• x y z F(x, y, z) -> Member(x, Class1)x y z F(x, y, z) -> Member(x, Class1)

A A A

“and”

“or”

Instances



Empirical Learning and Empirical Learning and Venn DiagramsVenn Diagrams

Concept = Concept = AA or or B B (Disjunctive concept)(Disjunctive concept)

Examples = labeled points in feature spaceExamples = labeled points in feature space

Concept = a label for a Concept = a label for a setset of points of points

Venn Diagram

A

B

--

--

-

-

- -

-

-

-

-

--

-

-

-

--

--

- - --- -

---

--

-

-

+

++ ++

+ +

+

++

+ +

+

++

+

+

++

Feature Space



Aspects of an ML Aspects of an ML SystemSystem• ““Language” for representing classified Language” for representing classified

examplesexamples• ““Language” for representing “Concepts”Language” for representing “Concepts”• Technique for producing concept Technique for producing concept

“consistent” with the training examples“consistent” with the training examples• Technique for classifying new instanceTechnique for classifying new instance

Each of these limits the Each of these limits the expressivenessexpressiveness//efficiencyefficiency of the supervised learning algorithm.of the supervised learning algorithm.

HW 0

OtherHW’s



Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning

(IBL), case-based learning)(IBL), case-based learning)

• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar

example in memory; output its categoryexample in memory; output its categoryVenn

-

--

-

-

--

-+

+

+

+ + +

++

+

+?

…“Voronoi

Diagrams”(pg 233)



““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2

Simple Example: 1-NNSimple Example: 1-NN

Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=0, c=0a=0, b=0, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??

So output -

(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)



Sample Experimental Sample Experimental Results Results (see UCI archive for (see UCI archive for more)more)

TestbedTestbed Testset CorrectnessTestset Correctness

1-NN1-NN D-TreesD-Trees Neural NetsNeural Nets

Wisconsin Wisconsin CancerCancer 98%98% 95%95% 96%96%

Heart Heart DiseaseDisease 78%78% 76%76% ??

TumorTumor 37%37% 38%38% ??

AppendicitisAppendicitis 83%83% 85%85% 86%86%

Simple algorithm works quite well!



KK-NN Algorithm-NN Algorithm

Collect Collect KK nearest neighbors, select majority nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)

• What should What should KK be? be?• It probably is problem dependentIt probably is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select

a good setting for a good setting for KK

1

Shouldn’t really“connect the dots”(Why?)

Tuning SetError Rate

2 3 4 5 K



Data RepresentationData Representation

• Creating a dataset ofCreating a dataset of

• Be sure to include – on separate 8x11 Be sure to include – on separate 8x11 sheet – a photo and a brief biosheet – a photo and a brief bio

• HW0 out on-lineHW0 out on-line• Due next FridayDue next Friday

fixed length feature vectorsfixed length feature vectors



HW0 – Create Your Own HW0 – Create Your Own Dataset Dataset (repeated from lecture (repeated from lecture #1)#1)

• Think about before next classThink about before next class• Read HW0 (on-line)Read HW0 (on-line)

• Google to find:Google to find:• UCI archive (or UCI KDD archive)UCI archive (or UCI KDD archive)• UCI ML archive (UCI ML repository)UCI ML archive (UCI ML repository)• More links in HW0’s web pageMore links in HW0’s web page



HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 1: Step 1: Choose a Boolean (true/false) conceptChoose a Boolean (true/false) concept• Books I like/dislike Books I like/dislike

Movies I like/dislike Movies I like/dislike www pages I like/dislikewww pages I like/dislike

• Subjective judgment (can’t articulate)Subjective judgment (can’t articulate)• ““time will tell” conceptstime will tell” concepts

• Stocks to buyStocks to buy• Medical treatmentMedical treatment

• at time at time tt, predict outcome at time (, predict outcome at time (t t ++∆∆t)t)• Sensory interpretation Sensory interpretation

• Face recognition (see textbook)Face recognition (see textbook)• Handwritten digit recognitionHandwritten digit recognition• Sound recognitionSound recognition

• Hard-to-Program FunctionsHard-to-Program Functions



Some Real-World Some Real-World ExamplesExamples

• Car Steering (Pomerleau, Thrun)Car Steering (Pomerleau, Thrun)

• Medical Diagnosis (Quinlan)Medical Diagnosis (Quinlan)

• DNA CategorizationDNA Categorization• TV-pilot ratingTV-pilot rating• Chemical-plant controlChemical-plant control• Backgammon playingBackgammon playing

Learned Function

Steering Angle

Digitized camera image

age=13,sex=M, wgt=18

Learned Function

sickvs

healthy

Medicalrecord



HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 2: Step 2: Choosing a Choosing a feature spacefeature space• We will use We will use fixed-length feature vectorsfixed-length feature vectors

• Choose Choose NN features features• Each feature has Each feature has VVii

possible valuespossible values• Each example is represented by a vector of Each example is represented by a vector of NN feature feature

values values (i.e., (i.e., is a point in the feature spaceis a point in the feature space))e.g.: e.g.: <red, 50, round><red, 50, round>

colorcolor weight shapeweight shape

• Feature TypesFeature Types• BooleanBoolean• NominalNominal• OrderedOrdered• HierarchicalHierarchical

• Step 3: Step 3: Collect examples (“I/O” pairs)Collect examples (“I/O” pairs)

Defines a space

In HW0 we will use a subset(see next slide)



Standard Feature TypesStandard Feature Typesfor representing training examples for representing training examples – a source of “ – a source of “domain knowledgedomain knowledge””

• NominalNominal• No relationship among possible valuesNo relationship among possible values

e.g., e.g., color color єє {red, blue, green} {red, blue, green} (vs.(vs. color = 1000 color = 1000 Hertz)Hertz)• Linear (or Ordered)Linear (or Ordered)

• Possible values of the feature are totally orderedPossible values of the feature are totally orderede.g., e.g., size size єє {small, medium, large}{small, medium, large} ←← discretediscrete

weight weight єє [0…500] [0…500] ←← continuouscontinuous

• HierarchicalHierarchical• Possible values are Possible values are partiallypartially

ordered in an ISA hierarchyordered in an ISA hierarchye.g. for e.g. for shapeshape ->->

closed

polygon continuous

trianglesquare circle ellipse



Our Feature TypesOur Feature Types(for CS 760 HW’s)(for CS 760 HW’s)

• DiscreteDiscrete• tokens (char strings, w/o quote marks and tokens (char strings, w/o quote marks and

spaces)spaces)

• ContinuousContinuous• numbers (int’s or float’s)numbers (int’s or float’s)

• If only a few possible values (e.g., 0 & 1) use If only a few possible values (e.g., 0 & 1) use discretediscrete

• i.e., merge i.e., merge nominalnominal and and discrete-ordereddiscrete-ordered (or convert (or convert discrete-ordereddiscrete-ordered into 1,2,…) into 1,2,…)

• We will ignore hierarchical info and We will ignore hierarchical info and only use the leaf values (common approach)only use the leaf values (common approach)



Example Hierarchy Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17)(KDD* Journal, Vol 5, No. 1-2, 2001, page 17)

Product

Pct Foods

Tea

Canned Cat Food

Dried Cat Food

99 Product Classes

2302 Product Subclasses

Friskies Liver, 250g

~30k Products• Structure of one feature!

• “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.”

- From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001

* Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers



HW0: HW0: Creating Your DatasetCreating Your Dataset

Ex: IMDB has a lot of data that Ex: IMDB has a lot of data that are not discrete or are not discrete or continuous or binary-valued continuous or binary-valued for target function for target function (category)(category)Studio

Movie

Director/Producer

ActorMade

Acted inDirected

NameCountryList of movies

NameYear of birthGenderOscar nominationsList of movies

Title, Genre, Year, Opening Wkend BO receipts,List of actors/actresses, Release season

NameYear of birthList of movies

Produced



HW0: Sample DBHW0: Sample DB

Choose a Boolean or binary-Choose a Boolean or binary-valued target function (category)valued target function (category)

• Opening weekend box-office Opening weekend box-office receipts > $2 million receipts > $2 million

• Movie is drama? (action, sci-fi,Movie is drama? (action, sci-fi,…)…)

• Movies I like/dislike (e.g. Tivo)Movies I like/dislike (e.g. Tivo)



HW0: Representing as a HW0: Representing as a Fixed-Length Feature Fixed-Length Feature VectorVector

<discuss on chalkboard><discuss on chalkboard>

Note: some advanced ML approaches do Note: some advanced ML approaches do not not require such “feature mashing” require such “feature mashing” (eg, ILP)(eg, ILP)



IMDB@umassIMDB@umass

David Jensen’s group at UMass uses David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the Naïve Bayes and other ML algo’s on the IMDBIMDB

• Opening weekend box-office Opening weekend box-office receipts > $2 millionreceipts > $2 million• 25 attributes25 attributes• Accuracy = 83.3%Accuracy = 83.3%• Default accuracy = 56%Default accuracy = 56% (default algo?)(default algo?)

• Movie is drama?Movie is drama?• 12 attributes12 attributes• Accuracy = 71.9%Accuracy = 71.9%• Default accuracy = 51%Default accuracy = 51%

http://kdl.cs.umass.edu/proximity/about.htmlhttp://kdl.cs.umass.edu/proximity/about.html



First Algorithm in First Algorithm in DetailDetail

• KK-Nearest Neighbors / -Nearest Neighbors / Instance-Based Learning (Instance-Based Learning (kk-NN/IBL)-NN/IBL)• Distance functionsDistance functions• Kernel functionsKernel functions• Feature selection (applies to all ML Feature selection (applies to all ML

algo’s)algo’s)• IBL SummaryIBL Summary

Chapter 8 of MitchellChapter 8 of Mitchell



Some Common JargonSome Common Jargon

• ClassificationClassification• Learning a Learning a discretediscrete valued function valued function

• RegressionRegression• Learning a Learning a realreal valued function valued function

IBL easily extended to regression IBL easily extended to regression tasks (and to multi-category tasks (and to multi-category classification)classification)

Discrete/RealOutputs



Variations on a ThemeVariations on a Theme

• IB1IB1 – keep all examples – keep all examples

• IB2IB2 – keep next instance if – keep next instance if incorrectlyincorrectly classified by using previous instancesclassified by using previous instances• Uses less storage (good)Uses less storage (good)• Order dependent (bad)Order dependent (bad)• Sensitive to noisy data (bad)Sensitive to noisy data (bad)

(From Aha, Kibler and Albert in ML Journal)(From Aha, Kibler and Albert in ML Journal)



Variations on a Theme Variations on a Theme (cont.)(cont.)• IB3IB3 – extend IB2 to more intelligently decide – extend IB2 to more intelligently decide

which examples to keep (see article)which examples to keep (see article)• Better handling of noisy dataBetter handling of noisy data

• Another IdeaAnother Idea - - cluster groups, keep cluster groups, keep example from each (median/centroid)example from each (median/centroid)• Less storage, faster lookupLess storage, faster lookup



Distance FunctionsDistance Functions

• KeyKey issue in IBL issue in IBL (instance-based learning)(instance-based learning)

• One approach:One approach:

assign weights to each assign weights to each featurefeature



Distance Functions Distance Functions (sample)(sample)

features

i

iiii eedweed

#

12121 ),(*),(

distance between examples 1 and 2

a numeric weighting factor

distance for feature i only between examples 1 and 2



Kernel Functions Kernel Functions and and kk-NN-NN

• Term “kernel” comes from Term “kernel” comes from statisticsstatistics

• Major topic in support vector Major topic in support vector machines (SVMs)machines (SVMs)

• Weights the interaction between Weights the interaction between pairs of examplespairs of examples



Kernel Functions and Kernel Functions and kk-NN (continued)-NN (continued)

• Assume we haveAssume we have• kk nearest neighbors nearest neighbors ee11, ..., e, ..., ekk

• associated output categories associated output categories OO11, ..., , ..., OOkk

• Then output for test case Then output for test case eett isis

k

iiti

c

cOee1categories possible

),(*),(maxarg

the kernel “delta” function (=1 if Oi=c, else =0)



Sample Kernel Sample Kernel Functions Functions (e(ei i , e, ett))

• ((eei i , e, ett) = 1) = 1

• ((eei i , e, ett) = 1 / dist() = 1 / dist(eei i , e, ett) )

simple majority vote (? classified as -)inverse distance weight (? could be classified as +)

-

-+?

In diagram to right, example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’.



Gaussian KernelGaussian Kernel

• Heavily Heavily used in used in SVMsSVMs

2

2

2),( ti ee

ti eee

Euler’s constant



Local LearningLocal Learning

• Collect Collect kk nearest neighbors nearest neighbors• Give them to some supervised ML algoGive them to some supervised ML algo• Apply learned model to test exampleApply learned model to test example

++

+++

+ +

--

-- ? -Train on these



Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• IBL algorithms postpone work IBL algorithms postpone work

from training to testingfrom training to testing• Pure Pure kk-NN/IBL just memorizes -NN/IBL just memorizes

the training datathe training data• Sometimes called Sometimes called lazy learninglazy learning

• Computationally intensiveComputationally intensive• Match all features of all training Match all features of all training

examplesexamples



Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• Possible Speed-upsPossible Speed-ups

• Use a subset of the training Use a subset of the training examples (Aha)examples (Aha)

• Use clever data structures (A. Use clever data structures (A. Moore)Moore)• KD trees, hash tables, Voronoi diagramsKD trees, hash tables, Voronoi diagrams

• Use subset of the featuresUse subset of the features



Number of Features Number of Features and Performanceand Performance

• Too many features can hurt test set performance

• Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect

• ““Curse of dimensionality”Curse of dimensionality”



Feature Selection and Feature Selection and MLML(general issue for ML)(general issue for ML)

Filtering-Based Filtering-Based Feature SelectionFeature Selection

all featuresall features

subset of featuressubset of features

modelmodel

Wrapper-Based Wrapper-Based Feature SelectionFeature Selection

FS algorithm

ML algorithmML algorithm

all features

model

FS algorithm

calls ML algorithm many times, uses it to help select features



Feature Selection as Feature Selection as Search ProblemSearch Problem• State State = set of features= set of features

• Start stateStart state = = emptyempty ((forward selectionforward selection)) or or fullfull ( (backward selectionbackward selection))

• Goal testGoal test = highest scoring state = highest scoring state

• Operators Operators • add/subtract featuresadd/subtract features

• Scoring function Scoring function • accuracy on training (or tuning) set of accuracy on training (or tuning) set of

ML algorithm using this state’s feature setML algorithm using this state’s feature set



Forward and Backward Forward and Backward Selection of FeaturesSelection of Features

• Hill-climbing (“greedy”) searchHill-climbing (“greedy”) search

{}50%

{FN}71%

{F1}62%

add F

N

ad

d F

1

Forward

Backward

add F1

...

...

Features to use

Accuracy on tuning set (our heuristic function)

...

{F1,F2,...,FN}73%

{F2,...,FN}79%

subtract F1

subtract F2

...



Forward vs. Backward Forward vs. Backward Feature SelectionFeature Selection

• Faster in early steps Faster in early steps because fewer because fewer features to testfeatures to test

• Fast for choosing a Fast for choosing a small subset of the small subset of the featuresfeatures

• Misses useful features Misses useful features whose usefulness whose usefulness requires other requires other features (feature features (feature synergy)synergy)

• Fast for choosing all Fast for choosing all but a small subset but a small subset of the featuresof the features

• Preserves useful Preserves useful features whose features whose usefulness requires usefulness requires other featuresother features• Example: area Example: area

important, important, features = length, features = length, widthwidth

Forward Backward



Some Comments on Some Comments on kk--NNNN

• Easy to implementEasy to implement• Good “baseline” Good “baseline”

algorithm / algorithm / experimental controlexperimental control

• Incremental learning Incremental learning easyeasy

• Psychologically Psychologically plausible model of plausible model of human memoryhuman memory

• No insight into No insight into domain (no explicit domain (no explicit model)model)

• Choice of distance Choice of distance function is function is problematicproblematic

• Doesn’t Doesn’t exploit/notice exploit/notice structure in structure in examplesexamples

Positive Negative



Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)

• Computationally expensive to Computationally expensive to save all examples; slow save all examples; slow classification of new examplesclassification of new examples• Addressed by IB2/IB3 of Aha et al. Addressed by IB2/IB3 of Aha et al.

and work of A. Moore (CMU; now and work of A. Moore (CMU; now Google)Google)

• Is this really a problem?Is this really a problem?



Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)

• Intolerant of NoiseIntolerant of Noise• Addressed by IB3 of Aha et al.Addressed by IB3 of Aha et al.• Addressed by Addressed by kk-NN version-NN version• Addressed by feature selection - can Addressed by feature selection - can

discard the noisy featurediscard the noisy feature• Intolerant of Irrelevant FeaturesIntolerant of Irrelevant Features

• Since algorithm very fast, can Since algorithm very fast, can experimentally choose good feature experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon)sets (Kohavi, Ph. D. – now at Amazon)



More IBL CriticismsMore IBL Criticisms

• High sensitivity to choice of similiarity High sensitivity to choice of similiarity (distance) function(distance) function• Euclidean distance might not be best choiceEuclidean distance might not be best choice

• Handling non-numeric features and Handling non-numeric features and missing feature values is not natural, but missing feature values is not natural, but doabledoable

• How might we do this? (Part of HW1)How might we do this? (Part of HW1)

• No insight into task No insight into task (learned concept not interpretable)(learned concept not interpretable)



SummarySummary

• IBL can be a very effective IBL can be a very effective machine learning algorithmmachine learning algorithm

• Good “baseline” for experimentsGood “baseline” for experiments

MLlecture1.ppt

Documents

Transcript of MLlecture1.ppt