CS760 – Machine Learning Course Instructor: David Page Course Instructor: David Page email:...

CS760 – Machine CS760 – Machine LearningLearning

• Course Instructor: David PageCourse Instructor: David Page• email: [email protected]: [email protected]• office: MSC 6743 (University & Charter) office: MSC 6743 (University & Charter) • hours: 1pm Tuesdays and Fridayshours: 1pm Tuesdays and Fridays

• Teaching Assistant: Nathanael Teaching Assistant: Nathanael FillmoreFillmore

• email: [email protected]: [email protected]• office: CS 3379office: CS 3379• hours: 8:50am Mondayshours: 8:50am Mondays

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Textbooks & Textbooks & Reading AssignmentReading Assignment• Machine Learning (Tom Mitchell) (Tom Mitchell)• Selected on-line readingsSelected on-line readings• Read in MitchellRead in Mitchell

• PrefacePreface• Chapter 1Chapter 1• Sections 2.1 and 2.2Sections 2.1 and 2.2• Chapter 8Chapter 8• Chapter 3 (for next lecture)Chapter 3 (for next lecture)

© Jude Shavlik 2006, © Jude Shavlik 2006, David 010 David Page 2010


Monday, Wednesday, Monday, Wednesday, andand Friday?Friday?

• We’ll meet 30 times this term (may or may We’ll meet 30 times this term (may or may not include exam in this count)not include exam in this count)

• We’ll meet on FRIDAY this and next week, in We’ll meet on FRIDAY this and next week, in order to cover material for HW 1order to cover material for HW 1(plus I have some business travel this term)(plus I have some business travel this term)

• DefaultDefault: we will NOT meet on Fridays unless I : we will NOT meet on Fridays unless I announce it (at least one week’s notice)announce it (at least one week’s notice)



Course "Style"Course "Style"

• Primarily algorithmic & experimentalPrimarily algorithmic & experimental• Some theory, both mathematical & Some theory, both mathematical &

conceptual (much on conceptual (much on statisticsstatistics))• "Hands on" experience, interactive "Hands on" experience, interactive

lectures/discussionslectures/discussions• Broad survey of many ML subfields, includingBroad survey of many ML subfields, including

• "symbolic" (rules, decision trees, ILP)"symbolic" (rules, decision trees, ILP)• "connectionist" (neural nets)"connectionist" (neural nets)• support vector machines, nearest-neighborssupport vector machines, nearest-neighbors• theoretical ("COLT")theoretical ("COLT")• statistical ("Bayes rule")statistical ("Bayes rule")• reinforcement learning, genetic algorithmsreinforcement learning, genetic algorithms



Two Major GoalsTwo Major Goals

• to understand to understand whatwhat a learning a learning system should dosystem should do

• to understand to understand howhow (and how (and how wellwell) ) existing systems workexisting systems work• Issues in algorithm designIssues in algorithm design• Choosing algorithms for applicationsChoosing algorithms for applications



Background AssumedBackground Assumed

• LanguagesLanguages• Java (see CS 368 tutorial online), C or C++ are OKJava (see CS 368 tutorial online), C or C++ are OK

• AI TopicsAI Topics• SearchSearch• FOPCFOPC• UnificationUnification• Formal DeductionFormal Deduction

• MathMath• Calculus (partial derivatives)Calculus (partial derivatives)• Simple prob & statsSimple prob & stats

• No previous ML experience assumedNo previous ML experience assumed (so some overlap with CS 540)(so some overlap with CS 540)© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


RequirementsRequirements

• Some written and programming HWsSome written and programming HWs• "hands on" experience valuable"hands on" experience valuable• HW0 – build a datasetHW0 – build a dataset• HW1 – experimental methodologyHW1 – experimental methodology• I’m updating the website as we go, so please I’m updating the website as we go, so please

wait for me to assign HWs in classwait for me to assign HWs in class

• "Midterm" exam "Midterm" exam (in class, about 90% through semester)(in class, about 90% through semester)

• Find project of your choosingFind project of your choosing• during last 4-5 weeks of classduring last 4-5 weeks of class



GradingGrading

HW'sHW's 35%35%

ExamExam 40%40%

ProjectProject 25%25%



Late HW's PolicyLate HW's Policy

• HW's due @ 2:30pmHW's due @ 2:30pm• you have you have 55 late days to use late days to use

over the semesterover the semester• (Fri 4pm (Fri 4pm → Mon 4pm is → Mon 4pm is 11 late "day") late "day")

• SAVE UP late days!SAVE UP late days!• extensions only for extensions only for extremeextreme cases cases

• Penalty points after late days Penalty points after late days exhaustedexhausted

• Can't be more than ONE WEEK lateCan't be more than ONE WEEK late© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


Academic Misconduct Academic Misconduct (also on course homepage)(also on course homepage)

All examinations, programming assignments, All examinations, programming assignments, and written homeworks must be done and written homeworks must be done individuallyindividually. Cheating and plagiarism will be . Cheating and plagiarism will be dealt with in accordance with University dealt with in accordance with University procedures (see the procedures (see the Academic Misconduct Guide for Students). ). Hence, for example, code for programming Hence, for example, code for programming assignments must not be developed in groups, assignments must not be developed in groups, nor should code be shared. You are encouraged nor should code be shared. You are encouraged to discuss with your peers, the TAs or the to discuss with your peers, the TAs or the instructor ideas, approaches and techniques instructor ideas, approaches and techniques broadly, but not at a level of detail where broadly, but not at a level of detail where specific implementation issues are described by specific implementation issues are described by anyone. If you have any questions on this, anyone. If you have any questions on this, please ask the instructor before you act. please ask the instructor before you act.



A Few Examples of A Few Examples of Machine LearningMachine Learning• Movie recommender (Netflix prize… Movie recommender (Netflix prize… ensembleensembles)s)• Your spam filter (probably Your spam filter (probably naïve Bayesnaïve Bayes))• Google, Microsoft and YahooGoogle, Microsoft and Yahoo• Predictive models for medicine (e.g. see news on Predictive models for medicine (e.g. see news on

Health Discovery Corporation and Health Discovery Corporation and SVMsSVMs))• Wall Street (e.g., Rebellion research)Wall Street (e.g., Rebellion research)• Speech recognition (Speech recognition (hidden Markov modelshidden Markov models) and ) and

natural language translationnatural language translation• Identifying the proteins of an organism from its Identifying the proteins of an organism from its

genome (also using HMMs… see CS/BMI 576)genome (also using HMMs… see CS/BMI 576)• Many examples in scientific data analysis…Many examples in scientific data analysis…



• A breakthrough in mach. learning would be worth ten A breakthrough in mach. learning would be worth ten MicrosoftsMicrosofts

Bill Gates, Chairman, MicrosoftBill Gates, Chairman, Microsoft

• Machine learning is the next InternetMachine learning is the next InternetTony Tether, previous Director, DARPATony Tether, previous Director, DARPA

• Machine learning is the hot new thingMachine learning is the hot new thingJohn Hennessy, President, StanfordJohn Hennessy, President, Stanford

• Web rankings today are mostly a matter of machine learningWeb rankings today are mostly a matter of machine learningPrabhakar Raghavan, Director of Research, YahooPrabhakar Raghavan, Director of Research, Yahoo

• Machine learning is going to result in a real revolutionMachine learning is going to result in a real revolutionGreg Papadopoulos, CTO, SunGreg Papadopoulos, CTO, Sun

• Machine learning is today’s discontinuityMachine learning is today’s discontinuityJerry Yang, founder and former CEO, YahooJerry Yang, founder and former CEO, Yahoo

Some QuotesSome Quotes(taken from P. Domingos’ ML class notes at U-Washington) (taken from P. Domingos’ ML class notes at U-Washington)



What Do You Think What Do You Think Learning Means?Learning Means?



What is Learning?What is Learning?

““Learning denotes changes in the system that Learning denotes changes in the system that

… … enable the system to do the same task … enable the system to do the same task …

more effectively the next time.”more effectively the next time.”

- - Herbert Herbert SimonSimon

““Learning is making useful changes in our Learning is making useful changes in our minds.”minds.”

- - Marvin Marvin MinskyMinsky



Today’sToday’s TopicsTopics

• Memorization as LearningMemorization as Learning• Feature SpaceFeature Space• Supervised MLSupervised ML• KK-NN (-NN (KK-Nearest Neighbor)-Nearest Neighbor)



Memorization (Rote Memorization (Rote Learning)Learning)

• Employed by first machine Employed by first machine learning systems, in 1950slearning systems, in 1950s• Samuel’s Checkers programSamuel’s Checkers program• Michie’s MENACE: Matchbox Educable Michie’s MENACE: Matchbox Educable

Naughts and Crosses EngineNaughts and Crosses Engine

• Prior to these, some people Prior to these, some people believed computers could not believed computers could not improveimprove at a task at a task with experiencewith experience



Rote Learning is Rote Learning is LimitedLimited

• Memorize I/O pairs and perform Memorize I/O pairs and perform exact matching with new inputsexact matching with new inputs

• If computer has not seen precise If computer has not seen precise case before, it cannot apply its case before, it cannot apply its experienceexperience

• Want computer to “generalize” Want computer to “generalize” from prior experiencefrom prior experience



Some Settings in Some Settings in Which Learning May Which Learning May HelpHelp• Given an input, what is appropriate Given an input, what is appropriate

response (output/action)?response (output/action)?• Game playing – board state/moveGame playing – board state/move• Autonomous robots (e.g., driving a vehicle) Autonomous robots (e.g., driving a vehicle)

-- world state/action-- world state/action• Video game characters – state/actionVideo game characters – state/action• Medical decision support – symptoms/ Medical decision support – symptoms/

treatmenttreatment• Scientific discovery – data/hypothesisScientific discovery – data/hypothesis• Data mining – database/regularityData mining – database/regularity



Broad Paradigms of Broad Paradigms of Machine LearningMachine Learning

• Inducing Functions from I/O PairsInducing Functions from I/O Pairs• Decision trees (e.g., Quinlan’s C4.5 [1993])Decision trees (e.g., Quinlan’s C4.5 [1993])• Connectionism / neural networks (e.g., backprop)Connectionism / neural networks (e.g., backprop)• Nearest-neighbor methodsNearest-neighbor methods• Genetic algorithmsGenetic algorithms• SVM’s SVM’s

• Learning without Learning without Feedback/TeacherFeedback/Teacher• Conceptual clusteringConceptual clustering• Self-organizing systemsSelf-organizing systems• Discovery systemsDiscovery systems

Not in Mitchell’s textbook (covered in CS 776)



IIDIID

• We are assuming examples are We are assuming examples are IID: IID: independently identically independently identically distributeddistributed

• Eg, we are ignoring Eg, we are ignoring temporaltemporal dependencies (covered in dependencies (covered in time-series learningtime-series learning))

• Eg, we assume the learner has no Eg, we assume the learner has no say in which examples it gets say in which examples it gets (covered in (covered in active learningactive learning))

© Jude Shavlik 2006, © Jude Shavlik 2006, David 10 David Page 2010


Supervised Learning Supervised Learning Task OverviewTask Overview

Concepts/Classes/

Decisions

Concepts/Classes/

Decisions

Feature Selection(usually done by humans)

Classification Rule Construction(done by learning algorithm)

Real WorldReal World

Feature SpaceFeature Space

HW 0

HW 1-3



Empirical Learning: Empirical Learning: Task DefinitionTask Definition• Given Given

• A collection of A collection of positivepositive examples of some examples of some concept/class/category (i.e., members of the class) and, concept/class/category (i.e., members of the class) and, possibly, a collection of the possibly, a collection of the negativenegative examples (i.e., non- examples (i.e., non-members)members)

• ProduceProduce• A description that A description that coverscovers (includes) all/most of the (includes) all/most of the

positive examples and non/few of the negative examples positive examples and non/few of the negative examples

(and, hopefully, properly categorizes most future (and, hopefully, properly categorizes most future examples!)examples!)

Note: one can easily extend this definition to handle more than two Note: one can easily extend this definition to handle more than two classesclasses

The KeyPoint!



ExampleExamplePositive Examples Negative Examples

How does this symbol classify?

•Concept

•Solid Red Circle in a (Regular?) Polygon

•What about?•Figures on left side of page



.

.

.

Concept LearningConcept Learning

Learning systems differ in how they represent Learning systems differ in how they represent concepts:concepts:

TrainingExamples

Backpropagation

C4.5, CART

AQ, FOIL

SVMs

NeuralNet

DecisionTree

Φ <- X^YΦ <- Z

Rules

If 5x1 + 9x2 – 3x3 > 12Then +



Feature SpaceFeature Space

If examples are described in terms of If examples are described in terms of values of features, they can be plotted values of features, they can be plotted as points in an as points in an NN-dimensional space.-dimensional space.

Size

Color

Weight

?Big

2500

Gray

A “concept” is then a (possibly disjoint) volume in this space.



Learning from Labeled Learning from Labeled ExamplesExamples

• Most common and successful Most common and successful form of MLform of ML Venn Diagram

+ ++

+

- -

--

-

-

--

•Examples – points in a multi-dimensional “feature space”•Concepts – “function” that labels every point in feature space

(as +, -, and possibly ?)



Brief ReviewBrief Review

• ConjunctiveConjunctive Concept Concept• Color(?obj1, red)Color(?obj1, red)

^̂• Size(?obj1, large)Size(?obj1, large)

• DisjunctiveDisjunctive Concept Concept• Color(?obj2, blue)Color(?obj2, blue)

vv• Size(?obj2, small)Size(?obj2, small)

• More formally a “concept” is of the More formally a “concept” is of the formform• x y z F(x, y, z) -> Member(x, Class1)x y z F(x, y, z) -> Member(x, Class1)

A A A

“and”

“or”

Instances



Empirical Learning and Empirical Learning and Venn DiagramsVenn Diagrams

Concept = Concept = AA or or B B (Disjunctive concept)(Disjunctive concept)

Examples = labeled points in feature spaceExamples = labeled points in feature space

Concept = a label for a Concept = a label for a setset of points of points

Venn Diagram

A

B

--

--

-

-

- -

-

-

-

-

--

-

-

-

--

--

- - --- -

---

--

-

-

+

++ ++

+ +

+

++

+ +

+

++

+

+

++

Feature Space



Aspects of an ML Aspects of an ML SystemSystem• ““Language” for representing classified Language” for representing classified

examplesexamples• ““Language” for representing “Concepts”Language” for representing “Concepts”• Technique for producing concept Technique for producing concept

“consistent” with the training examples“consistent” with the training examples• Technique for classifying new instanceTechnique for classifying new instance

Each of these limits the Each of these limits the expressivenessexpressiveness//efficiencyefficiency of the supervised learning algorithm.of the supervised learning algorithm.

HW 0

OtherHW’s



Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning

(IBL), case-based learning)(IBL), case-based learning)

• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar

example in memory; output its categoryexample in memory; output its categoryVenn

-

--

-

-

--

-+

+

+

+ + +

++

+

+?

…“Voronoi

Diagrams”(pg 233)



““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2

Simple Example: 1-NNSimple Example: 1-NN

Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=0, c=0a=0, b=0, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??

So output -

(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)



Sample Experimental Sample Experimental Results Results (see UCI archive for (see UCI archive for more)more)

TestbedTestbed Testset CorrectnessTestset Correctness

1-NN1-NN D-TreesD-Trees Neural NetsNeural Nets

Wisconsin Wisconsin CancerCancer 98%98% 95%95% 96%96%

Heart Heart DiseaseDisease 78%78% 76%76% ??

TumorTumor 37%37% 38%38% ??

AppendicitisAppendicitis 83%83% 85%85% 86%86%

Simple algorithm works quite well!© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


KK-NN Algorithm-NN Algorithm

Collect Collect KK nearest neighbors, select majority nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)

• What should What should KK be? be?• It probably is problem dependentIt probably is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select

a good setting for a good setting for KK

1

Shouldn’t really“connect the dots”(Why?)

Tuning SetError Rate

2 3 4 5 K© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


HW0 – Create Your Own HW0 – Create Your Own Dataset Dataset (repeated from lecture (repeated from lecture #1)#1)

• Think about before next classThink about before next class• Read HW0 (on-line)Read HW0 (on-line)

• Google to find:Google to find:• UCI archive (or UCI KDD archive)UCI archive (or UCI KDD archive)• UCI ML archive (UCI ML repository)UCI ML archive (UCI ML repository)• More links in HW0’s web pageMore links in HW0’s web page



HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 1: Step 1: Choose a Boolean (true/false) conceptChoose a Boolean (true/false) concept• Books I like/dislike Books I like/dislike

Movies I like/dislike Movies I like/dislike www pages I like/dislikewww pages I like/dislike

• Subjective judgment (can’t articulate)Subjective judgment (can’t articulate)• ““time will tell” conceptstime will tell” concepts

• Stocks to buyStocks to buy• Medical treatmentMedical treatment

• at time at time tt, predict outcome at time (, predict outcome at time (t t ++∆∆t)t)• Sensory interpretation Sensory interpretation

• Face recognition (see textbook)Face recognition (see textbook)• Handwritten digit recognitionHandwritten digit recognition• Sound recognitionSound recognition

• Hard-to-Program FunctionsHard-to-Program Functions© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


Some Real-World Some Real-World ExamplesExamples

• Car Steering (Pomerleau, Thrun)Car Steering (Pomerleau, Thrun)

• Medical Diagnosis (Quinlan)Medical Diagnosis (Quinlan)

• DNA CategorizationDNA Categorization• TV-pilot ratingTV-pilot rating• Chemical-plant controlChemical-plant control• Backgammon playingBackgammon playing

Learned Function

Steering Angle

Digitized camera image

age=13,sex=M, wgt=18

Learned Function

sickvs

healthy

Medicalrecord



HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 2: Step 2: Choosing a Choosing a feature spacefeature space• We will use We will use fixed-length feature vectorsfixed-length feature vectors

• Choose Choose NN features features• Each feature has Each feature has VVii

possible valuespossible values• Each example is represented by a vector of Each example is represented by a vector of NN feature feature

values values (i.e., (i.e., is a point in the feature spaceis a point in the feature space))e.g.: e.g.: <red, 50, round><red, 50, round>

colorcolor weight shapeweight shape

• Feature TypesFeature Types• BooleanBoolean• NominalNominal• OrderedOrdered• HierarchicalHierarchical

• Step 3: Step 3: Collect examples (“I/O” pairs)Collect examples (“I/O” pairs)

Defines a space

In HW0 we will use a subset(see next slide)



Standard Feature TypesStandard Feature Typesfor representing training examples for representing training examples – a source of “ – a source of “domain knowledgedomain knowledge””

• NominalNominal• No relationship among possible valuesNo relationship among possible values

e.g., e.g., color color єє {red, blue, green} {red, blue, green} (vs.(vs. color = 1000 color = 1000 Hertz)Hertz)• Linear (or Ordered)Linear (or Ordered)

• Possible values of the feature are totally orderedPossible values of the feature are totally orderede.g., e.g., size size єє {small, medium, large}{small, medium, large} ←← discretediscrete

weight weight єє [0 [0……500] 500] ←← continuouscontinuous

• HierarchicalHierarchical• Possible values are Possible values are partiallypartially

ordered in an ISA hierarchyordered in an ISA hierarchye.g. for e.g. for shapeshape ->->

closed

polygon continuous

trianglesquare circle ellipse



Another View of Std Another View of Std DatasetsDatasets - a Single Table (2D array)- a Single Table (2D array)

Feature 1

Feature 2 . . .

Feature N

OutputCategor

y

Example 1

0.0 small red true

Example 2

9.3 medium red false

Example 3

8.2 small blue false

. . .

Example M

5.7 medium green true



Our Feature TypesOur Feature Types(for CS 760 HW’s)(for CS 760 HW’s)

• DiscreteDiscrete• tokens (char strings, w/o quote marks and tokens (char strings, w/o quote marks and

spaces)spaces)

• ContinuousContinuous• numbers (int’s or float’s)numbers (int’s or float’s)

• If only a few possible values (e.g., 0 & 1) use If only a few possible values (e.g., 0 & 1) use discretediscrete

• i.e., merge i.e., merge nominalnominal and and discrete-ordereddiscrete-ordered (or convert (or convert discrete-ordereddiscrete-ordered into 1,2, into 1,2,……))

• We will ignore hierarchical info and We will ignore hierarchical info and only use the leaf values (common approach)only use the leaf values (common approach)



HW0: HW0: Creating Your DatasetCreating Your Dataset

Ex: IMDB has a lot of data that Ex: IMDB has a lot of data that are not discrete or are not discrete or continuous or binary-valued continuous or binary-valued for target function for target function (category)(category)Studio

Movie

Director/Producer

ActorMade

Acted inDirected

NameCountryList of movies

NameYear of birthGenderOscar nominationsList of movies

Title, Genre, Year, Opening Wkend BO receipts,List of actors/actresses, Release season

NameYear of birthList of movies

Produced



HW0: Sample DBHW0: Sample DB

Choose a Boolean or binary-Choose a Boolean or binary-valued target function (category)valued target function (category)

• Opening weekend box-office Opening weekend box-office receipts > $2 million receipts > $2 million

• Movie is drama? (action, sci-fi,Movie is drama? (action, sci-fi,…)…)

• Movies I like/dislike (e.g. Tivo)Movies I like/dislike (e.g. Tivo)



HW0: Representing as a HW0: Representing as a Fixed-Length Feature Fixed-Length Feature VectorVector

<discuss on chalkboard><discuss on chalkboard>

Note: some advanced ML approaches do Note: some advanced ML approaches do not not require such “feature mashing” require such “feature mashing” (eg, ILP)(eg, ILP)



IMDB@umassIMDB@umass

David Jensen’s group at UMass uses David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the Naïve Bayes and other ML algo’s on the IMDBIMDB

• Opening weekend box-office Opening weekend box-office receipts > $2 millionreceipts > $2 million• 25 attributes25 attributes• Accuracy = 83.3%Accuracy = 83.3%• Default accuracy = 56%Default accuracy = 56% (default algo?)(default algo?)

• Movie is drama?Movie is drama?• 12 attributes12 attributes• Accuracy = 71.9%Accuracy = 71.9%• Default accuracy = 51%Default accuracy = 51%

http://kdl.cs.umass.edu/proximity/about.html© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


From Earlier: From Earlier: MemorizationMemorization

• Employed by first machine Employed by first machine learning systems, in 1950slearning systems, in 1950s• Samuel’s Checkers programSamuel’s Checkers program• Michie’s MENACE: Matchbox Educable Michie’s MENACE: Matchbox Educable

Naughts and Crosses EngineNaughts and Crosses Engine

• Prior to these, some people Prior to these, some people believed computers could not believed computers could not improveimprove at a task at a task with experiencewith experience



Rote Learning is Rote Learning is LimitedLimited

• Memorize I/O pairs and perform Memorize I/O pairs and perform exact matching with new inputsexact matching with new inputs

• If computer has not seen precise If computer has not seen precise case before, it cannot apply its case before, it cannot apply its experienceexperience

• Want computer to “generalize” Want computer to “generalize” from prior experiencefrom prior experience



Nearest-Neighbor Nearest-Neighbor AlgorithmsAlgorithms(aka. Exemplar models, instance-based learning (aka. Exemplar models, instance-based learning

(IBL), case-based learning)(IBL), case-based learning)

• Learning ≈ memorize training examplesLearning ≈ memorize training examples• Problem solving = find most similar Problem solving = find most similar

example in memory; output its categoryexample in memory; output its categoryVenn

-

--

-

-

--

-+

+

+

+ + +

++

+

+?

…“Voronoi

Diagrams”(pg 233)



““Hamming Distance”Hamming Distance”•Ex 1 = 2Ex 1 = 2•Ex 2 = 1Ex 2 = 1•Ex 3 = 2Ex 3 = 2

Simple Example: 1-NNSimple Example: 1-NN

Training SetTraining Set1.1. a=0, b=0, c=1a=0, b=0, c=1 ++2.2. a=0, b=0, c=0a=0, b=0, c=0 --3.3. a=1, b=1, c=1a=1, b=1, c=1 --Test ExampleTest Example• a=0, b=1, c=0 a=0, b=1, c=0 ??

So output -

(1-NN ≡(1-NN ≡ one nearest neighbor)one nearest neighbor)



Sample Experimental Sample Experimental Results Results (see UCI archive for (see UCI archive for more)more)

TestbedTestbed Testset CorrectnessTestset Correctness

1-NN1-NN D-TreesD-Trees Neural NetsNeural Nets

Wisconsin Wisconsin CancerCancer 98%98% 95%95% 96%96%

Heart Heart DiseaseDisease 78%78% 76%76% ??

TumorTumor 37%37% 38%38% ??

AppendicitisAppendicitis 83%83% 85%85% 86%86%

Simple algorithm works quite well!© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


KK-NN Algorithm-NN Algorithm

Collect Collect KK nearest neighbors, select majority nearest neighbors, select majority classification (or somehow combine their classification (or somehow combine their classes)classes)

• What should What should KK be? be?• It probably is problem dependentIt probably is problem dependent• Can use Can use tuning setstuning sets (later) to select (later) to select

a good setting for a good setting for KK

1

Shouldn’t really“connect the dots”(Why?)

Tuning SetError Rate

2 3 4 5 K© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


In More DetailIn More Detail

• KK-Nearest Neighbors / -Nearest Neighbors / Instance-Based Learning (Instance-Based Learning (kk-NN/IBL)-NN/IBL)• Distance functionsDistance functions• Kernel functionsKernel functions• Feature selection (applies to all ML Feature selection (applies to all ML

algo’s)algo’s)• IBL SummaryIBL Summary

Chapter 8 of MitchellChapter 8 of Mitchell



Some Common JargonSome Common Jargon

• ClassificationClassification• Learning a Learning a discretediscrete valued function valued function

• RegressionRegression• Learning a Learning a realreal valued function valued function

IBL easily extended to regression IBL easily extended to regression tasks (and to multi-category tasks (and to multi-category classification)classification)

Discrete/RealOutputs



Variations on a ThemeVariations on a Theme

• IB1IB1 – keep all examples – keep all examples

• IB2IB2 – keep next instance if – keep next instance if incorrectlyincorrectly classified by using previous instancesclassified by using previous instances• Uses less storage (good)Uses less storage (good)• Order dependent (bad)Order dependent (bad)• Sensitive to noisy data (bad)Sensitive to noisy data (bad)

(From Aha, Kibler and Albert in ML Journal)(From Aha, Kibler and Albert in ML Journal)



Variations on a Theme Variations on a Theme (cont.)(cont.)• IB3IB3 – extend IB2 to more intelligently decide – extend IB2 to more intelligently decide

which examples to keep (see article)which examples to keep (see article)• Better handling of noisy dataBetter handling of noisy data

• Another IdeaAnother Idea - - cluster groups, keep cluster groups, keep example from each (median/centroid)example from each (median/centroid)• Less storage, faster lookupLess storage, faster lookup



Distance FunctionsDistance Functions

• KeyKey issue in IBL issue in IBL (instance-based learning)(instance-based learning)

• One approach:One approach:

assign weights to each assign weights to each featurefeature



Distance Functions Distance Functions (sample)(sample)

features

i

iiii eedweed

#

12121 ),(*),(

distance between examples 1 and 2

a numeric weighting factor

distance for feature i only between examples 1 and 2



Kernel Functions Kernel Functions and and kk-NN-NN

• Term “kernel” comes from Term “kernel” comes from statisticsstatistics

• Major topic in support vector Major topic in support vector machines (SVMs)machines (SVMs)

• Weights the interaction between Weights the interaction between pairs of examplespairs of examples



Kernel Functions and Kernel Functions and kk-NN (continued)-NN (continued)

• Assume we haveAssume we have• kk nearest neighbors nearest neighbors ee11, ..., e, ..., ekk

• associated output categories associated output categories OO11, ..., , ..., OOkk

• Then output for test case Then output for test case eett isis

k

iiti

c

cOee1categories possible

),(*),(maxarg

the kernel “delta” function (=1 if Oi=c, else =0)© Jude Shavlik 2006, © Jude Shavlik 2006,

David Page 2010 David Page 2010CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Sample Kernel Sample Kernel Functions Functions (e(ei i , e, ett))

• ((eei i , e, ett) = 1) = 1

• ((eei i , e, ett) = 1 / dist() = 1 / dist(eei i , e, ett) )

simple majority vote (? classified as -)inverse distance weight (? could be classified as +)

-

-+?

In diagram to right, example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’.



Gaussian KernelGaussian Kernel

• Heavily Heavily used in used in SVMsSVMs

2

2

2),( ti ee

ti eee

Euler’s constant



Local LearningLocal Learning

• Collect Collect kk nearest neighbors nearest neighbors• Give them to some supervised ML algoGive them to some supervised ML algo• Apply learned model to test exampleApply learned model to test example

++

+++

+ +

--

-- ? -Train on these



Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• IBL algorithms postpone work IBL algorithms postpone work

from training to testingfrom training to testing• Pure Pure kk-NN/IBL just memorizes -NN/IBL just memorizes

the training datathe training data• Sometimes called Sometimes called lazy learninglazy learning

• Computationally intensiveComputationally intensive• Match all features of all training Match all features of all training

examplesexamples© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


Instance-Based Instance-Based Learning (IBL) and Learning (IBL) and EfficiencyEfficiency• Possible Speed-upsPossible Speed-ups

• Use a subset of the training Use a subset of the training examples (Aha)examples (Aha)

• Use clever data structures (A. Use clever data structures (A. Moore)Moore)• KD trees, hash tables, Voronoi diagramsKD trees, hash tables, Voronoi diagrams

• Use subset of the featuresUse subset of the features



Where willWhere willkNN FAIL?kNN FAIL?

• Learning “Juntas” (Blum, Langley ‘94)Learning “Juntas” (Blum, Langley ‘94)• Target concept is a function of a small Target concept is a function of a small

subset of the features -- subset of the features -- relevantrelevant featuresfeatures

• Most features are irrelevant (not Most features are irrelevant (not correlated with relevant features)correlated with relevant features)

• In this case, nearness for kNN is In this case, nearness for kNN is based mostly on based mostly on irrelevantirrelevant features features



Looking Ahead (Trees)Looking Ahead (Trees)

• ML method we will discuss next ML method we will discuss next time is Decision Tree learningtime is Decision Tree learning

• Tree learners focus on choosing Tree learners focus on choosing the most relevant features, so the most relevant features, so address Junta-learning betteraddress Junta-learning better

• They choose features one at a They choose features one at a time, in a time, in a greedygreedy fashion fashion



Looking Ahead (SVMs)Looking Ahead (SVMs)

• Later we will cover Later we will cover support vector support vector machinesmachines (SVMs) (SVMs)

• As kNN, SVMs classify a new instance As kNN, SVMs classify a new instance based on similarity to other instances, based on similarity to other instances, use kernels to capture similarityuse kernels to capture similarity

• But SVMs also assign intrinsic weights But SVMs also assign intrinsic weights to examples (apart from distance)… to examples (apart from distance)… “support vectors” have weight > 0“support vectors” have weight > 0



Number of Features Number of Features and Performance for and Performance for MLML• Too many features can hurt test set

performance

• Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect

• ““Curse of dimensionality”Curse of dimensionality”

• kNN is especially susceptible© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


Feature Selection and Feature Selection and MLML(general issue for ML)(general issue for ML)

Filtering-Based Filtering-Based Feature SelectionFeature Selection

all featuresall features

subset of featuressubset of features

modelmodel

Wrapper-Based Wrapper-Based Feature SelectionFeature Selection

FS algorithm

ML algorithmML algorithm

all features

model

FS algorithm

calls ML algorithm many times, uses it to help select features



Feature Selection as Feature Selection as Search ProblemSearch Problem• State State = set of features= set of features

• Start stateStart state = = emptyempty ((forward selectionforward selection)) or or fullfull ( (backward selectionbackward selection))

• Goal testGoal test = highest scoring state = highest scoring state

• Operators Operators • add/subtract featuresadd/subtract features

• Scoring function Scoring function • accuracy on training (or tuning) set of accuracy on training (or tuning) set of

ML algorithm using this state’s feature setML algorithm using this state’s feature set



Forward and Backward Forward and Backward Selection of FeaturesSelection of Features

• Hill-climbing (“greedy”) searchHill-climbing (“greedy”) search

{}50%

{FN}71%

{F1}62%

add F

N

ad

d F

1

Forward

Backward

add F1

...

...

Features to use

Accuracy on tuning set (our heuristic function)

...

{F1,F2,...,FN}73%

{F2,...,FN}79%

subtract F1

subtract F2

...



Forward vs. Backward Forward vs. Backward Feature SelectionFeature Selection

• Faster in early steps Faster in early steps because fewer because fewer features to testfeatures to test

• Fast for choosing a Fast for choosing a small subset of the small subset of the featuresfeatures

• Misses useful Misses useful features whose features whose usefulness requires usefulness requires other features other features (feature synergy)(feature synergy)

• Fast for choosing all Fast for choosing all but a small subset but a small subset of the featuresof the features

• Preserves useful Preserves useful features whose features whose usefulness requires usefulness requires other featuresother features• Example: area Example: area

important, important, features = length, features = length, widthwidth

Forward Backward



Some Comments on Some Comments on kk--NNNN

• Easy to implementEasy to implement• Good “baseline” Good “baseline”

algorithm / algorithm / experimental controlexperimental control

• Incremental learning Incremental learning easyeasy

• Psychologically Psychologically plausible model of plausible model of human memoryhuman memory

• Led astray by irrelevant Led astray by irrelevant featuresfeatures

• No insight into domain No insight into domain (no explicit model)(no explicit model)

• Choice of distance Choice of distance function is problematicfunction is problematic

• Doesn’t exploit/notice Doesn’t exploit/notice structure in examplesstructure in examples

Positive Negative



Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)

• Computationally expensive to Computationally expensive to save all examples; slow save all examples; slow classification of new examplesclassification of new examples• Addressed by IB2/IB3 of Aha et al. Addressed by IB2/IB3 of Aha et al.

and work of A. Moore (CMU; now and work of A. Moore (CMU; now Google)Google)

• Is this really a problem?Is this really a problem?



Questions about IBL Questions about IBL (Breiman et al. - CART book)(Breiman et al. - CART book)

• Intolerant of NoiseIntolerant of Noise• Addressed by IB3 of Aha et al.Addressed by IB3 of Aha et al.• Addressed by Addressed by kk-NN version-NN version• Addressed by feature selection - can Addressed by feature selection - can

discard the noisy featurediscard the noisy feature• Intolerant of Irrelevant FeaturesIntolerant of Irrelevant Features

• Since algorithm very fast, can Since algorithm very fast, can experimentally choose good feature experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon)sets (Kohavi, Ph. D. – now at Amazon)



More IBL CriticismsMore IBL Criticisms

• High sensitivity to choice of similiarity High sensitivity to choice of similiarity (distance) function(distance) function• Euclidean distance might not be best Euclidean distance might not be best

choicechoice

• Handling non-numeric features and Handling non-numeric features and missing feature values is not natural, missing feature values is not natural, but doablebut doable

• No insight into task No insight into task (learned concept not interpretable)(learned concept not interpretable)



SummarySummary

• IBL can be a very effective IBL can be a very effective machine learning algorithmmachine learning algorithm

• Good “baseline” for experimentsGood “baseline” for experiments



CS760 – Machine Learning Course Instructor: David Page Course Instructor: David Page email:...

Documents

Transcript of CS760 – Machine Learning Course Instructor: David Page Course Instructor: David Page email:...