Introduction to Machine Learning

Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth

Lecture 3: Bayesian Learning, MAP & ML Estimation

Classification: Naïve Bayes

Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Bayes Rule (Bayes Theorem) unknown parameters (many possible models)

observed data available for learning

prior distribution (domain knowledge)

likelihood function (measurement model)

posterior distribution (learned information)

p(D | θ)p(θ | D)

p(θ,D) = p(θ)p(D | θ) = p(D)p(θ | D)

∝ p(D | θ)p(θ)

p(θ | D) =p(θ,D)

p(D | θ)p(θ)∑θ′∈Θ p(D | θ′)p(θ′)

The Number Game •! I am thinking of some arithmetical concept,

such as: •!Prime numbers •!Numbers between 1 and 10 •!Even numbers •!!

•! I give you a series of randomly chosen positive examples from the chosen class •!Question: Are other test digits also in the

class?

Tenenbaum 1999

Predictions of 20 Humans

D = {16}

D = {16,8,2,64}

D = {16,23,19,20}

A Bayesian Model Likelihood:

•! Assume examples are sampled uniformly at random from all numbers that are consistent with the hypothesis

•! Size principle: Favors smallest consistent hypotheses

Prior: •! Based on prior experience, some hypotheses are more

probable ( natural ) than others •! Powers of 2 •! Powers of 2 except 32, plus 37

•! Subjectivity: May depend on observer’s experience

D = {5, 34, 2, 89, 1, 13}?

Posterior Distributions

D = {16} D = {16,8,2,64}

Posterior Estimation

•! As the amount of data becomes large, weak conditions on the hypothesis space and measurement process imply that

•! This is a maximum a posteriori (MAP) estimate:

•! With a large amount of data, and/or an (almost) uniform prior, we approach the maximum likelihood (ML) estimate:

•! More theory to come later!

, pp (

Posterior Predictions

•! Suppose we want to predict the next number that will be revealed to us. One option is to use the MAP estimate:

p(x | D) ≈ p(x | h)

•! But if we correctly apply Bayes rule to the specified model, we instead obtain the posterior predictive distribution:

•! This is sometimes called Bayesian model averaging.

p(x | D) =∑

h∈Hp(x | h)p(h | D)

x ∈ {1, 2, 3, . . .}

Posterior Predictive Distribution

Experiment: Human Judgments

Experiment: Bayesian Predictions

Machine Learning Problems

Supervised Learning Unsupervised Learning

classification or categorization

regression

clustering

dimensionality reduction

Generative Classifiers class label in {1,!,C}, observed in training

observed features to be used for classification

parameters indexing family of models

yx ∈ X

θp(y, x | θ) = p(y | θ)p(x | y, θ)

prior distribution

likelihood function

•! Compute class posterior distribution via Bayes rule:

•! Inference: Find label distribution for some input example •! Classification: Make decision based on inferred distribution •! Learning: Estimate parameters from labeled training data

p(y = c | x, θ) = p(y = c | θ)p(x | y = c, θ)∑C

c′=1 p(y = c′ | θ)p(x | y = c′, θ)

Specifying a Generative Model p(y, x | θ) = p(y | θ)p(x | y, θ)

prior distribution

likelihood function

•! For generative classification, we take the prior to be some categorical (multinoulli) distribution:

•! The likelihood must be matched to the domain of the data

p(y | θ) = Cat(y | θ)

•! Suppose x is a vector of D different features •! The simplest generative model assumes that these features

are conditionally independent, given the class label:

•! This is a so-called naïve Bayes model for classification

p(x | y = c, θ) =

p(xj | y = c, θjc)

Learning a Generative Model

•! Assume we have N training examples independently sampled from an unknown naïve Bayes model:

p(θ | y, x) ∝ p(θ, y, x) = p(θ)p(y | θ)p(x | y, θ)model class features

p(θ | y, x) ∝ p(θ)

p(yi | θ)p(xi | yi, θ)

∝ p(θ)

p(yi | θ)D∏

p(xij | yi, θ)

observed class label for training example i

value of feature j for training example i

•! Learning: ML estimate, MAP estimate, or posterior prediction

Naïve Bayes: ML Estimation

Nc number of examples of training class c

•! Even if we are doing discrete categorization based on discrete features, this is a continuous optimization problem!

•! Bayesian reasoning about models also requires continuous probability distributions, even in this simple case

•! Next week we will show that the ML class estimates are:

•! For binary features, we have:

Introduction to Machine Learning - Brown...

Transcript of Introduction to Machine Learning - Brown...

Introduction to Machine Learning - Brown...

Documents

Transcript of Introduction to Machine Learning - Brown...

Priority Queues & Heaps - Brown Universitycs.brown.edu/courses/cs016/static/files/lectures/slides/09_priorityQueuesAndHeaps.pdf · References ‣ Slide #4 ‣ “Queue” in French

Introduction to Computer Vision - Brown Universitycs.brown.edu/courses/cs143/2009/projectIdeas09.pdf · Introduction to Computer Vision Michael J. Black ... More advanced machine

global and local alignment - Brown Universitycs.brown.edu/courses/csci1810/resources/global_local_alignment.pdf · Local Alignment •Very similar to global alignment! •Instead

Localizer - Brown Universitycs.brown.edu/research/pubs/pdfs/2000/Michel-2000-LOC.pdf · PASCAL VAN HENTENRYCK pvh@cs.brown.edu Brown University, Box 1910, Providence, RI 02912 Abstract.

CS145: Probability & Computing - Brown Universitycs.brown.edu/courses/csci1450/lectures/lec04_discrete.pdfLECTURE 2 Review of probability models • Readings: Sections 1.3-1.4 •

Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-03-06_logisticRegressBayes.pdf · 2012. 3. 6. · Introduction to Machine Learning Brown University CSCI 1950-F,

Page layout in LATEX - Brown CS - Brown Universitycs.brown.edu/about/system/managed/latex/doc/fancyhdr.pdf · Page layout in LATEX ... Math. Soc. Thanks, ... On twosided documents

Probabilistic Graphical Models - Brown Universitycs.brown.edu/.../2013-03-05_gaussianGraphicalModels.pdf2013/03/05 · Probabilistic Graphical Models Brown University CSCI 2950-P,

Integrating Dataﬂow Evaluation into a ... - Brown Universitycs.brown.edu/people/ghcooper/thesis.pdf · Sc. M., Brown University, 2002 Submitted in partial fulﬁllment of the requirements

Digital Counterinsurgency - Brown Universitycs.brown.edu/courses/cs180/sources/Digital_Counterinsurgency.pdf · Digital Counterinsurgency November/December 2015 53 2014, 46,000 Twitter

RodrigoFonseca - Brown Universitycs.brown.edu/~rfonseca/cv-brown.pdf · RodrigoFonseca CurriculumVitæ November15,2019 09/2008-07/2009 PostdoctoralResearcher.Yahoo!Labs,SantaClara,CA.

Usage Scenarios of DBMS - Brown Universitycs.brown.edu/vldb99/IndustrialSpeakerSlides/SAPVLDB.pdfPresentation layer Application layer Presentation layer Presentation layer LAN required,

Exploded Images - Brown Universitycs.brown.edu/.../theses/masters/2011/goldmints-orlov.pdf · 2011-01-05 · Exploded Images Arcady Goldmints-Orlov Brown University December 2010

Debugging Distributed Systems - Brown Universitycs.brown.edu/courses/cs138/s16/lectures/12debugging.pdf · - Stack hugely helpful (e.g. profiling, debugging) – Single-machine systems

Internet Browser Vulnerabilities - Brown Universitycs.brown.edu/cgc/net.secbook/se01/handouts/Ch07-Web.pdf · • Forged web pages ... • Most targeted sites – Financial services

China’s New Revolution - Brown Universitycs.brown.edu/.../2018_MayJune_ForeignAffairs_ChinasNewRevolutio… · China. By March 2018, only 330-odd groups had registered, about four

Introduction to Computer Vision - Brown Universitycs.brown.edu/courses/cs143/2009/ObjectRecogPart3.pdf · 2009. 12. 2. · Computer Vision Brown University Parts Learn parts from

APPROVAL SHEET - Brown Universitycs.brown.edu/~jmacglashan/pubpdfs/MacGlashanDissertation.pdf · Name of Candidate: James Alexander MacGlashan Ph.D., 2013 Thesis and Abstract Approved:

Projective Geometry and - Brown Universitycs.brown.edu/courses/cs143/2011/lectures/02.pdf · Projective Geometry and Camera Models Computer Vision CS 143 Brown James Hays 09/09/11

Large-scale Instance Retrieval - Brown Universitycs.brown.edu/courses/cs143/2011/lectures/19.pdf · Large-scale Instance Retrieval Computer Vision CS 143, Brown James Hays 10/24/11