More Classifiers

MORE CLASSIFIERS

AGENDA Key concepts for all classifiers

Precision vs recall Biased sample sets

Linear classifiers Intro to neural networks

RECAP: DECISION BOUNDARIES With continuous attributes, a decision

boundary is the surface in example space that splits positive from negative examples

x1>=20 x2

x2>=10

x2>=15

BEYOND ERROR RATES

BEYOND ERROR RATE Predicting security risk

Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them)

Searching for images Returning irrelevant images is

worse than omitting relevant ones

BIASED SAMPLE SETS Often there are orders of magnitude more

negative examples than positive E.g., all images of Kris on Facebook If I classify all images as “not Kris” I’ll have

>99.99% accuracy

Examples of Kris should count much more than non-Kris!

FALSE POSITIVES

True decision boundary Learned decision boundary

FALSE POSITIVES

New query

An example incorrectly predicted

to be positive

FALSE NEGATIVES

New query

An example incorrectly predicted

to be negative

PRECISION VS. RECALL Precision

# of relevant documents retrieved / # of total documents retrieved

Recall # of relevant documents retrieved / # of total

relevant documents Numbers between 0 and 1

PRECISION VS. RECALL Precision

# of true positives / (# true positives + # false positives)

Recall # of true positives / (# true positives + # false

negatives) A precise classifier is selective A classifier with high recall is inclusive

REDUCING FALSE POSITIVE RATE

REDUCING FALSE NEGATIVE RATE

PRECISION-RECALL CURVES

Precision

Recall

Measure Precision vs Recall as the decision boundary is tuned

Perfect classifier

Actual performance

Precision

Recall

Penalize false negatives

Penalize false positives

Equal weight

Precision

Recall

Precision

Recall

Better learningperformance

OPTION 1: CLASSIFICATION THRESHOLDS Many learning algorithms (e.g., probabilistic

models, linear models) give real-valued output v(x) that needs thresholding for classification

v(x) > t => positive label given to xv(x) < t => negative label given to x

May want to tune threshold to get fewer false positives or false negatives

OPTION 2: WEIGHTED DATASETS Weighted datasets: attach a weight w to

each example to indicate how important it is Instead of counting “# of errors”, count “sum of

weights of errors” Or construct a resampled dataset D’ where each

example is duplicated proportionally to its w As the relative weights of positive vs

negative examples is tuned from 0 to 1, the precision-recall curve is traced out

LINEAR CLASSIFIERS : MOTIVATION Decision tree produces axis-aligned decision

boundaries Can we accurately classify data like this?

PLANE GEOMETRY Any line in 2D can be expressed as the set of

solutions (x,y) to the equation ax+by+c=0 (an implicit surface) ax+by+c > 0 is one side of the line ax+by+c < 0 is the other ax+by+c = 0 is the line itself

PLANE GEOMETRY In 3D, a plane can be expressed as the set of

solutions (x,y,z) to the equation ax+by+cz+d=0 ax+by+cz+d > 0 is one side of the plane ax+by+cz+d < 0 is the other side ax+by+cz+d = 0 is the plane itself

LINEAR CLASSIFIER In d dimensions,

c0+c1*x1+…+cd*xd =0 is a hyperplane. Idea:

Use c0+c1*x1+…+cd*xd > 0 to denote positive classifications

Use c0+c1*x1+…+cd*xd < 0 to denote negative classifications

PERCEPTRON

y = f(x,w) = g(Si=1,…,n wi xi)

w1 x1 + w2 x2 = 0

A SINGLE PERCEPTRON CAN LEARN

A disjunction of boolean literals x1 x2 x3

Majority function

A SINGLE PERCEPTRON CAN LEARN

A disjunction of boolean literals x1 x2 x3

Majority functionXOR?

PERCEPTRON LEARNING RULE θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1)

If output is correct, weights are unchanged If g is 0 but y is 1, then the value of g on

attribute i is increased If g is 1 but y is 0, then the value of g on

attribute i is decreased

Converges if data is linearly separable, but oscillates otherwise

PERCEPTRON

y = f(x,w) = g(Si=1,…,n wi xi)

UNIT (NEURON)

y = g(Si=1,…,n wi xi)g(u) = 1/[1 + exp(-u)]

NEURAL NETWORK Network of interconnected neurons

Acyclic (feed-forward) vs. recurrent networks

TWO-LAYER FEED-FORWARD NEURAL NETWORK

Inputs Hiddenlayer

Outputlayer

w1j w2k

NETWORKS WITH HIDDEN LAYERS Can represent XORs, other nonlinear

functions Common neuron types:

Soft perceptron (sigmoid), radial basis functions, linear, …

As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features

How to train hidden layers?

BACKPROPAGATION (PRINCIPLE) Treat the problem as one of minimizing

errors between the example label and the network output, given the example and network weights as input Error(xi,yi,w) = (yi – f(xi,w))2

Sum this error term over all examples E(w) = Si Error(xi,yi,w) = Si (yi – f(xi,w))2

Minimize errors using an optimization algorithm Stochastic gradient descent is typically used

Gradient direction is orthogonal to the level sets (contours) of E,points in direction of steepest increase

Gradient descent: iteratively move in direction

Gradient descent: iteratively move in direction E

Gradient descent: iteratively move in direction

STOCHASTIC GRADIENT DESCENT For each example (xi,yi), take a gradient

descent step to reduce the error for (xi,yi) only.

STOCHASTIC GRADIENT DESCENT Objective function values (measured over all

examples) over time settle into local minimum

Step size must be reduced over time, e.g., O(1/t)

NEURAL NETWORKS: PROS AND CONS Pros

Bioinspiration is nifty Can represent a wide variety of decision boundaries Complexity is easily tunable (number of hidden

nodes, topology) Easily extendable to regression tasks

Cons Haven’t gotten close to unlocking the power of the

human (or cat) brain Complex boundaries need lots of data Slow training Mostly lukewarm feelings in mainstream ML

(although the “deep learning” variant is en vogue now)

NEXT CLASS Another guest lecture

More Classifiers

Documents

Transcript of More Classifiers

Decision Tree Classifiers

Non Linear Classifiers

Personalized classifiers

Lectures 5 & 6: Classifiers - University of Oxfordaz/lectures/est/lect56.pdf · Lectures 5 & 6: Classifiers ... Assign input vector to one of two or more classes Any decision rule

Puppet Camp Düsseldorf 2014: External Node Classifiers - Get Efficient and Do A Lot More (Beginner)

Classifiers Notes

Gaussian Classifiers

Linear Classifiers

Linear Classifiers/SVMs

Bayes Classifiers II: More Examples - CS Departmentgqi/CAP5610/CAP5610Lecture04.pdf · Bayes Classifiers II: More Examples CAP5610 Machine Learning Instructor: Guo-Jun QI

Comparison of Classifiers

5 classifiers

American Sign Language Classifiers. Purpose of classifiers Replaces a noun Clarifies a message More efficient.

Lecture8 classifiers ldc_rules

8. Classifiers

Boosting of classifiers

5 character classifiers

Class 4 – More Classifiers

INF 4300 – Support Vector Machine Classifiers (SVM)€¦ · Support Vector Machine classifiers • To understand Support Vector Machine (SVM) classifiers, we need to study the linear

Reflux Classifiers