LING 696B: Graph-based methods and Supervised learning

1

LING 696B: Graph-based methodsandSupervised learning

2

Road map Types of learning problems:

Unsupervised: clustering, dimension reduction -- Generative models

Supervised: classification (today)-- Discriminative models

Methodology: Parametric: stronger assumptions about the

distribution (blobs, mixture model) Non-parametric: weaker assumptions

(neural nets, spectral clustering, Isomap)

3

Puzzle from several weeks ago How do people learn categories

from distributions?

Liberman et al.(1952)

4

Graph-based non-parametric methods “Learn locally, think globally” Local learning produces a graph that

reveals the underlying structure Learning the neighbors

Graph is used to reveal global structure in the data Isomap: geodesic distance through

shortest path Spectral clustering: connected

components from graph spectrum (see demo)

5

Clustering as a graph partitioning problem Normalized-cut problem: splitting

the graph into two parts, so that Each part is not too small The edges being cut don’t carry too

many weights

Weights on edges from A to B

Weights on edges within A

A B

6

Normalized cut through spectral embedding

Exact solution of normalized-cut is NP-hard (explodes for large graph)

“Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize

Strongly connected nodes stay nearby, weakly connected nodes stay faraway

Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding

Neighborhood matrix

7

Is this relevant to how people learn categories? Maye & Gerken: learning a bi-modal

distribution on a curve (living in an abstract manifold) from /d/ to /(s)t/ Mixture model: transform the signal,

and approximate with two “dynamic blobs”

Can people learn categories from arbitrary manifolds following a “local learning” strategy? Simple case: start from a uniform

distribution (see demo)

8

Local learning from graphs Can people learn categories from arbitrary

manifolds following a “local learning” strategy? Most likely no

What constrains the kind of manifolds that people can learn?

What are the reasonable metrics people use?

How does neighborhood size affect such type of learning?

Learning through non-uniform distributions?

9

Switch gear Supervised learning: learning a

function from input-output pairs Arguably, something that people also do

Example: perceptron Learning a function f(x)= sign(<w,x> +

b) Also called a “classifier”: machine with

yes/no output

10

Speech perception as a classification problem Speech perception is viewed as a

bottom-up procedure involving many decisions E.g. sonorant/consonant,

voice/voiceless See Peter’s presentation

A long-standing effort of building machines that do the same Stevens’ view of distinctive features

11

Knowledge-based speech recognition Mainstream method:

Front end: uniform signal representation Back end: hidden Markov models

Knowledge based: Front end: sound-specific features based

on acoustic knowledge Back end: a series of decisions on how

lower level knowledge is integrated

12

The conceptual framework from (Liu, 96) and others Each step is hard work

Bypassed inStevens 02

13

Implications of flow-chart architecture Requires accurate low-level decisions

Mistakes can build up very quickly Thought experiment: “linguistic” speech

recognition through a sequence of distinctive feature classifiers

Hand-crafted decision rules often not robust/flexible The need for good statistical classifiers

14

An unlikely marriage Recent years have seen several

sophisticated classification machines Example: support vector machine by

Vapnik (today) Interest moving from neural nets to

these new machines Many have proposed to integrate

the new classifiers as a back-end Niyogi and Burges paper: building

feature detectors with SVM

15

Generalization in classification Experiment: you are learning a line

that separates two classes

16

Generalization in classification Question: Where does the yellow

dot belong?

17

Generalization in classification Question: Where does the yellow

dot belong?

18

Margin and linear classifiers We tend to draw a line that gives

the most “room” between the two clouds

margin

19

Margin Margin needs to be defined on

“border” points

20

Margin Margin needs to be defined on

“border” points

21

Justification for maximum margin Hopefully, they generalize well

22

Justification for maximum margin Hopefully, they generalize well

23

Support vectors in the separable case Data points that reaches the

maximal margin from the separating line

24

Formalizing maximum margin -- optimization for SVM Need constrained optimization

f(x) = sign(<w,x>+b) is the same as sign(<Cw,x>+Cb), for any C>0

Two strategies to choose a constrained optimization problem: Limit the length of w, and maximize margin Fix the margin, and minimize the length of w

w

25

SVM optimization (see demo) Constrained quadratic

programming problem

It can be shown (through Lagrange multiplier method) that solution looks like:

Fixed marginLabel

A linear combination of training data!

26

SVM applied to non-separable data What happens when data is not

separable? The optimization problem has no

solution (recall the XOR problem) See demo

27

Extension to non-separable data through new variables Allow the data points to

“encroach” the separating line(see demo)

+

ToleranceOriginal objective

28

When things become wild: Non-linear extensions The majority of “real world” problems

are not separable This can be due to some deep underlying

laws, e.g. XOR data Non-linearity from Neural nets:

Hidden layers Non-linear activations

SVM initiates a more trendy way of making non-linear machines -- kernels

29

Kernel methods Model-fitting problems ill-posed

without constraining the space Avoid commitment to space: non-

parametric method using kernels Idea: let the space grow with data How? Associate each data point with a

little function, e.g. a blob, and set the space to be the linear combination of these

Connection to neural nets

30

Kernel extension of SVM Recall the linear solution: Substituting this into f:

Using general kernel function K(x, xi) in the place of <x, xi>

What matters is the dot product

31

Kernel extension of SVM This is very much like replacing

linear with non-linear nodes in a neural net Radial Basis Network: each K(x, xi) is

a Gaussian centered at xi -- a small blob

“seeing” non-linearity: a theoremi.e. the kernel is still a dot product, exceptthat it works in an infinite dimensional space of “features”

32

This is not a fairy tale Hopefully, by throwing data into infinite

dimensions, they will become separable How can things work in infinite dimensions?

The infinite dimension is implicit Only support vectors act as “anchors” for the

separating plane in feature space All the computation is done in finite dimensions

by searching through support vectors and their weights

As a result, we can do lots of things with SVM by playing with kernels (see demo)

33

Reflections How likely this is a human learning

model?

34


model? Are all learning problems reducible

to classification?

35


model? Are all learning problems reducible

to classification? What learning models are

appropriate for speech?

LING 696B: Graph-based methods and Supervised learning

Documents

Transcript of LING 696B: Graph-based methods and Supervised learning