LING 696B: Graph-based methods and Supervised learning
description
Transcript of LING 696B: Graph-based methods and Supervised learning
1
LING 696B: Graph-based methodsandSupervised learning
2
Road map Types of learning problems:
Unsupervised: clustering, dimension reduction -- Generative models
Supervised: classification (today)-- Discriminative models
Methodology: Parametric: stronger assumptions about the
distribution (blobs, mixture model) Non-parametric: weaker assumptions
(neural nets, spectral clustering, Isomap)
3
Puzzle from several weeks ago How do people learn categories
from distributions?
Liberman et al.(1952)
4
Graph-based non-parametric methods “Learn locally, think globally” Local learning produces a graph that
reveals the underlying structure Learning the neighbors
Graph is used to reveal global structure in the data Isomap: geodesic distance through
shortest path Spectral clustering: connected
components from graph spectrum (see demo)
5
Clustering as a graph partitioning problem Normalized-cut problem: splitting
the graph into two parts, so that Each part is not too small The edges being cut don’t carry too
many weights
Weights on edges from A to B
Weights on edges within A
A B
6
Normalized cut through spectral embedding
Exact solution of normalized-cut is NP-hard (explodes for large graph)
“Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize
Strongly connected nodes stay nearby, weakly connected nodes stay faraway
Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding
Neighborhood matrix
7
Is this relevant to how people learn categories? Maye & Gerken: learning a bi-modal
distribution on a curve (living in an abstract manifold) from /d/ to /(s)t/ Mixture model: transform the signal,
and approximate with two “dynamic blobs”
Can people learn categories from arbitrary manifolds following a “local learning” strategy? Simple case: start from a uniform
distribution (see demo)
8
Local learning from graphs Can people learn categories from arbitrary
manifolds following a “local learning” strategy? Most likely no
What constrains the kind of manifolds that people can learn?
What are the reasonable metrics people use?
How does neighborhood size affect such type of learning?
Learning through non-uniform distributions?
9
Switch gear Supervised learning: learning a
function from input-output pairs Arguably, something that people also do
Example: perceptron Learning a function f(x)= sign(<w,x> +
b) Also called a “classifier”: machine with
yes/no output
10
Speech perception as a classification problem Speech perception is viewed as a
bottom-up procedure involving many decisions E.g. sonorant/consonant,
voice/voiceless See Peter’s presentation
A long-standing effort of building machines that do the same Stevens’ view of distinctive features
11
Knowledge-based speech recognition Mainstream method:
Front end: uniform signal representation Back end: hidden Markov models
Knowledge based: Front end: sound-specific features based
on acoustic knowledge Back end: a series of decisions on how
lower level knowledge is integrated
12
The conceptual framework from (Liu, 96) and others Each step is hard work
Bypassed inStevens 02
13
Implications of flow-chart architecture Requires accurate low-level decisions
Mistakes can build up very quickly Thought experiment: “linguistic” speech
recognition through a sequence of distinctive feature classifiers
Hand-crafted decision rules often not robust/flexible The need for good statistical classifiers
14
An unlikely marriage Recent years have seen several
sophisticated classification machines Example: support vector machine by
Vapnik (today) Interest moving from neural nets to
these new machines Many have proposed to integrate
the new classifiers as a back-end Niyogi and Burges paper: building
feature detectors with SVM
15
Generalization in classification Experiment: you are learning a line
that separates two classes
16
Generalization in classification Question: Where does the yellow
dot belong?
17
Generalization in classification Question: Where does the yellow
dot belong?
18
Margin and linear classifiers We tend to draw a line that gives
the most “room” between the two clouds
margin
19
Margin Margin needs to be defined on
“border” points
20
Margin Margin needs to be defined on
“border” points
21
Justification for maximum margin Hopefully, they generalize well
22
Justification for maximum margin Hopefully, they generalize well
23
Support vectors in the separable case Data points that reaches the
maximal margin from the separating line
24
Formalizing maximum margin -- optimization for SVM Need constrained optimization
f(x) = sign(<w,x>+b) is the same as sign(<Cw,x>+Cb), for any C>0
Two strategies to choose a constrained optimization problem: Limit the length of w, and maximize margin Fix the margin, and minimize the length of w
w
25
SVM optimization (see demo) Constrained quadratic
programming problem
It can be shown (through Lagrange multiplier method) that solution looks like:
Fixed marginLabel
A linear combination of training data!
26
SVM applied to non-separable data What happens when data is not
separable? The optimization problem has no
solution (recall the XOR problem) See demo
27
Extension to non-separable data through new variables Allow the data points to
“encroach” the separating line(see demo)
+
ToleranceOriginal objective
28
When things become wild: Non-linear extensions The majority of “real world” problems
are not separable This can be due to some deep underlying
laws, e.g. XOR data Non-linearity from Neural nets:
Hidden layers Non-linear activations
SVM initiates a more trendy way of making non-linear machines -- kernels
29
Kernel methods Model-fitting problems ill-posed
without constraining the space Avoid commitment to space: non-
parametric method using kernels Idea: let the space grow with data How? Associate each data point with a
little function, e.g. a blob, and set the space to be the linear combination of these
Connection to neural nets
30
Kernel extension of SVM Recall the linear solution: Substituting this into f:
Using general kernel function K(x, xi) in the place of <x, xi>
What matters is the dot product
31
Kernel extension of SVM This is very much like replacing
linear with non-linear nodes in a neural net Radial Basis Network: each K(x, xi) is
a Gaussian centered at xi -- a small blob
“seeing” non-linearity: a theoremi.e. the kernel is still a dot product, exceptthat it works in an infinite dimensional space of “features”
32
This is not a fairy tale Hopefully, by throwing data into infinite
dimensions, they will become separable How can things work in infinite dimensions?
The infinite dimension is implicit Only support vectors act as “anchors” for the
separating plane in feature space All the computation is done in finite dimensions
by searching through support vectors and their weights
As a result, we can do lots of things with SVM by playing with kernels (see demo)
33
Reflections How likely this is a human learning
model?
34
Reflections How likely this is a human learning
model? Are all learning problems reducible
to classification?
35
Reflections How likely this is a human learning
model? Are all learning problems reducible
to classification? What learning models are
appropriate for speech?