Statistical Methods for Learning Multimedia …infolab.stanford.edu/~echang/ICME03-UCSB.pdf7/6/2003...

Post on 24-May-2018

222 views 0 download

Transcript of Statistical Methods for Learning Multimedia …infolab.stanford.edu/~echang/ICME03-UCSB.pdf7/6/2003...

7/6/2003 ICME Tutorial, Baltimore 1

Statistical Methods for Learning Multimedia Semantics

Edward ChangAssociate Professor,Electrical Engineering, UC Santa BarbaraCTO, VIMA Technologies

7/6/2003 ICME Tutorial, Baltimore 2

Outline

Statistical LearningMultimedia Applications’ Data CharacteristicsClassical ModelsKernel Methods

Linear Model ViewNearest Neighbor ViewGeometric View

Dimension Reduction Methods

7/6/2003 ICME Tutorial, Baltimore 3

Statistical Learning

Program the computers to learn!Computers improve performancewith experience at some taskExample:

Task: classify imagesPerformance: prediction accuracyExperience: labeled images

7/6/2003 ICME Tutorial, Baltimore 4

Definition

X: Data poolU: Unlabeled pool L: Labeled pool

G: LabelsRegression: G → RClassification: G → +1, -1

H: Learning algorithm

7/6/2003 ICME Tutorial, Baltimore 5

Statistical LearningExperience

Characterized by training data LTraining

f = H(L)Task (e.g., prediction)

ŷ = f(u), u ∈ UPerformance

Measured by some error functione.g., maximizing yf(u)

7/6/2003 ICME Tutorial, Baltimore 6

Learning Algorithms (H)

Linear RegressionK-NNBayesian AnalysisNeural NetworksDecision TreesKernel MethodsEtc.

7/6/2003 ICME Tutorial, Baltimore 7

H Having a hypothesis spaceFind the “best” hypothesis based on the training data (L) efficiently

Best solutionFitting L well? Predicting U accurately!

EfficiencyComputational complexity and resource requirements

7/6/2003 ICME Tutorial, Baltimore 8

Classical Model [Donoho 2000]

N:Number of training instancesN = |U|

N+, N-

D:DimensionalityN >> D N → ∞

E.g., PAC learnabilityN- ≈ N+

7/6/2003 ICME Tutorial, Baltimore 9

Emerging MM Applications

N < DN+ << N-

ExamplesInformation retrieval with relevance feedbackK-class classification⌧Image classification⌧Gene profiling

7/6/2003 ICME Tutorial, Baltimore 10

Gene Profiling ExampleN = 59 cases, D = 4026 genes

7/6/2003 ICME Tutorial, Baltimore 11

Image Retrieval Demo

N < DN < 50D = 150

N+ << N-

ACM SIGMOD 01; ACM MM 01,02; IEEE CVPR 03Also see my Web site

7/6/2003 ICME Tutorial, Baltimore 12

SVMactive

7/6/2003 ICME Tutorial, Baltimore 13

SVMactive

7/6/2003 ICME Tutorial, Baltimore 14

SVMactive

7/6/2003 ICME Tutorial, Baltimore 15

SVMactive

7/6/2003 ICME Tutorial, Baltimore 16

Ranking

7/6/2003 ICME Tutorial, Baltimore 17

Solution Summary N < D

ACM MM 2001 (SVM Active)⌧Make each u in U most informative

PCM 2002, ICIP 2003⌧Increase N- through co-training

ACM MM 2002 (DPF)⌧Reduce D

N+ << N-

ACM MM 2003, ICML 2003⌧Conformal transformation ⌧Kernel boundary alignment

7/6/2003 ICME Tutorial, Baltimore 18

Outline

Statistical LearningMM Applications’ Data CharacteristicsClassical Models (Classification)Kernel Methods

Linear Model ViewNearest Neighbor ViewGeometric View

Dimension Reduction Methods

7/6/2003 ICME Tutorial, Baltimore 19

Classical Methods

Linear ModelLeast SquareMaximum Likelihood Naïve BayesianLDAMaximum Margin Hyperplane

Nearest Neighbor

7/6/2003 ICME Tutorial, Baltimore 20

Linear Regression

7/6/2003 ICME Tutorial, Baltimore 21

Least Square

Y = β0 + ΣΣ βj Xj (j = 1 to D)Y = XTβRSS(β) = (Y – Xβ)T(Y – Xβ)

RSS: Residual Sum of Squareβ = (XTX)-1 XTY

7/6/2003 ICME Tutorial, Baltimore 22

Maximum Likelihood

Y = β0 + ΣΣ βj Xj (j = 1 to p)Y = XTβY = XTβ + ε

ε (noise signals) are independentε → N (0, ∂2)

P(y|βx) has a normal dist. withMean at y = βxVariance ∂2

7/6/2003 ICME Tutorial, Baltimore 23

Maximum Likelihood

P(y|βx) → N (0, ∂2) Training

Given (x1,y1) (x2,y2) … (xn,yn)Infer P(β | x1, x2,… xn, y1, y2,…yn )By Bayes rule, orMaximum Likelihood Estimate

7/6/2003 ICME Tutorial, Baltimore 24

Maximum Likelihood

For what β isP(y1, y2,…yn | x1, x2,… xn, β) maximized?ΠΠ P(yi|βxi) maximized? ΠΠ exp(-½(yi-βxi/∂)2) maximized?ΣΣ (-½(yi-βxi/∂)2 maximized?ΣΣ (yi-βxi)2 minimized?

7/6/2003 ICME Tutorial, Baltimore 25

Least Square Linear Model

Solution Method #1RSS(β) = (Y – Xβ)T(Y – Xβ)β = (XTX)-1 XTY

Solution Method #2 (for D > N)Gradient decentPerceptron

7/6/2003 ICME Tutorial, Baltimore 26

Other Linear Models

LDAFind the projection direction which minimizes the overlap for two Gaussian distributions

Separating Hyperplane

7/6/2003 ICME Tutorial, Baltimore 27

LDA

7/6/2003 ICME Tutorial, Baltimore 28

7/6/2003 ICME Tutorial, Baltimore 29

Separating Hyperplane

7/6/2003 ICME Tutorial, Baltimore 30

Separating Hyperplane

7/6/2003 ICME Tutorial, Baltimore 31

Maximum Margin Hyperplane

Only support vectors involve in class prediction!

7/6/2003 ICME Tutorial, Baltimore 32

Linear Models

N ≥ DLeast SquareLDA

D > NPerceptron (using gradient decent)Maximum Hyperplane

Generative vs. Discriminative Model

7/6/2003 ICME Tutorial, Baltimore 33

Linear Model Fits All Data?

7/6/2003 ICME Tutorial, Baltimore 34

How about Joining the Dots?

Y(x) = 1/k ΣΣ yi,

xi ∈Nk(x)K = 1

7/6/2003 ICME Tutorial, Baltimore 35

Linear Model Fits All?

7/6/2003 ICME Tutorial, Baltimore 36

NN with k = 1

7/6/2003 ICME Tutorial, Baltimore 37

Nearest Neighbor

Four Things Make a Memory Based Learner

A distance function?K: number of neighbors to consider?A weighted function (optional)?How to fit with the local points?

7/6/2003 ICME Tutorial, Baltimore 38

Problems of K=1

Fitting NoiseJagged Boundaries

7/6/2003 ICME Tutorial, Baltimore 39

Solutions

Fitting NoisePick a larger K?

7/6/2003 ICME Tutorial, Baltimore 40

NN with k = 15

7/6/2003 ICME Tutorial, Baltimore 41

NN

7/6/2003 ICME Tutorial, Baltimore 42

Solutions

Fitting NoisePick a larger K?

Jagged BoundariesIntroducing Kernel as a weighting function

7/6/2003 ICME Tutorial, Baltimore 43

Nearest Neighbor → Kernel Method

Four Things Make a Memory Based Learner

A distance functionK: number of neighbors to consider? AllA weighted function: RBF kernelsHow to fit with the local points? Predict weights

7/6/2003 ICME Tutorial, Baltimore 44

Kernel Method

RBF Weighted FunctionKernel width holds the key⌧Implying KUse cross validation to find the “optimal” width

Fitting with the Local PointsWhere NN meets Linear Model

7/6/2003 ICME Tutorial, Baltimore 45

LM vs. NNLinear Model

f(x) is approximated by a global linear functionMore stable, less flexible

Nearest NeighborK-NN assumes f(x) is well approximated by a locally constant functionLess stable, more flexible

Between LM and NNThe other models…

7/6/2003 ICME Tutorial, Baltimore 46

Decision Theories

Bias & Variance TradeoffBayes PredictionVC DimensionalityPAC Learnability

7/6/2003 ICME Tutorial, Baltimore 47

Variance vs. Bias

MSE(x) = ET [f(x) – ŷ]2

= ET[ŷ – ET(ŷ)]2 + [ET(ŷ)– f(x)]2

Error = VarT(ŷ) + Bias2(ŷ)

7/6/2003 ICME Tutorial, Baltimore 48

Variance vs. Bias

7/6/2003 ICME Tutorial, Baltimore 49

Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods

7/6/2003 ICME Tutorial, Baltimore 50

Where Are We and Where Am I Heading To ?

LM and NNKernel Method of Three Views

LM viewNN viewGeometric view

7/6/2003 ICME Tutorial, Baltimore 51

Linear Model View

Y = β0 + ΣΣ β XSeparating Hyperplane

Max||β||=1 CSubject to yyii f(f(xxii) ) ≥≥ C, orC, oryyii ((β0 +β xi) ≥≥ CC

7/6/2003 ICME Tutorial, Baltimore 52

Classifier Margin

Margin Defined as width of the boundary before hitting a data object

Maximum MarginTends to minimize classification varianceNo formal theory for this yet

7/6/2003 ICME Tutorial, Baltimore 53

Separating Hyperplane

7/6/2003 ICME Tutorial, Baltimore 54

M’s Mathematical Representation

Plus-plane{x: wx+b = +1}

Minus-plane{x: wx+b = -1}

w ⊥ Plus-planew(u – v) = 0, if u and v on plus-plane

w ⊥ Minus-plane

7/6/2003 ICME Tutorial, Baltimore 55

Separating Hyperplane

7/6/2003 ICME Tutorial, Baltimore 56

M

Let x- be any point on minus-planeLet x+ be the closest plus-plane-point to x-

x+ = x- + λw, whyThe line (x+x-) ⊥ minus-plane

M = |x+ - x-|

7/6/2003 ICME Tutorial, Baltimore 57

M

1. wx- + b = -1 2. wx+ + b = 1 3. x+ = x- + λw 4. M = |x+ - x-|5. w(x- + λw) + b = 1 (from 2 & 3)6. wx- + b + λww = 17. λww = 2

7/6/2003 ICME Tutorial, Baltimore 58

M

1. λww = 22. λ = 2/ww3. M = |x+ - x-| = |λw| = λ|w| = 2/|w|

4. Max MGradient decent, simulated annealing, EM, Newton’s method…

7/6/2003 ICME Tutorial, Baltimore 59

Max M

Max M = 2/|w|Min |w|/2Min |w|2/2

subject to yi(xiw+b) ≥ 1i = 1,…,N

Quadratic criterion with linear inequality constraints

7/6/2003 ICME Tutorial, Baltimore 60

Max M

Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N

Lp = minw,b |w|2/2 + ΣΣi=1..N αi[yi(xiw+b)-1]

w = ΣΣi=1..N αiyixi

0 = ΣΣi=1..N αiyi

7/6/2003 ICME Tutorial, Baltimore 61

Wolfe Dual

Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj

Subject to αi ≥ 0αi [yi(xiw+b)-1] = 0KKT conditions⌧αi > 0, yi(xiw+b) = 1 (Support Vectors)⌧αi = 0, yi(xiw+b) > 1

7/6/2003 ICME Tutorial, Baltimore 62

Class Predictionyyqq = = w xq + b

w = ΣΣi=1..N αiyixi

yyqq = sign(= sign(ΣΣi=1..N αiyi(xi ·Xq) + b)

7/6/2003 ICME Tutorial, Baltimore 63

Non-separatable Classes

Soft Margin HyperplaneBasis Expansion

7/6/2003 ICME Tutorial, Baltimore 64

Non-separating Case

7/6/2003 ICME Tutorial, Baltimore 65

Soft Margin SVMs

Min |w|2/2subject to yi(xiw+b) ≥ 1i = 1,…,N

Min |w|2/2 + C ∑εi

xiw+b ≥ 1 - εi if yi = 1xiw+b ≤ -1 + εi if yi = -1εi ≥ 0

7/6/2003 ICME Tutorial, Baltimore 66

Non-separating Case

7/6/2003 ICME Tutorial, Baltimore 67

Wolfe Dual

Ld = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyjxixj

Subject to C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions

yyqq = = sign ((ΣΣi=1..N αiyi(xi ·Xq) + b)

7/6/2003 ICME Tutorial, Baltimore 68

Basis Function

7/6/2003 ICME Tutorial, Baltimore 69

Harder 1D Example

7/6/2003 ICME Tutorial, Baltimore 70

Basis Function

Φ(X) = (x, x2)

7/6/2003 ICME Tutorial, Baltimore 71

Harder 1D Example

7/6/2003 ICME Tutorial, Baltimore 72

Some Basis Functions

Φ(X) = ΣΣ γmhm(X) hm(X) Rp → R

Common FunctionsPolynomialRadial basis functionsSigmoid functions

7/6/2003 ICME Tutorial, Baltimore 73

Kernel FunctionLd = ΣΣi=1..N α - 1/2 ΣΣΣΣi,j=1..Nαiαjyiyj Φ(xi)Φ (xj)Subject to

C ≥ αi ≥ 0ΣΣ αiyi = 0KKT conditions

yyqq = sign (= sign (ΣΣi=1..N αiyi(Φ(xi)·Φ(Xq)) + b)K(xi, xj) = Φ(xi)·Φ(Xj)

Kernel function!

7/6/2003 ICME Tutorial, Baltimore 74

Quadratic Basis Functions

Φ(a) = {1, ai, ai aj}, ij = 1..D(D+1)(D+2)/2 termsD2 termsO(D2) computational cost

It is equivalent to (ab+1)2

O(D) computational costTotal computational cost

O(N2D)

7/6/2003 ICME Tutorial, Baltimore 75

Dot Product Saves the Day

O(N2D)Quadratic

O(N2D2)Cubic

O(N2D3)Quartic

O(N2D4)

7/6/2003 ICME Tutorial, Baltimore 76

Quiz

What is a polynomial kernel degree dfunction’s signature?(ab+1)d

7/6/2003 ICME Tutorial, Baltimore 77

Outline

LM and NNKernel Method of Three Views

LM viewNN viewGeometric view

7/6/2003 ICME Tutorial, Baltimore 78

Nearest Neighbor View

Z, a set of zero mean jointly Gaussian random variables,

Each Zi corresponds to one example Xi

Cov (zi, zj) = K(xi, xj)yi, the lable of zi, +1 or -1

P(yi | zi) = σ(yi,zi)

7/6/2003 ICME Tutorial, Baltimore 79

Training Data

7/6/2003 ICME Tutorial, Baltimore 80

General Kernel Classifier [Jaakkola, etc. 99]

MAP Classification for xt

yt = sign (Σ αi yi K(xt,xi)) K(xi, xj) = Cov (zi, zj) (some similarity function)

Supervised Training: Compute αi Given X and y, andAn error function such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)

7/6/2003 ICME Tutorial, Baltimore 81

Leave One Out

7/6/2003 ICME Tutorial, Baltimore 82

SVMsyt = sign (Σ αi yi K(xt,xi))(yi xi) training data, αi nonnegative, and kernel K positive definiteαi is obtained by maximizing

J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi)F(αi) = αi

αi ≥ 0, Σyiαi = 0

7/6/2003 ICME Tutorial, Baltimore 83

Important Insight

K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity function that produces a positive definite covariance matrix on the training instances

7/6/2003 ICME Tutorial, Baltimore 84

Basis Function Selection

Three General ApproachesRestriction methods⌧Limit the class of functionsSelection methods⌧Scan the dictionary adaptively (Boosting)Regularization methods⌧Use the entire dictionary but restrict

coefficients (Ridge Regression)

7/6/2003 ICME Tutorial, Baltimore 85

Overfitting?

Probably NotBecause

N free parameters (not D)Maximizing margin

7/6/2003 ICME Tutorial, Baltimore 86

Geometrical View

S = w X + b|w| = 1, b = 0V = {w | Si f(xi) > 0; i = 1..n, |w| = 1}SVM is the center of the largest sphere contained in V

7/6/2003 ICME Tutorial, Baltimore 87

SVMs

7/6/2003 ICME Tutorial, Baltimore 88

BPMs

Bayes Objective FunctionŜt = Bayes Z (Xt) = argmin Si in S E H|Z = x [l(H(x), Si)]

BPMs [Herbrich, etc. 2001]Abp= argmin h in H Ex[E H|Z = x [l(H(x), h(x))]]

7/6/2003 ICME Tutorial, Baltimore 89

BPMs

Linear ClassifierInput X Posses Spherical Gaussian Density

BP is the Center of Mass of the Version Space

7/6/2003 ICME Tutorial, Baltimore 90

BPMs vs. SVMs

7/6/2003 ICME Tutorial, Baltimore 91

BPMs

Use SVMs to find a good h in HFind the BP

Billiard Algorithm [Herbrich, etc. 2001]

Perceptron Algorithm [Herbrich, etc. 2001]

7/6/2003 ICME Tutorial, Baltimore 92

Billiard Ball Algorithm (R. Herbrich )

7/6/2003 ICME Tutorial, Baltimore 93

Outline

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel MethodsDimension Reduction Methods

7/6/2003 ICME Tutorial, Baltimore 94

Similarity Measurement

7/6/2003 ICME Tutorial, Baltimore 95

Perceptual Distance FunctionTwo Monumental Challenges

Formulating a perceptual feature spaceFormulating a perceptual distance function

7/6/2003 ICME Tutorial, Baltimore 96

Dimensionality Curse

D: Data DimensionWhen D increases

Nearest neighbors are not localAll points are equally distanced

7/6/2003 ICME Tutorial, Baltimore 97

Sparse High-D Space [C. Aggarwal, etc. ICDT 2001]

Hyper-cube Range Queries

dd ssP =][

7/6/2003 ICME Tutorial, Baltimore 98

7/6/2003 ICME Tutorial, Baltimore 99

Sparse High-D Space

Spherical Range Queries

7/6/2003 ICME Tutorial, Baltimore 100

)12(

)5.0()]5.0,([+Γ

•=∈ dQspRP

ddd π

7/6/2003 ICME Tutorial, Baltimore 101

7/6/2003 ICME Tutorial, Baltimore 102

Dimensionality Curse

7/6/2003 ICME Tutorial, Baltimore 103

So?

Is nearest neighbor estimate cursed in high-D spaces?

Yes!When D is large and N is relatively small, the estimate is off!!

7/6/2003 ICME Tutorial, Baltimore 104

Are We Doomed?

How does the curse affect classification?Similar objects tend to clustertogetherClassification makes binary prediction

7/6/2003 ICME Tutorial, Baltimore 105

Distribution of Distances

7/6/2003 ICME Tutorial, Baltimore 106

Some Solutions to High-D

Restricted Estimators Specifying the nature of local neighborhood

Adaptive Feature Reduction PCA, LDA

Dynamic Partial Function

7/6/2003 ICME Tutorial, Baltimore 107

Three Major Paradigms

Preserve data description in a lower dimensional space

PCAMaximize discriminability in a lower dimensional space

LDAActivate only similar channels

DPF

7/6/2003 ICME Tutorial, Baltimore 108

Minkowski Distance

Objects P and QD = (ΣM (pi - qi)n)1/n

Similar images are similar in all M features

7/6/2003 ICME Tutorial, Baltimore 109

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

00.

060.

130.

190.

250.

320.

380.

440.

510.

570.

630.

690.

760.

820.

880.

95

Feature Distance

Freq

uenc

y

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

00.

060.

130.

190.

250.

320.

380.

440.

510.

570.

630.

690.

760.

820.

880.

95

Feature Distance

Freq

uenc

y

7/6/2003 ICME Tutorial, Baltimore 110

Weighted Minkowski Distance

D = (ΣM wi(pi - qi)n)1/n

Similar images are similar in the same subset of the M features

7/6/2003 ICME Tutorial, Baltimore 111

0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0

0.007545 0.01307 0.004637 0.002413 0.002635 0.002954 0.0020070.014669 0.02717 0.010578 0.006734 0.007725 0.006379 0.0057660.012615 0.023055 0.009333 0.006764 0.007363 0.006593 0.0054430.082128 0.212612 0.068016 0.037835 0.032241 0.018068 0.0132030.061564 0.176548 0.045542 0.026445 0.026374 0.018583 0.0220370.019243 0.037016 0.015684 0.010834 0.012792 0.013536 0.0093460.09418 0.153677 0.066896 0.040249 0.036368 0.030341 0.0211380.1284 0.335405 0.13774 0.072613 0.054947 0.039216 0.043319

0.041414 0.101403 0.035881 0.022633 0.018991 0.017131 0.019450.014024 0.049782 0.01457 0.0053 0.004439 0.003041 0.0052260.049319 0.120274 0.045804 0.020165 0.019499 0.013805 0.018513

GIF

00.020.040.060.080.1

0.120.14

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0.002923 0.004377 0.029086 0.017063 0.007649 0.002019 0.001984 0.011560.006648 0.010143 0.070708 0.046142 0.023502 0.005178 0.005169 0.030140.006298 0.009264 0.075118 0.042225 0.020053 0.006285 0.006533 0.0300430.010198 0.056025 0.052869 0.033199 0.018294 0.00688 0.006858 0.023620.017066 0.047514 0.104013 0.073459 0.037468 0.013849 0.01293 0.0483440.008148 0.015337 0.074134 0.044238 0.021222 0.005197 0.005099 0.0299780.013529 0.051743 0.063263 0.038084 0.020885 0.010481 0.009844 0.0285110.045746 0.104141 0.145924 0.11276 0.065015 0.026333 0.02593 0.0751920.026167 0.034522 0.085067 0.054154 0.02918 0.015887 0.014371 0.0397320.002676 0.012148 0.008913 0.004682 0.002452 0.000913 0.000905 0.0035730.014527 0.036084 0.046779 0.024712 0.017418 0.004182 0.004991 0.0196160.012121 0.030269 0.045198 0.022268 0.012468 0.004706 0.004955 0.017919

Scale up/down

00.050.1

0.150.2

0.250.3

0.350.4

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Aver

age

Dis

tanc

e

0.024788 0.069615 0.0226 0.009364 0.01 0.00678 0.0097120.094781 0.227558 0.099002 0.046466 0.047815 0.036883 0.0246990.093399 0.233519 0.188091 0.043026 0.037991 0.022151 0.0240640.040228 0.102763 0.034949 0.014184 0.01465 0.010237 0.0155170.001163 0.000896 0.000722 0.000627 0.000349 0.000452 0.0027580.006947 0.006769 0.003541 0.006377 0.002048 0.005515 0.0130060.006365 0.005313 0.002064 0.004006 0.002055 0.003338 0.01010.011705 0.010935 0.006615 0.007506 0.003319 0.005911 0.0152110.009434 0.010169 0.004484 0.006306 0.002582 0.004798 0.0136570.006305 0.005997 0.003392 0.005719 0.002382 0.004853 0.0128020.005835 0.00945 0.004323 0.00564 0.002688 0.004535 0.0063320.008149 0.009636 0.0047 0.006213 0.002564 0.003375 0.0064210.006776 0.010315 0.005393 0.008004 0.003845 0.005659 0.0132030.001526 0.002551 0.000576 0.000371 0.000331 0.000286 0.000380.016302 0.022657 0.007055 0.00353 0.002171 0.004162 0.003980.012414 0.020159 0.007076 0.003102 0.00188 0.004606 0.003490.007231 0.013591 0.004979 0.001092 0.000582 0.002766 0.0007410.011588 0.015102 0.005764 0.003855 0.00262 0.004584 0.0037920.01212 0.016013 0.006441 0.004048 0.002728 0.004856 0.004241

0.012235 0.01671 0.00483 0.002616 0.00197 0.00268 0.001672

Cropping

00.050.1

0.150.2

0.250.3

0.35

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

Feature Number

Ave

rage

Dis

tanc

e

0.006109 0.019169 0.032795 0.015229 0.008667 0.002357 0.00292 0.0123940.01223 0.070665 0.046472 0.02549 0.017445 0.008694 0.00841 0.021302

0.019067 0.08113 0.04592 0.024327 0.014169 0.004995 0.005275 0.0189370.011323 0.029089 0.063856 0.037716 0.01988 0.00522 0.005556 0.0264460.000995 0.000971 0.00241 0.001415 0.000736 0.000275 0.000272 0.0010220.007103 0.006337 0.015615 0.008709 0.003433 0.001572 0.002071 0.006280.004321 0.004457 0.012494 0.007507 0.003403 0.001351 0.001976 0.0053460.007451 0.008135 0.017145 0.008711 0.003192 0.001154 0.00223 0.0064860.00576 0.006822 0.015235 0.00869 0.003676 0.001193 0.002159 0.006191

0.006491 0.005948 0.013473 0.007436 0.003165 0.001777 0.002377 0.0056460.003832 0.005257 0.011884 0.008077 0.002654 0.001227 0.001213 0.0050110.004812 0.005389 0.011737 0.00729 0.003216 0.001534 0.002039 0.0051630.008795 0.007888 0.016303 0.008801 0.004048 0.002367 0.0027 0.0068440.000451 0.000707 0.002277 0.001346 0.000797 0.000253 0.000239 0.0009820.004914 0.006924 0.01499 0.009123 0.006657 0.003364 0.003391 0.0075050.004473 0.006398 0.017247 0.008858 0.005219 0.002338 0.002392 0.0072110.001723 0.003639 0.010426 0.005216 0.003024 0.00043 0.000423 0.0039040.00427 0.005712 0.011221 0.00856 0.006923 0.004464 0.004462 0.007126

0.004978 0.006186 0.009864 0.007161 0.005881 0.003835 0.003847 0.0061180.001722 0.0046 0.015611 0.007291 0.00338 0.000508 0.00049 0.005456

Rotation

0

0.02

0.04

0.06

0.08

0.1

0.12

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

Feature Number

Aver

age

Dis

tanc

e

7/6/2003 ICME Tutorial, Baltimore 112

Similarity Theories

Objects are similar in all respects (Richardson 1928)Objects are similar in some respects (Tversky 1977)Similarity is a process of determining respects, rather than using predefined respects (Goldstone 94)

7/6/2003 ICME Tutorial, Baltimore 113

DPF

Which Place is Similar to DC?PartialDynamicDynamic Partial FunctionSee ACM MM 2002, ICIP 2002, ACM MM Journal

7/6/2003 ICME Tutorial, Baltimore 114

Precision/Recall

7/6/2003 ICME Tutorial, Baltimore 115

Summary

Statistical LearningEmerging Applications Data CharacteristicsClassical Models (Classification)Kernel Methods

Linear Model ViewNearest Neighbor ViewGeometric View

Dimension Reduction Methods

7/6/2003 ICME Tutorial, Baltimore 116

Advanced Topics

Imbalance Data LearningN- >> N+

See our ICML 2003 papersSequence-data KernelKernel Alignment & Boosting

7/6/2003 ICME Tutorial, Baltimore 117

Useful Links

Related Publicationshttp://www-db.stanford.edu/~echang/

Online DemoVIMA Technologies

Six deployments as July 2003www.vimatech.com

7/6/2003 ICME Tutorial, Baltimore 118

References1. The Elements of Statistical Learning, T. Hastie, R. Tibshirani, and J.

Friedman, Springer, N.Y., 20012. Machine Learning, T. Mitchell, 19973. High-dimensional Data Analysis, D. Donoho, American Math. Society Lecture,

20004. Support Vector Machine Active Learning for Image Retrieval, S. Tong and E.

Chang, ACM MM, 20015. Dynamic Partial Function, B. Li and E. Chang, ACM Multimedia Journal, 20036. Pattern Discovery in Sequences under a Markov Assumption, D. Chudova and

P. Smyth, ACM KDD 20027. Bayes Point Machines, R. Herbrich, T. Graepel and C. Campbell, Journal of

Machine Learning Research, 20018. The Nature of Statistical Learning Theory, V. Vapnik, Springer, N.Y., 19959. Probabilistic Kernel Regression Models, T. Jaakkola and D. Haussler,

Conference of AI and Statistics, 199910. Support Vector Machines, Lecture Notes, A. Moore, CMU11. On the Surprising Behavior of Distance Metrics in High-dimensional Space, C.

Aggarwal, A. Hinneburg, and D. Keim, ICDT 2001 12. Adaptive Conformal Transformation for Learning Imbalanced Data, G. Wu, E.

Chang, International Conference on Machine Learning, August 2003