Tin Kam Ho Bell Laboratories Lucent Technologies

60
Tin Kam Ho Bell Laboratories Lucent Technologies

description

Data Complexity Analysis. for Classifier Combination. Tin Kam Ho Bell Laboratories Lucent Technologies. Outlines. Classifier Combination: Motivation, methods, difficulties Data Complexity Analysis: Motivation, methods, early results. - PowerPoint PPT Presentation

Transcript of Tin Kam Ho Bell Laboratories Lucent Technologies

Page 1: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Tin Kam Ho Bell Laboratories

Lucent Technologies

Page 2: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Outlines

• Classifier Combination:– Motivation, methods, difficulties

• Data Complexity Analysis:– Motivation, methods, early results

Page 3: Tin Kam Ho                 Bell Laboratories      Lucent Technologies
Page 4: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Supervised Classification --Discrimination Problems

Page 5: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

An Ill-Posed Problem

Page 6: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Where Were We in the Late 1990’s?

• Statistical Methods– Bayesian classifiers, polynomial discriminators, nearest-

neighbors, decision trees, neural networks, support vector machines, …

• Syntactic Methods– regular grammars, context-free grammars, attributed

grammars, stochastic grammars, ...

• Structural Methods– graph matching, elastic matching, rule-based systems, ...

Page 7: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Classifiers

• Competition among different …– choices of features– feature representations– classifier designs

• Chosen by heuristic judgements

• No clear winners

Page 8: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Classifier Combination Methods

• Decision optimization methods– find consensus from a given set of classifiers– majority/plurality vote, sum/product rule– probability models, Bayesian approaches– logistic regression on ranks or scores– classifiers trained on confidence scores

Page 9: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Classifier Combination Methods

• Coverage optimization methods

– subsampling methods:

stacking, bagging, boosting

– subspace methods: random subspace projection, localized selection

– superclass/subclass methods: mixture of experts, error-correcting output codes

– perturbation in training

Page 10: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Layers of Choices

Best Features?

Best Classifier?

Best Combination Method?

Best (combination of )* combination methods?

Page 11: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Before We Start …

• Single or multiple classifiers?– accuracy vs efficiency

• Feature or decision combination?– compatible units, common metric

• Sequential or parallel classification?– efficiency, classifier specialties

Page 12: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Difficulties in Decision Optimization

• Reliability versus overall accuracy

• Fixed or trainable combination function

• Simple models or combinatorial estimates

• How to model complementary behavior

Page 13: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Difficulties inCoverage Optimization

• What kind of differences to introduce:– Subsamples? Subspaces? Subclasses?– Training parameters? – Model geometry?

• 3-way tradeoff: – discrimination + uniformity + generalization

• Effects of the form of component classifiers

Page 14: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Dilemmas and Paradoxes

• Weaken individuals for a stronger whole?

• Sacrifice known samples for unseen cases?

• Seek agreements or differences?

Page 15: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Model of Complementary Decisions

• Statistical independence of decisions:assumed or observed?

• Collective vs point-wise error estimates

• Related estimates of neighboring samples

Page 16: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Stochastic Discrimination

• Set-theoretic abstraction

• Probabilities in model or feature spaces

• Enrichment / Uniformity / Projectability

• Convergence by law of large numbers

Page 17: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Stochastic Discrimination

• Algorithm for uniformity enforcement

• Fewer, but more sophisticated classifiers

• Other ways to address the 3-way trade-off

Page 18: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Geometry vs Probability

• Geometry of classifiers

• Rule of generalization to unseen samples

• Assumption of representative samples

Optimistic error bounds

• Distribution-free arguments

Pessimistic error bounds

Page 19: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Sampling Density

N = 2 N = 10

N = 100 N = 500 N = 1000

Page 20: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Summary of Difficulties

• Many theories have inadequate assumptions• Geometry and probability lack connection• Combinatorics defies detailed modeling• Attempt to cover all cases gives weak results

• Empirical results overly specific to problems• Lack of systematic organization of evidences

Page 21: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

More Questions

• How do confidence scores differ from feature values?• Is combination a convenience or a necessity?• What are common among various

combination methods?• When should the combination hierarchy

terminate?

Page 22: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Data Dependent Behavior of Classifiers

• Different classifiers excel in

different problems

• So do combined systems

• This complicates theories and

interpretation of observations

Page 23: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Questions to ask:

• Does this method work for all problems?

• Does this method work for this problem?

• Does this method work for this type of problems?

Study the interaction of

data and classifiers

Page 24: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Characterization of

Data and Classifier Behavior

in a Common Language

Page 25: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Sources of Difficultyin Classification

• Class ambiguity

• Boundary complexity

• Sample size and dimensionality

Page 26: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Class Ambiguity

• Is the problem intrinsically ambiguous?

• Are the classes well defined?

• What is the information content of the features?

• Are the features sufficient for discrimination?

Page 27: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

• Kolmogorov complexity

• Length may be exponential in dimensionality

• Trivial description: list all points, class labels

• Is there a shorter description?

Boundary Complexity

Page 28: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Sample Size & Dimensionality

• Problem may appear deceptively simple or complex with small samples

• Large degree of freedom in high-dim. spaces

• Representativeness of samples vs.

generalization ability of classifiers

Page 29: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Mixture of Effects

• Real problems often have mixed effects of

class ambiguity

boundary complexity

sample size & dimensionality

• Geometrical complexity of class manifolds

coupled with probabilistic sampling process

Page 30: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Easy or Difficult Problems

• Linearly separable problems

Page 31: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Easy or Difficult Problems

• Random noise

1000 points 500 points 100 points 10 points

Page 32: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Easy or Difficult Problems

• Others

Nonlinearboundary Spirals

4x4checkerboard

10x10checkerboard

Page 33: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Description of Complexity

• What are real-world problems like?

• Need a description of complexity to

– set expectation on recognition accuracy

– characterize behavior of classifiers

• Apparent or true complexity?

Page 34: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Possible Measures

• Separability of classes– linear separability– length of class boundary– intra / inter class scatter and distances

• Discriminating power of features– Fisher’s discriminant ratio– overlap of feature values– feature efficiency

Page 35: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Possible Measures

• Geometry, topology, clustering effects– curvature of boundaries– overlap of convex hulls– packing of points in regular shapes– intrinsic dimensionality– density variations

Page 36: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Linear Separability

• Intensively studied in early literature• Many algorithms only stop with positive

conclusions– Perceptrons, Perceptron Cycling Theorem, 1962– Fractional Correction Rule, 1954– Widrow-Hoff Delta Rule, 1960– Ho-Kashyap algorithm, 1965– Linear programming, 1968

Page 37: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Length of Class Boundary

• Friedman & Rafsky 1979– Find MST (minimum spanning tree)

connecting all points regardless of class

– Count edges joining opposite classes

– Sensitive to separability

and clustering effects

Page 38: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Fisher’s Discriminant Ratio

• Defined for one feature:

22

21

221

σσ)μ(μ

f

2

22

121 ,,, σσμμ

means, variances of classes 1,2

• One good feature makes a problem easy

• Take maximum over all features

Page 39: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Volume of Overlap Region

• Overlap of class manifolds

• Overlap region of each dimension asa fraction of range spanned by the two classes

• Multiply fractions to estimate volume

• Zero if no overlap

Page 40: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Convex Hulls & Decision Regions

• Hoekstra & Duin 1996

• Measure nonlinearity of a classifier w.r.t.

a given dataset

• Sensitive to smoothness of decision boundaries

Page 41: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Shapes of Class Manifolds

• Lebourgeois & Emptoz 1996

• Packing of same-class

points in hyperspheres

• Thick and spherical,

or thin and elongated manifolds

Page 42: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Measures of Geometrical Complexity

Page 43: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Space of Complexity Measures

• Single measure may not suffice

• Make a measurement space

• See where datasets are in this space

• Look for a continuum of difficulty:

Easiest Cases Most difficult cases

Page 44: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

• UC-Irvine collection

• 14 datasets (no missing values, > 500 pts)

• 844 two-class problems

• 452 linearly separable

• 392 linearly nonseparable

• 2 - 4648 points each

• 8 - 480 dimensional feature spaces

Data Sets: UCI

Page 45: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Data Sets: Random Noise

• Randomly located and labeled points

• 100 artificial problems

• 1 to 100 dimensional feature spaces

• 2 classes, 1000 points per class

Page 46: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Patterns in Measurement Space

Page 47: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Correlated or Uncorrelated Measures

Page 48: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Separation + ScatterSeparation

Page 49: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Observations

• Noise sets and linearly separable sets occupy opposite ends in many dimensions

• In-between positions tell relative difficulty

• Fan-like structure in most plots

• At least 2 independent factors, joint effects

• Noise sets are far from real data

• Ranges of noise sets: apparent complexity

Page 50: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Principle Component Analysis

Page 51: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Principle Component Analysis

Page 52: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

A Trajectory of Difficulty

Page 53: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

What Else Can We Do?

• Find clusters in this space

• Determine intrinsic dimensionality

Study effectiveness of these measures

Interpret problem distributions

Page 54: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

What Else Can We Do?

• Study specific domains with these measures• Study alternative formuations, subproblems

induced by localization, projection, transformation

Apply these measures to more problems

Page 55: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

What Else Can We Do?

Relate complexity measures to

classifier behavior

Page 56: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Bagging vs Random Subspaces for Decision Forests

Subspaces betterSubsampling better

Fisher’s discriminant ratioLength of class boundary

Page 57: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Bagging vs Random Subspaces for Decision Forests

Nonlinearity, nearest neighborsvs linear classifier

% Retained adherence subsets,Intra/inter class NN distances

Page 58: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Error Rates of Individual Classifiers

Error rate, nearest neighborsvs linear classifier

Error rate, single treesvs sampling density

Page 59: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Observations• Both types of forests are good for

problems of various degrees of difficulty

• Neither is good for extremely difficult cases- many points on boundary - ratio of intra/inter class NN dist. close to 1 - low Fisher’s discriminant ratio - high nonlinearity of NN or LP classifiers

• Subsampling is preferable for sparse samples

• Subspaces is preferable for smooth boundaries

Page 60: Tin Kam Ho                 Bell Laboratories      Lucent Technologies

Conclusions

• Real-world problems have different types of geometric characteristics

• Relevant measures can be related to classifier accuracies

• Data complexity analysis improves understanding of classifier or combination behavior

• Helpful for combination theories and practices