Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics...

39
Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo aghodsib @uwaterloo.ca September 2006

Transcript of Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics...

Page 1: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Data VisualizationData Visualization

STAT 890, STAT 442, CM 462

Ali Ghodsi Department of Statistics

School of Computer ScienceUniversity of Waterloo

aghodsib @uwaterloo.ca

September 2006

Page 2: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Two ProblemsTwo Problems

Classical Statistics

• Infer information from small data sets (Not enough data)

Machine Learning

• Infer information from large data sets (Too many data)

Page 3: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Other Names for MLOther Names for ML

• Data mining,

• Applied statistics

• Adaptive (stochastic) signal processing

• Probabilistic planning or reasoning

are all closely related to the second problem.

Page 4: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

ApplicationsApplications

Machine Learning is most useful when the structure of the task is not well understood but can be characterized by a dataset with strong

statistical regularity.• Search and recommendation (e.g. Google, Amazon)• Automatic speech recognition and speaker verification• Text parsing• Face identification• Tracking objects in video• Financial prediction, fraud detection (e.g. credit cards)• Medical diagnosis

Page 5: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

TasksTasks

• Supervised Learning: given examples of inputs and corresponding desired outputs, predict outputs on future inputs.e.g.: classification, regression

• Unsupervised Learning: given only inputs, automatically discover representations, features, structure, etc.e.g.: clustering, dimensionality reduction, Feature extraction

Page 6: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Dimensionality ReductionDimensionality Reduction

• Dimensionality: The number of measurements available for each item in a data set.

• The dimensionality of real world items is very high.• For example: The dimensionality of a 600 by 600 image

is 360,000.• The Key to analyzing data is comparing these

measurements to find relationships among this plethora of data points.

• Usually these measurements are highly redundant, and relationships among data points are predictable.

Page 7: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Dimensionality ReductionDimensionality Reduction

• Knowing the value of a pixel in an image, it is easy to predict the value of nearby pixels since they tend to be similar.

• Knowing that the word “corporation” occurs often in articles about economics, but not very often in articles about art and poetry then it is easy to predict that it will not occur very often in articles about love.

• Although there are lots of measurements per item, there are far fewer that are likely to vary. Using a data set that only includes the items likely to vary allows humans to quickly and easily recognize changes in high dimensionality data.

Page 8: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Data RepresentationData Representation

Page 9: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Data RepresentationData Representation

Page 10: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

11 11 11 11 11

11 00 11 00 11

11 11 11 11 11

11 0.50.5 0.50.5 0.50.5 11

11 11 11 11 11

Data RepresentationData Representation

Page 11: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 12: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

644 by 103

644 by 2

2 by 103

23 by 28 23 by 28

-2.19

-0.02

-3.19

1.02

2 by 12 by 1

Page 13: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 14: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 15: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 16: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 17: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 18: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 19: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 20: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 21: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Arranging words: Each word was initially represented by a high-dimensional vector that counted the number of times it appeared in different encyclopedia articles. Words with similar contexts are collocated

Page 22: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 23: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 24: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 25: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.
Page 26: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Different FeaturesDifferent Features

Page 27: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Glasses vs. No GlassesGlasses vs. No Glasses

Page 28: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Beard vs. No BeardBeard vs. No Beard

Page 29: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Beard DistinctionBeard Distinction

Page 30: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Glasses DistinctionGlasses Distinction

Page 31: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Multiple-Attribute MetricMultiple-Attribute Metric

Page 32: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Embedding of sparse music Embedding of sparse music similarity graphsimilarity graph

Platt, 2004

Page 33: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Reinforcement learningReinforcement learning

Mahadevan and Maggioini, 2005

Page 34: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Semi-supervised learningSemi-supervised learning

Use graph-based discretization of manifold to infer missing labels.

Build classifiers from bottom eigenvectors of graph Laplacian.

Belkin & Niyogi, 2004; Zien et al, Eds., 2005

Page 35: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Learning correspondencesLearning correspondences

How can we learn manifold structure that is shared across multiple data sets?

c et al, 2003, 2005

Page 36: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Mapping and robot localizationMapping and robot localization

Bowling, Ghodsi, Wilkinson 2005

Ham, Lin, D.D. 2005

Page 37: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

The Big PictureThe Big Picture

Page 38: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

Manifold and Hidden VariablesManifold and Hidden Variables

Page 39: Data Visualization Data Visualization STAT 890, STAT 442, CM 462 Ali Ghodsi Department of Statistics School of Computer Science University of Waterloo.

ReadingReading

• Journals: Neural Computation, JMLR, ML, IEEE PAMI• Conferences: NIPS, UAI, ICML, AI-STATS, IJCAI,

IJCNN• Vision: CVPR, ECCV, SIGGRAPH• Speech: EuroSpeech, ICSLP, ICASSP• Online: citesser, google• Books:

– Elements of Statistical Learning, Hastie, Tibshirani, Friedman– Learning from Data, Cherkassky, Mulier– Machine Learning, Mitchell– Neural Networks for pattern Recognition, Bishop– Introduction to Graphical Models, Jordan et. al