Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation...

67
Object Orie’d Data Analysis, Last Time • Statistical Smoothing – Histograms – Density Estimation – Scatterplot Smoothing – Nonpar. Regression • SiZer Analysis – Replaces bandwidth selection – Scale Space – Statistical Inference: Which bumps are “really there”? – Visualization

Transcript of Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation...

Page 1: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Object Orie’d Data Analysis, Last Time

• Statistical Smoothing– Histograms – Density Estimation

– Scatterplot Smoothing – Nonpar. Regression

• SiZer Analysis– Replaces bandwidth selection

– Scale Space

– Statistical Inference:

Which bumps are “really there”?

– Visualization

Page 2: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Kernel Density EstimationChoice of bandwidth (window width)?• Very important to performance

Fundamental Issue:Which modes are “really there”?

Page 3: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundFun Scale Spaces Views (Incomes

Data)Surface View

Page 4: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundSiZer analysis of British Incomes data:

Page 5: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundFinance "tick data":

(time, price) of single stock transactions

Idea: "on line" version of SiZerfor viewing and understanding trends

Page 6: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundFinance "tick data":

(time, price) of single stock transactions

Idea: "on line" version of SiZerfor viewing and understanding trends

Notes: • trends depend heavily on scale • double points and more • background color transition

(flop over at top)

Page 7: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundInternet traffic data analysis:SiZer analysis oftime series of packet timesat internet hub (UNC)

Hannig, Marron,

and Riedi (2001)

Page 8: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundInternet traffic data analysis:

SiZer analysis oftime series of packet times

at internet hub (UNC)• across very wide range of scales • needs more pixels than screen allows • thus do zooming view

(zoom in over time) – zoom in to yellow bd’ry in next frame – readjust vertical axis

Page 9: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer BackgroundInternet traffic data analysis (cont.)

Insights from SiZer analysis:

• Coarse scales:

amazing amount of significant structure

• Evidence of self-similar fractal type process?

• Fewer significant features at small scales

• But they exist, so not Poisson process

• Poisson approximation OK at small scale???

• Smooths (top part) stable at large scales?

Page 10: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Dependent SiZer

Rondonotti, Marron, and Park (2007)

• SiZer compares data with white noise

• Inappropriate in time series

• Dependent SiZer compares data with

an assumed model

• Visual Goodness of Fit test

Page 11: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Dep’ent SiZer : 2002 Apr 13 Sat 1 pm – 3

pm

Page 12: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Zoomed view (to red region, i.e. “flat top”)

Page 13: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Further Zoom: finds very periodic behavior!

Page 14: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Possible Physical Explanation

IP “Port Scan”• Common device of hackers• Searching for “break in points”• Send query to every possible

(within UNC domain):– IP address– Port Number

• Replies can indicate system weaknesses

Internet Traffic is hard to model

Page 15: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer OverviewWould you like to try a SiZer analysis? • Matlab software: http://www.unc.edu/depts/statistics/postscript/papers/marron/Matlab6Software/Smoothing/

• JAVA version (demo, beta): Follow the SiZer link from the Wagner Associates home page:

http://www.wagner.com/www.wagner.com/SiZer/

• More details, examples and discussions:

http://www.stat.unc.edu/faculty/marron/DataAnalyses/

SiZer_Intro.html

Page 16: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

PCA to find clustersReturn to PCA of Mass Flux Data:

Page 17: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

PCA to find clustersSiZer analysis of Mass Flux, PC1

Page 18: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

PCA to find clustersSiZer analysis of Mass Flux, PC1

Conclusion:

• Found 3 significant clusters!

• Correspond to 3 known “cloud types”

• Worth deeper investigation

Page 19: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Recall Yeast Cell Cycle Data

• “Gene Expression” – Micro-array data

• Data (after major preprocessing): Expression “level” of:

• thousands of genes (d ~ 1,000s)

• but only dozens of “cases” (n ~ 10s)

• Interesting statistical issue:

High Dimension Low Sample Size data

(HDLSS)

Page 20: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Yeast Cell Cycle Data, FDA View

Central question:Which genes are “periodic” over 2 cell cycles?

Page 21: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Yeast Cell Cycle Data, FDA View

Periodic genes?

Naïve approach:Simple

PCA

Page 22: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Yeast Cell Cycle Data, FDA View

• Central question: which genes are “periodic” over 2 cell cycles?

• Naïve approach: Simple PCA• No apparent (2 cycle) periodic structure?• Eigenvalues suggest large amount of

“variation”• PCA finds “directions of maximal

variation”• Often, but not always, same as

“interesting directions”• Here need better approach to study

periodicities

Page 23: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Yeast Cell Cycles, Freq. 2 Proj.

PCA on

Freq. 2

Periodic

Component

Of Data

Page 24: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Frequency 2 Analysis

Page 25: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Frequency 2 Analysis• Project data onto 2-dim space of sin and

cos (freq. 2)

• Useful view: scatterplot

• Angle (in polar coordinates) shows phase

• Colors: Spellman’s cell cycle phase classification

• Black was labeled “not periodic”

• Within class phases approx’ly same, but notable differences

• Now try to improve “phase classification”

Page 26: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Yeast Cell CycleRevisit “phase classification”, approach:• Use outer 200 genes

(other numbers tried, less resolution)• Study distribution of angles• Use SiZer analysis

(finds significant bumps, etc., in histogram)

• Carefully redrew boundaries• Check by studying k.d.e. angles

Page 27: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

SiZer Study of Dist’n of Angles

Page 28: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Reclassification of Major Genes

Page 29: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Compare to Previous Classif’n

Page 30: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

New Subpopulation View

Page 31: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

OODA in Image Analysis

First Generation Problems:

• Denoising

• Segmentation (find object

boundaries)

• Registration (align objects)

(all about single images)

Page 32: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

OODA in Image Analysis

Second Generation Problems:

• Populations of Images

– Understanding Population Variation

– Discrimination (a.k.a.

Classification)

• Complex Data Structures (& Spaces)

• HDLSS Statistics

Page 33: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

HDLSS Data in Image Analysis

Why HDLSS (High Dim, Low Sample Size)?

• Complex 3-d Objects Hard to Represent– Often need d = 100’s of parameters

• Complex 3-d Objects Costly to Segment– Often have n = 10’s of cases

Page 34: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Image Object Representation

Major Approaches for Images:

• Landmark Representations

• Boundary Representations

• Medial Representations

Page 35: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark Representations

Main Idea:

• On each object find important points

• Treat point locations as features

• I.e. represent objects by vectors of point locations (in 2-d or 3-d)

(Fits in OODA framework)

Page 36: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsBasis of Field of Statistical Shape

Analysis:

(important precursor of FDA & OODA)

Main References:

• Kendall (1981, 1984)

• Bookstein (1984)

• Dryden and Mardia (1998)

(most readable and comprehnsive)

Page 37: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsNice Example:

• Fly Wing Data (Drosophila fruit flies)

• From George Gilchrist, W. & M. U.

http://gwgilc.people.wm.edu/

• Graphic Illustrating Landmarks (next page)– Same veins appear in all flies

– And always have same relationship

– I.e. all landmarks always identifiable

Page 38: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsLandmarks for fly wing data:

Page 39: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsImportant issue for landmark approaches:

Location, i. e. Registration

Illustration with Fly Wing Data (next slide)

Problem:

• coordinates are “locations in photo”

• & unclear where wing is positioned…

Page 40: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark Representations

Illustration of Registration, with Fly Wing Data

Page 41: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsStandard Approach to Registration

Problem:

Procrustes Analysis

Idea: mod out location

• Can also mod out rotation

• Can also mod out size

Recommended reference:

Dryden and Mardia (1988)

Page 42: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark Representations

Procustes Results for Fly Wing Data

Page 43: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsEffect of Procrustes Analysis:

Study Difference Between Continents• Flies from Europe & South America• Look for important differences• Project onto mean difference

direction• Visualize with movie

– Equal time spacing– Through range of data

Page 44: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsNo Procrustes Adjustment:

Movies on Difference Between Continents

Page 45: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsEffect of Procrustes Analysis:

Movies on Difference Between Continents

• Raw Data– Driven by location effects– Strongly feels size– Hard to understand shape

Page 46: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsLocation, Rotation, Scale Procrustes:

Movies on Difference Between Continents

Page 47: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsEffect of Procrustes Analysis:

Movies on Difference Between Continents

• Raw Data– Driven by location effects– Strongly feels size– Hard to understand shape

• Full Procrustes– Mods out location, size, rotation– Allows clear focus on shape

Page 48: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark Representations

Major Drawback of Landmarks:

• Need to always find each landmark

• Need same relationship

• I.e. Landmarks need to correspond

• Often fails for medical images

• E.g. How many corresponding landmarks on a set of kidneys, livers or brains???

Page 49: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark Representations

Landmarks for brains???

(thanks to

Liz Bullit)

Very hard to

identify

Page 50: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsLook across people:

Some structurein common

But “folds” are different

ConsistentLandmarks???

Page 51: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Landmark RepresentationsLook across people:

Some structurein common

But “folds” are different

ConsistentLandmarks???

Page 52: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Boundary Representations

Major sets of ideas:

• Triangular Meshes– Survey: Owen (1998)

• Active Shape Models– Cootes, et al (1993)

• Fourier Boundary Representations– Keleman, et al (1997 & 1999)

Page 53: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Boundary Representations

Example of triangular mesh rep’n:

From:www.geometry.caltech.edu/pubs.html

Page 54: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Boundary RepresentationsExample of triangular mesh rep’n for a

brain:

From: meshlab.sourceforge.net/SnapMeshLab.brain.jpg

Page 55: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Boundary RepresentationsMain Drawback:

Correspondence

• For OODA (on vectors of parameters):

Need to “match up points”

• Easy to find triangular mesh

– Lots of research on this driven by gamers

• Challenge match mesh across objects

– There are some interesting ideas…

Page 56: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsMain Idea: Represent Objects as:• Discretized skeletons (medial atoms)• Plus spokes from center to edge• Which imply a boundary

Very accessible early reference:• Yushkevich, et al (2001)

Page 57: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial Representations2-d M-Rep Example: Corpus Callosum(Yushkevich)

Page 58: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial Representations2-d M-Rep Example: Corpus Callosum(Yushkevich)

AtomsSpokesImpliedBoundary

Page 59: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial Representations3-d M-Rep Example: From Ja-Yeon Jeong

Bladder – Prostate - Rectum

Atoms - Spokes - Implied Boundary

Page 60: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial Representations3-d M-reps: there are several variations

Two choices:From Fletcher(2004)

Page 61: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsStatistical Challenge

• M-rep parameters are:– Locations– Radii– Angles (not comparable)

• Stuffed into a long vector• I.e. many direct products of

these

32 , 0

Page 62: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsStatistical Challenge:• How to analyze angles as data?• E.g. what is the average of:

– ??? (average of the numbers)– (of course!)

• Correct View of angular data:Consider as points on the unit circle

1811

359,358,4,3

Page 63: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsWhat is the average (181o?) or (1o?) of:

359

,358

,4

,3

Page 64: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsStatistical Analysis of Directional Data:• Common Examples:

– Wind Directions (0-360)– Magnetic Fields (0-360)– Cracks (0-180)

• There is a literature (monographs):– Mardia (1972, 2000)– Fisher, et al (1987, 1993)

Page 65: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsStatistical Challenge• Many direct products of:

– Locations– Radii– Angles (not comparable)

• Appropriate View:Data Lie on Curved Manifold

Embedded in higher dim’al Eucl’n Space

32 , 0

Page 66: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsData on Curved Manifold Toy Example:

Page 67: Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.

Medial RepresentationsData on Curved Manifold Viewpoint:• Very Simple Toy Example (last movie)• Data on a Cylinder = • Notes:

– Simplest non-Euclidean Example– 2-d data, embedded on manifold in – Can flatten the cylinder, to a plane– Have periodic representation– Movie by: Suman Sen

• Same idea for more complex direct prod’s

11 S

3R