Post on 05-Feb-2016
description
11
UNC, Stat & OR
Isaac Newton Institute - CambridgeIsaac Newton Institute - Cambridge
Object Oriented Data Analysis
J. S. Marron
Dept. of Statistics and Operations
Research, University of North Carolina
April 22, 2023
22
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
What is Mathematical Statistics?
Validation of existing methods
Asymptotics (n ∞) & Taylor
expansion
Comparison of existing methods
(requires hard math, but
really “accounting”???)
33
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
What could Mathematical Statistics be?
Basis for invention of new methods
Complicated data mathematical
ideas
Do we value creativity?
Since we don’t do this, others do…
(where are the ₤₤₤s???)
44
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
Since we don’t do this, others do…
Pattern Recognition
Artificial Intelligence
Neural Nets
Data Mining
Machine Learning
???
55
UNC, Stat & OR
Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics
Possible Litmus Test:
Creative Statistics
Clinical Trials Viewpoint:
Worst Imaginable Idea
Mathematical Statistics Viewpoint:
???
66
UNC, Stat & OR
Object Oriented Data Analysis, IObject Oriented Data Analysis, I
What is the “atom” of a statistical analysis?
1st Course: Numbers
Multivariate Analysis Course : Vectors
Functional Data Analysis: Curves
More generally: Data Objects
77
UNC, Stat & OR
Object Oriented Data Analysis, IIObject Oriented Data Analysis, II
Examples:
Medical Image Analysis
Images as Data Objects?
Shape Representations as Objects
Micro-arrays
Just multivariate analysis?
88
UNC, Stat & OR
Object Oriented Data Analysis, IIIObject Oriented Data Analysis, III
Typical Goals:
Understanding population variation
Visualization
Principal Component Analysis +
Discrimination (a.k.a. Classification)
Time Series of Data Objects
99
UNC, Stat & OR
Object Oriented Data Analysis, IVObject Oriented Data Analysis, IV
Major Statistical Challenge, I:
High Dimension Low Sample Size (HDLSS)
Dimension d >> sample size n
“Multivariate Analysis” nearly useless Can’t “normalize the data”
Land of Opportunity for Statisticians Need for “creative statisticians”
1010
UNC, Stat & OR
Object Oriented Data Analysis, VObject Oriented Data Analysis, V
Major Statistical Challenge, II:
Data may live in non-Euclidean space Lie Group / Symmet’c Spaces (manifold
data)
Trees/Graphs as data objects
Interesting Issues: What is “the mean” (pop’n center)?
How do we quantify “pop’n variation”?
1111
UNC, Stat & OR
Statistics in Image Analysis, IStatistics in Image Analysis, I
First Generation Problems:
Denoising
Segmentation
Registration
(all about single images)
1212
UNC, Stat & OR
Statistics in Image Analysis, IIStatistics in Image Analysis, II
Second Generation Problems:
Populations of Images
Understanding Population Variation
Discrimination (a.k.a. Classification)
Complex Data Structures (& Spaces)
HDLSS Statistics
1313
UNC, Stat & OR
HDLSS Statistics in Imaging
Why HDLSS (High Dim, Low Sample Size)?
Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters
Complex 3-d Objects Costly to Segment Often have n = 10’s cases
1414
UNC, Stat & OR
Medical Imaging – A Challenging Medical Imaging – A Challenging ExampleExample
Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment
(cancer) Work with 3-d CT Very Challenging to Segment
Find boundary of each object? Represent each Object?
1515
UNC, Stat & OR
Male Pelvis – Raw DataMale Pelvis – Raw Data
One CT Slice
(in 3d
image)
Coccyx
(Tail Bone)
Rectum
Prostate
1616
UNC, Stat & OR
Male Pelvis – Raw DataMale Pelvis – Raw Data
Prostate:
manual segmentation
Slice by slice
Reassembled
1717
UNC, Stat & OR
Male Pelvis – Raw DataMale Pelvis – Raw Data
Prostate:
Slices:Reassembled in 3d
How to represent?
Thanks: Ja-Yeon Jeong
1818
UNC, Stat & OR
Object RepresentationObject Representation
Landmarks (hard to find) Boundary Rep’ns (no
correspondence) Medial representations
Find “skeleton” Discretize as “atoms” called M-reps
1919
UNC, Stat & OR
3-d m-reps3-d m-reps
Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong)
• Medial Atoms provide “skeleton”
• Implied Boundary from “spokes” “surface”
2020
UNC, Stat & OR
3-d m-reps3-d m-reps
M-rep model fitting
• Easy, when starting from binary (blue)
• But very expensive (30 – 40 minutes technician’s time)
• Want automatic approach
• Challenging, because of poor contrast, noise, …
• Need to borrow information across training sample
• Use Bayes approach: prior & likelihood posterior
• ~Conjugate Gaussians, but there are issues:
• Major HLDSS challenges
• Manifold aspect of data
2121
UNC, Stat & OR
PCA for m-reps, IPCA for m-reps, I
Major issue: m-reps live in(locations, radius and angles)
E.g. “average” of: = ???
Natural Data Structure is:Lie Groups ~ Symmetric spaces
(smooth, curved manifolds)
)2()3(3 SOSO
359,358,3,2
2222
UNC, Stat & OR
PCA for m-reps, IIPCA for m-reps, II
PCA on non-Euclidean spaces?(i.e. on Lie Groups / Symmetric Spaces)
T. Fletcher: Principal Geodesic Analysis
Idea: replace “linear summary of data”With “geodesic summary of data”…
2323
UNC, Stat & OR
PGA for m-reps, Bladder-Prostate-PGA for m-reps, Bladder-Prostate-RectumRectum
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1 PG 2 PG 3
(analysis by Ja Yeon Jeong)
2424
UNC, Stat & OR
PGA for m-reps, Bladder-Prostate-PGA for m-reps, Bladder-Prostate-RectumRectum
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1 PG 2 PG 3
(analysis by Ja Yeon Jeong)
2525
UNC, Stat & OR
PGA for m-reps, Bladder-Prostate-PGA for m-reps, Bladder-Prostate-RectumRectum
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1 PG 2 PG 3
(analysis by Ja Yeon Jeong)
2626
UNC, Stat & OR
HDLSS Classification (i.e. HDLSS Classification (i.e. Discrimination)Discrimination)
Background: Two Class (Binary) version:
Using “training data” from Class +1, and from Class -1
Develop a “rule” for assigning new data to a Class
Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements
2727
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio
Less Useful Methods: Nearest Neighbors Neural Nets
(“black boxes”, no “directions” or intuition)
2828
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Currently Fashionable Methods: Support Vector Machines Trees Based Approaches
New High Tech Method Distance Weighted Discrimination
(DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem
2929
UNC, Stat & OR
HDLSS Classification (Cont.)HDLSS Classification (Cont.)
Currently Fashionable Methods:
Trees Based ApproachesSupport Vector Machines:
3030
UNC, Stat & OR
Distance Weighted DiscriminationDistance Weighted Discrimination
Maximal Data Piling
3131
UNC, Stat & OR
Distance Weighted DiscriminationDistance Weighted Discrimination
Based on Optimization Problem:
More precisely work in appropriate penalty for violations
Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic
prog’ing Fast greedy solution Can use existing software
n
i ibw r1,
1min
3232
UNC, Stat & OR
DWD Bias Adjustment for MicroarraysDWD Bias Adjustment for Microarrays
Microarray data: Simult. Measur’ts of “gene
expression” Intrinsically HDLSS
Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s
My view: Each array is “point in cloud”
3333
UNC, Stat & OR
DWD Batch and Source AdjustmentDWD Batch and Source Adjustment
For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004)
Bioinformaticshttps://genome.unc.edu/pubsup/dwd/
Adjust for Source Effects Different sources of mRNA
Adjust for Batch Effects Arrays fabricated at different times
3434
UNC, Stat & OR
DWD Adj: Raw Breast Cancer dataDWD Adj: Raw Breast Cancer data
3535
UNC, Stat & OR
DWD Adj: Source ColorsDWD Adj: Source Colors
3636
UNC, Stat & OR
DWD Adj: Batch ColorsDWD Adj: Batch Colors
3737
UNC, Stat & OR
DWD Adj: Biological Class ColorsDWD Adj: Biological Class Colors
3838
UNC, Stat & OR
DWD Adj: Biological Class Colors & DWD Adj: Biological Class Colors & SymbolsSymbols
3939
UNC, Stat & OR
DWD Adj: Biological Class SymbolsDWD Adj: Biological Class Symbols
4040
UNC, Stat & OR
DWD Adj: Source ColorsDWD Adj: Source Colors
4141
UNC, Stat & OR
DWD Adj: PC 1-2 & DWD directionDWD Adj: PC 1-2 & DWD direction
4242
UNC, Stat & OR
DWD Adj: DWD Source AdjustmentDWD Adj: DWD Source Adjustment
4343
UNC, Stat & OR
DWD Adj: Source Adj’d, PCA viewDWD Adj: Source Adj’d, PCA view
4444
UNC, Stat & OR
DWD Adj: Source Adj’d, Class ColoredDWD Adj: Source Adj’d, Class Colored
4545
UNC, Stat & OR
DWD Adj: Source Adj’d, Batch ColoredDWD Adj: Source Adj’d, Batch Colored
4646
UNC, Stat & OR
DWD Adj: Source Adj’d, 5 PCsDWD Adj: Source Adj’d, 5 PCs
4747
UNC, Stat & OR
DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWDDWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD
4848
UNC, Stat & OR
DWD Adj: S. & B1,2 vs. 3 AdjustedDWD Adj: S. & B1,2 vs. 3 Adjusted
4949
UNC, Stat & OR
DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCsDWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs
5050
UNC, Stat & OR
DWD Adj: S. & B Adj’d, B1 vs. 2 DWDDWD Adj: S. & B Adj’d, B1 vs. 2 DWD
5151
UNC, Stat & OR
DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’dDWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d
5252
UNC, Stat & OR
DWD Adj: S. & B Adj’d, 5 PC viewDWD Adj: S. & B Adj’d, 5 PC view
5353
UNC, Stat & OR
DWD Adj: S. & B Adj’d, 4 PC viewDWD Adj: S. & B Adj’d, 4 PC view
5454
UNC, Stat & OR
DWD Adj: S. & B Adj’d, Class ColorsDWD Adj: S. & B Adj’d, Class Colors
5555
UNC, Stat & OR
DWD Adj: S. & B Adj’d, Adj’d PCADWD Adj: S. & B Adj’d, Adj’d PCA
5656
UNC, Stat & OR
DWD Bias Adjustment for Microarrays
Effective for Batch and Source Adj. Also works for cross-platform Adj.
E.g. cDNA & Affy Despite literature claiming contrary
“Gene by Gene” vs. “Multivariate” views
Funded as part of caBIG“Cancer BioInformatics Grid”
“Data Combination Effort” of NCI
5757
UNC, Stat & OR
Interesting Benchmark Data SetInteresting Benchmark Data Set
NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available:
http://discover.nci.nih.gov/datasetsNature2000.jsp
Both cDNA and Affymetrix Platforms
8 Major cancer subtypes
Use DWD now for visualization
5858
UNC, Stat & OR
NCI 60: Views using DWD Dir’ns (focus on NCI 60: Views using DWD Dir’ns (focus on biology)biology)
5959
UNC, Stat & OR
DWD in Face Recognition, I
Face Images as Data
(with M. Benito & D. Peña)
Registered using
landmarks
Male – Female Difference?
Discrimination Rule?
6060
UNC, Stat & OR
DWD in Face Recognition, II
DWD Direction
Good separation
Images “make
sense”
Garbage at ends?
(extrapolation
effects?)
6161
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain:
Segmented from
MRA
Reconstruct trees
in 3d
Rotate to view
6262
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain:
Segmented from
MRA
Reconstruct trees
in 3d
Rotate to view
6363
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain:
Segmented from
MRA
Reconstruct trees
in 3d
Rotate to view
6464
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain:
Segmented from
MRA
Reconstruct trees
in 3d
Rotate to view
6565
UNC, Stat & OR
Marron’s brain:
Segmented from
MRA
Reconstruct trees
in 3d
Rotate to view
Blood vessel tree dataBlood vessel tree data
6666
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Marron’s brain:
Segmented from
MRA
Reconstruct trees
in 3d
Rotate to view
6767
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Now look over many people (data
objects)
Structure of population (understand
variation?)
PCA in strongly non-Euclidean Space???
, ... ,,
6868
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Possible focus of analysis:
• Connectivity structure only (topology)
• Location, size, orientation of segments
• Structure within each vessel segment
, ... ,,
6969
UNC, Stat & OR
Blood vessel tree dataBlood vessel tree data
Present Focus:
Topology only
Already
challenging
Later address
others
Then add
attributes
To tree nodes
And extend
analysis
7070
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
Statistics on Population of Tree-Structured Data Objects?
• Mean???• Analog of PCA???
Strongly non-Euclidean, since:• Space of trees not a linear space• Not even approximately linear
(no tangent plane)
7171
UNC, Stat & OR
Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces
PCA on Tree Space?
Key Idea (Jim Ramsay):
• Replace 1-d subspace
that best approximates data
• By 1-d representation
that best approximates data
Wang and Marron (2007) define notion of
Treeline (in structure space)
7272
UNC, Stat & OR
PCA for blood vessel tree PCA for blood vessel tree datadata
Data Analytic Goals: Age, Gender
See
these?
No…
7373
UNC, Stat & OR
Preliminary Tree-Curve Preliminary Tree-Curve ResultsResults
First Correlation
OfStructure
To Age!
(BackTrees)
7474
UNC, Stat & OR
HDLSS Asymptotics
Why study asymptotics?
7575
UNC, Stat & OR
HDLSS Asymptotics
Why study asymptotics?
An interesting (naïve) quote:
“I don’t look at asymptotics, because
I don’t have an infinite sample size”
7676
UNC, Stat & OR
HDLSS Asymptotics
Why study asymptotics?
An interesting (naïve) quote:
“I don’t look at asymptotics, because
I don’t have an infinite sample size”
Suggested perspective:
Asymptotics are a tool for finding simple
structure underlying complex entities
7777
UNC, Stat & OR
HDLSS Asymptotics
Which asymptotics?
n ∞ (classical, very widely
done)
d ∞ ???
Sensible?
Follow typical “sampling process”?
Say anything, as noise level
increases???
7878
UNC, Stat & OR
HDLSS Asymptotics
Which asymptotics?
n ∞ & d ∞
n >> d: a few results around
(still have classical info in data)
n ~ d: random matrices (Iain J., et al)
(nothing classically estimable)
HDLSS asymptotics: n fixed, d ∞
7979
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞
Follow typical “sampling process”?
8080
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞
Follow typical “sampling process”?
Microarrays: # genes bounded
Proteomics, SNPs, …
A moot point, from perspective:
Asymptotics are a tool for finding
simple structure underlying complex
entities
8181
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞
Say anything, as noise level
increases???
8282
UNC, Stat & OR
HDLSS Asymptotics
HDLSS asymptotics: n fixed, d ∞
Say anything, as noise level
increases???
Yes, there exists simple, perhaps
surprising, underlying structure
8383
UNC, Stat & OR
HDLSS Asymptotics: Simple Paradoxes, I
For dim’al “Standard Normal” dist’n:
Euclidean Distance to Origin (as ):
- Data lie roughly on surface of sphere of radius
- Yet origin is point of “highest density”???
- Paradox resolved by:
“density w. r. t. Lebesgue Measure”
d
d
dd
d
IN
Z
Z
Z ,0~1
)1(pOdZ
d
8484
UNC, Stat & OR
HDLSS Asymptotics: Simple Paradoxes, II
For dim’al “Standard Normal” dist’n: indep. of
Euclidean Dist. between and (as ):Distance tends to non-random constant:
Can extend to Where do they all go???
(we can only perceive 3 dim’ns)
d
d
dd INZ ,0~2
)1(221 pOdZZ
1Z
1Z 2Z
nZZ ,...,1
8585
UNC, Stat & OR
HDLSS Asymptotics: Simple Paradoxes, III
For dim’al “Standard Normal” dist’n: indep. of
High dim’al Angles (as ):
- -“Everything is orthogonal”??? - Where do they all go???
(again our perceptual limitations) - Again 1st order structure is non-random
d
d
dd INZ ,0~2
)(90, 2/121
dOZZAngle p
1Z
8686
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, I
Assume , let
Study Subspace Generated by Data
a. Hyperplane through 0, of dimension
b. Points are “nearly equidistant to 0”, & dist
c. Within plane, can “rotate towards Unit Simplex”
d. All Gaussian data sets are“near Unit Simplex Vertices”!!!
“Randomness” appears only in rotation of simplex
n
d ddn INZZ ,0~,...,1
d
d
With P. Hall & A. Neeman
8787
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, II
Assume , let
Study Hyperplane Generated by Data
a. dimensional hyperplane
b. Points are pairwise equidistant, dist
c. Points lie at vertices of “regular hedron”
d. Again “randomness in data” is only in rotation
e. Surprisingly rigid structure in data?
1n
d ddn INZZ ,0~,...,1
d2d~
n
8888
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, III
Simulation View: shows “rigidity after rotation”
8989
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, III
Straightforward Generalizations:
non-Gaussian data: only need moments
non-independent: use “mixing conditions” (with P. Hall & A. Neeman)
Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi)
All based on simple “Laws of Large Numbers”
9090
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, IV
Explanation of Observed (Simulation) Behavior:
“everything similar for very high d”
2 popn’s are 2 simplices (i.e. regular n-
hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN
9191
UNC, Stat & OR
HDLSS Asy’s: Geometrical Representation, V
Further Consequences of Geometric Representation
1. Inefficiency of DWD for uneven sample size(motivates “weighted version”, work in progress)
2. DWD more “stable” than SVM(based on “deeper limiting distributions”)(reflects intuitive idea “feeling sampling
variation”)(something like “mean vs. median”)
3. 1-NN rule inefficiency is quantified.
9292
UNC, Stat & OR
2nd Paper on HDLSS Asymptotics
Ahn, Marron, Muller & Chi (2007) Biometrika Assume 2nd Moments (and Gaussian)
Assume no eigenvalues too large in sense:
For assume i.e.
(min possible)
(much weaker than previous mixing conditions…)
d
jj
d
jj
d1
2
2
1
)(1 do 1 d
9393
UNC, Stat & OR
HDLSS Math. Stat. of PCA, I
Consistency & Strong Inconsistency:
Spike Covariance Model (Johnstone & Paul)
For Eigenvalues:
1st Eigenvector:
How good are empirical versions,
as estimates?
1,,1, ,,2,1 dddd d
1u
1,,1 ˆ,ˆ,,ˆ uddd
9494
UNC, Stat & OR
HDLSS Math. Stat. of PCA, II
Consistency (big enough spike):
For ,
Strong Inconsistency (spike not big enough):
For ,
1
0ˆ, 11 uuAngle
1
011 90ˆ, uuAngle
9595
UNC, Stat & OR
HDLSS Math. Stat. of PCA, III
Consistency of eigenvalues?
Eigenvalues Inconsistent
But known distribution
Unless as well
nn
dL
d
2
,1,1̂
n
9696
UNC, Stat & OR
HDLSS Work in Progress, II
Canonical Correlations: Myung Hee Lee
Results similar to those for those for
PCA
Singular values inconsistent
But directions converge under a much
milder spike assumption.
9797
UNC, Stat & OR
HDLSS Work in Progress, III
Conditions for Geo. Rep’n & PCA Consist.:
John Kent example:
Can only say:
not deterministic
Conclude: need some flavor of mixing
dddddd ININX *100,02
1,0
2
1~
212/1212/1
2/1
..10
..)(
pwd
pwddOX p
9898
UNC, Stat & OR
HDLSS Work in Progress, III
Conditions for Geo. Rep’n & PCA Consist.:
Conclude: need some flavor of mixing
Challenge: Classical mixing conditions
require notion of time ordering
Not always clear, e.g. microarrays
9999
UNC, Stat & OR
HDLSS Work in Progress, III
Conditions for Geo. Rep’n & PCA Consist.:
Sungkyu Jung Condition:
where
Define:
Assume: Ǝ a permutation,
So that is ρ-mixing
ddX ,0~ tdddd UU
dtddd XUZ 2/1
d
ddZ
100100
UNC, Stat & OR
HDLSS Deep Open Problem
In PCA Consistency:
Strong Inconsistency - spike
Consistency - spike
What happens at boundary
( )???
1
1
1
101101
UNC, Stat & OR
The Future of HDLSS Asymptotics?
1. Address your favorite statistical problem…
2. HDLSS versions of classical optimality
results?
3. Continguity Approach (~Random Matrices)
4. Rates of convergence?
5. Improved Discrimination Methods?
It is early days…
102102
UNC, Stat & OR
Some Carry Away Lessons
Atoms of the Analysis: Object Oriented
Viewpoint: Object Space Feature Space
DWD is attractive for HDLSS classification
“Randomness” in HDLSS data is only in rotations
(Modulo rotation, have constant simplex shape)
How to put HDLSS asymptotics to work?