11 Algorithmic Techniques for Massive Data (COMS 6998-9) Alex Andoni.
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches
-
Upload
chester-christensen -
Category
Documents
-
view
33 -
download
3
description
Transcript of Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches
![Page 1: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/1.jpg)
Scaling Multivariate Statistics to Massive Data
Algorithmic problems and approaches
Alexander GrayGeorgia Institute of Technology
www.fast-lab.org
![Page 2: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/2.jpg)
Core methods ofstatistics / machine learning / mining
1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)
3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)
4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)
5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-
shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory
tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-
point correlation 2-sample testing O(Nn)
![Page 3: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/3.jpg)
Now pretty fast (2011)…
1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*
3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*
4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine
5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*
6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-
shift segmentation O(N), hierarchical clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory
tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-
point correlation 2-sample testing O(Nlogn)*
![Page 4: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/4.jpg)
Things we made fastfastest, fastest in some settings
1. Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*, nearest-neighbor O(logN), all-nearest-neighbors O(N)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)*
3. Regression: linear regression, kernel regression O(N), Gaussian process regression O(N)*
4. Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2)
5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)*
6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-
shift segmentation O(N), hierarchical (FoF) clustering O(NlogN)8. Time series analysis: Kalman filter, hidden Markov model, trajectory
tracking O(Nlogn)*9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N)**, n-
point correlation 2-sample testing O(Nlogn)*
![Page 5: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/5.jpg)
Core computational problems
What are the basic mathematical operations making things hard?
• Alternative to speeding up each of the 1000s of statistical methods: treat common computational bottlenecks
• Divide up the space of problems (and associated algorithmic strategies), so we can examine the unique challenges and possible ways forward within each
![Page 6: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/6.jpg)
The “7 Giants” of data
1. Basic statistics
2. Generalized N-body problems
3. Graph-theoretic problems
4. Linear-algebraic problems
5. Optimizations
6. Integrations
7. Alignment problems
![Page 7: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/7.jpg)
The “7 Giants” of data
1. Basic statistics•e.g. counts, contingency tables, means, medians, variances, range queries (SQL queries)
2. Generalized N-body problems•e.g. nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations
![Page 8: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/8.jpg)
The “7 Giants” of data
3. Graph-theoretic problems•e.g. betweenness centrality, commute distance, graphical model inference
4. Linear-algebraic problems•e.g. linear algebra, PCA, Gaussian process regression, manifold learning
5. Optimizations•e.g. LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing
![Page 9: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/9.jpg)
The “7 Giants” of data
6. Integrations•e.g. Bayesian inference
7. Alignment problems•e.g. BLAST in genomics, string matching, phylogenies, SLAM, cross-match
![Page 10: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/10.jpg)
Back to our listbasic, N-body, graphs, linear algebra, optimization, integration, alignment
1. Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2)
2. Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3)
3. Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3)
4. Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3)
5. Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
6. Outlier detection: by density estimation or dimension reduction7. Clustering: by density estimation or dimension reduction, k-means, mean-
shift segmentation O(N2), hierarchical clustering O(N3)8. Time series analysis: Kalman filter, hidden Markov model, trajectory
tracking O(Nn)9. Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models10.Fusion and matching: sequence alignment, bipartite matching O(N3), n-
point correlation 2-sample testing O(Nn)
![Page 11: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/11.jpg)
5 settings
1. “Regular”: batch, in-RAM/core, one CPU
2. Streaming (non-batch)
3. Disk (out-of-core)
4. Distributed: threads/multi-core (shared memory)
5. Distributed: clusters/cloud (distributed memory)
![Page 12: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/12.jpg)
4 common data types
1. Vector data, iid
2. Time series
3. Images
4. Graphs
![Page 13: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/13.jpg)
3 desiderata
1. Fast experimental runtime/performance*
2. Fast theoretic (provable) runtime/performance*
3. Accuracy guarantees
*Performance: runtime, memory, communication, disk accesses; time-constrained, anytime, etc.
![Page 14: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/14.jpg)
7 general solution strategies
1. Divide and conquer (indexing structures)
2. Dynamic programming
3. Function transforms
4. Random sampling (Monte Carlo)
5. Non-random sampling (active learning)
6. Parallelism
7. Problem reduction
![Page 15: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/15.jpg)
1. Summary statistics
• Examples: counts, contingency tables, means, medians, variances, range queries (SQL queries)
• What’s unique/challenges: streaming, new guarantees
• Promising/interesting: – Sketching approaches– AD-trees– MapReduce/Hadoop (Aster,Greenplum,Netezza)
![Page 16: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/16.jpg)
2. Generalized N-body problems
• Examples: nearest-nbrs (in NLDR, etc), kernel summations (in KDE, GP, SVM, etc), clustering, MST, spatial correlations
• What’s unique/challenges: general dimension, non-Euclidean, new guarantees (e.g. in rank)
• Promising/interesting: – Generalized/higher-order FMM O(N2) O(N)
– Random projections
– GPUs
![Page 17: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/17.jpg)
3. Graph-theoretic problems
• Examples: betweenness centrality, commute dist, graphical model inference
• What’s unique/challenges: high interconnectivity (cliques), out-of-core
• Promising/interesting: – Variational methods– Stochastic composite likelihood methods– MapReduce/Hadoop (Facebook,etc)
![Page 18: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/18.jpg)
4. Linear-algebraic problems
• Examples: linear algebra, PCA, Gaussian process regression, manifold learning
• What’s unique/challenges: probabilistic guarantees, kernel matrices
• Promising/interesting: – Sampling-based methods– Online methods– Approximate matrix-vector multiply via N-body
![Page 19: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/19.jpg)
5. Optimizations
• Examples: LP/QP/SDP/SOCP/MINLPs in regularized methods, compressed sensing
• What’s unique/challenges: stochastic programming, streaming
• Promising/interesting: – Reformulations/relaxations of various ML forms– Online, mini-batch methods– Parallel online methods– Submodular functions– Global optimization (non-convex)
![Page 20: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/20.jpg)
6. Integrations
• Examples: Bayesian inference
• What’s unique/challenges: general dimension
• Promising/interesting: – MCMC– ABC– Particle filtering– Adaptive importance sampling, active learning
![Page 21: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/21.jpg)
7. Alignments
• Examples: BLAST in genomics, string matching, phylogenies, SLAM, cross-match
• What’s unique/challenges: greater heterogeneity, measurement errors
• Promising/interesting: – Probabilistic representations– Reductions to generalized N-body problems
![Page 22: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/22.jpg)
Reductions/transformationsbetween problems
• Gaussian graphical models linear alg• Bayesian integration MAP optimization• Euclidean graphs N-body problems• Linear algebra on kernel matrices N-body
inside conjugate gradient• Can featurize a graph or any other structure
matrix-based ML problem• Create new ML methods with different
computational properties
![Page 23: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/23.jpg)
General conclusions
• Algorithms can dramatically change the runtime order, e.g. O(N2) to O(N)
• High dimensionality is a persistent challenge• The non-default (e.g. streaming, disk…)
settings need more research work• Systems issues need more work, e.g.
connection to data storage/management• Hadoop does not solve everything
![Page 24: Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches](https://reader034.fdocuments.in/reader034/viewer/2022042718/56813069550346895d9646c8/html5/thumbnails/24.jpg)
General conclusions
• No general theory for the tradeoff between statistical quality and computational cost (lower/upper bounds, etc)
• More aspects of hardness (statistical and computational) are needed