So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton...

61
So Much Data So Much Data Bernard Chazelle Bernard Chazelle Princeton University Princeton University So Little Time So Little Time

Transcript of So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton...

Page 1: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

So Much DataSo Much DataSo Much DataSo Much Data

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

So Little TimeSo Little TimeSo Little TimeSo Little Time

Page 2: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

So Many SlidesSo Many SlidesSo Many SlidesSo Many Slides

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

Bernard ChazelleBernard Chazelle

Princeton UniversityPrinceton University

So Little Time So Little Time

So Little Time So Little Time

(before lunch)(before lunch) (before lunch)(before lunch)

Page 3: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

computation

math experimentation

algorithms

Page 4: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Computers have two Computers have two problemsproblems

Computers have two Computers have two problemsproblems

Page 5: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

1. They don’t have steering 1. They don’t have steering wheelswheels

1. They don’t have steering 1. They don’t have steering wheelswheels

Page 6: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.
Page 7: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

2. End of Moore’s Law

party’s over !

party’s over !

Page 8: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

computation

algorithms experimentation

Page 9: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

32x 17

22432

= 544

This is not me

Page 10: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

FFT

RSA

Page 11: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.
Page 12: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.
Page 13: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

noisy

low entropy

uncertain

unevenly priced

big

Page 14: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

noisy

low entropy

uncertain

unevenly priced

big

Page 15: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Biomedical imaging

Sloan Digital Sky

Survey4 petabytes4 petabytes

(~1MG)(~1MG)4 petabytes4 petabytes

(~1MG)(~1MG)

10 10 petabytes/yrpetabytes/yr

10 10 petabytes/yrpetabytes/yr

150 petabytes/yr150 petabytes/yr150 petabytes/yr150 petabytes/yr

Page 16: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Collected works of Micha Sharir

My A(9,9)-th paper

Page 17: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

massive input

massive input outputoutput

Sublinear Sublinear AlgorithmsAlgorithmsSublinear Sublinear

AlgorithmsAlgorithms

Sample tiny fractionSample tiny fractionSample tiny fractionSample tiny fraction

Page 18: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Shortest PathsShortest PathsShortest PathsShortest Paths [C-Liu-Magen ’03]

New New YorkYork

New New YorkYork

DelphiDelphiDelphiDelphi

Page 19: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Ray ShootingRay ShootingRay ShootingRay Shooting

Volume Intersection Point location

Page 20: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Approximate MSTApproximate MSTApproximate MSTApproximate MST [C-Rubinfeld-Trevisan ’01]

Page 21: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Reduces to counting connected componentsReduces to counting connected componentsReduces to counting connected componentsReduces to counting connected components

Page 22: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

EEEE = no. connected components= no. connected components= no. connected components= no. connected components

varvarvarvar << (no. connected components)<< (no. connected components)<< (no. connected components)<< (no. connected components)2222

whp, is a good estimator

of # connected components

Page 23: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

worst case worst case worst case worst case

input spaceinput spaceinput spaceinput space

average case average case (uniform)(uniform)average case average case (uniform)(uniform)

Page 24: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

worst case worst case worst case worst case

Page 25: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

average case = actuarial view average case = actuarial view average case = actuarial view average case = actuarial view

Page 26: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “

Page 27: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

arbitrary, unknown random sourcearbitrary, unknown random sourcearbitrary, unknown random sourcearbitrary, unknown random source

Self-Improving Self-Improving AlgorithmsAlgorithms

Self-Improving Self-Improving AlgorithmsAlgorithms

Page 28: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Yes ! This could be YOU, too !

Page 29: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

E Tk Optimal expected time for random source

time T1

time T2

time T3

time T4

Page 30: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]

K-median over Hamming K-median over Hamming cubecubeK-median over Hamming K-median over Hamming cubecube

Page 31: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

minimize sum of distancesminimize sum of distancesminimize sum of distancesminimize sum of distances

Page 32: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

minimize sum of distancesminimize sum of distancesminimize sum of distancesminimize sum of distances

Page 33: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

[ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ]

COST OPT( 1 + )

Page 34: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

How to achieve linear limiting How to achieve linear limiting time?time?How to achieve linear limiting How to achieve linear limiting time?time?

Input space {0,1}Input space {0,1}Input space {0,1}Input space {0,1}dndndndn

prob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSS

Identify coreIdentify coreIdentify coreIdentify core

TailTail::TailTail::

Use KSS Use KSS Use KSS Use KSS

Page 35: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Store sample of Store sample of precomputed KSSprecomputed KSSStore sample of Store sample of precomputed KSSprecomputed KSS

Nearest neighborNearest neighborNearest neighborNearest neighborIncremental algorithmIncremental algorithmIncremental algorithmIncremental algorithm

Page 36: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?

Page 37: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.
Page 38: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

encode

Page 39: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

decode

Page 40: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.
Page 41: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Data inaccessible before noise

What makes you What makes you think it’s wrong?think it’s wrong?

Page 42: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Data inaccessible before noise

must satisfy some propertymust satisfy some property

(eg, convex, bipartite)(eg, convex, bipartite)

but does not quitebut does not quite

Page 43: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

f(x) = ?f(x) = ?

x

f(x)

data

f = access function

Page 44: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

f(x) = ?f(x) = ?

x

f(x)

f = access function

Page 45: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

f(x) = ?f(x) = ?

x

f(x)

But life being what it is…

Page 46: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

f(x) = ?f(x) = ?

x

f(x)

Page 47: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

)(O

Humans

Define distance from any object to data class

Page 48: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

f(x) = ?f(x) = ?

x

g(x)

x1, x2,…

f(x1), f(x2),…

filter

g is access function for:

Page 49: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Online DataOnline DataReconstructiReconstructi

onon

Online DataOnline DataReconstructiReconstructi

onon

Page 50: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Monotone function: [n] Rd

Filter requires polylog (n) lookups

[ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ] [ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ]

Page 51: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Convex Convex polygonpolygon

Filter requires : lookups

[C-Comandur ’06 ]

Page 52: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Convex Convex terrainterrain

lookups

Filter requires :

Page 53: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Iterated planar separator Iterated planar separator theoremtheorem

Page 54: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Iterated planar separator Iterated planar separator theoremtheorem

Page 55: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Iterated Iterated (weak)(weak) planar separator theorem planar separator theorem

in sublinear time!in sublinear time!in sublinear time!in sublinear time!

Page 56: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Using epsilon-nets in spaces of unbounded VC Using epsilon-nets in spaces of unbounded VC dimensiondimension

reconstruct

Page 57: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

bipartite graph

k-connectivity

expander

Page 58: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

denoising low-dim attractor sets

Page 59: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Priced Priced

computation & computation & accuracyaccuracy

Priced Priced

computation & computation & accuracyaccuracy

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

001100001010001111110011001101011100001100000101111o1o1100001100

001100001010001111110011001101011100001100000101111o1o1100001100

Linear programmingLinear programming Linear programmingLinear programming

Page 60: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Pricing dataPricing data

Pricing dataPricing data

Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….

Page 61: So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.

Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu

Avner Magen, Ronitt Rubinfeld, Luca Trevisan

Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu

Avner Magen, Ronitt Rubinfeld, Luca Trevisan