Big Data and Machine Learning - TU Berlin...Some Remarks •Machine Learning •small data...

Big Data and Machine Learning

Klaus-Robert Müller et al.

Some Remarks

• Machine Learning

• small data (expensive!)

• big data

• big data in neuroscience: BCI et al.

• social media data

• physics & materials

Toward Brain Computer Interfacing

Klaus-Robert Müller, Siamac Fazli, Jan Mehnert, Stefan Haufe, Frank

Meinecke, Paul von Bünau, Franz Kiraly, Felix Biessmann, Sven Dähne,

Johannes Höhne, Michael Tangermann, Carmen Vidaure, Gabriel Curio,

Benjamin Blankertz et al.

Invasive BCI at it’s best

[From Schwartz]

Remark: 24*1000*

3600*30000 ~ 2tb/day

Noninvasive Brain-Computer Interface

DECODING

BCI for communcation

‚Brain Pong‘ with BBCI

Remark: 3*100*

3600*1000 ~ 1-2Gb/Experiment

BBCI paradigms

- healthy subjects untrained for BCI

A: training <10min: right/left hand imagined movements

→ infer the respective brain acivities (ML & SP)

B: online feedback session

Leitmotiv: ›let the machines learn‹

Machine learning approach to BCI: infer prototypical pattern

Inference by CSP Algorithm

The cerebral cocktail party problem

• use ICA/NGCA

projections for artifact

and noise removal

• feature extraction and

selection

[cf. Ziehe et al. 2000, Blanchard et al. 2006]

BBCI Set-up

Artifact removal

[cf. Müller et al. 2001, 2007, 2008, Dornhege et al. 2003, 2007, Blankertz et al. 2004, 2005, 2006, 2007, 2008]

Shifting distributions within experiment

20

Correlating apples and oranges

[Biessmann et al. Neuroimage 2012, Machine Learning 2010]

Temporal Dynamics of Web Data

Motivation

[Biessmann et al, 2012, and submitted]

Canonical Trend Analysis for Social Networks

Data Extraction

Data Extraction: Retweet Location

Mean Location of Reweeted News Articles

Downsampling of Geographic Information

Canonical Trend Model

Why projecting on canonical subspace

Recent development: tkCCA allows to optimally and nonlinearly correlate over time

[Biessmann et al 2010]

Canonical Trend Analysis

Efficient Computation of Canonical Trends

[Schölkopf, Smola & Müller 98, Boser, Gyon, Vapnik, 92]

Efficient Computation of Canonical Trends

Comparisons: Mean, PCA and Canonical Trends

Canonical Convolution

Spatiotemporal Analysis of Retweets of News

53

And now for something completely different

[Montavon et al 13, Rupp et al 2012 ….]

ML4Physics @ IPAM 2011

Klaus-Robert Müller, Matthias Rupp

Anatole von Lilienfeld and Alexandre Tkachenko et al

Machine Learning for chemical compound space

Ansatz:

instead of

[from von Lilienfeld]

Ansatz:

– Provide same information to ML as to SE:

• XYZ-file

• cast data similarly as in the SE:

– Unique and continuous in all of CCS

– Translationally, rotationally, permutationally invariant

– Symmetrical atoms contribute equally

→ ``Coulomb'' Matrix [energy]

→ fill up with zeros for smaller molecules

→ diagonalize OR sort rows according to their norm

→ measure distance between molecules:

Machine Learning for chemical compound space


Coulomb representation of molecules

2.4

iii Z=M

ji

ji

ijRR

ZZ=M

{Z1,R

1}

{Z2,R

2}

{Z3,R

3}

{0,R22}{0,R

21} {0,R23}

+ phantom atoms

{Z4,R

4}

...

Coulomb Matrix (Rupp12)

ijM

2323 M

Kernel ridge regression

Distances between M define Gaussian kernel matrix K

Predict energy as sum over weighted Gaussians

using weights that minimize error in training set

Exact solution

As many parameters as molecules + 2 global parameters, characteristic length-scale or kT of system (σ), and noise-level (λ)


GDB-13 database of all organic molecules (within stability & synthetic constraints) of 13 heavy atoms or less: 0.9B compounds

Blum & Reymond, JACS (2009)

The data


Results

March 2012

Rupp et al., PRL

9.99 kcal/mol

(kernels + eigenspectrum)

December 2012

Montavon et al., NIPS

3.51 kcal/mol

(deep Neural nets + Coulomb sets)

More fun is yet to come...

Prediction considered chemically

accurate when MAE is below 1

kcal/mol

Dataset available at http://quantum-machine.org

Conclusion

• Machine Learning is a versatile and ready to use tool for data

analysis

• small data vs. big data

• fields of ML & Data Bases will hit a limit in near future

• time for a new marriage

Big Data and Machine Learning - TU Berlin...Some Remarks •Machine Learning •small data...

Documents

Transcript of Big Data and Machine Learning - TU Berlin...Some Remarks •Machine Learning •small data...