Big Data and Machine Learning - TU Berlin...Some Remarks •Machine Learning •small data...
Transcript of Big Data and Machine Learning - TU Berlin...Some Remarks •Machine Learning •small data...
Big Data and Machine Learning
Klaus-Robert Müller et al.
Some Remarks
• Machine Learning
• small data (expensive!)
• big data
• big data in neuroscience: BCI et al.
• social media data
• physics & materials
Toward Brain Computer Interfacing
Klaus-Robert Müller, Siamac Fazli, Jan Mehnert, Stefan Haufe, Frank
Meinecke, Paul von Bünau, Franz Kiraly, Felix Biessmann, Sven Dähne,
Johannes Höhne, Michael Tangermann, Carmen Vidaure, Gabriel Curio,
Benjamin Blankertz et al.
Invasive BCI at it’s best
[From Schwartz]
Remark: 24*1000*
3600*30000 ~ 2tb/day
Noninvasive Brain-Computer Interface
DECODING
BCI for communcation
‚Brain Pong‘ with BBCI
Remark: 3*100*
3600*1000 ~ 1-2Gb/Experiment
BBCI paradigms
- healthy subjects untrained for BCI
A: training <10min: right/left hand imagined movements
→ infer the respective brain acivities (ML & SP)
B: online feedback session
Leitmotiv: ›let the machines learn‹
Machine learning approach to BCI: infer prototypical pattern
Inference by CSP Algorithm
The cerebral cocktail party problem
• use ICA/NGCA
projections for artifact
and noise removal
• feature extraction and
selection
[cf. Ziehe et al. 2000, Blanchard et al. 2006]
BBCI Set-up
Artifact removal
[cf. Müller et al. 2001, 2007, 2008, Dornhege et al. 2003, 2007, Blankertz et al. 2004, 2005, 2006, 2007, 2008]
Shifting distributions within experiment
20
Correlating apples and oranges
[Biessmann et al. Neuroimage 2012, Machine Learning 2010]
Temporal Dynamics of Web Data
Motivation
[Biessmann et al, 2012, and submitted]
Canonical Trend Analysis for Social Networks
Data Extraction
Data Extraction: Retweet Location
Mean Location of Reweeted News Articles
Downsampling of Geographic Information
Canonical Trend Model
Why projecting on canonical subspace
Recent development: tkCCA allows to optimally and nonlinearly correlate over time
[Biessmann et al 2010]
Canonical Trend Analysis
Canonical Trend Analysis
Efficient Computation of Canonical Trends
[Schölkopf, Smola & Müller 98, Boser, Gyon, Vapnik, 92]
Efficient Computation of Canonical Trends
Efficient Computation of Canonical Trends
Comparisons: Mean, PCA and Canonical Trends
Comparisons: Mean, PCA and Canonical Trends
Comparisons: Mean, PCA and Canonical Trends
Comparisons: Mean, PCA and Canonical Trends
Canonical Convolution
Spatiotemporal Analysis of Retweets of News
53
And now for something completely different
[Montavon et al 13, Rupp et al 2012 ….]
ML4Physics @ IPAM 2011
Klaus-Robert Müller, Matthias Rupp
Anatole von Lilienfeld and Alexandre Tkachenko et al
Machine Learning for chemical compound space
Ansatz:
instead of
[from von Lilienfeld]
Ansatz:
– Provide same information to ML as to SE:
• XYZ-file
• cast data similarly as in the SE:
– Unique and continuous in all of CCS
– Translationally, rotationally, permutationally invariant
– Symmetrical atoms contribute equally
→ ``Coulomb'' Matrix [energy]
→ fill up with zeros for smaller molecules
→ diagonalize OR sort rows according to their norm
→ measure distance between molecules:
Machine Learning for chemical compound space
[from von Lilienfeld]
Coulomb representation of molecules
2.4
iii Z=M
ji
ji
ijRR
ZZ=M
{Z1,R
1}
{Z2,R
2}
{Z3,R
3}
{0,R22}{0,R
21} {0,R23}
+ phantom atoms
{Z4,R
4}
...
Coulomb Matrix (Rupp12)
ijM
2323 M
Kernel ridge regression
Distances between M define Gaussian kernel matrix K
Predict energy as sum over weighted Gaussians
using weights that minimize error in training set
Exact solution
As many parameters as molecules + 2 global parameters, characteristic length-scale or kT of system (σ), and noise-level (λ)
[from von Lilienfeld]
GDB-13 database of all organic molecules (within stability & synthetic constraints) of 13 heavy atoms or less: 0.9B compounds
Blum & Reymond, JACS (2009)
The data
[from von Lilienfeld]
Results
March 2012
Rupp et al., PRL
9.99 kcal/mol
(kernels + eigenspectrum)
December 2012
Montavon et al., NIPS
3.51 kcal/mol
(deep Neural nets + Coulomb sets)
More fun is yet to come...
Prediction considered chemically
accurate when MAE is below 1
kcal/mol
Dataset available at http://quantum-machine.org
Conclusion
• Machine Learning is a versatile and ready to use tool for data
analysis
• small data vs. big data
• fields of ML & Data Bases will hit a limit in near future
• time for a new marriage