Download - Michael Stonebraker How to do Complex Analytics

How to do Complex Analytics

Michael Stonebraker

2

Big Volume - Little Analytics

• SQL aggregates, group_by

• Find me the average closing price of MSFT on all trading days within the last 3 years

• Find me the average closing price of each stock in the DJIA on trading days in the last 5 years

• High performance on SQL analytics available from the data warehouse crowd

3

Big Data - Big Analytics

• Complex math operations (machine learning, clustering, trend detection, ….)— The world of the “quants”— Mostly specified as linear algebra on array data

• A dozen or so common ‘inner loops’— Matrix multiply— QR decomposition— SVD decomposition— Linear regression

4

Big Data - Big AnalyticsAn Example

• Consider closing price on all trading days for the last 5 years for two stocks A and B

• What is the covariance between the two time-series?

(1/N) * sum (Ai - mean(A)) * (Bi - mean (B))

5

Now Make It Interesting …

• Do this for all pairs of 4000 stocks— The data is the following 4000 x 1000

matrixStoc

kt1 t2 t3 t4 t5 t6 t7

…. t1000

S1

S2

…

S4000

Hourly data? All securities?

6

Solution

• Except for the constant and subtracting off the means:

— Stock * StockT

7

Big Data - Big AnalyticsRequirements

• SQL-style data management— Filters, joins, ….

• Complex array manipulation

8

Big Data - Big AnalyticsSolution Options

• Math package• RDBMS• RDBMS + math package• Array data base• Hadoop

9

Solution OptionsR, SAS, Matlab, et al

• Weak or non-existent data management— Do the correlation only for companies with revenue >

$1B ?

• File system storage

• R doesn’t scale and is not a parallel system— Revolution does a bit better

10

Solution Options RDBMS alone

• SQL simulator (MadLib) is slooooow— And only does some of the required

operations

• Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow— And current UDF model not powerful enough

to support iteration

11

Solution OptionsR + RDBMS

• Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames)

• ‘move the world’ nightmare

• Need to learn 2 systems

• And R still doesn’t scale and is not a parallel system

• Some RDBMS vendors are working on these issues

12

Array DBMS(e.g. Paradigm4/SciDB)

• Array SQL data management • With massively scalable array analytics

• In a single system!

• Open source

• Runs in the cloud or private grid of commodity HW

13

Array Versus Relational Tables

• Math functions run directly on native storage format

• Dramatic storage efficiencies as # of dimensions & attributes grows

• High performance on both sparse and dense data

• Math functions run directly on native storage format

• Dramatic storage efficiencies as # of dimensions & attributes grows

• High performance on both sparse and dense data

48 cells

16 cells

14

Hadoop

• Awful performance on data management— No indexes, no statistics, …

• Low level interface — 40 years of DBMS research points to

high level interfaces

• At the very least move to Pig, Hive, …— Another moving part to integrate

15

Hadoop

• No Math— Roll your own or— Use Mahout (yet another moving

part to integrate)

• And Hadoop is very inefficient on math that is not “embarassingly parallel”

16

Summary

• RDBMS good on data management, bad on math

• Math products don’t scale and have no data management

• Hadoop is slow and has too many moving parts that are not well integrated— Not good at either task!

• Opportunity for a new DBMS?