How to do Complex Analytics
Michael Stonebraker
2
Big Volume - Little Analytics
• SQL aggregates, group_by
• Find me the average closing price of MSFT on all trading days within the last 3 years
• Find me the average closing price of each stock in the DJIA on trading days in the last 5 years
• High performance on SQL analytics available from the data warehouse crowd
3
Big Data - Big Analytics
• Complex math operations (machine learning, clustering, trend detection, ….)— The world of the “quants”— Mostly specified as linear algebra on array data
• A dozen or so common ‘inner loops’— Matrix multiply— QR decomposition— SVD decomposition— Linear regression
4
Big Data - Big AnalyticsAn Example
• Consider closing price on all trading days for the last 5 years for two stocks A and B
• What is the covariance between the two time-series?
(1/N) * sum (Ai - mean(A)) * (Bi - mean (B))
5
Now Make It Interesting …
• Do this for all pairs of 4000 stocks— The data is the following 4000 x 1000
matrixStoc
kt1 t2 t3 t4 t5 t6 t7
…. t1000
S1
S2
…
S4000
Hourly data? All securities?
6
Solution
• Except for the constant and subtracting off the means:
— Stock * StockT
7
Big Data - Big AnalyticsRequirements
• SQL-style data management— Filters, joins, ….
• Complex array manipulation
8
Big Data - Big AnalyticsSolution Options
• Math package• RDBMS• RDBMS + math package• Array data base• Hadoop
9
Solution OptionsR, SAS, Matlab, et al
• Weak or non-existent data management— Do the correlation only for companies with revenue >
$1B ?
• File system storage
• R doesn’t scale and is not a parallel system— Revolution does a bit better
10
Solution Options RDBMS alone
• SQL simulator (MadLib) is slooooow— And only does some of the required
operations
• Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow— And current UDF model not powerful enough
to support iteration
11
Solution OptionsR + RDBMS
• Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames)
• ‘move the world’ nightmare
• Need to learn 2 systems
• And R still doesn’t scale and is not a parallel system
• Some RDBMS vendors are working on these issues
12
Array DBMS(e.g. Paradigm4/SciDB)
• Array SQL data management • With massively scalable array analytics
• In a single system!
• Open source
• Runs in the cloud or private grid of commodity HW
13
Array Versus Relational Tables
• Math functions run directly on native storage format
• Dramatic storage efficiencies as # of dimensions & attributes grows
• High performance on both sparse and dense data
• Math functions run directly on native storage format
• Dramatic storage efficiencies as # of dimensions & attributes grows
• High performance on both sparse and dense data
48 cells
16 cells
14
Hadoop
• Awful performance on data management— No indexes, no statistics, …
• Low level interface — 40 years of DBMS research points to
high level interfaces
• At the very least move to Pig, Hive, …— Another moving part to integrate
15
Hadoop
• No Math— Roll your own or— Use Mahout (yet another moving
part to integrate)
• And Hadoop is very inefficient on math that is not “embarassingly parallel”
16
Summary
• RDBMS good on data management, bad on math
• Math products don’t scale and have no data management
• Hadoop is slow and has too many moving parts that are not well integrated— Not good at either task!
• Opportunity for a new DBMS?
Top Related