Big Analytics Without Big Hassles 04/10/14 Webinar
-
Upload
paradigm4inc -
Category
Technology
-
view
329 -
download
0
description
Transcript of Big Analytics Without Big Hassles 04/10/14 Webinar
Big Analytics without Big Hassles
Bryan Lewis Chief Data Scientist
Alex Poliakov
Solutions Architect
Paradigm4’s SciDB
MPP Database
Array data model
Complex analytics
Commodity clusters or cloud
R & Python
Big analytics without big hassles
© P
arad
igm
4 3
Using WebEx
• Ask questions using the Q&A window
• This webinar is being recorded
• Replays will be available
© P
arad
igm
4 4
Agenda
1. Brief Introduction to SciDB
2. Demos
3. Q & A
© P
arad
igm
4 5
Paradigm4 develops & supports SciDB
Force behind many major advances in databases
Postgres Vertica Paradigm4 Illustra VoltDB Streambase DataTamer
Mike Stonebraker CTO & Co-founder MIT Professor ISTC Big Data at MIT
© P
arad
igm
4 6
Presenters
Bryan Lewis, Chief Data Scientist Applied Math Ph.D. Founder Rocketcalc; RevolutionAnalytics CRAN contributor
Alex Poliakov, Solutions Architect Decade developing database internals (Netezza, Paradigm4) Solutions: e-commerce, pharma/biotech, insurance, satellite imagery
© P
arad
igm
4 7
Three pillars of SciDB
MPP Database
Array data model
Complex analytics
Commodity clusters or cloud
R & Python
Big analytics without big hassles
© P
arad
igm
4 8
SciDB Powers NIH NCBI’s 1000 Genomes Project
Running 24 x 7 since Fall 2012
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/
© P
arad
igm
4 9
Some commercial use cases
Pharma, Biotech, Healthcare
Join public & private data Integrate many data sources Scale up & speed up math
Quant Finance
Fast data window selection & scalable math
Image & Sensor Analytics
E-commerce
SVD on sparse matrices 50M x 50M Powering recommendation engine
Integrate diverse data with different spatial and temporal resolutions
© P
arad
igm
4 1
0
• ACID support means multiple users can
simultaneously read / write / analyze data
• FAST JOINs
data in files
SciDB is a Database
© P
arad
igm
4 1
1
Arrays are a natural data model
Sen
sor
/ car
/ p
ho
ne
time
longitude
Event
other dimensions ….
latitude
Exc
han
ge
Stock_ID
Time
other dimensions ….
© P
arad
igm
4 1
2
Native Array DBs vs. Relational DBs
Spatially close data in the coordinate system are stored close to each other on disk Important for ordered data and analysis
© P
arad
igm
4 1
3
Array storage supports fast multi-dimensional SELECTs
Illustration credit: Andrei Pandre
© P
arad
igm
4 1
4
SciDB does scalable complex analytics
• No more ETL hassles to a separate math package • Data not constrained to fit in memory
Parallel linear algebra Principal component analysis Clustering GLM Machine learning and more
© P
arad
igm
4 1
5
• Program SciDB from R or Python • Naturally reference & manipulate data in SciDB • Large computations run on SciDB cluster
– Go beyond the scalability limitations of R & Python
Analyst-Friendly Interfaces
We also support AQL and JDBC
© P
arad
igm
4 1
6
Shared-Nothing Cluster Architecture
SciDB Coordinator
SciDB …
SciDB 1
SciDB 2
R + SciDB-R
Python + SciDB-Py
JDBC
Web Browser
K-replication for redundancy Scale out horizontally
© P
arad
igm
4 1
7
SciDB Arrays
Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes
Price Volume Symbol usec 450.61 150 “AAPL” 36013008713
© P
arad
igm
4 1
8
SciDB Arrays D
imen
sion
i
Attributes Price Volume Symbol usec
1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113
A 1-D array looks like an R or Pandas data frame.
This picture shows five cells, each with four attributes.
© P
arad
igm
4 1
9
SciDB Arrays
The same data “redimensioned” into a 2D array
Dim
ensi
on u
sec
“AAPL” “MSFT”
Price Volume Price Volume
36003200113 36.20 100
36013008713 450.61 150
36013008915 450.73 200
36013208113 450.84 10
36019008713 36.57 75
Dimension Symbol .
© P
arad
igm
4 2
0
SciDB Array Schema
CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];
Attributes v1, v2, v3
Dimensions I, J
Dimension size * is unbounded
Chunk size
Chunk overlap
© P
arad
igm
4 2
1
Arrays are distributed with overlap
Supports constant time moving window aggregates and feature detection …even when data cross node boundaries
0.02 0.01 0.01
0.01 0.01 0.50
0.01 0.02 0.01
0.01 0.01 0.02
0.01 0.50 0.02
0.02 0.01 0.01
0.01 0.01 0.50
0.01 0.02 0.01
0.02 0.01 0.02
0.01 0.50 0.02
0.02 0.01 0.01
0.01 0.02 0.02
© P
arad
igm
4 2
2
Live demonstrations
1) Airline data • Select • Aggregate lateness • Heatmap
2) Netflix-like data • SVD
3) Zipcode (lat,long) and population by zipcode • Join • Compute distance-weighted population by zipcode • Plot histogram
4) Satellite and point-of-interest data • Select region • Regrid and plot • Overlay another dataset: shopping mall locations
© P
arad
igm
4 2
3
Demonstration Cluster
Running on modest 4 node cluster Each node has
16 cores 128 GB RAM 4 x 1TB disks Connected by 1Gbit Ethernet
Also runs on public clouds
© P
arad
igm
4 2
4
Registration Poll Results
Excel,'15%'MATLAB,'6%'
Other,'20%'
Python,'17%'
R,'42%'
What'mathemaAcal'and'staAsAcal'compuAng'soGware'do'you'use?'''''
n'='340'
Please respond to live poll
© P
arad
igm
4 2
5
Try It
Quick Start • scidb.org/forum • Download a VM or EC2 AMI
Community Edition Enterprise Edition Open Source; Active forum Commercial license Unrestricted & fully scalable Unrestricted & fully scalable
More math functions Intel MKL support Failover & fault tolerance System management tools
Take Away: Less coding, more analysis
ACID database Array data model In-database complex math Automatic scale-out & speed-up Programmable from R and Python
www.paradigm4.com
© Paradigm4 Inc. 27
Questions?
Tell us about your application • [email protected]
Try our Quick Start • scidb.org/forum • Download a VM or EC2 AMI
www.paradigm4.com
Thanks for your interest!