Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA:...
Transcript of Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA:...
![Page 1: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/1.jpg)
SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC
Statistical and Computational Analytics for Big DataJune 12, 2015
Andrew Rau-Chaplin
Faculty of Computer Science
Dalhousie University
www.cs.dal.ca/~arc
![Page 2: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/2.jpg)
THE PROBLEM
![Page 3: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/3.jpg)
STEP 1: SOLVE THE PROBLEM ON SMALL DATA
Typically Process
• start by using small data sets!
• focus on identifying those machine learning techniques that are best suited to the problem
![Page 4: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/4.jpg)
STEP 2: SCALE-UP
![Page 5: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/5.jpg)
SCALING UP
1)Asymptotic Analysis
2)Algorithmic Engineering
3)Parallelism & HPC
![Page 6: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/6.jpg)
HOW TO SCALE TO “BIG DATA”
• Sorry, I don’t have a recipe!
• Talk about our experience scaling up analytical techniques
• Highlight approaches and technology choices which we have found helped in multiple settings
•Perhaps they should have been obvious?
![Page 7: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/7.jpg)
EFFICIENT FRONTIER APPROACHES TO TREATY OPTIMIZATION
www.Risk-Analytics-Lab.ca
Joint work with
• Omar Carmona Cortes
• I. Cook and J. Gaiser-Porter
1) Asymptotic Analysis applied to search
and optimization
![Page 8: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/8.jpg)
Simulated Event Losses
CATASTROPHE MODELING
Exposure Event Catalog
Hazard
Vulnerability
LossEvent Loss Table (ELT)
Cat Model
A Program≈ Multiple layers, over ~15 ELTs, covering ~5 models, and ~200K events
A Portfolio ≈3-4K Programs each with multiple layers, with 40K ELTs, over 100 models, covering 1M events
![Page 9: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/9.jpg)
A TREATY OPTIMIZATION PROBLEM
Optimization From a Primary Insurers or Broker’s Perspective!
Find a Pareto Frontier
Risk
Expected
Return
Dominated
Infeasible
![Page 10: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/10.jpg)
MORE FORMALLY
Given: a fixed number of contractual layers and a simulated set of expected loss distributions (one per layer), plus a model of reinsurance market costs
The Task: identify optimal combinations of shares (also called placements) in order to build a Pareto frontier
![Page 11: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/11.jpg)
INPUTS/OUTPUTS
Treaty
Optimizer
Discretization = 10%, 5%, or 1%
![Page 12: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/12.jpg)
THE APPROACH
12
• Aggregate the loss data
• Location Event Trial year
• Discretized search parameters
• Calculate results for all combinations of shares
• Use a big parallel machine
![Page 13: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/13.jpg)
THE PROBLEM
• Works for a small number of layers!
• Results in a large number of computations
• Number of computations exponential increases with dimensions
Number of Layers
Number of share intervals
Asymptotic Analysis!• ((# of trials) * (discretization) ^ (# of layers))/
(number of processors)
• Example: (1,000,000 * 100^15) / 1000 =
10^33 computations
![Page 14: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/14.jpg)
ROUND 1:
Need a better algorithm!
Use an evolutional search approach
Population Based Incremental Learning Di-PBIL
Single risk measure (ie. 2D Pareto Frontier)
Variance
Value At Risk (VaR)
Tail Value at Risk (TVaR)
Prototype in R (with mutlithreading)
Questions
Quality: How close to the exact method?
Performance: How fast? How big a problem can we now handle?
![Page 15: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/15.jpg)
QUALITY: HOW CLOSE TO THE EXACT METHOD?
Percentage of time DiPBIL finds the same solution as the exact method?
![Page 16: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/16.jpg)
QUALITY: HOW CLOSE TO THE EXACT METHOD?
Average error when DiPBIL does not find the same solution as the exact method?
Error always
less than
6/100ths of a
percent.
![Page 17: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/17.jpg)
PERFORMANCE: HOW FAST WHEN COMPARED TO THE EXACT METHOD?
Time on a single core to compute a single point on efficient frontier for 7 layers and 5% discretization
Enumeration: weeks
Di-PBIL: 2-15 minutes
![Page 18: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/18.jpg)
PERFORMANCE: HOW BIG A PROBLEM CAN WE NOW SOLVE?
Time on a single core to compute a single point on efficient frontier at 5% discretization
Solutions times no
longer exponential
in the number of layer!
![Page 19: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/19.jpg)
ROUND 2:
19
Single risk metric Multiple risk metric(e.g. 1 in 100yr TVaR + 1 in 5yr VaR)
2-d Pareto front 3-d+ Pareto front
Di-PBIL Mo-PBIL
Prototype in R Prototype in C++
Advantages
Search for whole front, not point by point
Multiple Risk Metrics
Performance!
![Page 20: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/20.jpg)
ROUND 2: OPTIMIZED MO-PBIL
Mo-PBIL : Complete frontier (60 - 70 points) for 7 layer program and 5% discretization in 16 seconds!
Setup: 500 iterations, 128 population
2 * Xeon E5-2660 processors
![Page 21: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/21.jpg)
SUMMARY
Evolutionary techniques work well for Treaty Optimization!
Can now solve practical problem instances with practical performance.
Compared multiple evolutionary search methods
Single Objective: DE, PSO, GA, PBIL
Multi Objective: VEPSO, MODE, SPEA2, NSGA2
Evaluation Results
All work and can produce high quality solutions
Differences
Easy of use
PerformanceDon’t compute exactly what you
can compute approximately!
Parallelism is great, but it only
buys you a constant factor!
![Page 22: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/22.jpg)
OPTIMIZE LAYER STRUCTURE, NOT JUST SHARES !
Treaty Optimizer 2
Aggregate
Simulation Engine
Att
Exh
7
Discretization = d%
Risk Measure
Premium function
# reinstatements
Aggregate terms
(ie 3rd event cover)
Set of ELTs
100K Year Event Table (YET)
Risk
Expected
Return
Att
ExhPopulation Evaluation
Inputs
![Page 23: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/23.jpg)
ACCELERATING NGRAM BASED TEXT ANALYSIS
2) Algorithm engineering techniques
applied to text analytics problems
![Page 24: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/24.jpg)
DOCUMENT RELATEDNESS
Important task in many text mining applications
Represented by a score between 0 and 1
Unsupervised Corpus-based methods: Google Trigram Method, Semantic
Text Similarity, etc.
![Page 25: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/25.jpg)
Unigram:apple 6878789eating 14987879
Trigram:ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50
Word Relatedness
Document Relatedness: abstracted as a function of word relatedness
Word Relatedness
Find frequency of w1 and w2 in Unigram; Find co-occurrence of w1 and w2 in Trigram
GTM Distance Function
GOOGLE TRIGRAM METHOD (GTM)
![Page 26: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/26.jpg)
D1: An autograph is the signature of someone famous which is specially written for a fan to keep.D2: Your signature is your name, written in your own characteristic way, often at the end of a document to indicate that you wrote the document or that you agree with what it says
*
* Proposed by Islam, Milios, and Keselj, Text Relatedness using Google Tri-grams
GTM EXAMPLE
![Page 27: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/27.jpg)
CHALLENGES IN SCALING UP GTM
Measuring the relatedness between a pair of documents is too slow in the
existing work
The size of Unigram is roughly 200 MB; the size of Trigram is 20 GB.
High complexity of N to N pairwise document Relatedness computation.
Volume of documents is growing rapidly
![Page 28: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/28.jpg)
9 GB 3 GB
WORD RELATEDNESS PRECOMPUTATION
Tokenize! - Assign each word with an number ID
Precompute! - Compute all the word relatedness in advance for lookups
Build in-memory data structures! - Dictionary structure to store word
relatedness dictionary in memory
Hashing vs Arrays (207,761,290 pairs of words)
![Page 29: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/29.jpg)
SHARED MEMORY MULTITHREADING
Multithreaded implementation: make uses of a multi-core of shared memory machine.
Amortize I/O Costs: Each thread running on a separate core fetches documents from
the shared memory and computes the relatedness between them.
Lots of language and library based approaches: OpenMP, …
![Page 30: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/30.jpg)
MULTITHREADED IMPLEMENTATION PERFORMANCE
The speed-up analysis
Experiments use 2000 documents from ACM Paper Abstracts collections
![Page 31: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/31.jpg)
HORIZONTAL SCALING: HADOOP
• Scaling for free?
• Data parallelism
• Solves problem partitioning
• Solves task mapping
• Solves fault tolerance
• Challenges
• Shared data structures?
• How to amortize I/O costs?
![Page 32: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/32.jpg)
HADOOP
•Shared data structures?
• Use multi-threaded mappers!
• Example: Each multithreaded mapper constructs word relatedness dictionary and takes
input blocks for document relatedness computation.
•How to amortize I/O costs?
•Map over task definitions not the raw data
•Example: Map of blocks of similarity computations
![Page 33: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/33.jpg)
0
5
10
15
20
25
30
35
40
2000 4000 6000 8000 10000
Tim
e in
Min
ute
s
Number of Files
Hadoop Size_up Performance
20 Instances Performance(mins)
HADOOP IMPLEMENTATION PERFORMANCE
•Hardware: Amazon EC2 m3.xlarge nodes each of which has 4 cores and 15 GB.
•Hadoop: AWS EMR
•Dataset: 10,000 from ACM Paper abstracts collections
![Page 34: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/34.jpg)
0
5
10
15
20
25
30
35
1 2 3 4 5 6 8 10 20 25
Tim
e in
Min
ute
s
Number of Files
Hadoop Scale_up Performance
Time(mins)
HADOOP IMPLEMENTATION PERFORMANCE
The scale-up performance shows the running time of the implementation while keeping the ratio between input size and nodes fixed.
2000 texts per node using between 2-10 nodes.
![Page 35: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/35.jpg)
SUMMARY
• Speed-up of ~10,000,000x from initial research prototype
• GPU implementation: ~10x cost/performance
Algorithm Engineering works!
• Compress your data
• Precompute!
• Exploit in memory data structurtes
• Mutli-thread
• Scale horizontally
![Page 36: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/36.jpg)
SUMMARY
Asymptotic Analysis
• O( (# of Docs)^2 * (# of words per doc)^2 )
0
500
1000
1500
2000
2500
3000
3500
4000
0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000
Hour
s
Documents
Runtime as a function of # of documents
You can’t out run the
asymptotic complexity!
![Page 37: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/37.jpg)
VOLAP: A FULLY SCALABLE CLOUD-BASED SYSTEM FOR
REAL-TIME OLAP ON HIGH VELOCITY DATA
3) Applying cloud technologies to
accelerate real-time performanceJoint work with
• F. Dehne, D. Robillard
• Q. Kong, N. Burkee
![Page 38: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/38.jpg)
REAL-TIME OLAP ON THE CLOUD
38
Business Analytics based on OLAP
High dimensional data with hierarchies
Real-time insert and queries
Solution
Multiple server processors
Dynamic load balancing
Strong session serialization
![Page 39: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/39.jpg)
PERFORMANCE
39
![Page 40: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/40.jpg)
CLOUD BASED SYSTEM ARCHITECTURE
![Page 41: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/41.jpg)
THE DISTRIBUTED PDCR-TREE
The building block for CR-OLAP system
DC-Tree A fully dynamic index structure which explicitly represents
dimension hierarchies
“The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses”, [Kriegel2000]
PDC-Tree A Multi-threaded DC-Tree
“Parallel Real-Time OLAP on Multi-core Processors”, [Dehne2012]
Distributed PDCR-Tree A new distributed in-memory index structure for the cloud
Add support for distributed memory
Array-based implementation to support efficient migration
Add supports for both ordered and unordered dimensions.
![Page 42: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/42.jpg)
DATA INGESTION PERFORMANCE
42
![Page 43: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/43.jpg)
LOAD BALANCING PERFORMANCE
43
![Page 44: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/44.jpg)
SCALE-UP QUERY PERFORMANCE
44
![Page 45: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/45.jpg)
SUMMARY
CloudA model for high velocity data
There are great tools
New IssuesDesign for elasticity
Think about fault-tolerance from the start
Old IssuesNothing is for free! Still need to worry about Compression
Pre-computation
Efficient Data structures
Multi threading
![Page 46: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data](https://reader033.fdocuments.in/reader033/viewer/2022060317/5f0c4b937e708231d434b0b1/html5/thumbnails/46.jpg)
Big Data
HPC: Clusters, Clouds, &
GPU
Algorithm Engineering
SCALING UP TO BIG DATA
Andrew [email protected]
Risk Analytics Lab,
Faculty of Computer Science
Dalhousie University,
Halifax, Canada
www.Risk-Analytics-Lab.ca