On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU.
On the role of Interactivity and Data Placement in Big Data Analytics
description
Transcript of On the role of Interactivity and Data Placement in Big Data Analytics
![Page 1: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/1.jpg)
On the role of Interactivity and Data Placement in Big Data Analytics
Srini ParthasarathyOSU
![Page 2: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/2.jpg)
The Data Deluge: Data Data Everywhere
22
![Page 3: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/3.jpg)
600$ to buy a disk drive that can store all of the
world’s music
3
[McKinsey Global Institute Special Report, June ’11]
Data Storage is Cheap
![Page 4: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/4.jpg)
Data does not exist in isolation.
4
![Page 5: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/5.jpg)
Data almost always exists in connection with other data – integral
part of the value proposition.
5
![Page 6: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/6.jpg)
6
Social networks Protein Interactions Internet
VLSI networks Data dependencies Neighborhood graphs
![Page 7: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/7.jpg)
7
Big Data Problem: All this data is only useful if we can scalably extract useful knowledge from such complex data
![Page 8: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/8.jpg)
THIS TALK
• THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS
• THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS
![Page 9: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/9.jpg)
GLOBAL GRAPHS
![Page 10: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/10.jpg)
GLOBAL GRAPHS
• What? – System for deploying applications processing complex data
• Why? – Seeks balance between high productivity and high performance
• How?– Built on top of PNL’s GlobalArrays– Trees (GlobalTrees, GlobalForests)– Relational Arrays (ArrayDB-GA)– Graphs (GlobalGraphs)
• Data Placement is key to high performance
![Page 11: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/11.jpg)
Importance of Data Placement
• Locality– Placing related items close to each other so they may be
processed together
• Mitigating Impact of Data Skew– Reducing load imbalance in a parallel setting– Reducing variance in partition samples
• Generating Stratified Samples– Improving interactive performance
![Page 12: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/12.jpg)
Key Ideas• Pivotization
– Convert data with complex structure into sets– Each element of set captures features of local topology
• Hashing into Strata: Hash related sets into similar bins– Can employ a sketch-clustering algorithm
• Partitioning: Place Strata into partitions for• Locality • Mitigating Data Skew• Samples
![Page 13: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/13.jpg)
SK
ETCH
SORT
or S
KETC
HCLU
STER
S-1 : : S-4(Δ1, SK-1)(Δ5, SK-5)(Δ12,SK-12)(Δ25,SK-25) : : : S-5 : : : S-128 : : :
PART
ITIO
NIN
G &
REP
LICA
TIO
N
P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9S-12 : S-127
PIVO
T
T
RAN
SFO
RMAT
ION
S
A
B C
LE
A
B C
LE F
.
.
.
.
Δ1
Δ25
DATA (Δ)
A
B C
A
F C
A
E C
A
F L
B
E F
A
E L
A
B L
A
B C
A
E CA
E L
A
B L
.
.
.
.
(PS-1)
(PS-25)
PIVOT SETS (PS)
MIN
WIS
E H
ASHI
NG
on
PIVO
T SE
TS
{1050, 2020,3130,1800} (SK-1)
{1050, 2020,7225, 2020} (SK-25)
.
.
.
.
.
.SKETCHES(SK) Strata (S)
![Page 14: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/14.jpg)
Frequent Tree Mining
• Our proposed approaches shows 100X gains
![Page 15: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/15.jpg)
WebGraph Compression
• Linear Scaleup with no loss in compression ratio
![Page 16: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/16.jpg)
PRISM-HD -
PRobing the Intrinsic Structure and Makeup of High-dimensional Data
HD
![Page 17: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/17.jpg)
Visualization and Interactivity are key to discovery
17
![Page 18: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/18.jpg)
PRISM-HD• What?
– A novel mechanism for exploring complex data
• Why?– User is often overwhelmed with
characteristics of data– Befuddled on where to start
• How?– Given, similarity measure-of-interest– Compute similarity graph at threshold (t)
• Key: Graphs are dimensionless– Provide user graph visualization cues
• User determines next threshold and repeats
HD
![Page 19: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/19.jpg)
HD
HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD
![Page 20: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/20.jpg)
Benefits of Knowledge CachingHD
![Page 21: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/21.jpg)
Benefits of Incremental Processing on Twitter
Incremental estimates on Twitter t1 = 0.95
HD
![Page 22: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/22.jpg)
PRISM-HD and Global Graphs in Context:Leveraging Social Media in Emergency Response
HD
![Page 23: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/23.jpg)
Concluding Remarks
• Data is everywhere• Data is fraught with complexities
– Dimensionality, dynamics, structure, massive…• Both data placement and data interactivity
have an important role to play in big data analytics– PRISM-HD and GlobalGraphs can help!
HD
![Page 24: On the role of Interactivity and Data Placement in Big Data Analytics](https://reader035.fdocuments.in/reader035/viewer/2022062520/5681610c550346895dd05fef/html5/thumbnails/24.jpg)
Thanks for your attentionContact: [email protected]
Mining Simulation Data
Medical Image Analysis
Protein Interaction Network (yeast)
Acknowledgements: Various NSF, NIH, DOE and industry grants