Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction...
Transcript of Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction...
![Page 1: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/1.jpg)
Distributed Data Management
Introduction Thorsten Papenbrock
F-2.04, Campus II
Hasso Plattner Institut
![Page 2: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/2.jpg)
Supercomputer Minerva (Max Planck Institute in Potsdam-Golm)
![Page 3: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/3.jpg)
LS Naumann Comodity Hardware Cluster (10 Nodes)
![Page 4: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/4.jpg)
Desktop Computer (multiple CPUs and GPUs)
![Page 5: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/5.jpg)
LS Naumann Infrastructure (Server, Cluster, SAN)
![Page 6: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/6.jpg)
LS Naumann PI Cluster (12 Raspberry PI 4)
![Page 7: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/7.jpg)
DreamHack (12,000-computer LAN party)
![Page 8: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/8.jpg)
Boing 747 (thousands of computers)
![Page 9: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/9.jpg)
Turbinen-Prüfstand (thousands of sensors)
![Page 10: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/10.jpg)
Startpage (search engine backed by other search engines )
![Page 11: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/11.jpg)
Lost & Invalid Messages
Consensus Termination
![Page 12: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/12.jpg)
Introduction
Examples Distributed Systems
Lecture Organization
Motivation “Distributed”
Motivation “Data”
Motivation “Management”
![Page 13: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/13.jpg)
Distributed Data Management
Information Systems Team
Slide 13
Introduction
Distributed Data Management
Thorsten Papenbrock
Data Fusion Service-Oriented
Systems
Prof. Felix Naumann
Information Integration
Data Profiling
Distributed Computing
Entity Search
Duplicate Detection
RDF Data Mining
ETL Management
project DuDe
project Stratosphere
Data as a Service
Opinion Mining
Data Scrubbing
project DataChEx
Dependency Detection Linked Open Data
Data Cleansing
Agile Systems
Entity Recognition
Dr. Thorsten Papenbrock
Text Mining
Dr. Ralf Krestel
Phillip Wenig
John Koumarelas
Michael Loster
Hazar Harmouch
Diana Stephan
Tobias Bleifuß
Tim Repke
Lan Jiang
Web Science
Data Change
project Metanome
Julian Risch
Leon Bornemann
Change Exploration Data Preparation
Web Data
Nitisha Jain
![Page 14: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/14.jpg)
Distributed Data Management
Introduction: Audience
Slide 14
Introduction
Distributed Data Management
Thorsten Papenbrock
English?
Which semester?
HPI or Guest?
Database knowledge?
Other related lectures?
ITSE, DE, DH?
Distributed experience?
![Page 15: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/15.jpg)
Distributed Data Management
Courses 2019/2020
Slide 15
Introduction
Distributed Data Management
Thorsten Papenbrock
https://hpi.de/naumann/teaching/current-courses.html
![Page 16: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/16.jpg)
Distributed Data Management
This Lecture
Slide 16
Introduction
Distributed Data Management
Thorsten Papenbrock
Lecture
For master students
(IT-Systems Engineering,
Digital Health, Data Engineering)
6 credit points, 4 SWS
Mondays 13:30 – 15:00
Tuesdays 15:15 – 16:45
Exercises
Interleaved with lectures
Slides
On website
Website
https://hpi.de/naumann/teaching/teaching/ws-1920/distributed-data-management-vl-master.html
Prerequisites
To participate:
A little background and interest in
databases (e.g. DBS I lecture);
object oriented programming skills
For exam:
Attending lectures, participation in
exercises, and completion of
exercise homework tasks
Exam
Written exam
Probably first week after lectures
![Page 17: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/17.jpg)
Distributed Data Management
Feedback
Slide 17
Introduction
Distributed Data Management
Thorsten Papenbrock
Question any time please!
During lectures
Visit us: Campus II, Room F-2.04
Email:
Also: Give feedback about …
improving lectures
informational material
organization
Official evaluation
At the end of this semester
… too late for important feedback!
![Page 18: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/18.jpg)
Distributed Data Management
Feedback
Slide 18
Introduction
Distributed Data Management
Thorsten Papenbrock
See results of seminar
“Reliable Distributed Systems Engineering”
https://hpi.de//naumann/teaching/teaching/ss-19/ reliable-distributed-systems-engineering.html
![Page 19: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/19.jpg)
Slide 19
Introduction
Distributed Data Management
Thorsten Papenbrock
![Page 20: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/20.jpg)
Slide 20
Introduction
Distributed Data Management
Thorsten Papenbrock
Distributed Data Management
Lecture Outline (2018 !)
1. Introduction
2. Foundations
3. OLAP and OLTP
4. Encoding and Evolution
5. Hands-On: Akka
6. Data Models and Query Languages
7. Storage and Retrieval
8. Replication
9. Partitioning
10. Batch Processing
11. Hands-On: Spark
12. Distributed Systems
13. Consistency and Consensus
14. Transactions
15. Stream Processing
16. Hands on: Flink
17. Mining Data Streams
18. Distributed Algorithms
19. Services and Containerization
20. Cloud-based Data Systems
21. Lecture Summary and
Exam Preparation
![Page 21: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/21.jpg)
Slide 21
Introduction
Distributed Data Management
Thorsten Papenbrock
Distributed Data Management
Lecture Outline (2018 !) – Homework
1. Introduction
2. Foundations
3. OLAP and OLTP
4. Encoding and Evolution
5. Hands-On: Akka
6. Data Models and Query Languages
7. Storage and Retrieval
8. Replication
9. Partitioning
10. Batch Processing
11. Hands-On: Spark
12. Distributed Systems
13. Consistency and Consensus
14. Transactions
15. Stream Processing
16. Hands on: Flink
17. Mining Data Streams
18. Distributed Algorithms
19. Services and Containerization
20. Cloud-based Data Systems
21. Lecture Summary and
Exam Preparation
![Page 22: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/22.jpg)
Distributed Data Management
Literature: Course Book
Slide 22
Introduction
Distributed Data Management
Thorsten Papenbrock
Designing Data-Intensive Applications
Author: Martin Klappmann
Date: March 2017
Publisher: O‘Reilly Media, Inc
ISBN: 978-1-449-37332-0
References:
https://github.com/ept/ddia-references
Scope for this lecture
Distributed and parallel systems
Big data storage
Batch and stream processing
![Page 23: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/23.jpg)
Distributed Data Management
Literature: Further Reading
Slide 23
Introduction
Distributed Data Management
Thorsten Papenbrock
And Web-links that are given on the slides
during the lecture.
![Page 24: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/24.jpg)
Introduction
Examples Distributed Systems
Lecture Organization
Motivation “Distributed”
Motivation “Data”
Motivation “Management”
![Page 25: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/25.jpg)
Motivation: “Distributed”
Paradigm Shift in Software-Writing
http://www.gotw.ca/publications/concurrency-ddj.htm
The free lunch is over!
Clock speeds stall
Transistor numbers still increase
Cores in CPUs/GPUs
CPUs/GPUs in compute nodes,
compute nodes in clusters
Paradigm Shift:
Earlier: optimize code for a single thread
Now: solve tasks in parallel
Distributed computing
“Distribution of work on (potentially)
physically isolated compute nodes”
Moore’s Law
Power wall
![Page 26: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/26.jpg)
Motivation: “Distributed”
Surpassing Moor’s Law
Moore’s Law (Observation)
“The number of transistors on
integrated circuit chips doubles
approximately every two years”
Hyperscale: With clusters of distributed machines, we can already build systems with any number of transistors!
(don’t even need to wait for a new processors)
![Page 27: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/27.jpg)
Motivation: “Distributed”
High Performance and Hyperscale Computing
Slide 27
Introduction
Distributed Data Management
Thorsten Papenbrock
High Performance Computing (HPC)
Super computers
Specialized hardware (NUMA systems)
Heterogeneous hardware (FPGAs, GPUs, etc.)
Precision matters
Floating points per second (FLOPS)
Scientific and analytical use cases
OLAP, simulations, forecasts, machine learning, data mining, …
Hyperscale Computing
Standard computers
Fast commodity servers
Response time, availability and throughput matters
X-percentile response time, queries-per-second, …
Scalable systems (and analytical) use cases
OLTP, web services, application hosting, cloud, data transformation, …
Both use distributed computing!
![Page 28: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/28.jpg)
Motivation: “Distributed”
A Rule to Acknowledge
Amdahl’s Law
“The speedup of a program using
multiple processors for parallel
computing is limited by the
sequential fraction of the program”
s: degree of parallelization (e.g. #cores)
p: percentage of the algorithm that
profits from parallelization
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑠 = 1
1 − 𝑝 +𝑝𝑠
Even distributed parallelization cannot work around this law!
![Page 29: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/29.jpg)
Motivation: “Distributed”
New Technologies
Slide 29
Introduction
Distributed Data Management
Thorsten Papenbrock
Distributed Computing
… r
Distributed Storage
…
![Page 30: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/30.jpg)
Slide 30
Introduction
Distributed Data Management
Thorsten Papenbrock
![Page 31: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/31.jpg)
Motivation: “Distributed”
Driving Forces
Slide 31
Introduction
Distributed Data Management
Thorsten Papenbrock
Data volumes increase:
business data, sensor data, social media data, …
Data analytics gains importance:
downtime-less, real-time, predictive
Parallelization paradigm shifts:
multi-core and network speeds increase while CPU clock speeds stall
Computation resources become more available:
IaaS, PaaS, SaaS
Free and open source software gains popularity:
setting standards, utilizing external development resources, improving
software quality, avoiding vendor locks …
![Page 32: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/32.jpg)
Motivation: “Distributed”
Small and Medium Scale
Low-cost and low energy cluster of Cubieboards running Hadoop
A cluster of commodity hardware running Hadoop
![Page 33: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/33.jpg)
Motivation: “Distributed”
Large Scale
A cluster of machines running Hadoop at Yahoo!
![Page 34: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/34.jpg)
Motivation: “Distributed”
Super Large Scale
Slide 34
Thorsten Papenbrock
Top 10 Super Computers 2017
All distributed systems!
https://www.top500.org/lists/2017/06/
![Page 35: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/35.jpg)
Motivation: “Distributed”
Super Large Scale
Slide 35
Thorsten Papenbrock
Top 10 Super Computers 2017
All distributed systems!
https://www.top500.org/lists/2017/06/
![Page 36: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/36.jpg)
Motivation: “Distributed”
Super Large Scale
Slide 36
Thorsten Papenbrock
Top 10 Super Computers 2017
All distributed systems!
https://www.top500.org/lists/2017/06/
![Page 37: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/37.jpg)
Motivation: “Distributed”
Super Large Scale
Slide 37
Introduction
Distributed Data Analytics
Thorsten Papenbrock
Use cases
Weather forecasting
Market analysis
Crash simulation
Disaster simulation
Brute force decryption
Molecular dynamics modeling
…
Data-intensive analytics
tasks!
![Page 38: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/38.jpg)
Introduction
Examples Distributed Systems
Lecture Organization
Motivation “Distributed”
Motivation “Data”
Motivation “Management”
![Page 39: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/39.jpg)
Data Scientist The Sexiest Job
of the 21st Century
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
![Page 40: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/40.jpg)
Data Engineer The ‘real’ Sexiest Job of the 21st Century
https://www.information-age.com/data-engineer-sexiest-job-21st-century-123480578/
![Page 41: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/41.jpg)
https://www.idc.com/getdoc.jsp?containerId=prUS41826116 http://sigmacareer.com/big-data-what-is-it-and-what-are-the-trends
Excellent job opportunities in many companies!
A market worth $122 billion in 2016 with a growth of 11.3% per year!
For a world that created an entire zettabyte (which is exactly 1012 GB)
of data in the 2010 alone!
![Page 42: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/42.jpg)
VLDB 2017 Program
International conference “Very Large Data Bases”
All data processing and analytics tasks that are
more and more based on distributed computing.
![Page 43: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/43.jpg)
Motivation: “Data”
Successful IT Startups
Slide 43
Introduction
Distributed Data Management
Thorsten Papenbrock
Example: Mobile Motion GmbH
Dubsmash
An HPI-Startup of 2013
Founders:
Jonas Drüppel, Roland Grenke, Daniel Taschik
November 19, 2014: Launch of the Dubsmash app November 26, 2014: Dubsmash reached the number one
downloaded app in Germany June 1, 2015: Dubsmash had been downloaded over
50 million times in 192 countries
![Page 44: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/44.jpg)
Motivation: “Data”
Successful IT Startups
Slide 44
Introduction
Distributed Data Management
Thorsten Papenbrock
Many further HPI Startups!
![Page 45: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/45.jpg)
Motivation: “Data”
Successful IT Startups
Slide 45
Introduction
Distributed Data Management
Thorsten Papenbrock
Successful IT-Startups in recent years are masters of data:
1. AirBnB
2. Instagram
3. Pinterest
4. Angry Birds
5. Linkedin
6. Uber
7. Snapchat
8. WhatsApp
9. Twitter
10.Facebook
11.…
Peta- to Exabytes of … profile data (names, addresses, friends, …) content data (images, videos, messages, …) event data (logins, interactions, games, …) …
Challenged with … streaming persistence analytics load-balancing …
![Page 46: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/46.jpg)
Introduction
Examples Distributed Systems
Lecture Organization
Motivation “Distributed”
Motivation “Data”
Motivation “Management”
![Page 47: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/47.jpg)
Motivation: “Management”
Rethinking Data Management
Slide 47
Introduction
Distributed Data Management
Thorsten Papenbrock
Data is distributed and replicated!
Data needs to reach a processor to
be computed.
Processor memory is very small but
data is usually large.
Data is stored distributed and
replicated in memory hierarchies.
Data needs to be fetched, i.e.,
copied to a processor before it can
be computed.
Data needs to be flushed, i.e.,
copied to higher memory levels to
become visible to other processors.
2. Moving data costs magnitudes more
time and energy than computing data!
Push computation to the data
![Page 48: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/48.jpg)
Motivation: “Management”
Rethinking Data Management
Slide 48
Introduction
Distributed Data Management
Thorsten Papenbrock
Moving data costs magnitudes more
time and energy than computing data!
Copying data costs time and energy.
Stalled processors during data
copying consume energy.
Push computation to the data not
data to the computation.
https://hpc.pnl.gov//modsim/2014/Presentations/Kestor.pdf
![Page 49: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/49.jpg)
Motivation: “Management”
Rethinking Data Management
Slide 49
Introduction
Distributed Data Management
Thorsten Papenbrock
Moving data costs magnitudes more
time and energy than computing data!
Copying data costs time and energy.
Stalled processors during data
copying consume energy.
Push computation to the data not
data to the computation.
Why energy is a concern:
https://hpc.pnl.gov//modsim/2014/Presentations/Kestor.pdf
![Page 50: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/50.jpg)
Motivation: “Management”
Rethinking Data Management
Slide 50
Introduction
Distributed Data Management
Thorsten Papenbrock
Data engineers and data scientists
need to be good data manager!
Data encoding
Data transmission
Data replication
Data partitioning
Data consistency management
Load scheduling
Load balancing
We do not consider L0-L3 in this lecture, but this is super relevant for High Performance Computing!
I recommend: https://www.youtube.com/watch?v=3PjNgRWmv90&list=
LLbLaqsrSDDURdv_ZV75-AMQ&index=6&t=0s
![Page 51: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/51.jpg)
Domain
Knowledge
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Statistics
Data Obfuscation
Parallelization
Query Optimization
Visual Analytics
Relational Algebra / SQL
Scalability
Data Analysis Languages
Fault Tolerance
Memory Management
Memory Hierarchy
Data Flow
Information Extraction
Indexing
RDF / SparQL
NF2 / XQuery
Data Warehouse/OLAP
Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
Information Integration
Text Mining Graph Mining
Signal Processing
Business Models
Legal Aspects
Privacy
Security
Regression
Machine Learning
Predictive Analytics
![Page 52: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/52.jpg)
Motivation: “Management”
Data Management
Slide 52
Introduction
Distributed Data Management
Thorsten Papenbrock
Data Management
“The ability to efficiently
read, transform, and store
large amounts of data!”
Static (block) data
Volatile (streaming) data
Data Analytics
“The ability to effectively
extract and calculate
various kinds of information from data!”
Structural information
Explicit information
Implicit/derived information
![Page 53: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/53.jpg)
Motivation: “Management”
Related Topics
Slide 53
Introduction
Distributed Data Management
Thorsten Papenbrock
Database
Systems
Software
Architecture
Data
Mining
Parallel
Computing
Distributed
Data
Management
![Page 54: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/54.jpg)
Motivation: “Management”
Related Topics
Slide 54
Introduction
Distributed Data Management
Thorsten Papenbrock
Software
Architecture
Data
Mining
Parallel
Computing
Database
Systems
Distributed
Data
Management
![Page 55: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/55.jpg)
Motivation: “Management”
Database Systems
Slide 55
Introduction
Distributed Data Management
Thorsten Papenbrock
Touch points
Data models, query languages, and consistency guarantees
Distributed storage and retrieval of data
Index structures
Not in this lecture
Physical data storage
Foundations on transaction management and logging
Core database technology, e.g., query optimizer
More focused lectures
Database Systems I + II (Prof. Naumann)
Trends and Concepts in Software Industry (Prof. Plattner)
![Page 56: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/56.jpg)
Motivation: “Management”
Related Topics
Slide 56
Introduction
Distributed Data Management
Thorsten Papenbrock
Database
Systems
Data
Mining
Parallel
Computing
Software
Architecture
Distributed
Data
Management
![Page 57: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/57.jpg)
Motivation: “Management”
Software Architectures
Slide 57
Introduction
Distributed Data Management
Thorsten Papenbrock
Touch points
Requirements, design, and architecture of distributed systems
Pros and cons of different technologies for distributed systems
Not in this lecture
Non-distributed systems
Agile software development techniques
Software patterns
More focused lectures
Software Architecture (Dr. Uflacker)
Software Technique (Dr. Uflacker)
![Page 58: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/58.jpg)
Motivation: “Management”
Related Topics
Slide 58
Introduction
Distributed Data Management
Thorsten Papenbrock
Database
Systems
Software
Architecture
Data
Mining
Parallel
Computing
Distributed
Data
Management
![Page 59: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/59.jpg)
Motivation: “Management”
Parallel Computing
Slide 59
Introduction
Distributed Data Management
Thorsten Papenbrock
Touch points
Distributed data storage concepts
Distributed programming models, e.g., actor programming and MapReduce
Not in this lecture
Parallel, non-distributed programming languages, e.g., CUDA or OpenMP
Core parallel computing concepts, e.g., scheduling or shared memory
Processor architectures, cache hierarchies, GPU programming, …
More focused lectures
Parallel Programming (Dr. Tröger)
Programmierung paralleler und verteilter Systeme (Dr. Feinbube)
![Page 60: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/60.jpg)
Motivation: “Management”
Related Topics
Slide 60
Introduction
Distributed Data Management
Thorsten Papenbrock
Database
Systems
Software
Architecture
Parallel
Computing
Data
Mining
Distributed
Data
Management
![Page 61: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/61.jpg)
Motivation: “Management”
Data Mining
Slide 61
Introduction
Distributed Data Management
Thorsten Papenbrock
Touch points
Data analytics: aggregation queries and basic data mining algorithms
Not in this lecture
Detailed introduction to machine learning, e.g., neuronal networks,
(un)supervised learning, or Bayesian classification
Statistics, linear algebra, and most sophisticated mining algorithms
More focused lectures/seminars
Data Analysis in R (Lippert, Konigorski, Schurmann)
Selected Topics in Data Analytics (Döllner, Hagedorn)
Machine Learning for Data Steams (Albrecht)
Neuro Design (Von Thienen)
![Page 62: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/62.jpg)
Motivation: “Management”
Related Topics
Slide 62
Introduction
Distributed Data Management
Thorsten Papenbrock
Database
Systems
Software
Architecture
Data
Mining
Parallel
Computing
Distributed
Data
Management
Big Data Systems
(Prof. Rabl)
![Page 63: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/63.jpg)
Motivation: “Management”
Lecture Goals
Slide 63
Introduction
Distributed Data Management
Thorsten Papenbrock
Sorting the buzzwords
NoSQL, Big Data, OLAP, Web-scale, ACID, Sharding, MapReduce, Scale-out…
Understanding distributed systems
You know how state-of-the-art distributed systems work.
You know core technologies and techniques to solve distributed challenges.
You know the advantages and disadvantages of important systems.
You know how to handle data in distributed settungs.
Exercising in distributed data management and analytics
You can implement distributed algorithms and applications.
You can solve problems that arise in distributed setups.
You can write data-parallel and task-parallel jobs.
![Page 64: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/64.jpg)
“Dark Magic”
With distributed computing we can utilize
incredible amounts of compute power!
At the cost of harder programming
(e.g. fault tolerance, testing and protocols)
At the cost of additional energy
(e.g. communication and redundancy)
Efficient, fault resistant code matters all the more,
because inefficiency and failures scale, too!
![Page 65: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/65.jpg)
“Dark Magic”
“Around 10% of the world’s total electricity
consumption is being used by the internet.”
Swedish KTH https://www.insidescandinavianbusiness.com/article.php?id=356
https://www.sciencedirect.com/science/article/pii/S2214629618301051
“The Internet’s data centers alone may already have the same CO2 footprint as global air travel.” Global e-Sustainability Initiative https://internethealthreport.org/2018/the-internet-uses-more-electricity-than/
“Data centres […] consume about 3% of the global
electricity supply […] accounting for about 2% of total greenhouse gas emissions” in 2016. Independent https://www.independent.co.uk/environment/global-warming-data-centres-to-consume-three-times-as-much-energy-in-next-decade-experts-warn-a6830086.html
https://www.nature.com/articles/d41586-018-06610-y
![Page 66: Distributed Data Management Introduction Thorsten …...Super Large Scale Slide 37 Introduction Distributed Data Analytics Thorsten Papenbrock Use cases Weather forecasting Market](https://reader034.fdocuments.in/reader034/viewer/2022042222/5ec95c4d3573ac24be5460ab/html5/thumbnails/66.jpg)
Distributed Data Analytics
Introduction Thorsten Papenbrock
G-3.1.09, Campus III
Hasso Plattner Institut