E2ECLOUDS BIG DATA AND BUSINESS - Vinnova€¦ · HOPS‐FS 31 1.Scale-out Metadata-Metadata in an...
Transcript of E2ECLOUDS BIG DATA AND BUSINESS - Vinnova€¦ · HOPS‐FS 31 1.Scale-out Metadata-Metadata in an...
Computer VisionSpeech Recognition
Machine Translation
Deep Learning
Climbing the Mountain of Machine Learning and AI
2/
DEEP LEARNING IS THE NEW STEAM ENGINE
• 1765WaterPump
• 1819Steamship
• 1825Locomotive
• 1852Airship
3/
CONV-NETS FOR MUSIC RECOMMENDATION
[Recommending Music on Spotify with Deep Learning. Sander Dieleman]
5
CONV‐NETS FOR ART
DeepDream reddit.com/r/deepdream NeuralStyle, Gatys et al. 2015deepart.io, Prisma, etc.
6
BIGGER DATA MEANS BETTER DNN MODELS
Performance
Traditional AI
Small DNN
Large DNN
Amount Labelled Data
9
AI HIERARCHY OF NEEDS
10
[Adapted from https://hackernoon.com/the‐ai‐hierarchy‐of‐needs‐18f111fcc007?gi=7e13a696e469 ]
DDL(Distributed
Deep Learning)
Deep Learning,RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion
AI HIERARCHY OF NEEDS
11
[Adapted from https://hackernoon.com/the‐ai‐hierarchy‐of‐needs‐18f111fcc007?gi=7e13a696e469 ]
DDL(Distributed
Deep Learning)
Deep Learning,RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion
Analytics
Prediction
LARGE‐SCALE DATA ANALYTICS AND MACHINE LEARNING AT GOOGLE
[Hidden Technical Debt in Machine Learning Systems, Schulley et Al, NIPS 2015]
12
SSF E2E‐CLOUDSTHE PROJECT
• September 2011‐2017• Focus
• Data Intensive Cloud Systems• Distributed Algorithms• Distributed Storage
13
E2E‐CLOUDSTHE HOPS BIG DATA PLATFORM
15
[Adapted from https://hackernoon.com/the‐ai‐hierarchy‐of‐needs‐18f111fcc007?gi=7e13a696e469 ]
DDL(Distributed
Deep Learning)
Deep Learning,RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion
APACHE
FlinkHops
E2E‐CLOUDS HOPSWORKS HIERARCHY OF NEEDS
16
Develop Train Test Deploy
MySQL Cluster
Hive
InfluxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorflow
Jupyter, Zeppelin
Jobs, Kibana, Grafana
RESTAPI
Hopsworks
E2E‐CLOUDS HOPSWORKS ABSTRACTIONS
17
• Project is a grouping of users and data sources/sinks• Sensitive data isolated by projects.• Administration self‐service, like github• Sharing data like Dropbox• Data HDFS subtrees, Kafka topics• CPU/Storage Quotas for Projects in HOPS YARN
Dataflow Runtime
E2E‐CLOUDS APACHE FLINK
18
Cluster Metrics
DataStreamDataSetCore
Runner
Setup
SQL
Table
CEP
Graph
s
ML
Core TeamParis
Carbone(KTH)
Gyula
Fóra
(SICS/King)
TheodoreVasiloudis(SICS/KTH)
● State Management & Fault Tolerance
● DataStream API
● Gelly(Graph Programming Model)
● Window Aggregation Sharing
Contributions
Backend
Libraries
Vasiliki
Kalavri
(KTH/ETH)
Marius
Melzer
(SICS/TU Dresden)
“A state of the art stream compute engine”
APACHE FLINKSYSTEM ADOPTION
19
30 billion events daily
2 billion events in 10 1Gb
machines
Solution of choice forstreaming BI
“Blink” : Flink on 1000+ nodes Stream Processing as
a Service project
20
• A Stream Processing architecture implements naturally what people are trying to do so many years “in disguise”…and much more
DataDatalots of Queries answers
QueryQuerylots of Data answers
paradigmshift
STREAM PROCESSING
STREAM COMPUTE LANDSCAPE
21
Proprietary Open Source
Google Cloud Dataflow
IBM Streams
MicrosoftAzure
Flink
Storm
Kafka Streams
Spark
Beam
22
Data Stream Processor
http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg
Window Word Count
(Apache Flink )
DECLARATIVE DATA STREAMING
23
DStream, DataStream, PCollection…
• Direct access to the execution graph / topology
• Dataflow Programming
• Transformations, Stream Windows
• Meta‐programming
SSF E2E‐CLOUDSRESEARCH IMPACT
• September 2011‐2017• Focus
• Data Intensive Cloud Systems• Distributed Algorithms• Distributed Storage
24
PIPELINED SNAPSHOTS
25
• Snapshot: a correct state of all processes of a distributed application.
• We showed how to acquire snapshots without affecting the performance of a system.
• Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, Kostas Tzoumas:State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing.PVLDB 10(12): 1718-1729 (2017)
SNAPSHOT USAGES
26
1. End‐to‐EndGuarantees
Snapshots
3. Version Control 4. Isolation
2. Reconfiguration
ADOPTION OF SNAPSHOTS
27
https://community.hortonworks.com/articles/14171/windowing‐and‐state‐checkpointing‐in‐apache‐storm.html
Consistent Snapshots adopted for Storm (by Hortonworks)
ADOPTION OF SNAPSHOTS
28
Consistent Snapshots adopted for Spark
https://issues.apache.org/jira/browse/SPARK‐20928
HOPS‐FS
31
1.Scale-out Metadata- Metadata in an in-memory distributed database- Multiple stateless NameNodes
2.Remove the Global Namespace Lock
- Supports multiple concurrent read and write operations
Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowling, Steffen Grohsschmiedt, Mikael Ronström:HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. USENIX FAST 2017: 89‐104
HFDSCLIENT
NAMENODE
HFDSDATANODE
HADOOP DFS
32
File1
Where can I save the file?
DataNodes Addresses
File2 Blk1 DN1, Blk2 DN4File3 Blk1 DN1, Blk2 DN2, Blk3 DN3File4 Blk1 DN100File5 Blk1 DN4, Blk2 DN2, Blk3 DN9
… … … …FileN Blk1 DN2, Blk2 DN8
HOPS‐FS
33
Distributed Database
File Blocks Mappings
File3 Metadata
File4 Metadata
File Blocks Mappings
File3 Metadata
File4 Metadata
File Blocks Mappings
File3 Metadata
File4 Metadata
File Blocks Mappings
File3 Metadata
File4 Metadata
HOPS‐FS: NEXT GENERATION HDFS*
34
16xThroughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf
37xNumber of files
Scale Challenge Winner (2017)
Small Files**
E2E‐CLOUDS RESEARCH INFRASTRUCTURE
SICS ICE
• A national large‐scale datacenter for research and innovation in the area of Big data and clouds
• HOPS has been running for over a year as a service for big data analytics
• Logical Clocks a spin‐off First Complete European Big data Platform
35
Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersso,n August Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Roberto Bampi, ArunaKumari Yedurupaka, Tobias Johansson, Fanti MachmountAl Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, MisganuDessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Please Follow Us!@hopshadoop
HOPS HEADS
Please Star Us!http://github.com/
hopshadoop/hopsworks
SSF E2E‐CLOUDSTHE PEOPLE
38
Vasiliki Kalavri: PostDoc at ETH Zurich
Cosmin Arad: Senior Engineer at GoogleLeading Auto‐scaling of google Dataflow
Fatemeh Rahimian: Machine Learning ScientistOxford Martin Programme on Deep Medicine
Ali Ghodsi: co‐founder and CEO of Databricks (Spark, Mesos)Berkeley Professor
Gyuls Fora: Senior Engineer at King (Flink pipeline)Apache Flink Committer