E2ECLOUDS BIG DATA AND BUSINESS - Vinnova€¦ · HOPS‐FS 31 1.Scale-out Metadata-Metadata in an...

39
E2E CLOUDS BIG DATA AND BUSINESS Seif Haridi KTH ‐ SICS

Transcript of E2ECLOUDS BIG DATA AND BUSINESS - Vinnova€¦ · HOPS‐FS 31 1.Scale-out Metadata-Metadata in an...

E2E CLOUDSBIG DATA

AND BUSINESSSeif HaridiKTH ‐ SICS

Computer VisionSpeech Recognition

Machine Translation

Deep Learning

Climbing the Mountain of Machine Learning and AI  

2/

DEEP LEARNING IS THE NEW STEAM ENGINE

• 1765WaterPump

• 1819Steamship

• 1825Locomotive

• 1852Airship

3/

AUTOMATED IMAGE CAPTIONING

[Image captioning, Vinyals et al. 2015]

4

CONV-NETS FOR MUSIC RECOMMENDATION

[Recommending Music on Spotify with Deep Learning. Sander Dieleman]

5

CONV‐NETS FOR ART

DeepDream reddit.com/r/deepdream NeuralStyle, Gatys et al. 2015deepart.io, Prisma, etc.

6

DEEP RL FOR PLAYING GAMES

AlphaGo, Silver et al 2016

ATARI game playing, Mnih 2013

7

SELF‐DRIVING CARS

8

BIGGER DATA MEANS BETTER DNN MODELS

Performance

Traditional AI

Small DNN

Large DNN

Amount Labelled Data

9

AI HIERARCHY OF NEEDS

10

[Adapted from https://hackernoon.com/the‐ai‐hierarchy‐of‐needs‐18f111fcc007?gi=7e13a696e469 ]

DDL(Distributed

Deep Learning)

Deep Learning,RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion

AI HIERARCHY OF NEEDS

11

[Adapted from https://hackernoon.com/the‐ai‐hierarchy‐of‐needs‐18f111fcc007?gi=7e13a696e469 ]

DDL(Distributed

Deep Learning)

Deep Learning,RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion

Analytics

Prediction

LARGE‐SCALE DATA ANALYTICS AND MACHINE LEARNING AT GOOGLE

[Hidden Technical Debt in Machine Learning Systems, Schulley et Al, NIPS 2015]

12

SSF E2E‐CLOUDSTHE PROJECT

• September 2011‐2017• Focus

• Data Intensive Cloud Systems• Distributed Algorithms• Distributed Storage

13

E2E‐CLOUDSTHE BIG DATA ECOSYSTEM

14

E2E‐CLOUDSTHE HOPS BIG DATA PLATFORM

15

[Adapted from https://hackernoon.com/the‐ai‐hierarchy‐of‐needs‐18f111fcc007?gi=7e13a696e469 ]

DDL(Distributed

Deep Learning)

Deep Learning,RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates, Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and Structured Data Storage, Real-Time Data Ingestion

APACHE

FlinkHops

E2E‐CLOUDS HOPSWORKS HIERARCHY OF NEEDS

16

Develop Train Test Deploy

MySQL Cluster

Hive

InfluxDB

ElasticSearch

KafkaProjects,Datasets,Users

HopsFS / YARN

Spark, Flink, Tensorflow

Jupyter, Zeppelin

Jobs, Kibana, Grafana

RESTAPI

Hopsworks

E2E‐CLOUDS HOPSWORKS ABSTRACTIONS

17

• Project is a grouping of users and data sources/sinks• Sensitive data  isolated by projects.• Administration self‐service, like github• Sharing data  like Dropbox• Data HDFS subtrees, Kafka topics• CPU/Storage Quotas for Projects in HOPS YARN

Dataflow Runtime

E2E‐CLOUDS APACHE FLINK

18

Cluster Metrics

DataStreamDataSetCore

Runner

Setup

SQL

Table

CEP

Graph

s

ML

Core TeamParis

Carbone(KTH)

Gyula

Fóra

(SICS/King)

TheodoreVasiloudis(SICS/KTH)

● State Management & Fault Tolerance

● DataStream API

● Gelly(Graph Programming Model)

● Window Aggregation Sharing

Contributions

Backend

Libraries

Vasiliki

Kalavri

(KTH/ETH)

Marius

Melzer

(SICS/TU Dresden)

“A state of the art stream compute engine”

APACHE FLINKSYSTEM ADOPTION

19

30 billion events daily

2 billion events in 10 1Gb 

machines

Solution of choice forstreaming BI

“Blink” : Flink on 1000+ nodes Stream Processing as 

a Service project

20

• A Stream Processing architecture implements naturally what people are trying to do so many years “in disguise”…and much more

DataDatalots of Queries answers

QueryQuerylots of Data answers

paradigmshift

STREAM PROCESSING

STREAM COMPUTE LANDSCAPE

21

Proprietary Open Source

Google Cloud Dataflow

IBM Streams

MicrosoftAzure

Flink

Storm

Kafka Streams

Spark

Beam

22

Data Stream Processor

http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg

Window Word Count

(Apache Flink )

DECLARATIVE DATA STREAMING

23

DStream, DataStream, PCollection…

• Direct access to the execution graph / topology

• Dataflow Programming

• Transformations, Stream Windows

• Meta‐programming

SSF E2E‐CLOUDSRESEARCH IMPACT

• September 2011‐2017• Focus

• Data Intensive Cloud Systems• Distributed Algorithms• Distributed Storage

24

PIPELINED SNAPSHOTS

25

• Snapshot: a correct state of all processes of a distributed application.

• We showed how to acquire snapshots without affecting the performance of a system.

• Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, Kostas Tzoumas:State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing.PVLDB 10(12): 1718-1729 (2017)

SNAPSHOT USAGES

26

1. End‐to‐EndGuarantees

Snapshots

3. Version Control 4. Isolation

2. Reconfiguration

ADOPTION OF SNAPSHOTS

27

https://community.hortonworks.com/articles/14171/windowing‐and‐state‐checkpointing‐in‐apache‐storm.html

Consistent Snapshots adopted for Storm (by Hortonworks)

ADOPTION OF SNAPSHOTS

28

Consistent Snapshots adopted for Spark

https://issues.apache.org/jira/browse/SPARK‐20928

ADOPTION OF SNAPSHOTS

29

Consistent Snapshots adopted for Beam (by Google)

https://goo.gl/JsmpDP

FLINK AT KING

Gyula

Fóra

(SICS/King)

HOPS‐FS

31

1.Scale-out Metadata- Metadata in an in-memory distributed database- Multiple stateless NameNodes

2.Remove the Global Namespace Lock

- Supports multiple concurrent read and write operations

Salman Niazi, Mahmoud Ismail, Seif Haridi, Jim Dowling, Steffen Grohsschmiedt, Mikael Ronström:HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. USENIX FAST 2017: 89‐104

HFDSCLIENT

NAMENODE

HFDSDATANODE

HADOOP DFS

32

File1

Where can I save the file?

DataNodes Addresses

File2 Blk1 DN1, Blk2 DN4File3 Blk1 DN1, Blk2 DN2, Blk3 DN3File4 Blk1 DN100File5 Blk1 DN4, Blk2 DN2, Blk3 DN9

… … … …FileN Blk1 DN2, Blk2 DN8

HOPS‐FS

33

Distributed Database

File  Blocks Mappings

File3        Metadata

File4        Metadata

File  Blocks Mappings

File3        Metadata

File4        Metadata

File  Blocks Mappings

File3        Metadata

File4        Metadata

File  Blocks Mappings

File3        Metadata

File4        Metadata

HOPS‐FS: NEXT GENERATION HDFS*

34

16xThroughput

FasterBigger

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf

37xNumber of files 

Scale Challenge Winner (2017)

Small Files**

E2E‐CLOUDS RESEARCH INFRASTRUCTURE

SICS ICE

• A national large‐scale datacenter for research and innovation in the area of Big data and clouds 

• HOPS has been running for over a year as a service for big data analytics

• Logical Clocks a spin‐off First Complete European Big data Platform

35

Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersso,n August Bonds, Filotas Siskos, Mahmoud Hamed.

Active:

Alumni:

Roberto Bampi, ArunaKumari Yedurupaka, Tobias Johansson, Fanti MachmountAl Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, MisganuDessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Please Follow Us!@hopshadoop

HOPS HEADS

Please Star Us!http://github.com/

hopshadoop/hopsworks

SSF E2E‐CLOUDSTHE PEOPLE

• Nine Ph.D. Theses• Large number of publications

37

SSF E2E‐CLOUDSTHE PEOPLE

38

Vasiliki Kalavri: PostDoc at ETH Zurich

Cosmin Arad: Senior Engineer at GoogleLeading Auto‐scaling of google Dataflow  

Fatemeh Rahimian: Machine Learning ScientistOxford Martin Programme on Deep Medicine

Ali Ghodsi: co‐founder and CEO of Databricks (Spark, Mesos)Berkeley Professor  

Gyuls Fora: Senior Engineer at King (Flink pipeline)Apache Flink Committer

SSF E2E‐CLOUDSTHE PEOPLE

39

Salman Niazi: Ph.D. student KTHHOPS‐FSSCALE 2017  

Mahmoud Ismail: Ph.D. student KTHHOPS‐FSSCALE 2017

Jim Dowling: Assoc. Professor KTHHOPS systemLogical Clocks

Paris Carbone: Ph.D. student KTHApache Flink Committer

Amir Payberah: Ph.D. ResearcherOxford Univ.