CS 644: Introduction to Big Data Chapter 8. Enabling Big-data...

Post on 03-Aug-2020

2 views 0 download

Transcript of CS 644: Introduction to Big Data Chapter 8. Enabling Big-data...

CS 644: Introduction to Big Data

Chapter 8. Enabling Big-data Scientific Workflows in High-performance Networks

1

Chase Wu

New Jersey Institute of Technology

Outline

• Introduction• Challenges & Objectives• A Three-layer Architecture Solution

- Enabling Technologies: networking and computing• Networking for Big Data

- Software-defined Networking (SDN)- High-performance Networking (HPN)

• Computing for Big Data- Workflow Management and Optimization

• Simulation/Experimental Results• Conclusion

2

Introduction• Supercomputing for big-data science

3

AstrophysicsComputational biology

Climate research

Flow dynamics

Computational materialsFusion simulation

Neutron sciences

Nanoscience

Terascale Supernova Initiative (TSI)

4

Visualization channel

Visualization control channelComputation steering channel

• Collaborative project- Supernova explosion

• TSI simulation- 1 terabyte a day with a

small portion of parameters

- From TSI to PSI• Transfer to remote sites

- Interactive distributed visualization

- Collaborative data analysis

- Computation monitoring- Computation steering

Supercomputer or ClusterClient

Challenges for Extreme-scale Scientific Applications• A typical way of research conduct

- Run simulation code on supercomputer in a batch modeØ One day: 1 terabyte datasets

- Move datasets to HPSSØ About 8 hours

- Transfer datasets to remote sites over the InternetØ TCP-based transfer tools: up to one week

- Filter out data of interest- Partition dataset for parallel processing- Extract geometry data- Generate images on rendering engine- Display results on desktop, laptop, powerwall, etc.

}5

Visualization:

several hours or days

Start over if any parameter values are not set appropriately!

Challenges in Modern Sciences

• BIG DATA: from T to P, to E, to Z, to Y, and beyond…- Simulation

• Astrophysics, climate modeling, combustion research, etc.

- Experimental• Spallation Neutron Source, Large Hadron Collider, etc.

- Observational• Large-scale sensor networks, astronomical image data (Dark

Energy Camera), etc.

No matter which type of data is considered, we needan end-to-end workflow solution

for data transfer, processing, and analysis!

6

Big-data Scientific Workflows• Require massively distributed resources

- Hardware• Computing facilities, storage systems, special rendering engines, display devices

(tiled display, powerwall, etc.), network infrastructure, etc.

- Software• Domain-specific data analytics/processing tools, programs, etc.

- Data type• Real-time, archival

• Feature different complexities- Simple case: linear pipeline (a special case of DAG)- Complex case: DAG-structured graph

• Support different application types- Interactive: minimize total end-to-end delay for fast response- Streaming: maximize frame rate to achieve smooth data flow

7

8

- Support distributed workflows in heterogeneous environments

- Optimize workflow performance to meet various user requirementsØ Delay, throughput, reliability, etc.Ø Remote visualization, online computational monitoring

and steering, etc.- Make the best use of computing and networking

resources

Ultimate Goals

Solution: A Three-layer Architecture

9

Enabling Technologies

10

• Three layers- Top: Abstract scientific workflow- Middle: Virtual overlay network (grid, cloud)- Bottom: Physical high-performance network

• Top and bottom layers meet at middle layer- From bottom to middle: resource abstraction

• Bandwidth scheduling• Performance modeling and prediction

- From top to middle: workflow mapping• Optimization: where to execute modules?

• Workflow execution- Actual data transfer: transport control- Actual module running: job scheduling

Networking Requirements• Provision dedicated channels to meet

different transport objectives- High bandwidths

• Multiples of 10Gbps to terabits networking• Support bulk data transfers

- Stable bandwidths• 100s of Mbps• Support interactive control operations

• Why not the Internet?- Only backbone has high bandwidths (last mile)- Packet-level resource sharing- Best-effort IP routing- TCP: hard to sustain 10s Gbps or to stabilize

11

12

An Overview of TCP/IP Stack

Software-Defined Networking

• The Concept of Virtualization• Virtualization of Computing• Virtualization of Networking• Software-Defined Network• Possible Directions

13

Concept of Virtualization

• Decoupling HW/SW• Abstraction and layering• Using, demanding, but not owning or

configuring• Resource pool: flexible to slice, resize,

combine, and distribute• A degree of automation by software

14

Benefits of Virtualization

• An analogy: owning a huge house- Real estate, immovable property- Does not generate cash and income

• How to gain more profit?- Divide this huge house into suites, and RENT to

people!- Renting suites: using but not owning- Transform a static investment into cash

generators!!!

15

Virtualization of Computing• Partitioning one physical machine

- Virtual instances• Running concurrently, sharing resources

• Hypervisor: Virtual Machine Monitor (VMM)- A software layer presents abstraction of physical resources

Key Factor of Virtualization

16

NetworksareHardtoManage

• Operatinganetworkisexpensive- Morethanhalfthecostofanetwork- Yet,operatorerrorcausesmostoutages

• Buggysoftwareintheequipment- Routerswith20+millionlinesofcode- Cascadingfailures,vulnerabilities,etc.

• Thenetworkis“intheway”- Especiallyaproblemindatacenters- …andhomenetworks

17

TraditionalComputerNetworks

Collect measurements and configure the equipment

Management plane:

18

SoftwareDefinedNetworking(SDN)

API to the data plane(e.g., OpenFlow)

Logically-centralized control

Switches

Smart, slow

Dumb, fast

19

OpenFlow Protocol

NETWORK OPERATING SYSTEM

Bandwidth - on -

Demand

DynamicOptical Bypass

Unified Recovery

UnifiedControl Plane

Switch Abstraction

Networking Applications

VIRTUALIZATION (SLICING) PLANE

Underlying DataPlane Switching

Traffic Engineering

Application-Aware

QoS

Provide Choices

Packet Switch

Packet Switch

Wavelength Switch

Time-slotSwitch

Multi-layerSwitch

Packet & Circuit Switch

Packet & Circuit Switch

20

Architecture

Onix / Network OS

Logical Forwarding Plane

Control Plane / Applications

Network Hypervisor

Real States

Logical States Abstractions

Mapping

Control Commands

Distributes, Configures

Network Info Base

API

Distributed System

Abstraction

Provides

Provides

OpenFlow

21

Switch Forwarding Pipeline

Logical Forwarding Plane

As packets/flows traverse the network: moving both in logical and physical forwarding plane → logical context

22

Data-Plane:SimplePacketHandling

• Simplepacket-handlingrules- Pattern:matchpacketheaderbits- Actions:drop,forward,modify,sendtocontroller- Priority:disambiguateoverlappingpatterns- Counters:#bytesand#packets

1. src=1.2.*.*,dest=3.4.5.*à drop2. src=*.*.*.*,dest=3.4.*.*à forward(2)3.src=10.1.2.3,dest=*.*.*.*à sendtocontroller

23

UnifiesDifferentKindsofBoxes• Router

- Match:longestdestinationIPprefix

- Action:forwardoutalink• Switch

- Match:destinationMACaddress

- Action:forwardorflood

• Firewall- Match:IPaddressesandTCP/UDPportnumbers

- Action:permitordeny• NAT

- Match:IPaddressandport- Action:rewriteaddressandport

24

Controller:Programmability

25

Network OS

Controller Application

Events from switchesTopology changes,Traffic statistics,Arriving packets

Commands to switches(Un)install rules,Query statistics,Send packets

ExampleOpenFlowApplications

• Dynamicaccesscontrol• Seamlessmobility/migration• Serverloadbalancing• Networkvirtualization• Usingmultiplewirelessaccesspoints• Energy-efficientnetworking• Adaptivetrafficmonitoring• Denial-of-Serviceattackdetection

See http://www.openflow.org/videos/26

Example:DynamicAccessControl

• Inspectfirstpacketofaconnection• Consulttheaccesscontrolpolicy• Installrulestoblockorroutetraffic

27

SeamlessMobility/Migration

• Seehostsendtrafficatnewlocation• Modifyrulestoreroutethetraffic

28

ServerLoadBalancing• Pre-installload-balancingpolicy• SplittrafficbasedonsourceIP

29

src=0*

src=1*

NetworkVirtualization

30

Partition the space of packet headers

Controller #1 Controller #2 Controller #3

OpenFlowintheWild

• OpenNetworkingFoundation- Google,Facebook,Microsoft,Yahoo,Verizon,DeutscheTelekom,andmanyothercompanies

• CommercialOpenFlowswitches- HP,NEC,Quanta,Dell,IBM,Juniper,…

• Networkoperatingsystems- NOX,Beacon,Floodlight,Nettle,ONIX,POX,Frenetic

• Networkdeployments- Eightcampuses,andtworesearchbackbonenetworks- Commercialdeployments(e.g.Googlebackbone)

31

ControllerDelayandOverhead

• Controllerismuchslowerthantheswitch• Processingpacketsleadstodelayandoverhead• Needtokeepmostpacketsinthe“fastpath”

32

packets

DistributedController

33

Network OS

Controller Application

Network OS

Controller Application

For scalability and reliability

Partition and replicate state

High-performance Networks• Production and testbed networks

- UltraScience Net- ESnet OSCARS

• Offers MPLS tunnels and VLAN virtual circuits- Internet2 ION

• Offers MPLS tunnels and VLAN virtual circuits- UCLP

• User Controlled Light Paths- CHEETAH

• Circuit-switched High-speed End-to-End Transport Architecture

- DRAGON• Dynamic Resource Allocation via GMPLS Optical

Networks34

UltraScience Net – In a Nutshell• Experimental Network Research Testbed

35

Control Plane

36

Responsible for1. Reserving link

bandwidths2. Setting up end-to-

end network paths3. Releasing

resources when tasks are completed

Bandwidth Scheduling

37

Bandwidth Scheduler

• Central component- Computes paths and allocate link bandwidths

• Scheduling in USN- Fixed slot: start time, end time, target BW

• Extension of Dijkstra’s Algorithm- All slots: duration, target BW

• Extension of Bellman-Ford Algorithm- Both are solvable by polynomial-time algorithms

38

Network Model for Bandwidth Scheduling

• Graph: G(V, E)• Link bandwidth

- Segmented constant functions of time- Time-bandwidth (TB): (t[i], t[i+1], b[i])

• Aggregated TB for all links

39

Res

idua

l Ban

dwid

th

[0]t [2]t[1]t [3]t

4 Types of Scheduling Problems (TON’13)• Given: G = (E, V), ATB, vs , vd , data size δ• Objective: minimize data transfer end time• Fixed Path with Fixed Bandwidth (FPFB)

- Compute a fixed path from vs to vd with a constant (fixed) bandwidth

• Fixed Path with Variable Bandwidth (FPVB)- Compute a fixed path from vs to vd with varying bandwidths

across multiple time slots• Variable Path with Fixed Bandwidth (VPFB)

- Compute a set of paths from vs to vd at different time slots with the same (fixed) bandwidth

• Variable Path with Variable Bandwidth (VPVB)- Compute a set of paths from vs to vd at different time slots with

varying bandwidths across multiple time slots

40

VPFB & VPVB

• Multiple paths are used in a sequential order• Path switching incurs a delay (overhead) τ

- Path switching delay is negligible (τ = 0)• VPFB-0, VPVB-0

- Path switching delay is not negligible (τ > 0)• VPFB-1, VPVB-1

• When τ is large enough- VPFB-1 -> FPFB, VPVB-1 -> FPVB

41

Problem Features• FPFB is the most stringent• VPVB is the most flexible• FPFB and VPFB restrict the BW

- Not always optimal to start data transfer immediately

- Suited for transport methods with fixed rate• FRTP, PLUT, Tsunami, Hurricane, RBUDP

• FPVB and VPVB use variable BW- Always start immediately- Suited for transport methods with dynamically

adapted rate• SABUL, RAPID, RUNAT, improved TCP

42

Complexity and Algorithm

• An optimal algorithm for each problem• A heuristic for each NPC problem

43

Workflow Mapping

44

Cost ModelsA computing workflow consist of modules: .

A computer network modeled an arbitrary directed graph, where computing nodes are

interconnected by directed communication links.

Objectives: map modules to nodes to achieve minimum end-to-end delay (MED) or maximum frame rate (MFR).

0 1 1, ,... mw w w -

, 1, , 1jw j m= -!

( )jw

l × 1jw +1jz -

Computationalcomplexity jz

1jw -

Module

Recv data : Send data :

( , )G V E= | |V n= 0 1 1, , , nv v v -!

kvLink bandwidth: ,h kbhv

Min link delay: ,h kd Node processing power kpNetwork

m

,,

,

( )Message transmission time : , Node computing time : i iw wi j

i ji j

zzd

b pl

+

45

Link failure rate: ,h kf Node Failure rate

Workflow

kf

• Workflow mapping and execution conditions- Single-node mapping

• If w is mapped to v,• Otherwise,

- Module execution precedence

- Data transfer precedence

1,c

wv wv Vx w V

Î

= " Îå

, ,, ,j i j

s fw e j w i j wt t w V e E³ " Î Î

, ,, ,i j i

s fe w i w i j wt t w V e E³ " Î Î

1wvx =0wvx =

46

- Fair node sharing

- Fair link sharing

comp( ) ( , )( , ) , ,w w

w cv

z A w vT w v w V v Vp

l ×= " Î Î

where and( , ) ( )fw

sw

t

vtA w v t dta= ò

:( )( ) 0

( )f s

w w w

v wvw V t t t t

t xaÎ - - ³

= å

, , ,tran , , , , ,

,

( , )( , ) , ,i j i j h ki j h k h k i j w h k c

h k

z B e lT e l d e E l E

= + " Î Î

where and,

,,

, ,( , ) ( )fei j

s h kei j

t

i j h k ltB e l t dtb= ò ,

, ,,:( )( ) 0

( )h k i h j k

f si j w ee i ji j

l w v w ve E t t t t

t x xbÎ - - ³

= ×å

47

• Performance Metrics- End-to-end delay (ED)

1

ED1 2

comp tran ( , )0 0

1[ ]

0 , 1 [ ]

, 1 1 [ ], [ 1][ ], [ 1]

[ ], [ 1]

(CP mapped to a path of nodes)

( ) ( , )

( ) ( ( , ), )

i i i

j j

i

q q

g e g gi i

qw w j P i

i j g j P i

i i i i P i P iP i P i

i P i P i

T P q

T T T T

z A w v

p

z g g B e g g ld

b

l

+

- -

= =

-

= Î ³

+ + ++

= +

= + = +

×æ ö= ç ÷ç ÷

è øæ ö×

+ +ç ÷ç ÷è ø

å å

å å2

0

q-

å

48

- Frame rate: inverse of global bottleneck (BN)

- Overall failure rate

, ,' ', ' ' ', '

BN

'

comp ' '

, ,, , ', 'tran , ', ', ,

', '', '

( mapped to a network )( ) ( , )

( , )max max

( , )( , )

i i

i w j k w i w j k wi c j k i c j k

w c

w w i i

i i i

w V e E w V e Ej k j k j kj k j kv V l Ec v V l Ec

j kj k

T G Gz A w v

T w v pz B e lT e l

db

l

Î Î Î ÎÎ Î Î Î

×æ öç ÷æ ö ç ÷= =ç ÷ç ÷ ç ÷×è ø +ç ÷çè ø

÷

, ,

, ,

,mappedon mappedon

,,

1 (1 ) (1 )i j h k i h

i w h ci j w h k c

h k he l w v

w V v Ve E l E

f fÎ ÎÎ Î

æ ö æ öç ÷ ç ÷= - - × -ç ÷ ç ÷ç ÷ç ÷ è øè ø

Õ ÕF

49

• Objective Functions- Minimum End-to-end Delay (MED)

- Maximum Frame Rate (MFR)

EDall possible mappingsmin ( ), such thatT £ FF

BNall possible mappingsmin ( ), such thatT £F F

50

Pipeline Mapping-- A special case of DAG

• Problem categories- MED

Ø No Node Reuse (MED-NNR)Ø Contiguous Node Reuse (MED-CNR)Ø Arbitrary Node Reuse (MED-ANR)

- MFRØ No Node Reuse or Share (MFR-NNR)Ø Contiguous Node Reuse and

Share (MFR-CNR)Ø Arbitrary Node Reuse and Share

(MFR-ANR)

51Arbi t rary node reuse

Source Destination

No node reuse

Cont i guous node reuse

M0 M1 Mn-5 Mn-4 Mn-2

Mn-1

Mn-3

Vn-2

M2

Vp+1

Vq+1

Vp

Vq

V1

V5

Vs Vd

Source Destination

M0 M1 Mn-5 Mn-4 Mn-2

Mn-1

Mn-3

Vn-2

M2

Vp+1

Vq+1

Vp

Vq

V1

V5

Vs Vd

Source Destination

M0 M1 Mn-5 Mn-4 Mn-2

Mn-1

Mn-3

Vn-2

M2

Vp+1

Vq+1

Vp

Vq

V1

V5

Vs Vd

Problem Category and Complexity

52

NoNodeReuse

ContiguousNodeReuse

ArbitraryNodeReuse

ObjectiveFunctionConstraints

MinimumEnd-to-endDelay

NP-complete

NP-complete

Polynomial(Dyn.Prog.)

MaximumFrameRate

NP-complete

NP-complete

NP-complete

• MED-ANR is polynomially solvable- Dynamic Programming -based solution

• MED/MFR-NNR/CNR are NP-complete- Reduce from DISJOINT-CONNECTING-PATH (DCP)

• MFR-ANR is NP-complete- Reduce from Widest-path with Linear Capacity Constraints

(WLCC)

53

• Mapping algorithm for MED:- Recursive Critical Path for MED with Failure Rate

Constraint

RCP-F

From special to general: DAG Mapping (TC’14)

54

• Mapping algorithm for MFR:• Layer-oriented DP algorithm for MFR with Failure Rate

Constraint- Sort a DAG-structured workflow in a topological order - Map computing modules to network nodes on a layer-by-layer

basis, considering:• Module dependency in the workflow• Network connectivity• Node and link failure rate

LDP-F

( ( ))

,

, , ,, ,

,

, ( ),1to 1, 0 1( ( ))

,

( , )max 1 (1 ) (1 ) (1 ) if

( ) ( , )max

( ) ( , )max

j ju j

iv V pre wmapping j

j j

h u

u j u j h ih i j u h i i

h i

w w j ii j w pre wj m h nv suc v i

h u

w w j h

h

Tz B e l

d f f h ibz A w vT

p

Tz A w v

p

l

l

" Î

Î= - £ £ -

Î

æ öç ÷ç ÷ç ÷×ç ÷+ Ù = - - × - × - £ ¹ç ÷ç ÷

×= ç ÷ç ÷è ø

×

!

F F F

1 (1 ) (1 ) ifj u if h i

æ öç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷æ öç ÷ç ÷

Ù = - - × - £ =ç ÷ç ÷ç ÷ç ÷ç ÷è øè ø

F F F

55

The DP table with separated layers in .

Layered workflow in a topological sorting.

vs=v0

vd=vn-1

v1v2v3v4

...

w0 w1 w2 w3 w4 w5 …... ws wm-1…...…...

…...…...

…...

...…...layer 0

vs=v0

vd=vn-1

v1v2v3v4

...

w0 w1 w2 w3 w4 w5 …... ws wm-1…...…...

…...…...

...…...layer 1

vs=v0

vd=vn-1

v1v2v3v4...

w0 w1 w2 w3 w4 w5 …... ws wm-1…...…...

…...…...…...

...…...layer 2

vs=v0

vd=vn-1

v1

vyvpvq

...

w0 w1 w2 w3 wl wt ws wm-1…...…...

…...…...

…...

...

…...layer l-2

vs=v0

vd=vn-1

w0 w1 w2 w3 wswm-1…...…...

…...…...

...

…...layer l-1

…...

T0,0

T1,1T2,2

T2,4T3,3

T4,5

v1

vyvpvq...

…...

Tq,s

Tn-1,m-1

…...

…...

…... wl wt

Ty,lTp,t

LDP-F

vs

v3

v1

vdvq

vx

vp

…...

…...

w0 wm-1

w2

wl

wt…...

…...

…...

Workflow

Computer network

w1

ws

vy

v2

w3 …...

w5

w4

layer 0 layer 1 layer 2 …... layer l-1layer l-2

v4

v5

Jet air flow dynamics (pressure, raycasting)

TSI explosion (density, raycasting)

A Prototype System:Distributed Remote Intelligent Visualization Environment (DRIVE)

56

Two Examples in the Visualization of Large-scale Scientific Applications

A Production System (JOGC’13):Scientific Workflow Automation and Management Platform (SWAMP)

57

58

Use Case: Spallation Neutron Source (SNS)

SNS Workflow (abstract)

59

SNS Workflow (concrete)

60

61

SNS Workflow Results

62