A Software-Defined Networking based Approach for Performance Management of Analytical Queries on...

23
A Software-Defined Networking based Approach for Performance Management of Analytical Queries on Distributed Data Stores Pengcheng Xiong (NEC Labs America) Hakan Hacigumus (NEC Labs America) Jeffrey F. Naughton (Univ. of Wisconsin)

Transcript of A Software-Defined Networking based Approach for Performance Management of Analytical Queries on...

A Software-Defined Networking based Approach for Performance Management of Analytical Queries onDistributed Data Stores

Pengcheng Xiong (NEC Labs America)Hakan Hacigumus (NEC Labs America)Jeffrey F. Naughton (Univ. of Wisconsin)

Agenda Why?

Motivation and background How?

System architecture and implementation So what?

Real system and benchmark query evaluation Conclusion

2

Motivation Data analytics applications or data scientists

query the data from distributed stores. A huge amount of data traffic on the network.

Join Many applications want to share a cluster

Data backup, video streaming, etc Response time is critical

Deadline-driven reports Query service differentiation

Batch queries, interactive queries

3

An example query (TPC-H Q14)

4

Data StoreSite Sl

Data StoreSite Sp

lineitem part

We assume that tables are distributed at relational data stores.

Relational data stores are connected by networking

Network change implies plan perf. change

5Phase 1 Phase 2 Phase 3

(1) Huge gap

(2) The best plan can become the worst one

Network status

changes

What if?

6Phase 1 Phase 2 Phase 3

What if query optimizer can dynamically monitor the network bandwidth and

adaptively choose plan?

Adaptive plan is chosen and query execution time is kept short.

Network busy implies no good plan

7

Run query right now and right away. I need that ASAP to catch my

deadline!

User Distributed DBMS

Well… I am sorry. None of the candidate plans can meet your

deadline due to current busy network status.

What if?

8

Run query right now and right away. I need that ASAP to catch my deadline!

User Distributed DBMS

OK. Although current network is busy, I can control it to prioritize the bandwidth for the

query.

What if query optimizer can control the network?

Distributed query optimizer monitors and controls the

network?9

Sounds like a mission impossible Database always treats the underneath

networking as a black box unable to monitor let alone to control

With software-defined networking inquire about the current status of the network, or control the network with directives

10

Networking Networking

With SDNUnable to

monito

r,

let alone to

contro

l Able to inquire

and control

Sounds interesting, but how?

11

Ethernet Switch/RouterEthernet Switch/Router

12

Data Path (Hardware)

Control Path (Software)

13

Data Path (Hardware)

Control Path OpenFlow

OpenFlow ControllerOpenFlow Protocol (SSL/TCP)

Dist. Query Optimizer

APIOur contribution

14

System architecture

15

System implementation

Beacon

NEC PFS5240

Plan generation

16

Stores lineitem table

Stores part table

Cost estimation

17

Cost model for network operator Amount of data transferred Real-time transfer speed

(Monitor) Take any bandwidth left

(Control) Assign the highest priority Make a bandwidth reservation

SDN support

Evaluation Setup

TPC-H, scaling factor 100, Q14 Small tables (supplier, nation, region) are

replicated. Other tables are placed at a single data store site Neighbor traffic generator-iperf Summary of case studies

18

Case 1: single user, single-thread, iperf

19Phase 1 Phase 2 Phase 3

Bottleneck

Bottleneck

BottleneckBased on SDN, query optimizer can dynamically monitor the network

bandwidth and adaptively choose the best plan

Case 3: multiple users, multiple-thread,no contention traffic, priority queue

20

Based on SDN, premium queries run faster than regular ones.

Based on SDN, all queries run faster.

Case study 5: single user, multi-thread, iperf, weighted-fair queue

21

Based on SDN, more reservation makes queries run faster.

Conclusion SDN can be effectively exploited for

performance management of analytical queries on distributed data stores Directly monitor the network and adaptively pick

the best plan. Control the priority of network traffic or make

network bandwidth reservations to differentiate the query service.

Lots of opportunities

22

23

Thanks!