Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu,...

15
Reconstructing network states in cloud using NIC and system timestamps: A case study of Cloudlab Shiyu Liu, Balaji Prabhakar, Mendel Rosenblum Stanford University Feb 7, 2018

Transcript of Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu,...

Page 1: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Reconstructing network states in cloud using NIC and system timestamps:

A case study of CloudlabShiyu Liu, Balaji Prabhakar, Mendel Rosenblum

Stanford UniversityFeb 7, 2018

Page 2: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks (NSDI’19)

• Using edge-based measurement to reconstruct key network state variables• Packet queuing times at switches• Link utilizations• Queue and link compositions at the flow-level

• SIMON enables:• Sensitive A/B tests• Network troubleshooting & diagnosis• Network performance monitoring

Page 3: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

HW (NIC) or SW (system) timestamps?• HW (NIC) timestamps:

• Accurate inputs for estimating queueing

delays. SIMON’s default.

• Not available in many cases: e.g. cloud

• SW (system) timestamps:

• Widely available

• Could we use SW timestamps to still

get fairly good reconstructions?

• Will improve the deployability of SIMON

CPU+RAM

APP

Kernel

Driver

NIC

PCIe

CPU+RAM

APP

Kernel

Driver

NIC

PCIe

Tx HW Rx HW

Tx SW

Rx SW

!(#$%&) − !()$%&) v.s. !(#$*&) − !()$*&)• Software processing delays in driver,

interrupt handling, interrupt coalescing

• PCI-E delays

• NIC queueing & hardware processing delays

Page 4: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Contents

• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps

Page 5: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

• Cloudlab: 2-stage switching fabric, 10G links

• Use Huygens to sync SW (system) and HW (NIC) clocks respectively among all servers.

A case study of Cloudlab

OS: Linux v4.15

NIC: Mellanox ConnectX-4

ToR: Dell S4048-ON

12MB shared pkt buffer

Spine: Mellanox MSN2410

16MB shared pkt buffer

Topology of Cloudlab experiment

Page 6: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Contents

• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps

Page 7: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Estimate queue recon errors without ground truth

• Send two independent probe meshes, i.e. two independent sets of measurements.

!"# $% − $' = !"# $% + !"#($')

,-./"0% ,% = , + $%

,' = , + $'

Measure 1 SIMON

SIMON-./"0'Measure 2

The diff between two independent reconstruction (,% and ,') bounds the diff between these reconstructions and the ground

truth (i.e. $% and $').

Page 8: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

SIMON w/ HW timestamps in CloudlabCross-validation by 2 independent meshes of probes. Recon interval = 1ms.

All queues Queues > 100usRMS(blue-red) 29.33 us 108.54 us

Relative error = !"#(%&'()*(+)!"#(-./012034 ) 7.2% 6.9%

Page 9: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Contents

• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps

Page 10: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

SW & HW one-way delay in Cloudlab

• Our goal is to use SW one-way delay (red line) to estimate the HW one-way delay (blue line)• The noise is instantaneous, but the switch queueing delays are prolonged

DC bias

High-freqnoise

Page 11: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Remove the noise in SW one-way delays

DC bias

High-pass filter to remove the DC bias

> threshold > threshold

Remove peak noises Remaining noise:LASSO will take avg

Page 12: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Contents

• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps

Page 13: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Filtering improves the approximation of SW recon results to HW results

SW one-way delaysw/o or w/ filtering

HW one-way delays

SIMON SW recon results

HW recon resultsSIMON

Approximate

23.04

29.08

12.67

24.80

0.005.00

10.0015.0020.0025.0030.0035.0040.00

All queues Queues > 100us

RMS(

diffe

renc

e) (u

s)

RMS(SW recon - HW recon) (us)

w/o filter w/ filter

Page 14: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

29.33

108.54

32.37

109.79

0.00

20.00

40.00

60.00

80.00

100.00

120.00

All queues Queues > 100us

RMSE

(us)

RMSE (us)

HW SW w/ filter

7.23% 6.89%7.98%

6.97%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

All queues Queues > 100us

Rela

tive

erro

r

Relative error

HW SW w/ filter

Recon errors using SW & HW timestamps

• SW recon errors close to HW, esp. for large queues• SW (system) timestamps are good replacements of HW (NIC)

timestamps for SIMON

Page 15: Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu, Balaji Prabhakar, Mendel Rosenblum Stanford University ... •Study of the difference

Conclusion

• By applying proper filters on SW (system) timestamps, they become good replacements of HW (NIC) timestamps for reconstructing network states. • This improves the deployability of SIMON, e.g. in cloud environment

Welcome to our poster for more details and Q/A