Presentation of Understanding and Surpassing Dropbox Globecom 2015

29
Understanding and Surpassing Dropbox: Efficient Incremental Synchronization in Cloud Storage Services Shenglong Li 1 Quanlu Zhang 1 Zhi Yang 1 Yafei Dai 1 1 Peking University (lishenglong, zql, yangzhi, dyf)@net.pku.edu.cn June 18, 2016 Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 1 / 29

Transcript of Presentation of Understanding and Surpassing Dropbox Globecom 2015

Understanding and Surpassing Dropbox: EfficientIncremental Synchronization in Cloud Storage Services

Shenglong Li 1 Quanlu Zhang 1 Zhi Yang 1 Yafei Dai 1

1Peking University

(lishenglong, zql, yangzhi, dyf)@net.pku.edu.cn

June 18, 2016

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 1 / 29

Outline

1 IntroductionBackgroundObjectiveContribution

2 Related WorkMeasurement of cloud storageservicesSimilarity detection techniqueState of The Art

3 Understanding Incremental SyncOf Cloud Storage Services

Rsync AlgorithmSync Mechanism on DropboxDetail Measurement andAnalysis

4 System Design andImplementation

System ArchitectureDelta SharingChunk-Based Rsync withSimilarity DetectionEfficient conflict resolution

5 EvaluationModification BenchmarkFile ConflictComparison with other cloudservicesEvaluation of AdditionalOverhead

6 Conclusion

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 2 / 29

Introduction Background

Cloud Storage Services

With increasing demand of users for high data reliability and convenientdata access, cloud storage services have become extremely prevalent andreached phenomenal levels of success. These are famous for file sharingscenarios.

Sea File

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 3 / 29

Introduction Objective

Understanding and Surpassing Dropbox

Data synchronization is the heart of cloud storageservices with incremental data synchronization is appliedto minimize network traffic.

Whether the ”modified data = uploaded data” foractive client.

Whether the ”downloaded data (passive client) =uploaded data (active client)”.

Whether both active and passive client still presentsefficiency during file conflict.

Create an improved prototype based on findings.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 4 / 29

Introduction Contribution

3

Measurement on Dropbox

Conduct intensive measurements on Dropbox in filesharing scenarios.

Mechanism on Dropbox

Unravel the sync mechanisms employed in Dropbox onboth active and passive clients.

MinboxDesign several novel mechanisms, which resolve the trafficproblems, and apply them in an efficient incrementalsynchronization system

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 5 / 29

Related Work Measurement of cloud storage services

Measurement of cloud storage services

Drago first uncovers the Dropbox system architectureand data sync mechanism through an ISP-levellarge-scale measurement.

Li reveals the traffic overuse problem in Dropbox whenuser frequently modifies the files in synced folder andhe proposes an efficient batched sync mechanism toavoid massive metadata interaction.

Li focuses on quantifying and understanding trafficusage effectiveness through the measurements ofseveral popular cloud storage services on differentdevices.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 6 / 29

Related Work Similarity detection technique

Similarity detection technique

Xia proposes a new similarity detection

algorithm to better exploit similarity with low

RAM overhead and high throughput.

Google deploys SimHash to improve space

efficiency and query quality for web crawling.

Mark Manasse implements MinHash using

shingle sampling technique to extract features.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 7 / 29

Related Work State of The Art

1

While these previous works cover the data sync

mechanism as one of the key operations, none of

them tries to fully understand the mechanism of

incremental sync technique in file sharing scenario,

and measure the network traffic with different

write behaviours. Moreover, we reveal the network

traffic waste problems that are not explored before

and design several sync mechanisms to solve them.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 8 / 29

Related Work State of The Art

2

Our system design and implementation are

different from these works. Specifically, we design

an efficient chunk-based delta encoding

mechanism embedding similarity detection

technique, which combines locality-sensitive hash

and content defined chunking technique to

optimize the computation overhead while

guaranteeing precision. Moreover, this mechanism

can integrate with other deduplication techniques

seamlessly.Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 9 / 29

Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm

An Incremental Data Sync Algorithm

The whole point of rsync is when a file is modified on remote host is notto send the whole file to the client but to send only the modifiedpart.

When a file is modified, the client retrieves a signature it which consists ofstrong checksums (e.g., black2, MD5) and weak checksums (e.g., Adler-32,a type of rolling checksum).

The client first computes weak checksums of the blocks in the changed file.

If the checksum matches one of the retrieved checksums, the clientcalculates its strong checksum to verify if the two blocks are indeed thesame.

While if not match, the client rolls one byte forward and calculates weakchecksum again to find the same blocks,vwhich appeals to finding out theskewed content.

Finally, all the different parts, called delta, can be found and sent back tothe server. The changed file is generated on the server by merging delta andthe original file, which is called patch the new file.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 10 / 29

Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm

Illustration of Rsync

Old File

signature

New File

Delta(patch)

+

+

ServerClient

signature

signature

Delta(patch)

Old FileDelta

(patch)

New File

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 11 / 29

Understanding Incremental Sync Of Cloud Storage Services Sync Mechanism on Dropbox

Dropbox Index Server and Amazon Data Server

Index Server

Client

Data Server

1. Request

file lo

catio

n

2. Sends f

ile lo

catio

n

3. Sync file using rsync on

certain file location

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 12 / 29

Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis

Active and Passive Client

Dropbox ServersPassive Clients

Active Client

10MB + 1B

sync sync

10MB

Add orModified

=

or

10MB1B

1B

10MBor 1B

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 13 / 29

Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis

Experiment 1: Replacement at Different Positions

1. Divide both files into 4MB chunks. For example:

4MB 4MB 2MB

4MB 4MB 2MB

2. Check each chunks whether they are identical, if not then execute rsync.

4MB 4MB 2MB

Same Same rsync

4MB 4MB 2MB

For Figure 1 rsync is executed on:

4MB 4MB 2MB

4MB 4MB 2MB

or

4MB 4MB 2MB

4MB 4MB 2MB

or

4MB 4MB 2MB

4MB 4MB 2MB

Uplink (A) & Downlink (B) (based on delta) should be the same size:

Downlink for client A is the same because “active client” already stores signature data.

Passive clients have to sent the signature to data server and that’s why there’s uplink.

Uplink when “end” modified is smaller because

4KB block for rsync. Librsync uses 256-bit strong checksum and 32-bit weak checksum ((256b+32b)/8)*(4MB/4KB) =

36KB

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 14 / 29

Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis

Experiment 1: Insertion at Different Positions

4MB 4MB 2MB

4MB 4MB 2MBRsync on every block. Signature sent = 36KB + 36 KB + 18KB = 90KB

4MB 4MB 2MB

4MB 4MB 2MB

Rsync on block 2 and 3. Signature sent = 36 KB + 18KB = 54KB

4MB 4MB 2MB

4MB 4MB 2MB

Rsync on last block only. Signature sent = 18KB

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 15 / 29

Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis

Experiment 2: Modification with different amounts of data

Replace or insert different amounts of data, ranging from 4KB to4MB, in the middle of a 4MB file and a 8MB file.

When replaced content is larger than 100KB, the amounts of data isless than modified due to data compression in Dropbox.

Data insertion may show waste problem on larger data because of thefixed lenght skewing, where rsync should have been able to deal withit normally.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 16 / 29

Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis

Experiment 3: File conflict

Figure 3 A and B modifies at the same time and both sync to server.B reaches first and A sync from server.But when A’s modified data reaches the server and sync, B treats it as anew file and redownload whole.For Figure 4 the case is complicated but the case is similar to Figure 3 butwith 3 file conflicts.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 17 / 29

System Design and Implementation System Architecture

System Architecture

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 18 / 29

System Design and Implementation Delta Sharing

Delta Sharing

Usually passive client always executes repetitive rsyncto sync update timely which is a waste.

Since passive clients tends to stay online the deltagenerated by the active client can be reused.

In other words passive clients doesn’t have to executersync but retrieve delta from delta server.

Passive clients doesn’t have to maintain the onlinestate since it can be marked through index server.

If passive clients is offline for long, the previousmechanism is used (execute rsync).

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 19 / 29

System Design and Implementation Chunk-Based Rsync with Similarity Detection

Similarity Detection Mechanism

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 20 / 29

System Design and Implementation Chunk-Based Rsync with Similarity Detection

Algorithm

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 21 / 29

System Design and Implementation Chunk-Based Rsync with Similarity Detection

Algorithm Summary

Use locality-sensitive hash to detect similar chunks.

To reduce computation overhead while guaranteeingdetection precision, it is employed ImpMinHashalgorithm.

The non-deduplicated chunk were sliced intosub-blocks using Rabin fingerprint.

Then find smallest cyclic redundant check (CRC)checksums to identify this chunk. Finally usedJaccard Index to compute similarity between chunks.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 22 / 29

System Design and Implementation Efficient conflict resolution

Efficient conflict resolution

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 23 / 29

Evaluation Modification Benchmark

Modification Benchmark

Replay experiment 1, and the result is unlike Dropbox, no uplink on Minbox’s passiveclient.

Replay experiment 2 that the results can be seen on Figure 8 and Figure 9 where Minboximplements similarity detection algorithm that outperforms Dropbox.

MinboxFD (native) used fixed length chunking 4MB while MinboxVD uses contentdefined chunking (CDC) with average 4MB chunking.

In most cases, MinboxVD performs the best by taking advantage of CDC to avoid theimpact of content skewing.

However, for large modification workloads in 8MB file, MinboxFD outperforms MinboxVD,because MinboxVD slices the new chunks which are not similar to original chunks.

After the matching for these chunks, MinboxVD may generate more redundant deltacompared with MinboxFD.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 24 / 29

Evaluation File Conflict

File Conflict

Dropbox downloads the whole file while Minbox only needs todownload the delta.

High network efficiency on Minbox.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 25 / 29

Evaluation Comparison with other cloud services

Comparison with other cloud services

Figure 12 shows comparison between Seafile and Minbox that also usesCDC.

Seafile have to send the whole modified chunk, and client download wholefile in each case while Minbox only deals with the rsync part.Comparison with others such as Google Drive and One Drive, Minbox tookadvantage of the incremental sync mechanism.For file conflict others downloads the whole file, while Minbox uses rsync.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 26 / 29

Evaluation Evaluation of Additional Overhead

Evaluation of Additional Overhead

Finally, it is necessary to discuss the overhead of

Imp-MinHash in Minbox. We generate

ImpMinHash of a 4MB file and record the

signature size and computation time. The result is

that ImpMinHash has the same size as MinHash

which consumes little bytes compared with Rsync

signature. For computation time of signature,

ImpMinHash consumes two additional CPU ticks

in comparison to Rsync.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 27 / 29

Conclusion

Efficient Incremental Synchronization in Cloud StorageServices

Understanding Dropbox

In this paper, it is conducted comprehensive measurements onDropbox in file sharing scenario and unravel the incremental syncmechanism inside Dropbox.

Surpassing Dropbox

Meanwhile, it is revealed the significant network traffic waste existingin Dropbox, then designed and implemented an efficient incrementalsync system to solve these problems.

In the evaluation, Minbox significantly reduces the network trafficduring sync and solves the problem of file conflict with little overhead.

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 28 / 29

Conclusion

Thank youAny comments or questions?

Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 29 / 29