Presentation on similarity and locality based indexing for high performance data deduplication

35
1/35 Introduction Related Work Method Performance Evaluation Conclusion Supplementary Similarity and Locality Based Indexing for High Performance Data Deduplication Wen Xia 1 Hong Jiang 1 Dan Feng 1 Yu Hua 1 1 Huanzhong University 2 University of Nebraska-Lincoln Presented by: Fajar Purnama 152-D8713 August 24, 2016 Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Transcript of Presentation on similarity and locality based indexing for high performance data deduplication

Page 1: Presentation on similarity and locality based indexing for high performance data deduplication

1/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Similarity and Locality Based Indexing for HighPerformance Data Deduplication

Wen Xia 1 Hong Jiang 1 Dan Feng 1 Yu Hua 1

1Huanzhong University

2University of Nebraska-Lincoln

Presented by: Fajar Purnama 152-D8713

August 24, 2016

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 2: Presentation on similarity and locality based indexing for high performance data deduplication

2/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Outline

Introduction

Related Work

Method

Performance Evaluation

Conclusion

Supplementary

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 3: Presentation on similarity and locality based indexing for high performance data deduplication

3/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Data DeduplicationDefinition: the process to eliminate duplicate data.Purpose: to reduce storage usage / to save space.Implementation: disk-to-disk backup, virtual machine storage,WAN replication, and primary storage.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 4: Presentation on similarity and locality based indexing for high performance data deduplication

4/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Data Hashing

I Represent any amount of data with fixed value.

I Very fast indexing (O1) compared to manual indexing (On).

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 5: Presentation on similarity and locality based indexing for high performance data deduplication

5/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Data FingerprintingI When hash function can generate unique hash = Fingerprint.

I Block based deduplication divides files into chunks, assign fingerprints.

I Dedup Removes chunks with same fingerprint and replaced with pointers.

source:https://upload.wikimedia.org/wikipedia/commons/0/09/Fingerprint.svg

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 6: Presentation on similarity and locality based indexing for high performance data deduplication

6/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Problem

The number of fingerprints are too large:

I Does not fit into the memory (limited performance).

I Have to rely on disk speed 1-6 MB/sec (too slow).

I For example a data set of 1 PB needs at least 2.5 TB ofSHA-1 fingerprints.

Previous general approach:

I Locality approach for example chunk stash.

I Similarity approach for example extreme binning.

I But none of them alone suffice for Peta Byte scale data.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 7: Presentation on similarity and locality based indexing for high performance data deduplication

7/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Objective

Problem Summary:

I Current technique is too slow for today’s large data.

I Real implementation demands faster process.

I Imagine when you only have 1 day maintenance for backup .

This work proposes Similarity-Locality (SiLo):

I Combine similarity and locality based approach.

I To reduce Random Access Memory (RAM) usage.

I To increase throughput.

I To keep deduplication accuracy.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 8: Presentation on similarity and locality based indexing for high performance data deduplication

8/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Locality

I Normally chunk lookups are one by one but some backup streamshave high locality: between the first, second, and next backupshave a very high probability that chunks are in the same order.

I Locality approach: exploit this locality, on the Figure below uponthe lookup of fingerprint 4a, will prefetch the fingerprint ”4a, c7,9e”.

I However this approach shows low speed on backup stream withweak locality.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 9: Presentation on similarity and locality based indexing for high performance data deduplication

9/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Similarity

I Instead of lookups per chunks or per local chunks (locality) thelookups are per files.

I Similarity approach: on the below figure shows that file V1 issimilar to file V2 and the lookup is represented by the minimalfingerprint 2f, later on detect duplicate chunks between the two files.

I Although is much faster than locality approach it can sacirfice theduplication accuracy.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 10: Presentation on similarity and locality based indexing for high performance data deduplication

10/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

MotivationSimilarity approach can cover locality approach vice versa.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 11: Presentation on similarity and locality based indexing for high performance data deduplication

11/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Duplicate Eliminated vs Similarity Degree

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 12: Presentation on similarity and locality based indexing for high performance data deduplication

12/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Segments and BlocksMany small files produce many fingerprints and large filesreduce similarity detection, thus it is better:

I For small files: combine into segments to reduce number of fingerprints.

I For large files: divide into segments to increase the similarity detection.

I Group similar segments order into blocks (preserve locality).

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 13: Presentation on similarity and locality based indexing for high performance data deduplication

13/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Workflow

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 14: Presentation on similarity and locality based indexing for high performance data deduplication

14/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

SiLo System Architecture

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 15: Presentation on similarity and locality based indexing for high performance data deduplication

15/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Experimental SetupThis experiment evaluates SiLo in comparison to Chunkstash which implementslocality-based and Extreme Binning which implements similarity-based. Thehardware configuration includes a quad-core CPU running at 2.4 GHz, with a 4GB RAM, 2 gigabit network interface cards, and two 500 GB 7200 rpm harddisks. The data experimented on as follow:

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 16: Presentation on similarity and locality based indexing for high performance data deduplication

16/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Segment and Block Size to AccuracySmall segments = high similarity exposure, large blocks = high locality.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 17: Presentation on similarity and locality based indexing for high performance data deduplication

17/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Segment and Block Size to AccuracySmall segments = high similarity exposure, large blocks = high locality.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 18: Presentation on similarity and locality based indexing for high performance data deduplication

18/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Segment and Block Size to RAMSmall segments = more fingerprints, large blocks = more unrelated segments.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 19: Presentation on similarity and locality based indexing for high performance data deduplication

19/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Segment and Block Size to RAMSmall segments = more fingerprints, large blocks = more unrelated segments.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 20: Presentation on similarity and locality based indexing for high performance data deduplication

20/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Comparison of Duplicates Eliminated of 4 State of The ArtMethod

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 21: Presentation on similarity and locality based indexing for high performance data deduplication

21/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Locality 100%, Silo ≈ 100%, Similarity ≈ 75%

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 22: Presentation on similarity and locality based indexing for high performance data deduplication

22/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Comparison of RAM Usage of 4 State of The Art Method

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 23: Presentation on similarity and locality based indexing for high performance data deduplication

23/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

SiLo Low, Similarity Medium, Locality High

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 24: Presentation on similarity and locality based indexing for high performance data deduplication

24/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Comparison of Throughput of 4 State of The Art Method

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 25: Presentation on similarity and locality based indexing for high performance data deduplication

25/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

SiLo Fast, Similarity Medium, Locality Slow

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 26: Presentation on similarity and locality based indexing for high performance data deduplication

26/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Conclusion

This work presented SiLo, adeduplication system thatexploits both similarity andlocality in backup streams toachieve:

High Deduplication Accuraccy

Lower RAM Usage Higher Throughput

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 27: Presentation on similarity and locality based indexing for high performance data deduplication

27/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Thank youAny comments or questions?

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 28: Presentation on similarity and locality based indexing for high performance data deduplication

28/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

List of Related Work

I Content Define Chunking (CDC) adopting rabin fingerprint byLow-Bandwidth Network File System (LBFS).

I Many other chunking studies and incremental filesynchronization.

I Other studies consists of fingerprint indexing.

I Sparse indexing, DDFS, ChunkStash, and other localityapproach.

I Extreme Binning, a similarity approach.

I Load distribution, multi thread, pipelining, parallelcomputation etc.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 29: Presentation on similarity and locality based indexing for high performance data deduplication

29/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Similarity and LocalityLeft figure shows distribution of similarity degree and right figureshows that not all duplicate data eliminated by Extreme Bining (asimilarity approach).

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 30: Presentation on similarity and locality based indexing for high performance data deduplication

30/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Deduplication Server Data StructureI Similarity Hash (SH)Table provides the similarity detection for input

segments and Locality Hash (LH)Table serves to quickly index and filterout duplicate chunks. The write buffer and read cache contain therecently accessed blocks to exploit the backup stream locality.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 31: Presentation on similarity and locality based indexing for high performance data deduplication

31/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Similarity Algorithm Data StructureSmall files are group into segments to minimize fingerprints while large filesare divided into segments for more similarity exposure.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 32: Presentation on similarity and locality based indexing for high performance data deduplication

32/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Locality-Based Stateless Routing Algorithm

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 33: Presentation on similarity and locality based indexing for high performance data deduplication

33/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Flowchart of SiLo Deduplication

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 34: Presentation on similarity and locality based indexing for high performance data deduplication

34/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Throughput due to blocks in read cacheThe deduplication throughput will increase with the number of blocks in theread cache, but it results in more RAM overhead. It can be seen that beyondsixteen blocks the throughput increased only slowly and it even decreases fortwo backup sets.

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

Page 35: Presentation on similarity and locality based indexing for high performance data deduplication

35/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Load Distribution of This System

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015