Presentation on similarity and locality based indexing for high performance data deduplication

1/35

Introduction Related Work Method Performance Evaluation Conclusion Supplementary

Similarity and Locality Based Indexing for HighPerformance Data Deduplication

Wen Xia 1 Hong Jiang 1 Dan Feng 1 Yu Hua 1

1Huanzhong University

2University of Nebraska-Lincoln

Presented by: Fajar Purnama 152-D8713

August 24, 2016

Presented by: Fajar Purnama Kumamoto University, Computer Science and Electrical Engineering, HICC LAB

IEEE Transactions on Computers, Vol. 64, No. 4, April 2015

2/35


Outline

Introduction

Related Work

Method

Performance Evaluation

Conclusion

Supplementary



3/35


Data DeduplicationDefinition: the process to eliminate duplicate data.Purpose: to reduce storage usage / to save space.Implementation: disk-to-disk backup, virtual machine storage,WAN replication, and primary storage.



4/35


Data Hashing

I Represent any amount of data with fixed value.

I Very fast indexing (O1) compared to manual indexing (On).



5/35


Data FingerprintingI When hash function can generate unique hash = Fingerprint.

I Block based deduplication divides files into chunks, assign fingerprints.

I Dedup Removes chunks with same fingerprint and replaced with pointers.

source:https://upload.wikimedia.org/wikipedia/commons/0/09/Fingerprint.svg



6/35


Problem

The number of fingerprints are too large:

I Does not fit into the memory (limited performance).

I Have to rely on disk speed 1-6 MB/sec (too slow).

I For example a data set of 1 PB needs at least 2.5 TB ofSHA-1 fingerprints.

Previous general approach:

I Locality approach for example chunk stash.

I Similarity approach for example extreme binning.

I But none of them alone suffice for Peta Byte scale data.



7/35


Objective

Problem Summary:

I Current technique is too slow for today’s large data.

I Real implementation demands faster process.

I Imagine when you only have 1 day maintenance for backup .

This work proposes Similarity-Locality (SiLo):

I Combine similarity and locality based approach.

I To reduce Random Access Memory (RAM) usage.

I To increase throughput.

I To keep deduplication accuracy.



8/35


Locality

I Normally chunk lookups are one by one but some backup streamshave high locality: between the first, second, and next backupshave a very high probability that chunks are in the same order.

I Locality approach: exploit this locality, on the Figure below uponthe lookup of fingerprint 4a, will prefetch the fingerprint ”4a, c7,9e”.

I However this approach shows low speed on backup stream withweak locality.



9/35


Similarity

I Instead of lookups per chunks or per local chunks (locality) thelookups are per files.

I Similarity approach: on the below figure shows that file V1 issimilar to file V2 and the lookup is represented by the minimalfingerprint 2f, later on detect duplicate chunks between the two files.

I Although is much faster than locality approach it can sacirfice theduplication accuracy.



10/35


MotivationSimilarity approach can cover locality approach vice versa.



11/35


Duplicate Eliminated vs Similarity Degree



12/35


Segments and BlocksMany small files produce many fingerprints and large filesreduce similarity detection, thus it is better:

I For small files: combine into segments to reduce number of fingerprints.

I For large files: divide into segments to increase the similarity detection.

I Group similar segments order into blocks (preserve locality).



13/35


Workflow



14/35


SiLo System Architecture



15/35


Experimental SetupThis experiment evaluates SiLo in comparison to Chunkstash which implementslocality-based and Extreme Binning which implements similarity-based. Thehardware configuration includes a quad-core CPU running at 2.4 GHz, with a 4GB RAM, 2 gigabit network interface cards, and two 500 GB 7200 rpm harddisks. The data experimented on as follow:



16/35


Segment and Block Size to AccuracySmall segments = high similarity exposure, large blocks = high locality.



17/35


Segment and Block Size to AccuracySmall segments = high similarity exposure, large blocks = high locality.



18/35


Segment and Block Size to RAMSmall segments = more fingerprints, large blocks = more unrelated segments.



19/35


Segment and Block Size to RAMSmall segments = more fingerprints, large blocks = more unrelated segments.



20/35


Comparison of Duplicates Eliminated of 4 State of The ArtMethod



21/35


Locality 100%, Silo ≈ 100%, Similarity ≈ 75%



22/35


Comparison of RAM Usage of 4 State of The Art Method



23/35


SiLo Low, Similarity Medium, Locality High



24/35


Comparison of Throughput of 4 State of The Art Method



25/35


SiLo Fast, Similarity Medium, Locality Slow



26/35


Conclusion

This work presented SiLo, adeduplication system thatexploits both similarity andlocality in backup streams toachieve:

High Deduplication Accuraccy

Lower RAM Usage Higher Throughput



27/35


Thank youAny comments or questions?



28/35


List of Related Work

I Content Define Chunking (CDC) adopting rabin fingerprint byLow-Bandwidth Network File System (LBFS).

I Many other chunking studies and incremental filesynchronization.

I Other studies consists of fingerprint indexing.

I Sparse indexing, DDFS, ChunkStash, and other localityapproach.

I Extreme Binning, a similarity approach.

I Load distribution, multi thread, pipelining, parallelcomputation etc.



29/35


Similarity and LocalityLeft figure shows distribution of similarity degree and right figureshows that not all duplicate data eliminated by Extreme Bining (asimilarity approach).



30/35


Deduplication Server Data StructureI Similarity Hash (SH)Table provides the similarity detection for input

segments and Locality Hash (LH)Table serves to quickly index and filterout duplicate chunks. The write buffer and read cache contain therecently accessed blocks to exploit the backup stream locality.



31/35


Similarity Algorithm Data StructureSmall files are group into segments to minimize fingerprints while large filesare divided into segments for more similarity exposure.



32/35


Locality-Based Stateless Routing Algorithm



33/35


Flowchart of SiLo Deduplication



34/35


Throughput due to blocks in read cacheThe deduplication throughput will increase with the number of blocks in theread cache, but it results in more RAM overhead. It can be seen that beyondsixteen blocks the throughput increased only slowly and it even decreases fortwo backup sets.



35/35


Load Distribution of This System



Presentation on similarity and locality based indexing for high performance data deduplication

Technology

Transcript of Presentation on similarity and locality based indexing for high performance data deduplication