Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates...

29
Deduplication CSCI 333 Spring 2019

Transcript of Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates...

Page 1: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Deduplication

CSCI 333Spring 2019

Page 2: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Logistics

• Lab 2a/b• Final Project• Final Exam• Grades

2

Page 3: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Last Class

• BetrFS [FAST ‘15]

– Linux file system using Be-trees• Metadata Be-tree: path -> struct stat• Data in Be-tree: path|{block#} -> 4KiB block

– Schema maps VFS operations to efficient Be-tree operations• Upserts, Range queries

– Next iteration [FAST ‘16] : fixed slowest operations• Rangecast delete messages• “Zones”• Late-binding journal

3

Page 4: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

This Class

• Introduction to Deduplication– Big picture idea– Design choices and tradeoffs– Open questions

• Slides from Gala Yadgar & Geoff Kuenning, presented at Dagstuhl

• I’ve added new slides (slides without borders) for extra context

4

Page 5: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Deduplication

Geoff Kuenning Gala Yadgar

Page 6: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Sources of Duplicates• Different people store the same files– Shared documents, code development– Popular photos, videos, etc.

• May also share blocks– Attachments– Configuration files– Company logo and other headers

à Deduplication!6

Page 7: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Deduplication

• Dedup(e) is one form of compression• High-level goal: identify duplicate objects and

eliminate redundant copies– How should we define a duplicate object?– What makes a copy “redundant”?

• Answers are application-dependent and some of the more interesting research questions!

7

Page 8: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

857 Desktops at Microsoft

D. Meyer, W. Bolosky. A Study of Practical Deduplication. FAST 20118

Page 9: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

“Naïve” DeduplicationFor each new file

Compare each block to all existing blocksIf new, write block and add pointerIf duplicate, add pointer to existing copy

9

File1 File3File2

Are we done?

Page 10: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Identifying Duplicates

• It’s unreasonable to “Compare each block to all existing blocks”

àFingerprintsCryptographic hash of block contentLow collision probability

10

RAM

Page 11: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Dedup Fingerprints• Goal: uniquely identify an object’s contents• How big should a fingerprint be?– Ideally, large enough that the probability of a collision is

lower than the probability of a hardware error• MD5: 16-byte hash• SHA-1: 20-byte hash

• Technique: system stores a map (index) between each object’s fingerprint and each object’s location– Compare a new object’s fingerprint against all existing

fingerprints, looking for a match– Scales with number of unique objects, not size of objects

11

Page 12: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Identifying Duplicates

• It’s unreasonable to “Compare each block to all existing blocks”

àFingerprintsCryptographic hash of block contentLow collision probability

• It’s also unreasonable to compare to all fingerprints…àFingerprint cache

12

RAM

RAM

Page 13: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Fingerprint Lookup• How should we store the fingerprints?• Every unique block is a miss à miss rate ≥ 40%• One solution: Bloom filter

• Challenge: 2% false positive rate à 1TB for 4PB of data13

RAM

Insert Insert

Lookup(negative)

Lookup(false positive)

lookup

Page 14: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

How To Implement a Cache?

• (Bloom) Filters help us determine if a fingerprint exists– We still need to do an I/O to find the mapping

• Locality in fingerprints?– If we sort our index by fingerprint: cryptographic

hash destroys all notions of locality– What if we grouped fingerprints by temporal

locality of writes?

14

Page 15: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Reading and Restoring

• How long does it take to read File1?• How long does it take to read File3?

• Challenge: when is it better to store the duplicates?

15

File1 File3File2

Page 16: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Write Path

16

File3

File recipe

Fingerprintindex

Chunk store

lookupSurprise

Many writes become faster!

Page 17: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Read Path

17

File3

File recipe

Fingerprintindex

Chunk store

lookup

Page 18: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Delete Path

18

File3

File recipe

Fingerprintindex

Chunk store

lookup

Reference counters: 1 2 1 2 11 2

• Challenge: storing reference counts– Physically separate from the chunks

Page 19: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Chunking

• Chunking: splitting files into blocks• Fixed-size chunks: usually aligned to device blocks • What is the best chunk size?

19

File1 File2

File1 File2

Page 20: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Updates and Versions

• Best case:aabbccdd

àaAbbccdd• Worst case:

aabbccddàaAabbccdd

20

File1 File1a

File1b

Ideally…

File1b

Page 21: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

à aAa010bb010cc010dd

Variable-Size Chunks

• Basic idea: chunk boundary is triggered by a random string

• For example: 010• aa010bb010cc010dd

• Triggers should be:– Not too short/long– Not too popular (000000…)– Easy to identify

21

Page 22: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Identifying Chunk Boundaries

• 48-byte triggers (empirically, this works)

• Define a set of possible triggersàK highest bits of the hash are == 0

àRabin fingerprints do this efficiently

à “systems” solutions for corner cases

• Challenge: parallelize this process

22

…010110010011001110100100100110011001001001100110000…

00100010010000000101

Fingerprint

Boundary!K=5

Page 23: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Rabin Fingerprints

• “The polynomial representation of the data modulo a predetermined irreducible polynomial” [LBFS sosp01]

• What/why Rabin fingerprints?– Calculates a rolling hash– “Slide the window” in a constant number of

operations (intuition: we “add” a new byte and “subtract” an old byte to slide the window by one)

– Define a “chunk” once our window’s hash matches our target value (i.e., we hit a trigger)

23

Page 24: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Defining chunk boundaries• Tradeoff between small and large chunks?– Finer granularity of sharing vs. metadata overhead

• With process just described, how might we:– Produce a very small chunk?– Produce a very large chunk?

• How might we modify our chunking algorithm to give us “reasonable” chunk sizes?– To avoid small chunks: don’t consider boundaries until

minimum size threshold– To avoid large chunks: as soon as we reach a

maximum threshold, insert a chunk boundary

24

Page 25: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Distributed Storage Increase storage capacity and performance withmultiple storage servers• Each server is a separate machine

(CPU,RAM,HDD/SSD)• Data access is distributed between serversG Scalability

Increase capacity with data growthG Load balancing

Independent of workload G Failure handling

Network, nodes and devices always fail25

Page 26: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Distributed Deduplication

• Where/when should we look for duplicates?• Where should we store each file?

26

File1 File3File2

Page 27: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Challenges (aka Summary)

à Wonderful theory problems!

27

Approximate membership query structures (AMQ)

…010110010011001110100100100110011001001001100110000…Parallelizing chunking

Size of fingerprintdictionary

1 2 1 2 11 2

Bidirectional indexing of chunks

Page 28: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Next Class?

• Specific dedup system(s) (4)• Mapreduce (+ write-optimized) (2)• Google file system (1)• RAID (3)

28

Page 29: Deduplicationjannen/teaching/s19/cs333/meetings/...Geoff Kuenning Gala Yadgar Sources of Duplicates •Different people store the same files –Shared documents, code development –Popular

Final Project Discussion

• Get with your group• Find another group• Pitch your project / show them your proposal– React/revise

29