Privacy Preserving Record Linkage with PPJoin

Post on 27-Jan-2017

51 views 5 download

Transcript of Privacy Preserving Record Linkage with PPJoin

Privacy Preserving RecordLinkage with PPJoin

Ziad Sehili, Lars Kolb, Christian Borgs,Rainer Schnell, Ergard Rahm

Datenbanksysteme für Business, Technologie und Web (BTW), 2015

September 15, 2016Presentation by Mateus Cruz

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Introduction Preliminaries Proposal Experiments Conclusion

OVERVIEW

Find pairs of similar recordsQuadratic complexity

Ï Scalability problems

Adapt PPJoin1 to encrypted dataÏ Filtering reduces search space

Parallelize to improve performanceÏ GPUs

1Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu: “EfficientSimilarity Joins for Near Duplicate Detection”, WWW 2008

1 / 22

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Introduction Preliminaries Proposal Experiments Conclusion

DATA REPRESENTATION

Create Bloom filters using MD5 and SHA-1Ï Similarity preservingÏ Allows length filtering

2 / 22

Introduction Preliminaries Proposal Experiments Conclusion

DATA REPRESENTATION

Create Bloom filters using MD5 and SHA-1

Deterministic

Ï Similarity preservingÏ Allows length filtering

2 / 22

Introduction Preliminaries Proposal Experiments Conclusion

PPJOIN2

Position Prefix JoinSignature-based algorithmFiltering techniques

Ï Length filterÏ Prefix filterÏ Position filter

2Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu: “EfficientSimilarity Joins for Near Duplicate Detection”, WWW (2008)

3 / 22

Introduction Preliminaries Proposal Experiments Conclusion

LENGTH FILTER

“If two records are similar, the differencebetween their lengths cannot be large”

Sort records by lengthUsing Jaccard similarity: δ|s| ≤ |r| ≤ |s|

δÏ |s|: Length of sÏ δ: Similarity threshold

Group records according to their lengthsÏ Prune pairs of records from different groups

4 / 22

Introduction Preliminaries Proposal Experiments Conclusion

PREFIX FILTER

“If two records are similar,they must share some tokens”

Sort tokens in each recordÏ Alphabetical order, IDF order, etc

Select the p first tokensÏ For JS, p = b(1−δ)|s|c+1

Prune pairs for which sp ∩ rp 6= ;Ï sp: prefix of s (containing the first p tokens)

5 / 22

Introduction Preliminaries Proposal Experiments Conclusion

POSITION FILTER

“If two records are similar, their maximal overlapis smaller than the minimally needed overlap”

Minimal overlapÏ α= d t

1+t ∗ (|r|+ |s|)eDivide each record into left and right parts

Ï lp: tokens already seenÏ rp: unseen tokens

Prune if |lp(r)∩ lp(s)|+min(|rp(r)|, |rp(s)|) <α

6 / 22

Introduction Preliminaries Proposal Experiments Conclusion

PPJOIN PREPROCESSING

7 / 22

Introduction Preliminaries Proposal Experiments Conclusion

PPJOIN INDEX

Pair (r1,r4) filtered by length filterÏ |r1| < δ∗|r4| (4 < 0.8∗6)

Pair (r3,r2) filtered by position filter

8 / 22

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN

“PPJoin for Encrypted Data” (P4Join)Records are BFs of fixed sizeConsider bit positions as tokensLength is the number of 1 bits

Ï Called cardinality

Does not need an inverted index

9 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN PREPROCESSING

Length is the number of 1 bits (cardinality)Ï Prefixes with same lengths, but different sizes

10 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN PROCESSING

High cost to maintain inverted indexOriginal position filter reduces performancelmap

Ï Lists relevant records based on length filter

11 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN LENGTH FILTER

r1 does not satisfy the length filterÏ 7 < 0.8∗11

12 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN PREFIX FILTER

Check overlap by AND operationÏ Prune pair (r4,r2)

– 000011011 AND 1111 = 000000000

13 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN POSITION FILTER

Prune pair (r4,r3)

14 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN WITH GPUS

Bit arrays of type long (64 bits)Divide R and S into partitions

Ï To fit in the GPU’s memory

Sort records in partitionsCheck if partitions have candidate pairs

Ï Using length filterÏ If no candidates, do not even send to GPU

15 / 22

Introduction Preliminaries Proposal Experiments Conclusion

P4JOIN GPU PROCESSING

One kernel per record of RiÏ Comparing with all records from SjÏ Prune using length and prefix filtersÏ Matches are saved in the global memory

16 / 22

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Introduction Preliminaries Proposal Experiments Conclusion

SETUP

HardwareÏ CPU 4-core 2.67 GHz, 4GB memoryÏ GPUs

– NVIDIA GeForce GT 610 (1GB memory)– NVIDIA GeForce GT 540M (1GB memory)

ParametersÏ Bigrams as tokensÏ Bit vector length: 1000Ï JS threshold: 0.8Ï Number of hash functions k = 20Ï Partitions maximum size: 2000

17 / 22

Introduction Preliminaries Proposal Experiments Conclusion

CPU PERFORMANCE

Most gains from length filterLarge overhead for prefix filter

18 / 22

Introduction Preliminaries Proposal Experiments Conclusion

GPU PERFORMANCE

Speedups of 20%Ï Compared to sequential CPU approach

19 / 22

Introduction Preliminaries Proposal Experiments Conclusion

OUTLINE

1 Introduction

2 Preliminaries

3 Proposal

4 Experiments

5 Conclusion

Introduction Preliminaries Proposal Experiments Conclusion

SUMMARY

Adaptation of PPJoin to PPRLÏ Records are encrypted bit arrays

Parallelization using GPUsBit arrays reduce effectiveness of filters

Ï Due to overheads

20 / 22

Algorithms Detailed Filters

EXTRA SLIDES

Algorithms Detailed Filters

P4JOIN ALGORITHM

Algorithms Detailed Filters

POSITION FILTER“If two records are similar, the upper bound of

their JS cannot be smaller than the threshold δ”Compute prefixes sp and rp

Calculate the upper bound of their JS (Θ):Ï Θ= |sp∩rp|+min(|s|−|sp|,|r|−|rp|)

|sp∪rp|+max(|s|−|sp|,|r|−|rp|)Prune the pair if Θ< δ

Exampler = {B,C,D,E,F},s = {A,B,C,D,F}, δ= 0.8 a

Θ= 1+33+3 = 4

6 ≈ 0.7 → prune pair (r,s)

aExample from Jiang et al.: “String similarity joins: An experimentalevaluation” VLDB (2014)