Record Linkage in a Distributed Environment
description
Transcript of Record Linkage in a Distributed Environment
1
Record Linkagein a Distributed Environment
Huang YipengWing group meeting, 11 March 2011
2Introduction
Record LinkageDetermining if pairs of personal
records refer to the same entity
E.g. Distinguishing betweendata belonging to…
<Yipeng, author of this presentation> and <Yipeng, son of PM Lee>
3Introduction
The Distributed Environment Why?
◦ Dealing with large data
◦ Limitation of blocking
Advantages◦ Parallel computation◦ Data source
flexibility◦ Complementary to
blocking methods
O(nC2)
AmandaBeverleyKatherine Amanda
Amanda
Amanda
AmandaAmanda
4Introduction
The Distributed Environment MapReduce
◦ Distributed environment for large data sets
Hadoop ◦ Open source
implementation
◦ Convenient model for scaling Record Linkage
◦ Protects users from system level concerns
5Introduction
Research ProblemDisconnect between generic
parallel framework and specific Record Linkage problem
The goal Tailor Hadoop for Record Linkage tasks
6
OutlineIntroductionRelated WorkMethodology Evaluation Conclusion
7Related Work
Related WorkRecord Linkage Literature
◦Blocking techniquesParallel Record Linkage Literature
◦P-Febrl (P Christen 2003), ◦P-Swoosh (H Kawai 2006), ◦Parallel Linkage (H Kim 2007)
Hadoop Literature ◦Evaluation Metrics◦Pairwise comparisons (T Elsayed 2008)
8
OutlineIntroductionRelated WorkMethodology Evaluation Conclusion
9Methodology
MapReduce Workflow
Partitioner
10Methodology
ImplementationMapPurpose:
◦ Parallelism ◦ Data manipulation◦ Blocking
Reads lines of input and outputs <key, value> pairs.
ReducePurpose:
◦ Parallelism ◦ Record Linkage
ops
Records with the same <key> in same Reduce().
Linkage results
11Methodology
Hash Partitioner Default implementation Hash(Key) mod NGood for uniformed data but not
for skewed distributions
Node
10 22 21 3 4 5 6 7 2 80
20
40
60
Reduce task list for Job x
Name Distribution Comparisonsjoshua 5000 12497500
emiily 48 1128
jack 35 595
thomas 33 528
lachlan 32 496
benjamin 31 465
5416986 comparisons
210 comparisons
13Methodology
Record Linkage PartitionerGoal: Have all nodes finish the reduce
phase at the same time Attain a better runtime but
retaining the same level of accuracy
14Methodology
Domain principlesCounting pairwise comparisons
gives a more accurate picture of the true computational workload
The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000)
15Methodology
Record Linkage Workflow
Round 1
Round 2
Round 3
Range partition based on comparison workload
Merge lost comparisons from Round 1
Remove cross duplicates
16Methodology
Input
Round 1
Map Phase
Distribution
1. Calc avg comparison workload over N nodes
2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below.
3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes
Methodology
Round 2
A
17
B
List XA B
A R1B R1
A BA R1B R2 R1
Methodology
Round 2
18
A BAB Job 1
A B CAB Job 1C Job 2 Job 3
1. Only acts on lost comparisons
2. Because input is indistinct, a 3rd round of deduplication may be needed.
19
OutlineIntroductionRelated WorkMethodology Evaluation Conclusion
Introduction
20Evaluation
Performance MetricsPerformance evaluation in
absolute runtime, speedup & scaleup on a shared cluster.◦“It’s what users care about” ◦Representative of real operations
21Methodology
Input Records
10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution.
<rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9>
22Methodology
Data setsSynthetic data produced with
Febrl data generator◦Artificially skewed distribution
1 1352694035376718050
200400600800
100012001400
Comparisons
Name Distribution Comparisonsjoshua 50 1225
emiily 48 1128
jack 35 595
thomas 33 528
lachlan 32 496
benjamin 31 465
23Evaluation
Utilization
Node 1 Node 20
2
4
6
8
10
12
IdleComputation
24Evaluation
Utilization
Node 1 Node 2 Node 30
2
4
6
8
10
12
IdleComputation
25Evaluation
Utilization
Node 1 Node 2 Node 30
2
4
6
8
10
12
IdleComputation
A
B
C
26Evaluation
Utilization
Node 1 Node 2 Node 30
2
4
6
8
10
12
IdleRedistributed ComputationOriginal Computa-tion
CA B
Round 2
27
A B CABC
J1
J3 J5
J2
J4 J6 ?
Node Utilization 50-100%
28Evaluation
Results so far….Default Workflow
RL Workflow
2 nodes, 5000 records, 2433 duplicates
71.5 secs 75 secs
2 nodes, 7000 records, 4814 duplicates
>10 mins 196.8 secs
29Evaluation
Results so far….RL Workflow runtime
◦Similar to Hash-based runtime on small datasets
◦Better as the size of the dataset grows
30
ConclusionParallelism a right step in the
right direction for record linkage ◦Complementary to existing
approaches
Hadoop can be tailored for Record Linkage tasks◦“Record Linkage” Partitioner /
Workflow is just one an example of possible improvements
Conclusion