Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program...
Transcript of Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program...
![Page 1: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/1.jpg)
Database Entity Resolution using Program Synthesis
Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, Nan Tang
MIT (Cambridge,MA), QCRI (Doha, Qatar), ASU (Tempe, AZ)
ExCAPE Review Meeting May 10, 2016
![Page 2: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/2.jpg)
Problem Overview
if sim(Name, Person) and sim(Institution, University) and =(Gender, Sex) then r and s match
Output:𝑟 ∈ 𝑅 𝑠 ∈ 𝑆
Name Position Institution Gender
Wei Wang Associate Prof UNSW M
Wei Wang Prof UCLA F
Sam Madden Prof MIT M
Patrick Valduriez Director INRIA M
Person Title University Sex
Wei Wang Professor UCLA F
Samuel Madden Prof MIT M
P. Valduriez Director INRIA M
P. Papotti AP ASU M
Input:
𝑅 𝑆
Logical Structure
![Page 3: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/3.jpg)
Problem Overview
if sim(Name, Person) and sim(Institution, University) and =(Gender, Sex) then r and s match
Output:𝑟 ∈ 𝑅 𝑠 ∈ 𝑆
Name Position Institution Gender
Wei Wang Associate Prof UNSW M
Wei Wang Prof UCLA F
Sam Madden Prof MIT M
Patrick Valduriez Director INRIA M
Person Title University Sex
Wei Wang Professor UCLA F
Samuel Madden Prof MIT M
P. Valduriez Director INRIA M
P. Papotti AP ASU M
Input:
𝑅 𝑆
Similarity Predicates: 𝐬𝐢𝐦_𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧(𝑵𝒂𝒎𝒆, 𝑷𝒆𝒓𝒔𝒐𝒏) >= 𝐭𝐡𝐫𝐞𝐬𝐡𝐨𝐥𝐝
![Page 4: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/4.jpg)
Why does it matter?
Linking Census Data over the years Duplicate Contacts DetectionBetter Web Search (Knowledge Graph)
Comparison Shopping
Spam DetectionMachine ReadingPublic HealthCounter-terrorism… [1]
[1] ] Lise Getoor, Ashwin Machanavajjhala: Entity Resolution: Theory, Practice & Open Challenges. PVLDB 5(12): 2018-2019 (2012)
![Page 5: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/5.jpg)
ChallengesSchema R
Purchase Order
Product
Billing Name
Billing Address
Shipping Name
Shipping Address
Schema S
POrder
Payee Info
Article
Recipient Info
Schema Matching
Reasoning about Similarity FunctionsNoisy data, Slow convergence
Structure of the Matching Rules
if sim(A,A’) and sim(B,B’) and sim(C,C’) then r and s match
if sim(C,C’) and sim(D,D’) then r and s matchif sim(A,A’) and sim(B,B’) then r and s match
if sim(B,B’) and sim(C,C’) and sim(D,D’) then r and s match
???
![Page 6: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/6.jpg)
ApproachSchema R
Purchase Order
Product
Billing Name
Billing Address
Shipping Name
Shipping Address
Schema S
POrder
Payee Info
Article
Recipient Info
Schema Matching
Reasoning about Similarity Functions
Structure of the Matching Rules
if sim(A,A’) and sim(B,B’) and sim(C,C’) then r and s match
if sim(C,C’) and sim(D,D’) then r and s matchif sim(A,A’) and sim(B,B’) then r and s match
if sim(B,B’) and sim(C,C’) and sim(D,D’) then r and s match
???
Leverage Prior Work SyGuS
model functions with tables
Noisy data, Slow convergence
CEGIS + RANSAC
![Page 7: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/7.jpg)
Motivation• User provided Example pairs
• Some of these are wrong (✕)
• Challenges• Too many examples
• Don’t know which ones are wrong
• Hypotheses• Matching Rules are easy to learn
• Small fraction of wrong examples
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
••
•
•
•
•
![Page 8: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/8.jpg)
How it works?• Counter-Example Guided
Inductive Synthesis1. Pick an example pair randomly
2. Synthesize a matching function
3. Include an incorrectly matched example pair (⃝)
4. Go to 2 (repeat)
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Matched correctly (□) or incorrectly (⃝)
□
□□
□
□
⃝ ⃝
⃝ ⃝⃝
⃝
⃝⃝
⃝⃝
⃝⃝
⃝
⃝⃝
⃝
⃝⃝
⃝⃝
⃝⃝
⃝
⃝
⃝
□
Synthesize
Matching Function
Check
Counterexample
![Page 9: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/9.jpg)
How it works?• Counter-Example Guided
Inductive Synthesis
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Matched correctly (□) or incorrectly (⃝)
Synthesize
Matching Function
Check
Counterexample
□
□
□□
□
□
⃝ ⃝
⃝ □⃝
□
⃝⃝
□⃝
⃝□
⃝
⃝⃝
⃝
⃝⃝
□⃝
□□
□
□⃝
![Page 10: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/10.jpg)
How it works?• Counter-Example Guided
Inductive Synthesis
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Matched correctly (□) or incorrectly (⃝)
Synthesize
Matching Function
Check
Counterexample
□
□
□□
□
□
□ ⃝
⃝ □□□
⃝□
⃝□
□□
□□
⃝□
□□
□⃝
□□
□
⃝
⃝
![Page 11: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/11.jpg)
How it works?• Counter-Example Guided
Inductive Synthesis
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Matched correctly (□) or incorrectly (⃝)
Synthesize
Matching Function
Check
Counterexample
□
□
⃝
⃝
□
□
□ ⃝
⃝ □□□
⃝⃝
⃝□
□□
□□
⃝□
□□
⃝□
□⃝
□
⃝
⃝
![Page 12: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/12.jpg)
How it works?• Counter-Example Guided
Inductive Synthesis
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Matched correctly (□) or incorrectly (⃝)
Synthesize
Matching Function
Check
Counterexample
□
□
⃝
⃝
□
□
□ ⃝
⃝ ⃝□□
⃝⃝
⃝□
□⃝
□□
⃝□
□□
□□
□⃝
□
□⃝
![Page 13: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/13.jpg)
How it works?• Counter-Example Guided
Inductive Synthesis
• Record and Restart
• Best over K runs (RANSAC)
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Matched correctly (□) or incorrectly (⃝)
Synthesize
Matching Function
Check
Counterexample
UNSAT
![Page 14: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/14.jpg)
Improvements• Non-uniform sampling across
restarts• Penalize examples resulting into
UNSAT
• More penalty for examples picked at a later CEGIS iteration
• Resilient synthesizer • Maximize number of matches
• Threshold on number of allowed mismatches
•
• •
•
•
••
•
•
••
•
•
••
•
✕
•
•
•
•
•
•
•
✕
✕
✕
✕
•
•
•
Less likely to be picked again
□
□
□□
□
□
□ ⃝
⃝ □□□
⃝□
⃝□
□□
□□
⃝□
□□
□⃝
□□
□
⃝
⃝
![Page 15: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/15.jpg)
Experiments: Effectiveness Metric
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
• 𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
• 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
• 𝐹 −𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 21
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+
1
𝑅𝑒𝑐𝑎𝑙𝑙
−1
![Page 16: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/16.jpg)
Experiments: Comparison with SIFI [2]
Expert provided fixed rule structure used for both
5000 example pairs
[2] Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, Jianhua Feng: Entity Matching: How Similar Is Similar. PVLDB 4(10): 622-633 (2011)
Dataset Best SIFI Best Synth
Restaurant 0.86 0.85
Cora 0.72 0.75
Dbgen 0.957 0.964
Dataset Best SIFI Best Synth
Restaurant 0.63 0.76
Cora 0.37 0.52
Dbgen 0.87 0.93
10-fold 10% training examples (Avg F-measure) 100-fold 1% training examples (Avg F-measure)
![Page 17: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/17.jpg)
Experiments: Comparison with SIFI [2]
Expert provided structure for SIFI, bounded DNF grammar for Synthesis
5000 example pairs
[2] Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, Jianhua Feng: Entity Matching: How Similar Is Similar. PVLDB 4(10): 622-633 (2011)
Dataset Best SIFI Best Synth
Cora 0.72 0.87
Dbgen 0.957 0.98
10-fold 10% training examples (Avg F-measure)
![Page 18: Databse Entity Resolution using Program Synthesis · Database Entity Resolution using Program Synthesis Rohit Singh, Vamsi Meduri, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz,](https://reader034.fdocuments.in/reader034/viewer/2022051913/6003eb3abf9c18799e61af35/html5/thumbnails/18.jpg)
Summary
• Industry need for better Entity Resolution methods
• Synthesis enabled solution can• Outperform or match state of the art with “small” sized
training sets• Potentially save effort and $$ by reducing expert
involvement
•Work in Progress: • Evaluating on industry datasets