D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung,...

DATA PARTITIONING FOR DISTRIBUTED ENTITY MATCHINGToralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß,Hanna Köpcke, Erhard Rahm

Database Group Leipzighttp://dbs.uni-leipzig.de

Singapore, 13thSeptember 2010

2 / 20

• Detection of entities in one ore more sources that refer to the same real-world object

• Entity comparisons•Comparisons based on string similarity•Two sources: m2

•Combination of several matchers•Aggregation of individual results

• Runtime•Execution times up to several hours for a single attribute

matcher•Worse for machine learning approaches

• Memory requirements•Source and intermediate results do not fit in memory•Chunk-wise processing

ENTITY MATCHING

Data Partitioning for Distributed Entity Matching

3 / 20

• Blocking•Group similar entities within blocks•Restrict entity matching to entities from the same block•Supported by entity matching frameworks

• Parallelization•Split match computation in sub-tasks •Execute them in parallel on multiple multi-core nodes•Currently utilized by only few frameworks

HOW TO SPEED UP ENTITY MATCHING?

4 / 20

CONTRIBUTIONS• Generic data partitioning strategies for parallel matching• Size-based partitioning for evaluating the Cartesian

product of input entities•Blocking-based matching•Applicable to arbitrary matchers

• Load balancing regarding available resources and match strategy

• Service-based infrastructure for parallel entity matching

• Evaluation of the strategies for different types of matchers and datasets

5 / 20

OUTLINE

• Motivation• Overview• Partitioning Strategies

•Size-based Partitioning•Blocking-based Partitioning

• Match Infrastructure• Evaluation• Conclusion & Future Work

6 / 20

OVERVIEW

• Set of entities, described via attributes• Partitioning strategy

•Partition input data•Match task generation

• Parallel execution of a match strategy•One or several matchers•Combination of individual match results (manually,

training-based)•Treated as black box

InputSource

IntegratedSource

instance data

integration

......

Partitioning Strategies

Size-basedPartitioning

max. partition size

Match TaskGeneration

Task list

Match TaskGeneration

BlockingPartitionTuning

source-specific

blocking key

min./max. partition

MatchResult

Result Aggregatio

Parallel Matching

7 / 20

• Applicable to Cartesian product of input entities• Split n input entities into p partitions of fairly equal size m

•p = n/m•Range partitioning, Round Robin

• Match task compares two of these partitions•Match partitions Pi and Pj if i ≤ j

•p+p(p-1)/2 match tasks

• Promises good load balancing and scalability to many nodes

SIZE-BASED PARTITIONING

Input Set

P1 P2 … Pp-1 Pp

InputSet

P2 x x

… … … …

Pp-1 x x … x

Pp x x … x x

8 / 20

SIZE-BASED PARTITIONING – SUITABLE PARTITION SIZE M?• m determines number of partitions and match tasks

• m many small match tasks, high communication overhead• m memory bottlenecks

• m restricted by computing environment•Memory requirements of a match task in O(m2)•Average memory requirement per match task: cms∙m2

• Multiple cores share available memory•m ≤ •2 GB RAM, 4 cores, cms=20B

max. partition size m = = 5,000

9 / 20

BLOCKING-BASED PARTITIONING• Blocking – logical clustering of possibly matching entities

•Blocks of largely varying size•Entities with missing attribute values

•Assigned to dedicated misc block•Have to be compared with entities of all blocks

• Simple approach – one match task per block•Poor load balancing and/or high communication overhead

•Large blocks dominate execution time and consume much memory•Small blocks slow down parallel matching due high

communication overhead compared to time for matching

•Partition tuning to split or aggregate blocks

10 / 20

BLOCKING-BASED PARTITIONING – PARTITION TUNING • Large blocks for which matching would consume to

much memory are split into equally-sized partitions• Max. partition size m chosen according to memory

requirement estimation• All sub-partitions of a split block have to be matched

with each other

Drives & Storage (3,250)

3½ (1,300)

2½ (600)

Blu-ray (60)

HD-DVD (40)

Misc (600)

3½ 1 (700)

3½ 2 (600)

CD-RW (50)

DVD-RW (600)

Max. partition size = 700

11 / 20

BLOCKING-BASED PARTITIONING – PARTITION TUNING • Small blocks with sizes below min. partition size are

aggregated into larger ones• Less partitions less match tasks reduced

communication and scheduling overhead• Aggregation introduces unnecessary comparisons and

may lead to false-positives

3½ (1,300)

2½ (600)

Blu-ray (60)

HD-DVD (40)

misc (600)

3½ 1 (700)

3½ 2 (600)

CD-RW (50)

DVD-RW (600)

Min. partition size = 70

Blu-ray

Max. partition size = 700

CD-RW (150)HD-DVD (100)HD-DVD

12 / 20

BLOCKING-BASED PARTITIONING – MATCH TASK GENERATION• One match task per normal (non-misc) block that has not

been split• Blocks that have been split in k sub-partitions result in

k+k(k-1)/2 match tasks• The misc block (or its sub-partitions) have to be matched

with all blocks (sub-partitions)

3½ (1,300)

2½ (600)

Blu-ray (60)

HD-DVD (40)

Misc (600)

3½ 1 (700)

3½ 2 (600)

CD-RW (50)

DVD-RW (600)

Blu-ray HD-DVDCD-RW (150)

↷Drives & Storage

HD-DVD

RW DVD-RW misc

Drives&

Storage

Blu-ray

HD-DVD

DVD-RW

Drives & Storage

HD-DVD

RW DVD-RW misc

Drives&

Storage

Blu-ray

XHD-DVD

DVD-RW X

Drives & Storage

HD-DVD

RW DVD-RW misc

Drives&

Storage

3½ 1 X

3½ 2 X X

Blu-ray

XHD-DVD

DVD-RW X

Drives & Storage

HD-DVD

RW DVD-RW misc

Drives&

Storage

3½ 1 X

3½ 2 X X

Blu-ray

XHD-DVD

DVD-RW X

misc X X X X X X

13 / 20

MATCH INFRASTRUCTURE – SIZE-BASED PARTITIONING EXAMPLE

CPU1CPU2

Data Service

Workflow Service

ek-1ek

Source data/

Attribute histograms

Match Service 1

Match Service 2

Match task listt1

Equally sized

partitions(described logically)

.. .pn

t2p1p1

pntn+1

tn(n-1)/2

14 / 20

INFRASTRUCTURE – SIZE-BASED PARTITIONING EXAMPLE

CPU1CPU2

Data Service

Workflow Service

Match Service 1

Match Service 2

Match task listt1

tn(n-1)/2

CPU2..

.. ...

Unify partial match results

15 / 20

EVALUATION• Datasets

• 114,000 electronic product offers• Small subset of 20,000 offers

• Two matchers• WAM – Levenshtein, Trigram (weighted average)• LRM – Jaccard, Trigram, Cosine (machine learning)

• Computing environment• Up to 16 cores (4 nodes with 4x2.66GHz and 4GB RAM)• 3GB RAM heap size

16 / 20

EVALUATION – INFLUENCE OF THE MAX. PARTITION SIZE• 20,000 electronic product offers• Cartesian product• Single node, 4 match threads

WAM: mmax =1,000

LRM: mmax = 500

17 / 20

• 20,000 electronic product offers• Cartesian product (mmax = 1000/500)

• 4 nodes, 4 match threads per node

EVALUATION – PARALLEL MATCHING ON MULTIPLE NODES

18 / 20

• 114,000 electronic product offers• Blocking (mmax = 1000/500, mmin = 200/100)

• 4 nodes, 4 match threads per node

EVALUATION – PARALLEL MATCHING ON MULTIPLE NODES

19 / 20

CONCLUSIONS & FUTURE WORK• Two generic data partitioning strategies for parallel

matching with any matchers•Size-based partitioning•Blocking-based partitioning

• Partition tuning to achieve evenly loaded nodes• Evaluation on newly developed service-based

infrastructure

• Adapt approaches to cloud architectures• Parallelize Blocking• Investigate optimizations within match strategies

20 / 20Data Partitioning for Distributed Entity Matching

Please also note our contribution to the VLDB experiments and analysis track:

Evaluation of entity resolution approaches on real-world match problemsHanna Köpcke, Andreas Thor, Erhard Rahm

Date: 14 September 2010, TuesdayTime: 17:30 hoursRoom: Swallow

THANK YOU FOR YOUR ATTENTION

D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung,...

Documents

Transcript of D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung,...

Dr Toralf Richter bossert & richter AG . Switzerlandbr-marketing.ch/useruploads/files/marketing_organic_retail_market... · 1 Structure and perspective of main distribution channels

Team Manual EMRC 2010 Sapareva Banya - European … Arese (ITA) Sylvia Barlag (NED) Jonathan Edwards(GBR) Frank Hensel (GER) Dobromir Karamarinov (BUL) Philippe Lamblin (FRA) Toralf

EEG Toralf Neuling - Brain Products · 2019-02-18 · Brain Products Press Release December 2013, Volume 49 Design The experimental procedure is illustrated in Fig. 2B. After the

C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.

New Archived at · 2010. 4. 20. · Toralf Richter and Susanne Padel Europe: Organic farming statistics 151 16 Latin ... Case study Bolivia: On the way to an ecological country 176

Open Car Decoder...01.09.2014 V3.5 OCS A 5 5.2.1 Toralf Wilhelm 15.10.2014 V3.6 V0.2.3 Toralf Wilhelm 05.02.2015 V3.7 Kickstart C 4 0 ; Q = 5.3.5 Toralf Wilhelm 10.02.2016 V3.8 6 Toralf

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig .

From Um- and Mitwelt to Das Man: A comment on the ... · PDF fileHusserl and Heidegger (Term Paper) KÖPCKE TINTURÉ, Maria Isabel From Um- and Mitwelt to Das Man: A comment on the

Graph P artitioning a nd Clustering for Community Detection

Open Car Decoder · 2018-07-19 · V3.7 Kickstart entfernt 5.3.5 Toralf Wilhelm 10.02.2016 V3.8 Liste der Funktionsbelegung ergänzt 6 Toralf Wilhelm 17.04.2016 Toralf Wilhelm Seite

Sourcing organic products from Ukraine, Moldova, and Armenia · Sourcing organic products from Ukraine, Moldova, and Armenia BiofachNuremberg2015 Claudia Assmann, UNEP // Toralf Richter,

Edinburgh Research Explorer · 2014. 12. 16. · Ping Shen#1, Toralf Roch#1,2, Vicky Lampropoulou1, Richard A. O’Connor3, Ulrik Stervbo1, Ellen Hilgenberg 1 , Stefanie Ries 1 ,

Visual Merchandising for Organic Retailer shortorgprints.org/.../Richter-2005-Merchandising-for-Organic-Retailer.pdf · FiBL Frick Visual Merchandising for Organic Retailer Dr. Toralf

ICORD 2007, Brussels Methodological Issues on Clinical Trials with Small Sample Size Joachim Gerß, Wolfgang Köpcke Department of Medical Informatics and.

Albert Moser, Toralf Michaelsen: Congestion Management and ... · EU Emission Allowances Exchange Trading & Clearing OTC Clearing Exchange Trading & Clearing OTC Clearing Power Futures

Arne Heise, Toralf Pusch - uni-hamburg.de

A S CHEDULABILITY A NALYSIS FOR W EAKLY H ARD R EAL - T IME T ASKS IN P ARTITIONING S CHEDULING ON M ULTIPROCESSOR S YSTEMS Energy Reduction in Weakly.

P artitioning based algorithms for appro ximate and exacttheory.stanford.edu/~matias/papers/iceberg-queries.pdfAn imp ortan t class of queries whic h has recen tly receiv ed increasing

An optimising SMV to CLP(B) compiler - diva-portal.org20147/FULLTEXT01.pdf · When I ha ve b een stuc k with the pro ject it has b een very nice to talk ... 2.11 P artitioning the

Flexible Fee in Bergen · Thematic Seminar on Selective Collection in Heritage Areas 8th June 2017, Krakow (Poland) toralf.igesund@bir.no . 2 BIR Flexible fee -PAYT Toralf Igesund