Transcript of Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan...
- Slide 1
- Amit Shvarchenberg and Rafi Sayag
- Slide 2
- Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan
Department of Computer Science University of Illinois,
Urbana-Champaign, IL, USA fdhamanka,ylee11,anhaig@cs.uiuc.edu Alon
Halevy, Pedro Domingos Department of Computer Science and
Engineering University of Washington, Seattle, WA, USA
falon,pedrodg@cs.washington.edu
- Slide 3
- Introduction Today there are a lot of databases around the
world, and many times it is required to combine two or more similar
databases into a single database In the past, many of this
integrations were made manually The iMAP system offers a
semi-automatic method of matching information from different
sources
- Slide 4
- The Real-Estate-Agents Example locationpriceAgent-id Raleigh,
NC360,00032 Atlanta, GA430,00015 areaList-priceAgent- address
Agent- name Denver,CO550000Boulder, COLaura Smith
Atlanta,GA370800Athens, GAMike Brown IdNamecityStateFee-rate 32Mike
brownAthensGA0.03 15Jean LaupRaleighNC0.04 Schema T Schema S HOUSES
AGENTS LISTING
- Slide 5
- The Big Merge
- Slide 6
- Making Tuples Using SQL area= SELECT location from HOUSES
agent-address= SELECT concat(city, state) FROM AGENTS list-price=
SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id =
id
- Slide 7
- How Do We Match ? The process of creating mappings typically
proceeds in two steps. first step: schema matching, we find matches
between elements of the two schemas. second step :we elaborate the
matches to create query expressions that enable automated data
translation or exchange.
- Slide 8
- Schema Matches There are two kinds of schema matches. 1-1
matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta,
GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
- Slide 9
- Schema Matches There are two kinds of schema matches. 1-1
matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta,
GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
- Slide 10
- Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
- Slide 11
- Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
- Slide 12
- Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
- Slide 13
- Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
- Slide 14
- The Solution The iMAP System We will describe the iMAP system
which semi- automatically discovers complex matches for relational
data in a single table. In some cases iMAP able to find matches
that combine attributes from multiple tables.
- Slide 15
- The iMAP Architecture
- Slide 16
- Match Generator Input: target schema and source schema. Output:
match candidates.
- Slide 17
- How Match Generator Works Match generator uses a searching
method that goes through all possible match candidates. The
searchers uses a prior knowledge of possible match types and
heuristic methods.
- Slide 18
- The Internals of a Searcher Applying search to candidate
generation involve three major issues: Search strategy Evaluation
of candidate matches Termination condition
- Slide 19
- Search Strategy The space search can be very large or even
unbounded. We need to efficiently search such spaces. iMAP address
this problem using a search technique called beam search.
- Slide 20
- Beam Search Beam search uses a scoring function to evaluate
each match candidate At each level of the search tree, it keeps
only k highest- scoring match. By that the searcher can conduct a
very efficient search in any type of search space.
- Slide 21
- Implemented Searchers on iMAP
- Slide 22
- Example: Unit Conversion Searcher The unit conversion searcher
can identify a conversion between two different types of
measurement unit. It can do so By looking in the name and data of
the attributes. (e.g., hours", kg", $", etc.)
- Slide 23
- The searcher finds the best conversion from a set of conversion
functions between the units. In this case weight_kg = 2.2 *
weight_pounds. productpounds apple10 Fruits and vegetableskg banna5
Fruits and vegetableskg banna5 apple22 Example: Unit Conversion
Searcher (cont.)
- Slide 24
- Similarity Estimator Input: Match candidates. Output:
Similarity matrix. Similarity matrix stores the similarity score of
pairs
- Slide 25
- Similarity Estimator The similarity estimator gets the results
from all the searchers. Then it gathers the data and calculates a
final score for each match
- Slide 26
- Similarity Estimator (cont.) The similarity estimator uses two
methods to score match pairs: Name based evaluator Nave Bayese
evaluator
- Slide 27
- Match Selector Input: Similarity matrix. Output: 1-1 and
complex matches.
- Slide 28
- Match Selector Match Selector examines the score matrix and
outputs the best matches under certain conditions.
- Slide 29
- Exploiting Domain Knowledge Exploiting domain knowledge was
shown to be beneficial on 1-1 matching On complex matching, it can
be even more crucial, since it can save valuable processing by
early detection of unlikely matches
- Slide 30
- Domain Constraints Constraints are either present in the
schema, or provided by an expert or the user iMAP considers 3 kinds
of constraints: Two attributes are un-related Constraint on a
single attribute Multiple schema attributes are un-related
- Slide 31
- Sources For Domain Constraints Past Complex Matches Overlap
data External Data
- Slide 32
- Past Complex Matches We often find that we map the same or
similar schemas repeatedly iMAP can extract a template expression
from such matches Example Given the past match: price = pr *
(1+0.6) iMAP will extract: VAR * (1 + CONST) and ask the numeric
searcher to look for matches for that template
- Slide 33
- Overlap Data In some cases, both the source and the target
share the same data This can be used as information for the
matching process Searchers that exploit overlap data: Overlap text
searcher Overlap numeric searcher Overlap category and schema
mismatch searcher
- Slide 34
- External Data External data is used as additional constraints
on the attributes of a schema Usually provided by experts Can be
very useful in schema matching
- Slide 35
- Why do we need it?
- Slide 36
- Generating Explanations in iMAP iMAPs goal is to provide a
design environment where a human user can quickly generate a
mapping between a pair of schemas For a user to know what match to
choose, it is necessary to supply an explanation for each of the
matches
- Slide 37
- User Questions iMAP considers 3 questions that might be asked
by a user: Why the match exist? Why the match doesnt exist? Why is
one match better than the other?
- Slide 38
- Explanation Generation iMAP keeps track of the decision making
progress as a dependency graph: Each node is either a schema
attribute, an assumption, candidate matches or domain knowledge An
edge between two nodes means that one node lead to another
- Slide 39
- Explanation Generation Example