Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan...
39
Methods for Data Integration Amit Shvarchenberg and Rafi Sayag
-
Upload
landen-philips -
Category
Documents
-
view
215 -
download
1
Transcript of Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan...
- Slide 1
- Amit Shvarchenberg and Rafi Sayag
- Slide 2
- Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA fdhamanka,ylee11,[email protected] Alon Halevy, Pedro Domingos Department of Computer Science and Engineering University of Washington, Seattle, WA, USA falon,[email protected]
- Slide 3
- Introduction Today there are a lot of databases around the world, and many times it is required to combine two or more similar databases into a single database In the past, many of this integrations were made manually The iMAP system offers a semi-automatic method of matching information from different sources
- Slide 4
- The Real-Estate-Agents Example locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 Schema T Schema S HOUSES AGENTS LISTING
- Slide 5
- The Big Merge
- Slide 6
- Making Tuples Using SQL area= SELECT location from HOUSES agent-address= SELECT concat(city, state) FROM AGENTS list-price= SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
- Slide 7
- How Do We Match ? The process of creating mappings typically proceeds in two steps. first step: schema matching, we find matches between elements of the two schemas. second step :we elaborate the matches to create query expressions that enable automated data translation or exchange.
- Slide 8
- Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
- Slide 9
- Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
- Slide 10
- Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
- Slide 11
- Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
- Slide 12
- Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
- Slide 13
- Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
- Slide 14
- The Solution The iMAP System We will describe the iMAP system which semi- automatically discovers complex matches for relational data in a single table. In some cases iMAP able to find matches that combine attributes from multiple tables.
- Slide 15
- The iMAP Architecture
- Slide 16
- Match Generator Input: target schema and source schema. Output: match candidates.
- Slide 17
- How Match Generator Works Match generator uses a searching method that goes through all possible match candidates. The searchers uses a prior knowledge of possible match types and heuristic methods.
- Slide 18
- The Internals of a Searcher Applying search to candidate generation involve three major issues: Search strategy Evaluation of candidate matches Termination condition
- Slide 19
- Search Strategy The space search can be very large or even unbounded. We need to efficiently search such spaces. iMAP address this problem using a search technique called beam search.
- Slide 20
- Beam Search Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest- scoring match. By that the searcher can conduct a very efficient search in any type of search space.
- Slide 21
- Implemented Searchers on iMAP
- Slide 22
- Example: Unit Conversion Searcher The unit conversion searcher can identify a conversion between two different types of measurement unit. It can do so By looking in the name and data of the attributes. (e.g., hours", kg", $", etc.)
- Slide 23
- The searcher finds the best conversion from a set of conversion functions between the units. In this case weight_kg = 2.2 * weight_pounds. productpounds apple10 Fruits and vegetableskg banna5 Fruits and vegetableskg banna5 apple22 Example: Unit Conversion Searcher (cont.)
- Slide 24
- Similarity Estimator Input: Match candidates. Output: Similarity matrix. Similarity matrix stores the similarity score of pairs
- Slide 25
- Similarity Estimator The similarity estimator gets the results from all the searchers. Then it gathers the data and calculates a final score for each match
- Slide 26
- Similarity Estimator (cont.) The similarity estimator uses two methods to score match pairs: Name based evaluator Nave Bayese evaluator
- Slide 27
- Match Selector Input: Similarity matrix. Output: 1-1 and complex matches.
- Slide 28
- Match Selector Match Selector examines the score matrix and outputs the best matches under certain conditions.
- Slide 29
- Exploiting Domain Knowledge Exploiting domain knowledge was shown to be beneficial on 1-1 matching On complex matching, it can be even more crucial, since it can save valuable processing by early detection of unlikely matches
- Slide 30
- Domain Constraints Constraints are either present in the schema, or provided by an expert or the user iMAP considers 3 kinds of constraints: Two attributes are un-related Constraint on a single attribute Multiple schema attributes are un-related
- Slide 31
- Sources For Domain Constraints Past Complex Matches Overlap data External Data
- Slide 32
- Past Complex Matches We often find that we map the same or similar schemas repeatedly iMAP can extract a template expression from such matches Example Given the past match: price = pr * (1+0.6) iMAP will extract: VAR * (1 + CONST) and ask the numeric searcher to look for matches for that template
- Slide 33
- Overlap Data In some cases, both the source and the target share the same data This can be used as information for the matching process Searchers that exploit overlap data: Overlap text searcher Overlap numeric searcher Overlap category and schema mismatch searcher
- Slide 34
- External Data External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching
- Slide 35
- Why do we need it?
- Slide 36
- Generating Explanations in iMAP iMAPs goal is to provide a design environment where a human user can quickly generate a mapping between a pair of schemas For a user to know what match to choose, it is necessary to supply an explanation for each of the matches
- Slide 37
- User Questions iMAP considers 3 questions that might be asked by a user: Why the match exist? Why the match doesnt exist? Why is one match better than the other?
- Slide 38
- Explanation Generation iMAP keeps track of the decision making progress as a dependency graph: Each node is either a schema attribute, an assumption, candidate matches or domain knowledge An edge between two nodes means that one node lead to another
- Slide 39
- Explanation Generation Example