Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan...

39
Methods for Data Integration Amit Shvarchenberg and Rafi Sayag

Transcript of Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan...

  • Slide 1
  • Amit Shvarchenberg and Rafi Sayag
  • Slide 2
  • Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA fdhamanka,ylee11,[email protected] Alon Halevy, Pedro Domingos Department of Computer Science and Engineering University of Washington, Seattle, WA, USA falon,[email protected]
  • Slide 3
  • Introduction Today there are a lot of databases around the world, and many times it is required to combine two or more similar databases into a single database In the past, many of this integrations were made manually The iMAP system offers a semi-automatic method of matching information from different sources
  • Slide 4
  • The Real-Estate-Agents Example locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 Schema T Schema S HOUSES AGENTS LISTING
  • Slide 5
  • The Big Merge
  • Slide 6
  • Making Tuples Using SQL area= SELECT location from HOUSES agent-address= SELECT concat(city, state) FROM AGENTS list-price= SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
  • Slide 7
  • How Do We Match ? The process of creating mappings typically proceeds in two steps. first step: schema matching, we find matches between elements of the two schemas. second step :we elaborate the matches to create query expressions that enable automated data translation or exchange.
  • Slide 8
  • Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 9
  • Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 10
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 11
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 12
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 13
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 14
  • The Solution The iMAP System We will describe the iMAP system which semi- automatically discovers complex matches for relational data in a single table. In some cases iMAP able to find matches that combine attributes from multiple tables.
  • Slide 15
  • The iMAP Architecture
  • Slide 16
  • Match Generator Input: target schema and source schema. Output: match candidates.
  • Slide 17
  • How Match Generator Works Match generator uses a searching method that goes through all possible match candidates. The searchers uses a prior knowledge of possible match types and heuristic methods.
  • Slide 18
  • The Internals of a Searcher Applying search to candidate generation involve three major issues: Search strategy Evaluation of candidate matches Termination condition
  • Slide 19
  • Search Strategy The space search can be very large or even unbounded. We need to efficiently search such spaces. iMAP address this problem using a search technique called beam search.
  • Slide 20
  • Beam Search Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest- scoring match. By that the searcher can conduct a very efficient search in any type of search space.
  • Slide 21
  • Implemented Searchers on iMAP
  • Slide 22
  • Example: Unit Conversion Searcher The unit conversion searcher can identify a conversion between two different types of measurement unit. It can do so By looking in the name and data of the attributes. (e.g., hours", kg", $", etc.)
  • Slide 23
  • The searcher finds the best conversion from a set of conversion functions between the units. In this case weight_kg = 2.2 * weight_pounds. productpounds apple10 Fruits and vegetableskg banna5 Fruits and vegetableskg banna5 apple22 Example: Unit Conversion Searcher (cont.)
  • Slide 24
  • Similarity Estimator Input: Match candidates. Output: Similarity matrix. Similarity matrix stores the similarity score of pairs
  • Slide 25
  • Similarity Estimator The similarity estimator gets the results from all the searchers. Then it gathers the data and calculates a final score for each match
  • Slide 26
  • Similarity Estimator (cont.) The similarity estimator uses two methods to score match pairs: Name based evaluator Nave Bayese evaluator
  • Slide 27
  • Match Selector Input: Similarity matrix. Output: 1-1 and complex matches.
  • Slide 28
  • Match Selector Match Selector examines the score matrix and outputs the best matches under certain conditions.
  • Slide 29
  • Exploiting Domain Knowledge Exploiting domain knowledge was shown to be beneficial on 1-1 matching On complex matching, it can be even more crucial, since it can save valuable processing by early detection of unlikely matches
  • Slide 30
  • Domain Constraints Constraints are either present in the schema, or provided by an expert or the user iMAP considers 3 kinds of constraints: Two attributes are un-related Constraint on a single attribute Multiple schema attributes are un-related
  • Slide 31
  • Sources For Domain Constraints Past Complex Matches Overlap data External Data
  • Slide 32
  • Past Complex Matches We often find that we map the same or similar schemas repeatedly iMAP can extract a template expression from such matches Example Given the past match: price = pr * (1+0.6) iMAP will extract: VAR * (1 + CONST) and ask the numeric searcher to look for matches for that template
  • Slide 33
  • Overlap Data In some cases, both the source and the target share the same data This can be used as information for the matching process Searchers that exploit overlap data: Overlap text searcher Overlap numeric searcher Overlap category and schema mismatch searcher
  • Slide 34
  • External Data External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching
  • Slide 35
  • Why do we need it?
  • Slide 36
  • Generating Explanations in iMAP iMAPs goal is to provide a design environment where a human user can quickly generate a mapping between a pair of schemas For a user to know what match to choose, it is necessary to supply an explanation for each of the matches
  • Slide 37
  • User Questions iMAP considers 3 questions that might be asked by a user: Why the match exist? Why the match doesnt exist? Why is one match better than the other?
  • Slide 38
  • Explanation Generation iMAP keeps track of the decision making progress as a dependency graph: Each node is either a schema attribute, an assumption, candidate matches or domain knowledge An edge between two nodes means that one node lead to another
  • Slide 39
  • Explanation Generation Example