ETL Quality Stage blocking and matching
-
Upload
lizlavaveshkul -
Category
Data & Analytics
-
view
166 -
download
1
Transcript of ETL Quality Stage blocking and matching
![Page 1: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/1.jpg)
ETL QualityStage:Matching Stage
A simplified explanation of theMatching Stage
![Page 2: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/2.jpg)
Data matching in Data matching in ETL Quality StageETL Quality Stage
Data matching is used to find records in a single data source or independent data sources that refer to the same entity (such as a person, organization,
location, product, or material) regardless of the availability of a predetermined key.
Let’s take a look at a simplified example and examine the process.
![Page 3: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/3.jpg)
Let’s say our neighborhood club decided to display our pictures at our club house
bulletin board.
… But we only want to post one picture per club member.
![Page 4: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/4.jpg)
Neighbors submit their pictures, but some
neighbors submit more than one picture.
![Page 5: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/5.jpg)
Since we agreed to post only ONE picture, we’ll
have to weed out “duplicates”
(pictures of the same person).
![Page 6: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/6.jpg)
How do we find pictures of the same person?
Well, traditionally, we’d compare them one by one to determine if they
match certain criteria (same eyes, nose, etc.)
![Page 7: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/7.jpg)
Same person?
No.
![Page 8: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/8.jpg)
Same person?
No.
![Page 9: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/9.jpg)
Same person?
No.
![Page 10: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/10.jpg)
Same person?
No.
![Page 11: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/11.jpg)
Same person?
No.
![Page 12: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/12.jpg)
We have 12 pictures, so we’ll have to compare
12 pictures
You get the idea.
That’s 144 times!
12 times.
![Page 13: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/13.jpg)
The Matching Stage in QualityStage simplifies the work.
Matching is a two-step process: first you block records
and then you match them.
![Page 14: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/14.jpg)
Blocking identifies subsets of data so that
matches can be more efficiently performed.
These subsets are called blocks.
![Page 15: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/15.jpg)
Blocking
• Females < 18 years old• Females > 18 years old• Males < 18 years old• Males > 18 years old
Let’s say we decide to block the data.
We decide to form four subsets:
![Page 16: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/16.jpg)
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old
![Page 17: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/17.jpg)
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old
Making comparisons is easier now.
![Page 18: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/18.jpg)
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years oldCompare 5 pictures 5 times = 25 comparisons
Compare 3 pictures 3 times = 12 comparisons
Compare 2 pictures 2 times = 4 comparisons
Compare 2 pictures 2 times = 4 comparisons
![Page 19: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/19.jpg)
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old25 comparisons
9 comparisons
4 comparisons
4 comparisons
![Page 20: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/20.jpg)
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old25 comparisons
9 comparisons
4 comparisons
4 comparisons
4 25
94
52 comparisons
![Page 21: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/21.jpg)
Females < 18 years old
Females >18 years old
BlockingMales < 18 years old
Males > 18 years old
52 comparisons.
That’s much more efficient than the 144 comparisons we had earlier, when we were doing one-on-one matching.
![Page 22: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/22.jpg)
Matching is a two-step process: first you block records
and then you match them.
Females < 18 years old
Females >18 years old
Males < 18 years old
Males > 18 years old
To review:
Blocking identifies subsets of data within which matches can be more efficiently performed.
![Page 23: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/23.jpg)
Females < 18 years old
Females >18 years old
Males < 18 years old
Males > 18 years old
Matching identifies relationships among records.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
![Page 24: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/24.jpg)
Matching
Females >18 years old
Let’s pause for a minute to examine the matching process more closely.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
![Page 25: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/25.jpg)
MatchingFirst, we have to make certain decisions to set up rules.
Will all of the criteria have to match exactly?
(If NO) Will some criteria be more important than other criteria?
(If YES) Can we use some of QualityStage’s “fuzzy logic”?
Which criteria will be more important? We will have to assign weights.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
![Page 26: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/26.jpg)
MatchingLet’s see what could happen if we were to apply the strict rule:
All the criteria have to match exactly.
In our example, the people in the pictures will need to have the same shape and color of eyes, same length and
color of hair, same hairstyle, etc.
If someone had different hair styles in the pictures, for example, we would have to say that it is a different
person, if we were to apply this strict rule.
Matching is a 2-step process:
- First you block the records.
- Then you match them.
![Page 27: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/27.jpg)
If the rule were “All the criteria have to match exactly”:
We would have to conclude that these are not pictures of the same person.
Match
Match
No Match
No Match
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, open, tongue visible
Hair Dark brown, long, straight
Light brown, long, straight
![Page 28: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/28.jpg)
If the rule were “All the criteria have to match exactly”:
We would have to conclude that these are not pictures of the same person.
Match
Match
No Match
Match
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, closed, tongue visible
Hair Dark brown, long, straight
Dark brown, long, straight
![Page 29: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/29.jpg)
If the rule were “All the criteria have to match exactly”:
We would have to conclude that these are not pictures of the same person.
Match
Match
No Match
No Match
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Large, open, tongue visible
Large, closed, tongue visible
Hair Dark brown, long, straight
Light brown, long, straight
![Page 30: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/30.jpg)
MatchingAs an alternative, we can use some of QualityStage’s “fuzzy logic” and assign “weights” to the criteria.
We will have to decide: Which criteria are more important?
Matching is a 2-step process:
- First you block the records.
- Then you match them.
![Page 31: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/31.jpg)
We could assign weights to the criteria.
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, closed, tongue visible
Hair Dark brown, long, straight
Dark brown, long, straight
Large, oval, brown, long eye lashes
Not visible
Large, open, tongue visible
Light brown, long, straight
For example, we could assign higherhigher weightsweights to “nosenose” and “eyeseyes,” a lower weightlower weight to “mouthmouth,” and the lowest weightlowest weight to “hairhair.”
![Page 32: ETL Quality Stage blocking and matching](https://reader033.fdocuments.in/reader033/viewer/2022060204/559f7ffc1a28ab78488b482e/html5/thumbnails/32.jpg)
We could assign weights to the criteria.
Using these assigned weights, ETL can help us conclude that these are pictures of the same person.
Eyes Large, oval, brown, long eye lashes
Large, oval, brown, long eye lashes
Nose Not visible Not visible
Mouth Small, petite Large, closed, tongue visible
Hair Dark brown, long, straight
Dark brown, long, straight
Large, oval, brown, long eye lashes
Not visible
Large, open, tongue visible
Light brown, long, straight
Match
Match
Match
Match