EFFICIENT PROFILING FOR ESTIMATION OF QUERY RESULT QUALITY

18
16 th International Conference on Information Quality, 2011 EFFICIENT PROFILING FOR ESTIMATION OF QUERY RESULT QUALITY Naiem K. Yeganeh University of Queensland [email protected] Executive Summary/Abstract: Data quality profiles consist of statistical measurements about the quality of data sets. Query systems can use DQ profiles as a form of metadata to estimate the quality of a query result set. Traditional DQ profiling provides an estimate on the overall quality of a data set or data source, but quality of a query result can be remarkably different from the overall quality of the data set because conditions within the query typically select a subset of the data. In this paper we propose an efficient conditional DQ profiling method which can estimate the quality of a result set for a given query with guaranteed user definable level of accuracy. Mohamed A. Sharaf University of Queensland [email protected] Shazia Sadiq University of Queensland [email protected] Ke Deng University of Queensland [email protected]

description

EFFICIENT PROFILING FOR ESTIMATION OF QUERY RESULT QUALITY. Naiem K. Yeganeh University of Queensland [email protected]. Shazia Sadiq University of Queensland [email protected]. Mohamed A. Sharaf University of Queensland [email protected]. - PowerPoint PPT Presentation

Transcript of EFFICIENT PROFILING FOR ESTIMATION OF QUERY RESULT QUALITY

Page 1: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

EFFICIENT PROFILING FOR

ESTIMATION OF QUERY RESULT QUALITY

Naiem K. YeganehUniversity of [email protected]

Executive Summary/Abstract: Data quality profiles consist of statistical measurements about the quality of data sets. Query systems can use DQ profiles as a form of metadata to estimate the quality of a query result set. Traditional DQ profiling provides an estimate on the overall quality of a data set or data source, but quality of a query result can be remarkably different from the overall quality of the data set because conditions within the query typically select a subset of the data. In this paper we propose an efficient conditional DQ profiling method which can estimate the quality of a result set for a given query with guaranteed user definable level of accuracy.

Mohamed A. SharafUniversity of Queensland

[email protected]

Shazia SadiqUniversity of [email protected]

Ke DengUniversity of Queensland

[email protected]

Page 2: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Objectives of this presentation Study the need to profile the quality of data

sets.

Discuss Data Quality Profiling and its position in literature.

Proposing Conditional Data Quality Profiling as an tool for estimation of the quality of query results.

Proposing methods for improving the efficiency of genering Conditional Data Quality Profile.

Page 3: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

What is the quality of the data below?

For the above question to be answered in a consistent way, we need to be more specific:

•Quality of what attribute do we want to measure?

•What do we mean by quality or what aspect of quality do we want to measure?

Scenario

Data Quality metrics are measurements of a specific data quality dimension for a specific

part of data (i.e. a specific attribute like Price)

Page 4: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Data Quality (DQ) metric and Data Quality Profile

• Data Quality Metric :

• Statistical measurement of a data quality dimension for a specific attribute over a dataset.

• For example: Completeness of Price.

• Data Quality Profile :

• A data set (meta-data) that contains statistical information about the quality of another data set.

• Usually contains values (or aggregated values) for different metrics.

Definition

DataData Quality

Service

Data

DQ Metric

Data Quality Services

Assumption

Page 5: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

What is the quality of the data below?

A simple definition for the completeness metric.

Completeness(x) = If x is null then return 0 else return 1

Scenario

What is the completeness of all attributes for the data below?

Data Set “ShoppingItems” DQ Profile for the Data Set “ShoppingItems”

Page 6: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

What is the quality of the data below?

Scenario

What is the completeness of all attributes for the data below?

Data Set “ShoppingItems” DQ Profile for the Data Set “ShoppingItems”

What is the completeness of image for Sony Cameras?

Be more specific orHow to estimate quality of the query results

DQ Profile can not estimate the quality of query results.

Let us call this type of DQ profile, traditional DQ profile.

We will propose conditional DQ profile in contrast.

Page 7: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Conditional DQ Profile- A dq profile that can be used to estiamte metric results for every query with conjunctive selection cobditions.- One conditional DQ profile for each metric

Scenario

B = BrandM = ModelP = PriceI = Image

C = CannonS = Sony

S = SLRN = Normal

H = HighL = Low

Data Set (T)One conditional DQ Profile

for Completeness of I (Image) for T

Page 8: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Conditional DQ Profile- A dq profile that can be used to estiamte metric results for every query with conjunctive selection cobditions.- One conditional DQ profile for each metric

Scenario

B = BrandM = ModelP = PriceI = Image

C = CannonS = Sony

S = SLRN = Normal

H = HighL = Low

Data Set (T)One conditional DQ Profile

for Completeness of I (Image) for T

Sample Queries:What is the Completeness of I

where B=C and M=S

What is the Completeness of I where P=H

Page 9: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Create Conditional DQ ProfileMethod

- Brute ForceSearch Every possible conjunctive selection condition.

Page 10: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Conditional DQ Profile may become bigger than data set!Issue

Hence, size should be reduced.

An image reduced to only 16 colors with lower resolution is much smaller but

conveys enough and correct information.

Error distribution is not always random.

Different subsets of the data set may have

different error distributions.

Remove every record from the DQ profile data set if the value of DQ metric is in a specific range of its superset (which is called certainty threshold).

B=C and M=S -> completeness of I = 0.66

B=C ->completeness of I = 0.50

SLR Cannon cameras have about the same completeness of image as all Cannon cameras.

Page 11: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Conditional DQ Profile may become bigger than data set!Issue

Hence, size should be reduced.

Epsilon and Tau

Reduced conditional DQ profile with threshold τ=2 and ε=0.2.

Reduced Conditional DQ Profile may loose some data. I.e. it may

not be able to estimate some

queries correctly.

Page 12: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Generation and Optimization of Conditional DQ ProfileMethod

1. Create a Conditional DQ Profile

2. Reduce the size of profile using certainty and minimum set thresholds.

3. Return records to the DQ profile that can not be estimated from the conditional DQ profile.

Page 13: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Querying Conditional DQ ProfileMethod

SELECT * FROM D WHERE Brand= “Cannon” AND Model= “SLR” translates to SELECT TOP 1 #, Q FROM T WHERE (Brand= “Cannon” OR Brand= “_”) AND (Model= “SLR” OR Model = “_”) AND (Price=“_”) ORDER BY Brand, Model, Price.

Page 14: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Effectiveness of DQ EstimationEvaluation

Comparison of the average estimation error for traditional DQ profile (DQP) and conditional DQ profile with different certainty thresholds (ε=0.05 to

ε=0.45) and for different variations in distribution of dirty data d.

Page 15: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Effect of the distribution of dirty dataEvaluation

(a) Effect of variation in distribution of dirty data d on profile size (b) Effect of error distribution on profile generation time.

Page 16: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Effect of certainty and minimum set thresholds

Evaluation

(a) Effect of certainty threshold ε on the size of DQ profile for different minimum-set thresholds τ (b) Effect of τ on percent of correct estimation

made using the DQ profile

Page 17: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Effect of input data sizeEvaluation

Scalability graphs (a) generated profile size versus number of records in dataset (b) profile generation time versus database size

Page 18: EFFICIENT PROFILING FOR  ESTIMATION OF QUERY RESULT QUALITY

16th International Conference on Information Quality, 2011

Next Steps

Maintenance of Conditional Data Quality Profile– Data quality changes over time and DQ profile should

remain valid

Scalability- Scalability of conditional DQ profiles can be improved

further more. For example sampling techniques can help making conditional DQ profiles more scalable.

FurtherWork