K-relevance Measuring source relevance in data integration query.

28
K-relevance Measuring source relevance in data integration query
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    240
  • download

    1

Transcript of K-relevance Measuring source relevance in data integration query.

K-relevance

Measuring source relevance in data integration query

Queries, relations and sources

K-relevance is defined for queries, which query one or more relations.

Every relation is based on data extracted from one or more external sources.

The data in a relation may be not up-to-date. (the data from some sources may be extracted from previous versions of these sources)

Relations and sources

Every tuple in the relation is based on exactly one source, and has a column which contain reference to the source.

Example:

Numbersource

0Even_nums.html

1Odd_nums.html

Relations and sources

One source may be used by more than one relation.

Example:

positiveNums negativeNums

numbersource

0evenNums.html

1oddNums.html

numbersource

-2evenNums.html

-1oddNums.html

Relations and sources - example

Source information is needed

If an user thinks that there is mistake in the query results, knowledge on which sources the query results are based may help in finding the origin of the mistake.

If an sources can’t ever contribute to the query results, there is no need to extract data from it.

If a source can contribute to the query result regardless of the other sources, there may be need to extract the data from it more frequently.

Query results and sources

Every tuple in the query results is a join of tuples – one tuple for each relation.

The sources of the resulting tuple is an union of the sources of the joining relations.

0-relevance – the actual data sources

The union of the sources for all the tuples in the query results, is called

the 0-relevant sources If the query result is empty, there are no

tuples in the results, so there are no 0-relevant sources.

0-relevance - example

Nums1.html

1

Nums2.html

2

allNums

nsrc

1nums1

2nums2

evenNums

nsrc

2nums2

0-relevance - example

SELECT allNums.n FROM allNums,evenNums WHERE

allNums.n≤evenNums.n

allNums

nsrc

1nums1

2nums2

evenNums

nsrc

2nums2

Result

1

2

allNums.srcevenNums.src=source

{nums1}{nums2}={nums1,nums2}

{nums2} {nums2}={nums2}

The 0-relevant sources:{nums1,nums2}{nums2}= {nums1,nums2}

0-relevance via relation For relation R,if its tuple with source S has

joined to create result tuple, then

S is 0-relevant via R. Example:

Result

1

2

allNums.srcevenNums.src=source

{nums1}{nums2}={nums1,nums2}

{nums2}{nums2}={nums2}{nums1,nums2} are 0-relevant Via allNums.

{nums2} is 0-relevant Via evenNums

Definition: Potential tuple “Potential tuple” for a relation is any tuple

which fit the schema of the relation. (it may actually exist in the relation).

For example, for the relation R(string, int) every tuple of the form (string s,int i) is potential tuple.

For a relation which contain source column, every potential tuple which has S in this column is called

potential tuple from S Note:every “real” tuple in R is also potential

tuple, because it fits the schema of R.

∞-relevance via relation If there are

a potential tuple from the source S for the relation R

and potential tuples for the other relations in the query

which can join to satisfy the query and create a resulting tuple,S is called ∞–relevant source via R

-∞relevance

The union of the ∞-relevant sources via the relations in the query, are the

∞-relevant sources of the query. Note: the ∞-relevant sources are independent of

the data in the relations, and depend only on the query and the sources of the queried relations.

-∞relevance

Every source of the relations is ∞-relevant, unless there are constraints in the query on the source column. Note: the data sources of the relations are

shared: if S is source of R1, it is also source of R2 Therefore, if there are no constraints on the

source column of one of the relations, all of the sources are ∞-relevant.

-∞relevance - example

For example, if the data sources are {src1.html,src2.html} in the query

SELECT A.x FROM A,B WHERE A.source!=‘src1.html’ AND A.x < B.x

There is no possible tuple for A from src1 which will satisfy the query

There are possible tuple for A from src2 (for example, {x=1,src=src2}) and possible tuple for B (for example, {x=2,src=src1}) which satisfy the query and create the resulting tuple (1) src2 is ∞-relevant via A.

-∞relevance - example the data sources are {src1.html,src2.html}

SELECT A.x FROM A,B WHERE A.source!=‘src1.html’ AND A.x < B.x

There are possible tuple for B from src1 (for example, {x=2,src=src1}) and possible tuple for A (for example, {x=1,src=src2}) which satisfy the query and create the resulting tuple (1) src1 is ∞-relevant via B.

There are possible tuple for B from src2 (for example, {x=3,src=src2}) and possible tuple for A (for example, {x=2,src=src2}) which satisfy the query and create the resulting tuple (1) src2 is ∞-relevant via B.

-∞relevance - example

{src2} is ∞-relevant via A {src1,src2} are ∞-relevant via B {src2} {src1,src2}={src1,src2} are the ∞-

relevant sources of the query

k-relevance Assume the query is to m relations. If there are

potential tuple from the source S for the relation R and other (at most) k-1 potential tuples for (at

most) k-1 relations (one tuple for each relation) And real tuples for each of the remaining relations

in the query

which can join to create resulting tuple in the query,

S is called k-relevant source via R.

K-relevance

The union of the k-relevant sources via all relations in the query, is called

the k-relevant sources of the query. Note:If k is greater than or equal to

m (the number of queried relations), k-relevance is equal by definition to ∞-relevance, because all of the joining tuples may be potential tuples, and there is no need to join with real tuples.

K-relevance - notes

If S is k-relevant, it means that k potential tuples (one of them from S) can join with m-k real tuples to satisfy the relation.

k+1 potential tuples can also join with m-k-1 real tuples, because real tuple is also potential tuple by definition.

Therefore, K-relevance is monotone: every k-relevant source is also k+1 relevant source.

K-relevance - example The sources are

{sigcomm.html,sigmetrics.html} The query is:

SELECT Papers.title FROM Authors,Papers WHERE

Papers.author= Authors.name

AND Authors.org=‘MIT’

AND Papers.title like '%Ubiquitous%‘

AND Papers.src=Authors.src

K-relevance - example The relations are:

The query result are empty,Because there is no tuple in Authors with org=‘MIT’.

Therefore, there are no 0-relevant sources. Moreover, even if any source will add a tuple to

Papers, the result will be empty because the tuple won’t be able to join with any tuple in Authors.

Therefore, there are no 1-relevant sources via Papers.

SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘AND Papers.src=Authors.src

K-relevance - example

If sigcomm.html will add the tuple (sigcomm.html, John, MIT, [email protected]) to Authors, it can join with the first tuple from papers. Therefore, sigcomm.html is 1-relevant via Authors.

However, every tuple from sigmetrics.html, even (sigmetrics.html,John,MIT,[email protected]) can’t join with any tuple from Papers, because all the tuples in Papers have ‘sigcomm’ in the source column.

Therefore, the 1-relevant sources for the query are {sigcomm.html}

SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ ANDPapers.src=Authors.src

K-relevance - example

The potential tuples: (sigmetrics.html,Todd, MIT, [email protected]) from

sigmetrics.html in Authors And (sigmetrics.html, Todd, Boost Ubiquitous

Access) in Papers Can join to create the result tuple (Boost

Ubiquitous Access). Therefore, sigmetrics.html is 2-relevant source

via Authors.

SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src

K-relevance - example

sigmetrics.html is also 2-relevant source via Papers: The potential tuples:

(sigmetrics.html, Todd, Boost Ubiquitous Access) from sigmetrics.html in Papers

And (sigmetrics.html,Todd, MIT, [email protected]) in Authors

Can join to create the result tuple (Boost Ubiquitous Access).

Sigmetrics.html is 2-relevant source of the query. Sigcomm.html is also 2-relevant source of the query,

because it’s 1-relevant source and k-relevance is monotone.

SELECT Papers.title FROM Authors,Papers WHERE Papers.author= Authors.name AND Authors.org=‘MIT’ AND Papers.title like '%Ubiquitous%‘ AND Papers.src=Authors.src

K-relevance – example - conclusion

There are no 0-relevant sources. The only 1-relevant source is {sigcomm.html} The 2-relevant sources are

{sigcomm.html,sigmetrics.html} The query queries only 2 relations, therefore

the ∞-relevant sources are {sigcomm.html,sigmetrics.html}

K-relevance - summary A source is 0-relevant if tuple extracted from it to

one or more of the queried relations has joined to create a tuple in the query results.

A source is ∞-relevant if a potential tuple from it, in one of the relations, can join with potential tuples in the other ralations to satisfy the query and create a tuple in the results.

A source is k-relevant if a potential tuple from it, in one of the relations, can join with potential tuples in at most (k-1) of the other ralations, and with real tuples in the remaining relations to satisfy the query and create a tuple in the results.