probabilistic ranking

Ranking of Database Query Results

Nitesh Maan, Arujn Saraswat, Nishant Kapoor

Introduction As the name suggests ‘Ranking’ is the process

of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process.

Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

Ranking and databases Not much work has been done in ranking of

results of query in database systems We have all seen example of ranking of

results in the internet. The most common example is the internet search engines (like Google). A set of WebPages (satisfying the users search criteria) are returned, with most relevant results featuring at the top of the list.

In contrast to the WWW, databases support only a Boolean query model. For example a selection query on a SQL database schema returns all tuples that satisfy the conditions specified in the query. Depending on the conditions specified in the query, two situations may arise:

Empty Answers: when the query is too selective, the answer may be empty.

Many Answers: when the query is not too selective, too many tuples may be there in the answer.

We next consider these two scenarios in detail and look at various mechanism to produce ranked results in these circumstances.

The Empty Answers ProblemEmpty answers problem is the consequence of a

very selective query in database system.

In this case it would be desirable to return a ranked list of ‘approximately’ matching tuples without burdening the user to specify any additional conditions. In other words, an automated approach for ranking and returning approximately matching tuples.

Automated Ranking Functions Automated ranking of query results is the

process of taking a user query and mapping it to a Top-K query with a ranking function that depends on conditions specified in the user query.

A ranking function should be able to work well even for large databases and have minimum side effects on query processing

Automated Ranking functions for the ‘Empty Answers Problem’

IDF Similarity

QF Similarity

QFIDF Similarity

IDF Similarity IDF (inverse document frequency) is an

adaptation of popular IR technique based on the philosophy that frequently occurring words convey less information about user’s needs than rarely occurring words, and thus should be weighted less.

IDF Similarity, formal definition For every value ‘t’ in the domain of attribute ‘A’,

IDF(t) can be defined as log(n/F(t)), where ‘n’ = number of tuples in the database F(t) = frequency of tuples in database where ‘A’ = ‘t’The similarity between a tuple ‘T’ and a query ‘Q’ is

defined as:

i.e., similarity between a tuple T and a query Q is simply the sum of corresponding similarity coefficients over all attributes in T

),(),(1

qtS kk

m

kk

QTSIM

QF Similarity – leveraging workloads There may be instances where relevance of a

attribute value may be due to factors other than the frequency of its occurrence.

QF similarity is based on this very philosophy. According to QF Similarity, the importance of attribute values is directly related to the frequency of their occurrence in query strings in workload.

QFIDF Similarity QF is purely workload based, i.e., it does not

use data at all. This may be a disadvantage in situations wherein we have insufficient or unreliable workloads.

QFIDF Similarity is a remedy in such situations. It combines QF and IDF weights. This way even if a value is never referenced in the workload, it gets a small non-zero QF.

Breaking ties…. In case of many answers problem, the

recently discussed ranking functions might fail to perform.

This is because many tuples may tie for the same similarity score. Such a scenario could arise for empty answer problem also.

To break this tie, requires looking beyond the attributes specified in the query, i.e., missing attributes.

Many Answers Problem We know by now, that many answers

problem in database systems is the consequence of not too selective queries.

Such a query on a database system produces a large number of tuples that satisfy the condition specified in the query.

Let us see how ranking of results in such a scenario is accomplished.

Basic Approach Any ranking function for many answers

problem has to look beyond the attributes specified in the query, since all or a large number of tuples satisfy the specified conditions.

To determine precisely the unspecified attributes is a challenging task. We show adaptation of Probabilistic Information Retrieval (PIR) ranking methods.

Ranking function for Many Answers Problem Ranking function for many answers problem

is developed by adaptation of PIR models that best model data dependencies and correlations.

The ranking function of a tuple depends on two factors: (a) a global score, and (b) a conditional score.

These scores can be computed through workload as well as data analysis.

Ranking Function: Adaptation of PIR Models for Structured Data The basic philosophy of PIR models is that

given a document collection, ‘D’, the set of relevant documents, ‘R’, and the set of irrelevant documents, (= D – R), any document ‘t’ in ‘D’ can be ranked by finding out score(t). The score(t) is the probability of ‘t’ belonging to the relevant set, ‘R’

R

Problem in adapting this approach The problem in computing the score(t) using

PIR model for the databases is that the relevant set, ‘R’ is unknown at query time.

This approach is well suited to IR domain as ‘R’ is usually determined through user feedback.

User feedback based estimation of ‘R’ might be attempted in databases also but we propose an automated approach.

The ranking formula

This is the final ranking formula that we will use in computing scores of tuples in order to rank them.

The ranking formula is composed of two large factors: Global part of the score: measures the global importance of

unspecified attributes Conditional part of the score: measures the dependencies

between the specified and unspecified attributes

Yy XxYy Dyxp

Wyxp

Dyp

WyptScore

),|(

),|(

)|(

)|()(

Architecture of Ranking System

Detailed Architecture of the Ranking System

Implementation Pre-Processing : the pre-processing component is

composed of ‘Atomic Probabilities Module’ and ‘Index Module’.

Atomic Probabilities Module – is responsible for computation of several atomic probabilities necessary for the computation of Ranking Function, Score(t).

Index Module – is responsible for pre-computation of ranked lists necessary for improving the efficiency of query processing module.

Implementation Intermediate Layer: the atomic probabilities, lists

computed by the Index Module are all stored as database tables in the intermediate layer. All the tables in the intermediate layer are indexed on appropriate attributes for fast access during the later stages.

Primary purpose of the intermediate layer is to avoid computing the score from scratch each time a query is received, by storing pre-computed results of all atomic computations

The Index Module Index module pre-computes the ranked lists of tuples

for each possible atomic query. Purpose is to take the run-time load off the query

processing component. To assist the query processing component in

returning the Top-K tuples, ranked lists of the tuples for all possible “atomic” queries are pre-computed

Taking as input, the association rules and the database, Conditional List and Global List are created for each distinct value ‘x’ in the database

Query Processing Component List merge algorithm is the key player in the

query processing component. Its function is to take the user query, compute

scores for all the tuples that satisfy the condition specified in the query, rank the tuples in a sorted order of the scores and then return the Top-K tuples.

Space Requirements To build the conditional and the global lists, space

consumed is O(mn) bytes (where ‘m’ is the number of attributes and ‘n’ is the number of tuples of the database table)

There may be applications where space is an expensive resource.

In such cases, only a subset of the lists may be stored at pre-processing times, but this will at the expense of an increase in query processing time.

What’s Next….. The ranking function so presented works on

single table databases and does not allow presence of NULL values.

A very interesting but nevertheless challenging extension to this work would be to develop ranking functions that work on multi-table databases and allow NULL’s as well as non-text data in database columns.

probabilistic ranking

Technology

Transcript of probabilistic ranking