7/28/2019 Data Mining Query Languages (2)
1/30
Data Mining Query
Languages
Kristen LeFevre
April 19, 2004
With Thanks to Zheng Huang and Lei Chen
7/28/2019 Data Mining Query Languages (2)
2/30
7/28/2019 Data Mining Query Languages (2)
3/30
Problem Description
You guys are armed with two powerful tools
Database management systems
Efficient and effective data mining algorithmsand frameworks
Generally, this work asks:
How can we merge the two?
How can we integrate data mining moreclosely with traditional database systems,
particularly querying?
7/28/2019 Data Mining Query Languages (2)
4/30
Three Different Answers
DMQL: A Data Mining QueryLanguage for Relational Databases(Han et al, Simon Fraser University)
Integrating Data Mining with SQLDatabases: OLE DB for Data Mining(Netz et al, Microsoft)
MSQL: A Query Language forDatabase Mining (Imielinski &Virmani, Rutgers University)
7/28/2019 Data Mining Query Languages (2)
5/30
Some Common Ground
Create and manipulate data mining modelsthrough a SQL-based interface (Command-driven data mining)
Abstract away the data mining particulars Data mining should be performed on data in
the database (should not need to export toa special-purpose environment)
Approaches differ on what kinds of modelsshould be created, and what operations weshould be able to perform
7/28/2019 Data Mining Query Languages (2)
6/30
DMQL
Commands specify the following:
The set of data relevant to the data mining
task (the training set)
The kinds of knowledge to be discovered
Generalized relation
Characteristic rules
Discriminant rules
Classification rules
Association rules
7/28/2019 Data Mining Query Languages (2)
7/30
DMQL
Commands Specify the following:
Background knowledge
Concept hierarchies based on attributerelationships, etc.
Various thresholds
Minimum support, confidence, etc.
7/28/2019 Data Mining Query Languages (2)
8/30
DMQL
Syntaxuse database
{use hierarchy for}
related to
from
[where ]
[order by ]
{with [] threshold = [for ]}
Specify background
knowledge
Specify rules to be
discovered
Collect the set of
relevant data to mine
Specify thresholdparameters
Relevant attributes or
aggregations
7/28/2019 Data Mining Query Languages (2)
9/30
DMQL
Syntax
find classification rules [as ]
[according to ]
Find association rules [as ]
generalize data [into ]
others
7/28/2019 Data Mining Query Languages (2)
10/30
DMQL
use database Hospital
find association rules as Heart_Health
related to Salary, Age, Smoker,
Heart_Diseasefrom Patient_Financial f, Patient_Medical m
where f.ID = m.ID and m.age >= 18
with support threshold = .05
with confidence threshold = .7
7/28/2019 Data Mining Query Languages (2)
11/30
DMQL
DMQL provides a display in
command to view resulting rules, but
no advanced way to query them Suggests that a GUI interface might
aid in the presentation of these resultsin different forms (charts, graphs, etc.)
7/28/2019 Data Mining Query Languages (2)
12/30
MSQL
Focus on Association Rules
Seeks to provide a language both to
selectively generate rules, andseparately to query the rule base
Expressive rule generation language,
and techniques for optimizing somecommands
7/28/2019 Data Mining Query Languages (2)
13/30
MSQL
Get-Rules and Select-Rules Queries Get-Rules operator generates rules over
elements of argument class C, which satisfyconditions described in the where clause
[Project Body, Consequent,confidence, support]
GetRules(C) [as R1]
[into ]
[where ][sql-group-by clause]
[using-clause]
7/28/2019 Data Mining Query Languages (2)
14/30
MSQL
may contain a number of
conditions, including:
restrictions on the attributes in the body or
consequent
rule.body HAS {(Job = Doctor}
rule1.consequent IN rule2.body
rule.consequent IS {Age = *}
pruning conditions (restrict by support,confidence, or size)
Stratified or correlated subqueries
in, has, and is are rule
subset, superset,
and equality
respectively
7/28/2019 Data Mining Query Languages (2)
15/30
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 andConfidence > .7
and R2.Body HAS R1.Body)
Retrieve all rules with descriptors of the form Age = x in the body,
except when there is a rule with equal or greater support andconfidence with a rule containing a superset of the descriptors in
the body
7/28/2019 Data Mining Query Languages (2)
16/30
MSQL
GetRules(C) R1
where
and not exists ( GetRules(C) R2
where
and R2.Body HAS R1.Body)
correlated
stratified
GetRules(C) R1
where
and consequent is {(X=*)}and consequent in (SelectRules(R2)
where consequent is {(X=*)}
7/28/2019 Data Mining Query Languages (2)
17/30
MSQL
Nested Get-Rules Queries and their
optimization
Stratified (non-corrolated) queries are
evaluated bottom-up. The subquery is
evaluated first, and replaced with its results
in the outer query.
Correlated queries are evaluated either top-
down or bottom-up (like loop-unfolding),and there are rules for choosing between the
two options
7/28/2019 Data Mining Query Languages (2)
18/30
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 andConfidence > .7
and R2.Body HAS R1.Body)
7/28/2019 Data Mining Query Languages (2)
19/30
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
Top-Down Evaluation
For each rule produced by the outer, evaluate the
inner
not exists ( GetRules(Patients)Support > .05 andConfidence > .7
and R2.Body HAS R1.Body)
7/28/2019 Data Mining Query Languages (2)
20/30
MSQL
not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7and R2.Body HAS R1.Body)
Bottom-Up Evaluation
For each rule produced by the inner, evaluate the
outer
GetRules(Patients)where Body has {Age = *}
and Support > .05 and Confidence > .7
7/28/2019 Data Mining Query Languages (2)
21/30
MSQL
Choosing between the two In general, evaluate the expression with more
restrictive conditions first
Heuristic rules
Evaluate the query with higher support threshold first Next consider confidence threshold
A (length = x) expression is in general more restrictivethan (length > x), which is more restrictive than (length< x)
Body IS (constant expression) is more restrictive thanBody HAS, which is more restrictive than Body IN
Next consider Consequent IN expressions
Descriptors of for (A = a) are more restrictive thanwildcards such as (A = *)
Meant to prevent
unconstrained
queries from being
evaluated first
7/28/2019 Data Mining Query Languages (2)
22/30
OLE DB for DM
An extension to the OLE DB interface forMicrosoft SQL Server
Seeks to support the following ideas:
Define a model by specifying the set ofattributes to be predicted, the attributes usedfor the prediction, and the algorithm
Populate the model using the training data
Predictattributes for new data using the
populated model Browse the mining model (not fully
addressed because it varies a lot by modeltype)
None of the
others
seemed tosupport this
7/28/2019 Data Mining Query Languages (2)
23/30
OLE DB for DM
Defining a Mining Model Identify the set of data attributes to be
predicted, the set of attributes to be used forprediction, and the algorithm to be used forbuilding the model
Populating the Model Pull the information into a single rowset
using views, and train the model using thedata and algorithm specified
Supports complex objects, so rowset may behierarchical (see paper for more complexexamples)
7/28/2019 Data Mining Query Languages (2)
24/30
OLE DB for DM
Using the mining model to predict
Defines a new operatorprediction join.
A model may be used to makepredictions on datasets by taking the
prediction join of the mining model
and the data set.
7/28/2019 Data Mining Query Languages (2)
25/30
OLE DB for DM
CREATE MINING MODEL [Heart_Health Prediction]
[ID] Int Key,
[Age] Int,
[Smoker] Int,[Salary] Double discretized,
[HeartAttack] Int PREDICT, %Prediction column
USING [Decision_Trees_101]
Identifies the source columns for the training
data, the column to be predicted, and the data
mining algorithm.
7/28/2019 Data Mining Query Languages (2)
26/30
OLE DB for DM
INSERT INTO [Heart_Health Prediction]
([ID], [Age], [Smoker], [Salary])
SELECT [ID], [Age], [Smoker], [Salary] FROM
Patient_Medical M, Patient_Financial FWHERE M.ID = F.ID
The INSERT represents using a tuple fortraining the model (not actually inserting it into
the rowset).
7/28/2019 Data Mining Query Languages (2)
27/30
OLE DB for DM
SELECT t.[ID],
[Heart_Health Prediction].[HeartAttack]
FROM [Heart_Health Prediction]
PREDICTION JOIN (
SELECT [ID], [Age], [Smoker], [Salary]
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID) as t
ON [Heart_Health Prediction].Age = t.Age AND[Heath_Health Prediction].Smoker = t.SmokerAND [Heart_Health Prediction].Salary =t.Salary
Prediction join connects the model and an actual data
table to make predictions
7/28/2019 Data Mining Query Languages (2)
28/30
Key Ideas
Important to have an API for creating
and manipulating data mining models
The data is already in the DBMS, so itmakes sense to do the data mining
where the data is
Applications already use SQL, so aSQL extension seems logical
7/28/2019 Data Mining Query Languages (2)
29/30
Key Ideas
Need a method for defining data miningmodels, including algorithm specification,specification of various parameters, and
training set specification (DMQL, MSQL,ODBDM)
Need a method of querying the models(MSQL)
Need a way of using the data mining modelto interact with other data in the database,for purposes such as prediction (ODBDM)
7/28/2019 Data Mining Query Languages (2)
30/30
Discussion Topic:
What Functionality wouldand Ideal Solution
Support?
Top Related