PANDA Technical Report Series - Athens University of ... · PDF filePANDA Technical Report...
Transcript of PANDA Technical Report Series - Athens University of ... · PDF filePANDA Technical Report...
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -1- http://dke.cti.gr/panda/
PANDA Technical Report Series
TR Number: PANDA-TR-2003-01
Title: A Survey on Pattern Application Domains and Pattern
Management Approaches
Author(s): M. Vazirgiannis, M. Halkidi, G. Tsatsaronis, E. Vrachnos
(AUEB),
D. Keim (KONSTANZ),
P. Xeros, Y. Theodoridis (CTI),
A. Pikrakis, S. Theodoridis (UoA)
Date: 7 February 2003
Research supported by the Commission of the European Communities under the Information
Society Technologies (IST) Programme – Future and Emerging Technologies (FET)
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -2- http://dke.cti.gr/panda/
A Survey on Pattern Application Domains and
Pattern Management Approaches
M. Vazirgiannis 1, M. Halkidi 1, G. Tsatsaronis 1, E. Vrachnos 1,
D. Keim 2, P. Xeros 3, Y. Theodoridis 3, A. Pikrakis 4, S. Theodoridis 4
1 Dept of Informatics, Athens University of Economics and Business, Athens, Greece [www.db-
net.aueb.gr] 2 Institute of Computer Science, University of Konstanz, Konstanz, Germany [www.inf.uni-konstanz.de] 3 Data and Knowledge Engineering Group, Computer Technology Institute, Patras, Greece
[http://dke.cti.gr] 4 Dept of Informatics, University of Athens, Athens, Greece [www.di.uoa.gr]
ABSTRACT:
Data intensive applications produce complex information that is posing requirements for novel Database Management Systems (DBMSs). Such information is characterized by its huge volume of data and by its diversity and complexity, since the data processing methods such as pattern recognition, data mining and knowledge extraction result in knowledge artifacts like clusters, association rules, decision trees and others. These artifacts that we call patterns need to be stored and retrieved efficiently. In order to accomplish this we have to express them within a formalism and a language.
In this report we review the concept of patterns and their applicability in several research domains related with the proposed work and we define the knowledge domain related with the PANDA project. It is important to interrelate these domains in order to be able to define the problem in comprehensive and complete way and come up with requirements on how a management system for patterns should be.
We examine the different types of patterns that are extracted from a data set, in order to gather the necessary requirements for the definition of a pattern model. This model constitutes the heart of the Pattern Base Management System that will be designed.
KEYWORDS: patterns, data mining, pattern modeling, pattern databases, information retrieval,
Pattern Base Management Systems
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -3- http://dke.cti.gr/panda/
A Survey on Pattern Application Domains and
Pattern Management Approaches
1. INTRODUCTION - MOTIVATION............................................................................. 5
2. APPLICATION FIELDS AND PATTERN USAGE ................................................... 6
2.1. DATA MINING............................................................................................................ 6 2.1.1 Data Mining Patterns ........................................................................................ 7
2.2. SIGNAL PROCESSING: CONTENT-BASED MUSIC RETRIEVAL................................... 19 2.2.1 Problem definition ........................................................................................... 19 2.2.2 A survey of existing research efforts................................................................ 20 2.2.3 Patterns in music retrieval .............................................................................. 20
2.3. PATTERNS IN INFORMATION RETRIEVAL................................................................. 23 2.4. MATHEMATICS ........................................................................................................ 24
2.4.1 Number Patterns.............................................................................................. 25 2.4.2 Patterns in graphs ........................................................................................... 25 2.4.3 Patterns in shapes............................................................................................ 26 2.4.4 Patterns in infinite sequences .......................................................................... 26 2.4.5 Patterns in algebra .......................................................................................... 27 2.4.6 Patterns in Geometry....................................................................................... 27 2.4.7 Patterns in Cryptography ................................................................................ 27
2.5. VISUALIZATION ....................................................................................................... 28
3. CURRENT ISSUES IN MODELING DATA MINING PROCESSES AND
RESULTS............................................................................................................................... 29
3.1. DATA MINING GROUP / PREDICTIVE MODEL MARKUP LANGUAGE [DMG]........... 29 3.1.1 Overview.......................................................................................................... 29 3.1.2 General Structure of a PMML Document ....................................................... 31 3.1.3 Header ............................................................................................................. 31 3.1.4 Settings ............................................................................................................ 32 3.1.5 Data Dictionary............................................................................................... 33 3.1.6 Transformation Dictionary (Derived Values) ................................................. 33
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -4- http://dke.cti.gr/panda/
3.1.7 PMML mining models ..................................................................................... 34 3.1.8 Example: DTD of Association Rules Model .................................................... 36
3.2. SQL/MM ................................................................................................................. 40 3.2.1 Overview.......................................................................................................... 40 3.2.2 Part 6: Data Mining ........................................................................................ 40 3.2.3 Example: Rule Model ...................................................................................... 41
3.3. COMMON WAREHOUSE MODEL (CWM) ................................................................. 42 3.3.1 Overview.......................................................................................................... 42 3.3.2 Data Mining Metamodel.................................................................................. 42
3.4. JAVA DM API .......................................................................................................... 43 3.5. ORACLE9I DATA MINING ........................................................................................ 44 3.6. INFORMATION DISCOVERY DATAMINING SUITE .................................................... 45
4. CONCLUSION .............................................................................................................. 45
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -5- http://dke.cti.gr/panda/
1. Introduction - Motivation
With the advent of hardware advances, complex information resulting from data intensive
applications is posing requirements for novel Database Management Systems (DBMSs). Such
information possesses a number of key features such as:
• Huge volume of data: For example, images being collected daily from various sources
(satellites, etc.) that need to be stored and retrieved efficiently; in this case data can be
represented concisely by a set of patterns (for instance, a mathematical formula might
represent the trajectory of a satellite). Moreover, huge traditional databases are growing
due to extensive transaction paces in various domains (banking, stock exchange,
telecommunication etc. databases).
• Diversity and complexity: Data processing methods (pattern recognition, data mining,
knowledge extraction) result in knowledge artifacts (i.e., clusters, rules, patterns in
general) that need to be as well managed by a DBMS-like environment.
It is obvious then that the knowledge artifacts arise as significant representational primitives
in recently computerized application domain and therefore call for integrated and efficient
DBMS support. Unfortunately patterns have not been treated as persistent objects that can be
stored, retrieved and queried. It is now the time to tackle the challenge of integration between
the two fields (pattern and data) by designing fundamental approaches for providing database
support to patterns.
Various application domains dealing with patterns (telecom, medical, environmental
information systems, etc.) will directly benefit from a system that integrates data and pattern
management. This will be due to the fact that database support will enhance the maintenance
and manipulation of both their data collections and artifacts produced in the form of patterns.
Another field of advance results from the fundamentally novel paradigm arising from patterns
and affecting various database research areas, such as data models, query languages, query
processing and indexing techniques, visual user interfaces.
In this report we review the concept of patterns and their applicability in several research
domains related with the proposed work and define the knowledge domain related with the
PANDA project. It is important to interrelate these domains in order to be able to define the
problem in comprehensive and complete way and come up with requirements on how a
management system for patterns should be.
The remainder of the report is organized as follows. In Section 2 a review of patterns
application fields and existing research results for patterns is presented. There is a rich domain
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -6- http://dke.cti.gr/panda/
of fundamental research related to patterns in mathematics, data mining, pattern recognition
as well as in several application domains. Section 3 follows discussing the current issues in
modeling Data Mining processes and results. The innovation of a pattern management system
which is the subject of the PANDA project is presented in Section 4. We conclude in Section
5 presenting a set of requirements for designing a system that can handle modeling, storage,
visualization and retrieval of patterns.
2. Application fields and pattern usage
Various application domains related with the data management (storage, process, retrieve,
data analysis) result in different forms of patterns representing the data insights. In the sequel
we present some representative pattern application domains and the corresponding types of
patterns they produce.
2.1. Data Mining
The last decade has brought an explosive growth in our capabilities to both generate and
collect data. Advances in database technology have provided us with the basic tools and
methods for efficient data collection, storage and lookup of datasets. The result is that a flood
of data has been generated and a growing data glut problem has been brought to the worlds of
science, business. Also our ability to analyze, interpret large bodies of data and extract
"useful" knowledge has outpaced and the need for new generation of tools and techniques for
intelligent database analysis has been created. This need has been recognized by researchers
in different areas (artificial intelligence, statistics, data warehousing, on-line analysis
processing, expert systems and data visualization) and a new research area is emerged, known
as Data Mining.
Data Mining is mainly concerned with methodologies for extracting data patterns from large
data repositories. The data mining is a step in the knowledge discovery process. However,
others treat data mining as a synonym for another popular term, Knowledge Discovery in
Databases.
The data mining process may interact with the users or a knowledge base. The extracted
patterns are evaluated based on some interestingness measures to identify patterns
representing knowledge, i.e., interesting patterns. These patterns are presented to the user and
may be stored as new knowledge in the knowledge base. Then, we could adopt a broad view
of data mining functionality, considering data mining as the process of discovering interesting
knowledge from large amounts of data stored either in databases, data warehouses or other
information repositories.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -7- http://dke.cti.gr/panda/
2.1.1 Data Mining Patterns
There are many data mining algorithms that accomplishing a limited set of tasks produce a
particular enumeration of patterns over data sets. These main tasks, according to well
established data mining process [BL96], are: i) the definition/extraction of clusters that
provide a classification scheme, ii) the classification of database values into the categories
defined, ii) the extraction of association rules or other knowledge artefacts, iii) discovery and
analysis of sequences.
Since there are various types of data stores and database systems on which data mining tasks
can be performed, different kinds of data patterns can be mined. In some cases users have no
idea about the kinds of patterns in their data that could be interesting. Thus it is important a
data mining system to mine and store multiple kinds of patterns so as to accommodate
different user expectations and applications. Another requirement in data mining is the
granularity of data mining results. There are cases that it is important to have different levels
of abstraction for the patterns mined from a data repository depending on the application or
user requirements.
In the sequel, we discuss the data mining functionalities and the kinds of patterns that can be
mined from an amount of data.
2.1.1.1 Clustering
Clustering is one of the most useful tasks in data mining process for discovering groups and
identifying interesting distributions and patterns in the underlying data. Clustering problem is
about partitioning a given data set into groups (clusters) such that the data points in a cluster
are more similar to each other than points in different clusters [GRS98]. For example,
consider a retail database records containing items purchased by customers. A clustering
procedure could group the customers in such a way that customers with similar buying
patterns are in the same cluster. Thus, the main concern in the clustering process is to reveal
the organization of patterns into “sensible” groups, which allow us to discover similarities and
differences, as well as to derive useful conclusions about them. This idea is applicable in
many fields, such as life sciences, medical sciences and engineering. Clustering may be found
under different names in different contexts, such as unsupervised learning (in pattern
recognition), numerical taxonomy (in biology, ecology), typology (in social sciences) and
partition (in graph theory) [TK98].
In the clustering process, there are no predefined classes and no examples that would show
what kind of desirable relations should be valid among the data, that is why it is perceived as
an unsupervised process [BL96]. On the other hand, classification is a procedure of assigning
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -8- http://dke.cti.gr/panda/
a data item to a predefined set of categories [FSSU96]. Clustering produces initial categories
in which values of a data set are classified during the classification process.
The clustering process may result in different partitionings of a data set, depending on the
specific criterion used for clustering. Thus, there is a need of preprocessing before we apply a
clustering task in a data set. The basic steps to develop clustering process are presented in
Figure 1 and can be summarized as follows [FSSU96]:
• Feature selection. The goal is to select properly the features on which clustering is to be
performed so as to encode as much information as possible concerning the task of our
interest. Thus, preprocessing of data may be necessary prior to their utilization in
clustering task.
• Clustering algorithm. This step refers to the choice of an algorithm that results in the
definition of a good clustering scheme for a data set. A proximity measure and a
clustering criterion mainly characterize a clustering algorithm as well as its efficiency to
define a clustering scheme that fits the data set.
i) Proximity measure is a measure that quantifies how “similar” two data points (i.e.
feature vectors) are. In most of the cases we have to ensure that all selected features
contribute equally to the computation of the proximity measure and there are no
features that dominate others.
ii) Clustering criterion. In this step, we have to define the clustering criterion, which can
be expressed via a cost function or some other type of rules. We should stress that we
have to take into account the type of clusters that are expected to occur in the data set.
Thus, we may define a “good” clustering criterion, leading to a partitioning that fits
well the data set.
iii) Validation of the results. The correctness of clustering algorithm results is verified
using appropriate criteria and techniques. Since clustering algorithms define clusters
that are not known a priori, irrespective of the clustering methods, the final partition
of data requires some kind of evaluation in most applications [RLR98].
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -9- http://dke.cti.gr/panda/
iv) Interpretation of the results. In many cases, the experts in the application area have to
integrate the clustering results with other experimental evidence and analysis in order
to draw the right conclusion.
Clustering Applications. Cluster analysis is a major tool in a number of applications in many
fields of business and science. Hereby, we summarize the basic directions in which clustering
is used [TK98]:
• Data reduction. Cluster analysis can contribute in compression of the information
included in data. In several cases, the amount of available data is very large and its
processing becomes very demanding. Clustering can be used to partition the data set into
a number of “interesting” clusters. Then, instead of processing the data set as an entity,
we adopt the representatives of the defined clusters in our process. Thus, data
compression is achieved.
• Hypothesis generation. Cluster analysis is used here in order to infer some hypotheses
concerning the data. For instance we may find in a retail database that there are two
significant groups of customers based on their age and the time of purchases. Then, we
Figure 1. Steps of Clustering Process
Data for
process
Algorithm
results
Final Clusters
Knowledge
Feature Selection
Clustering Algorithm Selection
Validation of
results
Interpretation
Data
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -10- http://dke.cti.gr/panda/
may infer some hypotheses for the data, that it, “young people go shopping in the
evening”, “old people go shopping in the morning”.
• Hypothesis testing. In this case, the cluster analysis is used for the verification of the
validity of a specific hypothesis. For example, we consider the following hypothesis:
“Young people go shopping in the evening”. One way to verify whether this is true is to
apply cluster analysis to a representative set of stores. Suppose that each store is
represented by its customer’s details (age, job etc) and the time of transactions. If, after
applying cluster analysis, a cluster that corresponds to “young people buy in the evening”
is formed, then the hypothesis is supported by cluster analysis.
• Prediction based on groups. Cluster analysis is applied to the data set and the resulting
clusters are characterized by the features of the patterns that belong to these clusters.
Then, unknown patterns can be classified into specified clusters based on their similarity
to the clusters’ features. Useful knowledge related to our data can be extracted. Assume,
for example, that the cluster analysis is applied to a data set concerning patients infected
by the same disease. The result is a number of clusters of patients, according to their
reaction to specific drugs. Then for a new patient, we identify the cluster in which he/she
can be classified and based on this decision his/her medication can be made.
More specifically, some typical applications of the clustering are in the following fields
[HK01]:
• Business. In business, clustering may help marketers discover significant groups in their
customers’ database and characterize them based on purchasing patterns.
• Biology. In biology, it can be used to define taxonomies, categorize genes with similar
functionality and gain insights into structures inherent in populations.
• Spatial data analysis. Due to the huge amounts of spatial data that may be obtained from
satellite images, medical equipment, Geographical Information Systems (GIS), image
database exploration etc., it is expensive and difficult for the users to examine spatial data
in detail. Clustering may help to automate the process of analysing and understanding
spatial data. It is used to identify and extract interesting characteristics and patterns that
may exist in large spatial databases.
• Web mining. In this case, clustering is used to discover significant groups of documents
on the Web huge collection of semi-structured documents. This classification of Web
documents assists in information discovery.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -11- http://dke.cti.gr/panda/
In general terms, clustering may serve as a pre-processing step for other algorithms, such as
classification, which would then operate on the detected clusters.
2.1.1.2 Classification – Decision Making
The classification problem has been studied extensively in statistics, pattern recognition and
machine learning community as a possible solution to the knowledge acquisition or
knowledge extraction problem [RS98]. A number of classification techniques have been
developed and are available in bibliography. Among these, the most popular are: Bayesian
classification, Neural Networks and Decision Trees.
Bayesian classification is based on Bayesian statistical classification theory. The aim is to
classify a sample x to one of the given classes c1, c2,…, cN using a probability model defined
according to Bayes theory [CS96]. Each category is characterized by a prior probability of
observing the category ci. Also, we assume that a given sample x belongs to a category ci
with the conditional probability density function p(x/ci)∈[0,1]. Then using the above
definitions and based on Bayes formula, we define a posterior probability q(ci/x). An input
pattern is classified into a category with the highest posterior probability. According to the
above brief description of the Bayesian classification approach, it is obvious that complete
knowledge of probability laws is necessary in order to perform the classification [Hor+98].
Decision trees are one of the widely used techniques for classification and prediction. A
number of popular classifiers construct decision trees to generate classification models.
A decision tree is constructed based on a training set of pre-classified data. Each internal node
of the decision tree specifies a test of an attribute of the instance and each branch descending
of that node corresponds to one of the possible values for this attribute. Also, each leaf
corresponds to one of the defined classes. The procedure to classify a new instance using a
decision tree is as follows: starting at the root of the tree and testing the attribute specified by
this node successive internal nodes are visited until a leaf is reached. At each internal node,
the test of the node is applied to the instance. The outcome of this test at an internal node
determines the branch traversed and the next node visited [Mit+97]. The class for the instance
is the class of the final leaf node.
A number of algorithms for constructing decision trees have been developed over the years.
Some of the most widely known algorithms that are available in bibliography are: ID3
[Mit+97], C4.5 [Quin+93], SPRINT [SAM96], SLIQ [MAR96], CART etc. In general terms,
most of the algorithms have two distinct phases, a building phase and a pruning phase
[Mit+97]. In the building phase, the training data set is recursively partitioned until all the
instances in a partition have the same class. The result is a tree that classifies every record
from the training set. However, the tree constructed may be sensitive to statistical
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -12- http://dke.cti.gr/panda/
irregularities of the training set. Thus, most of the algorithms perform a pruning phase after a
building phase, in which nodes are pruned to prevent overfitting and to obtain a tree with
higher accuracy.
The various decision tree generation algorithms use different algorithms for selecting the test
criterion for partitioning a set of records [RS98]. One of the earliest algorithms, CLS,
examines the solution space of all possible decision trees to some fixed depth [RS98]. Then it
selects a test that minimizes the computational cost of classifying a record. The definition of
this cost is made up of the cost of determining the feature values for testing as well as the cost
of misclassification. The algorithms ID3 [Mit+97] and C4.5 [Quin+93] are based on a
statistical property, called information gain, in order to select the attribute to be tested at each
node in the tree. The measure definition is based on entropy used in information theory,
which characterizes the purity of an arbitrary selection of examples. Alternatively algorithms
like SLIQ [MAR96], SPRINT [SAM96] select the attribute to test, based on the GINI index
rather than the entropy measure. The best attribute for testing (i.e. the attribute that gives the
best partitioning) gives the lowest value for the GINI index.
Another classification approach used in many data mining applications for prediction and
classification is based on neural networks. More specifically, the methods of this approach
use neural networks to build a model for classification or prediction. The main steps for this
process (i.e. building a classification model) are [BL97]:
• Identification of the input and output features.
• Setting up of a network with an appropriate topology.
• Selection of a right training set
• Training the network on a representative data set. The data have to be represented in
such a way as to maximize the ability of the network to recognize patterns in it.
• Testing the network using a test set that is independent from the training set.
• Then the model generated by the network is applied to predict the classes (outcomes)
of unknown instances (inputs).
Among the above-described classification techniques the most commonly used are decision
trees. The decision trees compared to a neural network or a Bayesian classifier are more easily
interpreted and comprehensible by humans [RS98]. The training of a neural network can take
a lot of time and thousands of iterations and thus it is not suitable for large data sets.
Moreover, decision tree generation is based on the information already contained in the
training data set in contrast to other methods that require additional information (e.g. prior
probabilities in Bayesian approach).
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -13- http://dke.cti.gr/panda/
The above reference to some of the most widely known classical classification methods
denotes the relatively few efforts that have been devoted to data analysis techniques (i.e.
classification) in order to handle uncertainty. These approaches produce a crisp classification
decision, so an object either belongs to a class or not, which means that all objects are
considered to belong in a class equally. Moreover, the classes are considered as non-
overlapping. It is obvious that there is no notion of uncertainty representation in the proposed
methods, though usage and reveal of uncertainty is recognized as an important issue in data
mining research area[GMPS96]. For this purpose, the interest of research community has
been concentrated on this context and new classification approaches have recently been
proposed in bibliography so as to handle uncertainty.
An approach for pattern classification based on fuzzy logic is represented in [Chiu97]. The
main idea is the extraction of fuzzy rules for identifying each class of data. The rule extraction
methods are based on estimating clusters in the data and each cluster obtained corresponds to
a fuzzy rule that relates a region in the input space to an output class. Thus, for each class ci
the cluster center is defined that provides the rule: If {input is near xi} then class is ci. Then
for a given input vector x, the system defines the degree of fulfilment of each rule and the
consequent of the rule with highest degree of fulfilment is selected to be the output of the
fuzzy system. As a consequence, the approach uses fuzzy logic to define the best class in
which a data value can be classified but the final result is the classification of each data to one
of the classes. Moreover, it is possible to compute the classification error for a data sample x,
belonging to some class c, using the degree of fulfilment among all rules that assign x to class
c and the degree of fulfilment among all rules that do not assign x to class c. This error
measure indicates how well the defined rules classify our data.
In [Jan+98], an approach based on fuzzy decision trees is presented and aims at uncertainty
handing. It combines symbolic decision trees with fuzzy logic concepts so as to enhance
decision trees with additional flexibility offered by fuzzy representation. More specifically,
they propose a procedure to build a fuzzy decision tree based on classical decision tree
algorithm (ID3) and adapting norms used in fuzzy logic to represent uncertainty [Jan+98]. As
a consequence, the tree-building procedure is the same as that of ID3. The difference is that a
training example can be partially classified to several tree nodes. Thus, each instance of data
can belong to one or more nodes with different membership that is calculated based on the
restriction along the path from root to the specific node. However, according to the decision
tree methodology the classification inferences are crisp. More specifically, to define the
classification assigned to a sample, we should find leaves whose restrictions are satisfied by
sample and combine their decisions into a single crisp response.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -14- http://dke.cti.gr/panda/
Induction of classification rules. The knowledge produced during the classification process
can be extracted and represented in the form of rules. As we discussed in the previous section
there are classification approaches, such as [Jan+98], which result in a set of rules describing
the classification patterns in a data set. Another common classification approach is the
Decision Trees. In this case the extracted patterns of knowledge are described in the form of a
tree. However, the rules are easier for humans to understand, particularly if the tree is very
large. In the sequel we briefly describe an approach for converting a decision tree to a set of
classification rules.
Considering a decision tree a rule can be created for each path from the root to a leaf node.
Thus we may think that each leaf generates one rule. The conditions leading to the leaf
generate the conjunctive antecedent and the leaf node that holds the class prediction generated
the consequent. Example
Table 1.Training dataset
Example AGE COMPETITION TYPE PROFIT 1 old yes Software down 2 old no Software down 3 old no Hardware down 4 midlife yes software down 5 midlife yes hardware down 6 midlife no hardware up 7 midlife no software up 8 young yes software up 9 young no hardware up
10 young no software up
Figure 2. The decision tree defined for the dataset of Table 1
Age
Profit_up Profit_down Competition
Profit_downn Profit_up
Young Old Middle
Yes No
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -15- http://dke.cti.gr/panda/
Let Table 1 presents the data set for training a decision tree. The decision tree that classifies
our data into classes Profit_up, Profit_down is presented in Figure 2.
The knowledge represented in the decision tree of Figure 2 can be described in the form of
rules as follows:
IF Age=”Young” THEN Profit =“Up”
IF Age=”Old” THEN Profit =“Down”
IF Age=”Middle” AND Competition = “Yes” THEN Profit =“Down”
IF Age=”Middle” AND Competition=”No” THEN Profit =“Up”
Classification Applications fields. As we have already discussed classification is a form of
data analysis that can be used to define data models describing data classes or predict data
trends. It can be used for making intelligent bases decisions in business and science. For
example a classification model may be built to categorize customers on computer equipment
given their income and occupation, or given a database of patients diagnostic results a set of
classification rules can be extracted to identify patients as having excellent or fair health
progress.
2.1.1.3 Association Rules
Mining rules is one of the main tasks in data mining process. It has attracted considerable
interest because the rule provides a concise statement of potentially useful information that is
easily understood by the end-users.
Association rules reveal underlying interactions between attributes in the data set. These
interactions can be presented with the following form: A B, where A, B refer to sets of
attributes' values in underlying data. More specifically, A and B are selected so as to be
frequent item sets. The following is a formal statement of the problem as given in [AS94]:
Let I = {i1,i2,…,im} be a set of literals, called items. Let D be a set of transactions where
each transaction T is a set of items such that T ⊆ I. We say that a transaction T contains
X, a set of some items in I, if X⊆ T. An association rule is an implication of the form
A ⇒B, where A ⊂ I, B ⊂ I and A∩B =∅.
The rule A ⇒B has confidence c in the transaction set D if c% of transactions in D
that contain A also contain B, and support s if s% of transactions in D contain A∪ B.
Given a set of transactions D, generate all association rules that have support and
confidence greater than the corresponding user-specified thresholds.
The intuitive meaning of such a rule is that records in the dataset, which contain the attributes
in A, tend also to contain the attributes in B [SA95]. We note also that the extracted rules
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -16- http://dke.cti.gr/panda/
have to satisfy some user-defined thresholds related with association rules measures (such as
support, confidence, leverage, lift).
These measures give an indication of the association rules’ importance and confidence. They
represent the predictive advantage of a rule so as to help to identify interesting patterns of
knowledge in data and make decisions.
Association rule interestingness measures. Let an association rule, denoted as LHS RHS.
Further we refer to the left hand side and the right hand side of the rule as LHS and RHS
respectively. Then the measures related with the rule are [HK01, BL97]:
- Strength
The strength of an association rule is the proportion of the cases covered by the LHS of
the rule that are also covered by the RHS
strength =n(RHS ∩ LHS)/n(LHS) and it takes values inside [0,1]
where n(LHS) denotes the number of cases covered by the Left Hand Side.
The rule strength is also referred to as confidence.
A value of strength near to 1 is an indication of an important association rule.
- Coverage
The coverage of an association rule is the proportion of cases in the data that have the
attribute values or items specified on the Left Hand Side of the rule.
coverage = n(LHS)/N = P(LHS) and it takes values inside [0,1]
where N is the total number of cases under consideration.
An association rule with coverage value near to 1 can be considered as an important
association rule.
- Support
The support of an association rule is the proportion of all cases in the dataset that satisfy a
rule, i.e., both LHS and RHS of the rule. More specifically, support is defined as
support = n(LHS ∩RHS)/N
where N is the total number of cases under consideration and n(LHS ∩RHS) denotes the
number of cases covered by LHS ∩RHS.
Support can be considered as an indication of how often a rule occurs in a data set and as
a consequence how significant is a rule.
The above discussed interestingness measures support and confidence are widely used in
the association rule extraction process and are also known as Agrawal and Srikant’s
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -17- http://dke.cti.gr/panda/
itemset measures. From their definitions, we could say that confidence corresponds to the
strength of a rule while support to the statistical significance.
- Leverage
The leverage of an association rule is the proportion of additional cases covered by both
the LHS and RHS above those expected if the LHS and RHS were independent of each
other. This is a measure of the importance of the association that includes both the
strength and the coverage of the rule. More specifically, it is defined as
leverage = P(RHS LHS) – (P(LHS) ∗ P(RHS)) and it takes values inside [-1,1].
Values of leverage equal or under 0, indicate a strong independence between LHS and
RHS. On the other hand, values of leverage near to 1 are indication of an important
association rule.
- Lift
The lift of an association rule is the strength divided by the proportion of all cases that are
covered by the RHS. This is a measure of the importance of the association and it is
independent of coverage.
lift = strength / P(RHS)
and it takes values inside ℜ+ (the space of the real positive numbers).
As for the values of lift there are some conditions to be considered:
1. lift values close to 1 means that RHS and LHS are independent, which indicates
that the rule is not important.
2. lift values close to +∝. Here, we have the following sub-cases:
• RHS ⊆ LHS or LHS ⊆ RHS. If any of these cases is satisfied, we may
conclude that the rule is not important.
• P(RHS) is close to 0 or P(RHS LHS) is close to 1. The first case
indicates that the rule is not important. On the other hand, the second case
is a good indication that the rule is an interesting one.
3. lift = 0 means that P(RHSLHS) = 0 ⇔ P(RHS ∩ LHS) = 0, which indicates that
the rule is not important.
Mining association rules. The process of mining association rules is based on two steps
[HK01]:
• Find all frequent itemsets. A set of items (itemset), which occur at least as frequently as a
pre-determined minimum support is a frequent itemset.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -18- http://dke.cti.gr/panda/
• Generate strong association rules from the frequent itemsets. These rules must satisfy
minimum support and minimum confidence.
Association Rules Applications. A typical application of association rule mining is market
basket analysis. This process analyses customer-buying habits by finding associations
between the different items that customers place in their “shopping baskets”. The discovery of
such associations can help retailers develop marketing strategies by finding which items are
frequently purchased together by customers. In general terms, the discovery of interesting
association relationships among huge amounts of business or scientific databases records can
help in decision making processes gaining insight into database items.
2.1.1.4 Sequential patterns- Time series Analysis
Sequential pattern mining is the mining of frequently occurring patterns related to time or
other sequences. Most studies of sequential patterns mining concentrate on symbolic patterns.
The problem of mining sequential patterns can be stated as follows:
Given a potentially large pattern (string) S, we are interested in sequential patterns of the
form a b, where a, b are substrings inside S, such that the frequency of ab is not less
than some minimum support and the probability that a is immediately followed by b is
not less than minimum confidence.
Also the user can specify constraints on the kinds of sequential patterns to be mined by
providing pattern templates in the form of serial episodes, parallel episodes, or regular
expressions [HK01]. A serial episode is a set of events that occur in a total order whereas a
parallel episode is a set of events whose occurrence ordering is not important. For instance
the sequence A B is a serial episode implying that the event B follows the event A while
A&B is a parallel episode indicating that the events A and B occur in our data but their
ordering is not important.
The user can also specify constraints in the form of regular expression. For example the
template (A|B)C*(D|E) indicates that the user would like to find patterns where event A and
B first occur but their relative ordering is not important, followed by a the event C, followed
by the events D and E (D can be before or after E).
Sequential patterns applications. In daily and scientific life sequential data are available and
used everywhere. Some representative examples are text, music notes, weather data, satellite
data streams, business transactions, telecommunications records, experimental runs, DNA
sequences, histories of medical records. Discovering sequential patterns can benefit the user
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -19- http://dke.cti.gr/panda/
or scientist by predicting coming activities, interpreting recurring phenomena or extracting
similarities.
2.2. Signal Processing: Content-based Music Retrieval
2.2.1 Problem definition
Large-scale storage of sound and music has only become possible in the last decade. In
addition, the new possibility for wide-area distribution of multimedia over the Internet has
given rise to new requirements for flexible and powerful databases for musical and audio data.
Earlier systems relied on a single kind of data presentation (e.g. MIDI scores or sampled
sound effects) and were mainly oriented towards the needs of music librarians. However,
recent technologies, such as progress in networking transmission, compression of audio and
protection of digital data have made possible the quick and safe delivery of music through
networks (Internet or digital-audio broadcasting services). These techniques have given users
access to huge catalogues of annotated music. Although the above technologies have
addressed the distribution problem, they have also raised the issues of massive amounts of
data from which users can choose [PAC00]. In order to estimate the complexity of music
selection it is useful to consider a variety of queries that users are expected to form while
interacting with a music delivery system. The following examples of user queries are likely
to be encountered:
− “Show me a list of Irish Folk songs available by content provider X”
− “I want to browse through Bach Fugues recorded in C minor and performed with a
clavichord”
− “I have recorded 5 seconds of audio from a radio transmission and saved in file yyy.wav,
but I cannot tell what the title of the song is. Could you give me a hint?”
− “Are there any harpsichord recordings available?
− “Can I listen to the chorus line of song X?
It is clear that the aforementioned queries address information hidden in the content of the
music signal and raise the following challenges related to content-based music retrieval:
instrument recognition, melody spotting, musical key extraction, musical pattern recognition,
composer recognition, music structure extraction and music segmentation, to name but a few.
In an attempt to imitate the human cognitive system, a variety of signal processing solutions
have been developed to address the challenges imposed by music recognition and
understanding
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -20- http://dke.cti.gr/panda/
2.2.2 A survey of existing research efforts
− Several researchers have addressed the problem of feature extraction from stored pieces
of music. Feature selection is a topic of ongoing research and various solutions have been
proposed: [HER02] has performed a comparative study of features including Mel-
Frequency Cepstrum Coefficients (MFCC’s), attack-related descriptors, decay-related
descriptors and relative energy descriptors for the classification of drum sounds. [ERR01]
has compared features for musical instrument recognition. [AGO01] has performed a
study of spectral features to classify musical instrument timbres. It turns out that for a
system to be able to support several kinds of queries, it should be able to extract a wide
range of different kinds of features from the data loaded in the database.
− Stochastic modelling and dynamic time warping techniques have also been proposed for
the comparison of music signals. For example, Hidden Markov Models have been used to
build statistical representations of musical pieces. Music similarity can then be reduced to
probability extraction based on these models [PIK01], [PIK02]
− Various researchers have also tried to achieve automatic extraction of the structure of
music recordings in an attempt to prove the assumption that music similarity may be
considered in part, as a comparison of musical structure. Music is, indeed, often described
(at a high level of abstraction) in terms of the structure of repeated patterns. Discovering
musical structure in audio recordings has been addressed by [DAN02], [CON02],
[MER01], [MAR01], etc.
The above list, although not exhaustive, suggests that music retrieval should account for a
variety of music representations and therefore makes it necessary to enhance database
functionality to account for the automated extraction and comparison of such representations.
2.2.3 Patterns in music retrieval
Following the terminology introduced in Section 2.1 content-based music retrieval can be
considered as a data mining process that extracts patterns of interest from a music corpus and
stores these patterns as new knowledge in the knowledge base. Patterns stemming from raw
audio signals can be summarized to the following types:
2.2.3.1 Musical structure
Music is often described in terms of the structure of repeated phrases. For example, many
songs have the form AABA, where each letter represents an instance of a phrase. In this case,
the sequence AABA is the pattern of interest. Constructing descriptions of music in this form
requires the automated discovery of recurrent patterns by means of pattern extraction
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -21- http://dke.cti.gr/panda/
algorithms. Recurrent pattern finding can be thought of as a case of data mining. In addition,
the discovery of patterns that occur across many pieces of the same musical style or composer
can yield signature motifs (structure classes) that can be used for the classification of new
pieces of music in a specific musical style or composer [TZA01]. Following the terminology
introduced in Section 2.1.1.2, identifying the musical style and/or composer of a recording
may be considered as a classification problem, i.e., the structure of the recording in question
is classified to one of the available signature motifs (classes) using an appropriate method
(Bayesian classification, Neural Networks and Decision Trees). Alternatively, Hidden
Markov Models (HMM’s) may be employed in order to determine structure similarity. To this
end, a HMM is built per class and is trained with structures belonging to the class.
Classification is subsequently achieved by presenting the musical structure of a recording in
question to the set of HMM’s. Each HMM generates a recognition probability and the
unknown recording is determined according to the highest probability [PIK02]. Building a
HMM is preferably a supervised procedure [TK98]. Training the HMM’s should be repeated
occasionally to account for new instances of the class.
2.2.3.2 Single feature vectors representing entire musical recordings
In this case the feature selection process follows the guidelines given in Section 2.1.1.1. Once
appropriate features have been selected, each recording can be represented by a feature vector
(pattern). Such feature vectors are multi-dimensional and usually consist of a combination of
Mel-frequency cepstrum coefficients, average energy descriptors, zero-crossing rate envelops,
etc. Clustering methods may be subsequently applied depending on the user queries that need
to be addressed. Attention must be given to the fact that music signals have a time-varying
nature and as a result, a single feature vector can only estimate the average behaviour of the
signal. However, this form of representation can be useful when dealing with short music
segments, e.g., classification of drum sounds [HER02].
2.2.3.3 Sequences of possibly multi-dimensional feature vectors
The limitations inherent in the “single feature vector approach” can be accommodated if a
sequence of feature vectors is extracted from the music signal by means a moving window
technique [REI01]. Moving window techniques break the signal into overlapping frames and
a possibly multi-dimensional feature vector is extracted from each frame. In this case, the
sequence of feature vectors can be considered as a pattern. Single-dimensional feature vectors
may also be extracted. Such is the case, for example, when a sequence of music intervals is
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -22- http://dke.cti.gr/panda/
generated from a piece of monophonic music. When comparing feature sequences,
conventional classification methods like Bayesian classification are not adequate. Therefore,
one has to resort to other techniques, i.e., Dynamic Time Warping (DTW) and Hidden
Markov Models (HMM’s), in order to determine similarity [TK98], [DEL87]. The following
two sections describe in brief DTW and HMM’s.
2.2.3.4 Dynamic Time Warping (DTW)
Dynamic Time Warping (DTW) is fundamentally a feature-matching scheme that inherently
accomplishes “time alignment” of feature sequences through a Dynamic Programming (DP)
procedure. By time alignment we mean the process by which temporal regions of one feature
sequence are matched with appropriate regions of another feature sequence. Because one
feature sequence is “warped” (stretched or compressed in time) to fit the other and because
dynamic programming is used to accomplish the task, this approach is referred to as dynamic
time warping. The feature sequence to be classified is matched against a set of reference
patterns, one reference pattern per class [PIK01]. The choice of references patterns is usually
user driven, i.e. the user decides which pattern is likely to serve as a reference pattern.
Exhaustive methods have also been investigated in order to automatically extract reference
patterns from a class [DEL87]. DTW has its origins in Speech Recognition and has been
employed with success by early isolated word recognition systems.
2.2.3.5 Stochastic modeling: Hidden Markov Models
An alternative to DTW is stochastic modelling by means of Hidden Markov Models. HMM’s
are “stochastic finite state automata”. Each class of patterns is associated with a
corresponding HMM. Two fundamental problems arise when dealing with HMM’s.
− Given a series of patterns belonging to the same class, how do we train an HMM to
represent the class? This is the HMM training problem.
− Given a trained HMM, how do we find the likelihood that generates an incoming pattern?
This is the recognition problem.
To solve both problems, various approaches have been proposed, the most popular being
Baum-Welch and Viterbi training [BW70], [VIT67].
Depending on the selected features a HMM is expected to act as a means of music content
modelling. For example, if feature sequences correspond to sequences of music intervals
extracted from monophonic melodies of the same clarinet player, a HMM can be trained with
these sequences and can serve as a signature for the particular clarinet player. Subsequently, it
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -23- http://dke.cti.gr/panda/
is also made feasible to determine the probability with which a monophonic melody has its
origins in a particular instrument player. Following this line of thinking HMM’s may also be
trained to serve as signatures of the identity of composers, musical traditions, etc.
2.3. Patterns in Information Retrieval
Another research field where patterns are apparent is that of Information Retrieval. In a
retrieval setting we have a collection of discourse material, also called corpus, and users
submit queries to the system in order to retrieve information that suits their interest. Queries
are often vaguely defined, in contrast to traditional Database systems, due to lack of a query
language or algebra. A query consisting of only a few words does not always reflect the user’s
actual interest; therefore users often experience frustration from a retrieval system. The Latent
Semantic Indexing (LSI) [DDF+90, BDO95], a retrieval model, unveiling patterns in terms’
usage, seems to produce more effective information retrieval.
Assuming a corpus of n documents, each of which is indexed by m, suitably chosen index
terms, the entire corpus can be represented as an m×n matrix A, each column of A
representing the respective text. The entry A(i,j) of the matrix, corresponds to a measure of the
frequency of the occurrence of the i-th index term to the j-th text of the collection.
In the traditional Vector-Space model [SL68], each index term is considered as a base vector
in an m-dimensional space U. Hence, each text can be considered as a linear combination of
index terms and as such is represented by a vector in U. Queries are also treated as linear
combinations of index terms and projected in U. The objective of the retrieval system, is to
return the documents, most closely related, i.e. most similar to the query. In the vector space
approach the measure of similarity between a query q and a document d is merely ),cos( dqrv ,
where qr and dr
are the vector representations of q and d, respectively. Thus, the documents of
interest form a set }),cos(|{ tdqdD ≥=rr
, where t is a given threshold.
The key assumption in the vector space model is the orthogonality of index terms, i.e. no two
index terms are correlated in their appearance within a text. Often, this is not the case, when
human-generated discourse is involved. People frequently use distinct words to describe the
same concept, giving thus rise to a phenomenon called “synonymy”. An example of
synonymy are the terms “car” and “automobile”, that both describe a passenger vehicle. As a
response to query containing just the term “car”, texts containing “automobile” may be
overlooked. Another issue is that a term can have several interpretations, according to the
context of its use. For example when referring to the word “jaguar”, it is not clear without
knowledge of the context whether the car manufacturer or the animal is involved. As an
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -24- http://dke.cti.gr/panda/
undesired sideffect of polysemy, documents about cars may appear after a query that actually
seeks to retrieve information about jaguars.
Identifying “patterns” in terms’ usage seems to remedy the aforementioned problems. In
essence, one seeks for an efficient mechanism to identify text clusters. In such a cluster,
synonymous terms, although distinct, share similar usage characteristics, i.e. they are used in
similar contexts. On the other hand a polysemous term is expected to have distinct usage
patterns for each distinct meaning, so a text referring to a jaguar in a zoo, shall appear in a
different cluster than a text about Jaguar X-type. This partition into clusters can be achieved if
the documents are represented as a linear combination of entities richer in semantics, than the
plain index terms. These entities are the topics addressed by texts, hence a text is a
combination of such topics. This methodology is called Semantic Indexing.
The LSI method is a type of Semantic Indexing that uses a well known algebraic
decomposition, the Singular Value Decomposition (SVD) [GL96], applied to term-document
matrix A. At a high level, LSI projects the original vector space spanned by the columns of A,
to a low-dimensional “semantic” space. This semantic space results from keeping only the
important correlations of indexing terms, as those appear in the spectral structure of the
matrix. In this low-dimensional space, semantically correlated texts appear closely together,
whereas this is not true for unrelated texts, sharing polysemous terms. The ability to project
queries also, to the induced space, and then retrieve all documents closely projected to the
query vector in terms of the cosine measure, yields increased efficiency in Information
Retrieval.
Some of the reported applications of LSI, support the claimed efficiency of the method, to
discover important patterns in word usage. Dumais et. al. [DLLL97] tested LSI in parallel
bilingual corpora (English and French). They found out that they were able to retrieve
relevant texts in both languages, even though the query was formulated in one of them.
Nielsen et. al. [NPD94] used some texts with spelling errors, produced by an OCR software.
In experiments they conducted, they showed that the retrieval of misspelled documents where
almost as efficient as the retrieval of error-free documents, in both cases using LSI. This is a
consequence of the fact that LSI is supposed to exploit correlations among terms, not much
caring about the actual content
2.4. Mathematics
Mathematics is the science of patterns. Not only do patterns take many forms over the range
of school mathematics, they are also a unifying theme. Number patterns, such as 3, 6, 9, 12,
e.t.c., are familiar to us since they are among the patterns we first learn as young students. As
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -25- http://dke.cti.gr/panda/
we advance, we experience number patterns again through the huge concept of functions in
mathematics. But patterns are much broader. They can be sequential [AS95], spatial [ST99],
temporal [DLMRS98][SB98], and even linguistic[FFKLLRSZ98][LAS97].
The various mathematical patterns can be summarized to the following categories:
2.4.1 Number Patterns
We frequently come upon numbers with special characteristics. In the science of mathematics
a pattern is usually the rule or the constraint which several items satisfy. Thus we can clearly
distinguish the pattern, which is a constraint, from the pattern instance, which is the set of
items (in this case numbers) that share the common repeated characteristic, meaning the
verification of the constraint. Sometimes it is very useful to collect number patterns so as to
be able to learn the behaviour of numbers collected by telecommunicational data for
example[BCH00] or by equation solvings[SB99]. A variety of number patterns exists, a brief
report of which follows:
Prime Numbers: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, …
Composite numbers: prime factorization
Figurate numbers:
I. Triangular numbers
II. Square numbers (for example 25 = 52, 36 = 62, 49 = 72, e.t.c.)
Fibonacci numbers: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
2.4.2 Patterns in graphs
Every graph has been resulted from the graphical depiction of a function or more general of
an equation solved according to one of it’s variables. We can find similar behaviour to many
equations, for example linear, which forces us to draw the conclusion that this similarity
exists to their graphical depiction as well. Almost every graph belongs to a graph pattern,
which means that each graph has some attributes that forces it to obey to a specific behaviour,
as one or more of it’s variables grow or lessen in value. For example, we can easily
distinguish some graph patterns, like the following:
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -26- http://dke.cti.gr/panda/
linear graphs
i. x+y=7
ii. 2x + 3 = y
non-linear (parabola, circle [Sch97], ellipse)
For example, Figure 3 depicts the non-linear graphs that present a pattern of their own.
2.4.3 Patterns in shapes
We can also find patterns in shapes. The concept in this case is that the majority of shapes
follow a similarity, for example in the number of vertices from which the shape is constituted.
So, patterns could be triangles (which might be similar), squares, polygons with n-angles,
e.t.c. The existence of such patterns can help us in recognizing familiar shapes in image
processing [NPP00] or even for reconstructing polygonal images [CN95].
2.4.4 Patterns in infinite sequences
Infinite sequences can be the pattern instance of a mathematical constraint, which in this case
is the pattern itself. For example Fibonacci infinite series, Taylor series, prime numbers, e.t.c.
In this case we can add the infinite series of digits of numbers like e or π. Such patterns could
Figure 3. Non-linear graphs
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -27- http://dke.cti.gr/panda/
be proved to be really helpful to characterize the behaviors of an individual, system or
network in terms of temporal sequences or discrete data[LB98].
2.4.5 Patterns in algebra
What is the difference between the arithmetic 3+5 = 5+3 and the algebraic a+b = b+a? One is
a specific fact, another is a pattern valid in a multitude of situations. While arithmetic may
hint at some regularities, algebra, as a language, gives expression to our acknowledgement of
patterns as such. How did people express general ideas before the advent of algebra in the
15th-17th centuries? Diophantus of Alexandria (c. 250 A.D.) is credited with the invention of
syncopated (shorthand) notations. Before that, it was geometry.
Algebra unites patterns and quantities in patterns with the means of describing change
through the use of variables and functions. Its concepts and analytical methods allow people
to consider general solutions to problems with common characteristics and develop related
formulas. Algebra provides verbal, symbolic and graphical formats for discussing and
representing settings as diverse as the pricing patterns of merchandise in a store, the behavior
of a car as it accelerates or slows down, the changes in two chemicals as they react with one
another, or the type of variation existing in a comparison of two factors in the economy. In
algebra, we could think of many patterns, like the way equations are solved, iterating
functions, etc. These patterns apply to fields like music [LK94], moving objects
applications[G99], etc.
2.4.6 Patterns in Geometry
In Geometry we come across geometric locuses, which means that there are many sets of
points in space (2D, 3D, e.t.c.) which satisfy a specific premise or constraint. For example, all
the points in 2D space, which have the same distance (lets say r) from a certain point (x,y),
form a circle, with center C(x,y) and radius r constitute the pattern instance of the above
pattern. Applications of such patterns are used in computational geometry, pattern recognition
and multimedia [VH99].
2.4.7 Patterns in Cryptography
Cryptography is one of the main subjects in which mathematics rays supreme. In
cryptography, every cryptographic system could be considered as a pattern itself. For
example, if we have many sets of raw data and in each set we enforce a specific encryption,
then this sets would share the similarity of their encryption attribute. A very common example
of a code which constitutes a pattern is the Morse code. In case we were given a stream of
characters then it would be interesting to find out whether this stream constitutes a Morse
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -28- http://dke.cti.gr/panda/
code, so it would be a pattern instance of the code. We can find many such cases in
cryptography, like Vigenere Cipher, Caesar Cipher, Gronsfeld cipher e.t.c. Such
Cryptographic patterns are widely used in word processors, electronic commerce systems,
spreadsheets, databases and security systems [BRD99].
2.5. Visualization
Scientific, engineering, and environmental databases can contain very large amounts of raw
data collected automatically and continuously via sensors and monitoring systems. Even
simple transactions of everyday life, such as paying with credit card, result in large
multidimensional data sets. Information visualization and visual data mining techniques can
help to deal with the flood of data [Kei01a]. Finding the relevant information hidden in the
data is difficult without user interaction. For an effective data analysis, it is import to include
the user in the exploration process and combine the flexibility, creativity, and general
knowledge of the user with the enormous storage capacity and the computational power of the
computer. Visualization techniques integrate the human perceptual abilities into the data
exploration process for the analysis of large data sets.
The basic idea of visual data exploration is to present the data in some visual form which
allows the user to gain insight into the data [CMS99, Che99, Spe00, War00, Kei01b]. Visual
data exploration usually follows a three step process [Shn96]: (1) Overview, (2) zoom and
filter, and (3) details on demand. First, the user needs to get an overview of the raw data to
identify interesting patterns in the raw data. Visualization techniques are useful for showing
an overview of the raw data and detecting patterns [Kei 00]. Patterns are groups of data points
in the visualization that represent potentially valuable information and provide new insights
for the user. A second step is the analysis of the discovered patterns. In this step, the user
needs to be able to zoom and filter the data in order to focus on one and more patterns
[AW95]. Finally, the user needs to drill-down to access the details of the data points
belonging to a discovered pattern [BSC96]. Visualization technology is used for interpreting
the patterns of interest. It is important to retain an overview of the data while focusing on the
interesting patterns using another visualization. A common technique is to distort the
overview visualization in order to focus on the interesting patterns [LA94].
Visual data exploration can be seen as a hypothesis generation process. The visualization
allows the user to identify patterns of interest or groups of related data points and gain insight
into the raw data. Visualization can also be used to analyse the patterns on different levels of
abstraction, which may result in adapting existing hypotheses or generating new hypotheses.
The verification of the hypothesis can also be done with the help of visualization techniques.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -29- http://dke.cti.gr/panda/
The advantage of visual data exploration is that the process is interactive, i.e. the user is
directly involved in the exploration process, i.e. the user is repeatedly asked to make the
important decisions to steer the data exploration process.
3. Current issues in modeling Data Mining processes and results
In this section we address the current and evolving efforts on modeling data mining processes
and their results. The survey is organized as follows. Section 2 introduces the work of the
Data Mining Group and the specification of the Predictive Model Markup Language, a set of
XML DTDs that can be used for describing common data mining techniques. Section 3 gives
an overview of the SQL/MM Part 6, a standard that has been developed under ISO. Section 4
gives an overview of the Common Warehouse Model and the Data Mining MetaModel that
are standards supported by OMG. Section 5 highlights the efforts in developing the Java Data
Mining API, a forthcoming standard developed by individual vendors. Section 6 gives an
overview of the Oracle 9i Data Mining components and Section 7 highlights PQL, a pattern
query language developed by Information Discovery, Inc. Finally, Section 8 gives a summary
and concludes the survey.
3.1. Data Mining Group / Predictive Model Markup Language [DMG]
3.1.1 Overview
The Data Mining Group (DMG) is an independent, vendor led group which develops data
mining standards, such as the Predictive Model Markup Language (PMML). PMML is a
collection of XML Document Type Descriptors (DTDs) that provide a uniform way for
modeling data mining processes and results. In the next sections we will present the main
features of the PMML DTDs. The current members of DMG are:
• Angoss Software Corp. Toronto, CAN
• IBM Corp. Somers, NY
• NCR Corp. Dayton, OH
• Magnify Inc. Chicago, IL
• Oracle Corporation Redwood Shore, CA
• National Center for Data Mining, University of Illinois at Chicago
• SPSS Inc. Chicago, IL
• Xchange, Inc. Boston, MA
• MINEit Software Ltd. Bracknell, UK
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -30- http://dke.cti.gr/panda/
PMML defines a variety of specific mining models such as tree classification, neural
networks, regression, etc. Equally important are definitions which are common to all models,
in order to describe the input data itself, and generic transformations which can be applied to
the input data before the model itself is evaluated. In the following schema the basic blocks of
a mining model as well as the data flow of such operation are shown:
The DataDictionary describes the data 'as is', that is the raw input data. The DataDictionary
refers to the original data and defines how the mining model interprets the data, e.g., as
categorical, or numerical, and the range of valid values may be restricted. The raw data are
not included in a PMML document and they are hosted in external sources. The
DataDictionary only defines the mappings between the source attributes and the model’s local
field names.
The MiningSchema defines an interface to the user of PMML models. It lists all fields that are
used as input to the computations in the mining model. The mining model may internally
require further derived values that depend on the input values, but these derived values are not
part of the MiningSchema. The derived values are defined in the transformations block. The
MiningSchema also defines which values are regarded as outliers, which weighting is applied
to a field, e.g., for clustering. Input fields as specified in the MiningSchema refer to fields in
the data dictionary but not to derived fields because a user of a model is not required to
perform the normalizations.
Various types of transformations are defined such as normalization of numbers to a range
[0..1] or discretization of continuous fields. These transformations convert the original values
Figure 4. Basic blocks of a mining model
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -31- http://dke.cti.gr/panda/
to internal values as they are required by the mining model such as an input neuron of a
network model.
If a PMML model contains transformations a user is not required to take care of these
normalizations. The MiningSchema lists the input fields that refer to the non-normalized
original values, the user presents these fields as input to the model.
The output of a model always depends on the specific kind of model, e.g. it may by defined
by a leaf node in a tree or by output neurons in a neural network. The final result, such as a
predicted class and a probability, are computed from the output of the model. If a neural
network is used for predicting numeric values then the output value of the network usually
needs to be denormalized into the original domain of values. Fortunately, this denormalization
can use the same kind of transformation types. The PMML consumer system will
automatically compute the inverse mapping.
3.1.2 General Structure of a PMML Document
PMML uses XML to represent mining models. The structure of the models is described by a
DTD which is called the PMML DTD. The DTD that all PMML documents must conform is:
<!ELEMENT PMML (Header, Settings?, DataDictionary,
TransformationDictionary, (%A-PMML-MODEL;)+, Extension* )>
<!ATTLIST PMML version CDATA #REQUIRED>
<!ELEMENT Settings (Extension*) >
<!ELEMENT TransformationDictionary (DerivedValues*, Extension* ) >
For PMML version 2.0 the attribute version must have the value "2.0" as shown in the next
small example of a PMML instance.
<?xml version="1.0"?>
<!DOCTYPE PMML PUBLIC "PMML 2.0" "http://www.dmg.org/PMML2.0/pmml-2-0.dtd"> <PMML version="2.0">
...
</PMML>
In the following we will give a description about the base elements of the generic PMML
DTD.
3.1.3 Header
The Header DTD is: <!ELEMENT Header(Application?,Annotation*,Timestamp?)>
<!ATTLIST Header
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -32- http://dke.cti.gr/panda/
copyright CDATA #REQUIRED description CDATA #IMPLIED
>
<!ELEMENT Application (EMPTY)>
<!ATTLIST Application
name CDATA #REQUIRED
version CDATA #IMPLIED
>
<!ELEMENT Annotation (#PCDATA)>
<!ELEMENT Timestamp (#PCDATA)>
• Header: The top level tag that marks the beginning of the header information.
• Copyright: This head attribute contains the copyright information for this model.
• Description: This head attribute contains a non-specific description for the model. It
should contain information necessary to use this model in further applications, but not
information that could be better defined in the application element, annotation, and
the data dictionary section. This attribute should only contain human readable
information, and models mentioned in this dtd file should not be expected to utilize
the information contained in this attribute.
• Application: This head element describes the software application that generated the
PMML. Though these models are created to be portable, different mechanisms may
create different models from the same data set. It is of interest to the user from which
application these models were generated.
• Name: The name of the application that generated the model.
• Version: The version of the application that generated this model.
• Annotation: Document modification history is embedded here. Each annotation is
free text and, like the description attribute in the head element, makes sense to the
human eye only.
• Timestamp: This element allows a model creation timestamp in the format YYYY-
MM-DD hh:mm:ss GMT +/- xx:xx.
3.1.4 Settings
The element Settings can contain any XML value describing the configuration of the training
run that produced the model instance. This information is not directly needed in a PMML
consumer, but in many cases it is helpful for maintenance and for visualization of the model.
The content of Settings is not defined in PMML 2.0.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -33- http://dke.cti.gr/panda/
3.1.5 Data Dictionary
The data dictionary contains definitions for fields as used in mining models. It specifies the
types and value ranges. These definitions are assumed to be independent of specific data sets
as used for training or building a specific model.
A data dictionary can be shared by multiple models, statistics and other information related to
the training set is stored within a model.
<!ENTITY %FIELD-NAME "CDATA" >
<!ELEMENT DataDictionary (Extension*, DataField+, Taxonomy+ ) >
<!ATTLIST DataDictionary
numberOfFields %INT-NUMBER; #IMPLIED
>
<!ELEMENT DataField ( Extension*, (Interval*| Value*) ) >
<!ATTLIST DataField
name %FIELD-NAME; #REQUIRED
displayName CDATA #IMPLIED
optype (categorical|ordinal|continuous) #REQUIRED
taxonomy CDATA #IMPLIED
isCyclic ( 0 | 1 ) "0"
>
The value numberOfFields is the number of fields which are defined in the content of
DataDictionary, this number can be added for consistency checks. The name of a data field
must be unique in the data dictionary. The displayName is a string which may be used by
applications to refer to that field. Within the XML document only the value of name is
significant. If displayName is not given, then name is the default value.
The fields are separated into different types depending on which operations are defined on the
values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal
fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic
fields have a distance measure which takes into account that the maximal value and minimal
value are close together. This optional attribute refers to a taxonomy of values. It's only
applicable for categorical fields. The value of taxonomy is a name of a taxonomy.
3.1.6 Transformation Dictionary (Derived Values)
At various places the mining models use simple functions in order to map user data to values
that are easier to use in the specific model. For example, neural networks internally work with
numbers, usually in the range [0..1]. Numeric input data are mapped to the range [0..1], and
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -34- http://dke.cti.gr/panda/
categorical fields are mapped to series of 0/1 indicators. Similarly, Naive Bayes models
internally map all input data to categorical values. For PMML to be able to handle such
mappings it defines various kinds of simple data transformations:
Normalization: map values to numbers, input can be continuous or discrete.
Discretization: map continuous values to discrete values.
Value mapping: map discrete values to discrete values. Mapping missing values as a
special case of value mapping.
Aggregation: summarize or collect groups of values, e.g. compute average.
The corresponding XML elements appear as content of a surrounding markup DerivedValues.
which provides a common element for the various mappings. DerivedValues's can appear in
the data dictionary within DataField's. They can also appear at several places in the definition
of specific models such as neural network or Naive Bayes models. The name of the
transformed fields is defined such that the name and the model can refer to these fields.
The transformations in PMML do not cover the full set of preprocessing functions which may
be needed to collect and prepare the data for mining. There are too many variations of
preprocessing expressions. Instead, the PMML transformations represent expressions that are
created automatically by a mining system. The corresponding expressions are often generated.
Similarly, a discretization might be constructed by a mining system that computes quantile
ranges in order to transform skewed data.
3.1.7 PMML mining models
A PMML document can contain more than one model. PMML supports the following data
mining models:
TreeModel: The tree modeling framework allows for defining either a classification or
prediction structure. Each Node holds a rule, called PREDICATE, that determines the
reason for choosing the Node or any of the branching Nodes.
NeuralNetwork: A neural network has one or more input nodes and one or more neurons.
Some neuron's outputs are the output of the network. The network is defined by the
neurons, their connections and the corresponding weights. All neurons are organized into
layers; the sequence of layers defines the order in which the activations are computed. All
output activations for neurons in some layer L are evaluated before computation proceeds
to the next layer L+1. Note that this allows for recurrent networks where outputs of
neurons in layer L+i can be used as input in layer L where L+i > L. The model does not
define a specific evaluation order for neurons within a layer.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -35- http://dke.cti.gr/panda/
ClusteringModel: PMML models for Clustering are defined in two different classes.
These are center-based and distribution-based cluster models. Both models have the DTD
element ClusteringModel as the toplevel type and they share many other element types. A
cluster model basically consists of a set of clusters. For each cluster a center vector can be
given. In center-based models a cluster is defined by a vector of center coordinates. Some
distance measure is used to determine the nearest center, that is the nearest cluster for a
given input record. For distribution-based models (e.g. in demographic clustering) the
clusters are defined by their statistics. Some similarity measure is used to determine the
best matching cluster for a given record. The center vectors then only approximate the
clusters. The model must contain information on the distance or similarity measure used
for clustering. It may also contain information on overall data distribution, such as
covariance matrix, or other statistics.
RegressionModel: The regression functions are used to determine the relationship
between the dependent variable (target field) and one or more independent variables. The
dependent variable is the one whose values you want to predict, whereas the independent
variables are the variables that you base your prediction on. PMML defines three types of
regression models: linear, polynomial, and logistic regression.
NaiveBayesModels: Naive Bayes uses Bayes' Theorem, combined with a ("naive")
presumption of conditional independence, to predict, for each record (a set of values, one
for each field), the value of a target (output) independence, to predict, for each record (a
set of values, one for each field), the value of a target (output) field, from evidence given
by one or more predictor (input) fields.
AssociationModel: The Association Rule model represents rules where some set of items
is associated to another set of items. For example a rule can express that a certain product
is often bought in combination with a certain set of other products.
SequenceMiningModel: The basic data model consists of an Object, identified by the
“Primary Key” that has a number of events attributed to it, defined by the “Secondary
Key”. Each event consists of a set of ordered items. An “Order Field” defines the order of
the items within an event, with an optional qualifier in the form of an attribute name.
For every model there is a corresponding DTD that describes the metadata and the processes
of each model. In the following we give a description for the generic framework of PMML
that is used by every data model. For all PMML models the structure of the top-level model
element is similar to:
<!ELEMENT XModel (Extension*, MiningSchema, ModelStats?, ..., Extension* ) >
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -36- http://dke.cti.gr/panda/
<!ATTLIST XModel
modelName CDATA #IMPLIED
functionName %MINING-FUNCTION; #REQUIRED
algorithmName CDATA #IMPLIED
>
<!ELEMENT MiningSchema (MiningField+) >
<!ELEMENT ModelStats (UnivariateStats+) >
The non-empty list of mining fields define a so-called mining schema. The statistics contain
global statistics on (a subset of the) mining fields. Other model specific elements follow after
ModelStats in the content of XModel. For a list of models that have been defined in PMML
2.0 see the entity A-PMML-MODEL above.
The naming conventions for PMML are ElementNames in mixed case, first uppercase
attributeNames in mixed case, first lowercase enumConstants in mixed case, first lowercase
ENTITY-NAMES all uppercase. The character '-' is used less often in order to avoid
confusion with mathematical notation.
3.1.8 Example: DTD of Association Rules Model
The Association Rule model represents rules where some set of items is associated to another
set of items. For example a rule can express that a certain product is often bought in
combination with a certain set of other products.
An Association Rule model consists of four major parts:
<!ELEMENT AssociationModel (Extension*, AssocInputStats,
AssocItem+, AssocItemset+, AssocRule+)>
<!ATTLIST AssociationModel
modelName CDATA #IMPLIED
>
AssocInputStats describes the basic information of the input data through a set of
attributes:
<!ELEMENT AssocInputStats EMPTY>
<!ATTLIST AssocInputStats
numberOfTransactions %INT-NUMBER; #REQUIRED
maxNumberOfItemsPerTA %INT-NUMBER; #IMPLIED
avgNumberOfItemsPerTA %REAL-NUMBER; #IMPLIED
minimumSupport %PROB-NUMBER; #REQUIRED
minimumConfidence %PROB-NUMBER; #REQUIRED
lengthLimit %INT-NUMBER; #IMPLIED
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -37- http://dke.cti.gr/panda/
numberOfItems %INT-NUMBER; #REQUIRED
numberOfItemsets %INT-NUMBER; #REQUIRED
numberOfRules %INT-NUMBER; #REQUIRED
>
Attribute description:
numberOfTransactions: The number of transactions (baskets of items) contained in the
input data.
maxNumberOfItemsPerTA: The number of items contained in the largest transaction.
avgNumberOfItemsPerTA: The average number of items contained in a transaction.
minimumSupport: The minimum relative support value (#supporting transactions / #total
transactions) satisfied by all rules.
minimumConfidence: The minimum confidence value satisfied by all rules. Confidence is
calculated as (support (rule) / support(antecedent)).
lengthLimit: The maximum number of items contained in a rule which was used to limit
the number of rules.
numberOfItems: The number of different items contained in the input data.
numberOfItemsets: The number of itemsets contained in the model.
numberOfRules: The number of rules contained in the model.
AssocItem describes the items contained in itemsets: <!ELEMENT AssocItem EMPTY> <!ATTLIST AssocItem
id %ELEMENT-ID; #REQUIRED
value CDATA #REQUIRED
mappedValue CDATA #IMPLIED
weight %REAL-NUMBER; #IMPLIED
>
Attribute description:
id: An identification to uniquely identify an item.
value: The value of the item as in the input data.
mappedValue: Optional, a value to which the original item value is mapped.For instance,
this could be a product name if the original value is an EAN code.
weight : The weight of the item. For example, the price or value of an item.
AssocItemset describes the itemsets which are contained in rules <!ELEMENT AssocItemset (Extension*, AssocItemRef+)>
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -38- http://dke.cti.gr/panda/
<!ATTLIST AssocItemset
id %ELEMENT-ID; #REQUIRED
support %PROB-NUMBER; #REQUIRED
numberOfItems %INT-NUMBER; #REQUIRED
>
Attribute description:
id : An identification to uniquely identify an itemset
support : The relative support of the itemset
numberOfItems : The number of items contained in this itemset
AssocItemRef are item references to point to elements of type item. <!ELEMENT AssocItemRef EMPTY>
<!ATTLIST AssocItemRef
itemRef %ELEMENT-ID; #REQUIRED
>
Attribute description: itemRef : The id value of an item element
AssocRule are elements of the form <antecedent itemset> => <consequent itemset> that
contain the actual association rules derived by the model instance <!ELEMENT AssocRule ( Extension* )>
<!ATTLIST AssocRule
support %PROB-NUMBER; #REQUIRED
confidence %PROB-NUMBER; #REQUIRED
antecedent %ELEMENT-ID; #REQUIRED
consequent %ELEMENT-ID; #REQUIRED
>
Attribute definitions:
support : The relative support of the rule
confidence : The confidence of the rule
antecedent : The id value of the itemset which is the antecedent of the rule
consequent : The id value of the itemset which is the consequent of the rule
An instance of an example Association Rules model is given below. In this example we may
see all the aforementioned elements that construct an Association Rules model.
Let's assume we have four transactions with the following data:
• t1: Cracker, Coke, Water
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -39- http://dke.cti.gr/panda/
• t2: Cracker, Water
• t3: Cracker, Water
• t4: Cracker, Coke, Water
<?xml version="1.0" ?>
<PMML version="1.1" >
<Header copyright="www.dmg.org"
description="example model for association rules"/>
<DataDictionary numberOfFields="1" >
<DataField name="item" optype="categorical" />
</DataDictionary>
<AssociationModel>
<AssocInputStats numberOfTransactions="4" numberOfItems="3"
minimumSupport="0.6" minimumConfidence="0.5"
numberOfItemsets="3" numberOfRules="2"/>
<!-- We have three items in our input data -->
<AssocItem id="1" value="Cracker" />
<AssocItem id="2" value="Coke" />
<AssocItem id="3" value="Water" />
<!-- and two frequent itemsets with a single item -->
<AssocItemset id="1" support="1.0" numberOfItems="1">
<AssocItemRef itemRef="1" />
</AssocItemset>
<AssocItemset id="2" support="1.0" numberOfItems="1">
<AssocItemRef itemRef="3" />
</AssocItemset>
<!-- and one frequent itemset with two items. -->
<AssocItemset id="3" support="1.0" numberOfItems="2">
<AssocItemRef itemRef="1" />
<AssocItemRef itemRef="3" />
</AssocItemset>
<!-- Two rules satisfy the requirements -->
<AssocRule support="1.0" confidence="1.0"
antecedent="1" consequent="2" />
<AssocRule support="1.0" confidence="1.0"
antecedent="2" consequent="1" />
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -40- http://dke.cti.gr/panda/
</AssociationModel>
</PMML>
3.2. SQL/MM
3.2.1 Overview
SQL/MM (MM for MultiMedia) [SQL/MM ] is a standard based on SQL that has been
developed by the International Organization for Standardization (ISO). It is divided into
multiple parts which are the following:
Part 1 : Framework
Part 2: Full Text
Part 3: Spatial
Part 5: Still Image
Part 6: Data Mining
The structured types defined in SQL/MM are first-class SQL types that can be accessed
through SQL:1999 base syntax. These accesses also include invocation of the routines
(methods) associated with the structured types. In the following section we shall review the
basic features of Part 6, Data Mining.
3.2.2 Part 6: Data Mining
The standard supports four different data mining techniques. The term model is used through
out the standard and it actually stands for data mining technique. The four models are:
Rule model
Clustering model
Regression model
Classification model
Every model has a corresponding SQL structured user-defined type. A set of predefined types
completes the full definition of each model. The basic type is named DM_*Model where “*”
is replaced by ‘Class’ for a classification model, ‘Rule’ for a rule model, ‘Clustering’ for a
clustering model and ‘Regression’ for a regression model. The predefined types are the
following (the same naming schema is used as in the case of the DM_*Model):
DM_*Settings type: Instances of that type are used for storing various parameters of the
data mining model, such as the maximum number of clusters or the depth of a decision
tree.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -41- http://dke.cti.gr/panda/
DM_*TestResult: Instantiations of that type hold the results of the testings during the
training phase of the data mining models.
DM_*Result: The running of a data mining model against real data creates instances of
that type.
DM_*Task: Instances of that type store metadata that describe the process and control of
testings and of the actual runnings.
3.2.3 Example: Rule Model
The DM_RuleModel type represents models which are the result of the search for association
rules. Next, we give the definition of the Rule Model type as described in the standard:
CREATE TYPE DM_RuleModel
AS (
DM_content CHARACTER LARGE OBJECT(DM_MaxContentLength)
)
INSTANTIABLE
NOT FINAL
STATIC METHOD DM_impRuleModel
(input CHARACTER LARGE OBJECT(DM_MaxContentLength))
RETURNS DM_RuleModel
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD DM_expRuleModel ()
RETURNS CHARACTER LARGE OBJECT(DM_MaxContentLength)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
CALLED ON NULL INPUT,
METHOD DM_getNORules ()
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD DM_getRuleTask ()
RETURNS DM_RuleTask
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -42- http://dke.cti.gr/panda/
CALLED ON NULL INPUT
The Rule Model type has only one member variable, DM_content. In that variable, that is a
CHARACTER LARGE OBJECT (CLOB), the complete information about one instance of
the model is stored.
Method DM_impRuleModel takes a CLOB as input parameter and if it is a proper
representation of a DM_RuleModel, then a new value of type DM_RuleModel is created. A
CLOB is a proper representation of a DM_RuleModel, if it is a valid instance of the PMML
Association Rules DTD according to XML.
The result of the invocation DM_expRuleModel () is a CHARACTER LARGE OBJECT
representing the association rule model contained in SELF.
Method DM_getNORules returns the number of rules in the DM_content. As mentioned
before, DM_content is a PMML document and the implementation of that method must
include processing techniques for XML in order to find the value of the attribute
“numberOfRules” in element “AssocInputStats” (see section 2.1.6). For example this could be
an XQuery statement.
3.3. Common Warehouse Model (CWM)
3.3.1 Overview
The main purpose of CWM [CWM] is to enable easy interchange of warehouse and business
intelligence metadata between warehouse tools, warehouse platforms and warehouse metadata
repositories in distributed heterogeneous environments. CWM is based on three key industry
standards:
UML - Unified Modeling Language, an OMG modeling standard
MOF - Meta Object Facility, an OMG metamodeling and metadata repository standard
XMI - XML Metadata Interchange, an OMG metadata interchange standard
The CWM provides a framework for representing metadata about data sources, data targets,
transformations and analysis, and the processes and operations that create and manage
warehouse data and provide lineage information about its use. The CWM Metamodel consists
of a number of sub-metamodels which represent common warehouse metadata in the
following major areas of interest to data warehousing and business intelligence. In this section
we give an overview of the Data Mining sub-metamodel.
3.3.2 Data Mining Metamodel
The CWM Data Mining metamodel represents three conceptual areas:
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -43- http://dke.cti.gr/panda/
The overall Model description
Settings
Attributes
The Model conceptual area consists of a generic representation of a data mining model (that
is, a mathematical model produced or generated by the execution of a data mining algorithm).
This consists of MiningModel, a representation of the mining model itself, MiningSettings,
which drive the construction of the model, ApplicationInputSpecification, which specifies the
set of input attributes for the model, and MiningModelResult, which represents the result set
produced by the testing or application of a generated model.
The Settings conceptual area elaborates further on the Mining Settings and their usage
relationships to the attributes of the input specification. Mining Settings has four subclasses
representing settings for StatisticsSettings, ClusteringSettings, SupervisedMiningSettings and
AssociationRulesSettings. The SupervisedMiningSettings are further subclassed as
ClassificationSettings and RegressionSettings, and a CostMatrix is defined for representing
cost values associated with misclassifications.
AttributeUsageRelation consists of attributes that further classify the usage of
MiningAttributes by MiningSettings (e.g., relative weight). Several associations are also used
to explicitly define requirements placed on attributes by certain subclasses of settings (e.g.,
target, transactionId and itemId).
The Attributes conceptual area defines two subclasses of Mining Attribute: NumericAttribute
and CategoricalAttribute. Category represents the category properties and values that either a
CategoricalAttribute or OrdinalAttribute mightpossess, while CategoryHierarchy represents
any taxonomy that a CategoricalAttribute might be associated with.
All of the above classes are expressed in UML and are described in more detail in [CWM].
3.4. Java DM API
Java DM API (JDMAPI) [JDM] follows SUN’s Java Community Process as a Java
Specification Request (JSR). It addresses the need for an API that will give “procedural”
support to all the existing and evolving data mining standards. A final proposal of the API has
not been issued yet but one is expected early 2003. The group that participates in the
specification of the API is constituted by the following members:
BEA Systems Blue Martini Software Dubitzky, Werner
Hyperion Solutions Corp. IBM Kana Communications Inc.
Magnify Research, Inc. MINEit Software Ltd. Oracle
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -44- http://dke.cti.gr/panda/
Quadstone SAP AG SAS Institute SPSS
Strategic Analytics Sun Microsystems, Inc.
The JDMAPI specification will use design references addressed in PMML, CWM and
SQL/MM Data Mining. The JDMAPI specification will support the building of data mining
models, the scoring of data using models, as well as the creation, storage, access and
maintenance of data and metadata supporting data mining results, and select data
transformations. JDMAPI will be based on the Java 2 Platform, Enterprise Edition.
The goal of JDMAPI is to provide for data mining systems similar functionality to the JDBC.
3.5. Oracle9i Data Mining
Oracle has embedded data mining within the Oracle9i database with Oracle9i Data Mining
(ODM) [Oracle9i]. All the functionality for data mining operations, such as model building,
scoring functions, and testing are provided via a Java API. Oracle9i Data Mining consists of
the following components:
Oracle9i Data Mining (ODM) API
Data Mining Server (DMS)
The ODM API allows users to write programs that perform data mining operations. It is based
on the proposed concepts of the Java Data Mining API (section 5).
The Data Mining Server (DMS) resides on the database server side and accepts requests from
programs written using the ODM API. DMS performs the processing of these requests and
delivers results to the client applications. The DMS also provides a metadata repository
consisting of mining input objects and result objects, along with the namespaces within which
these objects are stored and retrieved. Oracle9i Data Mining supports two data mining
functions: classification for supervised learning and association rules for unsupervised
learning. The mining functions use two algorithms: Naive Bayes and Association Rules.
Every model has characteristics similar to those addressed in PMML and SQL/MM Data
Mining, and they are persisted in the DMS using relational tables. The model instances are
stored in the DMS and the users can refer to them using a user-specified unique name. Every
step in the model building procedure creates persistent objects that can be used from multiple
applications.
The most important object, in context of storing data mining results, is the mining result
object. A mining result object contains the end products of test or build-model mining
operations. ODM persists mining results in relational tables in the DMS. A mining results
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -45- http://dke.cti.gr/panda/
object contains the operation start time and end time, the name of the model used, input data
location, and output data location for the data mining operation.
Oracle intends to support other models in future versions, such as Decision Trees and
Classification models. It also intends to provide full support of SQL/MM and PMML.
3.6. Information Discovery DataMining Suite
Information Discovery [IDMS] has developed a set of tools and systems for data mining and
knowledge extraction. These products make use of the Pattern Query Language (PQL). PQL
is a pattern-oriented query language specifically designed to provide business users access to
refined information. No other information is available (at least through their web site) about
the syntax or the semantics of the language. Its notion is very close to the motive of the
PANDA project and that is the reason for mentioning it in this survey.
4. Conclusion
The database research community claims that technology has reached a critical point since
there are certain requirements and information types that are either partially supported or not
supported at all [B+98]. The innovation of the PANDA project lies in the vision for a new
approach aiming at the definition of a system architecture that efficiently represents,
maintains and manages patterns. It refers to a variety of domains, that is, among others,
knowledge discovery in (traditional or non-traditional) databases, time-involving applications
(time series or moving objects databases), multimedia systems (image or video databases),
scientific data, and the WWW (as a huge repository of unstructured information). The
cornerstone of this new approach will be the pattern concept, aiming at representing huge
volumes of information in an effective way.
Moreover the PANDA project aims at the integration of existing approaches towards a novel
logical integration of patterns into a data model, language and base management system
support.
With regards to PMML and the other modelling approaches, the proposed system architecture
is vertical. It heads in a vertical approach defining an extensible type system. As it has already
been discussed in the previous sections, patterns arise from different scientific fields (i.e., data
mining, mathematics, information retrieval etc). PANDA aims at supporting any pattern type
regardless to the application, while other approaches are mainly oriented to data mining
patterns.
The majority of users both in scientific and business field do not want massive volumes of
data, but they are interested in the patterns and trends hidden within data. Since these patterns
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -46- http://dke.cti.gr/panda/
need to be accessed, manipulated and managed, just as data elements are managed the concept
of "pattern management" is introduced. Pattern management systems deal with patterns, just
as data management systems deal with data. Moreover, they require distinct repositories and
query languages, corresponding to languages that have been developed for data management.
In this report we reviewed different types of patterns from many areas, the current efforts on
modeling data mining operations were addressed along with the corresponding results.
Furthermore different procedures of pattern extraction were found which give us a brief idea
of the diversity between various patterns types. After a close observation of the various
pattern types and given the informal definition that pattern is a compact and rich in semantics
representation of raw data, we draw the conclusion that there are many common
characteristics between all of them.
One of the challenges in the field of pattern mining (or pattern recognition) is the
development of a framework capable of representing and dealing with every kind of patterns
independently of the application and/or the method used to extract patterns. This framework
has to serve as a precise and conceptual foundation for the representation and behavior of
patterns. It will be the basis for the design and development of a system that handles (stores /
processes / retrieves) patterns and supports pattern-related operations.
To proceed with the definition of the considered framework we have to define the structure
and the requirements for representing each kind of patterns while the relationships of them
have to be identified. Also it is important that we identify the behavior of patterns and the
functions that a pattern-related system has to support.
References
[AGO01] G. Agostini, M. Longari and E. Pollastri, “Musical instrument timbres
classification with spectral features”, Proceedings of IEEE Multimedia Signal
Processing Conference (2001).
[AS94] Rakesh Agrawal, Ramakrishnan Srikant. “Fast Algorithms for Mining
Association Rules”. Proc. of the 20th VLDB Conference, 1994.
[AS95] Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns, IBM
Almaden Research Center, 1995.
[AW95] C.Ahlberg and E.Wistrand. IVEE: An information visualization and exploration
environment. Information Visualization, Atlanta, GA pages 66–73, 1995.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -47- http://dke.cti.gr/panda/
[BAU70] L.E Baum, T. Petrie, G. Soules, “A maximization technique in the statistical
analysis of probabilistic functions of Markov chains”, Annals of Mathematical
Statistics, vol. 41, pp. 164-171, 1970.
[B+98] P. A. Bernstein et al., “The Asilomar Report on Database Research”, SIGMOD
Record, 27(4):74-80, December 1998.
[BCH00] Andi Baritchi, Diane J. Cook and Lawrence B. Holder: Discovering Structural
Patterns in Telecommunications Data, Texas, 2000.
[BDO95] M. W. Berry, S. T. Dumais, G.W. O'Brien, Using linear algebra for intelligent
Information retrieval, SIAM Review, 37(4): 573-595, 1995
[BL96] Michael Berry, Gordon Linoff. “Data Mining Techniques: For Marketing, Sales,
and Customer Support”. John Willey, 1996.
[BRD99] Alexandre M. Braga, Cecilia M.F. Rubira and Ricardo Dahab: Tropyc: A Pattern
Language for Cryptographic Software, Brazil, 1999.
[BSC96] A. Buja, D.F. Swayne, and D. Cook. Interactive high-dimensional data
visualization. Journal of Computational and Graphical Statistics, 5(1): 78–99,
1996.
[Che99] C.Chen. Information Visualisation and Virtual Environments. Springer-Verlag,
London, 1999.
[Chiu97] S. Chiu. "Extracting Fuzzy Rules from Data for Function Approximation and
Pattern Classification". Fuzzy Information Engineering- A Guided Tour of
Applications.(Eds.: D. Dubois, H. Prade, R Yager), 1997.
[CMS 99] Card S., Mackinlay J., Shneiderman B.: ‘Readings in Information Visualization:
Using Vision to Think’, Morgan Kaufmann, 1999.
[CN95] Peter Clifford and Geoff Nichols: A Metropolis Sampler for Polygonal Image
Reconstruction, UK, 1995.
[CS96] P. Cheeseman, J. Stutz. "Bayesian Classification (AutoClass): Theory and
Results". Advances in Knowledge Discovery and Data Mining. (Eds:U.
Fayyad,et al), AAAI Press,1996.
[CON02] D. Conklin, “Representation and Discovery of vertical patterns in Music”,
Lecture Notes in Artificial Intelligence (LNAI) 2445, Springer Verlag,
2002.
[CWM] Common Warehouse Metamodel (CWM) , available at http://www.omg.org/cwm
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -48- http://dke.cti.gr/panda/
[DAN02] R. Dannenberg and N. Hu, “Discovering Musical Structure in Audio
Recordings”, Lecture Notes in Artificial Intelligence (LNAI) 2445,
Springer Verlag, 2002.
[DDF+90] S. Deerwester, S. T. Dumais, G. Furnas, Th. K. Landauer, R. Harshman, Indexing
by Latent Semantic Analysis, Journal of the Society for Information Science,
41(6): 391-407, 1990.
[DEL87] J. Deller, J. Proakis, J. Hansen, “Dicrete-Time Processing of Speech Signals”,
Prentice-Hall, 1987.
[DLLL97] S. T. Dumais, T. A. Letsche, M. L. Littman, T. K. Landauer, Automatic
cross-language retrieval using Latent Semantic Indexing, In AAAI Spring
Symposuim on Cross-Language Text and Speech Retrieval, March 1997.
[DLMRS98] Gautan Das, King-Le Lin, Heikki Mannila, Gopal Renganathan and
Padhraic Smith: Rule discovery from time series, USA, 1998.
[DMG] DMG, Predictive Model Markup Language (PMML), available at
http://www.dmg.org/pmmlspecs_v2/pmml_v2_0.html
[ERR01] A. Erronen, “Comparison of features for musical instrument recognition”,
Proceedings of 2001 IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA), 2001
[FFKLLRSZ98] Roner Feldman, Moshe Fresko, Yakkov Kihar, Yehuda Lindell, Orly
Liphstat, Mrtin Rajman, Yonatan Schler, Oren Zamiv: Text Mining at the Term
Level, 1998.
[FPSU96] U. Fayyad, G. Piatesky-Shapiro, P. Smuth & R. Uthurusamy(editors). "From
DataMining to Knowledge Discovery: An Overview". Advances in Knowledge
Discovery and Data Mining. AAAI Press, 1996.
[G99] D.M. Gavrilla: The Visual Analysis of Human Movement: A Survey, Germany,
1999.
[GL96] G.H. Golub, C.F. Van Loan. Matrix Computations, Third Edition, Johns
Hopkins University Press, Baltimore-London, 1996
[GMPS96] Glymour C., Madigan D., Pregibon D, Smyth P, “Statsitical Inference and Data
Mining”, in CACM v39 (11), 1996, pp. 35-42.
[GMV96] Guyon, I., Matic, N. and Vapnik, V.: Discovering informative patterns and data
cleaning. In Fayyad U., Piatetsky-Shapiro G., Smyth P. and Uthurusamy, R. (ed.)
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -49- http://dke.cti.gr/panda/
Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT
Press, Menlo Park, California, (1996)
[GRS98] Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. "CURE: An Efficient Clustering
Algorithm for Large Databases", Published in the Proceedings of the ACM
SIGMOD Conference, 1998.
[HH99] R. J. Hilderman, H. J. Hamilton. “ Knowledge Discovery and Interstigness
Measures: A Survey”, Technical Report CS 99-04, Dept of Computer Scinece,
University of Regina, October 1999.
[HER02] P. Herrera, A. Yeterian, F. Gouyon, “Automatic Classification of Drum Sounds:
A Comparison of Feature Selection Methods and Classification Techniques”,
Lecture Notes in Artificial Intelligence (LNAI) 2445, Springer Verlag, 2002.
[HK01] Jiawei Han, Micheline Kamber. “Data Mining: Concepts and Techniques”.
Academic Press, 2001.
[Hor+98] T. Horiuchi. "Decision Rule for Pattern Classification by Integrating Interval
Feature Values". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol.20, No.4, April 1998, pp.440-448.
[IDMS] Information Discovery DataMining Suite, available at
http://www.patternwarehouse.com/dmsuite.htm
[Jan+98] Cezary Z. Janikow, "Fuzzy Decision Trees: Issues and Methods", IEEE
Transactions on Systems, Man, and Cybernetics, Vol. 28, Issue 1, pp 1-14, 1998.
[JDM] Java Data Mining API , available at http://www.jcp.org/jsr/detail/73.prt
[Kei 00] Keim D. A.: ‘Designing Pixel-oriented Visualization Techniques: Theory and
Applications’, Transactions of Visualization and Computer Graphics, 2000.
[Kei 01a] D. Keim.Visual exploration of large databases. Communications of the ACM,
44(8): 38–44, 2001.
[Kei 01b] Keim D. A.: ‘An Introduction to Information Visualization Techniques for
Exploring Very Large Databases’, Tutorial Notes, Visualization 2001, San Diego,
CA, 2001.
[LA94] LeungY., Apperley M.: A Review and Taxonomy of Distortion-oriented
Presentation Techniques, Proc. Human Factors in Computing Systems CHI '94
Conf., Boston, MA, p. 126-160, 1994.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -50- http://dke.cti.gr/panda/
[LAS97 ] Lent, B., Agrawal, R., Srikant, R.: Discovering Trends in Text Databases. In:
Proceedings ofthe 3rd International Conference on Knowledge Discovery (KDD),
(1997).
[LB98] Terran Lane and Carla E1. Brodley: Temporal Sequence Learning and Data
Reduction for Anomally Detection, Scholl of Electrical and Computer
Engineering and the COAST laboratory, Purdue University, West Lafayette,
1998.
[LK94] Edward W. Large and John F. Kolen: Resonance and the Perception of Musical
Meter, The Ohio State University, 1994.
[MAR96] M. Melta, R. Agrawal, J. Rissanen. "SLIQ: A fast scalable classifier for data
mining". In EDBT’96, Avigon France, March 1996.
[MAR01] A. Marsden, “Representing melodic patterns as networks of elaborations”, in
Computers and the Humanities, 35:37-54, 2001.
[MER01] D. Meredith, G. Wiggins and K. Lemstrom, “Patterns induction and Matching in
Music and other multidimensional datasets”, Proceedings of the Conference on
Systemics, Cybernetics and Informatics, volume X, 2001.
[Mit+97] T. Mitchell. Machine Learning. McGraw-Hill, 1997
[NPD94] J. Nielsen, V.L. Phillips, S.T. Dumais, Information Retrieval of Imperfectly
Recognized Handwriting, available at
http://www.useit.com/papers/handwriting_retrieval.html
[NPP00] Chicahito Nakajima, Massimiliano Pontil and Tomaso Pogio: People Recognition
and Pose Estimation in Image Sequences, JAPAN, 2000.
[Oracle9i] Oracle9i Data Mining Concepts, available at
http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/datamine.920/a
95961/1concept.htm#923516
[PAC00] F. Pachet, P. Roy, D. Cazaly, “A Combinatorial approach to content-based music
selection”, IEEE Multimedia, Vol 1, 2000.
[PIK01] A. Pikrakis, S. Theodoridis and D. Kamarotos, “Recognition of Isolated Musical
Patterns using Context Dependent Dynamic Time Warping”, IEEE Transactions
on Speech and Audio Processing (to appear).
[PIK02] A. Pikrakis, S. Theodoridis and D. Kamarotos, “Recognition of Isolated Musical
Patterns using Hidden Markov Models”, Lecture Notes in Artificial Intelligence
(LNAI) 2445, Springer Verlag, 2002.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -51- http://dke.cti.gr/panda/
[Quin+93] J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
[REI01] J. Reiss, J. Aucouturier, M. Sandler, "Efficient Multi-dimensional searching
routines for music information retrieval", Proceedings of ISMIR 2001.
[RS98] R. Rastori, K. Shim. "PUBLIC: A Decision Tree Classifier that Integrates
Building and Pruning". Proceeding of the 24th VLDB Conference, New York,
USA, 1998.
[SA95] Ramakrishnan Srikant, Rakesh Agrawal. “Mining Generalized Association
Rules”. Proc. of the 21st VLDB Conference, 1995.
[SAM96] J.Shafer, R. Agrawal, M. Mehta. "SPRINT: A scalable parallel classifier for data
mining". In Proc. of the VLDB Conference, Bombay, India, September 1996
[SB98] Neil Sumpter and Andrew J. Bulpitt: Learning Spatio-Temporal Patterns for
predicting Object behaviour, UK, 1998.
[SB99] Stephan Schulz and Felix Brandt: Using Term Space Maps to Capture Search
Control Knowledge in Equational Theorem Proving, Germany, 1999.
[Sch97] Oded Schram: Circle Patterns with the combinatorics of the square grid, The
Weizmann Institute, 1997.
[SL68] G. Salton, M.E. Lesk, Computer Evaluation of Indexing and Text Processing.
Journal of the ACM, 15(1):8-36, 1968
[Shn96] Shneiderman B.: ‘The Eye Have It: A Task by Data Type Taxonomy for
Information Visualizations’, Proc. Visual Languages, 1996.
[Spe00] B. Spence. Information Visualization. Pearson Education Higher Education
publishers, UK, 2000.
[ST99] Ayman A. Abel-Samad and Ahmed H. Tewfik: Search Strategies for radar
Target localization, University of Minnesota, Minneapolis, 1999.
[SQL/MM] ISO SQL/MM Part 6, available at http://www.sql-
99.org/SC32/WG4/Progression_Documents/FCD/fcd-datamining-2001-05.pdf
[TK98] S. Theodoridis, K. Koutroumbas, “Pattern Recognition”, Academic Press, 1998.
[TZA01] G. Tzanetakis, G. Essl, P. Cook, "Automatic music genre classification of audio
signals", Proceedings of ISMIR 2001.
[VIT67] A.J. Viterbi, “Error bounds for convolutional codes and an asymptotically
optimal decoding algorithm”, IEEE Transactions on Information Theory, vol. 13,
pp. 260-269, Apr. 1967.
Patterns for Next-Generation Database Systems
IST-2001-33058
2002-03 / PANDA Consortium -52- http://dke.cti.gr/panda/
[VH99] Remco C. Veltkamp and Michiel Hagedoorn: State-of-the-Art in Shape Matching,
The Netherlands, 1999.
[War 00] Ware C.: ‘Information Visualization: Perception for Design’. Academic Press,
San Diego, 2000.