PANDA Technical Report Series - Athens University of ... · PDF filePANDA Technical Report...

Patterns for Next-Generation Database Systems

IST-2001-33058

2002-03 / PANDA Consortium -1- http://dke.cti.gr/panda/

PANDA Technical Report Series

TR Number: PANDA-TR-2003-01

Title: A Survey on Pattern Application Domains and Pattern

Management Approaches

Author(s): M. Vazirgiannis, M. Halkidi, G. Tsatsaronis, E. Vrachnos

(AUEB),

D. Keim (KONSTANZ),

P. Xeros, Y. Theodoridis (CTI),

A. Pikrakis, S. Theodoridis (UoA)

Date: 7 February 2003

Research supported by the Commission of the European Communities under the Information

Society Technologies (IST) Programme – Future and Emerging Technologies (FET)


IST-2001-33058


A Survey on Pattern Application Domains and

Pattern Management Approaches

M. Vazirgiannis 1, M. Halkidi 1, G. Tsatsaronis 1, E. Vrachnos 1,

D. Keim 2, P. Xeros 3, Y. Theodoridis 3, A. Pikrakis 4, S. Theodoridis 4

1 Dept of Informatics, Athens University of Economics and Business, Athens, Greece [www.db-

net.aueb.gr] 2 Institute of Computer Science, University of Konstanz, Konstanz, Germany [www.inf.uni-konstanz.de] 3 Data and Knowledge Engineering Group, Computer Technology Institute, Patras, Greece

[http://dke.cti.gr] 4 Dept of Informatics, University of Athens, Athens, Greece [www.di.uoa.gr]

ABSTRACT:

Data intensive applications produce complex information that is posing requirements for novel Database Management Systems (DBMSs). Such information is characterized by its huge volume of data and by its diversity and complexity, since the data processing methods such as pattern recognition, data mining and knowledge extraction result in knowledge artifacts like clusters, association rules, decision trees and others. These artifacts that we call patterns need to be stored and retrieved efficiently. In order to accomplish this we have to express them within a formalism and a language.

In this report we review the concept of patterns and their applicability in several research domains related with the proposed work and we define the knowledge domain related with the PANDA project. It is important to interrelate these domains in order to be able to define the problem in comprehensive and complete way and come up with requirements on how a management system for patterns should be.

We examine the different types of patterns that are extracted from a data set, in order to gather the necessary requirements for the definition of a pattern model. This model constitutes the heart of the Pattern Base Management System that will be designed.

KEYWORDS: patterns, data mining, pattern modeling, pattern databases, information retrieval,

Pattern Base Management Systems


IST-2001-33058


A Survey on Pattern Application Domains and

Pattern Management Approaches

1. INTRODUCTION - MOTIVATION............................................................................. 5

2. APPLICATION FIELDS AND PATTERN USAGE ................................................... 6

2.1. DATA MINING............................................................................................................ 6 2.1.1 Data Mining Patterns ........................................................................................ 7

2.2. SIGNAL PROCESSING: CONTENT-BASED MUSIC RETRIEVAL................................... 19 2.2.1 Problem definition ........................................................................................... 19 2.2.2 A survey of existing research efforts................................................................ 20 2.2.3 Patterns in music retrieval .............................................................................. 20

2.3. PATTERNS IN INFORMATION RETRIEVAL................................................................. 23 2.4. MATHEMATICS ........................................................................................................ 24

2.4.1 Number Patterns.............................................................................................. 25 2.4.2 Patterns in graphs ........................................................................................... 25 2.4.3 Patterns in shapes............................................................................................ 26 2.4.4 Patterns in infinite sequences .......................................................................... 26 2.4.5 Patterns in algebra .......................................................................................... 27 2.4.6 Patterns in Geometry....................................................................................... 27 2.4.7 Patterns in Cryptography ................................................................................ 27

2.5. VISUALIZATION ....................................................................................................... 28

3. CURRENT ISSUES IN MODELING DATA MINING PROCESSES AND

RESULTS............................................................................................................................... 29

3.1. DATA MINING GROUP / PREDICTIVE MODEL MARKUP LANGUAGE [DMG]........... 29 3.1.1 Overview.......................................................................................................... 29 3.1.2 General Structure of a PMML Document ....................................................... 31 3.1.3 Header ............................................................................................................. 31 3.1.4 Settings ............................................................................................................ 32 3.1.5 Data Dictionary............................................................................................... 33 3.1.6 Transformation Dictionary (Derived Values) ................................................. 33


IST-2001-33058


3.1.7 PMML mining models ..................................................................................... 34 3.1.8 Example: DTD of Association Rules Model .................................................... 36

3.2. SQL/MM ................................................................................................................. 40 3.2.1 Overview.......................................................................................................... 40 3.2.2 Part 6: Data Mining ........................................................................................ 40 3.2.3 Example: Rule Model ...................................................................................... 41

3.3. COMMON WAREHOUSE MODEL (CWM) ................................................................. 42 3.3.1 Overview.......................................................................................................... 42 3.3.2 Data Mining Metamodel.................................................................................. 42

3.4. JAVA DM API .......................................................................................................... 43 3.5. ORACLE9I DATA MINING ........................................................................................ 44 3.6. INFORMATION DISCOVERY DATAMINING SUITE .................................................... 45

4. CONCLUSION .............................................................................................................. 45


IST-2001-33058


1. Introduction - Motivation

With the advent of hardware advances, complex information resulting from data intensive

applications is posing requirements for novel Database Management Systems (DBMSs). Such

information possesses a number of key features such as:

• Huge volume of data: For example, images being collected daily from various sources

(satellites, etc.) that need to be stored and retrieved efficiently; in this case data can be

represented concisely by a set of patterns (for instance, a mathematical formula might

represent the trajectory of a satellite). Moreover, huge traditional databases are growing

due to extensive transaction paces in various domains (banking, stock exchange,

telecommunication etc. databases).

• Diversity and complexity: Data processing methods (pattern recognition, data mining,

knowledge extraction) result in knowledge artifacts (i.e., clusters, rules, patterns in

general) that need to be as well managed by a DBMS-like environment.

It is obvious then that the knowledge artifacts arise as significant representational primitives

in recently computerized application domain and therefore call for integrated and efficient

DBMS support. Unfortunately patterns have not been treated as persistent objects that can be

stored, retrieved and queried. It is now the time to tackle the challenge of integration between

the two fields (pattern and data) by designing fundamental approaches for providing database

support to patterns.

Various application domains dealing with patterns (telecom, medical, environmental

information systems, etc.) will directly benefit from a system that integrates data and pattern

management. This will be due to the fact that database support will enhance the maintenance

and manipulation of both their data collections and artifacts produced in the form of patterns.

Another field of advance results from the fundamentally novel paradigm arising from patterns

and affecting various database research areas, such as data models, query languages, query

processing and indexing techniques, visual user interfaces.

In this report we review the concept of patterns and their applicability in several research

domains related with the proposed work and define the knowledge domain related with the

PANDA project. It is important to interrelate these domains in order to be able to define the

problem in comprehensive and complete way and come up with requirements on how a

management system for patterns should be.

The remainder of the report is organized as follows. In Section 2 a review of patterns

application fields and existing research results for patterns is presented. There is a rich domain


IST-2001-33058


of fundamental research related to patterns in mathematics, data mining, pattern recognition

as well as in several application domains. Section 3 follows discussing the current issues in

modeling Data Mining processes and results. The innovation of a pattern management system

which is the subject of the PANDA project is presented in Section 4. We conclude in Section

5 presenting a set of requirements for designing a system that can handle modeling, storage,

visualization and retrieval of patterns.

2. Application fields and pattern usage

Various application domains related with the data management (storage, process, retrieve,

data analysis) result in different forms of patterns representing the data insights. In the sequel

we present some representative pattern application domains and the corresponding types of

patterns they produce.

2.1. Data Mining

The last decade has brought an explosive growth in our capabilities to both generate and

collect data. Advances in database technology have provided us with the basic tools and

methods for efficient data collection, storage and lookup of datasets. The result is that a flood

of data has been generated and a growing data glut problem has been brought to the worlds of

science, business. Also our ability to analyze, interpret large bodies of data and extract

"useful" knowledge has outpaced and the need for new generation of tools and techniques for

intelligent database analysis has been created. This need has been recognized by researchers

in different areas (artificial intelligence, statistics, data warehousing, on-line analysis

processing, expert systems and data visualization) and a new research area is emerged, known

as Data Mining.

Data Mining is mainly concerned with methodologies for extracting data patterns from large

data repositories. The data mining is a step in the knowledge discovery process. However,

others treat data mining as a synonym for another popular term, Knowledge Discovery in

Databases.

The data mining process may interact with the users or a knowledge base. The extracted

patterns are evaluated based on some interestingness measures to identify patterns

representing knowledge, i.e., interesting patterns. These patterns are presented to the user and

may be stored as new knowledge in the knowledge base. Then, we could adopt a broad view

of data mining functionality, considering data mining as the process of discovering interesting

knowledge from large amounts of data stored either in databases, data warehouses or other

information repositories.


IST-2001-33058


2.1.1 Data Mining Patterns

There are many data mining algorithms that accomplishing a limited set of tasks produce a

particular enumeration of patterns over data sets. These main tasks, according to well

established data mining process [BL96], are: i) the definition/extraction of clusters that

provide a classification scheme, ii) the classification of database values into the categories

defined, ii) the extraction of association rules or other knowledge artefacts, iii) discovery and

analysis of sequences.

Since there are various types of data stores and database systems on which data mining tasks

can be performed, different kinds of data patterns can be mined. In some cases users have no

idea about the kinds of patterns in their data that could be interesting. Thus it is important a

data mining system to mine and store multiple kinds of patterns so as to accommodate

different user expectations and applications. Another requirement in data mining is the

granularity of data mining results. There are cases that it is important to have different levels

of abstraction for the patterns mined from a data repository depending on the application or

user requirements.

In the sequel, we discuss the data mining functionalities and the kinds of patterns that can be

mined from an amount of data.

2.1.1.1 Clustering

Clustering is one of the most useful tasks in data mining process for discovering groups and

identifying interesting distributions and patterns in the underlying data. Clustering problem is

about partitioning a given data set into groups (clusters) such that the data points in a cluster

are more similar to each other than points in different clusters [GRS98]. For example,

consider a retail database records containing items purchased by customers. A clustering

procedure could group the customers in such a way that customers with similar buying

patterns are in the same cluster. Thus, the main concern in the clustering process is to reveal

the organization of patterns into “sensible” groups, which allow us to discover similarities and

differences, as well as to derive useful conclusions about them. This idea is applicable in

many fields, such as life sciences, medical sciences and engineering. Clustering may be found

under different names in different contexts, such as unsupervised learning (in pattern

recognition), numerical taxonomy (in biology, ecology), typology (in social sciences) and

partition (in graph theory) [TK98].

In the clustering process, there are no predefined classes and no examples that would show

what kind of desirable relations should be valid among the data, that is why it is perceived as

an unsupervised process [BL96]. On the other hand, classification is a procedure of assigning


IST-2001-33058


a data item to a predefined set of categories [FSSU96]. Clustering produces initial categories

in which values of a data set are classified during the classification process.

The clustering process may result in different partitionings of a data set, depending on the

specific criterion used for clustering. Thus, there is a need of preprocessing before we apply a

clustering task in a data set. The basic steps to develop clustering process are presented in

Figure 1 and can be summarized as follows [FSSU96]:

• Feature selection. The goal is to select properly the features on which clustering is to be

performed so as to encode as much information as possible concerning the task of our

interest. Thus, preprocessing of data may be necessary prior to their utilization in

clustering task.

• Clustering algorithm. This step refers to the choice of an algorithm that results in the

definition of a good clustering scheme for a data set. A proximity measure and a

clustering criterion mainly characterize a clustering algorithm as well as its efficiency to

define a clustering scheme that fits the data set.

i) Proximity measure is a measure that quantifies how “similar” two data points (i.e.

feature vectors) are. In most of the cases we have to ensure that all selected features

contribute equally to the computation of the proximity measure and there are no

features that dominate others.

ii) Clustering criterion. In this step, we have to define the clustering criterion, which can

be expressed via a cost function or some other type of rules. We should stress that we

have to take into account the type of clusters that are expected to occur in the data set.

Thus, we may define a “good” clustering criterion, leading to a partitioning that fits

well the data set.

iii) Validation of the results. The correctness of clustering algorithm results is verified

using appropriate criteria and techniques. Since clustering algorithms define clusters

that are not known a priori, irrespective of the clustering methods, the final partition

of data requires some kind of evaluation in most applications [RLR98].


IST-2001-33058


iv) Interpretation of the results. In many cases, the experts in the application area have to

integrate the clustering results with other experimental evidence and analysis in order

to draw the right conclusion.

Clustering Applications. Cluster analysis is a major tool in a number of applications in many

fields of business and science. Hereby, we summarize the basic directions in which clustering

is used [TK98]:

• Data reduction. Cluster analysis can contribute in compression of the information

included in data. In several cases, the amount of available data is very large and its

processing becomes very demanding. Clustering can be used to partition the data set into

a number of “interesting” clusters. Then, instead of processing the data set as an entity,

we adopt the representatives of the defined clusters in our process. Thus, data

compression is achieved.

• Hypothesis generation. Cluster analysis is used here in order to infer some hypotheses

concerning the data. For instance we may find in a retail database that there are two

significant groups of customers based on their age and the time of purchases. Then, we

Figure 1. Steps of Clustering Process

Data for

process

Algorithm

results

Final Clusters

Knowledge

Feature Selection

Clustering Algorithm Selection

Validation of

results

Interpretation

Data


IST-2001-33058


may infer some hypotheses for the data, that it, “young people go shopping in the

evening”, “old people go shopping in the morning”.

• Hypothesis testing. In this case, the cluster analysis is used for the verification of the

validity of a specific hypothesis. For example, we consider the following hypothesis:

“Young people go shopping in the evening”. One way to verify whether this is true is to

apply cluster analysis to a representative set of stores. Suppose that each store is

represented by its customer’s details (age, job etc) and the time of transactions. If, after

applying cluster analysis, a cluster that corresponds to “young people buy in the evening”

is formed, then the hypothesis is supported by cluster analysis.

• Prediction based on groups. Cluster analysis is applied to the data set and the resulting

clusters are characterized by the features of the patterns that belong to these clusters.

Then, unknown patterns can be classified into specified clusters based on their similarity

to the clusters’ features. Useful knowledge related to our data can be extracted. Assume,

for example, that the cluster analysis is applied to a data set concerning patients infected

by the same disease. The result is a number of clusters of patients, according to their

reaction to specific drugs. Then for a new patient, we identify the cluster in which he/she

can be classified and based on this decision his/her medication can be made.

More specifically, some typical applications of the clustering are in the following fields

[HK01]:

• Business. In business, clustering may help marketers discover significant groups in their

customers’ database and characterize them based on purchasing patterns.

• Biology. In biology, it can be used to define taxonomies, categorize genes with similar

functionality and gain insights into structures inherent in populations.

• Spatial data analysis. Due to the huge amounts of spatial data that may be obtained from

satellite images, medical equipment, Geographical Information Systems (GIS), image

database exploration etc., it is expensive and difficult for the users to examine spatial data

in detail. Clustering may help to automate the process of analysing and understanding

spatial data. It is used to identify and extract interesting characteristics and patterns that

may exist in large spatial databases.

• Web mining. In this case, clustering is used to discover significant groups of documents

on the Web huge collection of semi-structured documents. This classification of Web

documents assists in information discovery.


IST-2001-33058


In general terms, clustering may serve as a pre-processing step for other algorithms, such as

classification, which would then operate on the detected clusters.

2.1.1.2 Classification – Decision Making

The classification problem has been studied extensively in statistics, pattern recognition and

machine learning community as a possible solution to the knowledge acquisition or

knowledge extraction problem [RS98]. A number of classification techniques have been

developed and are available in bibliography. Among these, the most popular are: Bayesian

classification, Neural Networks and Decision Trees.

Bayesian classification is based on Bayesian statistical classification theory. The aim is to

classify a sample x to one of the given classes c1, c2,…, cN using a probability model defined

according to Bayes theory [CS96]. Each category is characterized by a prior probability of

observing the category ci. Also, we assume that a given sample x belongs to a category ci

with the conditional probability density function p(x/ci)∈[0,1]. Then using the above

definitions and based on Bayes formula, we define a posterior probability q(ci/x). An input

pattern is classified into a category with the highest posterior probability. According to the

above brief description of the Bayesian classification approach, it is obvious that complete

knowledge of probability laws is necessary in order to perform the classification [Hor+98].

Decision trees are one of the widely used techniques for classification and prediction. A

number of popular classifiers construct decision trees to generate classification models.

A decision tree is constructed based on a training set of pre-classified data. Each internal node

of the decision tree specifies a test of an attribute of the instance and each branch descending

of that node corresponds to one of the possible values for this attribute. Also, each leaf

corresponds to one of the defined classes. The procedure to classify a new instance using a

decision tree is as follows: starting at the root of the tree and testing the attribute specified by

this node successive internal nodes are visited until a leaf is reached. At each internal node,

the test of the node is applied to the instance. The outcome of this test at an internal node

determines the branch traversed and the next node visited [Mit+97]. The class for the instance

is the class of the final leaf node.

A number of algorithms for constructing decision trees have been developed over the years.

Some of the most widely known algorithms that are available in bibliography are: ID3

[Mit+97], C4.5 [Quin+93], SPRINT [SAM96], SLIQ [MAR96], CART etc. In general terms,

most of the algorithms have two distinct phases, a building phase and a pruning phase

[Mit+97]. In the building phase, the training data set is recursively partitioned until all the

instances in a partition have the same class. The result is a tree that classifies every record

from the training set. However, the tree constructed may be sensitive to statistical


IST-2001-33058


irregularities of the training set. Thus, most of the algorithms perform a pruning phase after a

building phase, in which nodes are pruned to prevent overfitting and to obtain a tree with

higher accuracy.

The various decision tree generation algorithms use different algorithms for selecting the test

criterion for partitioning a set of records [RS98]. One of the earliest algorithms, CLS,

examines the solution space of all possible decision trees to some fixed depth [RS98]. Then it

selects a test that minimizes the computational cost of classifying a record. The definition of

this cost is made up of the cost of determining the feature values for testing as well as the cost

of misclassification. The algorithms ID3 [Mit+97] and C4.5 [Quin+93] are based on a

statistical property, called information gain, in order to select the attribute to be tested at each

node in the tree. The measure definition is based on entropy used in information theory,

which characterizes the purity of an arbitrary selection of examples. Alternatively algorithms

like SLIQ [MAR96], SPRINT [SAM96] select the attribute to test, based on the GINI index

rather than the entropy measure. The best attribute for testing (i.e. the attribute that gives the

best partitioning) gives the lowest value for the GINI index.

Another classification approach used in many data mining applications for prediction and

classification is based on neural networks. More specifically, the methods of this approach

use neural networks to build a model for classification or prediction. The main steps for this

process (i.e. building a classification model) are [BL97]:

• Identification of the input and output features.

• Setting up of a network with an appropriate topology.

• Selection of a right training set

• Training the network on a representative data set. The data have to be represented in

such a way as to maximize the ability of the network to recognize patterns in it.

• Testing the network using a test set that is independent from the training set.

• Then the model generated by the network is applied to predict the classes (outcomes)

of unknown instances (inputs).

Among the above-described classification techniques the most commonly used are decision

trees. The decision trees compared to a neural network or a Bayesian classifier are more easily

interpreted and comprehensible by humans [RS98]. The training of a neural network can take

a lot of time and thousands of iterations and thus it is not suitable for large data sets.

Moreover, decision tree generation is based on the information already contained in the

training data set in contrast to other methods that require additional information (e.g. prior

probabilities in Bayesian approach).


IST-2001-33058


The above reference to some of the most widely known classical classification methods

denotes the relatively few efforts that have been devoted to data analysis techniques (i.e.

classification) in order to handle uncertainty. These approaches produce a crisp classification

decision, so an object either belongs to a class or not, which means that all objects are

considered to belong in a class equally. Moreover, the classes are considered as non-

overlapping. It is obvious that there is no notion of uncertainty representation in the proposed

methods, though usage and reveal of uncertainty is recognized as an important issue in data

mining research area[GMPS96]. For this purpose, the interest of research community has

been concentrated on this context and new classification approaches have recently been

proposed in bibliography so as to handle uncertainty.

An approach for pattern classification based on fuzzy logic is represented in [Chiu97]. The

main idea is the extraction of fuzzy rules for identifying each class of data. The rule extraction

methods are based on estimating clusters in the data and each cluster obtained corresponds to

a fuzzy rule that relates a region in the input space to an output class. Thus, for each class ci

the cluster center is defined that provides the rule: If {input is near xi} then class is ci. Then

for a given input vector x, the system defines the degree of fulfilment of each rule and the

consequent of the rule with highest degree of fulfilment is selected to be the output of the

fuzzy system. As a consequence, the approach uses fuzzy logic to define the best class in

which a data value can be classified but the final result is the classification of each data to one

of the classes. Moreover, it is possible to compute the classification error for a data sample x,

belonging to some class c, using the degree of fulfilment among all rules that assign x to class

c and the degree of fulfilment among all rules that do not assign x to class c. This error

measure indicates how well the defined rules classify our data.

In [Jan+98], an approach based on fuzzy decision trees is presented and aims at uncertainty

handing. It combines symbolic decision trees with fuzzy logic concepts so as to enhance

decision trees with additional flexibility offered by fuzzy representation. More specifically,

they propose a procedure to build a fuzzy decision tree based on classical decision tree

algorithm (ID3) and adapting norms used in fuzzy logic to represent uncertainty [Jan+98]. As

a consequence, the tree-building procedure is the same as that of ID3. The difference is that a

training example can be partially classified to several tree nodes. Thus, each instance of data

can belong to one or more nodes with different membership that is calculated based on the

restriction along the path from root to the specific node. However, according to the decision

tree methodology the classification inferences are crisp. More specifically, to define the

classification assigned to a sample, we should find leaves whose restrictions are satisfied by

sample and combine their decisions into a single crisp response.


IST-2001-33058


Induction of classification rules. The knowledge produced during the classification process

can be extracted and represented in the form of rules. As we discussed in the previous section

there are classification approaches, such as [Jan+98], which result in a set of rules describing

the classification patterns in a data set. Another common classification approach is the

Decision Trees. In this case the extracted patterns of knowledge are described in the form of a

tree. However, the rules are easier for humans to understand, particularly if the tree is very

large. In the sequel we briefly describe an approach for converting a decision tree to a set of

classification rules.

Considering a decision tree a rule can be created for each path from the root to a leaf node.

Thus we may think that each leaf generates one rule. The conditions leading to the leaf

generate the conjunctive antecedent and the leaf node that holds the class prediction generated

the consequent. Example

Table 1.Training dataset

Example AGE COMPETITION TYPE PROFIT 1 old yes Software down 2 old no Software down 3 old no Hardware down 4 midlife yes software down 5 midlife yes hardware down 6 midlife no hardware up 7 midlife no software up 8 young yes software up 9 young no hardware up

10 young no software up

Figure 2. The decision tree defined for the dataset of Table 1

Age

Profit_up Profit_down Competition

Profit_downn Profit_up

Young Old Middle

Yes No


IST-2001-33058


Let Table 1 presents the data set for training a decision tree. The decision tree that classifies

our data into classes Profit_up, Profit_down is presented in Figure 2.

The knowledge represented in the decision tree of Figure 2 can be described in the form of

rules as follows:

IF Age=”Young” THEN Profit =“Up”

IF Age=”Old” THEN Profit =“Down”

IF Age=”Middle” AND Competition = “Yes” THEN Profit =“Down”

IF Age=”Middle” AND Competition=”No” THEN Profit =“Up”

Classification Applications fields. As we have already discussed classification is a form of

data analysis that can be used to define data models describing data classes or predict data

trends. It can be used for making intelligent bases decisions in business and science. For

example a classification model may be built to categorize customers on computer equipment

given their income and occupation, or given a database of patients diagnostic results a set of

classification rules can be extracted to identify patients as having excellent or fair health

progress.

2.1.1.3 Association Rules

Mining rules is one of the main tasks in data mining process. It has attracted considerable

interest because the rule provides a concise statement of potentially useful information that is

easily understood by the end-users.

Association rules reveal underlying interactions between attributes in the data set. These

interactions can be presented with the following form: A B, where A, B refer to sets of

attributes' values in underlying data. More specifically, A and B are selected so as to be

frequent item sets. The following is a formal statement of the problem as given in [AS94]:

Let I = {i1,i2,…,im} be a set of literals, called items. Let D be a set of transactions where

each transaction T is a set of items such that T ⊆ I. We say that a transaction T contains

X, a set of some items in I, if X⊆ T. An association rule is an implication of the form

A ⇒B, where A ⊂ I, B ⊂ I and A∩B =∅.

The rule A ⇒B has confidence c in the transaction set D if c% of transactions in D

that contain A also contain B, and support s if s% of transactions in D contain A∪ B.

Given a set of transactions D, generate all association rules that have support and

confidence greater than the corresponding user-specified thresholds.

The intuitive meaning of such a rule is that records in the dataset, which contain the attributes

in A, tend also to contain the attributes in B [SA95]. We note also that the extracted rules


IST-2001-33058


have to satisfy some user-defined thresholds related with association rules measures (such as

support, confidence, leverage, lift).

These measures give an indication of the association rules’ importance and confidence. They

represent the predictive advantage of a rule so as to help to identify interesting patterns of

knowledge in data and make decisions.

Association rule interestingness measures. Let an association rule, denoted as LHS RHS.

Further we refer to the left hand side and the right hand side of the rule as LHS and RHS

respectively. Then the measures related with the rule are [HK01, BL97]:

- Strength

The strength of an association rule is the proportion of the cases covered by the LHS of

the rule that are also covered by the RHS

strength =n(RHS ∩ LHS)/n(LHS) and it takes values inside [0,1]

where n(LHS) denotes the number of cases covered by the Left Hand Side.

The rule strength is also referred to as confidence.

A value of strength near to 1 is an indication of an important association rule.

- Coverage

The coverage of an association rule is the proportion of cases in the data that have the

attribute values or items specified on the Left Hand Side of the rule.

coverage = n(LHS)/N = P(LHS) and it takes values inside [0,1]

where N is the total number of cases under consideration.

An association rule with coverage value near to 1 can be considered as an important

association rule.

- Support

The support of an association rule is the proportion of all cases in the dataset that satisfy a

rule, i.e., both LHS and RHS of the rule. More specifically, support is defined as

support = n(LHS ∩RHS)/N

where N is the total number of cases under consideration and n(LHS ∩RHS) denotes the

number of cases covered by LHS ∩RHS.

Support can be considered as an indication of how often a rule occurs in a data set and as

a consequence how significant is a rule.

The above discussed interestingness measures support and confidence are widely used in

the association rule extraction process and are also known as Agrawal and Srikant’s


IST-2001-33058


itemset measures. From their definitions, we could say that confidence corresponds to the

strength of a rule while support to the statistical significance.

- Leverage

The leverage of an association rule is the proportion of additional cases covered by both

the LHS and RHS above those expected if the LHS and RHS were independent of each

other. This is a measure of the importance of the association that includes both the

strength and the coverage of the rule. More specifically, it is defined as

leverage = P(RHS LHS) – (P(LHS) ∗ P(RHS)) and it takes values inside [-1,1].

Values of leverage equal or under 0, indicate a strong independence between LHS and

RHS. On the other hand, values of leverage near to 1 are indication of an important

association rule.

- Lift

The lift of an association rule is the strength divided by the proportion of all cases that are

covered by the RHS. This is a measure of the importance of the association and it is

independent of coverage.

lift = strength / P(RHS)

and it takes values inside ℜ+ (the space of the real positive numbers).

As for the values of lift there are some conditions to be considered:

1. lift values close to 1 means that RHS and LHS are independent, which indicates

that the rule is not important.

2. lift values close to +∝. Here, we have the following sub-cases:

• RHS ⊆ LHS or LHS ⊆ RHS. If any of these cases is satisfied, we may

conclude that the rule is not important.

• P(RHS) is close to 0 or P(RHS LHS) is close to 1. The first case

indicates that the rule is not important. On the other hand, the second case

is a good indication that the rule is an interesting one.

3. lift = 0 means that P(RHSLHS) = 0 ⇔ P(RHS ∩ LHS) = 0, which indicates that

the rule is not important.

Mining association rules. The process of mining association rules is based on two steps

[HK01]:

• Find all frequent itemsets. A set of items (itemset), which occur at least as frequently as a

pre-determined minimum support is a frequent itemset.


IST-2001-33058


• Generate strong association rules from the frequent itemsets. These rules must satisfy

minimum support and minimum confidence.

Association Rules Applications. A typical application of association rule mining is market

basket analysis. This process analyses customer-buying habits by finding associations

between the different items that customers place in their “shopping baskets”. The discovery of

such associations can help retailers develop marketing strategies by finding which items are

frequently purchased together by customers. In general terms, the discovery of interesting

association relationships among huge amounts of business or scientific databases records can

help in decision making processes gaining insight into database items.

2.1.1.4 Sequential patterns- Time series Analysis

Sequential pattern mining is the mining of frequently occurring patterns related to time or

other sequences. Most studies of sequential patterns mining concentrate on symbolic patterns.

The problem of mining sequential patterns can be stated as follows:

Given a potentially large pattern (string) S, we are interested in sequential patterns of the

form a b, where a, b are substrings inside S, such that the frequency of ab is not less

than some minimum support and the probability that a is immediately followed by b is

not less than minimum confidence.

Also the user can specify constraints on the kinds of sequential patterns to be mined by

providing pattern templates in the form of serial episodes, parallel episodes, or regular

expressions [HK01]. A serial episode is a set of events that occur in a total order whereas a

parallel episode is a set of events whose occurrence ordering is not important. For instance

the sequence A B is a serial episode implying that the event B follows the event A while

A&B is a parallel episode indicating that the events A and B occur in our data but their

ordering is not important.

The user can also specify constraints in the form of regular expression. For example the

template (A|B)C*(D|E) indicates that the user would like to find patterns where event A and

B first occur but their relative ordering is not important, followed by a the event C, followed

by the events D and E (D can be before or after E).

Sequential patterns applications. In daily and scientific life sequential data are available and

used everywhere. Some representative examples are text, music notes, weather data, satellite

data streams, business transactions, telecommunications records, experimental runs, DNA

sequences, histories of medical records. Discovering sequential patterns can benefit the user


IST-2001-33058


or scientist by predicting coming activities, interpreting recurring phenomena or extracting

similarities.

2.2. Signal Processing: Content-based Music Retrieval

2.2.1 Problem definition

Large-scale storage of sound and music has only become possible in the last decade. In

addition, the new possibility for wide-area distribution of multimedia over the Internet has

given rise to new requirements for flexible and powerful databases for musical and audio data.

Earlier systems relied on a single kind of data presentation (e.g. MIDI scores or sampled

sound effects) and were mainly oriented towards the needs of music librarians. However,

recent technologies, such as progress in networking transmission, compression of audio and

protection of digital data have made possible the quick and safe delivery of music through

networks (Internet or digital-audio broadcasting services). These techniques have given users

access to huge catalogues of annotated music. Although the above technologies have

addressed the distribution problem, they have also raised the issues of massive amounts of

data from which users can choose [PAC00]. In order to estimate the complexity of music

selection it is useful to consider a variety of queries that users are expected to form while

interacting with a music delivery system. The following examples of user queries are likely

to be encountered:

− “Show me a list of Irish Folk songs available by content provider X”

− “I want to browse through Bach Fugues recorded in C minor and performed with a

clavichord”

− “I have recorded 5 seconds of audio from a radio transmission and saved in file yyy.wav,

but I cannot tell what the title of the song is. Could you give me a hint?”

− “Are there any harpsichord recordings available?

− “Can I listen to the chorus line of song X?

It is clear that the aforementioned queries address information hidden in the content of the

music signal and raise the following challenges related to content-based music retrieval:

instrument recognition, melody spotting, musical key extraction, musical pattern recognition,

composer recognition, music structure extraction and music segmentation, to name but a few.

In an attempt to imitate the human cognitive system, a variety of signal processing solutions

have been developed to address the challenges imposed by music recognition and

understanding


IST-2001-33058


2.2.2 A survey of existing research efforts

− Several researchers have addressed the problem of feature extraction from stored pieces

of music. Feature selection is a topic of ongoing research and various solutions have been

proposed: [HER02] has performed a comparative study of features including Mel-

Frequency Cepstrum Coefficients (MFCC’s), attack-related descriptors, decay-related

descriptors and relative energy descriptors for the classification of drum sounds. [ERR01]

has compared features for musical instrument recognition. [AGO01] has performed a

study of spectral features to classify musical instrument timbres. It turns out that for a

system to be able to support several kinds of queries, it should be able to extract a wide

range of different kinds of features from the data loaded in the database.

− Stochastic modelling and dynamic time warping techniques have also been proposed for

the comparison of music signals. For example, Hidden Markov Models have been used to

build statistical representations of musical pieces. Music similarity can then be reduced to

probability extraction based on these models [PIK01], [PIK02]

− Various researchers have also tried to achieve automatic extraction of the structure of

music recordings in an attempt to prove the assumption that music similarity may be

considered in part, as a comparison of musical structure. Music is, indeed, often described

(at a high level of abstraction) in terms of the structure of repeated patterns. Discovering

musical structure in audio recordings has been addressed by [DAN02], [CON02],

[MER01], [MAR01], etc.

The above list, although not exhaustive, suggests that music retrieval should account for a

variety of music representations and therefore makes it necessary to enhance database

functionality to account for the automated extraction and comparison of such representations.

2.2.3 Patterns in music retrieval

Following the terminology introduced in Section 2.1 content-based music retrieval can be

considered as a data mining process that extracts patterns of interest from a music corpus and

stores these patterns as new knowledge in the knowledge base. Patterns stemming from raw

audio signals can be summarized to the following types:

2.2.3.1 Musical structure

Music is often described in terms of the structure of repeated phrases. For example, many

songs have the form AABA, where each letter represents an instance of a phrase. In this case,

the sequence AABA is the pattern of interest. Constructing descriptions of music in this form

requires the automated discovery of recurrent patterns by means of pattern extraction


IST-2001-33058


algorithms. Recurrent pattern finding can be thought of as a case of data mining. In addition,

the discovery of patterns that occur across many pieces of the same musical style or composer

can yield signature motifs (structure classes) that can be used for the classification of new

pieces of music in a specific musical style or composer [TZA01]. Following the terminology

introduced in Section 2.1.1.2, identifying the musical style and/or composer of a recording

may be considered as a classification problem, i.e., the structure of the recording in question

is classified to one of the available signature motifs (classes) using an appropriate method

(Bayesian classification, Neural Networks and Decision Trees). Alternatively, Hidden

Markov Models (HMM’s) may be employed in order to determine structure similarity. To this

end, a HMM is built per class and is trained with structures belonging to the class.

Classification is subsequently achieved by presenting the musical structure of a recording in

question to the set of HMM’s. Each HMM generates a recognition probability and the

unknown recording is determined according to the highest probability [PIK02]. Building a

HMM is preferably a supervised procedure [TK98]. Training the HMM’s should be repeated

occasionally to account for new instances of the class.

2.2.3.2 Single feature vectors representing entire musical recordings

In this case the feature selection process follows the guidelines given in Section 2.1.1.1. Once

appropriate features have been selected, each recording can be represented by a feature vector

(pattern). Such feature vectors are multi-dimensional and usually consist of a combination of

Mel-frequency cepstrum coefficients, average energy descriptors, zero-crossing rate envelops,

etc. Clustering methods may be subsequently applied depending on the user queries that need

to be addressed. Attention must be given to the fact that music signals have a time-varying

nature and as a result, a single feature vector can only estimate the average behaviour of the

signal. However, this form of representation can be useful when dealing with short music

segments, e.g., classification of drum sounds [HER02].

2.2.3.3 Sequences of possibly multi-dimensional feature vectors

The limitations inherent in the “single feature vector approach” can be accommodated if a

sequence of feature vectors is extracted from the music signal by means a moving window

technique [REI01]. Moving window techniques break the signal into overlapping frames and

a possibly multi-dimensional feature vector is extracted from each frame. In this case, the

sequence of feature vectors can be considered as a pattern. Single-dimensional feature vectors

may also be extracted. Such is the case, for example, when a sequence of music intervals is


IST-2001-33058


generated from a piece of monophonic music. When comparing feature sequences,

conventional classification methods like Bayesian classification are not adequate. Therefore,

one has to resort to other techniques, i.e., Dynamic Time Warping (DTW) and Hidden

Markov Models (HMM’s), in order to determine similarity [TK98], [DEL87]. The following

two sections describe in brief DTW and HMM’s.

2.2.3.4 Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) is fundamentally a feature-matching scheme that inherently

accomplishes “time alignment” of feature sequences through a Dynamic Programming (DP)

procedure. By time alignment we mean the process by which temporal regions of one feature

sequence are matched with appropriate regions of another feature sequence. Because one

feature sequence is “warped” (stretched or compressed in time) to fit the other and because

dynamic programming is used to accomplish the task, this approach is referred to as dynamic

time warping. The feature sequence to be classified is matched against a set of reference

patterns, one reference pattern per class [PIK01]. The choice of references patterns is usually

user driven, i.e. the user decides which pattern is likely to serve as a reference pattern.

Exhaustive methods have also been investigated in order to automatically extract reference

patterns from a class [DEL87]. DTW has its origins in Speech Recognition and has been

employed with success by early isolated word recognition systems.

2.2.3.5 Stochastic modeling: Hidden Markov Models

An alternative to DTW is stochastic modelling by means of Hidden Markov Models. HMM’s

are “stochastic finite state automata”. Each class of patterns is associated with a

corresponding HMM. Two fundamental problems arise when dealing with HMM’s.

− Given a series of patterns belonging to the same class, how do we train an HMM to

represent the class? This is the HMM training problem.

− Given a trained HMM, how do we find the likelihood that generates an incoming pattern?

This is the recognition problem.

To solve both problems, various approaches have been proposed, the most popular being

Baum-Welch and Viterbi training [BW70], [VIT67].

Depending on the selected features a HMM is expected to act as a means of music content

modelling. For example, if feature sequences correspond to sequences of music intervals

extracted from monophonic melodies of the same clarinet player, a HMM can be trained with

these sequences and can serve as a signature for the particular clarinet player. Subsequently, it


IST-2001-33058


is also made feasible to determine the probability with which a monophonic melody has its

origins in a particular instrument player. Following this line of thinking HMM’s may also be

trained to serve as signatures of the identity of composers, musical traditions, etc.

2.3. Patterns in Information Retrieval

Another research field where patterns are apparent is that of Information Retrieval. In a

retrieval setting we have a collection of discourse material, also called corpus, and users

submit queries to the system in order to retrieve information that suits their interest. Queries

are often vaguely defined, in contrast to traditional Database systems, due to lack of a query

language or algebra. A query consisting of only a few words does not always reflect the user’s

actual interest; therefore users often experience frustration from a retrieval system. The Latent

Semantic Indexing (LSI) [DDF+90, BDO95], a retrieval model, unveiling patterns in terms’

usage, seems to produce more effective information retrieval.

Assuming a corpus of n documents, each of which is indexed by m, suitably chosen index

terms, the entire corpus can be represented as an m×n matrix A, each column of A

representing the respective text. The entry A(i,j) of the matrix, corresponds to a measure of the

frequency of the occurrence of the i-th index term to the j-th text of the collection.

In the traditional Vector-Space model [SL68], each index term is considered as a base vector

in an m-dimensional space U. Hence, each text can be considered as a linear combination of

index terms and as such is represented by a vector in U. Queries are also treated as linear

combinations of index terms and projected in U. The objective of the retrieval system, is to

return the documents, most closely related, i.e. most similar to the query. In the vector space

approach the measure of similarity between a query q and a document d is merely ),cos( dqrv ,

where qr and dr

are the vector representations of q and d, respectively. Thus, the documents of

interest form a set }),cos(|{ tdqdD ≥=rr

, where t is a given threshold.

The key assumption in the vector space model is the orthogonality of index terms, i.e. no two

index terms are correlated in their appearance within a text. Often, this is not the case, when

human-generated discourse is involved. People frequently use distinct words to describe the

same concept, giving thus rise to a phenomenon called “synonymy”. An example of

synonymy are the terms “car” and “automobile”, that both describe a passenger vehicle. As a

response to query containing just the term “car”, texts containing “automobile” may be

overlooked. Another issue is that a term can have several interpretations, according to the

context of its use. For example when referring to the word “jaguar”, it is not clear without

knowledge of the context whether the car manufacturer or the animal is involved. As an


IST-2001-33058


undesired sideffect of polysemy, documents about cars may appear after a query that actually

seeks to retrieve information about jaguars.

Identifying “patterns” in terms’ usage seems to remedy the aforementioned problems. In

essence, one seeks for an efficient mechanism to identify text clusters. In such a cluster,

synonymous terms, although distinct, share similar usage characteristics, i.e. they are used in

similar contexts. On the other hand a polysemous term is expected to have distinct usage

patterns for each distinct meaning, so a text referring to a jaguar in a zoo, shall appear in a

different cluster than a text about Jaguar X-type. This partition into clusters can be achieved if

the documents are represented as a linear combination of entities richer in semantics, than the

plain index terms. These entities are the topics addressed by texts, hence a text is a

combination of such topics. This methodology is called Semantic Indexing.

The LSI method is a type of Semantic Indexing that uses a well known algebraic

decomposition, the Singular Value Decomposition (SVD) [GL96], applied to term-document

matrix A. At a high level, LSI projects the original vector space spanned by the columns of A,

to a low-dimensional “semantic” space. This semantic space results from keeping only the

important correlations of indexing terms, as those appear in the spectral structure of the

matrix. In this low-dimensional space, semantically correlated texts appear closely together,

whereas this is not true for unrelated texts, sharing polysemous terms. The ability to project

queries also, to the induced space, and then retrieve all documents closely projected to the

query vector in terms of the cosine measure, yields increased efficiency in Information

Retrieval.

Some of the reported applications of LSI, support the claimed efficiency of the method, to

discover important patterns in word usage. Dumais et. al. [DLLL97] tested LSI in parallel

bilingual corpora (English and French). They found out that they were able to retrieve

relevant texts in both languages, even though the query was formulated in one of them.

Nielsen et. al. [NPD94] used some texts with spelling errors, produced by an OCR software.

In experiments they conducted, they showed that the retrieval of misspelled documents where

almost as efficient as the retrieval of error-free documents, in both cases using LSI. This is a

consequence of the fact that LSI is supposed to exploit correlations among terms, not much

caring about the actual content

2.4. Mathematics

Mathematics is the science of patterns. Not only do patterns take many forms over the range

of school mathematics, they are also a unifying theme. Number patterns, such as 3, 6, 9, 12,

e.t.c., are familiar to us since they are among the patterns we first learn as young students. As


IST-2001-33058


we advance, we experience number patterns again through the huge concept of functions in

mathematics. But patterns are much broader. They can be sequential [AS95], spatial [ST99],

temporal [DLMRS98][SB98], and even linguistic[FFKLLRSZ98][LAS97].

The various mathematical patterns can be summarized to the following categories:

2.4.1 Number Patterns

We frequently come upon numbers with special characteristics. In the science of mathematics

a pattern is usually the rule or the constraint which several items satisfy. Thus we can clearly

distinguish the pattern, which is a constraint, from the pattern instance, which is the set of

items (in this case numbers) that share the common repeated characteristic, meaning the

verification of the constraint. Sometimes it is very useful to collect number patterns so as to

be able to learn the behaviour of numbers collected by telecommunicational data for

example[BCH00] or by equation solvings[SB99]. A variety of number patterns exists, a brief

report of which follows:

Prime Numbers: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, …

Composite numbers: prime factorization

Figurate numbers:

I. Triangular numbers

II. Square numbers (for example 25 = 52, 36 = 62, 49 = 72, e.t.c.)

Fibonacci numbers: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...

2.4.2 Patterns in graphs

Every graph has been resulted from the graphical depiction of a function or more general of

an equation solved according to one of it’s variables. We can find similar behaviour to many

equations, for example linear, which forces us to draw the conclusion that this similarity

exists to their graphical depiction as well. Almost every graph belongs to a graph pattern,

which means that each graph has some attributes that forces it to obey to a specific behaviour,

as one or more of it’s variables grow or lessen in value. For example, we can easily

distinguish some graph patterns, like the following:


IST-2001-33058


linear graphs

i. x+y=7

ii. 2x + 3 = y

non-linear (parabola, circle [Sch97], ellipse)

For example, Figure 3 depicts the non-linear graphs that present a pattern of their own.

2.4.3 Patterns in shapes

We can also find patterns in shapes. The concept in this case is that the majority of shapes

follow a similarity, for example in the number of vertices from which the shape is constituted.

So, patterns could be triangles (which might be similar), squares, polygons with n-angles,

e.t.c. The existence of such patterns can help us in recognizing familiar shapes in image

processing [NPP00] or even for reconstructing polygonal images [CN95].

2.4.4 Patterns in infinite sequences

Infinite sequences can be the pattern instance of a mathematical constraint, which in this case

is the pattern itself. For example Fibonacci infinite series, Taylor series, prime numbers, e.t.c.

In this case we can add the infinite series of digits of numbers like e or π. Such patterns could

Figure 3. Non-linear graphs


IST-2001-33058


be proved to be really helpful to characterize the behaviors of an individual, system or

network in terms of temporal sequences or discrete data[LB98].

2.4.5 Patterns in algebra

What is the difference between the arithmetic 3+5 = 5+3 and the algebraic a+b = b+a? One is

a specific fact, another is a pattern valid in a multitude of situations. While arithmetic may

hint at some regularities, algebra, as a language, gives expression to our acknowledgement of

patterns as such. How did people express general ideas before the advent of algebra in the

15th-17th centuries? Diophantus of Alexandria (c. 250 A.D.) is credited with the invention of

syncopated (shorthand) notations. Before that, it was geometry.

Algebra unites patterns and quantities in patterns with the means of describing change

through the use of variables and functions. Its concepts and analytical methods allow people

to consider general solutions to problems with common characteristics and develop related

formulas. Algebra provides verbal, symbolic and graphical formats for discussing and

representing settings as diverse as the pricing patterns of merchandise in a store, the behavior

of a car as it accelerates or slows down, the changes in two chemicals as they react with one

another, or the type of variation existing in a comparison of two factors in the economy. In

algebra, we could think of many patterns, like the way equations are solved, iterating

functions, etc. These patterns apply to fields like music [LK94], moving objects

applications[G99], etc.

2.4.6 Patterns in Geometry

In Geometry we come across geometric locuses, which means that there are many sets of

points in space (2D, 3D, e.t.c.) which satisfy a specific premise or constraint. For example, all

the points in 2D space, which have the same distance (lets say r) from a certain point (x,y),

form a circle, with center C(x,y) and radius r constitute the pattern instance of the above

pattern. Applications of such patterns are used in computational geometry, pattern recognition

and multimedia [VH99].

2.4.7 Patterns in Cryptography

Cryptography is one of the main subjects in which mathematics rays supreme. In

cryptography, every cryptographic system could be considered as a pattern itself. For

example, if we have many sets of raw data and in each set we enforce a specific encryption,

then this sets would share the similarity of their encryption attribute. A very common example

of a code which constitutes a pattern is the Morse code. In case we were given a stream of

characters then it would be interesting to find out whether this stream constitutes a Morse


IST-2001-33058


code, so it would be a pattern instance of the code. We can find many such cases in

cryptography, like Vigenere Cipher, Caesar Cipher, Gronsfeld cipher e.t.c. Such

Cryptographic patterns are widely used in word processors, electronic commerce systems,

spreadsheets, databases and security systems [BRD99].

2.5. Visualization

Scientific, engineering, and environmental databases can contain very large amounts of raw

data collected automatically and continuously via sensors and monitoring systems. Even

simple transactions of everyday life, such as paying with credit card, result in large

multidimensional data sets. Information visualization and visual data mining techniques can

help to deal with the flood of data [Kei01a]. Finding the relevant information hidden in the

data is difficult without user interaction. For an effective data analysis, it is import to include

the user in the exploration process and combine the flexibility, creativity, and general

knowledge of the user with the enormous storage capacity and the computational power of the

computer. Visualization techniques integrate the human perceptual abilities into the data

exploration process for the analysis of large data sets.

The basic idea of visual data exploration is to present the data in some visual form which

allows the user to gain insight into the data [CMS99, Che99, Spe00, War00, Kei01b]. Visual

data exploration usually follows a three step process [Shn96]: (1) Overview, (2) zoom and

filter, and (3) details on demand. First, the user needs to get an overview of the raw data to

identify interesting patterns in the raw data. Visualization techniques are useful for showing

an overview of the raw data and detecting patterns [Kei 00]. Patterns are groups of data points

in the visualization that represent potentially valuable information and provide new insights

for the user. A second step is the analysis of the discovered patterns. In this step, the user

needs to be able to zoom and filter the data in order to focus on one and more patterns

[AW95]. Finally, the user needs to drill-down to access the details of the data points

belonging to a discovered pattern [BSC96]. Visualization technology is used for interpreting

the patterns of interest. It is important to retain an overview of the data while focusing on the

interesting patterns using another visualization. A common technique is to distort the

overview visualization in order to focus on the interesting patterns [LA94].

Visual data exploration can be seen as a hypothesis generation process. The visualization

allows the user to identify patterns of interest or groups of related data points and gain insight

into the raw data. Visualization can also be used to analyse the patterns on different levels of

abstraction, which may result in adapting existing hypotheses or generating new hypotheses.

The verification of the hypothesis can also be done with the help of visualization techniques.


IST-2001-33058


The advantage of visual data exploration is that the process is interactive, i.e. the user is

directly involved in the exploration process, i.e. the user is repeatedly asked to make the

important decisions to steer the data exploration process.

3. Current issues in modeling Data Mining processes and results

In this section we address the current and evolving efforts on modeling data mining processes

and their results. The survey is organized as follows. Section 2 introduces the work of the

Data Mining Group and the specification of the Predictive Model Markup Language, a set of

XML DTDs that can be used for describing common data mining techniques. Section 3 gives

an overview of the SQL/MM Part 6, a standard that has been developed under ISO. Section 4

gives an overview of the Common Warehouse Model and the Data Mining MetaModel that

are standards supported by OMG. Section 5 highlights the efforts in developing the Java Data

Mining API, a forthcoming standard developed by individual vendors. Section 6 gives an

overview of the Oracle 9i Data Mining components and Section 7 highlights PQL, a pattern

query language developed by Information Discovery, Inc. Finally, Section 8 gives a summary

and concludes the survey.

3.1. Data Mining Group / Predictive Model Markup Language [DMG]

3.1.1 Overview

The Data Mining Group (DMG) is an independent, vendor led group which develops data

mining standards, such as the Predictive Model Markup Language (PMML). PMML is a

collection of XML Document Type Descriptors (DTDs) that provide a uniform way for

modeling data mining processes and results. In the next sections we will present the main

features of the PMML DTDs. The current members of DMG are:

• Angoss Software Corp. Toronto, CAN

• IBM Corp. Somers, NY

• NCR Corp. Dayton, OH

• Magnify Inc. Chicago, IL

• Oracle Corporation Redwood Shore, CA

• National Center for Data Mining, University of Illinois at Chicago

• SPSS Inc. Chicago, IL

• Xchange, Inc. Boston, MA

• MINEit Software Ltd. Bracknell, UK


IST-2001-33058


PMML defines a variety of specific mining models such as tree classification, neural

networks, regression, etc. Equally important are definitions which are common to all models,

in order to describe the input data itself, and generic transformations which can be applied to

the input data before the model itself is evaluated. In the following schema the basic blocks of

a mining model as well as the data flow of such operation are shown:

The DataDictionary describes the data 'as is', that is the raw input data. The DataDictionary

refers to the original data and defines how the mining model interprets the data, e.g., as

categorical, or numerical, and the range of valid values may be restricted. The raw data are

not included in a PMML document and they are hosted in external sources. The

DataDictionary only defines the mappings between the source attributes and the model’s local

field names.

The MiningSchema defines an interface to the user of PMML models. It lists all fields that are

used as input to the computations in the mining model. The mining model may internally

require further derived values that depend on the input values, but these derived values are not

part of the MiningSchema. The derived values are defined in the transformations block. The

MiningSchema also defines which values are regarded as outliers, which weighting is applied

to a field, e.g., for clustering. Input fields as specified in the MiningSchema refer to fields in

the data dictionary but not to derived fields because a user of a model is not required to

perform the normalizations.

Various types of transformations are defined such as normalization of numbers to a range

[0..1] or discretization of continuous fields. These transformations convert the original values

Figure 4. Basic blocks of a mining model


IST-2001-33058


to internal values as they are required by the mining model such as an input neuron of a

network model.

If a PMML model contains transformations a user is not required to take care of these

normalizations. The MiningSchema lists the input fields that refer to the non-normalized

original values, the user presents these fields as input to the model.

The output of a model always depends on the specific kind of model, e.g. it may by defined

by a leaf node in a tree or by output neurons in a neural network. The final result, such as a

predicted class and a probability, are computed from the output of the model. If a neural

network is used for predicting numeric values then the output value of the network usually

needs to be denormalized into the original domain of values. Fortunately, this denormalization

can use the same kind of transformation types. The PMML consumer system will

automatically compute the inverse mapping.

3.1.2 General Structure of a PMML Document

PMML uses XML to represent mining models. The structure of the models is described by a

DTD which is called the PMML DTD. The DTD that all PMML documents must conform is:

<!ELEMENT PMML (Header, Settings?, DataDictionary,

TransformationDictionary, (%A-PMML-MODEL;)+, Extension* )>

<!ATTLIST PMML version CDATA #REQUIRED>

<!ELEMENT Settings (Extension*) >

<!ELEMENT TransformationDictionary (DerivedValues*, Extension* ) >

For PMML version 2.0 the attribute version must have the value "2.0" as shown in the next

small example of a PMML instance.

<?xml version="1.0"?>

<!DOCTYPE PMML PUBLIC "PMML 2.0" "http://www.dmg.org/PMML2.0/pmml-2-0.dtd"> <PMML version="2.0">

...

</PMML>

In the following we will give a description about the base elements of the generic PMML

DTD.

3.1.3 Header

The Header DTD is: <!ELEMENT Header(Application?,Annotation*,Timestamp?)>

<!ATTLIST Header


IST-2001-33058


copyright CDATA #REQUIRED description CDATA #IMPLIED

>

<!ELEMENT Application (EMPTY)>

<!ATTLIST Application

name CDATA #REQUIRED

version CDATA #IMPLIED

>

<!ELEMENT Annotation (#PCDATA)>

<!ELEMENT Timestamp (#PCDATA)>

• Header: The top level tag that marks the beginning of the header information.

• Copyright: This head attribute contains the copyright information for this model.

• Description: This head attribute contains a non-specific description for the model. It

should contain information necessary to use this model in further applications, but not

information that could be better defined in the application element, annotation, and

the data dictionary section. This attribute should only contain human readable

information, and models mentioned in this dtd file should not be expected to utilize

the information contained in this attribute.

• Application: This head element describes the software application that generated the

PMML. Though these models are created to be portable, different mechanisms may

create different models from the same data set. It is of interest to the user from which

application these models were generated.

• Name: The name of the application that generated the model.

• Version: The version of the application that generated this model.

• Annotation: Document modification history is embedded here. Each annotation is

free text and, like the description attribute in the head element, makes sense to the

human eye only.

• Timestamp: This element allows a model creation timestamp in the format YYYY-

MM-DD hh:mm:ss GMT +/- xx:xx.

3.1.4 Settings

The element Settings can contain any XML value describing the configuration of the training

run that produced the model instance. This information is not directly needed in a PMML

consumer, but in many cases it is helpful for maintenance and for visualization of the model.

The content of Settings is not defined in PMML 2.0.


IST-2001-33058


3.1.5 Data Dictionary

The data dictionary contains definitions for fields as used in mining models. It specifies the

types and value ranges. These definitions are assumed to be independent of specific data sets

as used for training or building a specific model.

A data dictionary can be shared by multiple models, statistics and other information related to

the training set is stored within a model.

<!ENTITY %FIELD-NAME "CDATA" >

<!ELEMENT DataDictionary (Extension*, DataField+, Taxonomy+ ) >

<!ATTLIST DataDictionary

numberOfFields %INT-NUMBER; #IMPLIED

>

<!ELEMENT DataField ( Extension*, (Interval*| Value*) ) >

<!ATTLIST DataField

name %FIELD-NAME; #REQUIRED

displayName CDATA #IMPLIED

optype (categorical|ordinal|continuous) #REQUIRED

taxonomy CDATA #IMPLIED

isCyclic ( 0 | 1 ) "0"

>

The value numberOfFields is the number of fields which are defined in the content of

DataDictionary, this number can be added for consistency checks. The name of a data field

must be unique in the data dictionary. The displayName is a string which may be used by

applications to refer to that field. Within the XML document only the value of name is

significant. If displayName is not given, then name is the default value.

The fields are separated into different types depending on which operations are defined on the

values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal

fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic

fields have a distance measure which takes into account that the maximal value and minimal

value are close together. This optional attribute refers to a taxonomy of values. It's only

applicable for categorical fields. The value of taxonomy is a name of a taxonomy.

3.1.6 Transformation Dictionary (Derived Values)

At various places the mining models use simple functions in order to map user data to values

that are easier to use in the specific model. For example, neural networks internally work with

numbers, usually in the range [0..1]. Numeric input data are mapped to the range [0..1], and


IST-2001-33058


categorical fields are mapped to series of 0/1 indicators. Similarly, Naive Bayes models

internally map all input data to categorical values. For PMML to be able to handle such

mappings it defines various kinds of simple data transformations:

Normalization: map values to numbers, input can be continuous or discrete.

Discretization: map continuous values to discrete values.

Value mapping: map discrete values to discrete values. Mapping missing values as a

special case of value mapping.

Aggregation: summarize or collect groups of values, e.g. compute average.

The corresponding XML elements appear as content of a surrounding markup DerivedValues.

which provides a common element for the various mappings. DerivedValues's can appear in

the data dictionary within DataField's. They can also appear at several places in the definition

of specific models such as neural network or Naive Bayes models. The name of the

transformed fields is defined such that the name and the model can refer to these fields.

The transformations in PMML do not cover the full set of preprocessing functions which may

be needed to collect and prepare the data for mining. There are too many variations of

preprocessing expressions. Instead, the PMML transformations represent expressions that are

created automatically by a mining system. The corresponding expressions are often generated.

Similarly, a discretization might be constructed by a mining system that computes quantile

ranges in order to transform skewed data.

3.1.7 PMML mining models

A PMML document can contain more than one model. PMML supports the following data

mining models:

TreeModel: The tree modeling framework allows for defining either a classification or

prediction structure. Each Node holds a rule, called PREDICATE, that determines the

reason for choosing the Node or any of the branching Nodes.

NeuralNetwork: A neural network has one or more input nodes and one or more neurons.

Some neuron's outputs are the output of the network. The network is defined by the

neurons, their connections and the corresponding weights. All neurons are organized into

layers; the sequence of layers defines the order in which the activations are computed. All

output activations for neurons in some layer L are evaluated before computation proceeds

to the next layer L+1. Note that this allows for recurrent networks where outputs of

neurons in layer L+i can be used as input in layer L where L+i > L. The model does not

define a specific evaluation order for neurons within a layer.


IST-2001-33058


ClusteringModel: PMML models for Clustering are defined in two different classes.

These are center-based and distribution-based cluster models. Both models have the DTD

element ClusteringModel as the toplevel type and they share many other element types. A

cluster model basically consists of a set of clusters. For each cluster a center vector can be

given. In center-based models a cluster is defined by a vector of center coordinates. Some

distance measure is used to determine the nearest center, that is the nearest cluster for a

given input record. For distribution-based models (e.g. in demographic clustering) the

clusters are defined by their statistics. Some similarity measure is used to determine the

best matching cluster for a given record. The center vectors then only approximate the

clusters. The model must contain information on the distance or similarity measure used

for clustering. It may also contain information on overall data distribution, such as

covariance matrix, or other statistics.

RegressionModel: The regression functions are used to determine the relationship

between the dependent variable (target field) and one or more independent variables. The

dependent variable is the one whose values you want to predict, whereas the independent

variables are the variables that you base your prediction on. PMML defines three types of

regression models: linear, polynomial, and logistic regression.

NaiveBayesModels: Naive Bayes uses Bayes' Theorem, combined with a ("naive")

presumption of conditional independence, to predict, for each record (a set of values, one

for each field), the value of a target (output) independence, to predict, for each record (a

set of values, one for each field), the value of a target (output) field, from evidence given

by one or more predictor (input) fields.

AssociationModel: The Association Rule model represents rules where some set of items

is associated to another set of items. For example a rule can express that a certain product

is often bought in combination with a certain set of other products.

SequenceMiningModel: The basic data model consists of an Object, identified by the

“Primary Key” that has a number of events attributed to it, defined by the “Secondary

Key”. Each event consists of a set of ordered items. An “Order Field” defines the order of

the items within an event, with an optional qualifier in the form of an attribute name.

For every model there is a corresponding DTD that describes the metadata and the processes

of each model. In the following we give a description for the generic framework of PMML

that is used by every data model. For all PMML models the structure of the top-level model

element is similar to:

<!ELEMENT XModel (Extension*, MiningSchema, ModelStats?, ..., Extension* ) >


IST-2001-33058


<!ATTLIST XModel

modelName CDATA #IMPLIED

functionName %MINING-FUNCTION; #REQUIRED

algorithmName CDATA #IMPLIED

>

<!ELEMENT MiningSchema (MiningField+) >

<!ELEMENT ModelStats (UnivariateStats+) >

The non-empty list of mining fields define a so-called mining schema. The statistics contain

global statistics on (a subset of the) mining fields. Other model specific elements follow after

ModelStats in the content of XModel. For a list of models that have been defined in PMML

2.0 see the entity A-PMML-MODEL above.

The naming conventions for PMML are ElementNames in mixed case, first uppercase

attributeNames in mixed case, first lowercase enumConstants in mixed case, first lowercase

ENTITY-NAMES all uppercase. The character '-' is used less often in order to avoid

confusion with mathematical notation.

3.1.8 Example: DTD of Association Rules Model

The Association Rule model represents rules where some set of items is associated to another

set of items. For example a rule can express that a certain product is often bought in

combination with a certain set of other products.

An Association Rule model consists of four major parts:

<!ELEMENT AssociationModel (Extension*, AssocInputStats,

AssocItem+, AssocItemset+, AssocRule+)>

<!ATTLIST AssociationModel

modelName CDATA #IMPLIED

>

AssocInputStats describes the basic information of the input data through a set of

attributes:

<!ELEMENT AssocInputStats EMPTY>

<!ATTLIST AssocInputStats

numberOfTransactions %INT-NUMBER; #REQUIRED

maxNumberOfItemsPerTA %INT-NUMBER; #IMPLIED

avgNumberOfItemsPerTA %REAL-NUMBER; #IMPLIED

minimumSupport %PROB-NUMBER; #REQUIRED

minimumConfidence %PROB-NUMBER; #REQUIRED

lengthLimit %INT-NUMBER; #IMPLIED


IST-2001-33058


numberOfItems %INT-NUMBER; #REQUIRED

numberOfItemsets %INT-NUMBER; #REQUIRED

numberOfRules %INT-NUMBER; #REQUIRED

>

Attribute description:

numberOfTransactions: The number of transactions (baskets of items) contained in the

input data.

maxNumberOfItemsPerTA: The number of items contained in the largest transaction.

avgNumberOfItemsPerTA: The average number of items contained in a transaction.

minimumSupport: The minimum relative support value (#supporting transactions / #total

transactions) satisfied by all rules.

minimumConfidence: The minimum confidence value satisfied by all rules. Confidence is

calculated as (support (rule) / support(antecedent)).

lengthLimit: The maximum number of items contained in a rule which was used to limit

the number of rules.

numberOfItems: The number of different items contained in the input data.

numberOfItemsets: The number of itemsets contained in the model.

numberOfRules: The number of rules contained in the model.

AssocItem describes the items contained in itemsets: <!ELEMENT AssocItem EMPTY> <!ATTLIST AssocItem

id %ELEMENT-ID; #REQUIRED

value CDATA #REQUIRED

mappedValue CDATA #IMPLIED

weight %REAL-NUMBER; #IMPLIED

>


id: An identification to uniquely identify an item.

value: The value of the item as in the input data.

mappedValue: Optional, a value to which the original item value is mapped.For instance,

this could be a product name if the original value is an EAN code.

weight : The weight of the item. For example, the price or value of an item.

AssocItemset describes the itemsets which are contained in rules <!ELEMENT AssocItemset (Extension*, AssocItemRef+)>


IST-2001-33058


<!ATTLIST AssocItemset

id %ELEMENT-ID; #REQUIRED

support %PROB-NUMBER; #REQUIRED

numberOfItems %INT-NUMBER; #REQUIRED

>


id : An identification to uniquely identify an itemset

support : The relative support of the itemset

numberOfItems : The number of items contained in this itemset

AssocItemRef are item references to point to elements of type item. <!ELEMENT AssocItemRef EMPTY>

<!ATTLIST AssocItemRef

itemRef %ELEMENT-ID; #REQUIRED

>

Attribute description: itemRef : The id value of an item element

AssocRule are elements of the form <antecedent itemset> => <consequent itemset> that

contain the actual association rules derived by the model instance <!ELEMENT AssocRule ( Extension* )>

<!ATTLIST AssocRule

support %PROB-NUMBER; #REQUIRED

confidence %PROB-NUMBER; #REQUIRED

antecedent %ELEMENT-ID; #REQUIRED

consequent %ELEMENT-ID; #REQUIRED

>

Attribute definitions:

support : The relative support of the rule

confidence : The confidence of the rule

antecedent : The id value of the itemset which is the antecedent of the rule

consequent : The id value of the itemset which is the consequent of the rule

An instance of an example Association Rules model is given below. In this example we may

see all the aforementioned elements that construct an Association Rules model.

Let's assume we have four transactions with the following data:

• t1: Cracker, Coke, Water

IST-2001-33058


• t2: Cracker, Water

• t3: Cracker, Water

• t4: Cracker, Coke, Water

<?xml version="1.0" ?>

<PMML version="1.1" >

<Header copyright="www.dmg.org"

description="example model for association rules"/>

<DataDictionary numberOfFields="1" >

<DataField name="item" optype="categorical" />

</DataDictionary>

<AssociationModel>

<AssocInputStats numberOfTransactions="4" numberOfItems="3"

minimumSupport="0.6" minimumConfidence="0.5"

numberOfItemsets="3" numberOfRules="2"/>



<AssocItem id="1" value="Cracker" />

<AssocItem id="2" value="Coke" />

<AssocItem id="3" value="Water" />



<AssocItemset id="1" support="1.0" numberOfItems="1">

<AssocItemRef itemRef="1" />

</AssocItemset>



</AssocItemset>






</AssocItemset>



<AssocRule support="1.0" confidence="1.0"

antecedent="1" consequent="2" />

<AssocRule support="1.0" confidence="1.0"

antecedent="2" consequent="1" />


IST-2001-33058


</AssociationModel>

</PMML>

3.2. SQL/MM

3.2.1 Overview

SQL/MM (MM for MultiMedia) [SQL/MM ] is a standard based on SQL that has been

developed by the International Organization for Standardization (ISO). It is divided into

multiple parts which are the following:

Part 1 : Framework

Part 2: Full Text

Part 3: Spatial

Part 5: Still Image

Part 6: Data Mining

The structured types defined in SQL/MM are first-class SQL types that can be accessed

through SQL:1999 base syntax. These accesses also include invocation of the routines

(methods) associated with the structured types. In the following section we shall review the

basic features of Part 6, Data Mining.

3.2.2 Part 6: Data Mining

The standard supports four different data mining techniques. The term model is used through

out the standard and it actually stands for data mining technique. The four models are:

Rule model

Clustering model

Regression model

Classification model

Every model has a corresponding SQL structured user-defined type. A set of predefined types

completes the full definition of each model. The basic type is named DM_*Model where “*”

is replaced by ‘Class’ for a classification model, ‘Rule’ for a rule model, ‘Clustering’ for a

clustering model and ‘Regression’ for a regression model. The predefined types are the

following (the same naming schema is used as in the case of the DM_*Model):

DM_*Settings type: Instances of that type are used for storing various parameters of the

data mining model, such as the maximum number of clusters or the depth of a decision

tree.


IST-2001-33058


DM_*TestResult: Instantiations of that type hold the results of the testings during the

training phase of the data mining models.

DM_*Result: The running of a data mining model against real data creates instances of

that type.

DM_*Task: Instances of that type store metadata that describe the process and control of

testings and of the actual runnings.

3.2.3 Example: Rule Model

The DM_RuleModel type represents models which are the result of the search for association

rules. Next, we give the definition of the Rule Model type as described in the standard:

CREATE TYPE DM_RuleModel

AS (

DM_content CHARACTER LARGE OBJECT(DM_MaxContentLength)

)

INSTANTIABLE

NOT FINAL

STATIC METHOD DM_impRuleModel

(input CHARACTER LARGE OBJECT(DM_MaxContentLength))

RETURNS DM_RuleModel

LANGUAGE SQL

DETERMINISTIC

CONTAINS SQL

RETURNS NULL ON NULL INPUT,

METHOD DM_expRuleModel ()

RETURNS CHARACTER LARGE OBJECT(DM_MaxContentLength)

LANGUAGE SQL

DETERMINISTIC

CONTAINS SQL

CALLED ON NULL INPUT,

METHOD DM_getNORules ()

RETURNS INTEGER

LANGUAGE SQL

DETERMINISTIC

CONTAINS SQL

RETURNS NULL ON NULL INPUT,

METHOD DM_getRuleTask ()

RETURNS DM_RuleTask

LANGUAGE SQL

DETERMINISTIC

CONTAINS SQL


IST-2001-33058


CALLED ON NULL INPUT

The Rule Model type has only one member variable, DM_content. In that variable, that is a

CHARACTER LARGE OBJECT (CLOB), the complete information about one instance of

the model is stored.

Method DM_impRuleModel takes a CLOB as input parameter and if it is a proper

representation of a DM_RuleModel, then a new value of type DM_RuleModel is created. A

CLOB is a proper representation of a DM_RuleModel, if it is a valid instance of the PMML

Association Rules DTD according to XML.

The result of the invocation DM_expRuleModel () is a CHARACTER LARGE OBJECT

representing the association rule model contained in SELF.

Method DM_getNORules returns the number of rules in the DM_content. As mentioned

before, DM_content is a PMML document and the implementation of that method must

include processing techniques for XML in order to find the value of the attribute

“numberOfRules” in element “AssocInputStats” (see section 2.1.6). For example this could be

an XQuery statement.

3.3. Common Warehouse Model (CWM)

3.3.1 Overview

The main purpose of CWM [CWM] is to enable easy interchange of warehouse and business

intelligence metadata between warehouse tools, warehouse platforms and warehouse metadata

repositories in distributed heterogeneous environments. CWM is based on three key industry

standards:

UML - Unified Modeling Language, an OMG modeling standard

MOF - Meta Object Facility, an OMG metamodeling and metadata repository standard

XMI - XML Metadata Interchange, an OMG metadata interchange standard

The CWM provides a framework for representing metadata about data sources, data targets,

transformations and analysis, and the processes and operations that create and manage

warehouse data and provide lineage information about its use. The CWM Metamodel consists

of a number of sub-metamodels which represent common warehouse metadata in the

following major areas of interest to data warehousing and business intelligence. In this section

we give an overview of the Data Mining sub-metamodel.

3.3.2 Data Mining Metamodel

The CWM Data Mining metamodel represents three conceptual areas:


IST-2001-33058


The overall Model description

Settings

Attributes

The Model conceptual area consists of a generic representation of a data mining model (that

is, a mathematical model produced or generated by the execution of a data mining algorithm).

This consists of MiningModel, a representation of the mining model itself, MiningSettings,

which drive the construction of the model, ApplicationInputSpecification, which specifies the

set of input attributes for the model, and MiningModelResult, which represents the result set

produced by the testing or application of a generated model.

The Settings conceptual area elaborates further on the Mining Settings and their usage

relationships to the attributes of the input specification. Mining Settings has four subclasses

representing settings for StatisticsSettings, ClusteringSettings, SupervisedMiningSettings and

AssociationRulesSettings. The SupervisedMiningSettings are further subclassed as

ClassificationSettings and RegressionSettings, and a CostMatrix is defined for representing

cost values associated with misclassifications.

AttributeUsageRelation consists of attributes that further classify the usage of

MiningAttributes by MiningSettings (e.g., relative weight). Several associations are also used

to explicitly define requirements placed on attributes by certain subclasses of settings (e.g.,

target, transactionId and itemId).

The Attributes conceptual area defines two subclasses of Mining Attribute: NumericAttribute

and CategoricalAttribute. Category represents the category properties and values that either a

CategoricalAttribute or OrdinalAttribute mightpossess, while CategoryHierarchy represents

any taxonomy that a CategoricalAttribute might be associated with.

All of the above classes are expressed in UML and are described in more detail in [CWM].

3.4. Java DM API

Java DM API (JDMAPI) [JDM] follows SUN’s Java Community Process as a Java

Specification Request (JSR). It addresses the need for an API that will give “procedural”

support to all the existing and evolving data mining standards. A final proposal of the API has

not been issued yet but one is expected early 2003. The group that participates in the

specification of the API is constituted by the following members:

BEA Systems Blue Martini Software Dubitzky, Werner

Hyperion Solutions Corp. IBM Kana Communications Inc.

Magnify Research, Inc. MINEit Software Ltd. Oracle


IST-2001-33058


Quadstone SAP AG SAS Institute SPSS

Strategic Analytics Sun Microsystems, Inc.

The JDMAPI specification will use design references addressed in PMML, CWM and

SQL/MM Data Mining. The JDMAPI specification will support the building of data mining

models, the scoring of data using models, as well as the creation, storage, access and

maintenance of data and metadata supporting data mining results, and select data

transformations. JDMAPI will be based on the Java 2 Platform, Enterprise Edition.

The goal of JDMAPI is to provide for data mining systems similar functionality to the JDBC.

3.5. Oracle9i Data Mining

Oracle has embedded data mining within the Oracle9i database with Oracle9i Data Mining

(ODM) [Oracle9i]. All the functionality for data mining operations, such as model building,

scoring functions, and testing are provided via a Java API. Oracle9i Data Mining consists of

the following components:

Oracle9i Data Mining (ODM) API

Data Mining Server (DMS)

The ODM API allows users to write programs that perform data mining operations. It is based

on the proposed concepts of the Java Data Mining API (section 5).

The Data Mining Server (DMS) resides on the database server side and accepts requests from

programs written using the ODM API. DMS performs the processing of these requests and

delivers results to the client applications. The DMS also provides a metadata repository

consisting of mining input objects and result objects, along with the namespaces within which

these objects are stored and retrieved. Oracle9i Data Mining supports two data mining

functions: classification for supervised learning and association rules for unsupervised

learning. The mining functions use two algorithms: Naive Bayes and Association Rules.

Every model has characteristics similar to those addressed in PMML and SQL/MM Data

Mining, and they are persisted in the DMS using relational tables. The model instances are

stored in the DMS and the users can refer to them using a user-specified unique name. Every

step in the model building procedure creates persistent objects that can be used from multiple

applications.

The most important object, in context of storing data mining results, is the mining result

object. A mining result object contains the end products of test or build-model mining

operations. ODM persists mining results in relational tables in the DMS. A mining results


IST-2001-33058


object contains the operation start time and end time, the name of the model used, input data

location, and output data location for the data mining operation.

Oracle intends to support other models in future versions, such as Decision Trees and

Classification models. It also intends to provide full support of SQL/MM and PMML.

3.6. Information Discovery DataMining Suite

Information Discovery [IDMS] has developed a set of tools and systems for data mining and

knowledge extraction. These products make use of the Pattern Query Language (PQL). PQL

is a pattern-oriented query language specifically designed to provide business users access to

refined information. No other information is available (at least through their web site) about

the syntax or the semantics of the language. Its notion is very close to the motive of the

PANDA project and that is the reason for mentioning it in this survey.

4. Conclusion

The database research community claims that technology has reached a critical point since

there are certain requirements and information types that are either partially supported or not

supported at all [B+98]. The innovation of the PANDA project lies in the vision for a new

approach aiming at the definition of a system architecture that efficiently represents,

maintains and manages patterns. It refers to a variety of domains, that is, among others,

knowledge discovery in (traditional or non-traditional) databases, time-involving applications

(time series or moving objects databases), multimedia systems (image or video databases),

scientific data, and the WWW (as a huge repository of unstructured information). The

cornerstone of this new approach will be the pattern concept, aiming at representing huge

volumes of information in an effective way.

Moreover the PANDA project aims at the integration of existing approaches towards a novel

logical integration of patterns into a data model, language and base management system

support.

With regards to PMML and the other modelling approaches, the proposed system architecture

is vertical. It heads in a vertical approach defining an extensible type system. As it has already

been discussed in the previous sections, patterns arise from different scientific fields (i.e., data

mining, mathematics, information retrieval etc). PANDA aims at supporting any pattern type

regardless to the application, while other approaches are mainly oriented to data mining

patterns.

The majority of users both in scientific and business field do not want massive volumes of

data, but they are interested in the patterns and trends hidden within data. Since these patterns


IST-2001-33058


need to be accessed, manipulated and managed, just as data elements are managed the concept

of "pattern management" is introduced. Pattern management systems deal with patterns, just

as data management systems deal with data. Moreover, they require distinct repositories and

query languages, corresponding to languages that have been developed for data management.

In this report we reviewed different types of patterns from many areas, the current efforts on

modeling data mining operations were addressed along with the corresponding results.

Furthermore different procedures of pattern extraction were found which give us a brief idea

of the diversity between various patterns types. After a close observation of the various

pattern types and given the informal definition that pattern is a compact and rich in semantics

representation of raw data, we draw the conclusion that there are many common

characteristics between all of them.

One of the challenges in the field of pattern mining (or pattern recognition) is the

development of a framework capable of representing and dealing with every kind of patterns

independently of the application and/or the method used to extract patterns. This framework

has to serve as a precise and conceptual foundation for the representation and behavior of

patterns. It will be the basis for the design and development of a system that handles (stores /

processes / retrieves) patterns and supports pattern-related operations.

To proceed with the definition of the considered framework we have to define the structure

and the requirements for representing each kind of patterns while the relationships of them

have to be identified. Also it is important that we identify the behavior of patterns and the

functions that a pattern-related system has to support.

References

[AGO01] G. Agostini, M. Longari and E. Pollastri, “Musical instrument timbres

classification with spectral features”, Proceedings of IEEE Multimedia Signal

Processing Conference (2001).

[AS94] Rakesh Agrawal, Ramakrishnan Srikant. “Fast Algorithms for Mining

Association Rules”. Proc. of the 20th VLDB Conference, 1994.

[AS95] Rakesh Agrawal and Ramakrishnan Srikant: Mining Sequential Patterns, IBM

Almaden Research Center, 1995.

[AW95] C.Ahlberg and E.Wistrand. IVEE: An information visualization and exploration

environment. Information Visualization, Atlanta, GA pages 66–73, 1995.


IST-2001-33058


[BAU70] L.E Baum, T. Petrie, G. Soules, “A maximization technique in the statistical

analysis of probabilistic functions of Markov chains”, Annals of Mathematical

Statistics, vol. 41, pp. 164-171, 1970.

[B+98] P. A. Bernstein et al., “The Asilomar Report on Database Research”, SIGMOD

Record, 27(4):74-80, December 1998.

[BCH00] Andi Baritchi, Diane J. Cook and Lawrence B. Holder: Discovering Structural

Patterns in Telecommunications Data, Texas, 2000.

[BDO95] M. W. Berry, S. T. Dumais, G.W. O'Brien, Using linear algebra for intelligent

Information retrieval, SIAM Review, 37(4): 573-595, 1995

[BL96] Michael Berry, Gordon Linoff. “Data Mining Techniques: For Marketing, Sales,

and Customer Support”. John Willey, 1996.

[BRD99] Alexandre M. Braga, Cecilia M.F. Rubira and Ricardo Dahab: Tropyc: A Pattern

Language for Cryptographic Software, Brazil, 1999.

[BSC96] A. Buja, D.F. Swayne, and D. Cook. Interactive high-dimensional data

visualization. Journal of Computational and Graphical Statistics, 5(1): 78–99,

1996.

[Che99] C.Chen. Information Visualisation and Virtual Environments. Springer-Verlag,

London, 1999.

[Chiu97] S. Chiu. "Extracting Fuzzy Rules from Data for Function Approximation and

Pattern Classification". Fuzzy Information Engineering- A Guided Tour of

Applications.(Eds.: D. Dubois, H. Prade, R Yager), 1997.

[CMS 99] Card S., Mackinlay J., Shneiderman B.: ‘Readings in Information Visualization:

Using Vision to Think’, Morgan Kaufmann, 1999.

[CN95] Peter Clifford and Geoff Nichols: A Metropolis Sampler for Polygonal Image

Reconstruction, UK, 1995.

[CS96] P. Cheeseman, J. Stutz. "Bayesian Classification (AutoClass): Theory and

Results". Advances in Knowledge Discovery and Data Mining. (Eds:U.

Fayyad,et al), AAAI Press,1996.

[CON02] D. Conklin, “Representation and Discovery of vertical patterns in Music”,

Lecture Notes in Artificial Intelligence (LNAI) 2445, Springer Verlag,

2002.

[CWM] Common Warehouse Metamodel (CWM) , available at http://www.omg.org/cwm


IST-2001-33058


[DAN02] R. Dannenberg and N. Hu, “Discovering Musical Structure in Audio

Recordings”, Lecture Notes in Artificial Intelligence (LNAI) 2445,

Springer Verlag, 2002.

[DDF+90] S. Deerwester, S. T. Dumais, G. Furnas, Th. K. Landauer, R. Harshman, Indexing

by Latent Semantic Analysis, Journal of the Society for Information Science,

41(6): 391-407, 1990.

[DEL87] J. Deller, J. Proakis, J. Hansen, “Dicrete-Time Processing of Speech Signals”,

Prentice-Hall, 1987.

[DLLL97] S. T. Dumais, T. A. Letsche, M. L. Littman, T. K. Landauer, Automatic

cross-language retrieval using Latent Semantic Indexing, In AAAI Spring

Symposuim on Cross-Language Text and Speech Retrieval, March 1997.

[DLMRS98] Gautan Das, King-Le Lin, Heikki Mannila, Gopal Renganathan and

Padhraic Smith: Rule discovery from time series, USA, 1998.

[DMG] DMG, Predictive Model Markup Language (PMML), available at

http://www.dmg.org/pmmlspecs_v2/pmml_v2_0.html

[ERR01] A. Erronen, “Comparison of features for musical instrument recognition”,

Proceedings of 2001 IEEE Workshop on Applications of Signal Processing to

Audio and Acoustics (WASPAA), 2001

[FFKLLRSZ98] Roner Feldman, Moshe Fresko, Yakkov Kihar, Yehuda Lindell, Orly

Liphstat, Mrtin Rajman, Yonatan Schler, Oren Zamiv: Text Mining at the Term

Level, 1998.

[FPSU96] U. Fayyad, G. Piatesky-Shapiro, P. Smuth & R. Uthurusamy(editors). "From

DataMining to Knowledge Discovery: An Overview". Advances in Knowledge

Discovery and Data Mining. AAAI Press, 1996.

[G99] D.M. Gavrilla: The Visual Analysis of Human Movement: A Survey, Germany,

1999.

[GL96] G.H. Golub, C.F. Van Loan. Matrix Computations, Third Edition, Johns

Hopkins University Press, Baltimore-London, 1996

[GMPS96] Glymour C., Madigan D., Pregibon D, Smyth P, “Statsitical Inference and Data

Mining”, in CACM v39 (11), 1996, pp. 35-42.

[GMV96] Guyon, I., Matic, N. and Vapnik, V.: Discovering informative patterns and data

cleaning. In Fayyad U., Piatetsky-Shapiro G., Smyth P. and Uthurusamy, R. (ed.)


IST-2001-33058


Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT

Press, Menlo Park, California, (1996)

[GRS98] Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. "CURE: An Efficient Clustering

Algorithm for Large Databases", Published in the Proceedings of the ACM

SIGMOD Conference, 1998.

[HH99] R. J. Hilderman, H. J. Hamilton. “ Knowledge Discovery and Interstigness

Measures: A Survey”, Technical Report CS 99-04, Dept of Computer Scinece,

University of Regina, October 1999.

[HER02] P. Herrera, A. Yeterian, F. Gouyon, “Automatic Classification of Drum Sounds:

A Comparison of Feature Selection Methods and Classification Techniques”,

Lecture Notes in Artificial Intelligence (LNAI) 2445, Springer Verlag, 2002.

[HK01] Jiawei Han, Micheline Kamber. “Data Mining: Concepts and Techniques”.

Academic Press, 2001.

[Hor+98] T. Horiuchi. "Decision Rule for Pattern Classification by Integrating Interval

Feature Values". IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol.20, No.4, April 1998, pp.440-448.

[IDMS] Information Discovery DataMining Suite, available at

http://www.patternwarehouse.com/dmsuite.htm

[Jan+98] Cezary Z. Janikow, "Fuzzy Decision Trees: Issues and Methods", IEEE

Transactions on Systems, Man, and Cybernetics, Vol. 28, Issue 1, pp 1-14, 1998.

[JDM] Java Data Mining API , available at http://www.jcp.org/jsr/detail/73.prt

[Kei 00] Keim D. A.: ‘Designing Pixel-oriented Visualization Techniques: Theory and

Applications’, Transactions of Visualization and Computer Graphics, 2000.

[Kei 01a] D. Keim.Visual exploration of large databases. Communications of the ACM,

44(8): 38–44, 2001.

[Kei 01b] Keim D. A.: ‘An Introduction to Information Visualization Techniques for

Exploring Very Large Databases’, Tutorial Notes, Visualization 2001, San Diego,

CA, 2001.

[LA94] LeungY., Apperley M.: A Review and Taxonomy of Distortion-oriented

Presentation Techniques, Proc. Human Factors in Computing Systems CHI '94

Conf., Boston, MA, p. 126-160, 1994.


IST-2001-33058


[LAS97 ] Lent, B., Agrawal, R., Srikant, R.: Discovering Trends in Text Databases. In:

Proceedings ofthe 3rd International Conference on Knowledge Discovery (KDD),

(1997).

[LB98] Terran Lane and Carla E1. Brodley: Temporal Sequence Learning and Data

Reduction for Anomally Detection, Scholl of Electrical and Computer

Engineering and the COAST laboratory, Purdue University, West Lafayette,

1998.

[LK94] Edward W. Large and John F. Kolen: Resonance and the Perception of Musical

Meter, The Ohio State University, 1994.

[MAR96] M. Melta, R. Agrawal, J. Rissanen. "SLIQ: A fast scalable classifier for data

mining". In EDBT’96, Avigon France, March 1996.

[MAR01] A. Marsden, “Representing melodic patterns as networks of elaborations”, in

Computers and the Humanities, 35:37-54, 2001.

[MER01] D. Meredith, G. Wiggins and K. Lemstrom, “Patterns induction and Matching in

Music and other multidimensional datasets”, Proceedings of the Conference on

Systemics, Cybernetics and Informatics, volume X, 2001.

[Mit+97] T. Mitchell. Machine Learning. McGraw-Hill, 1997

[NPD94] J. Nielsen, V.L. Phillips, S.T. Dumais, Information Retrieval of Imperfectly

Recognized Handwriting, available at

http://www.useit.com/papers/handwriting_retrieval.html

[NPP00] Chicahito Nakajima, Massimiliano Pontil and Tomaso Pogio: People Recognition

and Pose Estimation in Image Sequences, JAPAN, 2000.

[Oracle9i] Oracle9i Data Mining Concepts, available at

http://otn.oracle.com/docs/products/oracle9i/doc_library/release2/datamine.920/a

95961/1concept.htm#923516

[PAC00] F. Pachet, P. Roy, D. Cazaly, “A Combinatorial approach to content-based music

selection”, IEEE Multimedia, Vol 1, 2000.

[PIK01] A. Pikrakis, S. Theodoridis and D. Kamarotos, “Recognition of Isolated Musical

Patterns using Context Dependent Dynamic Time Warping”, IEEE Transactions

on Speech and Audio Processing (to appear).

[PIK02] A. Pikrakis, S. Theodoridis and D. Kamarotos, “Recognition of Isolated Musical

Patterns using Hidden Markov Models”, Lecture Notes in Artificial Intelligence

(LNAI) 2445, Springer Verlag, 2002.


IST-2001-33058


[Quin+93] J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

[REI01] J. Reiss, J. Aucouturier, M. Sandler, "Efficient Multi-dimensional searching

routines for music information retrieval", Proceedings of ISMIR 2001.

[RS98] R. Rastori, K. Shim. "PUBLIC: A Decision Tree Classifier that Integrates

Building and Pruning". Proceeding of the 24th VLDB Conference, New York,

USA, 1998.

[SA95] Ramakrishnan Srikant, Rakesh Agrawal. “Mining Generalized Association

Rules”. Proc. of the 21st VLDB Conference, 1995.

[SAM96] J.Shafer, R. Agrawal, M. Mehta. "SPRINT: A scalable parallel classifier for data

mining". In Proc. of the VLDB Conference, Bombay, India, September 1996

[SB98] Neil Sumpter and Andrew J. Bulpitt: Learning Spatio-Temporal Patterns for

predicting Object behaviour, UK, 1998.

[SB99] Stephan Schulz and Felix Brandt: Using Term Space Maps to Capture Search

Control Knowledge in Equational Theorem Proving, Germany, 1999.

[Sch97] Oded Schram: Circle Patterns with the combinatorics of the square grid, The

Weizmann Institute, 1997.

[SL68] G. Salton, M.E. Lesk, Computer Evaluation of Indexing and Text Processing.

Journal of the ACM, 15(1):8-36, 1968

[Shn96] Shneiderman B.: ‘The Eye Have It: A Task by Data Type Taxonomy for

Information Visualizations’, Proc. Visual Languages, 1996.

[Spe00] B. Spence. Information Visualization. Pearson Education Higher Education

publishers, UK, 2000.

[ST99] Ayman A. Abel-Samad and Ahmed H. Tewfik: Search Strategies for radar

Target localization, University of Minnesota, Minneapolis, 1999.

[SQL/MM] ISO SQL/MM Part 6, available at http://www.sql-

99.org/SC32/WG4/Progression_Documents/FCD/fcd-datamining-2001-05.pdf

[TK98] S. Theodoridis, K. Koutroumbas, “Pattern Recognition”, Academic Press, 1998.

[TZA01] G. Tzanetakis, G. Essl, P. Cook, "Automatic music genre classification of audio

signals", Proceedings of ISMIR 2001.

[VIT67] A.J. Viterbi, “Error bounds for convolutional codes and an asymptotically

optimal decoding algorithm”, IEEE Transactions on Information Theory, vol. 13,

pp. 260-269, Apr. 1967.


IST-2001-33058


[VH99] Remco C. Veltkamp and Michiel Hagedoorn: State-of-the-Art in Shape Matching,

The Netherlands, 1999.

[War 00] Ware C.: ‘Information Visualization: Perception for Design’. Academic Press,

San Diego, 2000.

PANDA Technical Report Series - Athens University of ... · PDF filePANDA Technical Report...

Documents

Transcript of PANDA Technical Report Series - Athens University of ... · PDF filePANDA Technical Report...