Knowledge Discovery in Databases - vse.czsorry.vse.cz/~berka/docs/4iz451/sl01-kdd-en.pdfKnowledge...

Knowledge Discovery in Databases T1: introduction

P. Berka, 2012 1/19

Knowledge Discovery in Databases

(Information Harvesting, Data Archeology, Data

Mining, Knowledge Destilery, ....)

Non-trivial process of identifying valid, novel,

potentially useful and ultimately understandable

patterns from data (Fayyad a kol., 1996)

Data mining involves the use of sophisticated data

analysis tools to discover previously unknown, valid

patterns and relationships in large data sets

(Adriaans, Zantinge, 1999)

Analysis of observational data sets to find

unsuspected relationships and summarize data in

novel ways that are both understandable and useful

to the data owner (Hand, Manilla, Smyth, 2001)

Data mining is the process of analyzing hidden

patterns of data from different perspectives and

categorizing them into useful information

(techopedia.org, 2011)

Three sources

databases (query languages, OLAP), statistics

(data analysis), artificial intelligence (machine

learning)


P. Berka, 2012 2/19

KDD Tasks

(Klosgen, Zytkow, 1997)

classification/prediction: the task is to

find knowledge applicable to automatically

process new examples

desription: the task is to find dominant

structure or relationships


P. Berka, 2012 3/19

search for „nuggets“: the task is to find

partial novel and surprising knowledge

(Chapman a kol, 2000)

data description and summarisation:

concise description of characteristics of

the data, typically in elementary and

aggregated form

segmentation: separation of the data into

interesting and meaningful subgroups or

classes

concept description: understandable

description of concepts or classes to gain

insight


P. Berka, 2012 4/19

classification: build classification models

(sometimes called classifiers) which assign

the correct class label to previously unseen

and unlabeled objects

prediction: similar to classification, but the

target attribute (class) is not a qualitative

discrete attribute but a continuous one.

Prediction also often deals with time

dependent concepts

dependency analysis: describe significant

dependencies (or associations) between data

items or events


P. Berka, 2012 5/19

Managerial viewpoint

7. Interpre-

tace

6. Data

mining

1. Řešitelský

tým

4. Výběr

metod

3. Získání

dat

2. Specifikace

problému

Znalosti

pro řešení

Manažerský

problém

5.Předzpraco-

vání dat

Data processing viewpoint


P. Berka, 2012 6/19

Application areas of KDD

Segmentation and classification (clients of a bank

or insurance company),

Credit Risk Assessment,

Fraud detection

Prediction of stock market prices,

Prediction of energy consumption,

Intrusion detection,

Churn Analysis (telco services providers, internet

providers),

Microarray data analysis (molecular biology),

Targeted marketing,

Medical diagnosis,

Market Basket Analysis.


P. Berka, 2012 7/19

Market basket analysis: data expolration

Collected data – content of market baskets in transactional form

Basket_id Item_id

10011 152

10011 37

10012 1

10012 152

10012 785

10012 6

10013 10

10014 15

10014 811

. . . . . .


P. Berka, 2012 8/19

Market basket analysis: dependency analysis


P. Berka, 2012 9/19

Market basket analysis: classification


P. Berka, 2012 10/19

KDD Standards

1. Methodologies

(Marban a kol, 2009)

5A Developed in mid. 90th by SPSS. The name is an

acronym for the performed steps:

Assess – assess the requirements of the project,

Access – access the available data,

Analyze – perform the analyses,

Act – turn knowledge into actions,

Automate – deploy the models in an automatic way.


P. Berka, 2012 11/19

SEMMA Developed in mid. 90th by SAS:

Sample the data by creating one or more data

tables,

Explore the data by searching for relationships,

trends or anomalies,

Modify the data by creating, selecting, and

transforming the variables,

Model the relationships between input and output

variables by using various data mining techniques,

Assess the quality of the models.


P. Berka, 2012 12/19

CRISP-DM Currently a de-facto standard supported by most

data mining systems


P. Berka, 2012 13/19

2. Standards to describe models

Predictive Modeling Markup Language

Standard based on XML developed at Data Mining

Group (www.dmg.org), that allows to describe data,

data transformations and created models. Main parts

of a PMML document:

Header

Data Dictionary

Data Transformations

Model

http://www.dmg.org/


P. Berka, 2012 14/19

<?xml version="1.0" ?>

<PMML version="4.0">

<Header copyright="P.B." description="An example decision tree model."/>

<DataDictionary numberOfFields="5" >

<DataField name="income" optype="categorical" />

<Value value="low"/>

<Value value="high"/>

<DataField name=account" optype= categorical " />

<Value value="low"/>

<Value value="medium"/>

<Value value="high"/>

<DataField name="sex" optype="categorical" >

<Value value="male"/>

<Value value="female"/>

</DataField>

<DataField name="unemployed" optype="categorical" >

<Value value="yes"/>

<Value value="no"/>

</DataField>

<DataField name=loan" optype="categorical" >

<Value value="A"/>

<Value value="n"/>

</DataField>

</DataDictionary>

<TreeModel modelName="loan aproval decision tree" >

<MiningSchema>

<MiningField name=“income"/>

<MiningField name="account"/>

<MiningField name="sex"/>

<MiningField name="unemployed"/>

<MiningField name="loan" usageType="predicted"/>

</MiningSchema>

<Node score="A">

<True/>

<Node score="A">

<SimplePredicate field="income" operator="equal" value="high"/>

</Node>

<Node score="n">

<SimplePredicate field="income" operator="equal" value="low"/>

<Node score="A">

<SimplePredicate field="account" operator="equal"

value="high"/>

</Node>

<Node score="n">

<SimplePredicate field="account" operator="equal"

value="low"/>

<Node score="n">

<SimplePredicate field="unemployed" operator="equal"

value="yes“/>

</Node>

<Node score="A">

<SimplePredicate field="unemployed" operator="equal"

value="no“/>

</Node>

</Node>

</Node>

</Node>

</TreeModel>

</PMML>


P. Berka, 2012 15/19

3. Programming standards (API)

SQL/MM Data Mining

Standard interface that enables to access data

mining algorithms from relational databases

OLE DB for Data Mining

API developed by Microsoft

Java Data Mining

CREATE MINING MODEL CreditRisk

(

CustomerId long key,

Income text discrete,

Account text discrete,

Sex text discrete,

Unemployed boolean discrete,

Loan text discrete predict,

)

USING [Microsoft Decision Tree]


P. Berka, 2012 16/19

Data Mining Systems

cover the whole KDD process (from data

preprocessing to model evaluation),

offer more data mining algorithms (than single-

purpose machine learning systems),

focus on visualization (both in the way how to

use the system and in the way how to present

and interpret data and results).

System Vendor URL

SPM Salford

Systems

www.salford-systems.com

Clementine SPSS www-01.ibm.com/software/analytics/

spss/products/modeler/

Enterprise

Miner

SAS Institute www.sas.com/technologies/analytics/

datamining/miner/

GhostMiner Fujitsu www.fqs.pl/business_intelligence/prod

ucts/ghostminer

Intelligent

Miner

IBM www-01.ibm.com/software/data/

infosphere/warehouse/enterprise.html

KnowledgeSt

udio

Angoss www.angoss.com

Oracle Data

Mining

Oracle www.oracle.com/us/products/database/

options/data-mining/index.html

PolyAnalyst Megaputer www.megaputer.com/

Statistica

Data Miner

StatSoft www.statsoft.com/products/data-

mining-solutions/

LISp Miner VŠE lispminer.vse.cz

RapidMiner Rapid-I rapid-i.com/

Weka University of

Waikato

www.cs.waikato.ac.nz/ml/weka/index.

html


P. Berka, 2012 17/19

Weka

Rapid Miner


P. Berka, 2012 18/19

SAS Enterprise Miner

IBM SPSS Modeler (Clementine)


P. Berka, 2012 19/19

Knowledge Discovery in Databases - vse.czsorry.vse.cz/~berka/docs/4iz451/sl01-kdd-en.pdfKnowledge...

Documents

Transcript of Knowledge Discovery in Databases - vse.czsorry.vse.cz/~berka/docs/4iz451/sl01-kdd-en.pdfKnowledge...