Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation...

61
® IBM Software Group © 2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

Transcript of Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation...

Page 1: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

®

IBM Software Group

© 2004 IBM Corporation

Knowledge Discovery and Data Mining

Toni BollingerIBM Development Lab, Böblingen, Germany

Page 2: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

2

IBM Development Lab Böblingen

Page 3: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

3

IBM Development Lab Böblingen

Linux for zSeriesSystems Management

z/OS ComponentseLiza Initiative

SAP SolutionsVSE

Competence CenterLinux CoCExecutive Briefing Center

Server GroupData

Management

Content Management/ Information Integration

CommonStoreText Mining Text Search

DB2 UDBDB2 Extenders

Business IntelligenceData Mining

DB2 Performance ToolsSAP DB2 Multiplatform

DM ServicesTechnical Marketing

Business Partner & Sales Support

Software GroupHardware

Systems MicrocodeEmbedded Service ControllersServer System NestSimulation of ESG systemsServer Micropro's

Software

Platform Technology Platform Strategy

Application Integration &

Pervasive

Life SciencesWebSphere Business Integrator Financial NetworksMERVA CoreBank

Industry SolutionsIGS Development & Services

SW Solutions & Services

WebSphereWorkflow WebSphere Portal Server Pervasive Computing

SpeechREXXOffice SolutionsASIC Design Center

PRIZMA SwitchOEM Micropro's

Technology Group

Page 4: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

4

IBM Research and Development – world wide

ResearchHardware DevelopmentSoftware DevelopmentHardware and Software Development

GreenockDublinHursleyHavant

BöblingenZürichLa GaudeRom

RochesterBoulder

AlmadenSan Jose

Santa TeresaTucsonAustin

TorontoBurlingtonEndicott

East FishkillPoughkeepsieYorktown HeightsRaleigh

YasuTokioFujisavaYamato Delhi

Haifa

Peking

2001: 3411 US Patents

Page 5: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

5

Outline of the talk

§ Introduction

§ Data Mining Techniques4Association Rules4Clustering4Tree Classification

§ Data Mining Methodology4Data Mining Proceess

§ KDD & Data Mining - Where are we now?4Data Mining Standards4Data Mining Solutions

Page 6: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

6

Introduction

§ IJCAI 1989, Workshop on Knowledge Discovery in Databases4 Introduction of term Knowledge Discovery4Brought together researchers from different disciplines§ Artificial Intelligence

– Machine learning– Neural networks

§ Statistics§ Database research§ Visualization

4Motivation§ Amount of data is growing exponentially§ Ability to understand these data is lagging behind§ Growing need for intelligent analysis techniques

Page 7: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

7

Introduction

§ D. Michie 19904 „the area that is going to explode is the use of machine learning tools as a

component of large-scale data analysis

§ Silberschatz, Stonebracker, Ullman: Database systems: Achievements and Opportunities4Data mining ranked as the most promising research topics for 1990s

Page 8: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

8

Definition of the term „knowledge discovery“

§ W. Frawley, G. Piatesky-Shapiro, C. Matheus:

Knowledge discovery is 4 the nontrivial extraction 4of previously unknown and 4potentially useful information

from data.

Page 9: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

9

Challenges for Knowledge Discovery

§ Real Data4Different data types:§ Numeric§ Categorical

4Structured/unstructered information4High/low number of different values4Missing/invalid values4 Inconsistencies, noise in data§ Different encodings of the same information

§ High volumes of data4Lerning sets in machine learning up to 1000 tupels.4Number of records in real data several millions.

Page 10: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

10

Business Trends in the 90ies

§ Customer Relationship Mangement4 Knowing more about the customer for enhancing the customer life time value§ Cross and up selling§ Target marketing§ Customer acquisition and churn prevention

§ Supply Chain Management

§ E-business, Internet4 All interactions with the customers are through a computer4 Enormous amounts of data available4 Application areas§ Personalization§ Click stream analysis§ Product recommendations

Page 11: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

11

Data Warehousing, Business Intelligence

§ Data is extracted from the operational systems and stored in a data warehouse in a systematic and unified way.

§ Data in the warehouse represents the „truth“ for a company.

§ Data is used for decison support

§ Analysis techniques4SQL – Queries, Standard Reporting§ Revenue of the month, 5 most and 5 less frequently products, ..

4Online Analytic Processing (OLAP)§ Revenue per month per product per region at different hierarchy levels§ An OLAP cube represents a set different queries

4Data Mining

Page 12: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

12

Data Mining <-> OLAP, Standard Reporting

problem -> hypotheses?

verification of the hypotheses

SQL, OLAP

known correlations

generation ofhypotheses

Data Mining

Bekannte Zusammenhänge

+ unknown correlations

Page 13: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

13

Data Mining Techniques

§ Discovery – unsupervised learning4Clustering4Associations Rules4Sequential Patterns

§ Prediction – supervised learning4Classification 4Regression

Page 14: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

14

Association Rules Discovery

§ Discovery of rules in data4Which combinations/events occur simultaneously with a high frequency4Which combinations are unsual?

§ Application areas4Market basket analysis§ If someone buys low fat margerine then s/he buys brie cheese.

4Web log analysis§ If some visits page x s/he visits page y as well.

4Quality mangement§ If defect x occurs then defect y

Page 15: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

15

Association rules attributes

100 transactions50 transactions with

low fat margerine

30 transactions with

brie cheese 20 xboth

support = 20% = 20/100confidence = 67% = 20/30lift = 1.3 = confidence/50%

low fat margerine à brie cheese

Page 16: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

16

Association rules discovery

§ Discovery problem:4 Given § a set of transactions T, § values for minimum support and confidence,

4 find all rules Aà B that satisfy these constraints

§ Discovery algorithm1. Find all item sets with minimum support.

These are called frequent item sets.2. For each frequent item set IF

1. compute all partitions into two disjoint subsets IF1 and IF2

2. compute the confidence of the rule IF1à IF2

3. and keep those rules with a confidence greater or equal to the minimum confidence

Page 17: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

17

Apriori algorithm for frequent item sets(R. Agrawal, R. Srikant)

§ Property of frequent item sets:4If an item set is frequent every subset is frequent as well.

1.N=1;Determine all frequent item sets with 1 element IS(1)

2.From the frequent item sets with n elements IS(N) build the candidate set for frequent item sets with N+1 elements.

3.Determine the support for the candidate item sets and retain those with at least minimum support: IS(N+1)

4.If IS(N+1) == {} return IS(1) ∪ IS(2) ∪ .... ∪ IS(N)

5.N=N+1; goto 2

Page 18: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

18

Example:

§ transaction table

§ minimum support 40 % (2 transactions)

§ minimum confidence 70 %

softdrinkT4

beerT3

waterT5

beerT4

juiceT4

waterT3

juiceT3

wineT2

softdrinkT2

juiceT2

beerT1

softdrinkT1

juiceT1

itemtransaction

Page 19: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

19

Frequent 1-element item sets

2Wasser

1Wein

3Bier

3Cola

4Saft

supportitem

softdrinkT4

beerT3

waterT5

beerT4

juiceT4

waterT3

juiceT3

wineT2

softdrinkT2

juiceT2

beerT1

softdrinkT1

juiceT1

itemtransaction

Page 20: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

20

Frequent 2 – element item sets

1beer, water

0softdrink, water

2softdrink, beer

1suice, water

3juice, beer

3juice, softdrink

support2 e. candidates

softdrinkT4

beerT3

waterT5

beerT4

juiceT4

waterT3

juiceT3

wineT2

softdrinkT2

juiceT2

beerT1

softdrinkT1

juiceT1

itemtransaction

Page 21: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

21

Frequent 3 – element item sets

2juice, softdrink, beer

support3 e. candidates

softdrinkT4

beerT3

waterT5

beerT4

juiceT4

waterT3

juiceT3

wineT2

softdrinkT2

juiceT2

beerT1

softdrinkT1

juiceT1

itemtransaction

Page 22: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

22

Determining the confidence

66 %32beer→ softdrink

66 %32beer & juice→ softdrink

100 %22softdrink & beer→ juice

66 %32juice & softdrink→ beer

66 %32softdrink→ beer

100 %33beer→ juice

75 %43juice→ beer

100 %33softdrink→ juice

75 %43juice→ softdrink

confidencerule body -support

supportrule

Page 23: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

23

Association sample with Intelligent Miner Visualization

Page 24: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

24

Association sample with Intelligent Miner Visualization –graphical view

Page 25: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

25

Clustering - Task

§ Given a set of data records (a relational table)

§ Find a partitioning of this set into disjunct subsets (clusters,segments) such that the elements within a subset have a high similarity and the elements of diffferent subsets have a high dissimilarity.

Page 26: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

26

Clustering Methods

§ Neural networks4 Kohonen Feature Maps

§ for numeric data§ categorical data has to be transformed to numeric data§ number of clusters given by size of the network

§ Statisticsal methods4 K-means Clustering

§ for numeric data§ categorical data has to be transformed to numeric data§ Number of clusters has to be specified by user

§ Demographic clustering (IBM DB2 Intelligent Miner)4 Initially for categorical data only4 Extented to deal with numeric data as well4 Number of clusters detected by clustering algorithm

Page 27: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

27

Application areas

§ Customer segmentation based on shopping behaviour and demographic data

4 Enables targeted marketing actions

§ Store segmentation/profiling4 Product offering can be adapted to the characteristics of the

segment the store belongs to

§ Fraud detection4 Outliers, unusual behaviour can be contained in small clusters,

niches

Page 28: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

28

Clustering example – online banking

segmentation algorithm

ABCDEF

revenue gender domicile agehigh M urban < 30low M rural 30-40middle F urban < 30very high M urban > 40high F urban < 30low M rural > 40

A B C D E FA 4 1 2 2 3 1BCDEF

1 4 0 1 0 32 0 4 1 3 02 1 1 4 1 23 0 3 1 4 01 3 0 2 0 4

A C E B F DA 4 2 3 1 1 2CEBFD

2 4 3 0 0 13 3 4 0 0 11 0 0 4 3 11 0 0 3 4 22 1 1 1 2 4

revenue gender domicile ageD very high M urban > 40

ACE

revenue gender domicile agehigh M urban < 30middle F urban < 30high F urban < 30

revenue gender domicile ageBF

low M rural 30-40low M rural > 40

Ergebnis: 3 Segmente

Page 29: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

29

Clustering Example with IM Visualization

Page 30: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

30

Prediction

independant variables

Comparison:actual - predicted value

diseased

YN

paintype

angina

num vessels

thaldiseas

ed

5 0 3 3 Y2 0 0 7 N

dependant variable

Training mode

Test mode:

"historical data"

"historical data"

paintype

angina

num vessels

thaldiseas

ed

3 1 2 2 N1 0 2 4 N

Page 31: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

31

Prediction

paintype

angina

num vessels

thal

4 1 1 53 0 0 7

predicted valuesdiseas

ed

YN

Application mode:

"new data"

Page 32: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

32

Some prediction techniques

§ Prediction of categorical values – classification4 neural networks: back propagation4 decision trees4 rule induction

§ Prediction of numeric values – regression4 neural networks: back propagation4 linear, polynomial, logistic regression4 radial basis functions4 decision trees4 support vector machines

Page 33: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

33

Application areas

§ Churn prevention, in particular of profitable customers

§ Prediction of credit worthiness

§ Prediction of interest in marketing campaigns

§ Analysis of quality problems in manufacturing

§ ...

Page 34: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

34

Decision trees

Page 35: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

35

Confusion Matrix

Page 36: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

36

Gains chart

Page 37: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

37

Gains chart – comparsion between two models

Page 38: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

38

Knowledge Discovery Methodology

§ Challenges in Knowledge Discovery4Real data4Huge volumes of data 4Completeley automatic discovery not realistic

§ Discovered Knowledge should be useful4You have know what kind of information you are interested in.4The purpose is important.

Page 39: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

39

KDD Process§ According to the CRISP-DM -

„CRoss Industry Standard Process for Data Mining“

Page 40: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

40

Distribution of the effort

Modeling5%

Data Acquisition40%Data

Pre-Processing 30%

5%

Model Deployment

10%

Data Cleansing/Transformation

10%

Data Discovery/Modeling

Page 41: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

41

Data Mining Products - Workbenches

§ SPSS Clementine

§ IBM Intelligent Miner for Data

§ SAS Enterprise Miner

§ ...

Page 42: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

42

Data Mining – Myth and Reality (1)

§ Arno Penzias, Nobel laureate and former chief of Bell Labs (January 1999, in an interview with ComputerWorld:„Data Mining will become much more important. Your bank will knoweverything you‘ve bought. Companies will throw away nothing they know about their customers, because it will be so valuable. If you‘re not doing this, you‘re out of business.“

Page 43: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

43

Data Mining – Myth and Reality (2)§ Gartner Group Hype Cycle for BI (December 2002)

Page 44: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

44

Data Mining – Myth and Reality (3)

§ My assessment:4Data mining is exiting.4Data mining is difficult: „You need a PhD in statistics to do data mining“§ Understand the business problem and the data.§ Map the business problem to a data mining problem.§ Prepare the data and run the mining techniques.§ Evaluate the results.

4The results are in most cases not spectacular, but they are valuable.§ Where are the nuggets?§ Most the results are known already.

4Deployment of the mining models in operational processes is difficult.4Privacy is an issue.

Page 45: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

45

What can be done?

§ Standardization

§ Closing the loop – making model deployment easier

§ Hiding the complexity of data mining through data mining solutions

§ Integrated BI Platforms instead of single data mining workbenches

Page 46: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

46

Standardization

§ Is a sign for the maturity of a field.§ Makes the field more interesting for those beyond the „early adopters“

in the technology aption cycle (G.A. Moore, Crossing the Chasm)

§ Is the basis for further progress of the field4Enables reuse of third party components.4Facilitates the development of mining solutions.

Late Majority

Early Majority

Early Adopters LaggardsInnovators

Page 47: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

47

PMML standard for mining models

§ PMML –„Predictive Model Markup Language“4 Driven by the data mining group

www.dmg.org4 Supported by almost all major data

mining vendors4 Allows the interchange of models§ for deployemnt§ for visualization

4 Based on XML

Page 48: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

48

Other standards

§ SQL/MM Data Mining4SQL extension for data mining4Oracle, IBM

§ JSR 734Java standard for data mining4Oracle, SAS, SPSS, SAP, IBM, KXEN, ...

Page 49: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

49

Closing the loop – making model deployment easier§ Model deployment

4Use of mining models in operational application§ For instance campaign management selection of the target group

according to the scores of a prediction model

§ Specific components for model deployment4DB2 IBM Intelligent Miner Scoring –§ allows to apply models inside the database

INSERT INTO IDMMX.ClusterModels values( 'DemoBanking',IDMMX.DM_impClusFile('/tmp/demoBanking.pmml');SELECT d.name, d.age, IDMMX.DM_getClusterId(IDMMX.DM_applyClusModel( cm.model, IDMMX.DM_applData( IDMMX.DM_applData('ae',d.age),

'salary', d.salary))) FROM ClusterModels cm, MyData d WHERE cm.modelname='DemoBanking';

Page 50: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

50

Building Mining Solutions

§ Automated mining solutions4Build applications that hide the complexity of data mining4The users do not have data mining skills4They only have to be able to understand the results

Page 51: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

51

Example: Detection of Insider Trading in Stock Transactions

§ German “Bundesanstalt für Finanzdienstleistungsaufsicht” (BaFin)

§ Mining Solution:4 The BaFin analysts have to enter two parameters only:

§ the stock id§ date and time of the Ad hoc announcement

4 No data mining skills are required for the BaFin analysts.4 The pre-processing, mining and post-processing steps are executed

automatically.4 Based on the mining results, scores for each transaction are computed.

They can be interpreted as a measure of how untypical the transactions are.

4 The BaFin analysts can inspect the stock transactions with these scores with a front-end tool of their choice (like Business Objects).

Page 52: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

52

BaFin Mining Solution

§ BaFin-Bundesanstalt für Finanzdienstleistungsaufsicht(Federal Finance Supervisory Agency)

§ One of its tasks: Detection of insider trading4 Every stock transaction is reported to the BaFin4 The trigger for insider-investigations are "Ad hoc Publications" or other important events4 "Ad hoc Publications" are statements of companies listed at the stock exchange that

may have an influence on the stock valuee.g.; quarter reports, earning warning, mergers with other companies, change of CEO

§ Some figures:4 400 000 stocks and derivatives4 5400 companies listed at German stock exchanges4 5600 Ad hoc publications in 20004 525 million stock transactions in 2000

§ Challenge:4 How can we efficiently and effectively detect information relevant to insider trading in

this huge amount of data?

Page 53: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

53

Adhoc Mining Scenario

§ Mining Scenario that has been developed by IBM in a pilot project:4 It consists of a sequence of§ preprocessing steps (discretization, removal of outliers, pivotization, ...)§ mining steps (associations, clustering)§ post processing steps

4The goal is to find transactions that are untypical§ The pilot project was successful:

4 Insider transactions hidden in the data have been found§ However, the scenario was too complex to be applied regularly by the BaFin

analysts.

Page 54: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

54

Adhoc Mining Web Application

§ The BaFin analysts have to enter a few parameters only in a browser interface:4the stock id4date and time of the Ad hoc publication4..

§ No data mining skills are required for the BaFin analysts.§ The pre-processing, mining and post-processing steps are executed

automatically.§ Based on the mining results, scores for each transaction are

computed. They can be interpreted as a measure of how untypical the transactions are.§ The BaFin analysts can inspect the stock transactions with these

scores in the browser interface as well.§ This scenario has been extended to find untypical trading behavior of

brokers, banks and stock owners.

Page 55: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

55

The Adhoc Browser Interface

Page 56: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

56

The result page

Page 57: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

57

Generalization of such a solution

§ For every customer there are only a few relevant business problems to be solved with data mining

4 Each of these business problems is handled in slight variations

§ Examples:4 Detection of Insider Trading

§ Variation: stock4 Market Basket Analysis

§ Variations: stores, time period

What we need is one solution for one business problem that covers all variations

Page 58: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

58

Intelligent Miner Solution Framework

§ Framework for generating automated mining web applications basedon Intelligent Miner

§ Characteristics of such applications4No mining skills are required for the user.4The user enters only a few parameters 4The mining requests are executed automatically4Results are stored in database tables and can be inspected with the

browser interface.4The "mining knowledge" is contained in the mining scenarios4A mining expert is needed only for the development and maintenance of

these mining scenarios

Page 59: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

59

Business Intelligence Platforms

§ BI Platforms – integrated products that contain the major compenents of an BI solution4Database4ETL operations (Extract/Transform/Load)4OLAP 4Data Mining

§ Offered by major database vendors4Microsoft, 4Oracle, 4 IBM

§ SAP BW

Page 60: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

60

Summary

§ Knowldege Discovery and Data Mining has achieved a certain maturity.

§ It is still a very popular in the academic world.

§ The software market for data mining is consolidating.

§ The major challenges still persist.

Page 61: Knowledge Discovery and Data Mining Toni Bollinger … Software Group ©2004 IBM Corporation Knowledge Discovery and Data Mining Toni Bollinger IBM Development Lab, Böblingen, Germany

IBM Software Group | DB2 information management software

61

Thank you

For your attention!