Download - Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Knowledge discovery & data mining

Tools, methods, and experiences

Fosca Giannotti and Dino PedreschiPisa KDD Lab

CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/

A tutorial @ EDBT2000

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 2

Contributors and acknowledgements

The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many students

The many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-)

In particular: Jiawei HAN, Simon Fraser University, whose forthcoming book Data

mining: concepts and techniques has influenced the whole tutorial Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs Daniel A. KEIM, University of Halle Daniel Silver, CogNova Technologies

The EDBT2000 board who accepted our tutorial proposal


Tutorial goals

Introduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technology

Provide a systematization to the many many concepts around this area, according the following lines the process the methods applied to paradigmatic cases the support environment the research challenges

Important issues that will be not covered in this tutorial: methods: time series, exception detection, neural nets systems: parallel implementations


Tutorial Outline

1. Introduction and basic concepts1. Motivations, applications, the KDD process, the techniques

2. Deeper into DM technology1. Decision Trees and Fraud Detection

2. Association Rules and Market Basket Analysis3. Clustering and Customer Segmentation

3. Trends in technology1. Knowledge Discovery Support Environment2. Tools, Languages and Systems

4. Research challenges


Introduction - module outline

Motivations Application Areas KDD Decisional Context KDD Process Architecture of a KDD system The KDD steps in short


Evolution of Database Technology:from data management to data analysis

1960s: Data collection, database creation, IMS and network DBMS.

1970s: Relational data model, relational DBMS implementation.

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).

1990s: Data mining and data warehousing, multimedia databases,

and Web technology.


Motivations “Necessity is the Mother of Invention”

Data explosion problem: Automated data collection tools, mature database

technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.

We are drowning in information, but starving for knowledge! (John Naisbett)

Data warehousing and data mining : On-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.


Also referred to as: Data dredging, Data harvesting, Data

archeology

A multidisciplinary field: Database Statistics Artificial intelligence

Machine learning, Expert systems and Knowledge Acquisition

Visualization methods

A rapidly emerging fieldA rapidly emerging field


Motivations for DM

Abundance of business and industry dataCompetitive focus - Knowledge

Management Inexpensive, powerful computing enginesStrong theoretical/mathematical

foundations machine learning & logic statistics database management systems


What is DM useful for?

Marketing

DatabaseMarketing

DataWarehousing

KDD &Data Mining

Increase knowledge to base decision upon.

E.g., impact on marketing


The Value Chain

DataData• Customer data• Store data• Demographical Data• Geographical data

InformationInformation• X lives in Z• S is Y years old• X and S moved• W has money in Z

KnowledgeKnowledge• A quantity Y of product A is used in region Z• Customers of class Y use x% of C during period D

DecisionDecision• Promote product A in region Z.• Mail ads to families of profile P• Cross-sell service B to clients C


Application Areas and Opportunities

Marketing: segmentation, customer targeting, ... Finance: investment support, portfolio management Banking & Insurance: credit and policy approval Security: fraud detection Science and medicine: hypothesis discovery,

prediction, classification, diagnosis Manufacturing: process modeling, quality control,

resource allocation Engineering: simulation and analysis, pattern

recognition, signal processing Internet: smart search engines, web marketing


Classes of applications

Market analysis target marketing, customer relation management, market

basket analysis, cross selling, market segmentation.

Risk analysis Forecasting, customer retention, improved underwriting,

quality control, competitive analysis.

Fraud detection

Text (news group, email, documents) and Web analysis.

Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro

Market Analysis

Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studies.

Target marketing Find clusters of “model” customers who share the same

characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage,

etc.

Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.


Customer profiling data mining can tell you what types of customers

buy what products (clustering or classification).

Identifying customer requirements identifying the best products for different

customers use prediction to find what factors will attract

new customers

Provides summary information various multidimensional summary reports; statistical summary information (data central

tendency and variation)

Market Analysis and ManagementMarket Analysis (2)


Risk Analysis

Finance planning and asset evaluation: cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-

ratio, trend analysis, etc.)

Resource planning: summarize and compare the resources and

spending

Competition: monitor competitors and market directions (CI:

competitive intelligence). group customers into classes and class-based

pricing procedures set pricing strategy in a highly competitive market


Fraud Detection

Applications: widely used in health care, retail, credit card services,

telecommunications (phone card fraud), etc.

Approach: use historical data to build models of fraudulent

behavior and use data mining to help identify similar instances.

Examples: auto insurance: detect a group of people who stage

accidents to collect on insurance money laundering: detect suspicious money

transactions (US Treasury's Financial Crimes Enforcement Network)

medical insurance: detect professional patients and ring of doctors and ring of references


More examples: Detecting inappropriate medical treatment:

Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).

Detecting telephone fraud: Telephone call model: destination of the call, duration,

time of day or week. Analyze patterns that deviate from an expected norm.

British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.

Fraud Detection (2)


Sports IBM Advanced Scout analyzed NBA game statistics

(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.

Astronomy JPL and the Palomar Observatory discovered 22

quasars with the help of data mining

Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web

access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Watch for the PRIVACY pitfall!

Other applications


The selection and processing of data for: the identification of novel, accurate,

and useful patterns, and the modeling of real-world

phenomena.

Data mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models.

What is KDD? A process!


Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

The KDD process


The KDD Process

Core Problems & Approaches Problems:

identification of relevant data representation of data search for valid pattern or model

Approaches: top-down deduction by expert interactive visualization of data/models * bottom-up induction from data *

DataMining

OLAP


Learning the application domain: relevant prior knowledge and goals of application

Data consolidation: Creating a target data set Selection and Preprocessing

Data cleaning : (may take 60% of effort!) Data reduction and projection:

find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results.

visualization, transformation, removing redundant patterns, …

Use of discovered knowledge

The steps of the KDD process


CogNovaTechnologies

9

The KDD ProcessThe KDD Process


Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

IdentifyProblem or Opportunity

Measure effectof Action

Act onKnowledge

Knowledge

ResultsStrategy

Problem

The virtuous cycle


Applications, operations, techniques


Roles in the KDD process


Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Data mining and business intelligence


Graphical User Interface

DataConsolidation

Selectionand

Preprocessing

DataMining

Interpretationand Evaluation

Warehouse KnowledgeData Sources

Architecture of a KDD system


A business intelligence environment



Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

The KDD process


Garbage in Garbage out The quality of results relates directly to

quality of the data 50%-70% of KDD process effort is spent

on data consolidation and preparation Major justification for a corporate data

warehouse

Data consolidation and preparation


From data sources to consolidated data repository

RDBMS

Legacy DBMS

Flat Files

DataConsolidationand Cleansing

Warehouse

Object/Relation DBMS Object/Relation DBMS

Multidimensional DBMS Multidimensional DBMS

Deductive Database Deductive Database

Flat files Flat files External

Data consolidation


Determine preliminary list of attributes Consolidate data into working database

Internal and External sources

Eliminate or estimate missing values Remove outliers (obvious exceptions) Determine prior probabilities of

categories and deal with volume bias

Data consolidation



Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

The KDD process


Generate a set of examples choose sampling method consider sample complexity deal with volume bias issues

Reduce attribute dimensionality remove redundant and/or correlating attributes combine attributes (sum, multiply, difference)

Reduce attribute value ranges group symbolic discrete values quantize continuous numeric values

Transform data de-correlate and normalize values map time-series data to static representation

OLAP and visualization tools play key role

Data selection and preprocessing



Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

The KDD process


Data mining tasks and methods

Automated Exploration/Discovery e.g.. discovering new market segments clustering analysis

Prediction/Classification e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms,

decision trees

Explanation/Description e.g.. characterizing customers by demographics

and purchase history decision trees, association rules

x1

x2

f(x)

x

if age > 35 and income < $35k then ...


Clustering: partitioning a set of data into a set

of classes, called clusters, whose members share some interesting common properties.

Distance-based numerical clustering metric grouping of examples (K-NN) graphical visualization can be used

Bayesian clustering search for the number of classes which result in

best fit of a probability distribution to the data AutoClass (NASA) one of best examples

Automated exploration and discovery


Learning a predictive modelClassification of a new

case/sample Many methods:

Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-

parametric)

Prediction and classification


The objective of learning is to achieve good generalization to new unseen cases.

Generalization can be defined as a mathematical interpolation or regression over a set of training points

Models can be validated with a previously unseen test set or using cross-validation methods

f(x)

x

Generalization and regression


Classification and prediction

Classify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.

Use obtained model to predict some unknown or missing attribute values based on other information.


Objective: Develop a general model or hypothesis from specific

examples

Function approximation (curve fitting)

Classification (concept learning, pattern recognition)

x1

x2

AB

f(x)

x

Summarizing: inductive modeling = learning


Learn a generalized hypothesis (model) from selected data

Description/Interpretation of model provides new knowledge

Methods: Inductive decision tree and rule systems Association rule systems Link Analysis …

Explanation and description


Generate a model of normal activity Deviation from model causes alert Methods:

Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools

Exception/deviation detection


Outlier and exception data analysis

Time-series analysis (trend and deviation): Trend and deviation analysis: regression,

sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.

Similarity-based pattern-directed analysis

Full vs. partial periodicity analysis Other pattern-directed or statistical

analysis



Data Mining


Data Consolidationand Warehousing

Knowledge

p(x)=0.02

Warehouse

The KDD process


A data mining system/query may generate thousands of patterns, not all of them are interesting.

Interestingness measures: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to

confirm

Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns,

e.g., support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g.,

unexpectedness, novelty, etc.

Are all the discovered pattern interesting?


Find all the interesting patterns: Completeness. Can a data mining system find all the interesting

patterns?

Search for only interesting patterns: Optimization. Can a data mining system find only the interesting

patterns? Approaches

First generate all the patterns and then filter out the uninteresting ones.

Generate only the interesting patterns - mining query optimization.

Completeness vs. optimization


Evaluation Statistical validation and significance testing Qualitative review by experts in the field Pilot surveys to evaluate model accuracy

Interpretation Inductive tree and rule models can be read

directly Clustering results can be graphed and tabled Code can be automatically generated by

some systems (IDTs, Regression models)

Interpretation and evaluation


Visualization tools can be very helpful sensitivity analysis (I/O relationship) histograms of value distribution time-series plots and animation requires training and practice

Response

Velocity

Temp

Interpretation and evaluation


1989 IJCAI Workshop on KDD Knowledge Discovery in Databases (G. Piatetsky-

Shapiro and W. Frawley, eds., 1991)

1991-1994 Workshops on KDD Advances in Knowledge Discovery and Data Mining

(U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996)

1995-1998 AAAI Int. Conf. on KDD and DM (KDD’95-98) Journal of Data Mining and Knowledge Discovery

(1997)

1998 ACM SIGKDD 1999 SIGKDD’99 Conf.

Important dates of data mining


References - general P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE

Trans. Knowledge and Data Engineering, 8:866-883, 1996. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge

Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To

appear. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications

of ACM, 39:58-64, 1996. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An

overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.

Michael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, 1997.

Sholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1997.

W.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997. T. Mitchell. Machine Learning. McGraw-Hill, 1997.


Main Web resources

http://www.kdnuggets.com KDD Newsletter and comprehensive website

http://www.acm.org/sigkdd ACM SIGKDD

http://www.research.microsoft.com/datamine/ Journal of Data Mining and Knowledge Discovery


Tutorial Outline

Introduction and basic concepts Motivations, applications, the KDD process, the techniques

Deeper into DM technology Decision Trees and Fraud Detection Association Rules and Market Basket Analysis Clustering and Customer Segmentation

Trends in technology Knowledge Discovery Support Environment Tools, Languages and Systems

Research challenges