Knowledge discovery & data mining
Tools, methods, and experiences
Fosca Giannotti and Dino PedreschiPisa KDD Lab
CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/
A tutorial @ EDBT2000
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 2
Contributors and acknowledgements
The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many students
The many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-)
In particular: Jiawei HAN, Simon Fraser University, whose forthcoming book Data
mining: concepts and techniques has influenced the whole tutorial Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs Daniel A. KEIM, University of Halle Daniel Silver, CogNova Technologies
The EDBT2000 board who accepted our tutorial proposal
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 3
Tutorial goals
Introduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technology
Provide a systematization to the many many concepts around this area, according the following lines the process the methods applied to paradigmatic cases the support environment the research challenges
Important issues that will be not covered in this tutorial: methods: time series, exception detection, neural nets systems: parallel implementations
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 4
Tutorial Outline
1. Introduction and basic concepts1. Motivations, applications, the KDD process, the techniques
2. Deeper into DM technology1. Decision Trees and Fraud Detection
2. Association Rules and Market Basket Analysis3. Clustering and Customer Segmentation
3. Trends in technology1. Knowledge Discovery Support Environment2. Tools, Languages and Systems
4. Research challenges
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 5
Introduction - module outline
Motivations Application Areas KDD Decisional Context KDD Process Architecture of a KDD system The KDD steps in short
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 6
Evolution of Database Technology:from data management to data analysis
1960s: Data collection, database creation, IMS and network DBMS.
1970s: Relational data model, relational DBMS implementation.
1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).
1990s: Data mining and data warehousing, multimedia databases,
and Web technology.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 7
Motivations “Necessity is the Mother of Invention”
Data explosion problem: Automated data collection tools, mature database
technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.
We are drowning in information, but starving for knowledge! (John Naisbett)
Data warehousing and data mining : On-line analytical processing
Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 8
Also referred to as: Data dredging, Data harvesting, Data
archeology
A multidisciplinary field: Database Statistics Artificial intelligence
Machine learning, Expert systems and Knowledge Acquisition
Visualization methods
A rapidly emerging fieldA rapidly emerging field
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 9
Motivations for DM
Abundance of business and industry dataCompetitive focus - Knowledge
Management Inexpensive, powerful computing enginesStrong theoretical/mathematical
foundations machine learning & logic statistics database management systems
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 10
What is DM useful for?
Marketing
DatabaseMarketing
DataWarehousing
KDD &Data Mining
Increase knowledge to base decision upon.
E.g., impact on marketing
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 11
The Value Chain
DataData• Customer data• Store data• Demographical Data• Geographical data
InformationInformation• X lives in Z• S is Y years old• X and S moved• W has money in Z
KnowledgeKnowledge• A quantity Y of product A is used in region Z• Customers of class Y use x% of C during period D
DecisionDecision• Promote product A in region Z.• Mail ads to families of profile P• Cross-sell service B to clients C
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 12
Application Areas and Opportunities
Marketing: segmentation, customer targeting, ... Finance: investment support, portfolio management Banking & Insurance: credit and policy approval Security: fraud detection Science and medicine: hypothesis discovery,
prediction, classification, diagnosis Manufacturing: process modeling, quality control,
resource allocation Engineering: simulation and analysis, pattern
recognition, signal processing Internet: smart search engines, web marketing
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 13
Classes of applications
Market analysis target marketing, customer relation management, market
basket analysis, cross selling, market segmentation.
Risk analysis Forecasting, customer retention, improved underwriting,
quality control, competitive analysis.
Fraud detection
Text (news group, email, documents) and Web analysis.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Market Analysis
Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies.
Target marketing Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage,
etc.
Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Customer profiling data mining can tell you what types of customers
buy what products (clustering or classification).
Identifying customer requirements identifying the best products for different
customers use prediction to find what factors will attract
new customers
Provides summary information various multidimensional summary reports; statistical summary information (data central
tendency and variation)
Market Analysis and ManagementMarket Analysis (2)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Risk Analysis
Finance planning and asset evaluation: cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.)
Resource planning: summarize and compare the resources and
spending
Competition: monitor competitors and market directions (CI:
competitive intelligence). group customers into classes and class-based
pricing procedures set pricing strategy in a highly competitive market
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Fraud Detection
Applications: widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach: use historical data to build models of fraudulent
behavior and use data mining to help identify similar instances.
Examples: auto insurance: detect a group of people who stage
accidents to collect on insurance money laundering: detect suspicious money
transactions (US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of doctors and ring of references
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
More examples: Detecting inappropriate medical treatment:
Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).
Detecting telephone fraud: Telephone call model: destination of the call, duration,
time of day or week. Analyze patterns that deviate from an expected norm.
British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.
Fraud Detection (2)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Sports IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.
Astronomy JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Watch for the PRIVACY pitfall!
Other applications
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 20
The selection and processing of data for: the identification of novel, accurate,
and useful patterns, and the modeling of real-world
phenomena.
Data mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models.
What is KDD? A process!
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 21
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
The KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 22
The KDD Process
Core Problems & Approaches Problems:
identification of relevant data representation of data search for valid pattern or model
Approaches: top-down deduction by expert interactive visualization of data/models * bottom-up induction from data *
DataMining
OLAP
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Learning the application domain: relevant prior knowledge and goals of application
Data consolidation: Creating a target data set Selection and Preprocessing
Data cleaning : (may take 60% of effort!) Data reduction and projection:
find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results.
visualization, transformation, removing redundant patterns, …
Use of discovered knowledge
The steps of the KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 24
CogNovaTechnologies
9
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
IdentifyProblem or Opportunity
Measure effectof Action
Act onKnowledge
Knowledge
ResultsStrategy
Problem
The virtuous cycle
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 25
Applications, operations, techniques
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 26
Roles in the KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 27
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Data mining and business intelligence
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 28
Graphical User Interface
DataConsolidation
Selectionand
Preprocessing
DataMining
Interpretationand Evaluation
Warehouse KnowledgeData Sources
Architecture of a KDD system
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 29
A business intelligence environment
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 30
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
The KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 31
Garbage in Garbage out The quality of results relates directly to
quality of the data 50%-70% of KDD process effort is spent
on data consolidation and preparation Major justification for a corporate data
warehouse
Data consolidation and preparation
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 32
From data sources to consolidated data repository
RDBMS
Legacy DBMS
Flat Files
DataConsolidationand Cleansing
Warehouse
Object/Relation DBMS Object/Relation DBMS
Multidimensional DBMS Multidimensional DBMS
Deductive Database Deductive Database
Flat files Flat files External
Data consolidation
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 33
Determine preliminary list of attributes Consolidate data into working database
Internal and External sources
Eliminate or estimate missing values Remove outliers (obvious exceptions) Determine prior probabilities of
categories and deal with volume bias
Data consolidation
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 34
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
The KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 35
Generate a set of examples choose sampling method consider sample complexity deal with volume bias issues
Reduce attribute dimensionality remove redundant and/or correlating attributes combine attributes (sum, multiply, difference)
Reduce attribute value ranges group symbolic discrete values quantize continuous numeric values
Transform data de-correlate and normalize values map time-series data to static representation
OLAP and visualization tools play key role
Data selection and preprocessing
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 36
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
The KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 37
Data mining tasks and methods
Automated Exploration/Discovery e.g.. discovering new market segments clustering analysis
Prediction/Classification e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms,
decision trees
Explanation/Description e.g.. characterizing customers by demographics
and purchase history decision trees, association rules
x1
x2
f(x)
x
if age > 35 and income < $35k then ...
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 38
Clustering: partitioning a set of data into a set
of classes, called clusters, whose members share some interesting common properties.
Distance-based numerical clustering metric grouping of examples (K-NN) graphical visualization can be used
Bayesian clustering search for the number of classes which result in
best fit of a probability distribution to the data AutoClass (NASA) one of best examples
Automated exploration and discovery
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 39
Learning a predictive modelClassification of a new
case/sample Many methods:
Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-
parametric)
Prediction and classification
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 40
The objective of learning is to achieve good generalization to new unseen cases.
Generalization can be defined as a mathematical interpolation or regression over a set of training points
Models can be validated with a previously unseen test set or using cross-validation methods
f(x)
x
Generalization and regression
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 41
Classification and prediction
Classify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.
Use obtained model to predict some unknown or missing attribute values based on other information.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 42
Objective: Develop a general model or hypothesis from specific
examples
Function approximation (curve fitting)
Classification (concept learning, pattern recognition)
x1
x2
AB
f(x)
x
Summarizing: inductive modeling = learning
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 43
Learn a generalized hypothesis (model) from selected data
Description/Interpretation of model provides new knowledge
Methods: Inductive decision tree and rule systems Association rule systems Link Analysis …
Explanation and description
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 44
Generate a model of normal activity Deviation from model causes alert Methods:
Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools
Exception/deviation detection
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 45
Outlier and exception data analysis
Time-series analysis (trend and deviation): Trend and deviation analysis: regression,
sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.
Similarity-based pattern-directed analysis
Full vs. partial periodicity analysis Other pattern-directed or statistical
analysis
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 46
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidationand Warehousing
Knowledge
p(x)=0.02
Warehouse
The KDD process
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
A data mining system/query may generate thousands of patterns, not all of them are interesting.
Interestingness measures: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g.,
unexpectedness, novelty, etc.
Are all the discovered pattern interesting?
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
Find all the interesting patterns: Completeness. Can a data mining system find all the interesting
patterns?
Search for only interesting patterns: Optimization. Can a data mining system find only the interesting
patterns? Approaches
First generate all the patterns and then filter out the uninteresting ones.
Generate only the interesting patterns - mining query optimization.
Completeness vs. optimization
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 49
Evaluation Statistical validation and significance testing Qualitative review by experts in the field Pilot surveys to evaluate model accuracy
Interpretation Inductive tree and rule models can be read
directly Clustering results can be graphed and tabled Code can be automatically generated by
some systems (IDTs, Regression models)
Interpretation and evaluation
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 50
Visualization tools can be very helpful sensitivity analysis (I/O relationship) histograms of value distribution time-series plots and animation requires training and practice
Response
Velocity
Temp
Interpretation and evaluation
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro
1989 IJCAI Workshop on KDD Knowledge Discovery in Databases (G. Piatetsky-
Shapiro and W. Frawley, eds., 1991)
1991-1994 Workshops on KDD Advances in Knowledge Discovery and Data Mining
(U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996)
1995-1998 AAAI Int. Conf. on KDD and DM (KDD’95-98) Journal of Data Mining and Knowledge Discovery
(1997)
1998 ACM SIGKDD 1999 SIGKDD’99 Conf.
Important dates of data mining
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 52
References - general P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE
Trans. Knowledge and Data Engineering, 8:866-883, 1996. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To
appear. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications
of ACM, 39:58-64, 1996. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An
overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.
Michael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, 1997.
Sholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1997.
W.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997. T. Mitchell. Machine Learning. McGraw-Hill, 1997.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 53
Main Web resources
http://www.kdnuggets.com KDD Newsletter and comprehensive website
http://www.acm.org/sigkdd ACM SIGKDD
http://www.research.microsoft.com/datamine/ Journal of Data Mining and Knowledge Discovery
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 54
Tutorial Outline
Introduction and basic concepts Motivations, applications, the KDD process, the techniques
Deeper into DM technology Decision Trees and Fraud Detection Association Rules and Market Basket Analysis Clustering and Customer Segmentation
Trends in technology Knowledge Discovery Support Environment Tools, Languages and Systems
Research challenges
Top Related