Private Sector Program Workshop on Data Mining

45
Michael Welge Automated Learning Group National Center for Supercomputing Applications University of Illinois [email protected] 217. 244.1999 April 28, 2003 Private Sector Program Workshop on Data Mining

Transcript of Private Sector Program Workshop on Data Mining

Page 1: Private Sector Program Workshop on Data Mining

Michael Welge

Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinois

[email protected]. 244.1999

April 28, 2003

Private Sector Program Workshop on Data Mining

Page 2: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Workshop Overview

• Data Mining Concepts and Techniques

• Break

• Data Mining Frameworks D2K/D2KSL

• Lunch – Center Atrium

• Data Mining Applications• Text mining• Image Mining

Page 3: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining Concept and Techniques Overview

• Automated Learning Group Background

• Introduction to Knowledge Discovery in Databases and Data Mining

• Applications of Data Mining

• Knowledge Discovery in Database Process

• Data Mining Paradigms

• Knowledge Discovery in Databases Framework

• Current and Future Research Activities

• Major Challenges in Data Mining

• Summary/References

Page 4: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Goals

• Understanding of the Knowledge Discovery in Databases Processes

• Gain Knowledge of Basic Data Mining Operations and Techniques

• Key Issues in Application Deployment

• Understanding the Role of Information Visualization in Data Mining

• Understanding the Role of the Knowledge Discovery Framework

Page 5: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

ALG Background

• A brief history of the NCSA Automated Learning Group (ALG)• NCSA Industrial program foundation• State and Federal program support• Evolving framework to support KDD

• ALG’s Participation in Related Campus Activities• OVCR Faculty Fellows Program• REU Data Mining• Disability Research Institute (DRI) • Mid-America Earthquake Center (MAE) • Multi-Sector Crisis Management Consortium (MSCMC) • Technology Research Education Collaboration Center (TRECC)

Page 6: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

ALG Mission

The specific mission of the Automated Learning Group is: 

• To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making

• To work closely with industrial, government, and academic partners to explore new application areas for such methods, and

 

• To transfer the resulting software technology into real world applications

Page 7: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

ALG Research, Development, & Technology Transfer Model

Page 8: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Motivation: “Necessity is Mother of Invention”

• Data Explosion Problem• Automated Data Collection Tools And Mature Database Technology

Lead To Tremendous Amounts Of Data Stores In Databases, Data Warehouses, And Other Information Repositories.

• We Are Drowning In Data, But Starving For Knowledge

• Solution: Data Management Environments and Data Mining• Data Warehousing and On-Line Analytical Processing• Extraction Of Interesting Knowledge (Rules, Regularities, Patterns)

From Large Data And Large Databases

Page 9: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Why Do We Need Data Mining ?

• Data volumes are too large for classical analysis approaches:• Large number of records (108 – 1012 bytes)• High dimensional data ( 102 – 104 attributes)

How do you explore millions of records, tens or hundreds of fields, and find patterns?

Page 10: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Why Do We Need Data Mining?

• As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible

• Many queries of interest are difficult to state in a query language (query formulation problem)• “Find all cases of fraud”

• “Find all individuals likely to need Education Credit Assistance”

• “Find all documents that are similar to this customers problem”

Page 11: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

What is Data Mining? (Knowledge Discovery in Databases)

Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

• The understandable patterns are used to:• Make predictions or classifications about new data• Discovery of new business rules• Summarize the contents of a large database to support decision

making• Information visualization to aid humans in discovering deeper

patterns

Page 12: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Why Data Mining? – Potential Application

• Database analysis and decision support• Market analysis and management

– target marketing, customer relation management, market basket analysis, cross selling, market segmentation

• Risk analysis and management

– Forecasting, customer retention, improved underwriting, quality control, competitive analysis

• Fraud detection and management

• Other Applications• Text mining (news group, email, documents) and Web analysis.

• Many, Many - Others

Page 13: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Market Analysis and Management

• Where are the data sources for analysis?• Credit card transactions, loyalty cards, discount coupons, customer

complaint calls, plus (public) lifestyle studies

• Target marketing• Find clusters of “model” customers who share the same

characteristics: interest, income level, spending habits, etc.

• Determine customer purchasing patterns over time• Conversion of single to a joint bank account: marriage, etc.

• Cross-market analysis• Associations/co-relations between product sales

• Prediction based on the association information

Page 14: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Market Analysis and Management

• Customer profiling

• data mining can tell you what types of customers buy what

products (clustering or classification)

• Identifying customer requirements

• identifying the best products for different customers

• use prediction to find what factors will attract new customers

• Provides summary information

• various multidimensional summary reports

• statistical summary information (data central tendency and

variation)

Page 15: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Fraud and Inappropriate Behavior Management

• Applications• widely used in health care, retail, credit card services,

telecommunications (phone card fraud), etc.

• Approach• use historical data to build models of fraudulent behavior and use

data mining to help identify similar instances

• Examples• tax claims: detect a group of people who file false Tax claims• money laundering: detect suspicious money transactions (US

Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors

and ring of references

Page 16: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Fraud and Inappropriate Behavior Management

• Detecting inappropriate medical treatment• Australian Health Insurance Commission identifies that in many

cases blanket screening tests were requested.

• Detecting telephone fraud• Telephone call model: destination of the call, duration, time of day

or week. Analyze patterns that deviate from an expected norm.• British Telecom identified discrete groups of callers with frequent

intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

• Retail• Analysts estimate that 38% of retail shrink is due to dishonest

employees.

Page 17: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Corporate Analysis and Risk Management

• Finance planning and asset evaluation• cash flow analysis and prediction• contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend

analysis, etc.)

• Resource planning:• summarize and compare the resources and spending

• Competition:• monitor competitors and market directions • group customers into classes and a class-based pricing procedure• set pricing strategy in a highly competitive market

Page 18: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Many Many Others

• Description of Land Uses

• Precision Farming

• Peer Group Study

• Real-time Diagnosis of Mechanical Systems

• National Crime Incident Reporting System (Homeland Security)

• Student/Teacher Performance System

• Making Human Resource Decisions

• Automated Completion of Repetitive Forms

• Predicting the Function of a Gene Complex

• Auditing Tool

• Systems for Intrusion Detection

Page 19: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Management Environments and Data Mining

Page 20: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

KDD Process

• Develop an Understanding of the Application Domain • Relevant prior knowledge, problem objectives, success criteria, current

solution, inventory resources, constraints, terminology, cost and benefits

• Create Target Data Set• Collect initial data, describe, focus on a subset of variables, verify data

quality

• Data Cleaning and Preprocessing• Remove noise, outliers, missing fields, time sequence information, known

trends, integrate data

• Data Reduction and Projection• Feature subset selection, feature construction, discretizations,

aggregations

• Selection of Data Mining Task• Classification, segmentation, deviation detection, link analysis

• Select Data Mining Approach(es)• Data Mining to Extract Patterns or Models• Interpretation and Evaluation of Patterns/Models• Consolidating Discovered Knowledge

Page 21: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Knowledge Discovery In Databases Process

Page 22: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Required Effort for Each KDD Step

0

10

20

30

40

50

60

BusinessObjectives

Determination

Data Preparation Data Mining Analysis &Assimilation

Eff

ort

(%

)

Page 23: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 24: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining: On What Kind of Data?

• Relational Databases

• Data Warehouses

• Transactional Databases

• Advanced Database Systems• Object-Relational• Spatial• Temporal• Text• Heterogeneous, Legacy, and Distributed• WWW

Page 25: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining Paradigms

• Concept description: Characterization and discrimination• Generalize, summarize, and contrast data characteristics, e.g., dry

vs. wet regions

• Discovery - Association (correlation and causality)• age(“20..29”) ^ income(“20..29K”) buys(“PC”) [support = 2%,

confidence = 60%]

Page 26: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining Paradigms

• Classification and Prediction

• Finding models (functions) that describe and distinguish classes or concepts for future prediction

• E.g., classify countries based on climate, or classify cars based on gas mileage

• Presentation: decision-tree, classification rule, neural network

• Prediction: Predict some unknown or missing numerical values

• Cluster analysis• Class label is unknown: Group data to form new classes, e.g.,

cluster houses to find distribution patterns

• Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Page 27: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining Paradigms

• Outlier analysis• Outlier: a data object that does not comply with the general behavior of the

data

• It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

• Other pattern-directed or statistical analyses

Page 28: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Origins of Data Mining

• Draws ideas from database systems, machine learning, statistics, mathematical programming, information visualization, and high performance computing.

• Traditional techniques may be unsuitable• Enormity of data• High dimensionality of data• Heterogeneous, distributed nature of data

Page 29: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Mining in Action

Page 30: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Requirements For a Successful Data Mining Effort• There is a sponsor for the application.

• The business case for the application is clearly understood and measurable, and the objectives are likely to be achievable given the resources being applied.

• The application has a high likelihood of having a significant impact on the business.

• Business domain knowledge is available.

• Good quality relevant data in sufficient quantities is available.

• The right people---domain, data management, and data mining experts---are available.

For a first time project the following criteria could be added:

• The scope of the application is limited - try to show results in 6-9 months

• The data source should be limited to those that are well known, relatively clean and freely accessible

Page 31: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Need for Data Mining Framework

• Human analysis breaks down with volume and dimensionality.• How quickly can you digest 10 million records with 100 fields each?• High data growth rate, changing underlying source

• What is typically done by non-statisticians?• Select a few fields (usually 2-3 out of 50-100), attempt to visualize

or fit to a simple model

• What about traditional statistical approaches?• In general, do not scale to large database

Page 32: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

D2K - Data To Knowledge

D2K is a rapid, flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization.

• Visual Programming Environment

• Robust Computational Infrastructure

• Flexible And Extensible Architecture

• Rapid Application Development Environment

• Integrated Environment For Models And Visualization

• Workflow and Group Use Interface

Page 33: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

D2K – Infrastructure, Toolkit, Modules, and Applications

• Data Selection• Distributed Knowledge

Sources

• Data Transformation• Feature Selection/

Construction• Example Selection

• Data Modeling• Scalable Algorithms

– Predictive– Discovery– Anomaly

Detection• Bias Optimization• Layer Learning

• Model Evaluation• Information

Visualization

Page 34: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

D2K – Infrastructure, Toolkit, Modules, and Applications

Page 35: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

D2K/T2K/I2K - Data, Text, and Image Analysis

Page 36: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

D2K – SL

• Intuitive interfaces into D2K functionality for non-data mining professionals.

• Transparent access to mine data stored in databases.

• Extensible from desktop to cluster to grid.

• Visualization support at all stages of the data mining process.

• Support for very large data sets.

Page 37: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

• Mines and archives information from the web, Usenet, news-feeds, mailing lists, intranets, and databases

• Provides cost effective, efficient, easy to use solutions for searching multiple government/military web sites

• Automated information clustering, classification, and association discovery

• Visualization of search and data organization

• Learns from users; leverages the power of large user communities

• Provides the means to share information and alerts others with similar interests

REVEAL

Page 38: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Decision Making in Uncertain Settings

• Evolutionary Multi-Objective Optimization

• DISCUS• Computer -> Computer

– Genetic Algorithms• Computer -> Human

– Interactive Genetic Algorithms

• Human -> Human– Human-based

Genetic Algorithms

Page 39: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Data Spaces - Publish, Query, and Discover Data

Page 40: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Mining Alarming Incidents in Data Streams - MAIDS

MAIDS is aimed to:• Discover changes, trends and evolution characteristics in data streams.• Construct clusters and classification models from data streams.• Explore frequent patterns and similarities among data streams

MAIDS can be applied to:• Network intrusion detection• Remote sensor data• Telecommunication data flow analysis•Financial data trend prediction•Web click streams analysis

Page 41: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

D2K Infrastructure – Grid Powered

Page 42: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Major Challenges in Data Mining

• Mining methodology and user interaction• Mining different kinds of knowledge in databases

• Interactive mining of knowledge at multiple levels of abstraction

• Incorporation of background knowledge

• Data mining query languages and ad-hoc data mining

• Expression and visualization of data mining results

• Handling noise and incomplete data

• Pattern evaluation: the interestingness problem

• Performance and scalability• Efficiency and scalability of data mining algorithms

• Parallel, distributed and incremental mining methods

Page 43: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Major Challenges in Data Mining

• Issues relating to the diversity of data types• Handling relational and complex types of data• Mining information from heterogeneous databases and global

information systems (WWW)

• Issues related to applications and social impacts• Application of discovered knowledge

– Domain-specific data mining tools– Intelligent query answering– Process control and decision making

• Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem

• Protection of data security, integrity, and privacy

Page 44: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

Summary

• Data mining: discovering interesting patterns from large amounts of data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of information repositories

• Data mining paradigms: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

• Data mining framework

• Major issues in data mining

Page 45: Private Sector Program Workshop on Data Mining

alg | Automated Learning Group

References

• J. Han and M. Kamber. Data Mining: Concepts and Techniques.

Morgan Kaufmann, 2000. (A Very Special Thanks to Jiawei Han

for Slide Use)

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.

Uthurusamy. Advances in Knowledge Discovery and Data

Mining. AAAI/MIT Press, 1996.

• T. Imielinski and H. Mannila. A database perspective on

knowledge discovery. Communications of ACM, 39:58-64, 1996.

• G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data

mining to knowledge discovery: An overview. In U.M. Fayyad,

et al. (eds.), Advances in Knowledge Discovery and Data

Mining, 1-35. AAAI/MIT Press, 1996.

• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in

Databases. AAAI/MIT Press, 1991.