CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science...

34
CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: [email protected] www: http://www.seas.smu.edu/~mhd January 1999

Transcript of CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science...

Page 1: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 SPRING 1999DATA MINING:

PART I

Professor Margaret H. Dunham

Department of Computer Science and Engineering

Southern Methodist University

Dallas, Texas 75275

(214) 768-3087

fax: (214) 768-3085

email: [email protected]

www: http://www.seas.smu.edu/~mhd

January 1999

Page 2: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 2

CSE8392 SPRING 1999 OUTLINE

• Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken.

• I. Introduction and Related Topics

• II. Core Topics

• III. Advanced Topics

• IV. Case Studies

• V. Student Presentations

• VI. Summary and Future Trends

Page 3: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 3

INTRODUCTION AND RELATED TOPICS

• Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics.

• Historical Perspective

– Gleaning Knowledge from the Data

– User Expectations increase as amount/sophistication of collected data increases.

– Reality vs Extracted Data

Reality

QueryInformation

Need

Data

Physical View Database View

Page 4: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 4

Related Topics (to be covered)

– Knowledge Discovery

– Information Retrieval

– Fuzzy Sets

– Data Warehousing and OLAP

– Dimensional Modeling

Page 5: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 5

Data Mining Overview

• What is Data Mining?– Definition: Fayyad, p. 9 – A.k.a.

• Exploratory data analysis• Unsupervised pattern recognition• Data driven discovery• Deductive learning

• Data Mining determines patterns in the data– Non-trivial– Valid– Novel– Potentially useful– Interesting– General and simple– Understandable

Page 6: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 6

DM Techniques (R[1])

• DM involves many different algorithms to accomplish different things. All have the following techniques in common.

– Model(Must fit a model to the data.)

• Function/Purpose

• Representation

– Preference Criteria (How to choose one model over another?)

– Search Algorithm (How to search the data)

• Example (Loan Data, fig 1.1 p6 in Fayyad):

– Model: Classification, Linear Function

– Preference: What best fits data? (Fig 1.2 or 1.4)

– Search Algorithm: Linear search of database

Page 7: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 7

DM Model Functions (R[1])

• Classification - Map data into predefined groups

• Regression - Map data to real valued predicate variable

• Clustering - Map data into groups defined by data itself

• Summarization - Map subsets of data into simple description

• Dependency Modeling - Identify dependencies among data items

• Link Analysis - Identify other relationships among data (association rules)

• Sequence Analysis - Identify sequential patterns in data

Page 8: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 8

DM Historical Perspective

• Late 70’s: Spreadsheet analysis

• 80’s: Transactional databases support data storage and retrieval

• Early 90’s: Growing interest in end user support (a.k.a. decision support)

– Issue: transactional databases are not designed for decision support

• Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis

• Late 90’s: Proliferation; new concepts (data marts)

• DM Tools: Neovista, Red Brick

Page 9: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 9

Data Mining Metrics

• Berson, Tables 17-1,17-2,17-3, p 347

• Accuracy

• Clarity

• Dirty Data

• Dimensionality

• Raw Data (Preprocessing)

• RDBMS embedding

• Scalability

• Speed

• Validation

Page 10: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 10

DM Issues

• Overfitting

• Outliers

• Closed World Assumption

• Database schemas and database models

• Algorithms for data mining

• Interpretation and visualization of results

• Size of databases

• Multimedia data, Spatio-Temporal Data

• Changing data

• Integration

• DM Applications

– Basket market analysis Stock analysis and selection

– Fraud detection and prevention

– Crisis prediction and prevention

Page 11: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 11

KNOWLEDGE DISCOVERY IN DATABASES (KDD)

• “Overall process of discovering useful knowledge from data.” (p28 in R[1])

• Defn: R[1] p 30

• Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad)

• Data Mining is one step in KDD process

• KDD objective not usually clear or exact. May require time with customer understanding needs.

• Data usually has problems - needs cleaning

– Incorrect/missing data

– Extract from multiple sources and compare

– Delete anomalous data and sources

– Different data types/metrics

Page 12: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 12

FUZZY SETS and LOGIC

• Set membership described by a real valued (0,1) membership function

• Ex: Set of all tall people

• Set membership function: f(x)=x is tall iff height(x)>6 ft.

• Note that this is a simple classification problem. Just as the Loan example, the results are not exact.

• Basis of many classification and clustering approaches

• In a conventional DB how do you retrieve all tall people?

– Three valued logic: True, False, Maybe

– Multi-valued logic: More than 2 values

Page 13: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 13

Fuzzy Logic

• Reasoning with uncertainty

• Extends multivalued logic; allows user to communicate using imprecise concepts, i.e.

– “good” and “bad”

– “close to” and “far away”

• Avoids brittleness of rule based reasoning by introducing probability of set membership

– Allows for smoother transition between classification sets in the domain

– Example

• Berson figure 16.2, page 325

Page 14: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 14

INFORMATION RETRIEVAL

• Store and retrieve documents based on fuzzy queries

• Predecessor of web based access

• Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps.

• Overview

– Conventional IR Systems

– Query Structures(Keywords)

– Matching(Multivalued logic)

– Measures

– Text Analysis Techniques

– IR Related Topics

Page 15: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 15

Conventional IR Systems

• Library card catalogs

• Documents (Library Science)

– Formatted

– Unformatted (Text)

– Mixed

• Document Surrogates

– Identifiers

– Titles, names, and dates

– Abstracts, extracts, reviews

– Summaries of Numerical Data

– Image Descriptions

Page 16: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 16

IR Queries

• Query Structures

– Matching Criteria

– Boolean Queries

– Vector

– Fuzzy

– Natural Language

• Logical combination of keywords

• Weight associated with keywords

• Similarity measures

Page 17: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 17

Similarity Measures

– Document Vector:

– Different Measures:

– Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204.

– Similarity uses:

• Document-Document

• Query-Query

• Document-Query

iniii dddD ,...,, 21

n

kjkikji ddDDSim

1

),(

Page 18: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 18

IR Document/Query Matching

• Matching Process

– Relevance and Similarity Measures

– Boolean based matching

• Logical match

– Vector based matching

• Threshold match

– Probabilistic Match

n documents relevant

• P(relevant) =

N total documents

– Fuzzy Matching

– Proximity Matching

– Weighting

– Relative Importance of Items

Page 19: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 19

IR Matching

• Scaling

– Impact of Sample Size

– Clustering

– Centroids

• Measures

– Precision

– Recall

Page 20: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 20

IR Indexing

• Text Analysis

– Indexing is the assignment of keywords or terms that represent document content

• Originally a library science problem that has grown with the advent of web based searches

– Indexing types

• Automated vs. manual

• Controlled vs. uncontrolled

• Single term vs. terms in context

• Deep vs. shallow

Page 21: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 21

IR Indexing

• General Steps

– 1. Assignment of terms or concepts capable of representing content

– 2. Assignment to each term a weight or value

• Indexing

– Vector based

• Start with excerpts, remove high frequency words

– Stop list

– Thesaurus

• Compute discrimination values of terms

Page 22: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 22

IR Retrieval

• Retrieval or Classification

– Vector based

• Same starting point as with indexing

• Compute weighting factors

• Assign to each document a weighted term vector

– Similarity Measures

• Measure similarity between document/query

• Results normalized to range between 0 - 1

Page 23: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 23

IR Retrieval

– Inverse Document Frequency

• Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents.

• Also used for similarity measurement

– Inverted Indexing of Document

– Concept Hierarchy

• DAG of concepts

• Follow nodes from general to more specific

• Tag articles with low level concepts so that each may be distinguished from ancestors

Page 24: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 24

IR Related Topics

• Information Retrieval Related Topics

– Text Analysis

– Fuzzy Sets

– Extending Databases

– Hypertext

– Digital Libraries

– Data Mining

• Web based browsers

Page 25: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 25

DATA WAREHOUSING AND OLAP

– Preparations for Mining: Data Warehousing

• Extracting the data (from RDBMS)

• Storing the data

– Data warehouse or data mart

• Cleansing the data

• Mining the data

– Often with multidimensional queries

• Definition

– Blend of technologies

– Integration

– Enables Strategic Use of Data

• Architecture

– Figure 6.1, page 116

Page 26: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 26

DW Migration

• Migration from Relational Database to Data Warehouse

– Differences (Relational vs. Data Warehouse)

– Procedure for Migration

• Extraction

• Cleanup

• Transformation

• Migration

• Issues

– Multiple sources

– Database Heterogeneity

– Data Heterogeneity

Page 27: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 27

DW Design

• Data Warehouse Design Considerations - Nine Step Method:

– Subject Matter

– Fact Table contents

– Dimensioning

– Fact Selection

– Precalculations

– Rounding out dimension table

– Duration selection

– What about change?

– Query priorities

• Technical Considerations

– Hardware

– Communications Infrastructure

– Data Structures

Page 28: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 28

More on DW• Benefits

– Development of strategic information and resources

– Hypothesis testing

– Knowledge discovery

• Data Marts

– Definition: a mini data warehouse for data mining

– Directed at a partition of data

– Dedicated user group

– May be physically separate

– Drivers

• Urgent user requirements

• Small budget

• Absence of sponsor

• Decentralization

• Smaller project size

Page 29: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 29

DIMENSIONAL MODELING

• Dimensional Modeling– Describes relationships in the data that

will be mined– Relatively new concept, still developing– A technique for visualizing data models– Schema (Star and Snowflake)– Facts - A collection of related data items,

consisting of measures and context data– Dimensions - A collection of members or

units of the same type of view. Axis for modeling. Sets the context for the facts.

– Measures - Numeric attribute of fact (What is stored about sales data)

• Focus - Tends to be on numeric data• MD Analysis vs. DM - Figure 4, R[3]

Page 30: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 30

Data Cube

• Way to visualize facts and dimensions

• Hypercube (more than 3 dimensions)

• May be nested

• Figure 13.1, p249, Berson

• Figure 15,R[3]

Page 31: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 31

Part No.

Dimension

Customer

Dimension

Time

Dimension

Salesperson

Dims

Product

Dimension

Sales

Facts

Star Schema

– Contains large fact table and a surrounding set of dimension tables

– A.k.a. constellation or multistar model

– Figure 9.1, p171,Berson

– Following from Figure 18, R[3]

Page 32: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 32

Part No.

Dimension

Customer

Dimension

Time

Dimension

Salesperson

Dimension

Product

Dimension

Sales

Facts

Location

Dimension

Manager

Dimension

Month

Dimension

Week

Dimension

Snowflake Schema

• Sometimes dimensions have hierarchies among themselves

• N:1 relationships among members of a dimension may be subdivided

• Decomposition yields a snowflake like schema

Page 33: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 33

OLAP (On Line Analytic Processing)• Multidimensional database• Allows user to analyze data using elaborate,

multidimensional, complex views• MOLAP - Multidimensional OLAP.

Supported by specialized DBMS/software systems. (Data structures, temporal)– May not be general enough for other uses– Access limited and optimized for OLAP

processing– Fig 13.3 p 253, Berson

• ROLAP - Underlying data stored in traditional (relational) DBMS and accessed by traditional query language (SQL).– Layer on top of DBMS. Middleware.– May have poor performance for OLAP

applications– Fig 13.4 p 254, Berson

Page 34: CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 Spring 1999 34

OLAP Operations

• Move view of facts down/up dimensions

– Drill Down

– Roll Up

– Figure 3, R[3]

– Figure 16,R[3]

• Look at data by partitioning the cube

– Slice - Look at subcube to get more specific data

– Dice - Rotate cube to look at another dimension

– Figure 17,R[3]