Download - 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

Transcript
Page 1: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

G Dong (WSU) 1

CS499/699-10 Data Mining

Fall 2003 Professor Guozhu Dong

Computer Science & EngineeringWSU

Page 2: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 2

Introduction Introduction to this Course Introduction to Data Mining

Page 3: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 3

Introduction to the Course First, about you - why take this course?

Your background and strength AI, DBMS, Statistics, Biology, Business, …

Your interests and requests What is this course about?

Problem solving Handling data

transform data to workable data Mining data

turn data to knowledge validation and presentation of knowledge

Page 4: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 4

This course What can you expect from this course?

Knowledge and experience about DM Problem solving skills

How is this course conducted? Home works, projects, exams, classes

Course Format Individual Projects: 30% Exams and/or quizzes: 60% Homeworks: 10%

Page 5: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 5

Course Web Site cs.wright.edu/~gdong/mining03/

WSUCS499DataMining.htm My office and office hours

RC 430 4:30-5:30, T Th

My email: [email protected] Slides and relevant information will be made

available at the course web site

Page 6: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 6

Any questions and suggestions?

Your feedback is most welcome! I need it to adapt the course to your needs. Please feel free to provide yours anytime.

Share your questions and concerns with the class – very likely others may have the same.

No pain no gain – no magic for data mining. The more you put in, the more you get Your grades are proportional to your efforts.

Page 7: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

G Dong (WSU) 7

Introduction to Data Mining

DefinitionsMotivations of DM

Interdisciplinary Links of DM

Page 8: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 8

What is DM?

Or more precisely KDD (knowledge discovery from databases)? Many definitions An iterative process, not plug-and-play

raw data transformed data preprocessed data data mining post-processing knowledge

One definition is A non-trivial process of identifying valid,

novel, useful and ultimately understandable patterns in data

Page 9: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 9

Need for Data Mining Data accumulate and double every 9 months There is a big gap from stored data to

knowledge; and the transition won’t occur automatically.

Manual data analysis is not new but a bottleneck

Fast developing Computer Science and Engineering generates new demands

Seeking knowledge from massive data Any personal experience?

Page 10: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 10

When is DM useful

Data rich world Large data (dimensionality and

size) Image data (size) Gene chip data (dimensionality)

Little knowledge about data (exploratory data analysis) What if we have some knowledge?

Page 11: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 11

DM perspectives KDD “goals”: Prediction, description,

explanation, optimization, and exploration Knowledge forms: patterns vs. models Understandability and representation of

knowledge Some applications

Business intelligence (CRM) Security (Info, Comp Systems, Networks,

Data, Privacy) Scientific discovery (bioinformatics, medicine)

Page 12: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 12

Challenges

Increasing data dimensionality and data size

Various data forms New data types

Streaming data, multimedia data Efficient search and access to

data/knowledge Intelligent update and integration

Page 13: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 13

Interdisciplinary Links of DM

Statistics Databases AI Machine Learning Visualization High Performance Computing

supercomputers, distributed/parallel/cluster computing

Page 14: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 14

Statistics Discovery of structures or patterns in data sets

hypothesis testing, parameter estimation Optimal strategies for collecting data

efficient search of large databases Static data

constantly evolving data Models play a central role

algorithms are of a major concern patterns are sought

Page 15: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 15

Relational Databases

A relational database can contain several tables Tables and schemas

The goal in data organization is to maintain data and quickly locate the requested data Queries and index structures

Query execution and optimization Query optimization is to find the “best” possible

evaluation method for a given query Providing fast, reliable access to data for data

mining

Page 16: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 16

AI

Intelligent agents Perception-Action-Goal-Environment

Search Uniform cost and informed search algorithms

Knowledge representation FOL, production rules, frames with semantic

networks Knowledge acquisition Knowledge maintenance and application

Page 17: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 17

Machine Learning

Focusing on complex representations, data-intensive problems, and search-based methods

Flexibility with prior knowledge and collected data Generalization from data and empirical validation

statistical soundness and computational efficiency constrained by finite computing & data resources

Challenges from KDD scaling up, cost info, auto data preprocessing,

more knowledge types

Page 18: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 18

Visualization Producing a visual display with insights into the

structure of the data with interactive means zoom in/out, rotating, displaying detailed info

Various types of visualization methods show summary properties and explore relationships

between variables investigate large DBs and convey lots of information analyze data with geographic/spatial location

A pre- and post-processing tool for KDD

Page 19: 9/03 Data Mining – Introduction G Dong (WSU)1 CS499/699-10 Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.

9/03 Data Mining – Introduction

Guozhu Dong 19

Bibliography J. Han and M. Kamber. Data Mining – Concepts

and Techniques. 2001. Morgan Kaufmann. D. Hand, H. Mannila, P. Smyth. Principals of

Data Mining. 2001. MIT. W. Klosgen & J.M. Zytkow, edited, 2001,

Handbook of Data Mining and Knowledge Discovery.