1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

28
1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao

Transcript of 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

Page 1: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

1

SHIM 413Database Applications for Healthcare

Fall 2006

Slides by H. T. Bao

Page 2: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

2

Outline of the presentation

Objectives,

Prerequisite

and Content

Brief

Introduction

to Lectures

Discussion

and

Conclusion

Objectives,

Prerequisite

and Content

This presentation summarizes the content and organizationof lectures in module “Knowledge Discovery and Data Mining”

Page 3: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

3

Objectives

This course provides:

•fundamental techniques of knowledge discovery and data mining (KDD)

•issues in KDD practical use and tools

•case-studies of KDD applications in medical domain (healthcare)

Page 4: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

4

Nothing special but the followings are expected:

Prerequisite for the course

• experience of computer use

• basis of databases and statistics

• programming skills on advanced levels

Page 5: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

5

Content of the course

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Page 6: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

6

Outline of the presentation

Objectives,

Prerequisite

and Content

Brief

Introduction

to Lectures

Discussion

and

Conclusion

This presentation summarizes the content and organizationof lectures on the “Knowledge Discovery and Data Mining” topic

Page 7: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

7

Brief introduction to lectures

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Page 8: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

8

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

Page 9: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

9

KDD: A Definition

106-1012 bytes:never see the wholedata set or put it in thememory of computers

What knowledge?How to represent and use it?

Data mining algorithms?

KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

Page 10: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

10

We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily.

Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data.

Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”.

Data, Information, Knowledge

Knowledge can be considered data at a high level of abstraction and generalization.

Page 11: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

11

From Data to KnowledgeFrom Data to Knowledge From Data to KnowledgeFrom Data to Knowledge

...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0,   0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS,   VIRUS...

Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes

Numerical attribute categorical attribute missing values class labels

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]

[confidence, predictive accuracy]

Page 12: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

12

People gathered and stored so much data because they think some valuable assetsare implicitly coded within it.

Raw data is rarely of direct benefit.Its true value depends on the ability to extract information useful for decision support.

Impractical Manual Data Analysis

knowledge base

inference engine

How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem.

?

Tradition: via knowledge engineers

New trend: via automatic programs

Data Rich Knowledge Poor

Page 13: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

13

Volume

Value

EDP

MIS

DSS

Benefits of Knowledge Discovery

Generate

Rapid Response

Disseminate

EDP: Electronic Data ProcessingMIS: Management Information Systems

DSS: Decision Support Systems

Page 14: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

14

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

Page 15: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

15

The KDD processThe non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

Page 16: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

16

The Knowledge Discovery ProcessThe Knowledge Discovery Process The Knowledge Discovery ProcessThe Knowledge Discovery Process

KDD is inherentlyinteractive and iterative

a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations

1

2

3

4

5

Understand the domain and Define problems

Collect and Preprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluate discovered knowledge

Putting the results in practical use

Page 17: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

17

The KDD ProcessData organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing 1

2

3 4

5

Page 18: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

18

Main Contributing Areas of KDDMain Contributing Areas of KDD Main Contributing Areas of KDDMain Contributing Areas of KDD

DatabasesStore, access, search, update data (deduction)

StatisticsInfer info from data (deduction & induction, mainly numeric data)

Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)

KDD

[data warehouses:integrated data]

[OLAP: On-Line Analytical Processing]

Page 19: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

19

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

Page 20: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

20

Potential ApplicationsPotential Applications Potential ApplicationsPotential ApplicationsBusiness information

- Marketing and sales data analysis- Investment analysis- Loan approval- Fraud detection- etc.

Manufacturing information

- Controlling and scheduling- Network management- Experiment result analysis- etc.

Scientific information- Sky survey cataloging- Biosequence Databases- Geosciences: Quakefinder- etc.

Personal information

Page 21: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

21

KDD: Opportunity and Challenges KDD: Opportunity and Challenges KDD: Opportunity and Challenges KDD: Opportunity and Challenges

Data RichKnowledge Poor(the resource)

Enabling Technology(Interactive MIS, OLAP, parallel computing, Web, etc.)

Competitive Pressure

Data Mining TechnologyMature

KDD

Page 22: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

22

KDD workshops: since 1989.Inter. Conferences: KDD (USA), first in 1995;PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997.ML’04/PKDD’04 (in Pisa, Italy)

Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems.

JAPAN: FGCS Project (logic programming and reasoning).

“Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.

KDD: A New and Fast Growing Area

Page 23: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

23

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

Page 24: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

24

Primary Tasks of Data MiningPrimary Tasks of Data Mining Primary Tasks of Data MiningPrimary Tasks of Data Mining

Classification

Deviation andchange detection

?

Summarization

Clustering

Dependency Modeling

Regression

finding the descriptionof several predefined classes and classify a data item into one of them.

maps a data item to a real-valued prediction variable.

identifying a finite set of categories or clusters to describe

the data.

finding a compact description

for a subset of data

finding a model which describes

significant dependencies between variables.

discovering the most significant changes in the data

Page 25: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

25

Data General patterns

Examples

Cancerous Cell Data

Classification“What factors determine cancerous cells?”

Classification Algorithm

MiningAlgorithm

- Rule Induction- Decision tree- Neural Network

Page 26: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

26

If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)

If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)

Classification: Rule Induction“What factors determine a cell is cancerous?”

Page 27: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

27

Color = dark Color = light

healthy

Classification: Decision Trees

#nuclei=1

#nuclei=2 #nuclei=1

#nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous

Page 28: 1 SHIM 413 Database Applications for Healthcare Fall 2006 Slides by H. T. Bao.

28

Healthy

Cancerous

“What factors determine a cell is cancerous?”

Classification: Neural Networks

Color = dark

# nuclei = 1

# tails = 2