Introduction to Knowledge Discovery and Data Miningbao/MOT-Ishikawa/MOT-Ishikawa.pdf19 November 2005...

19 November 2005

TuBao Ho (HieuChi Dam)School of Knowledge ScienceJapan Advanced Institute of Science and Technology

Introduction to Knowledge Discovery and Data Mining

19 November 2005

The lecture aims to …

Provide basic concepts and techniques of knowledge discovery and data mining (KDD).

Emphasize on different kinds of data, different tasks to do with the data, and different methods to do the tasks.

Emphasize on the KDD process and important issues when using data mining methods.

19 November 2005

Outline

1. Why knowledge discovery and data mining?

2. Basic concepts of KDD

3. KDD techniques: classification, association, clustering, text and Web mining

4. Challenges and trends in KDD

5. Case study in medicine data mining

19 November 2005

Much more data around us than before

We are living in the most exciting of times: Computer and computer networks

19 November 2005

Astronomical dataAstronomy is facing a major data avalanche: Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source …

19 November 2005

Multi-wavelength data paint a more complete(and a more complex!) picture of the universe

Infrared emission frominterstellar dust

Smoothed galaxydensity map

Astronomical data

19 November 2005

Earthquake data

1932-1996

04/25/92 Cape Mendocino, CA

Japanese earthquakes

1961-1994

19 November 2005

9 August 2004 (AP): Swedish geologists may have found a way to predict earthquakes weeks before they happen (current accurate warnings only come seconds before a quake).

Water samples taken 4,900 feet beneath the ground in northern Iceland show the content of several metals increased dramatically a few weeks before a magnitude 5.8 earthquake struck.

"We need a database over other earthquakes." The bedrock at the test site is basalt, which is also found in other earthquake-prone areas like Hawaii and Japan.

Predict earthquakes

19 November 2005

Finance: the market data

Data on price fluctuation throughout the day in the market

19 November 2005

Ishikawa’s monthly industrial data

0

50,000

100,000

150,000

200,000

250,000

1 2 3 4 5 6 7大型小売店売上高

19 November 2005

10,267,507,282 bases in 9,092,760 records.

Explosion of biological data

19 November 2005

A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller):

How biological data look like?

…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT…

19 November 2005

Approximately 80% of the world’s data is held in unstructured formats (source: Oracle Corporation)

Web sites, digital libraries, … increase the volume of textual data

例：JAISTの図書館

オンライン参照可能なジャーナル数：4700 （280000論文/年）

1%（＝2820論文）を読むには、8論文／日がノルマ.

1960s: Easy

2000sDifficult

1980sTime-consuming Soon: impossible

Text: huge sources of knowledge

19 November 2005

MEDLINE: a medical text database

36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis.Institute of Liver Studies, King's College Hospital, London, United Kingdom.

Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals; and, (d) the final effector mechanism of tissue damage probably involves auto-antibodies reacting with liver-specific antigens expressed on hepatocyte surfaces, rather than direct T-cell cytotoxicity against hepatocytes.

The world's most comprehensive source of life sciences and biomedical bibliographic information, with nearly eleven million records (http://medline.cos.com)

About 40,000 abstracts on hepatitis (concerning our research project), below is one of them

19 November 2005

looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200

mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291

mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014

mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP

mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302

mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487

looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200. . . . . . . . .

Typical data in a server access log

Web server access logs data

19 November 2005

Web link data

Internet Map [lumeta.com]

Food Web [Martinez ’91]Friendship Network

[Moody ’01]

19 November 2005

What do we want from the data?

Much more data of different kinds were collected than before.

Want to exploit the data, to extract new and useful information/knowledge in the data, such as

Which phenomenon can be seen from data when a disease occurred?What are properties of several metals in 4,900 feet beneath the ground?Is Japan stock market rising this week?How other researchers talked about “interferon effect”?etc.

Want to draw valid conclusions from data.

19 November 2005

Statistics provides principles and methodology for designing the process of ( 統計学は、下記プロセスを設計する際の原理や方法論の基礎を提供する):

What statistics usually does?

Data collectionデータ収集Summarizing and Interpreting the dataデータ要約と解釈Drawing conclusions or generalities結論と一般性の記述

19 November 2005

Evolution of data processing

faster and cheaper computers with more storage, advanced algorithms, massive databases

“What’s likely to happen to Hokkaido unit sales next month? Why?

Data Mining

faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases, data warehouses

“What were unit sales in Korea last March? Drill down to Hokkaido”.

Data Warehousing and Decision Support

faster and cheaper computers with more storage, relational databases, structured query language (SQL), etc.

“What were unit sales in Korea last March?”

Data Access (1980s)

computers, tapes, disks “What was my total revenue in the last five years?”

Data Collection (1960s)

Enabling TechnologyBusiness QuestionEvolutionary Step

Drill down: To move from summary information to the detailed data that created it （集計データを集計の元となる明細データへと掘り下げる操作). For example, adding totals from all the orders for a year createsgross sales for the year. Drilling down would identify the types of products that were most popular.

19 November 2005

KDD: Convergence of thee technologies

Increasing computing power

Statistical and learning algorithms

Improved data collection and

management

KDD

19 November 2005

Increasing computing power

30MB1.6

met

ers

1966

Lab PC cluster: 16 nodes dual Intel Xeon 2.4 GHz CPU/512 KB cache

JAIST’s CRAY XT3計算ノード：CPU: AMD Opteron150 2.4GHz ×4×90 メモリ: 32GB ×90 ＝ 2.88TB CPU間接続: 3Dトーラス結合帯域幅: CPU-CPU間7.68GB/s(双方向)

19 November 2005

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All books(words)

All Books MultiMedia

EverythingRecorded

A PhotoA Photo

20 TB contains 20 M books in LC

How much information is there?

Soon everything can be recorded and indexed

Most bytes will never be seen by humans

What will be key technologies to deal with huge volumes of information sources?

[This page adapted from the invited talk of Jim Gray (Microsoft) at KDD’2003]

19 November 2005

Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …

BC V5A 459, Canada … … … … … … …

Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00

… CDplayer … … … … … … …

customer

item

Emp-ID name category group salary commisionE35 Jones, Jane home entertainment manager $18,000 2%… … … … … …

employee

Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …

branch

Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …

purchases

Trnas-ID item-ID sty

T100 I3 1T100 I8 2… … …

Empl-ID branch-ID

E55 B1… …

Item-sold works-at

Relational databases

A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes * and a set of tuples**.

*: 顧客IDなどのデータ項目 **：データレコード(行)

19 November 2005

A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. 複数のデータ資源からの情報を収集し、形式を統一して一箇所にある単一

のデータ倉庫に格納

Data sourcein Hokkaido

Data sourcein Kanazawa

Data sourcein Hongkong

Data sourcein Busan

CleanTransformIntegrateLoad

Data warehouse

Query andanalysis tool

client

client

Data warehouses

19 November 2005

A transactional database consists of a file where each record represents a transaction.

A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction.

Transactional databases

Trans_ID list of item_ID

T100 beer, cake, onigiriT200 beer, cakeT300 beer, onigiriT400 beer, onigiriT500 cake

19 November 2005

Object-Oriented Databases

Object-Relational Databases

Spatial Databases

Temporal Databases and Time-Series Databases

Text Databases and Multimedia Databases

Heterogeneous Databases and Legacy Databases

The World Wide Web

Advanced database systems

19 November 2005

Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.

Data mining may uncover patterns describing the content of several metals in specific location when earthquakes happen, the climate of mountainous areas located at various altitudes, etc.

Japanese earthquakes

1961-1994

Spatial databases

19 November 2005

They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)

Data mining finds the characteristics of object evolution, trend of changefor objects: e.g., stock exchange data can be mined to uncover trends in investment strategies

Spatial-temporal databases

Temporal and time-series databases

19 November 2005

Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.

Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.

Text and multimedia databases

19 November 2005

The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.

Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.

The world wide web

19 November 2005

Statistical and learning algorithms

Techniques have often been waiting for computing technology to catch up Development and improvement of statistical and learning algorithms during last decades: support vector machine and kernel methods, multi-relational data mining, graph-based learning, finite state machines, etc.

t-1 t

Ot

t+1

Ot+1Ot -1

...

...

...transitions

observations

t-1 t

Ot

t+1

Ot+1Ot -1

...

...

...transitions

observations

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

HMMs (Hidden Markov Models: directed graph,

joint, generative)

MEMMs (Maximum Entropy Markov Models: directed graph,

conditional, discriminative)

CRFs (Conditional Random Fields: undirected graph,

conditional, discriminative)

1960s 2000s

19 November 2005

独立成分分析(ICA) vs. 主成分分析(PCA)

Principal Component Analysis(PCA) finds directions of maximal variance in Gaussian data (second-order statistics). 主成分分析(PCA)：ガウス分布

データにおいて分散が最大となる方向の発見 (一次統計).

Independent Component Analysis (ICA) finds directions of maximal independence in non-Gaussian data (higher-order statistics). 独立成分分析 (ICA)：非ガウス分

布データにおいて独立性が最大となる方向の発見 (高次統計).

19 November 2005

Play MixturesPlay Components

Perform ICA

Mic 1

Mic 2

Mic 3

Mic 4

Terry Scott

Te-Won Tzyy-Ping

ICA: 複数センサで取得した信号データの分離

19 November 2005

People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information.

How to acquire knowledge for knowledge–based systems remains as the main difficult and crucial artificial intelligence problem.

Need of powerful tools to analyze data

19 November 2005

Outline






19 November 2005

未解釈の信号

25.1 27.3 21.6 …

意味をもつデータ

(object’s mass: measure of an object's resistance to changes in either the speed or direction of its motion)

事実および関係などからなる総合的な情報 (“検証された真実”)

(E = mc2)

Metaphor: Data: rock; knowledge: ore.Miner?

Data, information, and knowledge

19 November 2005

1. ( 5.6, 8.5)2. ( 6.0, 13.0)3. (11.0, 12.0)4. (11.0, 19.0)5. (13.5, 10.0)6. (16.5, 20.0)7. (17.5, 15.0)8. (17.5, 5.0)9. (22.5, 25.0)10. (26.0, 7.5)11. (30,0, 9.0)12. (30.0, 18.0)13. (30.0, 30.0)14. (31.0, 14.0)15. (32.5, 25.0)16. (38.0, 12.0)17. (41.0, 9.0)18. (41.0, 22.0)19. (43.5, 12.5)20. (44.0, 27.5)21. (45.0, 22.5)22. (48.0, 28.0)23. (52.5, 21.0)24. (53.5, 32.0)25. (54.0, 27.5)26. (57.5, 18.0)27. (59.0, 18.0)28. (62.5, 32.5)29. (63.0, 18.0)“if income < $33K, then the person has defaulted on the loan”

Mean of Debt = 18.4, Mean of Income = 34.5

33

US$ K(income, debt)

0

34.5, 18.4

(information)

(knowledge)

Have defaultedon the loan

Good statuswith the bank

Debt

Income

Data, information, and knowledge

19 November 2005

知識発見とデータマイニング(KDD: Knowledge Discovery and Data Mining)

106-1012 bytes:データセット全体の把握・コンピュータメモリへの展開が

困難な大規模データベースどんな知識か？どう表現するか？

どのデータマイニングアルゴリズムを適用するか？

the automatic extraction of non-obvious,hidden knowledge from large volumes of data大量データに潜在する未発見の知識の自動抽出

19 November 2005

...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS...

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + ANDNausea > 15 THEN Prediction = VIRUS [confidence = 87,5%]

Meningitis data, Tokyo Med. & Dental Univ., 38 attributes

numerical categorical missing class attribute

From data to knowledge

19 November 2005

DatabasesStore, access, search, update data (deduction)

Statistics Infer information from data (deduction and induction, mainly numeric data)

Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)

KDD

also Algorithmics, Visualization, Data warehouses, OLAP, etc.

KDD: An interdisciplinary field

19 November 2005

KDD: New and fast growing area

KDD’95, 96, 97, 98, …, 04, 05 (ACM, America)PAKDD’97, 98, 99, 00, …, 04, 05 (Pacific & Asia) http://www.jaist.ac.jp/PAKDD-05PKDD’97, 98, 99, 00, …, 04, 2005 (Europe)ICDM’01, 02,…, 04, 05 (IEEE), SDM’01, …, 04, 05 (SIAM)

Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …

Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning. Projects “Knowledge Science”, “Discovery Science”, and “Active Mining Project” (2001-2004)

19 November 2005

KDD is inherentlyinteractive and iterative

a step in the KDD process consisting of methods that produce useful patterns or models from the data

1

3

4

5

Understand the domainand Define problems

Collect andPreprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluatediscovered knowledge

Putting the resultsin practical use

Maybe 70-90% of effort and cost in KDD

2

The KDD process

19 November 2005

Data organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing

1

2

3

4

5

Common tasks in the KDD process

19 November 2005

Types of dataFlat data tablesRelational databaseTemporal & Spatial Transactional databasesMultimedia dataGenome databasesMaterials science data Textual dataWeb dataetc.

Mining tasks and methodsClassification/Prediction

Decision treesNeural networkRule inductionSupport vector machinesHidden Markov Modeletc.

DescriptionAssociation analysisClusteringSummarizationetc.

Different data schemas

Data schemas vs. mining methods

19 November 2005

color #nuclei #tails class

H1 light 1 1 healthy

H2 dark 1 1 healthy



C1 dark 1 2 cancerous


C3 light 2 2 cancerous


H1

C3

H3 H4

H2

C2C1

C4

Dataset: cancerous and healthy cells

Supervised dataUnsupervised data

19 November 2005

Predictive mining tasks perform inference on the current data in order to make prediction or classification. （予測的マイニングの課題は，未知のデータの予測を目的として，現在のデータに関する推論を行うことである）

Ex. If “color = dark” and “#nuclei =2”then cancerous

Descriptive mining tasks characterize the properties of the data in the database. （記述的マイニングの課題は，データベース中のデータの全般的特性を特徴付ける記述を与えることである）

Ex. “Healthy cells mostly have one nuclei while cancerous ones have two”

What to do? Primary tasks of KDD

19 November 2005

Patterns: 局所的な関係の要約 K417\K417\D2MS\D2MS.exe

Models:データセットに関する大域的な記述

A model is a global description of a data set, a high level population or large sample perspective モデルとは

，データセット，高次元の母集団，あるいは大多数のデータを視野に入れたに関する大域的な記述である．

A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc. モデルは，変数間の相関（回帰式）,クラスタ階層（クラスタリング），神経回路網，などを用いて表現する．

A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local). パターンとは，任意の関係に

関する低次の要約であり，１つのパターンがカバーするデータ件数あるいは変数の数は少数である（局所的）

Ex. If “color = dark” and “#nuclei =2” then cancerous

What to find? Patterns and models

19 November 2005

Classification/prediction is the process of finding a set of models (or patterns) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown

（分類*/予測**とは，データのクラス/概念を説明し識別する一連のモデルを

見つけ出すプロセスであり，このモデルを用い，クラス情報が不明なデータの分類を予測する *カテゴリデータの場合 **数値データの場合）

Decision treesIF-THEN rulesNeural networksSupport vector machinesetc.

Classification/Prediction

19 November 2005

ClassificationAlgorithms

If color = darkand # tails = 2

Then cancerous cell

H1

H3 H4

H2

C2C1

training data

Classifier(model)

Unknown case

Cancerous?

Model construction Model usage

Classification—A two-step process

19 November 2005

Predictive accuracy（予測精度）: the ability of the classifier to correctly predict unseen data

Speed: refers to computation cost

Robustness（頑健性）: the ability of the classifier to make correctly predictions given noisy data or data with missing values

Scalability（拡張性）: the ability to construct the classifier efficiently given large amounts of data

Interpretability（解釈容易性）: the level of understanding and insight that is provided by the classifier

Criteria for classification methods

19 November 2005

Description is the process of finding a set of patterns or models that describe properties of data (essentially the summary of data), for the purpose of understanding the data.（記述とは、データの理解を目的とし、そのデータを要約するような性質を表

現するパターンあるいはモデル集合を発見する過程である）

ClusteringAssociation miningSummarizationTrend detectionetc.

Description

19 November 2005

Interestingness: overall measure combining novelty, utility, simplicity, reliability and validity of discovered patterns/models.

Speed: refers to computation cost

Scalability（拡張性）: the ability to construct the classifier efficiently given large amounts of data

Interpretability（解釈容易性）: the level of understanding and insight that is provided by the classifier

Criteria for description methods

19 November 2005

Outline






19 November 2005

Decision tree learning

Learn to generate classifiers in the form of decision trees from supervised data.

19 November 2005

Mining with decision treesA decision tree is a flow-chart-like tree structure:

each internal node denotes a test on an attributeeach branch represents an outcome of the testleaf nodes represent classes or class distributionsThe top-most node in a tree is the root node

#nuclei

color?

1 2

#tails

light dark

1 2

H

{H1, H3}

{H4, C1, C3, C4}{H1, H2, H3, C1}

{H2, C2}

CH {H2} {C2}

#tails

1 2

H

{H4} {C1, C3, C4}

C

{H1, H2, H3, H4,C1, C2, C3, C4}

19 November 2005

Decision tree induction (DTI)

K417\K417\D2MS\D2MS.exe

Decision tree generation consists of two phases

Tree construction（決定木構築）

Partition examples recursively based on selected attributesAt start, all the training objects are at the root

Tree pruning （構築した木の枝刈）

Identify and remove branches that reflect noise or outliers

Use of decision tree: Classify unknown objects（新事例の分類）

Test the attribute values of the object against the decision tree

19 November 2005

1. At each node, choose the “best”attribute by a given measure for attribute selection

2. Extend tree by adding new branch for each value of the attribute

3. Sorting training examples to leaf nodes

4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes

5. Prune the tree to avoid over-fitting

Two steps: recursively generate the tree (1-4), and prune the tree (5)

DTI general algorithm

#nuclei

color?

1 2

#tails

light dark

1 2

H

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C2}

CH {H2} {C2}

#tails

1 2

H

{H4} {C1, C3, C4}

C

{H1, H2, H3, H4,C1, C2, C3, C4}

19 November 2005

Measures for attribute selection

19 November 2005

A typical dataset in machine learning

14 objects belonging to two class {Y, N} are observed on 4 properties.

Dom(Outlook) = {sunny, overcast, rain}

Dom(Temperature) = {hot, mild, cool}

Dom(humidity) = {high, normal}

Dom(Wind) = {weak, strong}

Days Outlook Temperature Humidity Wind ClassD1 sunny hot high weak ND2 sunny hot high strong ND3 overcast hot high weak YD4 rain mild high weak YD5 rain cool normal weak YD6 rain cool normal strong ND7 overcast cool normal strong YD8 sunny mild high weak ND9 sunny cool normal weak YD10 rain mild normal weak YD11 sunny mild normal strong YD12 overcast mild high strong YD13 overcast hot normal weak YD14 rain mild high strong N

Training data for concept “play-tennis”

19 November 2005

temperature

sunny rain o’cast{D9} {D5, D6} {D7}

outlook outlookwind

cool hot mild{D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14}

true false{D2} {D1, D3, D13}

true false{D5} {D6}

wind

high normal{D1, D3} {D3}

humidity

sunny rain o’cast{D1} {D3}

outlook

sunny o’cast rain{D8, D11} {D12} {D4, D10,D14}

true false{D11} {D8}

windyes yes

no yes

yesno null

yes

no yes

high normal{D4, D14} {D10}

humidity

yestrue false

{D14} {D4}

wind

no yes

noyes

A decision tree for playing tennis

19 November 2005

sunny o’cast rain{D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}

outlook

high normal{D1, D2, D8} {D9, D10}

humidity

no yes

yes

true false{D6, D14} {D4, D5, D10}

wind

no yes

This tree is much simpler as “outlook” is selected at the root.How to select good attribute to split a decision node?

A simple decision tree for playing tennis

19 November 2005

Which attribute is the best?

The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-]

If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better?

[9+, 5-]

[6+, 1-] [3+, 4-]

A1 = humidity[9+, 5-]

[6+, 2-] [4+, 2-]

A2 = wind

normal high weak strong

19 November 2005

Entropy

Entropy characterizes the impurity (purity) of an arbitrary collection of objects (不純度をあらわす).

S is the collection of positive and negative objects

p is the proportion of positive objects in S

p is the proportion of negative objects in S

In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively

Entropy is defined as follows

Entropy(S) = −p log2p⊕ − p log2p

19 November 2005

Entropy

The entropy function relative to a Boolean classification, as the proportion p of positive objects varies between 0 and 1.

i2

c

1ii plogpEntropy(S) ∑

=

−≡

19 November 2005

From 14 examples of Play-Tennis, 9 positive and 5 negative objects (denote by [9+, 5-] )

Entropy([9+, 5-] ) = − (9/14)log2(9/14) − (5/14)log2(5/14)= 0.940

Notice: 1. Entropy is 0 if all members of S belong to the same class. For

example, if all members are positive (p = 1), then p is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0.

2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1.

Example

19 November 2005

We define a measure, called information gain, of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attribute

where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v.

)Entropy(SSS

Entropy(S)A)Gain(S, vValue(A)v

v∑∈

−≡

Information gain measures the expected reduction in entropy

19 November 2005

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]

Gain(S, Wind) = Entropy(S) − )Entropy(SSS

Strong} {Weak,vv

v∑∈

= Entropy(S) − (8/14)Entropy(Sweak)− (6/14)Entropy(SStrong)

= 0.940 − (8/14)0.811 − (6/14)1.00= 0.048

Information gain measures the expected reduction in entropy

19 November 2005

S:[9+, 5-]E = 0.940

Humidity

High Normal

[3+, 4-] [6+, 1-]E = 0.985 E = 0.592

Gain(S, Humidity)= .940 − (7/14).985 − (7/14).592= .151

S:[9+, 5-]E = 0.940

Wind

Weak Strong

[6+, 2-] [3+, 3-]E = 0.811 E = 1.00

Gain(S, Wind)= .940 − (8/14).811 − (6/14)1.00= .048

Which attribute is the best classifier?

19 November 2005

Information gain of all attributes

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) = 0.029

19 November 2005

{D1, D2, ..., D14} [9+, 5-]

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11}[2+, 3-]

{D3, D7, D12, D13}[4+, 0-]

{D4, D5, D6, D10, D14}[3+, 2-]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1, D2, D3, D9, D11}Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019

Next step in growing the decision tree

19 November 2005

Stopping condition

1. Every attribute has already been included along this path through the tree

2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero)

Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain)

19 November 2005

Over-fitting in decision trees

The generated tree may overfit the training dataToo many branches, some may reflect anomalies due to noise or outliers

Result is in poor accuracy for unseen objects

Two approaches to avoid overfittingPrepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold

Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

Use a set of data different from the training data to decide which is the “best pruned tree”

19 November 2005

sunny o’cast rain

outlook

high normal

humidity

no yes

yestrue false

wind

no yes

IF (Outlook = Sunny) and (Humidity = High)THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal)THEN PlayTennis = Yes

Converting a tree to rules

19 November 2005

Attributes with many values

If attribute has many values (e.g., days of the month), ID3 will select it

C4.5 uses GainRatio instead

ii

i2

c

1i

i

v valuehas A with S ofsubset is S where

SS

logSS

A)mation(S,SplitInfor

A)mation(S,SplitInforA)Gain(S,

A)S,GainRatio(

∑=

−≡

≡

19 November 2005

Bayesian classification

Learning statistical classifiers from supervised data basing on Bayes theorem and assumptions on independence/dependence of the data.

19 November 2005

What are Bayesian classification?

Bayesian classifiers are statistical classifiers. Bayesian classification is based on Bayes theorem.

Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.

Bayesian belief networks are graphical models allow the representation of dependencies among subsets of attributes.

19 November 2005

Bayes theorem

Let X be an object whose class label is unknown

Let H be some hypothesis, such as that X belongs to class C.

For classification, we want to determine the posterior probability P(H|X), of H conditioned on X.

Example: Data object consists of fruits, described by color and shape

Suppose X is red and round, and H is the hypothesis that X is an apple.

P(H|X) reflects our confidence that X is an apple given that we have seen that X is red and round.

P(apple | red and round) = ?

19 November 2005

Bayes theorem

In contrast, P(H) is the prior probability of H.

In our example, P(H) is the probability that any given data object is an apple, regardless of how the data sample looks (independent of X).

P(X|H) is the likelihood of X given H, that is the probability that X is red and round given that we know that it is true that X is an apple.

P(X), P(H), and P(X|H) may be estimated from the given data. Bayes theorem allows us to calculate P(H|X)

)()()|()|(

XPHPHXPXHP =

)()()|()|(

roundredPapplePappleroundredProundredappleP

∧∧

=∧

19 November 2005

Naïve Bayesian classification

Suppose X = (x1, x2, …, xn), attributes A1, A2, …, An

There are m classes C1, C2, …, Cm

P(Ci|X) denotes probability that X is classified to class Ci.

Example:P(class = N | outlook=sunny, temperature=hot, humidity=high, wind=strong)

Idea: assign to object X the class label Ci such that P(Ci|X) is maximal, i.e., P(Ci|X) > P(Cj|X), ∀j, i≠j.

Ci is called the maximum posterior hypothesis.

19 November 2005

Estimating a posteriori probabilities

Bayes theorem:P(X)

))P(CC|P(XX)|P(C ii

i =

P(X) is constant. Only need maximize P(X|Ci) P(Ci)

Ci such that P(Ci |X) is maximum =

Ci such that P(X| Ci)·P(Ci) is maximum

If prior probability is unknown, commonly assumed that P(C1) = P(C2) = … = P(Cm), and we would maximize P(X|Ci)

Otherwise, P(Ci) = relative frequency of class Ci = Si/S

Problem: computing P(X|Ci) is unfeasible!

19 November 2005

Naïve Bayesian classification

Naïve assumption: We have P(X|Ci) = P(x1,…,xn|Ci), if attributes are independent then P(X|Ci) = P(x1|Ci) x … x P(xn|Ci)

If Ak is categorical then P(xk|Ci) = Sik/Si where Sik is the number of training objects of class Ci having the value xk for Ak, and Si is the number of training objects belonging to Ci.

If Ak is continuous then the attribute is typically assumed to have a Gaussian distribution, so that

2

2

2

)(

21)|( iC

iCk

i

x

Cik eCxP σ

μ

σπ

−−

=

To classify an unknown object X, P(X|Ci)P(Ci) is evaluated for each class Ci. X is then assigned to the class Ci if and only if P(X|Ci)P(Ci) > P(X|Cj)P(Cj), for 1 ≤ j ≤ m, j ≠ i.

19 November 2005

It makes computation possible

It yields optimal classifiers when independence satisfied

But it is seldom satisfied in practice, as attributes (variables) are often correlated

Attempts to overcome this limitation, among others:

Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes

Decision trees, that reason on one attribute at the time, considering most important attributes first

The independence hypothesis

19 November 2005

Other classification methods

Neural Networks

Instance-based Classification

Genetic Algorithms

Rough Set Approach

Statistical Approaches

Support Vector Machines

etc.

19 November 2005

H1

C3

H3 H4

H2

C2C1

C4

Healthy

Cancerous

color = dark

# nuclei = 1

# tails = 2

Mining with neural networks

19 November 2005

Advantages

prediction accuracy is generally highrobust, works when training examples contain errorsoutput may be discrete, real-valued, or a vector of several discrete or real-valued attributesfast evaluation of the learned target function

Criticism

long training timedifficult to understand the learned function (weights)not easy to incorporate domain knowledge


19 November 2005

Instance-based classificationUsing most similarity individual instances known in the past to classify a new instance

Typical approachesk-nearest neighbor approach

Instances as points in an Euclidean space

Locally weighted regressionConstructs local approximation

Case-based reasoningUses symbolic representations and knowledge-based inference

Instance-based classification

19 November 2005

Genetics algorithms (GA)

..\form\CSD\Cryst.exe

Generate initial population

doCalculate the fitness of each member// simulate another generation

do1. Select parents from current population2. Perform crossover to add offspring

to the new population

while new population is not full

1. Merge new population into the current population

2. Mutate current population

while not converged

EVOLUTION

Environment

Individual

Fitness

PROBLEM SOLVING

Problem

Candidate Solution

Quality

Quality → chance for seeding new solutions

Fitness → chances for survival and reproduction

19 November 2005

Rough sets are used to approximately or “roughly”define equivalent classes

A rough set for a class C is approximated by two sets: A lower approximation (certain to be in C)A upper approximation (possible to be in C)

Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc.

X

Equivalence classes

Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), KluwerAcademic Pub., 1997)

Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), Physica-Verlag, 1998.

} ,{ shapecolor=R

{/ . =colorclassEq{/ . =shapeclassEq

{ . =classEq

,, ,,} {{ }}

, ,} {,} { ,{ }}

,} {, ,} {,} {{ }}

}, , {=X }, {* =X } , , {* =X

},, ,{=U , ,

Rough set approach

19 November 2005

Association rule mining

Description learning that aims to find all possible associations from data.

19 November 2005

Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”

Helps develop marketing strategies by gaining insight into which items are frequently purchased together by customers

How often people buy onigiri and beer together?

Market basket analysis

19 November 2005

Association rule X ⇒Y

support s = probability that a transaction contains X and Y

confidence c = conditional probability that a transaction having X also contains Y

If minimum support 50%, minimum confidence 50%:

A ⇒ C (s=50%, c=66.6%)

C ⇒ A (s=50%, c=100%)

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Customer buys onigiri

Customer buys both Customerbuys beer

Rule measures: support and confidence

19 November 2005

The rule X Y holds in the transaction set D with confidence cif c% of transactions in D that contain X also contain Y.

The rule X Y has support s in the transaction set D if s% of transactions in D contain X ∪ Y.

Confidence denotes the strength of implication and support indicates the frequencies of the occurring patterns in the rule

X containing trans.#Y and X containing trans.#Y)/P(X) andP(XX)|P(YY)(Xconfidence

database in the trans.all#Y and Xboth containing trans.#Y) and P(XY)support(X

===→

==→

Basic concepts

19 November 2005

It is composed of two steps:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence

Association mining: Apriori algorithm

19 November 2005

For rule A ⇒ Csupport = support({A and C}) = 50%confidence = support({A and C})/support({A}) = 66.6%

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Association mining: Apriori principle

The Apriori principle:Any subset of a frequent itemset must be frequent

(if an itemset is not frequent, its supersets are not)

19 November 2005

1. Find the frequent itemsets: the sets of items that have support higher than the minimum support

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-itemset) by from candidate itemsets Ck (Lk ⊆ Ck)

2. Use the frequent itemsets to generate association rules.

C1 … Li-1 Ci Li Ci+1 … Lk

The Apriori algorithm

19 November 2005

Mining frequent itemsets for Boolean association rules

It employs an iterative approach to level-wise search where k-itemsets are used to explore (k+1)-itemsets

First, the set of frequent 1-itemsets L1 is found.

L1 is used to find the set of 2-itemsets L2,

then L2 is used to find L3,

and so on, until no more frequent k-itemsets can be found.

To improve the efficiency of the level-wise generation of frequent itemsets, the important Apriori property is used to reduce the search space.

Apriori algorithm: Finding frequent itemsetsusing candidate generation

19 November 2005

If an itemset l does not satisfy the minimum support threshold min_sup, then l is not frequent, that is, P(l)<min_sup

If an item A is added to the itemset l, then the resulting itemset (i.e., l ∪ A) cannot occur more frequently than l. Therefore, l ∪ A is not frequent either, that is, P(l ∪ A) < min_sup

This is the anti-monotone property:

if a set cannot pass a test, all of its supersets will fail the same test as well

Apriori algorithm: Finding frequent itemsetsusing candidate generation

19 November 2005

To join Lk-1 with itself to generate a set Ck of candidates k-itemsets, and to use it to find Lk

Given l1, l2 ∈ Lk-1, the notation li[j] refers to the jth item in li

Apriori assumes that items within a transaction or itemset are sorted in lexicographic order

The join, Lk-1 Lk-1, is performed where members of Lk-1 are joinable if their first (k-2) items are in common. That is, l1, l2 ∈Lk-1 are joined if

(l1[1] = l2[1]) ∧ (l1[2]= l2[2])∧… ∧ (l1[k-2]= l2[k-2]) ∧ (l1[k-1]<l2[k-1])

The resulting itemset by joining l1 and l2 is

l1[1] l1[2] … l1[k-2] l1[k-1] l2[k-1]

Apriori algorithm: join step to find Ck

19 November 2005

Ck is a superset of Lk, i.e. its members may or may not be frequent, but all frequent k-itemsets are included in Ck (Lk ⊆ Ck)

A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (all candidates having a count no less than the min_sup_count are frequent and therefore belong to Lk)

Ck can be huge. To reduce the size of Ck: use apriori property

Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck (can be tested quickly by a hash tree of frequent itemsets)

Apriori algorithm: The prune step

19 November 2005

TID List of items_IDs

T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3

Itemset Sup.Count

{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2

C1

Itemset Sup.Count

{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2

L1

Transactional data

Scan D for count of each candidate

Compare candidate support count with minimum support count

Example (min_sup_count = 2)

19 November 2005

Itemset{I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}

C2


Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2{I3, I4} 0{I3, I5} 1{I4, I5} 0

C2Compare candidate support count with minimum support count

Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2

L2

Generate candidates C3 from L2using Aprioriprinciple

Itemset

{I1, I2, I3} {I1, I2, I5}


Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

C3

Compare candidate support count with minimum support count

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

L3

Generate candidates C2 from L1using Aprioriprinciple


19 November 2005

1. Join C3 = L2 L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}

= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}

2. Prune using the Apriori property: all nonempty subsets of a frequent itemset must also be frequent. Any candidate has a subset that is not frequent?

2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, {I2, I3} which are all members of L2. Therefore, keep {I1, I2, I3} in C3

2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, {I2, I5} which are all members of L2. Therefore, keep {I1, I2, I5} in C3

2-item subsets of {I1, I3, I5} are {I1, I3}, {I1, I5}, {I3, I5}. Subset {I3, I5} is not a member of L2. Therefore, remove {I1, I3, I5} from C3, and so on

3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}}


19 November 2005

Cluster analysis

Description learning that aims to detect and groups of similar objects from unsupervised data.

19 November 2005

A cluster is a collection of data objects satisfying

Objects in this cluster are similar to one another

Objects in this cluster are dissimilar to the objects in other clusters

The process of grouping objects into clusters is called clustering

What is cluster analysis?

19 November 2005

Clustering analyzes data objects without consulting a known class label.

The objects are clustered or grouped based on the principle of maximizing the between-class similarity and minimizing the within-class similarity

Partition-based clustering for large sets of numerical data.

Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets

Mining with clustering

19 November 2005

A good clustering method will produce high quality clusters with

high intra-class similarity (within a class)low inter-class similarity (between classes)

The quality of clustering basically depends on the similarity measure and the cluster representative used by the method

New forms of clustering require different criteria of quality.

What is good clustering?

19 November 2005

Statistics: since many years, focus on distance-based clustering (S-Plus, SPSS, SAS)

Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept

KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data

Clustering in different fields

19 November 2005

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to determine input parameters

Ability to deal with noisy data

Insensitivity to the order of input records

High dimensionality

Constraint-based clustering

Interpretability and usability

Typical requirements of clustering

19 November 2005

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Methods

Clustering methods in KDD

19 November 2005

Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters

The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”

C:\STAT\STA_WIN.EXE

Partitioning methods

19 November 2005

Two centersselected randomlyfrom nobjects

Form twoclusters byassigningeach object toits nearest center

Reformtwo new clusters

Calculatetwo newcenters

Calculatetwo newcenters

Repeatstep 2 and 3untilthe stoppingconditions hold

1 2

3 4

K-means algorithm (K=2)

19 November 2005

Partitioning methods

The k-means algorithm is sensitive to outliers

The k-medoids method uses medoid (the most centrally located object in a cluster)

The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.

PAM (Partitioning Around Medoids)

From k-Medoids to CLARA (Clustering LARgeApplications)

From CLARA to CLARANS (Clustering LARgeApplications based on RANdomized Search)

19 November 2005

A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence.

Partition Q is nested into partition P if every component of Q is a subset of a component of P.

{ }},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP =

{ }},{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ =

Hierarchical methods

C:\Documents and Settings\Ho Tu Bao\Desktop\STATISTICA.lnk

19 November 2005

Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling.

Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters.

Hierarchical clustering: chameleon

19 November 2005

Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density

DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)

DENCLUE: Based on Density Distribution Functions (Kernel Estimation)

DBSCAN result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0

Density-based methods

19 November 2005

Text mining

Finding unknown useful information from huge collection of textual data.

19 November 2005

What is text mining?

“The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.

“An exploration and analysis of textual (natural language) data by automatic and semi automatic means to discover new knowledge”.

A branch of data mining, targets discovering and extracting knowledge from text documents

19 November 2005

生物医学文献タイトルからの科学的根拠の抽出 (Swanson & Smalheiser, 1997)

“stress is associated with migraines” “ストレスは片頭痛を伴う”

“stress can lead to loss of magnesium”“ストレスはマグネシウム損失の原因となる”

“calcium channel blockers prevent some migraines”“カルシウム拮抗薬は片頭痛を予防することがある”

“magnesium is a natural calcium channel blocker”“マグネシウムは天然のカルシウム拮抗薬である”

抜粋した文の断片を人間の医学専門知識を使って組合せ，

文献にない新しい仮説を導き出す

Magnesium deficiency may play a role in some kinds of migraine headacheマグネシウムはある種の片頭痛に関与するらしい

テキストマイニング: 研究事例

19 November 2005

80%

Structured Numerical or CodedInformation20%

Unstructured or Semi-structuredInformation

Motivation for text mining

Approximately 80% of the world’s data is held in unstructured formats (source: Oracle Corporation)

Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery.

19 November 2005

Disciplines that influence text mining

Computational linguistics (NLP)

Information extraction

Information retrieval

Web mining

Regular data mining

Text Mining = Data Mining (applied to text data) + Language Engineering

19 November 2005

Computational linguistics

Goal: automated language understanding

this isn’t possible

instead, go for subgoals of text analysis, e.g.,

word sense disambiguation

phrase recognition

semantic associations

Common current approach (trend):statistical analyses over very large text collections

Consider a word like "string" or "rope." No computer today has any way to understand what those things mean. For example, you can pull something with a string, but you cannot push anything. You can tie a package with string, or fly a kite, but you cannot eat a string or make it into a balloon. In a few minutes, any young child could tell you a hundred ways to use a string − or not to use a string − but no computer knows any of this.

19 November 2005

Lexical / Morphological Analysis

Syntactic Analysis

Semantic Analysis

Discourse Analysis

Tagging

Chunking

Word Sense Disambiguation

Grammatical Relation Finding

Named Entity Recognition

Reference Resolution

Computational linguistics

Shallow parsing

The woman will give Mary a book

The/Det woman/NN will/MD give/VBMary/NNP a/Det book/NN

POS tagging

[The/Det woman/NN]NP [will/MD give/VB]VP[Mary/NNP]NP [a/Det book/NN]NP

chunking

[The woman] [will give] [Mary] [a book]

relation findingsubject

i-object object

text

meaning

19 November 2005

1990s–2000s: Statistical learningalgorithms, evaluation, corpora

1980s: Standard resources and tasksPenn Treebank, WordNet, MUC

1970s: Kernel (vector) spacesclustering, information retrieval (IR)

1960s: Representation TransformationFinite state machines (FSM) and Augmented transition networks (ATNs)

1960s: Representation—beyond the word levellexical features, tree structures, networks

Trainable FSMs

Trainable parsers

Archeology of computational linguistics

19 November 2005

Information retrieval

Given:

A source of textual documents

A user query (text based) Query

E.g. migraines causes

Documentssource

Find:A set (ranked) of documents that are relevant to the query

RankedDocuments

DocumentDocument

Document

IRSystem

19 November 2005

Evaluation measures

relevant

retrieved

recallRprecisionPRPPRF

relevantrelevantretrievedrecall

retrievedrelevantretrievedprecision

==+

+=

∩=

∩=

, ,)1(

,

2

2

ββ

19 November 2005

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2004

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

the process of extracting text segments of semi-structured or free text to fill data slots in a predefined template

Information extraction vs. information retrieval: Finding “things” but not “pages”

19 November 2005

What is information extraction?

Given:A source of textual documents

A well defined limited query (text based)

Find:Sentences with relevantinformation (i.e., identify specific semantics elements such as entities, properties, relations)

Extract the relevant information and ignore non-relevant information (important!)

Link related information and output in a predetermined format

19 November 2005

Example: template extraction

Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natioanl Liberation Front (FMLN) of the crime …. Garcia Alvarado, 56, was killed when a bomb places by urban guerillas on his vehicle explored as it came to a halt at an intersection in downtown San Salvador … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.

Incident: Date 19 Apr 89Incident: Location El Salvador: San SalvadorIncident: Type BombingPerpetrator: Individual ID “urban guerillas”Perpetrator: Organization ID “FMLN”Human target: Name “Roberto Garcia Alvarado”

19 November 2005

Zipf’s “law” and its consequence

One of curiosities in text analysis and mining (1935)

The product of the frequency of words (f) and their rank (r) is approximately constant

Rank = order of words’frequency of occurrence

f = C * 1/rC ≈ N/10

Always a few very frequent tokens that are not good discriminators.

Called “stop words” in Information Retrieval

Usually correspond to linguistic notion of “closed-class” words

English examples: to, from, on, and, the, ...

Grammatical classes that don’t take on new members.

TypicallyA few very common words

A middling number of medium frequency words

A large number of very infrequent words

Medium frequency words most descriptive

19 November 2005

Text mining process

Text preprocessingSyntactic/Semantic text analysis

Features Generation Bag of words

Features SelectionSimple counting

Statistics

- Term frequency- Document frequency- Term proximity- Document length- etc.

Text/Data MiningClassification

Clustering

Summarization

Etc.

Analyzing results

19 November 2005

Typical issues and techniques

Text Categorization (text classification)

Text Clustering

Text Summarization

Trend Detection

Relationship Analysis

Information Extraction

Question-Answering

Text Visualization

etc.

19 November 2005

Text categorization (classification)

Task: Assignment of one or more labels from a pre-defined set to a document

Example category set

MeSH medical hierarchy

JAIST’s library, Library of Congress subject headings

Idea: Content vs. External Meta-Data

Techniques: Supervised classification

Decision treesNaïve Bayesian classificationSupport Vector Machinesetc.

19 November 2005

Text clustering

Task: Detecting topics within a document collection, assigning documents to those topics, and labeling these topic clusters

Scatter/Gather clusteringCluster sets of documents into general “themes”, like a table of contentsDisplay the contents of the clusters by showing typical terms and typical titlesUser chooses subsets of the clusters and re-clusters the documents withinResulting new groups have different “themes”

Techniques: Different clustering techniques (similarity between texts, texts and word densities, etc.)

19 November 2005

Text summarization

A text is entered into the computer and a summarized text is returned, which is a non redundant extract from the original text.

A process of text summarizationSentence extraction: Find a set of important sentences that covers the gist meaning of the text document

Sentence reduction: Convert a long sentence to a short one without losing the meaning

Sentence combination: Combine sentences to make a text.

19 November 2005

Emerging trend detection

Task: Detecting topic areas that are growing in interest and utilityover time (emerging trends)Example:

3711999

1701998

101997

81996

11995

31994

Number of documents

Year

COE project: Can we detect emerging trends in materials science, information science, and biology?

INSPEC®[INS] database search on keyword “XML”

KDD’03 challenges: From arXiv (since 1991) 500,000 articles on High Energy Particle Physics, predict, says, # citation in a period of a given articles, etc.

“Find sales trends by product and correlate with occurrences of company name in business news articles”

19 November 2005

Question Answering

Task: Give answer to question

(document retrieval: find documents relevant to query)

Example:

Who invented the telephone?Alexander Graham Bell

When was the telephone invented?1876

(Buchholz & Daelemans, 2001)

Imagine how to automatically answer such a question?

19 November 2005

Text visualization

Network Maps

Landscapeshttp://www.lexiquest.com

http://www.aurigin.com

19 November 2005

Web mining

Finding unknown useful information from the World Wide Web.

19 November 2005

Data mining turns data into knowledge.

Web mining is to apply data mining techniques to extract and uncover knowledge from web documents and services.

Data Mining and Web Mining

19 November 2005

Web:

A huge, widely-distributed

Highly heterogeneous

Semi-structured

Hypertext/hypermedia

Interconnected information repository

Web is a huge collection of documents plus

Hyper-link information

Access and usage information

WWW specifics

19 November 2005

Web user tasks

Finding relevant informationThe user usually uses a simple keyword query and receive a list of ranked pages

Current problem: low precision (irrelevance of search result), and low recall (inability to index all the information on the Web)

Creating new knowledge over the existing dataWant to extract knowledge from Web data (assuming to have it)

Personalizing the informationPeople differ in the contents and presentations they prefer while interacting with the Web

Learning about consumers and individual usersKnowing what the customers do and want

19 November 2005

Web Content Mining

Discovery of information from Web contents (various types of data such as textual, image, audio, video, hyperlinks, etc.)

Web Structure MiningDiscovery of the model underlying the link structures of the Web. The model is based on the topology of the hyperlinks with or without the description of the links.

Web Usage Mining

Discovery of information from web users’ sessions and behaviors (secondary data derived from the interactions of the users while interacting with the Web).

Web mining taxonomy

19 November 2005

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web mining taxonomy

Web miningWeb mining

19 November 2005

Web Mining

Web StructureMining

Web ContentMining

Web Page Content MiningWeb Page Summarization • WebLog (Lakshmanan et.al. 1996)• WebSQL(Mendelzon et.al. 1998) …: Web

structuring query languages; Can identify information within given web pages

• Ahoy! (Etzioni et.al. 1997): Uses heuristics to distinguish personal home pages from other web pages

• ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

Search ResultMining

Web UsageMining



Web mining taxonomy


19 November 2005

Web Mining

Web mining taxonomy

Web UsageMining



Web StructureMining

Web ContentMining


Search Result MiningSearch Engine Result SummarizationClustering Search Result (Leouski and Croft, 1996; Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets


19 November 2005

Web Mining

Web ContentMining


Search ResultMining

Web UsageMining



Web mining taxonomy

Web Structure MiningUsing Links• PageRank (Brin and Page, 1996), HITS (Kleinberg,

1996)• CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages

Web Communities• Communities crawling (Kumar et al, 1999), etc.Using Generalization• MLDB (1994), VWV (1998)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.


19 November 2005

Web Mining

Web StructureMining

Web ContentMining


Search ResultMining

Web UsageMining

General Access Pattern Tracking

• Web Log Mining (Zaïane, Xin and Han, 1998)Uses KDD techniques to understand general access patterns and trends.


Web mining taxonomy


19 November 2005

Web UsageMining

General AccessPattern Tracking Customized Usage Tracking

• Adaptive Sites (Perkowitz and Etzioni, 1997)Analyses access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.

Web mining taxonomy

Web StructureMining

Web ContentMining


Search ResultMining


19 November 2005

1. Resource finding: The task of retrieving intended Web documents

2. Information selection and pre-processing: Automatically selecting and preprocessing specific information from retrieved Web resources

3. Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites

4. Analysis: validation and/or interpretation of the mined patterns

Web mining process

19 November 2005

Sunday11-12 PM

Lunch time

Tree map

Cone tree

Fisheye view

Hyperbolic tree

MagicLens

Data and knowledge visualization

19 November 2005

SPSS

IBM

Silicon Graphics SASSalford Systems

RuleQuest Research (C4.5)

KDD products and tools

19 November 2005

Outline

Why knowledge discovery and data mining?

Basic concepts of KDD

KDD techniques: classification, association, clustering, text and Web mining

Challenges and trends in KDD

Case study in medicine data mining

19 November 2005

Different types of data in different forms(mixed numeric, symbolic, text, image, voice,…)

Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes)[Problems: efficiency, scalability?]

[Problems: quality, effectiveness?]

Data and knowledge are changing

Human-Computer Interaction and Visualization

Challenges of KDD

19 November 2005

3 attributes each has 2 values: #instances = 23 = 8 #patterns =27

What if #attributes increases?

Size of instance space and pattern space increased exponentially

p attributes each has d values, size of instance space is dp

38 attributes each has 10 values: #instances = 1038

H1

C3

H3 H4

H2

C2C1

C4

Large datasets and high dimensionality

19 November 2005

Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)

Sampling (instance selection)

Dimensionality reduction (feature selection)

Approximation methods

Massively parallel processing

Integration of machine learning and database management

Possible solutions

19 November 2005

Attribute Numerical Symbolic

No structure

≠= Places,Color

Ordinal structure

≥≠= Ring

structure

Rank,Resemblance

Age,Temperature,Taste,

Income,Length

Nominal(categorical)

Ordinal

Measurable

Combinatorial search in hypothesis spaces (machine learning)

Often matrix-based computation (multivariate data analysis)

×+≥≠=

Numerical vs. symbolic data

19 November 2005

Attribute selection

Pruning trees

From trees to rules (high cost of pruning)

Visualization

Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage)

(well-known systems: C4.5 and CART)

Mining with decision trees

19 November 2005

SLIQ (Mehta et al., 1996)builds an index for each attribute and only class list and the current attribute list reside in memory

SPRINT (J. Shafer et al., 1996)constructs an attribute list data structure

PUBLIC (Rastogi & Shim, 1998)integrates tree splitting and tree pruning: stop growing the tree earlier

RainForest (Gehrke, Ramakrishnan & Ganti, 1998)separates the scalability aspects from the criteria that determine the quality of the tree

builds an AVC-list (attribute, value, class label)

Scalable decision tree induction methods

19 November 2005

Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)

Extracting or making sense of numeric weightsassociated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem


19 November 2005

Mining with association rules

Improving the efficiency

Database scan reduction: partitioning (Savaseve95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find non-redundant rules (3000 times less, Zaki KDD’2000)

Parallel mining of association rules

New measures of association

Interestingness and exceptional rules

Generalized and multiple-level rules

19 November 2005

Mining scientific data

Data Mining in Bioinformatics

Data Mining in Astronomy and Earth Sciences

Mining Physics and Chemistry data

Mining Large Image Databases

etc.

19 November 2005

Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances

Massively parallel processingData-parallel vs. Control-parallel Data Mining

Client/Server Frameworks for Parallel Data Mining

Mining Very Large Databases With Parallel ProcessingAlex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998

Solutions to mining huge datasets

19 November 2005

Mixed Similarity Measures (MSM):

Goodall (1966) time O(n3), Diday and Gowda(1992),

Ichino and Yaguchi (1994),

Li & Biswas (1997) Time O(n2logn2), Space O(n2):

New and Efficient MSM (Binh & Bao, 2000):

Time and Space O(n): *ˆ 1ˆijij PP −=

ijP*

ijP

Example of a scalable algorithm

19 November 2005

US Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000)

＃cases 500 1.000 1.500 2.000 5.000 10.000 199.523 (0.2M) (0.5M) (0.9M) (1.1M) (2.6M) (5.2M) (102M)

# values 497 992 1.486 1.973 4.858 9.651 97.799

time of LiBis 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app O(n2logn2)

Time of OURS 0.1s 0.2s 0.3s 0.5s 2.8s 9.2s 36m26sO(n)

Memory of LiBis 5.3M 20.0M 44.0M 77.0M 455.0M not app not app O(n2)Memory of OURS 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0MO(n)

Preprocessing 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s

Comparative results

19 November 2005

High performance computing for NLP

Experimental environments

Massively parallel computer (Cray XT3): 90 nodes, each node has four 2.4GHz processors, 32GB RAM (total: 90 x 4 x 2.4GHz processors, 2.88TB RAM)

Linux OS, using C/C++ and MPI library

Experiments for POS tagging and chunking

24 sections of WSJ (Penn TreeBank): about 1,000,000 words (more than 40,000 English sentences)

Application: Part-of-speech tagging. Highest F1 score: 96.96% (using only for first-order Markov CRFs)

60.0510270

1.8732782

9.0367810

27.3422430

43.7514050

76.568090

1.0061251

Speed-up

ratio

Training

Time (minute)

Number of parallel processes

19 November 2005

Outline

Why knowledge discovery and data mining?

Basic concepts of KDD

KDD techniques: classification, association, clustering, text and Web mining

Challenges and trends in KDD

Case study in medicine data mining

19 November 2005

Background

HBV, HCV are viruses which both cause continuous inflammation of the liver (chronic hepatitis).

The inflammation results in liver fibrosis, and finally liver cirrhosis (LC) after 20 to 30 years which is diagnosed by liver biopsy.

In addition, the cirrhosis patients have a highly potential risk of hepatocellular carcinoma (HCC).

Physicians can treat viral hepatitis with interferon (IFN). However IFN is not always effective, with severe side effects.

19 November 2005

Fibrosis stage

F0

F1

F2

F3

onset of infection

HCCHCC

20-30 years

LCLC

time

F4

HepatitisHepatitis

The natural course of hepatitis

The course of HCC?

19 November 2005

IFNFibrosis stage

HCC

LC

timeonset of infection

The effect of interferon therapy

Effectiveness of interferon?

F0

F1

F2

F3

F4

19 November 2005

提供元：千葉大学医学部第一内科

約800人の患者の20年間にわたるデータ

データの特徴

大規模な未整備時系列データ

検査項目数が非常に多い

検査時期により各検査項目の値やその精度が異なる，欠損値が多い

医者によるバイアスが存在

時系列データ

19 November 2005

Example of the hepatitis dataset

Sequences of length 179 for MID1

88 for MID2, and they are irregular

19 November 2005

P1. Differences in temporal patterns between hepatitis B and C? (HBV, HCV)

P2. Evaluate whether laboratory examinations can be used to estimate the stage of liver fibrosis? (F0, F1, F2, F3, F4)

P3. Evaluate whether the interferon therapy is effective or not? (Response, Partial response, Aggravation, No response)

Problems under consideration

19 November 2005

To perform any kind of medical problem solving, patient data have to be “matched” against medical knowledge.

Patient data mostly comprise numeric

measurements of various parameters at different

points in time.

Medical knowledge is usually expressed in a

form of symbolic statements which is

as general as possible.

Why data abstraction?

19 November 2005

What is temporal abstraction?

ZTT first was increasingly highthen changed to the normal region and stable

ZTT

normal region

Idea: Convert time-stamped points to symbolic interval-based representation of data Characteristic: No detail but essence of trend and state change of patients

ZTT: H>N−S

19 November 2005

1

2

1. Develop temporal abstraction methods for describing hepatitis data appropriately for each problem.

1. Develop temporal abstraction methods for describing hepatitis data appropriately for each problem.

2. Using data mining methods to solve problems from the abstracted data.

2. Using data mining methods to solve problems from the abstracted data.

Research objectives

19 November 2005

Key issues in temporal abstraction

1. Define a descriptionlanguage for abstraction patterns

2. Determine the basic abstraction patterns

3. Transform each sequence into temporal abstraction patterns

Simple but rich enough to describe abstract patterns

Be typical and significant primitives needed for analysis purpose

Efficiently characterize the trends and changes in the temporal data

ZTT: H>N−S

Tasks Requirement

19 November 2005

0

100

200

300

400

500

600

2/19

/198

1

2/19

/198

3

2/19

/198

5

2/19

/198

7

2/19

/198

9

2/19

/199

1

2/19

/199

3

2/19

/199

5

2/19

/199

7

2/19

/199

9

2/19

/200

1

Typical tests by physicians

Short-term changed tests

Long-term changed tests

Concerning inflammation, changed quickly in days or weeksCan be much higher (even 40 times) than normal range with many peaks

Down: T-CHO, CHE, ALB, TP (liver products)PLT, WBC, HGB.

Up: T-BIL, D-BIL, I-BIL, AMONIA, ICG-15.

Up: GPT, GOT, TTT, ZTT

Concerning liver status, changed slowly in months or years

Do not much exceed the normal range

19 November 2005

Observation of temporal sequences

Make a tool in Matlab to visualize temporal sequences

Observe a large number of temporal sequences for different patients and tests

19 November 2005

0

100

200

300

400

500

600

2/19

/198

1

2/19

/198

3

2/19

/198

5

2/19

/198

7

2/19

/198

9

2/19

/199

1

2/19

/199

3

2/19

/199

5

2/19

/199

7

2/19

/199

9

2/19

/200

1

Ideas of basic patterns

Short-term changed tests

Long-term changed tests

Idea: base state and peaks

Idea: change of states(compactly capture both state and trend of the sequence)

Down: T-CHO, CHE, ALB, TP (liver products)PLT, WBC, HGB.

Up: T-BIL, D-BIL, I-BIL, AMONIA, ICG-15.

Up: GPT, GOT, TTT, ZTT

19 November 2005

時系列抽象化アプローチ

短期変化項目に関する8の主要パターン長期変化項目に関する21のパターン

発見アルゴリズムを用い，各検査項目の変化特性を「長期変化項目」「短期変化項目」に分類

19 November 2005

Two temporal abstraction methods

Abstraction pattern extraction (APE)

Mapping each given temporal sequence of fixed length into one of pre-defined temporal patterns

(2001~)

Temporal relation extraction (TRE)

Detect temporal relations between basic patterns, and extract rules using temporal relations

(2004~)

19 November 2005

Data and knowledge visualization

Simultaneously view the data in different forms: top-left is original data; top-right is histogram of attributes; lower-left is view by parallel coordinates; and lower-right is relations between a conjunction of attribute-value pairs and the class labels.

View an individual rule in D2MS: top-left window shows the list of discovered rules, the middle-left and the top-right windows show a rule under inspection, and bottom window displays the instances covered by that rule.

19 November 2005

LC vs. non-LC

19 November 2005

Effectiveness of interferon (LUPC rules)

GOT & GPToccurred as VH or EH in no_responserules

CHE occurred as N/L or L-D in partial_responserules

D-BIL occurred as N/H, H>N, H>N in no_response and partial_responserules.

19 November 2005

Temporal relations

AB

AB

AB

AB

AB

AB

AB

A is equal to BB is equal to A

A is before BB is after A

A meets BB is met by A

A overlaps BB is overlapped by A

A starts BB is started by A

A finishes BB is finished by A

A is during BB contains A

(Allen’s Temporal Logic, 1984)

Relations between two basic patternseach happens in a period of time, says, “ALB get down right after heavy inflammation finished”.

Updating a graph of temporal relations (Allen, 1983) and temporal logic (Allen, 1984)

19 November 2005

Findings for HBV and HCV

Findings are different from general medical observations(no clear distinction between type B and C)

R#13 (HBV): “ALP changed from low to normal state” AFTER “LDH changed from low to normal state” (supp. count = 21, conf. = 0.71)

R#5 (HCV): “ALP changed from normal to high state” AFTER “LDH changed from normal to low state” (supp. count = 60, conf. = 0.80)

“Quantitatively” confirm findings in medicine (Medline abstracts)

R#53 (HCV): “ALB changed from normal to low state” BEFORE “TTT in highstate with peaks” AND “ALP from normal to high state” BEFORE “TTT in high state with peaks” (supp. count = 10, conf. = 1.00)

Murawaki et al. (2001): the main difference between HBV and HCV is that

the base state of TTT in HBV is normal, while that of HCV is high.

19 November 2005

Findings for LC and non-LC

Typical relations in non-LC rules“GOT in high or very high states with peaks”BEFORE “TTT in high state with peaks” (20 rules contain this temporal relation)

Typical relations in LC rules“GOT in high or very high states with peaks”AFTER “TTT in high or very high states with peaks” (10 rules contain this temporal relation).

19 November 2005

(プロジェクト期間 2004-2007)

肝炎患者履歴データ

に対する組合せアプローチ

データマイニング

Medlineからのテキス

トマイニング

専門家の評価と提案

多種情報源による医療データマイニング

19 November 2005

Conclusion

Temporal abstraction shown to be a good alternative approach to hepatitis study.

The results be comprehensive to physicians, and significant rules founds.

Much work to be done for solving three problems, including:

Integrating qualitative and quantitative information

Combining with text mining techniques

Domain expert knowledge

19 November 2005

Summary

KDD is motivated by the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.

There are different methods to find predictive or descriptive models that strongly depend on the data schemes.

The KDD process is necessarily interactive and iterative, and requires human participation in all steps of the process.

19 November 2005

http://www.kdnuggets.com

David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000

Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

Recommended references

19 November 2005

Profs. S. Ohsuga, M. Kimura, H. Motoda, H. Shimodaira, Y. Nakamori, S. Horiguchi, T. Mitani, T. Tsuji, K. Satou, among others

Nguyen N.B., Nguyen T.D., Kawasaki S., Huynh V.N., Dam H.C., Tran T.N., Pham T.H., Nguyen D.D., Nguyen L.M., Phan X.H., Le M.H., Hassine B.A., Le S.Q., Zhang H., Nguyen C.H., Nagai K., Nguyen C.H., Nguyen T.P., Tran D.H.

Acknowledgments

Introduction to Knowledge Discovery and Data Miningbao/MOT-Ishikawa/MOT-Ishikawa.pdf19 November 2005...

Documents

Transcript of Introduction to Knowledge Discovery and Data Miningbao/MOT-Ishikawa/MOT-Ishikawa.pdf19 November 2005...