Introduction to Knowledge Discovery and Data Miningbao/MOT-Ishikawa/MOT-Ishikawa.pdf19 November 2005...
Transcript of Introduction to Knowledge Discovery and Data Miningbao/MOT-Ishikawa/MOT-Ishikawa.pdf19 November 2005...
19 November 2005
TuBao Ho (HieuChi Dam)School of Knowledge ScienceJapan Advanced Institute of Science and Technology
Introduction to Knowledge Discovery and Data Mining
19 November 2005
The lecture aims to …
Provide basic concepts and techniques of knowledge discovery and data mining (KDD).
Emphasize on different kinds of data, different tasks to do with the data, and different methods to do the tasks.
Emphasize on the KDD process and important issues when using data mining methods.
19 November 2005
Outline
1. Why knowledge discovery and data mining?
2. Basic concepts of KDD
3. KDD techniques: classification, association, clustering, text and Web mining
4. Challenges and trends in KDD
5. Case study in medicine data mining
19 November 2005
Much more data around us than before
We are living in the most exciting of times: Computer and computer networks
19 November 2005
Astronomical dataAstronomy is facing a major data avalanche: Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source …
19 November 2005
Multi-wavelength data paint a more complete(and a more complex!) picture of the universe
Infrared emission frominterstellar dust
Smoothed galaxydensity map
Astronomical data
19 November 2005
Earthquake data
1932-1996
04/25/92 Cape Mendocino, CA
Japanese earthquakes
1961-1994
19 November 2005
9 August 2004 (AP): Swedish geologists may have found a way to predict earthquakes weeks before they happen (current accurate warnings only come seconds before a quake).
Water samples taken 4,900 feet beneath the ground in northern Iceland show the content of several metals increased dramatically a few weeks before a magnitude 5.8 earthquake struck.
"We need a database over other earthquakes." The bedrock at the test site is basalt, which is also found in other earthquake-prone areas like Hawaii and Japan.
Predict earthquakes
19 November 2005
Finance: the market data
Data on price fluctuation throughout the day in the market
19 November 2005
Ishikawa’s monthly industrial data
0
50,000
100,000
150,000
200,000
250,000
1 2 3 4 5 6 7大型小売店売上高
19 November 2005
10,267,507,282 bases in 9,092,760 records.
Explosion of biological data
19 November 2005
A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller):
How biological data look like?
…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT…
19 November 2005
Approximately 80% of the world’s data is held in unstructured formats (source: Oracle Corporation)
Web sites, digital libraries, … increase the volume of textual data
例:JAISTの図書館
オンライン参照可能なジャーナル数:4700 (280000論文/年)
1%(=2820論文)を読むには、8論文/日がノルマ.
1960s: Easy
2000sDifficult
1980sTime-consuming Soon: impossible
Text: huge sources of knowledge
19 November 2005
MEDLINE: a medical text database
36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis.Institute of Liver Studies, King's College Hospital, London, United Kingdom.
Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals; and, (d) the final effector mechanism of tissue damage probably involves auto-antibodies reacting with liver-specific antigens expressed on hepatocyte surfaces, rather than direct T-cell cytotoxicity against hepatocytes.
The world's most comprehensive source of life sciences and biomedical bibliographic information, with nearly eleven million records (http://medline.cos.com)
About 40,000 abstracts on hepatitis (concerning our research project), below is one of them
19 November 2005
looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200
mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291
mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014
mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP
mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302
mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487
looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200. . . . . . . . .
Typical data in a server access log
Web server access logs data
19 November 2005
Web link data
Internet Map [lumeta.com]
Food Web [Martinez ’91]Friendship Network
[Moody ’01]
19 November 2005
What do we want from the data?
Much more data of different kinds were collected than before.
Want to exploit the data, to extract new and useful information/knowledge in the data, such as
Which phenomenon can be seen from data when a disease occurred?What are properties of several metals in 4,900 feet beneath the ground?Is Japan stock market rising this week?How other researchers talked about “interferon effect”?etc.
Want to draw valid conclusions from data.
19 November 2005
Statistics provides principles and methodology for designing the process of ( 統計学は、下記プロセスを設計する際の原理や方法論の基礎を提供する):
What statistics usually does?
Data collectionデータ収集Summarizing and Interpreting the dataデータ要約と解釈Drawing conclusions or generalities結論と一般性の記述
19 November 2005
Evolution of data processing
faster and cheaper computers with more storage, advanced algorithms, massive databases
“What’s likely to happen to Hokkaido unit sales next month? Why?
Data Mining
faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases, data warehouses
“What were unit sales in Korea last March? Drill down to Hokkaido”.
Data Warehousing and Decision Support
faster and cheaper computers with more storage, relational databases, structured query language (SQL), etc.
“What were unit sales in Korea last March?”
Data Access (1980s)
computers, tapes, disks “What was my total revenue in the last five years?”
Data Collection (1960s)
Enabling TechnologyBusiness QuestionEvolutionary Step
Drill down: To move from summary information to the detailed data that created it (集計データを集計の元となる明細データへと掘り下げる操作). For example, adding totals from all the orders for a year createsgross sales for the year. Drilling down would identify the types of products that were most popular.
19 November 2005
KDD: Convergence of thee technologies
Increasing computing power
Statistical and learning algorithms
Improved data collection and
management
KDD
19 November 2005
Increasing computing power
30MB1.6
met
ers
1966
Lab PC cluster: 16 nodes dual Intel Xeon 2.4 GHz CPU/512 KB cache
JAIST’s CRAY XT3計算ノード:CPU: AMD Opteron150 2.4GHz ×4×90 メモリ: 32GB ×90 = 2.88TB CPU間接続: 3Dトーラス結合帯域幅: CPU-CPU間7.68GB/s(双方向)
19 November 2005
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
KiloA BookA Book
.Movie
All books(words)
All Books MultiMedia
EverythingRecorded
A PhotoA Photo
20 TB contains 20 M books in LC
How much information is there?
Soon everything can be recorded and indexed
Most bytes will never be seen by humans
What will be key technologies to deal with huge volumes of information sources?
[This page adapted from the invited talk of Jim Gray (Microsoft) at KDD’2003]
19 November 2005
Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …
BC V5A 459, Canada … … … … … … …
Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00
… CDplayer … … … … … … …
customer
item
Emp-ID name category group salary commisionE35 Jones, Jane home entertainment manager $18,000 2%… … … … … …
employee
Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …
branch
Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …
purchases
Trnas-ID item-ID sty
T100 I3 1T100 I8 2… … …
Empl-ID branch-ID
E55 B1… …
Item-sold works-at
Relational databases
A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes * and a set of tuples**.
*: 顧客IDなどのデータ項目 **:データレコード(行)
19 November 2005
A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. 複数のデータ資源からの情報を収集し、形式を統一して一箇所にある単一
のデータ倉庫に格納
Data sourcein Hokkaido
Data sourcein Kanazawa
Data sourcein Hongkong
Data sourcein Busan
CleanTransformIntegrateLoad
Data warehouse
Query andanalysis tool
client
client
Data warehouses
19 November 2005
A transactional database consists of a file where each record represents a transaction.
A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction.
Transactional databases
Trans_ID list of item_ID
T100 beer, cake, onigiriT200 beer, cakeT300 beer, onigiriT400 beer, onigiriT500 cake
19 November 2005
Object-Oriented Databases
Object-Relational Databases
Spatial Databases
Temporal Databases and Time-Series Databases
Text Databases and Multimedia Databases
Heterogeneous Databases and Legacy Databases
The World Wide Web
Advanced database systems
19 November 2005
Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.
Data mining may uncover patterns describing the content of several metals in specific location when earthquakes happen, the climate of mountainous areas located at various altitudes, etc.
Japanese earthquakes
1961-1994
Spatial databases
19 November 2005
They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)
Data mining finds the characteristics of object evolution, trend of changefor objects: e.g., stock exchange data can be mined to uncover trends in investment strategies
Spatial-temporal databases
Temporal and time-series databases
19 November 2005
Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.
Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.
Text and multimedia databases
19 November 2005
The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.
Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.
The world wide web
19 November 2005
Statistical and learning algorithms
Techniques have often been waiting for computing technology to catch up Development and improvement of statistical and learning algorithms during last decades: support vector machine and kernel methods, multi-relational data mining, graph-based learning, finite state machines, etc.
t-1 t
Ot
t+1
Ot+1Ot -1
...
...
...transitions
observations
t-1 t
Ot
t+1
Ot+1Ot -1
...
...
...transitions
observations
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
HMMs (Hidden Markov Models: directed graph,
joint, generative)
MEMMs (Maximum Entropy Markov Models: directed graph,
conditional, discriminative)
CRFs (Conditional Random Fields: undirected graph,
conditional, discriminative)
1960s 2000s
19 November 2005
独立成分分析(ICA) vs. 主成分分析(PCA)
Principal Component Analysis(PCA) finds directions of maximal variance in Gaussian data (second-order statistics). 主成分分析(PCA): ガウス分布
データにおいて分散が最大となる方向の発見 (一次統計).
Independent Component Analysis (ICA) finds directions of maximal independence in non-Gaussian data (higher-order statistics). 独立成分分析 (ICA):非ガウス分
布データにおいて独立性が最大となる方向の発見 (高次統計).
19 November 2005
Play MixturesPlay Components
Perform ICA
Mic 1
Mic 2
Mic 3
Mic 4
Terry Scott
Te-Won Tzyy-Ping
ICA: 複数センサで取得した信号データの分離
19 November 2005
People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information.
How to acquire knowledge for knowledge–based systems remains as the main difficult and crucial artificial intelligence problem.
Need of powerful tools to analyze data
19 November 2005
Outline
1. Why knowledge discovery and data mining?
2. Basic concepts of KDD
3. KDD techniques: classification, association, clustering, text and Web mining
4. Challenges and trends in KDD
5. Case study in medicine data mining
19 November 2005
未解釈の信号
25.1 27.3 21.6 …
意味をもつデータ
(object’s mass: measure of an object's resistance to changes in either the speed or direction of its motion)
事実および関係などからなる総合的な情報 (“検証された真実”)
(E = mc2)
Metaphor: Data: rock; knowledge: ore.Miner?
Data, information, and knowledge
19 November 2005
1. ( 5.6, 8.5)2. ( 6.0, 13.0)3. (11.0, 12.0)4. (11.0, 19.0)5. (13.5, 10.0)6. (16.5, 20.0)7. (17.5, 15.0)8. (17.5, 5.0)9. (22.5, 25.0)10. (26.0, 7.5)11. (30,0, 9.0)12. (30.0, 18.0)13. (30.0, 30.0)14. (31.0, 14.0)15. (32.5, 25.0)16. (38.0, 12.0)17. (41.0, 9.0)18. (41.0, 22.0)19. (43.5, 12.5)20. (44.0, 27.5)21. (45.0, 22.5)22. (48.0, 28.0)23. (52.5, 21.0)24. (53.5, 32.0)25. (54.0, 27.5)26. (57.5, 18.0)27. (59.0, 18.0)28. (62.5, 32.5)29. (63.0, 18.0)“if income < $33K, then the person has defaulted on the loan”
Mean of Debt = 18.4, Mean of Income = 34.5
33
US$ K(income, debt)
0
34.5, 18.4
(information)
(knowledge)
Have defaultedon the loan
Good statuswith the bank
Debt
Income
Data, information, and knowledge
19 November 2005
知識発見とデータマイニング(KDD: Knowledge Discovery and Data Mining)
106-1012 bytes:データセット全体の把握・コンピュータメモリへの展開が
困難な大規模データベースどんな知識か?どう表現するか?
どのデータマイニングアルゴリズムを適用するか?
the automatic extraction of non-obvious,hidden knowledge from large volumes of data大量データに潜在する未発見の知識の自動抽出
19 November 2005
...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS...
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + ANDNausea > 15 THEN Prediction = VIRUS [confidence = 87,5%]
Meningitis data, Tokyo Med. & Dental Univ., 38 attributes
numerical categorical missing class attribute
From data to knowledge
19 November 2005
DatabasesStore, access, search, update data (deduction)
Statistics Infer information from data (deduction and induction, mainly numeric data)
Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)
KDD
also Algorithmics, Visualization, Data warehouses, OLAP, etc.
KDD: An interdisciplinary field
19 November 2005
KDD: New and fast growing area
KDD’95, 96, 97, 98, …, 04, 05 (ACM, America)PAKDD’97, 98, 99, 00, …, 04, 05 (Pacific & Asia) http://www.jaist.ac.jp/PAKDD-05PKDD’97, 98, 99, 00, …, 04, 2005 (Europe)ICDM’01, 02,…, 04, 05 (IEEE), SDM’01, …, 04, 05 (SIAM)
Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …
Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning. Projects “Knowledge Science”, “Discovery Science”, and “Active Mining Project” (2001-2004)
19 November 2005
KDD is inherentlyinteractive and iterative
a step in the KDD process consisting of methods that produce useful patterns or models from the data
1
3
4
5
Understand the domainand Define problems
Collect andPreprocess Data
Data MiningExtract Patterns/Models
Interpret and Evaluatediscovered knowledge
Putting the resultsin practical use
Maybe 70-90% of effort and cost in KDD
2
The KDD process
19 November 2005
Data organized by function
Create/selecttarget database
Select samplingtechnique and
sample data
Supply missing values
Normalizevalues
Select DM task (s)
Transform todifferent
representation
Eliminatenoisy data
Transformvalues
Select DM method (s)
Create derivedattributes
Extract knowledge
Find importantattributes &value ranges
Test knowledge
Refine knowledge
Query & report generationAggregation & sequencesAdvanced methods
Data warehousing
1
2
3
4
5
Common tasks in the KDD process
19 November 2005
Types of dataFlat data tablesRelational databaseTemporal & Spatial Transactional databasesMultimedia dataGenome databasesMaterials science data Textual dataWeb dataetc.
Mining tasks and methodsClassification/Prediction
Decision treesNeural networkRule inductionSupport vector machinesHidden Markov Modeletc.
DescriptionAssociation analysisClusteringSummarizationetc.
Different data schemas
Data schemas vs. mining methods
19 November 2005
color #nuclei #tails class
H1 light 1 1 healthy
H2 dark 1 1 healthy
H3 light 1 2 healthy
H4 light 2 1 healthy
C1 dark 1 2 cancerous
C2 dark 2 1 cancerous
C3 light 2 2 cancerous
C4 dark 2 2 cancerous
H1
C3
H3 H4
H2
C2C1
C4
Dataset: cancerous and healthy cells
Supervised dataUnsupervised data
19 November 2005
Predictive mining tasks perform inference on the current data in order to make prediction or classification. (予測的マイニングの課題は,未知のデータの予測を目的として,現在のデータに関する推論を行うことである)
Ex. If “color = dark” and “#nuclei =2”then cancerous
Descriptive mining tasks characterize the properties of the data in the database. (記述的マイニングの課題は,データベース中のデータの全般的特性を特徴付ける記述を与えることである)
Ex. “Healthy cells mostly have one nuclei while cancerous ones have two”
What to do? Primary tasks of KDD
19 November 2005
Patterns: 局所的な関係の要約 K417\K417\D2MS\D2MS.exe
Models:データセットに関する大域的な記述
A model is a global description of a data set, a high level population or large sample perspective モデルとは
,データセット,高次元の母集団,あるいは大多数のデータを視野に入れたに関する大域的な記述である.
A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc. モデルは,変数間の相関(回帰式),クラスタ階層(クラスタリング),神経回路網,などを用いて表現する.
A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local). パターンとは,任意の関係に
関する低次の要約であり,1つのパターンがカバーするデータ件数あるいは変数の数は少数である(局所的)
Ex. If “color = dark” and “#nuclei =2” then cancerous
What to find? Patterns and models
19 November 2005
Classification/prediction is the process of finding a set of models (or patterns) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown
(分類*/予測**とは,データのクラス/概念を説明し識別する一連のモデルを
見つけ出すプロセスであり,このモデルを用い,クラス情報が不明なデータの分類を予測する *カテゴリデータの場合 **数値データの場合)
Decision treesIF-THEN rulesNeural networksSupport vector machinesetc.
Classification/Prediction
19 November 2005
ClassificationAlgorithms
If color = darkand # tails = 2
Then cancerous cell
H1
H3 H4
H2
C2C1
training data
Classifier(model)
Unknown case
Cancerous?
Model construction Model usage
Classification—A two-step process
19 November 2005
Predictive accuracy(予測精度): the ability of the classifier to correctly predict unseen data
Speed: refers to computation cost
Robustness(頑健性): the ability of the classifier to make correctly predictions given noisy data or data with missing values
Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data
Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier
Criteria for classification methods
19 November 2005
Description is the process of finding a set of patterns or models that describe properties of data (essentially the summary of data), for the purpose of understanding the data.(記述とは、データの理解を目的とし、そのデータを要約するような性質を表
現するパターンあるいはモデル集合を発見する過程である)
ClusteringAssociation miningSummarizationTrend detectionetc.
Description
19 November 2005
Interestingness: overall measure combining novelty, utility, simplicity, reliability and validity of discovered patterns/models.
Speed: refers to computation cost
Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data
Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier
Criteria for description methods
19 November 2005
Outline
1. Why knowledge discovery and data mining?
2. Basic concepts of KDD
3. KDD techniques: classification, association, clustering, text and Web mining
4. Challenges and trends in KDD
5. Case study in medicine data mining
19 November 2005
Decision tree learning
Learn to generate classifiers in the form of decision trees from supervised data.
19 November 2005
Mining with decision treesA decision tree is a flow-chart-like tree structure:
each internal node denotes a test on an attributeeach branch represents an outcome of the testleaf nodes represent classes or class distributionsThe top-most node in a tree is the root node
#nuclei
color?
1 2
#tails
light dark
1 2
H
{H1, H3}
{H4, C1, C3, C4}{H1, H2, H3, C1}
{H2, C2}
CH {H2} {C2}
#tails
1 2
H
{H4} {C1, C3, C4}
C
{H1, H2, H3, H4,C1, C2, C3, C4}
19 November 2005
Decision tree induction (DTI)
K417\K417\D2MS\D2MS.exe
Decision tree generation consists of two phases
Tree construction(決定木構築)
Partition examples recursively based on selected attributesAt start, all the training objects are at the root
Tree pruning (構築した木の枝刈)
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classify unknown objects(新事例の分類)
Test the attribute values of the object against the decision tree
19 November 2005
1. At each node, choose the “best”attribute by a given measure for attribute selection
2. Extend tree by adding new branch for each value of the attribute
3. Sorting training examples to leaf nodes
4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes
5. Prune the tree to avoid over-fitting
Two steps: recursively generate the tree (1-4), and prune the tree (5)
DTI general algorithm
#nuclei
color?
1 2
#tails
light dark
1 2
H
{H1, H3}
{H4, C2, C3, C4}{H1, H2, H3, C1}
{H2, C2}
CH {H2} {C2}
#tails
1 2
H
{H4} {C1, C3, C4}
C
{H1, H2, H3, H4,C1, C2, C3, C4}
19 November 2005
Measures for attribute selection
19 November 2005
A typical dataset in machine learning
14 objects belonging to two class {Y, N} are observed on 4 properties.
Dom(Outlook) = {sunny, overcast, rain}
Dom(Temperature) = {hot, mild, cool}
Dom(humidity) = {high, normal}
Dom(Wind) = {weak, strong}
Days Outlook Temperature Humidity Wind ClassD1 sunny hot high weak ND2 sunny hot high strong ND3 overcast hot high weak YD4 rain mild high weak YD5 rain cool normal weak YD6 rain cool normal strong ND7 overcast cool normal strong YD8 sunny mild high weak ND9 sunny cool normal weak YD10 rain mild normal weak YD11 sunny mild normal strong YD12 overcast mild high strong YD13 overcast hot normal weak YD14 rain mild high strong N
Training data for concept “play-tennis”
19 November 2005
temperature
sunny rain o’cast{D9} {D5, D6} {D7}
outlook outlookwind
cool hot mild{D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14}
true false{D2} {D1, D3, D13}
true false{D5} {D6}
wind
high normal{D1, D3} {D3}
humidity
sunny rain o’cast{D1} {D3}
outlook
sunny o’cast rain{D8, D11} {D12} {D4, D10,D14}
true false{D11} {D8}
windyes yes
no yes
yesno null
yes
no yes
high normal{D4, D14} {D10}
humidity
yestrue false
{D14} {D4}
wind
no yes
noyes
A decision tree for playing tennis
19 November 2005
sunny o’cast rain{D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}
outlook
high normal{D1, D2, D8} {D9, D10}
humidity
no yes
yes
true false{D6, D14} {D4, D5, D10}
wind
no yes
This tree is much simpler as “outlook” is selected at the root.How to select good attribute to split a decision node?
A simple decision tree for playing tennis
19 November 2005
Which attribute is the best?
The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-]
If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better?
[9+, 5-]
[6+, 1-] [3+, 4-]
A1 = humidity[9+, 5-]
[6+, 2-] [4+, 2-]
A2 = wind
normal high weak strong
19 November 2005
Entropy
Entropy characterizes the impurity (purity) of an arbitrary collection of objects (不純度をあらわす).
S is the collection of positive and negative objects
p is the proportion of positive objects in S
p is the proportion of negative objects in S
In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively
Entropy is defined as follows
Entropy(S) = −p log2p⊕ − p log2p
19 November 2005
Entropy
The entropy function relative to a Boolean classification, as the proportion p of positive objects varies between 0 and 1.
i2
c
1ii plogpEntropy(S) ∑
=
−≡
19 November 2005
From 14 examples of Play-Tennis, 9 positive and 5 negative objects (denote by [9+, 5-] )
Entropy([9+, 5-] ) = − (9/14)log2(9/14) − (5/14)log2(5/14)= 0.940
Notice: 1. Entropy is 0 if all members of S belong to the same class. For
example, if all members are positive (p = 1), then p is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0.
2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1.
Example
19 November 2005
We define a measure, called information gain, of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attribute
where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v.
)Entropy(SSS
Entropy(S)A)Gain(S, vValue(A)v
v∑∈
−≡
Information gain measures the expected reduction in entropy
19 November 2005
Values(Wind) = {Weak, Strong}, S = [9+, 5-]
Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]
Gain(S, Wind) = Entropy(S) − )Entropy(SSS
Strong} {Weak,vv
v∑∈
= Entropy(S) − (8/14)Entropy(Sweak)− (6/14)Entropy(SStrong)
= 0.940 − (8/14)0.811 − (6/14)1.00= 0.048
Information gain measures the expected reduction in entropy
19 November 2005
S:[9+, 5-]E = 0.940
Humidity
High Normal
[3+, 4-] [6+, 1-]E = 0.985 E = 0.592
Gain(S, Humidity)= .940 − (7/14).985 − (7/14).592= .151
S:[9+, 5-]E = 0.940
Wind
Weak Strong
[6+, 2-] [3+, 3-]E = 0.811 E = 1.00
Gain(S, Wind)= .940 − (8/14).811 − (6/14)1.00= .048
Which attribute is the best classifier?
19 November 2005
Information gain of all attributes
Gain (S, Outlook) = 0.246
Gain (S, Humidity) = 0.151
Gain (S, Wind) = 0.048
Gain (S, Temperature) = 0.029
19 November 2005
{D1, D2, ..., D14} [9+, 5-]
Outlook
Sunny Overcast Rain
{D1, D2, D8, D9, D11}[2+, 3-]
{D3, D7, D12, D13}[4+, 0-]
{D4, D5, D6, D10, D14}[3+, 2-]
? Yes ?
Which attribute should be tested here?
Ssunny = {D1, D2, D3, D9, D11}Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019
Next step in growing the decision tree
19 November 2005
Stopping condition
1. Every attribute has already been included along this path through the tree
2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero)
Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain)
19 November 2005
Over-fitting in decision trees
The generated tree may overfit the training dataToo many branches, some may reflect anomalies due to noise or outliers
Result is in poor accuracy for unseen objects
Two approaches to avoid overfittingPrepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned tree”
19 November 2005
sunny o’cast rain
outlook
high normal
humidity
no yes
yestrue false
wind
no yes
IF (Outlook = Sunny) and (Humidity = High)THEN PlayTennis = No
IF (Outlook = Sunny) and (Humidity = Normal)THEN PlayTennis = Yes
Converting a tree to rules
19 November 2005
Attributes with many values
If attribute has many values (e.g., days of the month), ID3 will select it
C4.5 uses GainRatio instead
ii
i2
c
1i
i
v valuehas A with S ofsubset is S where
SS
logSS
A)mation(S,SplitInfor
A)mation(S,SplitInforA)Gain(S,
A)S,GainRatio(
∑=
−≡
≡
19 November 2005
Bayesian classification
Learning statistical classifiers from supervised data basing on Bayes theorem and assumptions on independence/dependence of the data.
19 November 2005
What are Bayesian classification?
Bayesian classifiers are statistical classifiers. Bayesian classification is based on Bayes theorem.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.
Bayesian belief networks are graphical models allow the representation of dependencies among subsets of attributes.
19 November 2005
Bayes theorem
Let X be an object whose class label is unknown
Let H be some hypothesis, such as that X belongs to class C.
For classification, we want to determine the posterior probability P(H|X), of H conditioned on X.
Example: Data object consists of fruits, described by color and shape
Suppose X is red and round, and H is the hypothesis that X is an apple.
P(H|X) reflects our confidence that X is an apple given that we have seen that X is red and round.
P(apple | red and round) = ?
19 November 2005
Bayes theorem
In contrast, P(H) is the prior probability of H.
In our example, P(H) is the probability that any given data object is an apple, regardless of how the data sample looks (independent of X).
P(X|H) is the likelihood of X given H, that is the probability that X is red and round given that we know that it is true that X is an apple.
P(X), P(H), and P(X|H) may be estimated from the given data. Bayes theorem allows us to calculate P(H|X)
)()()|()|(
XPHPHXPXHP =
)()()|()|(
roundredPapplePappleroundredProundredappleP
∧∧
=∧
19 November 2005
Naïve Bayesian classification
Suppose X = (x1, x2, …, xn), attributes A1, A2, …, An
There are m classes C1, C2, …, Cm
P(Ci|X) denotes probability that X is classified to class Ci.
Example:P(class = N | outlook=sunny, temperature=hot, humidity=high, wind=strong)
Idea: assign to object X the class label Ci such that P(Ci|X) is maximal, i.e., P(Ci|X) > P(Cj|X), ∀j, i≠j.
Ci is called the maximum posterior hypothesis.
19 November 2005
Estimating a posteriori probabilities
Bayes theorem:P(X)
))P(CC|P(XX)|P(C ii
i =
P(X) is constant. Only need maximize P(X|Ci) P(Ci)
Ci such that P(Ci |X) is maximum =
Ci such that P(X| Ci)·P(Ci) is maximum
If prior probability is unknown, commonly assumed that P(C1) = P(C2) = … = P(Cm), and we would maximize P(X|Ci)
Otherwise, P(Ci) = relative frequency of class Ci = Si/S
Problem: computing P(X|Ci) is unfeasible!
19 November 2005
Naïve Bayesian classification
Naïve assumption: We have P(X|Ci) = P(x1,…,xn|Ci), if attributes are independent then P(X|Ci) = P(x1|Ci) x … x P(xn|Ci)
If Ak is categorical then P(xk|Ci) = Sik/Si where Sik is the number of training objects of class Ci having the value xk for Ak, and Si is the number of training objects belonging to Ci.
If Ak is continuous then the attribute is typically assumed to have a Gaussian distribution, so that
2
2
2
)(
21)|( iC
iCk
i
x
Cik eCxP σ
μ
σπ
−−
=
To classify an unknown object X, P(X|Ci)P(Ci) is evaluated for each class Ci. X is then assigned to the class Ci if and only if P(X|Ci)P(Ci) > P(X|Cj)P(Cj), for 1 ≤ j ≤ m, j ≠ i.
19 November 2005
P(strong|N) = 3/5P(strong|Y) = 3/9
P(weak|N) = 2/5P(weak|Y) = 6/9
P(high|N) = 4/5P(high|Y) = 3/9
P(normal|N) = 2/5P(normal|Y) = 6/9
P(hot|N) = 2/5P(hot|Y) = 2/9
P(mild|N) = 2/5P(mild|Y) = 4/9
P(cool|N) = 1/5P(cool|Y) = 3/9
P(rain|N) = 2/5P(rain|Y) = 3/9
P(overcast|N) = 0P(overcast|Y) = 4/9
P(sunny|N) = 3/5P(sunny|Y) = 2/9
windy
humidity
temperature
outlook
P(N) = 5/14
P(Y) = 9/14
Days Outlook Temperature Humidity Wind ClassD1 sunny hot high weak ND2 sunny hot high strong ND3 overcast hot high weak YD4 rain mild high weak YD5 rain cool normal weak YD6 rain cool normal strong ND7 overcast cool normal strong YD8 sunny mild high weak ND9 sunny cool normal weak YD10 rain mild normal weak YD11 sunny mild normal strong YD12 overcast mild high strong YD13 overcast hot normal weak YD14 rain mild high strong N
Play-tennis example: estimating P(xk|Ci)
19 November 2005
An unseen object X = <rain, hot, high, weak>
P(X|Y)×P(Y) = P(rain|Y)×P(hot|Y)×P(high|Y)×P(weak|Y)×P(Y) = 3/9 × 2/9 × 3/9 × 6/9 × 9/14 = 0.010582
P(X|N)×P(N) = P(rain|N) ×P(hot|N) ×P(high|N) × P(weak|N) × P(N) = 2/5 × 2/5 × 4/5 × 2/5 × 5/14 = 0.018286
Object X is classified in class N (don’t play)
P(true|N) = 3/5P(true|Y) = 3/9
P(false|N) = 2/5P(false|Y) = 6/9
P(high|N) = 4/5P(high|Y) = 3/9
P(normal|N) = 2/5P(normal|Y) = 6/9
P(hot|N) = 2/5P(hot|Y) = 2/9
P(mild|N) = 2/5P(mild|Y) = 4/9
P(cool|N) = 1/5P(cool|Y) = 3/9
P(rain|N) = 2/5P(rain|Y) = 3/9
P(overcast|N) = 0P(overcast|Y) = 4/9
P(sunny|N) = 3/5P(sunny|Y) = 2/9
windy
humidity
temperature
outlook
Play-tennis example: classifying X
19 November 2005
It makes computation possible
It yields optimal classifiers when independence satisfied
But it is seldom satisfied in practice, as attributes (variables) are often correlated
Attempts to overcome this limitation, among others:
Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes
Decision trees, that reason on one attribute at the time, considering most important attributes first
The independence hypothesis
19 November 2005
Other classification methods
Neural Networks
Instance-based Classification
Genetic Algorithms
Rough Set Approach
Statistical Approaches
Support Vector Machines
etc.
19 November 2005
H1
C3
H3 H4
H2
C2C1
C4
Healthy
Cancerous
color = dark
# nuclei = 1
# tails = 2
Mining with neural networks
19 November 2005
Advantages
prediction accuracy is generally highrobust, works when training examples contain errorsoutput may be discrete, real-valued, or a vector of several discrete or real-valued attributesfast evaluation of the learned target function
Criticism
long training timedifficult to understand the learned function (weights)not easy to incorporate domain knowledge
Mining with neural networks
19 November 2005
Instance-based classificationUsing most similarity individual instances known in the past to classify a new instance
Typical approachesk-nearest neighbor approach
Instances as points in an Euclidean space
Locally weighted regressionConstructs local approximation
Case-based reasoningUses symbolic representations and knowledge-based inference
Instance-based classification
19 November 2005
Genetics algorithms (GA)
..\form\CSD\Cryst.exe
Generate initial population
doCalculate the fitness of each member// simulate another generation
do1. Select parents from current population2. Perform crossover to add offspring
to the new population
while new population is not full
1. Merge new population into the current population
2. Mutate current population
while not converged
EVOLUTION
Environment
Individual
Fitness
PROBLEM SOLVING
Problem
Candidate Solution
Quality
Quality → chance for seeding new solutions
Fitness → chances for survival and reproduction
19 November 2005
Rough sets are used to approximately or “roughly”define equivalent classes
A rough set for a class C is approximated by two sets: A lower approximation (certain to be in C)A upper approximation (possible to be in C)
Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc.
X
Equivalence classes
Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), KluwerAcademic Pub., 1997)
Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), Physica-Verlag, 1998.
} ,{ shapecolor=R
{/ . =colorclassEq{/ . =shapeclassEq
{ . =classEq
,, ,,} {{ }}
, ,} {,} { ,{ }}
,} {, ,} {,} {{ }}
}, , {=X }, {* =X } , , {* =X
},, ,{=U , ,
Rough set approach
19 November 2005
Association rule mining
Description learning that aims to find all possible associations from data.
19 November 2005
Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”
Helps develop marketing strategies by gaining insight into which items are frequently purchased together by customers
How often people buy onigiri and beer together?
Market basket analysis
19 November 2005
Association rule X ⇒Y
support s = probability that a transaction contains X and Y
confidence c = conditional probability that a transaction having X also contains Y
If minimum support 50%, minimum confidence 50%:
A ⇒ C (s=50%, c=66.6%)
C ⇒ A (s=50%, c=100%)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Customer buys onigiri
Customer buys both Customerbuys beer
Rule measures: support and confidence
19 November 2005
The rule X Y holds in the transaction set D with confidence cif c% of transactions in D that contain X also contain Y.
The rule X Y has support s in the transaction set D if s% of transactions in D contain X ∪ Y.
Confidence denotes the strength of implication and support indicates the frequencies of the occurring patterns in the rule
X containing trans.#Y and X containing trans.#Y)/P(X) andP(XX)|P(YY)(Xconfidence
database in the trans.all#Y and Xboth containing trans.#Y) and P(XY)support(X
===→
==→
Basic concepts
19 November 2005
It is composed of two steps:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence
Association mining: Apriori algorithm
19 November 2005
For rule A ⇒ Csupport = support({A and C}) = 50%confidence = support({A and C})/support({A}) = 66.6%
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
Association mining: Apriori principle
The Apriori principle:Any subset of a frequent itemset must be frequent
(if an itemset is not frequent, its supersets are not)
19 November 2005
1. Find the frequent itemsets: the sets of items that have support higher than the minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-itemset) by from candidate itemsets Ck (Lk ⊆ Ck)
2. Use the frequent itemsets to generate association rules.
C1 … Li-1 Ci Li Ci+1 … Lk
The Apriori algorithm
19 November 2005
Mining frequent itemsets for Boolean association rules
It employs an iterative approach to level-wise search where k-itemsets are used to explore (k+1)-itemsets
First, the set of frequent 1-itemsets L1 is found.
L1 is used to find the set of 2-itemsets L2,
then L2 is used to find L3,
and so on, until no more frequent k-itemsets can be found.
To improve the efficiency of the level-wise generation of frequent itemsets, the important Apriori property is used to reduce the search space.
Apriori algorithm: Finding frequent itemsetsusing candidate generation
19 November 2005
If an itemset l does not satisfy the minimum support threshold min_sup, then l is not frequent, that is, P(l)<min_sup
If an item A is added to the itemset l, then the resulting itemset (i.e., l ∪ A) cannot occur more frequently than l. Therefore, l ∪ A is not frequent either, that is, P(l ∪ A) < min_sup
This is the anti-monotone property:
if a set cannot pass a test, all of its supersets will fail the same test as well
Apriori algorithm: Finding frequent itemsetsusing candidate generation
19 November 2005
To join Lk-1 with itself to generate a set Ck of candidates k-itemsets, and to use it to find Lk
Given l1, l2 ∈ Lk-1, the notation li[j] refers to the jth item in li
Apriori assumes that items within a transaction or itemset are sorted in lexicographic order
The join, Lk-1 Lk-1, is performed where members of Lk-1 are joinable if their first (k-2) items are in common. That is, l1, l2 ∈Lk-1 are joined if
(l1[1] = l2[1]) ∧ (l1[2]= l2[2])∧… ∧ (l1[k-2]= l2[k-2]) ∧ (l1[k-1]<l2[k-1])
The resulting itemset by joining l1 and l2 is
l1[1] l1[2] … l1[k-2] l1[k-1] l2[k-1]
Apriori algorithm: join step to find Ck
19 November 2005
Ck is a superset of Lk, i.e. its members may or may not be frequent, but all frequent k-itemsets are included in Ck (Lk ⊆ Ck)
A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (all candidates having a count no less than the min_sup_count are frequent and therefore belong to Lk)
Ck can be huge. To reduce the size of Ck: use apriori property
Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck (can be tested quickly by a hash tree of frequent itemsets)
Apriori algorithm: The prune step
19 November 2005
TID List of items_IDs
T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3
Itemset Sup.Count
{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2
C1
Itemset Sup.Count
{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2
L1
Transactional data
Scan D for count of each candidate
Compare candidate support count with minimum support count
Example (min_sup_count = 2)
19 November 2005
Itemset{I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}
C2
Scan D for count of each candidate
Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2{I3, I4} 0{I3, I5} 1{I4, I5} 0
C2Compare candidate support count with minimum support count
Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2
L2
Generate candidates C3 from L2using Aprioriprinciple
Itemset
{I1, I2, I3} {I1, I2, I5}
Scan D for count of each candidate
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
C3
Compare candidate support count with minimum support count
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
L3
Generate candidates C2 from L1using Aprioriprinciple
Example (min_sup_count = 2)
19 November 2005
1. Join C3 = L2 L2 = {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}} {{I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5}}
= {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}
2. Prune using the Apriori property: all nonempty subsets of a frequent itemset must also be frequent. Any candidate has a subset that is not frequent?
2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, {I2, I3} which are all members of L2. Therefore, keep {I1, I2, I3} in C3
2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, {I2, I5} which are all members of L2. Therefore, keep {I1, I2, I5} in C3
2-item subsets of {I1, I3, I5} are {I1, I3}, {I1, I5}, {I3, I5}. Subset {I3, I5} is not a member of L2. Therefore, remove {I1, I3, I5} from C3, and so on
3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}}
Example (min_sup_count = 2)
19 November 2005
Cluster analysis
Description learning that aims to detect and groups of similar objects from unsupervised data.
19 November 2005
A cluster is a collection of data objects satisfying
Objects in this cluster are similar to one another
Objects in this cluster are dissimilar to the objects in other clusters
The process of grouping objects into clusters is called clustering
What is cluster analysis?
19 November 2005
Clustering analyzes data objects without consulting a known class label.
The objects are clustered or grouped based on the principle of maximizing the between-class similarity and minimizing the within-class similarity
Partition-based clustering for large sets of numerical data.
Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets
Mining with clustering
19 November 2005
A good clustering method will produce high quality clusters with
high intra-class similarity (within a class)low inter-class similarity (between classes)
The quality of clustering basically depends on the similarity measure and the cluster representative used by the method
New forms of clustering require different criteria of quality.
What is good clustering?
19 November 2005
Statistics: since many years, focus on distance-based clustering (S-Plus, SPSS, SAS)
Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept
KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data
Clustering in different fields
19 November 2005
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Ability to deal with noisy data
Insensitivity to the order of input records
High dimensionality
Constraint-based clustering
Interpretability and usability
Typical requirements of clustering
19 November 2005
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
Clustering methods in KDD
19 November 2005
Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters
The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”
C:\STAT\STA_WIN.EXE
Partitioning methods
19 November 2005
Two centersselected randomlyfrom nobjects
Form twoclusters byassigningeach object toits nearest center
Reformtwo new clusters
Calculatetwo newcenters
Calculatetwo newcenters
Repeatstep 2 and 3untilthe stoppingconditions hold
1 2
3 4
K-means algorithm (K=2)
19 November 2005
Partitioning methods
The k-means algorithm is sensitive to outliers
The k-medoids method uses medoid (the most centrally located object in a cluster)
The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.
PAM (Partitioning Around Medoids)
From k-Medoids to CLARA (Clustering LARgeApplications)
From CLARA to CLARANS (Clustering LARgeApplications based on RANdomized Search)
19 November 2005
A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence.
Partition Q is nested into partition P if every component of Q is a subset of a component of P.
{ }},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP =
{ }},{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ =
Hierarchical methods
C:\Documents and Settings\Ho Tu Bao\Desktop\STATISTICA.lnk
19 November 2005
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling.
Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters.
Hierarchical clustering: chameleon
19 November 2005
Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density
DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)
DENCLUE: Based on Density Distribution Functions (Kernel Estimation)
DBSCAN result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0
Density-based methods
19 November 2005
Text mining
Finding unknown useful information from huge collection of textual data.
19 November 2005
What is text mining?
“The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.
“An exploration and analysis of textual (natural language) data by automatic and semi automatic means to discover new knowledge”.
A branch of data mining, targets discovering and extracting knowledge from text documents
19 November 2005
生物医学文献タイトルからの科学的根拠の抽出 (Swanson & Smalheiser, 1997)
“stress is associated with migraines” “ストレスは片頭痛を伴う”
“stress can lead to loss of magnesium”“ストレスはマグネシウム損失の原因となる”
“calcium channel blockers prevent some migraines”“カルシウム拮抗薬は片頭痛を予防することがある”
“magnesium is a natural calcium channel blocker”“マグネシウムは天然のカルシウム拮抗薬である”
抜粋した文の断片を人間の医学専門知識を使って組合せ,
文献にない新しい仮説を導き出す
Magnesium deficiency may play a role in some kinds of migraine headacheマグネシウムはある種の片頭痛に関与するらしい
テキストマイニング: 研究事例
19 November 2005
80%
Structured Numerical or CodedInformation20%
Unstructured or Semi-structuredInformation
Motivation for text mining
Approximately 80% of the world’s data is held in unstructured formats (source: Oracle Corporation)
Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery.
19 November 2005
Disciplines that influence text mining
Computational linguistics (NLP)
Information extraction
Information retrieval
Web mining
Regular data mining
Text Mining = Data Mining (applied to text data) + Language Engineering
19 November 2005
Computational linguistics
Goal: automated language understanding
this isn’t possible
instead, go for subgoals of text analysis, e.g.,
word sense disambiguation
phrase recognition
semantic associations
Common current approach (trend):statistical analyses over very large text collections
Consider a word like "string" or "rope." No computer today has any way to understand what those things mean. For example, you can pull something with a string, but you cannot push anything. You can tie a package with string, or fly a kite, but you cannot eat a string or make it into a balloon. In a few minutes, any young child could tell you a hundred ways to use a string − or not to use a string − but no computer knows any of this.
19 November 2005
Lexical / Morphological Analysis
Syntactic Analysis
Semantic Analysis
Discourse Analysis
Tagging
Chunking
Word Sense Disambiguation
Grammatical Relation Finding
Named Entity Recognition
Reference Resolution
Computational linguistics
Shallow parsing
The woman will give Mary a book
The/Det woman/NN will/MD give/VBMary/NNP a/Det book/NN
POS tagging
[The/Det woman/NN]NP [will/MD give/VB]VP[Mary/NNP]NP [a/Det book/NN]NP
chunking
[The woman] [will give] [Mary] [a book]
relation findingsubject
i-object object
text
meaning
19 November 2005
1990s–2000s: Statistical learningalgorithms, evaluation, corpora
1980s: Standard resources and tasksPenn Treebank, WordNet, MUC
1970s: Kernel (vector) spacesclustering, information retrieval (IR)
1960s: Representation TransformationFinite state machines (FSM) and Augmented transition networks (ATNs)
1960s: Representation—beyond the word levellexical features, tree structures, networks
Trainable FSMs
Trainable parsers
Archeology of computational linguistics
19 November 2005
Information retrieval
Given:
A source of textual documents
A user query (text based) Query
E.g. migraines causes
Documentssource
Find:A set (ranked) of documents that are relevant to the query
RankedDocuments
DocumentDocument
Document
IRSystem
19 November 2005
Evaluation measures
relevant
retrieved
recallRprecisionPRPPRF
relevantrelevantretrievedrecall
retrievedrelevantretrievedprecision
==+
+=
∩=
∩=
, ,)1(
,
2
2
ββ
19 November 2005
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2004
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
the process of extracting text segments of semi-structured or free text to fill data slots in a predefined template
Information extraction vs. information retrieval: Finding “things” but not “pages”
19 November 2005
What is information extraction?
Given:A source of textual documents
A well defined limited query (text based)
Find:Sentences with relevantinformation (i.e., identify specific semantics elements such as entities, properties, relations)
Extract the relevant information and ignore non-relevant information (important!)
Link related information and output in a predetermined format
19 November 2005
Example: template extraction
Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natioanl Liberation Front (FMLN) of the crime …. Garcia Alvarado, 56, was killed when a bomb places by urban guerillas on his vehicle explored as it came to a halt at an intersection in downtown San Salvador … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.
Incident: Date 19 Apr 89Incident: Location El Salvador: San SalvadorIncident: Type BombingPerpetrator: Individual ID “urban guerillas”Perpetrator: Organization ID “FMLN”Human target: Name “Roberto Garcia Alvarado”
19 November 2005
Zipf’s “law” and its consequence
One of curiosities in text analysis and mining (1935)
The product of the frequency of words (f) and their rank (r) is approximately constant
Rank = order of words’frequency of occurrence
f = C * 1/rC ≈ N/10
Always a few very frequent tokens that are not good discriminators.
Called “stop words” in Information Retrieval
Usually correspond to linguistic notion of “closed-class” words
English examples: to, from, on, and, the, ...
Grammatical classes that don’t take on new members.
TypicallyA few very common words
A middling number of medium frequency words
A large number of very infrequent words
Medium frequency words most descriptive
19 November 2005
Text mining process
Text preprocessingSyntactic/Semantic text analysis
Features Generation Bag of words
Features SelectionSimple counting
Statistics
- Term frequency- Document frequency- Term proximity- Document length- etc.
Text/Data MiningClassification
Clustering
Summarization
Etc.
Analyzing results
19 November 2005
Typical issues and techniques
Text Categorization (text classification)
Text Clustering
Text Summarization
Trend Detection
Relationship Analysis
Information Extraction
Question-Answering
Text Visualization
etc.
19 November 2005
Text categorization (classification)
Task: Assignment of one or more labels from a pre-defined set to a document
Example category set
MeSH medical hierarchy
JAIST’s library, Library of Congress subject headings
Idea: Content vs. External Meta-Data
Techniques: Supervised classification
Decision treesNaïve Bayesian classificationSupport Vector Machinesetc.
19 November 2005
Text clustering
Task: Detecting topics within a document collection, assigning documents to those topics, and labeling these topic clusters
Scatter/Gather clusteringCluster sets of documents into general “themes”, like a table of contentsDisplay the contents of the clusters by showing typical terms and typical titlesUser chooses subsets of the clusters and re-clusters the documents withinResulting new groups have different “themes”
Techniques: Different clustering techniques (similarity between texts, texts and word densities, etc.)
19 November 2005
Text summarization
A text is entered into the computer and a summarized text is returned, which is a non redundant extract from the original text.
A process of text summarizationSentence extraction: Find a set of important sentences that covers the gist meaning of the text document
Sentence reduction: Convert a long sentence to a short one without losing the meaning
Sentence combination: Combine sentences to make a text.
19 November 2005
Emerging trend detection
Task: Detecting topic areas that are growing in interest and utilityover time (emerging trends)Example:
3711999
1701998
101997
81996
11995
31994
Number of documents
Year
COE project: Can we detect emerging trends in materials science, information science, and biology?
INSPEC®[INS] database search on keyword “XML”
KDD’03 challenges: From arXiv (since 1991) 500,000 articles on High Energy Particle Physics, predict, says, # citation in a period of a given articles, etc.
“Find sales trends by product and correlate with occurrences of company name in business news articles”
19 November 2005
Question Answering
Task: Give answer to question
(document retrieval: find documents relevant to query)
Example:
Who invented the telephone?Alexander Graham Bell
When was the telephone invented?1876
(Buchholz & Daelemans, 2001)
Imagine how to automatically answer such a question?
19 November 2005
Text visualization
Network Maps
Landscapeshttp://www.lexiquest.com
http://www.aurigin.com
19 November 2005
Web mining
Finding unknown useful information from the World Wide Web.
19 November 2005
Data mining turns data into knowledge.
Web mining is to apply data mining techniques to extract and uncover knowledge from web documents and services.
Data Mining and Web Mining
19 November 2005
Web:
A huge, widely-distributed
Highly heterogeneous
Semi-structured
Hypertext/hypermedia
Interconnected information repository
Web is a huge collection of documents plus
Hyper-link information
Access and usage information
WWW specifics
19 November 2005
Web user tasks
Finding relevant informationThe user usually uses a simple keyword query and receive a list of ranked pages
Current problem: low precision (irrelevance of search result), and low recall (inability to index all the information on the Web)
Creating new knowledge over the existing dataWant to extract knowledge from Web data (assuming to have it)
Personalizing the informationPeople differ in the contents and presentations they prefer while interacting with the Web
Learning about consumers and individual usersKnowing what the customers do and want
19 November 2005
Web Content Mining
Discovery of information from Web contents (various types of data such as textual, image, audio, video, hyperlinks, etc.)
Web Structure MiningDiscovery of the model underlying the link structures of the Web. The model is based on the topology of the hyperlinks with or without the description of the links.
Web Usage Mining
Discovery of information from web users’ sessions and behaviors (secondary data derived from the interactions of the users while interacting with the Web).
Web mining taxonomy
19 November 2005
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web mining taxonomy
Web miningWeb mining
19 November 2005
Web Mining
Web StructureMining
Web ContentMining
Web Page Content MiningWeb Page Summarization • WebLog (Lakshmanan et.al. 1996)• WebSQL(Mendelzon et.al. 1998) …: Web
structuring query languages; Can identify information within given web pages
• Ahoy! (Etzioni et.al. 1997): Uses heuristics to distinguish personal home pages from other web pages
• ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web mining taxonomy
Web miningWeb mining
19 November 2005
Web Mining
Web mining taxonomy
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web StructureMining
Web ContentMining
Web PageContent Mining
Search Result MiningSearch Engine Result SummarizationClustering Search Result (Leouski and Croft, 1996; Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets
Web miningWeb mining
19 November 2005
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web mining taxonomy
Web Structure MiningUsing Links• PageRank (Brin and Page, 1996), HITS (Kleinberg,
1996)• CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages
Web Communities• Communities crawling (Kumar et al, 1999), etc.Using Generalization• MLDB (1994), VWV (1998)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.
Web miningWeb mining
19 November 2005
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General Access Pattern Tracking
• Web Log Mining (Zaïane, Xin and Han, 1998)Uses KDD techniques to understand general access patterns and trends.
CustomizedUsage Tracking
Web mining taxonomy
Web miningWeb mining
19 November 2005
Web UsageMining
General AccessPattern Tracking Customized Usage Tracking
• Adaptive Sites (Perkowitz and Etzioni, 1997)Analyses access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.
Web mining taxonomy
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web miningWeb mining
19 November 2005
1. Resource finding: The task of retrieving intended Web documents
2. Information selection and pre-processing: Automatically selecting and preprocessing specific information from retrieved Web resources
3. Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites
4. Analysis: validation and/or interpretation of the mined patterns
Web mining process
19 November 2005
Sunday11-12 PM
Lunch time
Tree map
Cone tree
Fisheye view
Hyperbolic tree
MagicLens
Data and knowledge visualization
19 November 2005
SPSS
IBM
Silicon Graphics SASSalford Systems
RuleQuest Research (C4.5)
KDD products and tools
19 November 2005
Outline
Why knowledge discovery and data mining?
Basic concepts of KDD
KDD techniques: classification, association, clustering, text and Web mining
Challenges and trends in KDD
Case study in medicine data mining
19 November 2005
Different types of data in different forms(mixed numeric, symbolic, text, image, voice,…)
Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes)[Problems: efficiency, scalability?]
[Problems: quality, effectiveness?]
Data and knowledge are changing
Human-Computer Interaction and Visualization
Challenges of KDD
19 November 2005
3 attributes each has 2 values: #instances = 23 = 8 #patterns =27
What if #attributes increases?
Size of instance space and pattern space increased exponentially
p attributes each has d values, size of instance space is dp
38 attributes each has 10 values: #instances = 1038
H1
C3
H3 H4
H2
C2C1
C4
Large datasets and high dimensionality
19 November 2005
Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)
Sampling (instance selection)
Dimensionality reduction (feature selection)
Approximation methods
Massively parallel processing
Integration of machine learning and database management
Possible solutions
19 November 2005
Attribute Numerical Symbolic
No structure
≠= Places,Color
Ordinal structure
≥≠= Ring
structure
Rank,Resemblance
Age,Temperature,Taste,
Income,Length
Nominal(categorical)
Ordinal
Measurable
Combinatorial search in hypothesis spaces (machine learning)
Often matrix-based computation (multivariate data analysis)
×+≥≠=
Numerical vs. symbolic data
19 November 2005
Attribute selection
Pruning trees
From trees to rules (high cost of pruning)
Visualization
Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage)
(well-known systems: C4.5 and CART)
Mining with decision trees
19 November 2005
SLIQ (Mehta et al., 1996)builds an index for each attribute and only class list and the current attribute list reside in memory
SPRINT (J. Shafer et al., 1996)constructs an attribute list data structure
PUBLIC (Rastogi & Shim, 1998)integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (Gehrke, Ramakrishnan & Ganti, 1998)separates the scalability aspects from the criteria that determine the quality of the tree
builds an AVC-list (attribute, value, class label)
Scalable decision tree induction methods
19 November 2005
Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)
Extracting or making sense of numeric weightsassociated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem
Mining with neural networks
19 November 2005
Mining with association rules
Improving the efficiency
Database scan reduction: partitioning (Savaseve95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find non-redundant rules (3000 times less, Zaki KDD’2000)
Parallel mining of association rules
New measures of association
Interestingness and exceptional rules
Generalized and multiple-level rules
19 November 2005
Mining scientific data
Data Mining in Bioinformatics
Data Mining in Astronomy and Earth Sciences
Mining Physics and Chemistry data
Mining Large Image Databases
etc.
19 November 2005
Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances
Massively parallel processingData-parallel vs. Control-parallel Data Mining
Client/Server Frameworks for Parallel Data Mining
Mining Very Large Databases With Parallel ProcessingAlex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998
Solutions to mining huge datasets
19 November 2005
Mixed Similarity Measures (MSM):
Goodall (1966) time O(n3), Diday and Gowda(1992),
Ichino and Yaguchi (1994),
Li & Biswas (1997) Time O(n2logn2), Space O(n2):
New and Efficient MSM (Binh & Bao, 2000):
Time and Space O(n): *ˆ 1ˆijij PP −=
ijP*
ijP
Example of a scalable algorithm
19 November 2005
US Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000)
#cases 500 1.000 1.500 2.000 5.000 10.000 199.523 (0.2M) (0.5M) (0.9M) (1.1M) (2.6M) (5.2M) (102M)
# values 497 992 1.486 1.973 4.858 9.651 97.799
time of LiBis 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app O(n2logn2)
Time of OURS 0.1s 0.2s 0.3s 0.5s 2.8s 9.2s 36m26sO(n)
Memory of LiBis 5.3M 20.0M 44.0M 77.0M 455.0M not app not app O(n2)Memory of OURS 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0MO(n)
Preprocessing 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s
Comparative results
19 November 2005
High performance computing for NLP
Experimental environments
Massively parallel computer (Cray XT3): 90 nodes, each node has four 2.4GHz processors, 32GB RAM (total: 90 x 4 x 2.4GHz processors, 2.88TB RAM)
Linux OS, using C/C++ and MPI library
Experiments for POS tagging and chunking
24 sections of WSJ (Penn TreeBank): about 1,000,000 words (more than 40,000 English sentences)
Application: Part-of-speech tagging. Highest F1 score: 96.96% (using only for first-order Markov CRFs)
60.0510270
1.8732782
9.0367810
27.3422430
43.7514050
76.568090
1.0061251
Speed-up
ratio
Training
Time (minute)
Number of parallel processes
19 November 2005
Outline
Why knowledge discovery and data mining?
Basic concepts of KDD
KDD techniques: classification, association, clustering, text and Web mining
Challenges and trends in KDD
Case study in medicine data mining
19 November 2005
Background
HBV, HCV are viruses which both cause continuous inflammation of the liver (chronic hepatitis).
The inflammation results in liver fibrosis, and finally liver cirrhosis (LC) after 20 to 30 years which is diagnosed by liver biopsy.
In addition, the cirrhosis patients have a highly potential risk of hepatocellular carcinoma (HCC).
Physicians can treat viral hepatitis with interferon (IFN). However IFN is not always effective, with severe side effects.
19 November 2005
Fibrosis stage
F0
F1
F2
F3
onset of infection
HCCHCC
20-30 years
LCLC
time
F4
HepatitisHepatitis
The natural course of hepatitis
The course of HCC?
19 November 2005
IFNFibrosis stage
HCC
LC
timeonset of infection
The effect of interferon therapy
Effectiveness of interferon?
F0
F1
F2
F3
F4
19 November 2005
提供元:千葉大学医学部第一内科
約800人の患者の20年間にわたるデータ
データの特徴
大規模な未整備時系列データ
検査項目数が非常に多い
検査時期により各検査項目の値やその精度が異なる,欠損値が多い
医者によるバイアスが存在
時系列データ
19 November 2005
Example of the hepatitis dataset
Sequences of length 179 for MID1
88 for MID2, and they are irregular
19 November 2005
P1. Differences in temporal patterns between hepatitis B and C? (HBV, HCV)
P2. Evaluate whether laboratory examinations can be used to estimate the stage of liver fibrosis? (F0, F1, F2, F3, F4)
P3. Evaluate whether the interferon therapy is effective or not? (Response, Partial response, Aggravation, No response)
Problems under consideration
19 November 2005
To perform any kind of medical problem solving, patient data have to be “matched” against medical knowledge.
Patient data mostly comprise numeric
measurements of various parameters at different
points in time.
Medical knowledge is usually expressed in a
form of symbolic statements which is
as general as possible.
Why data abstraction?
19 November 2005
What is temporal abstraction?
ZTT first was increasingly highthen changed to the normal region and stable
ZTT
normal region
Idea: Convert time-stamped points to symbolic interval-based representation of data Characteristic: No detail but essence of trend and state change of patients
ZTT: H>N−S
19 November 2005
1
2
1. Develop temporal abstraction methods for describing hepatitis data appropriately for each problem.
1. Develop temporal abstraction methods for describing hepatitis data appropriately for each problem.
2. Using data mining methods to solve problems from the abstracted data.
2. Using data mining methods to solve problems from the abstracted data.
Research objectives
19 November 2005
Key issues in temporal abstraction
1. Define a descriptionlanguage for abstraction patterns
2. Determine the basic abstraction patterns
3. Transform each sequence into temporal abstraction patterns
Simple but rich enough to describe abstract patterns
Be typical and significant primitives needed for analysis purpose
Efficiently characterize the trends and changes in the temporal data
ZTT: H>N−S
Tasks Requirement
19 November 2005
0
100
200
300
400
500
600
2/19
/198
1
2/19
/198
3
2/19
/198
5
2/19
/198
7
2/19
/198
9
2/19
/199
1
2/19
/199
3
2/19
/199
5
2/19
/199
7
2/19
/199
9
2/19
/200
1
Typical tests by physicians
Short-term changed tests
Long-term changed tests
Concerning inflammation, changed quickly in days or weeksCan be much higher (even 40 times) than normal range with many peaks
Down: T-CHO, CHE, ALB, TP (liver products)PLT, WBC, HGB.
Up: T-BIL, D-BIL, I-BIL, AMONIA, ICG-15.
Up: GPT, GOT, TTT, ZTT
Concerning liver status, changed slowly in months or years
Do not much exceed the normal range
19 November 2005
Observation of temporal sequences
Make a tool in Matlab to visualize temporal sequences
Observe a large number of temporal sequences for different patients and tests
19 November 2005
0
100
200
300
400
500
600
2/19
/198
1
2/19
/198
3
2/19
/198
5
2/19
/198
7
2/19
/198
9
2/19
/199
1
2/19
/199
3
2/19
/199
5
2/19
/199
7
2/19
/199
9
2/19
/200
1
Ideas of basic patterns
Short-term changed tests
Long-term changed tests
Idea: base state and peaks
Idea: change of states(compactly capture both state and trend of the sequence)
Down: T-CHO, CHE, ALB, TP (liver products)PLT, WBC, HGB.
Up: T-BIL, D-BIL, I-BIL, AMONIA, ICG-15.
Up: GPT, GOT, TTT, ZTT
19 November 2005
時系列抽象化アプローチ
短期変化項目に関する8の主要パターン長期変化項目に関する21のパターン
発見アルゴリズムを用い,各検査項目の変化特性を「長期変化項目」「短期変化項目」に分類
19 November 2005
Two temporal abstraction methods
Abstraction pattern extraction (APE)
Mapping each given temporal sequence of fixed length into one of pre-defined temporal patterns
(2001~)
Temporal relation extraction (TRE)
Detect temporal relations between basic patterns, and extract rules using temporal relations
(2004~)
19 November 2005
Data and knowledge visualization
Simultaneously view the data in different forms: top-left is original data; top-right is histogram of attributes; lower-left is view by parallel coordinates; and lower-right is relations between a conjunction of attribute-value pairs and the class labels.
View an individual rule in D2MS: top-left window shows the list of discovered rules, the middle-left and the top-right windows show a rule under inspection, and bottom window displays the instances covered by that rule.
19 November 2005
LC vs. non-LC
19 November 2005
Effectiveness of interferon (LUPC rules)
GOT & GPToccurred as VH or EH in no_responserules
CHE occurred as N/L or L-D in partial_responserules
D-BIL occurred as N/H, H>N, H>N in no_response and partial_responserules.
19 November 2005
Temporal relations
AB
AB
AB
AB
AB
AB
AB
A is equal to BB is equal to A
A is before BB is after A
A meets BB is met by A
A overlaps BB is overlapped by A
A starts BB is started by A
A finishes BB is finished by A
A is during BB contains A
(Allen’s Temporal Logic, 1984)
Relations between two basic patternseach happens in a period of time, says, “ALB get down right after heavy inflammation finished”.
Updating a graph of temporal relations (Allen, 1983) and temporal logic (Allen, 1984)
19 November 2005
Findings for HBV and HCV
Findings are different from general medical observations(no clear distinction between type B and C)
R#13 (HBV): “ALP changed from low to normal state” AFTER “LDH changed from low to normal state” (supp. count = 21, conf. = 0.71)
R#5 (HCV): “ALP changed from normal to high state” AFTER “LDH changed from normal to low state” (supp. count = 60, conf. = 0.80)
“Quantitatively” confirm findings in medicine (Medline abstracts)
R#53 (HCV): “ALB changed from normal to low state” BEFORE “TTT in highstate with peaks” AND “ALP from normal to high state” BEFORE “TTT in high state with peaks” (supp. count = 10, conf. = 1.00)
Murawaki et al. (2001): the main difference between HBV and HCV is that
the base state of TTT in HBV is normal, while that of HCV is high.
19 November 2005
Findings for LC and non-LC
Typical relations in non-LC rules“GOT in high or very high states with peaks”BEFORE “TTT in high state with peaks” (20 rules contain this temporal relation)
Typical relations in LC rules“GOT in high or very high states with peaks”AFTER “TTT in high or very high states with peaks” (10 rules contain this temporal relation).
19 November 2005
(プロジェクト期間 2004-2007)
肝炎患者履歴データ
に対する組合せアプローチ
データマイニング
Medlineからのテキス
トマイニング
専門家の評価と提案
多種情報源による医療データマイニング
19 November 2005
Conclusion
Temporal abstraction shown to be a good alternative approach to hepatitis study.
The results be comprehensive to physicians, and significant rules founds.
Much work to be done for solving three problems, including:
Integrating qualitative and quantitative information
Combining with text mining techniques
Domain expert knowledge
19 November 2005
Summary
KDD is motivated by the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.
There are different methods to find predictive or descriptive models that strongly depend on the data schemes.
The KDD process is necessarily interactive and iterative, and requires human participation in all steps of the process.
19 November 2005
http://www.kdnuggets.com
David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000
Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
Recommended references
19 November 2005
Profs. S. Ohsuga, M. Kimura, H. Motoda, H. Shimodaira, Y. Nakamori, S. Horiguchi, T. Mitani, T. Tsuji, K. Satou, among others
Nguyen N.B., Nguyen T.D., Kawasaki S., Huynh V.N., Dam H.C., Tran T.N., Pham T.H., Nguyen D.D., Nguyen L.M., Phan X.H., Le M.H., Hassine B.A., Le S.Q., Zhang H., Nguyen C.H., Nagai K., Nguyen C.H., Nguyen T.P., Tran D.H.
Acknowledgments