Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...

41
Back Efficient Text and Semi- structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University, Japan Joint work with Tatsuya Asai, Shinji Kawasoe, Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech) Supported by Grant-in-aid for Scientific Research on Priority Area Discovery Science” and "Infomatics"; Japan Science & Tech Co., PRESTO
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace...

Page 1: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Efficient Text and Semi-structured Data Mining: Towards Knowledge

Discovery in the Cyberspace

Hiroki ArimuraDepartment of Informatics, Kyushu University, Japan

Joint work with Tatsuya Asai, Shinji Kawasoe, Kenji Abe, Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech)

Supported by Grant-in-aid for Scientific Research on Priority Area “Discovery Science” and "Infomatics"; Japan Science & Tech Co., PRESTO

Page 2: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

OutlineEfficient Text Data Mining Fast and Robust Text Mining Algorithm (ALT'98,

ISSAC'98, DS'98) Efficient Text Index for Data Mining (CPM'01 , CPM'02) Text Mining on External Storage (PAKDD'00)

Applications– Interactive Document browsing– Keyword discovery form Web

Towards Semi-structured Data Mining Efficinet Frequent Tree Miner (SDM'02, PKDD'02) Mining Semi-structured Data Streams (ICDM '02)

Information Extraction from Web (GI'00, FLAIRS'01)

Conclusion

Page 3: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Efficient Text Data Mining with Optimized Pattern Discovery

Joint work with Junichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Setsuo Arikawa (Kyushu Univ), Shinichi Shimozono (Kyutech)

Page 4: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Large Text Databases

have emerged in early 90’s with the rapid progress of computer and network technologies .

– Web pages (OPENTEXT Index, GBs to TBs)– A collection of XML / SGML documents.– Genome databases (GenBank, PIR)– Bibliographic databases (INSPEC, ResearchIndex)– Emails or plain texts on a file system.

Huge, Heterogeneous, unstructured data Traditional database/ data mining technology cannot

work!

Page 5: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Our Research Goal

Text Mining as an inverse of IRDevelop a new access tool for text data

that Interactively supports human discovery from large text collections

Key: Fast and robust text mining methods

User

Text Data Mining System

Text DatabasesAGGAGGTCACA 30

CCAAA

AACACTGTGTGACA

GTGT CACA TGTTTCTGT AGGAGGT

Web pages, XML/SGML archives Genome databases E-mails, text files

Page 6: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackBrowsing a large collection of documents

with unknown vocabulary and structure

Reuters 21578: 21578 articles from Reuters newswires from Feb. to Oct. in 1987 on economy and international affairs

<vessels >

<ships>

<gulf ><shipping >

<iranian >

<port >

<iran >

<the gulf >

<strike >

<attack ><silk worm missile>

<us.>

<wheat>

<dallers>

<sea men>

<strike >

Text data mining for survey and

browsing

Information Retrieval

Direct browsing

Page 7: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Proximity Word Association Patterns

Association rules over arbitrary subwords. Ordered:     ordering among subwords Proximity:   the distance of consecutive

subwords are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCACACA AGGAT CCAA

CACA AGGAT CCAA* *

GTGTCACAAATTCTGTAGTATCA

Parameters: the maximum number of substrings d & the proximity k

Page 8: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Related Research Feldman and Dagan (KDD’96)

– Association rules over keywords: “Arab”, “Egipt”, “Iran” => “Oil”– Using Apriori-style algorithm of Agrawal et al (1994)

Motowani (SIGMOD'97)– Correlations over keywords

Mannila and Toivonen (KDD’96)– Episodes patterns (Partially ordered set of events)

Wang, Chirn, Marr, Shapiro, Shasha, Zhang (SIGMOD'94)– Word Association Patterns without proximity

AGAG * TATA * AGAT– A generate-and-test algorithm + heuristics– Implementation for d = 2, or d = 1 + approximate matching

Iliopoulos, Makris, Sioutas, Tsakalidis, Tsichlas (CPM'02, this conference!)– Model Identification Problem for maximal pairs of strings (2-dim)– Common or frequent pattern discovery for d = 2 and proximity

Page 9: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

How to find?

What properties

separate the target data from the rest of

the data?

Goal: to find those patterns that characterize the target collection

Page 10: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Minimization of•Prediction Error •Information Entropy•Gini Index

Optimized Rule/Pattern DiscoveryData Mining Optimized data mining (IBM DM group,1996 - 2000)Learning Theory Agnostic PAC learning (1990s)Statistics Vapnik & Chervonenkis theory (1970s)

p: ratio of positives that a pattern matches

Good rect.8 positives2 negatives

Bad rect.9 positives9 negatives

f(p): im

purity function

50% 100%0%

Impurity function(p)

Page 11: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Goodness of a pattern= Goodness of the split by the pattern= Weighted average of the values of

impurity function at matched and unmatched sets

Optimized Rule/Pattern Discovery

Pattern

Population N

f(p):

impurity

function

50% 100%0%

Page 12: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Optimized Rule/Pattern Discovery

Split !

Evaluation function for pattern GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)

(M1/N1) (M0/N0)Impurity function

S1

Population N1

S0

Population N0

Pattern

Population N

Page 13: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Optimal Pattern Discovery Problem

Given: a set S of documents and an objective function : S {0, 1}.

Problem: Find an optimal pattern that minimizes the evaluation function

GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)

Page 14: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackRelation to Robust Probabilistic Learning

Statistical Decision Theory in 70sVapnik & Chervonenkis theory (1970s)

Computational Learning Theory in 90sAgnostic PAC-leaning / Robust Trainability (Kearns et al. '92)An algorithm that efficiently solves the classification error

minimization problem is an efficient robust learner, that is, it can approximate arbitrary distribution that generates the examples from the view of classification. (Hausser 1990)

Intractable in general (Kearns et al. 1992)

Empirical machine learning in 90sThe power of simple rules + rigorous optimization (Weiss;

Holte)

Data Mining & COLT in 90s (middle)Efficient algorithms for simple geometric patterns

Page 15: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Application to Text Mining

Traditional method (Frequency-based) Finding most frequent patterns in the target set T.

Many trivial patterns (stop-words) may hide less frequent interesting patterns

Traditional stop-word elimination in IR may not work

the

a an that

of

wit

h

iran

ian

oil

quw

aiti

tan

ker

atta

ck

Iran

ian

oil

pla

tfor

mqu

wai

ti ta

nke

rS

ilkw

orm

mis

sile

vocaburary

freq

uenc

y Target dataset

Page 16: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Application to Text Mining

Optimized Data Mining Finding optimal patterns

Uses an average dataset B of documents as a control dataset for canceling trivial patterns

Finds those patterns that appear more frequently in the target set T and less frequently in the control set B.

the

a an that

of

wit

h

iran

ian

oil

quw

aiti

tan

ker

atta

ck

Iran

ian

oil

pla

tfor

mqu

wai

ti ta

nke

rS

ilkw

orm

mis

sile

vocaburary

freq

uenc

y

Target dataset

Background dataset

Page 17: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Proximity Word Association Patterns

Association rules over arbitrary subwords. Ordered:     ordering among subwords Proximity:   the distance of consecutive

subwords are within constant k (proximity)

GTGTCACATGTTTCTGTAGAAAGAGGTCCACACA AGGAT CCAA

CACA AGGAT CCAA* *

GTGTCACAAATTCTGTAGTATCA

Parameters: the maximum number of substrings d & the proximity k

Page 18: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Straightforward algorithm: Case: the number d of substrings is bounded

Procedure:– Enumerate all O(n2d) proximity patterns

built from O(n2) subwords of the text. – For each pattern p, compute the score in

linear time.

The straightforward algorith requires O(n2d+1) time and too slow to apply to real world databases

We require more efficient algorithms that run in time, say, O(n) to O(n log n) on real datasets.

Page 19: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Theoretical result: PositiveNumber d of substrings is bounded

アルゴリズム詳細

Theorem: For a set of random texts of total size N, Split-Merge algorithm finds all the k-proximity d-word association patterns that minimize the prediction error in average time O(kd (log N)d+1 N) and space O(max(k, d) N).

Proc. ISAAC'98, LNCS 1533, 1998; New Generation Computing. 2000

d = 2 ~ 4, k = 2 ~ 8 (words), log N = 10 ~ 20 (Reuters21578 collection of 15.6MB)

A large constant in practice

Page 20: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Theoretical result: NegativeNumber d of substrings is unbounded

Details of Algorithm

Theorem: If the number d of subwords is unbounded, then there is no polynomial time approximation algorithm that solves the optimal pattern problem above in arbitrary small approximation ratio assuming P≠NP (MAXSNP-hard) .

Proc. ISAAC'98, LNCS 1533, 1998; New Generation Computing. 2000

Page 21: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Suffix tree– represent all the substrings

in O(n) space.

Problems– Not space efficient.

– Dynamic reconstruction is not easy.

– Not suitable for implementation on the secondary storage.

Suffix array

973625814

987654321

abcabbca$

abbca$

a$

bbca$

bcabbca$

bca$

cabbca$

ca$

$

Compactly represents all the substrings with a 1-dimensional integer array.

(1990, Manber etal.)

Suffix tree & array 987654321

Text

$acbbacba

Suffix tree

a

b

bc

a$

cabbca$

$

b

bca$

ca

bbca$

$

ca

$

$

bbca$

4

1

8

5

2

6

3

7

9

(1976, McCreight)

Data structures for efficiently storing all of O(n2) substrings in O(n) space

Page 22: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back Basic Idea

– Reducing the best k-proximity d-word association pattern

– to the best d-dim box over the rank space

The position space:consists of all possible pairs of positions that differ at most k

The rank space:consists of all pairs of the suffices of the text ordered lexicographically.

k

translation by suffix array

Page 23: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

An O(kd (log N)d+1 N) -time Algorithm

Improvement of a generate-test algorithms Using d-dim Orthogonal

Range Tree Structure

Mean Height O(log N)

Two dimensional case

Page 24: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackImplementation: From Trees to Arrays Efficient full text Index for text mining.

– Replacing Tree with one-dim arrays– Most operations in Split-Merge Algorithm can be

efficiently implemented by Suffix + Height arrays.

• Enumeration of substrings and its occurrences is done in linear time by simulating the DFS of the "virtual" suffix tree with scanning the Height array.

• Reconstruction (restriction) of the suffix and the height arrays can be done with O(n log n) integer sorting and O(1) time LCA/range-minima computation. (Farach-Colton, Ferragina, Muthukrishnan '00)

T. Kasai, G. Lee, H. Arimura, S. Arikawa, K.S. Park, "Linear-time Longest-common-prefix computation in suffix arrays and its applications", CPM'01; H. Arimura, CPM'01 talk.

Page 25: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackImplementation: More Practical Algorithm Split-Merge-with-Array algorithm (SMA)

– Re-implemetation of SMT with Suffix + Height arrays– has the same time complexity and the slightly imploved

space complexity to SMA in average. – Easy to implement and scalable due to a simple data

structure which extensively uses one-dimensional arrays and sorting and mapping operations over them.

Theorem: For a set of random texts of total size N, the SMA algorithm finds all the k-proximity d-word association patterns that minimize the prediction error in average time O( N (log N)d+1 ) and space O( max(k,d) N).

Page 26: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Prototype system

Based on computational geometry techniques Built on a full text index called the suffix array Virtual traversal technique over suffix array. Space requirement is reduced to O(dn) with small

constants by the extensive use of suffix array and tertiary quick sort.

g++ on Solaris 2.6, Sun Ultra Sparc IIi, 250MHz.

Page 27: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Running time

d k Time (s) d k Time (s)

1 - 0.64 2 2 2.30

2 2 7.00 2 4 3.81

3 2 33.60 2 8 6.65

4 2 170.38 2 16 10.59

5 2 934.00 2 32 14.81

6 2 1405.82 2 64 14.67

Summary for various values of parameters d and k

•Data: 15.2MB (SHIP data from Reuters 21578 data)•Sun micro., Ultra SPARC II 300MHz, 512MB, g++ on Solaris 2.6.•Best 200 patterns with entropy minimization

Page 28: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Experiments on Document Browsing

Data– Reuters-21578 text collection (Lewis, 1997)– 21578 articles on economy and international

affairs for 8 months from Feb to Oct in 1997. – Each article is tagged with dates and topics (ship,

grain, wheat, gold, ... ). – Ascii data of total size 27.6MB ( 15.2 MB

removing tags) Task

– To find the optimized patterns that distinguish the sentences appearing in the articles of category ship from those with other categories

Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Page 29: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

500 patterns found (time 0.860000 sec)#pos and #neg: 2970 7887 id eval P1 N1 ------------------------------------- 1 0.568 209 23 <gulf > 2 0.572 136 3 <ships > 3 0.572 142 6 <shipping > 4 0.572 132 3 <iranian > 5 0.573 134 9 <iran > 6 0.576 108 6 <port > 7 0.576 111 8 <the gulf > 8 0.577 111 14 <strike > 9 0.577 81 0 <vessels > 10 0.578 86 4 <attack >next (11-20)?

1. Best ten phrases with high entropy value#pos and #neg: 2970 7887 id eval P1 N1 -------------------------------------261 0.585 12 0 <mhi >262 0.585 12 0 <mclean >263 0.585 12 0 <lloyds shipping intelligence >264 0.585 12 0 <iranian oil platform >265 0.585 12 0 <herald of free >266 0.585 12 0 <began on >267 0.585 12 0 <bagged >268 0.585 12 0 <24 - >269 0.585 12 0 <18 - >270 0.585 12 0 <120 pct >next (271-280)?

2 . Phrases with middle entropy values (rank 261-)

%time 0.120000138 patterns found (time 0.120000 sec) #pos and #neg: 2970 7887 id   eval P1 N1 ------------------------------------- 1 0.586 5 0 <attack>   <on >   <an >   <iranian >   <oil > 2 0.586 4 0 <attack>   <an >   <iranian>   <oil >   <platform > 3 0.586 3 0 <attack>   <on >   <u.s.-flagged >   <in kuwaiti>  <waters > 4 0.586 3 0 <attack>   <on >   <iranian >   <oil ><platform > 11 0.586 3 0 <attack>   <on >   <a >   <ship >   <kuwaiti > 12 0.586 3 0 <attack>   <on>   <a >   <ship >   <in kuwaiti >

3. Best ten k-proximity d-word patterns with high entropy value, where k = 1, d = 5 and the first word is "attack".

Finding the phrases that characterize the articles of category ship from the articles with other categories.

Application to document browsing

Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Page 30: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

# of Title words # of Stop words

Title words from <TITLE> section of Reuters newswires Stop words from the standard stopword lists for Brown corpus Measuring the ratio of title/stop words in a phrase found.

Optimization based data mining vs. Frequency based data mining

0

10

20

30

40

50

60

70

80

90

1 ~ 100 101 ~ 200201 ~ 300301 ~ 400401 ~ 500501 ~ 600601 ~ 700701 ~ 800801 ~ 900901 ~ 1000

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Title Words 比 Stop Words  比 平均単語数

Frequency maximization with positve examples alone

1.0

0.0

1.5

0.0

Reuter Ship Entropy

0

10

20

30

40

50

60

70

80

1 ~ 100101 ~ 200201 ~ 300301 ~ 400401 ~ 500501 ~ 600601 ~ 700701 ~ 800801 ~ 9000

0.5

1

1.5

2

2.5

Title Words 比 Stop Words  比 平均単語数

Entropy minimization with positive and negative examples

1.0

0.0

2.0

0.0

Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Page 31: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Target dataset: the base-set for the query "HONDA"

Best 200 hits by AltaVistaTM

root set S

Back linkspointing to S

Randomly selected 50 pages

Forward linksfrom S

base set T 1,000 ~ 5,000 pages

All pages of distance one from pages in S

Discovery of Important Keywords in the Cyberspace

Control dataset: the base-set for the query "Softbank" Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Page 32: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackFrequency-based vs. Optimization-based

Rank Pattern Rank Pattern0 <the> 0 <honda >1 <and> 1 <prelude >2 <to> 2 <i >3 <a> 3 <car >4 <of> 4 <parts >5 <for> 5 <engine >6 <in> 6 <99 >7 <is> 7 <rear >8 <I> 8 <vtec >9 <honda> 9 <exhaust >

10 <on> 10 <miles >11 <s> 11 <bike >12 <with> 12 <motorcycle >13 <you> 13 <racing >14 <it> 14 <black >15 <or> 15 <si >16 <this> 16 <me >17 <that> 17 <tires >18 <are> 18 <fuel >19 <99> 19 <my >

Frequency Maximazation Entropy Miminization

Mining patterns in the target/postive dataset (HONDA data) using background/negative dataset (SOFTBANK)

Automobile co. and internet business Both are Automobile companies

Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Page 33: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackDependence on Background/Negative Data

Rank Other Pattern Rank Other Pattern0 0 <honda > 0 0 <honda > 1 1 <prelude > 1 1 <prelude >2 33 <i > 2 8 <vtec >3 >350 <car > 3 15 <si >4 >350 <parts > 4 11 <bike >5 >350 <engine > 5 6 <99 >6 5 <99 > 6 12 <motorcycle >7 >350 <rear > 7 25 <the honda >8 2 <vtec > 8 41 <prelude si >9 24 <exhaust > 9 20 <civic >

10 108 <miles > 10 35 <honda prelude >11 4 <bike > 11 48 <98 >12 6 <motorcycle > 12 53 <valkyrie >13 >350 <racing > 13 60 <99 time >14 19 <black > 14 37 <honda s >15 3 <si > 15 36 <rims >16 28 <me > 16 40 <looking >17 >350 <tires > 17 30 <looking for >18 >350 <fuel > 18 67 <scooters >19 >350 <my > 19 14 <black >

HONDA vs. SOFTBANK HONDA vs. TOYOTA

Mining patterns in the target/postive dataset (HONDA data) varying the background/negative dataset (SOFTBANK and TOYOTA)

Automobile co. and internet business Both are Automobile companies

Arimura, Abe, Fujino, Sakamoto, Shimozono, Arikawa, Proc. IEEE Kyoto Int'l Conf. Digital Library, 2000.

Page 34: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Conclusion Text databases Optimized Pattern Discovery

– Proximity phrase association patterns

Fast and robust text mining algorithms– Split-Merge algorithm for finding the optimal patterns– Levelwise-Scan algorithm for large disk-resident data.

Applications– Interactive Document browsing– Web Mining

Please visit: http://www.i.kyushu-u.ac.jp/~arim/

Page 35: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

<ARTICLE status = “draft”><TITLE> Fast Text Data Mining with optimal pattern discovery</TITLE><AUTHOR> H. Arimura </AUTHOR><AUTHOR> T. KASAI </AUTHOR><AUTHOR> A. WATAKI </AUTHOR><AUTHOR> S. Arikawa </AUTHOR><ABSTRACT> This paper consider the efficient discovery of a simple class of patterns from large text databases. </ABSTRACT><SECTION>

<TITLE> Introduction </TITLE><BODY> Recent progress of network and strage technology enable us to collect and accumulate ...</BODY>

</SECTION><SECTION>

<TITLE> Preliminaries </TITLE><BODY> In this section, we give basic definitions and results on ... </BODY>

</SECTION> ...</ARTICLE>

Semi-structured Data

TITLE

AUTHORAUTHOR

AUTHOR

ABSTRACT

TITLE BODY TITLE BODY

SECTION SECTION SECTION

LINK

FIGURE

FIGURE

Web & XML data

Page 36: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Theoretical results

Theorem: If the maximum sizek of subwords is unbounded, For any e > 0, there exists no polynomial time (770/767 - e)-approximation algorithm for the maximum agreement problem for labeled ordered trees of unbounded size on an unbounded label alphabet if P /=NP.

アルゴリズム詳細

Theorem: The algorithm OPTT solves the maximum agreement problem for labeled ordered trees in average time O(kk bk N).

(Note: A straightforward algorithm has super linear time complexity when the number of labels grows in N).

Proc. SIAM Data Mining 02 (2001), and Proc. PKDD'02 (2002)

Page 37: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

順序木枚挙グラフ

T1 T2 T3⊥ T4

•ルートは空木.

•各ノードは,順序木であり,その最右拡張を子供としてもつ

Page 38: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back

Page 39: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back研究成果:半構造データマイニング 高速な頻出順序木パターン発見アルゴリ

ズム– Efficient Substructure Discovery from Large

Semi-structured Data– Asai, Abe, Kawasoe, Arimura, Sakamoto, Arikawa– Proc. 2nd SIAM International Conference on Data

Mining (SDM'02), Arlington, April 2002. (To appear)– 電気通信学会 DE 研 (10 月 ) ;  AI 学会研究報告 SIG-FAI/KBS ( 11

月) ; DEWS '02.

結合ルール発見の半構造データへの拡張

Page 40: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

BackFREQT アルゴリズムの正当性と計算量定理 提案のアルゴリズム Find-Freq-Trees は,任意の頻度閾値 0 < σ≦1 に対して,すべての頻出な順序木パターンをちょうど枚挙する.

最近,厳密な計算量の上限と下限を示すことができた( PKDD'02, Aug 2002 )

– 上限:定数パターンに対して線形

– 下限:任意サイズのパターンでは,1に近い誤差率での近似さえ, NP 完全

Page 41: Back Efficient Text and Semi-structured Data Mining: Towards Knowledge Discovery in the Cyberspace Hiroki Arimura Department of Informatics, Kyushu University,

Back