An Efficient Concept-Based Mining Model for Enhancing Text Clustering

17
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology 1 An Efficient Concept- Based Mining Model for Enhancing Text Clustering Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03

description

An Efficient Concept-Based Mining Model for Enhancing Text Clustering. Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03. Outlines. Motivation Objectives THEMATIC ROLES BACKGROUND CONCEPT-BASED MINING MODEL Experiments Conclusions - PowerPoint PPT Presentation

Transcript of An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Page 1: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

1

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Shady Shehata, Fakhri Karray, and Mohamed S. KamelTKDE, 2010

Presented by Wen-Chung Liao2010/11/03

Page 2: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outlines

Motivation Objectives THEMATIC ROLES BACKGROUND CONCEPT-BASED MINING MODEL Experiments Conclusions Comments

Page 3: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

3

Motivation

Vector Space Model (VSM)─ represents each document as a feature vector of the terms

(words or phrases) in the document. ─ Each feature vector contains term weights (usually term

frequencies) of the terms in the document.─ term frequency captures the importance of the term

within a document only. However, two terms can have the same frequency in

their documents, but one term contributes more to the meaning of its sentences than the other term.

Thus, the underlying text mining model should indicate terms that capture the semantics of text.

Page 4: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Objectives

A new concept-based mining model is introduced. ─ captures the semantic structure of each term within a sentence

and document rather than the frequency of the term within a document only

─ effectively discriminate between nonimportant terms and terms which hold the concepts that represent the sentence meaning.

─ three measures for analyzing concepts on the sentence, document, and corpus levels are computed

─ a new concept-based similarity measure is proposed. based on a combination of sentence-based, document-based, and

corpus-based concept analysis.─ more significant effect on the clustering quality due to the

similarity’s insensitivity to noisy terms.

Page 5: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

5

THEMATIC ROLES BACKGROUND Verb argument structure: (e.g., John hits the ball).

─ “hits” is the verb. ─ “John” and “the ball” are the arguments of the verb “hits,”

Label: A label is assigned to an argument, ─ e.g.: “John” has subject (or Agent) label. “the ball” has object (or

theme) label, Term: is either an argument or a verb.

─ either a word or a phrase Concept: a labeled term. Generally, the semantic structure of a sentence can

be characterized by a form of verb argument structure

Page 6: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

6

CONCEPT-BASED MINING MODEL

Page 7: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

7

CONCEPT-BASED MINING MODEL Sentence-Based Concept Analysis

─ Calculating ctf of Concept c in Sentence s the conceptual term frequency, ctf

the number of occurrences of concept c in verb argument structures of sentence s.

has the principal role of contributing to the meaning of s a local measure on the sentence level

─ Calculating ctf of Concept c in Document d

the overall importance of concept c to the meaning of its sentences in document d.

Page 8: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

8

CONCEPT-BASED MINING MODEL Document-Based Concept Analysis

─ the concept-based term frequency tf the number of occurrences of a concept (word or phrase) c in

the original document. a local measure on the document level

Corpus-Based Concept Analysis─ the concept-based document frequency df

the number of documents containing concept c used to reward the concepts that only appear in a small

number of documents

Page 9: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

9

Three verbs, colored by red, that represent the semantic structure of the meaning of the sentence.

Each has its own arguments:─ [ARG0 Texas and Australia researchers] have [TARGET created]

[ARG1 industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles].

─ Texas and Australia researchers have created industry-ready sheets of [ARG1 materials] [TARGET made] [ARG2 from nanotubes that could lead to the development of artificial muscles].

─ Texas and Australia researchers have created industry-ready sheets of materials made from [ARG1 nanotubes] [R-ARG1 that] [ARGM-MOD could] [TARGET lead] [ARG2 to the development of artificial muscles].

Example of Calculating ctf Measure

Texas and Australia researchers have created industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles.

Page 10: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

10

A clean step To remove stop words To stem the words

Page 11: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

11

A Concept-Based Similarity Measure

• The single-term similarity measure is:

The concept-based similarity between two documents, d1 and d2 is calculated by:

d1

d2

m matching concepts

(using the TF-IDF weighting scheme)

Page 12: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

12

Mathematical Framework Assume that the content of document d2 is changed by △ Sensitivity analysis:

• Assume that each concept consists of one word. • In this case, each concept is a word and A =1. (?)• By approximation, the d1c value is bigger than d1w and the △d2c value is bigger than the △ d2w value.

• Hence, the sensitivity of the concept-based similarity is higher than the cosine similarity.

• This means that the concept-based model is deeper in analyzing the similarity between two documents than the traditional approaches.

Page 13: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

13

Concept-Based Analysis Algorithm

d1d2

d3d4

d1 d2 d3 d4

L

L L

L L L

Page 14: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

14

EXPERIMENTAL RESULTS Four data sets

─ 23,115 ACM abstract articles collected from the ACM digital library five main categories

─ 12,902 documents from the Reuters 21,578 data set five category sets

─ 361 samples from the Brown corpus main categories were press: reportage; press:

reviews, religion, skills and hobbies, popular lore, belles-letters, and learned; fiction: science; fiction: romance and humor.

─ 20,000 messages collected from 20 Usenet newsgroups

Three standard document clustering techniques: ─ Hierarchical Agglomerative Clustering (HAC), ─ Single-Pass Clustering─ k-Nearest Neighbor (k-NN)

Evaluation methods

Page 15: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

15

Four different concept-based weighting schemes:

Page 16: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

16

Conclusions

Bridges the gap between natural language processing and text mining disciplines. (?)

By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved.

A number of possibilities for extending this paper. ─ link this work to Web document clustering. ─ apply the same model to text classification.

Page 17: An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

17

Comments

Advantages─ Better similarity considering the semantic structure of

sentences in documents. Shortages

─ Ambiguous algorithm

Applications─ Text clustering─ Text classification