Comparison of Centrality Indexes in Network Japanese Text ...ijeeee.org/Papers/189-CZ406.pdf ·...

Abstract—There is the research in fashion that expresses the

text analysis result with network structure between words in

recent years. Clarifying the relation of the words is important

for the future legacy text mining technologies such as a

derivation of the conceptual meaning between words and its

relationship, new evaluation indexes for degree of similarity

between documents, and a visualization of the relationship. In

such text network analyses domain, a method of node

evaluation is not defined clearly, so far. For this background,

the intensive comparative evaluation has been made with three

typical indexes in network analysis which are degree centrality,

closeness centrality, and betweenness centrality. We have made

a conclusion that the betweenness centrality marked the best

result.

Index Terms—Centrality, graph, meaning, network analysis,

network text analysis, ontology, text analysis.

I. INTRODUCTION

Today, huge amount of data are available for analysis

because SNS and other Internet services have become very

popular. Also progress in the computing power makes it

possible to calculate such big data analysis within a short

period. In this paper, we define the network data as data in a

form of graph which express human, things, or machines as

nodes and their connections based on their relationship as the

edge. Having these backgrounds, network analysis is widely

used for various fields such as Social network analytics.

Semantic analytics is also one of such fields. When we

mention semantic analytics, it is applied to combining text

analytics and semantic web technologies like RDF.

For example, it is used to extract the semantic information

from web pages by applying present text analysis, and then is

used to extract and analyze the semantic network structures

by referring to network relationships between web pages.

On the other hand, some studies are focusing on visual

expressions by directly drawing network structures obtained

from text analysis. They are methods to express

co-occurrence and association between words by network

structures, and to calculate the ontology (conceptual meaning

Manuscript received December 14, 2012; revised February 16, 2013.

Koji Tanaka is with the Graduate School of Business Sciences, The

University of TSUKUBA, Tokyo, 3-29-1 Otsuka, Bunkyo-ku, Tokyo

112-0012 Japan and with Hitachi Government & Public Corporation System

Engineering, Ltd, 2-4-18 Toyo, Koto-ku, Tokyo, Japan 135-8633 (e-mail:

[email protected]).

Kazuhiko Tsuda is with the Graduate School of Business Sciences, The

University of Tsukuba, Tokyo, 3-29-1 Otsuka, Bunkyo-ku, Tokyo, Japan

112-0012 (e-mail: [email protected]).

Masakazu Takahashi is from Graduate School of Innovation and

Technology Management, Yamaguchi University, 2-16-1, Tokiwadai, Ube,

Yamaguchi, Japan 755-8611 (e-mail: [email protected]).

and its relationship) by employing network clustering and

visualizing technique. We call these approaches as network

text analysis.

In the field of network analytics, there are several indexes

to evaluate nodes in the network according to the contents

and objective of analysis. However, it remains ambiguous

how nodes should be evaluated in the network text analysis.

In this paper, we have compared the typical evaluation

indexes, such as degree centrality, closeness centrality, and

betweenness centrality for network text analysis.

II. PREVIOUS STUDIES

Concerning the primary research field of network text

analysis, we have drawn the graph roughly with the original

text data and obtain the meaning of the text and better

understandings of the text structure. Bruce indicated the

social interaction structure from the semantic structure of the

text using visual graphic analysis [1]. Popping (2003)

described the extraction of the knowledge from the text data

with the schema theory, especially by using the knowledge

graphs [2].

These papers were mainly focused on relations of the

semantic and described the text as a network. In addition,

these analyses of the connectivity of the network were based

on the conventional text mining methods such as

co-occurrence, causal relationship [1] and the relationship of

the feelings [3].

Meanwhile, there is an approach of extracting semantics

from the word network structure. Akama et al. indicated the

derivation of ontology (conceptual meaning and its

relationship) with the graph clustering technique to the

associative word network [4]. Moreover, Akama et. al. also

indicated clustering on the graph from the network which

generated from the text group with two different concepts [5].

These clustering techniques were used with the connectivity

between the nodes based on the Markov Clustering (MCL)

[6], and were not focused on the feature of the node.

On the other hand, in the social network analytics domain,

indexes that indicate the feature value of the node in the

network has been proposed [7]. The primary centrality

indexes are degree centrality, closeness centrality, and

betweenness centrality.

From the results of the past studies, we have figured out

the methods to extract the centrality words in the network

with the evaluation of each node by above centrality indexes.

III. CENTRALITY

For these backgrounds, we have made an intensive

Comparison of Centrality Indexes in Network Japanese

Text Analysis

Koji Tanaka, Masakazu Takahashi, and Kazuhiko Tsuda

International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013

DOI: 10.7763/IJEEEE.2013.V3.189 37

research on the comparison evaluation of the three centrality

indexes (degree centrality, closeness centrality, and

betweenness centrality) to extract the conceptual meaning

from the text network.

A. Degree Centrality

Historically first and conceptually the simplest is degree

centrality, which is defined as the number of links incident

upon a node (i.e., the number of ties that a node has). The

degree can be interpreted in terms of the immediate risk of a

node for catching whatever is flowing through the network

such as a virus or some information. In the case of a directed

network (where ties have direction), we usually define two

separate measures of degree centrality, namely indegree and

outdegree. Accordingly, indegree is a count of the number of

ties directed to the node and outdegree is the number of ties

that the node directs to others. When ties are associated to

some positive aspects such as friendship or collaboration,

indegree is often interpreted as a form of popularity, and

outdegree as gregariousness.

The degree centrality of a node, for a given graph

),(: EVG with ||V as nodes and || E as edges, is defined

as

)deg()( vvCD (1)

B. Closeness Centrality

In connected graphs, there is a natural metric distance

between all pairs of nodes, defined by the length of their

shortest paths. The farness of a node s is defined as the sum of

its distances to all other nodes, and its closeness is defined as

the inverse of the farness [8]. Thus, the more central a node is

the lower its total distance to all other nodes. Closeness can

be regarded as a measure of how fast it will take to spread

information from the node s to all other nodes sequentially

[9].

In the classic definition of the closeness centrality, the

spread of information is modeled by the use of shortest paths.

This model might not be the most realistic for all types of

communication scenarios. Thus, related definitions have

been discussed to measure closeness, like the random walk

closeness centrality introduced by Noh and Rieger (2004). It

measures the speed with which randomly walking messages

reach a node from elsewhere in the network—a sort of

random-walk version of closeness centrality [10].

The information centrality of Stephenson and Zelen (1989)

is another closeness measure, which bears some similarity to

that of Noh and Rieger. In essence it measures the harmonic

mean length of paths ending at a node i, which is smaller if i

has many short paths connecting it to other nodes [11].

Note that by definition of graph theoretic distances, the

classic closeness centrality of all nodes in an unconnected

graph would be 0. In a work by Dangalchev (2006) relating

network vulnerability, the definition for closeness is

modified such that it can be calculated more easily and can be

also applied to graphs which lack connectivity [12]:

(2)

C. Betweenness Centrality

Betweenness is a centrality measure of a node within a

graph (there is also edge betweenness, which is not discussed

here). Betweenness centrality quantifies the number of times

a node acts as a bridge along the shortest path between two

other nodes. It was introduced as a measure for quantifying

the control of a human on the communication between other

humans in a social network by Linton Freeman [13]. In his

conception, nodes that have a high probability to occur on a

randomly chosen shortest path between two randomly chosen

nodes have a high betweenness.

The betweenness of a node v in a graph ),(: EVG with

V nodes is computed as follows:

1) For each pair of nodes (s, t), the shortest paths between

them are computed.

2) For each pair of nodes (s, t), the fraction of shortest paths

that pass through the node in question is determined

(here, node v).

3) This fraction is summed up over all pairs of nodes (s, t).

More compactly the betweenness can be represented as

[14] :

(3)

where is total number of shortest paths from node s to

node t and is the number of those paths that pass

through v.

IV. GRAPH CREATING

There are various compound nouns that are consisted of

simply connecting multiple nouns in Japanese language. For

example, there are some compound nouns simply

abbreviating the conjunction such as ―Kenkyuu Kaihatsu‖ in

Japanese that means Research (―Kenkyuu‖) and

Development (―Kaihatsu‖). Moreover, there are also

compound nouns which in English are consisted of adjective

and noun such as ―Environmental Technology‖ but in

Japanese are consisted of multiple nouns .

For example, compound nouns using ―Kenkyu‖ that means

research in English, are as follows: Kenkyu Kaihatsu

(Research Activities), Kenkyu Shisetsu (Research Faculties),

Kenkyu Keikaku (Research Plan), Kenkyu Hiyou (Research

Cost) and so on. However, if the word ―Research Cost‖ is

used in the document, it might be hard to simply understand

whether the topic of the document is on research or on cost, in

Japanese. If the document was about research, then a lot of

compound nouns with the word ―research‖ should appear

frequently, and if the document was about cost, then a lot of

compound nouns with the word ―cost‖ such as the building

cost and the transportation cost should appear frequently

even in Japanese.

Therefore, the topics of documents can be presumed by

extracting the compound nouns from the document and by

focusing on nouns that consist them. We have analyzed

relations of the document, the compound noun and the noun

within the groups of documents with the same topic in the

CB (v) =σ st (v)σ sts≠v≠t∈V

∑

Cc (v) = 2−dG (v,t )t∈V \v∑

σ st

σ st (v)


38

form of graph. Then we have evaluated indexes by

comparing their extraction rate of the specialized vocabulary

within the groups of documents focusing on nodes indicating

the noun. As for evaluation, we have used the text data

classified in either technology category or culture category

from Wikipedia in Japanese [15], [16]. Attributions of data

are shown in Table I.

TABLE I: DATA ATTRIBUTES

Number of

Categories Total pages

Technology 69 4167

Culture 44 1847

Total 133 6014

We have created graph for each text data according to the

flowchart in Fig.1 and steps following it.

Step 3:Obtain a page text, and perform Japanese

language morphological analyses

Step 4:Extract all compound nouns in a page

Step 5:Does a node for target compound

noun exist?

Step 6:Create a compound noun node CN

and add edge (P, CN)

Step 8:Does a node for target

noun exist?

Step 2:Create a page node P

No

Yes

Step 7:Extract nouns in a compound noun

Have all compound nounsin a page been processed?

No

Have all nouns been processed?

Step 9:Create a noun node Nand add edge (CN,N)

Yes

No

Yes

Yes

Have all pages in a category been

processed?

Yes

No

No

Start

End

Step 1:Obtain pages in a category

Fig. 1. Flowchart of graph creation.

1) Obtain pages for the particular category, and take the

following steps for each page.

2) Create a page node P.

3) Obtain a page text, and perform Japanese language

morphological analysis.

4) Extract all compound nouns in a page, and take the

following steps for each compound noun.

5) If the node for the target compound noun already exists in the graph, move to the next compound noun. If not,

take the next step.

6) Create a compound noun node CN, and add edge (P,

CN).

7) Extract all nouns in a compound noun, and take the

following steps for all these nouns.

8) If the node for the target noun already exists in the graph,

move to the next noun. If not, take the next step.

9) Create a noun node N, and add edge (CN, N)

V. CENTRALITY COMPARISON

A. Creating Graph Outline

An example of the created graph is shown in Fig. 2.

P1

N1

CN1

N2

P2

CN2

CN3

N5

N3

P3P4

CN4

Page

Compound

noun

Noun

legend

Fig. 2. Example of created graph.

Table II indicates the outline of the created graph. From

the result, there are no remarkable differences in the

following categories: Clustering Coefficient, Density, and

Slope of Power Law Distribution.

TABLE II: THE ARRANGEMENT OF GRAPH

Avg.

Nodes

P 52.8

CN 2929.2

N 2548.6

Total 5573.6

Edges

(P, CN) 3345.6

(CN, N) 6221.8

Total 9567.4

Clustering Coefficient 0.00019

Density 0.0019

Slope of Power Law

Distribution 1.63

B. Index of the Centrality

Degree centrality, closeness centrality, and betweenness

centrality are calculated from the noun nodes of each graph

respectively. We have measured the appearance ratio of

category name at top n words with high centrality. In the case


39

of the category name composed of multiple nouns, we have

accepted it if the part of category name matches. The result is

shown in Fig. 3.

As a result, we have obtained the high ratio of agreement

for degree centrality and betweenness centrality more than

for closeness centrality.

Comparing the result of degree centrality with the result of

betweenness centrality, there is no remarkable difference

between them. However, comparing them at the upper rank

where n<30, betweenness centrality had slightly higher

agreement rate.

Fig. 3. Cumulative sum of rate of agreement to the category name.

Next, we have measured agreement rates with the parent

category in the same procedure. Fig. 4 indicates the results of

the comparison.

From the graph, we have remarkable result as follows;

Both closeness centrality and betweenness centrality have

scored high agreement rateat the upper rank where n<30. On

the other hand, betweenness centrality alone scored the

highest agreement rate at the lower rank where n<30

Fig. 4. Cumulative sum of rate of agreement to the parent category name.

The agreement rate to category name was about 60% for

the top 3 words, was about 70% for the top 6 words,, and was

about 80% for the top 15 words, respectively.

Moreover, the agreement rate with upper category was

about 40% for the top 5 words, about 60% for the top 12

words,, and about 70% for the top 20 words.

From the result of the comparison, the betweenness

centrality is suitable in extraction of conceptual meanings.

C. Evaluation in Consideration of Synonymous Words

We have made evaluations in accordance with the noun

node and the category name in the previous section, but in

actual, we have also observed many synonymous words. For

example, although the category name is ―production‖, its

node name is ―manufacture ―.

For this reason, we have made a new evaluation in

consideration with the synonymous words. In particular, list

of synonymous words of the particular noun node has been

made using the Japanese word net [17]. Then among those

words, if there was a match entirely or partially with the

category name, we assumed that the noun node and the

category name are in accord. Results are shown in Fig. 5 and

Fig. 6.

Fig. 5. Cumulative sum of rate of agreement with the category name

(Synonymous Words Consideration).

Fig. 6. Cumulative sum of rate of agreement with the upper category name

(Synonymous Words Consideration).

We have obtained the similar results for the synonymous

words consideration as well. In order to clarify the effect of

synonymous word consideration, the difference of the

numerical value for Fig. 3 and Fig. 5 is shown in Fig. 7, and


40

that of Fig. 4 and Fig. 6 is shown in Fig. 8. respectively.

Fig. 7. Difference of the rate of agreement with the category name in

presence and absence of the synonym word consideration.

Fig. 8. Difference of the rate of agreement with the parent category name in

presence and absence of the synonym word consideration.

For the purpose of extracting the word indicating the

feature of a certain category, it is desirable for its

synonymous words to appear also in the higher rank. Both

Fig. 7 and Fig. 8 indicate the appearance ratio of the

synonymous words per rank.

As shown in Fig. 7, in the case of considering synonymous

words, we have obtained the high agreement rate with the

category name, in the higher rank(n=1, 2, 3) for all three

centrality indexes. That is to say that the synonymous words

of the category name would also appear in the higher rank

(n= 1, 2, 3).

However, we also have observed the repetition of rise and

fall of agreement rate for the degree centrality at n< 30, and

as well as for the closeness centrality at n< 70; therefore, it is

hard to say that the synonymous words would appear

frequently at the higher rank..

In contrast, the betweenness centrality has indicated the

slight rise-and-fall motion, but it is stable around 0.05 at n< 3,

around 0.02 at 3< n<20, and around 0.01 at beyond. So for

the betweenness centrality, we have observed the tendency of

frequent appearance of the synonymous words at the higher

rank.

Furthermore, although betweenness centrality has the

slight rise-and-fall motion for the agreement rate of the

parent category as shown in Fig. 8, it has remained stable

around 0.02 appearance rate for the synonymous words.

On the other hand, both the degree centrality and the

closeness centrality have the tendency of high agreement rate

in the lower rank, so both of them do not agree for the

purpose of extracting the word showing the feature of

category (the synonymous words of the category name).

From these analyses, the betweenness centrality is suitable

for extracting the conceptual meanings even in the case of the

synonymous words being taken into consideration. To make

use of betweenness centrality, it is effective for the

synonymous words with the category name to appear at the

higher rank.

VI. CONCLUSION

In network text analysis domain which expresses a result

of the text analysis as a network structure and analyzes it, a

method for node evaluation is still unclear.

Therefore, we have first analyzed relations in the form of

graph among the document, the compound noun, and the

noun within the group of documents with the same topic.

Then three typical centralities (degree centrality, closeness

centrality, and betweenness centrality) are calculated from

the noun nodes of each graph respectively. Lastly, we have

measured the appearance ratio of category name by top n

words with high centrality. From the result of the comparison,

betweenness centrality was the most suitable in extraction of

conceptual meaning.

The agreement rate with category names was about 60%

for top 3, about 70% in top 6, and about 80% in top 15,

respectively.

Moreover, the agreement rate for the parent category was

about 40% for top 5, about 60% for top 12, and about 70% for

top 20.

In addition, the betweenness centrality is suitable for

extracting the conceptual meanings even in the case of the

synonymous words were taken into consideration. To make

use of betweenness centrality, it was effective for the

synonymous words with the category name to appear at

higher rank.

REFERENCES

[1] B. Bruce and D. Newman, ―Interacting plans,‖ Cognitive Science, vol.

2, no. 3, pp. 195–233, Sep. 1978.

[2] R. Popping, ―Knowledge Graphs and Network Text Analysis,‖ Social

Science Information, vol. 42, no. 1, pp. 91–106, Mar. 2003.

[3] W. Lehnert, ―Plot units and narrative summarization,‖ Cognitive

Science, vol. 5, no. 4, pp. 293–331, Dec. 1981.

[4] J. Jung, M. Miyake, and H. Akama, "Associative Language Data

Processing by Recurrent Graph Clustering Methods," JSAI, 2006.

[5] H. Akama, M. Miyake, and J. Jung, ―Graph-based Linguistic Analysis

on the Ideological Similarity between the Mesmerism and the Modern

Stoicism,‖ IPSJ SIG-CH, vol. 2007, no. 49, pp. 49–56, May 2007.

[6] S. Van Dongen, ―Performance criteria for graph clustering and Markov

cluster experiments,‖ Report - Information systems, no. 12, pp. 1–36,

2000.

[7] L. C. Freeman, ―Centrality in social networks conceptual

clarification,‖ Social Networks, vol. 1, no. 3, pp. 215–239, Jan. 1978.

[8] G. Sabidussi, ―The centrality index of a graph,‖ Psychometrika, pp.

581–603, 1966.

[9] M. E. J. Newman, ―A measure of betweenness centrality based on

random walks,‖ Social Networks, vol. 27, no. 1, pp. 39–54, Jan. 2005.

[10] J. D. Noh, ―Random Walks on Complex Networks,‖ Phys. Rev. Lett.,

vol. 92, no. 11, pp. 118701, Mar. 2004.

[11] K. Stephenson and M. Zelen, ―Rethinking centrality: Methods and

examples,‖ Social Networks, vol. 11, no. 1, pp. 1–37, Mar. 1989.


41

[12] C. Dangalchev, ―Residual closeness in networks,‖ Physica A:

Statistical Mechanics and its Applications, vol. 365, no. 2, pp.

556–564, Jun. 2006.

[13] L. C. Freeman, ―A set of measures of centrality based on betweenness,‖

Sociometry, pp. 35–41, 1977.

[14] U. Brandes, ―A Faster Algorithm for Betweenness Centrality,‖ Journal

of Mathematical Sociology, vol. 25, pp. 163–177, 2001.

Koji Tanaka received his M.B.A. degree in 2012 from

The University of TSUKUBA, Japan. He is with

Government & Public Corporation System Engineering,

Ltd and is a lead engineer of Research & Development

Center. His research interests include natural language

processing, information retrieval and human-computer

interaction. He is a member of The Information Processing

Society of Japan.

Masakazu Takahashi received his M.B.A. degree in

1996 and Ph.D. in 2010 both from the University of

Tsukuba, Japan. He started his business carrier from

Nikkei Research, a subsidiary of Nikkei: Japanese

newspaper corporation, in 1992, and started his

educational carrier from Gunma University in 2010. He

is presently an associate professor at the Graduate School

of Innovation and Technology Management, Yamaguchi University from

2012. His research interests include computational organization theory,

cooperative agents, case-based reasoning, service science and knowledge

system development methodology. He is a member of IEEJ, IPSJ, JSAI, and

ACM

Kazuhiko Tsuda received the B.S. and Ph.D. degrees in

Engineering from Tokushima University in 1986 and

1994, respectively. He was with Mitsubishi Electric

Corporation during 1986-1990, and with Sumitomo Metal

Industries Ltd. during 1991-1998. He is presently with the

Graduate School of Business Science, University of

Tsukuba, Tokyo, Japan as an associate professor during

1998-2005, and as a professor since 2005. His research interests include

natural language processing, database, information retrieval and

human-computer interaction. He is a member of The Information Processing

Society of Japan and The Institute of Electronics, Information and

Communication Engineers.


42

[15] Wikipedia. [Online]. Available: http://ja.wikipedia.org

[16] Jawiki dump progress on 20120826. [Online]. Available:

http://dumps.wikimedia.org/jawiki/20120826/

[17] Japanese Wordnet. [Online]. Available:

http://nlpwww.nict.go.jp/wn-ja/

[18] Centrality. le:

http://en.wikipedia.org/wiki/Closeness_centrality/ .

[19] H. Isahara, F. Bond, K. Uchimoto, M. Utiyama, and K. Kanzaki,

―Development of the Japanese WordNet,‖ LREC 2008, 2008.

[Online]. Availab

Comparison of Centrality Indexes in Network Japanese Text ...ijeeee.org/Papers/189-CZ406.pdf ·...

Documents

Transcript of Comparison of Centrality Indexes in Network Japanese Text ...ijeeee.org/Papers/189-CZ406.pdf ·...