Comparison of Centrality Indexes in Network Japanese Text ...ijeeee.org/Papers/189-CZ406.pdf ·...
Transcript of Comparison of Centrality Indexes in Network Japanese Text ...ijeeee.org/Papers/189-CZ406.pdf ·...
Abstract—There is the research in fashion that expresses the
text analysis result with network structure between words in
recent years. Clarifying the relation of the words is important
for the future legacy text mining technologies such as a
derivation of the conceptual meaning between words and its
relationship, new evaluation indexes for degree of similarity
between documents, and a visualization of the relationship. In
such text network analyses domain, a method of node
evaluation is not defined clearly, so far. For this background,
the intensive comparative evaluation has been made with three
typical indexes in network analysis which are degree centrality,
closeness centrality, and betweenness centrality. We have made
a conclusion that the betweenness centrality marked the best
result.
Index Terms—Centrality, graph, meaning, network analysis,
network text analysis, ontology, text analysis.
I. INTRODUCTION
Today, huge amount of data are available for analysis
because SNS and other Internet services have become very
popular. Also progress in the computing power makes it
possible to calculate such big data analysis within a short
period. In this paper, we define the network data as data in a
form of graph which express human, things, or machines as
nodes and their connections based on their relationship as the
edge. Having these backgrounds, network analysis is widely
used for various fields such as Social network analytics.
Semantic analytics is also one of such fields. When we
mention semantic analytics, it is applied to combining text
analytics and semantic web technologies like RDF.
For example, it is used to extract the semantic information
from web pages by applying present text analysis, and then is
used to extract and analyze the semantic network structures
by referring to network relationships between web pages.
On the other hand, some studies are focusing on visual
expressions by directly drawing network structures obtained
from text analysis. They are methods to express
co-occurrence and association between words by network
structures, and to calculate the ontology (conceptual meaning
Manuscript received December 14, 2012; revised February 16, 2013.
Koji Tanaka is with the Graduate School of Business Sciences, The
University of TSUKUBA, Tokyo, 3-29-1 Otsuka, Bunkyo-ku, Tokyo
112-0012 Japan and with Hitachi Government & Public Corporation System
Engineering, Ltd, 2-4-18 Toyo, Koto-ku, Tokyo, Japan 135-8633 (e-mail:
Kazuhiko Tsuda is with the Graduate School of Business Sciences, The
University of Tsukuba, Tokyo, 3-29-1 Otsuka, Bunkyo-ku, Tokyo, Japan
112-0012 (e-mail: [email protected]).
Masakazu Takahashi is from Graduate School of Innovation and
Technology Management, Yamaguchi University, 2-16-1, Tokiwadai, Ube,
Yamaguchi, Japan 755-8611 (e-mail: [email protected]).
and its relationship) by employing network clustering and
visualizing technique. We call these approaches as network
text analysis.
In the field of network analytics, there are several indexes
to evaluate nodes in the network according to the contents
and objective of analysis. However, it remains ambiguous
how nodes should be evaluated in the network text analysis.
In this paper, we have compared the typical evaluation
indexes, such as degree centrality, closeness centrality, and
betweenness centrality for network text analysis.
II. PREVIOUS STUDIES
Concerning the primary research field of network text
analysis, we have drawn the graph roughly with the original
text data and obtain the meaning of the text and better
understandings of the text structure. Bruce indicated the
social interaction structure from the semantic structure of the
text using visual graphic analysis [1]. Popping (2003)
described the extraction of the knowledge from the text data
with the schema theory, especially by using the knowledge
graphs [2].
These papers were mainly focused on relations of the
semantic and described the text as a network. In addition,
these analyses of the connectivity of the network were based
on the conventional text mining methods such as
co-occurrence, causal relationship [1] and the relationship of
the feelings [3].
Meanwhile, there is an approach of extracting semantics
from the word network structure. Akama et al. indicated the
derivation of ontology (conceptual meaning and its
relationship) with the graph clustering technique to the
associative word network [4]. Moreover, Akama et. al. also
indicated clustering on the graph from the network which
generated from the text group with two different concepts [5].
These clustering techniques were used with the connectivity
between the nodes based on the Markov Clustering (MCL)
[6], and were not focused on the feature of the node.
On the other hand, in the social network analytics domain,
indexes that indicate the feature value of the node in the
network has been proposed [7]. The primary centrality
indexes are degree centrality, closeness centrality, and
betweenness centrality.
From the results of the past studies, we have figured out
the methods to extract the centrality words in the network
with the evaluation of each node by above centrality indexes.
III. CENTRALITY
For these backgrounds, we have made an intensive
Comparison of Centrality Indexes in Network Japanese
Text Analysis
Koji Tanaka, Masakazu Takahashi, and Kazuhiko Tsuda
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013
DOI: 10.7763/IJEEEE.2013.V3.189 37
research on the comparison evaluation of the three centrality
indexes (degree centrality, closeness centrality, and
betweenness centrality) to extract the conceptual meaning
from the text network.
A. Degree Centrality
Historically first and conceptually the simplest is degree
centrality, which is defined as the number of links incident
upon a node (i.e., the number of ties that a node has). The
degree can be interpreted in terms of the immediate risk of a
node for catching whatever is flowing through the network
such as a virus or some information. In the case of a directed
network (where ties have direction), we usually define two
separate measures of degree centrality, namely indegree and
outdegree. Accordingly, indegree is a count of the number of
ties directed to the node and outdegree is the number of ties
that the node directs to others. When ties are associated to
some positive aspects such as friendship or collaboration,
indegree is often interpreted as a form of popularity, and
outdegree as gregariousness.
The degree centrality of a node, for a given graph
),(: EVG with ||V as nodes and || E as edges, is defined
as
)deg()( vvCD (1)
B. Closeness Centrality
In connected graphs, there is a natural metric distance
between all pairs of nodes, defined by the length of their
shortest paths. The farness of a node s is defined as the sum of
its distances to all other nodes, and its closeness is defined as
the inverse of the farness [8]. Thus, the more central a node is
the lower its total distance to all other nodes. Closeness can
be regarded as a measure of how fast it will take to spread
information from the node s to all other nodes sequentially
[9].
In the classic definition of the closeness centrality, the
spread of information is modeled by the use of shortest paths.
This model might not be the most realistic for all types of
communication scenarios. Thus, related definitions have
been discussed to measure closeness, like the random walk
closeness centrality introduced by Noh and Rieger (2004). It
measures the speed with which randomly walking messages
reach a node from elsewhere in the network—a sort of
random-walk version of closeness centrality [10].
The information centrality of Stephenson and Zelen (1989)
is another closeness measure, which bears some similarity to
that of Noh and Rieger. In essence it measures the harmonic
mean length of paths ending at a node i, which is smaller if i
has many short paths connecting it to other nodes [11].
Note that by definition of graph theoretic distances, the
classic closeness centrality of all nodes in an unconnected
graph would be 0. In a work by Dangalchev (2006) relating
network vulnerability, the definition for closeness is
modified such that it can be calculated more easily and can be
also applied to graphs which lack connectivity [12]:
(2)
C. Betweenness Centrality
Betweenness is a centrality measure of a node within a
graph (there is also edge betweenness, which is not discussed
here). Betweenness centrality quantifies the number of times
a node acts as a bridge along the shortest path between two
other nodes. It was introduced as a measure for quantifying
the control of a human on the communication between other
humans in a social network by Linton Freeman [13]. In his
conception, nodes that have a high probability to occur on a
randomly chosen shortest path between two randomly chosen
nodes have a high betweenness.
The betweenness of a node v in a graph ),(: EVG with
V nodes is computed as follows:
1) For each pair of nodes (s, t), the shortest paths between
them are computed.
2) For each pair of nodes (s, t), the fraction of shortest paths
that pass through the node in question is determined
(here, node v).
3) This fraction is summed up over all pairs of nodes (s, t).
More compactly the betweenness can be represented as
[14] :
(3)
where is total number of shortest paths from node s to
node t and is the number of those paths that pass
through v.
IV. GRAPH CREATING
There are various compound nouns that are consisted of
simply connecting multiple nouns in Japanese language. For
example, there are some compound nouns simply
abbreviating the conjunction such as ―Kenkyuu Kaihatsu‖ in
Japanese that means Research (―Kenkyuu‖) and
Development (―Kaihatsu‖). Moreover, there are also
compound nouns which in English are consisted of adjective
and noun such as ―Environmental Technology‖ but in
Japanese are consisted of multiple nouns .
For example, compound nouns using ―Kenkyu‖ that means
research in English, are as follows: Kenkyu Kaihatsu
(Research Activities), Kenkyu Shisetsu (Research Faculties),
Kenkyu Keikaku (Research Plan), Kenkyu Hiyou (Research
Cost) and so on. However, if the word ―Research Cost‖ is
used in the document, it might be hard to simply understand
whether the topic of the document is on research or on cost, in
Japanese. If the document was about research, then a lot of
compound nouns with the word ―research‖ should appear
frequently, and if the document was about cost, then a lot of
compound nouns with the word ―cost‖ such as the building
cost and the transportation cost should appear frequently
even in Japanese.
Therefore, the topics of documents can be presumed by
extracting the compound nouns from the document and by
focusing on nouns that consist them. We have analyzed
relations of the document, the compound noun and the noun
within the groups of documents with the same topic in the
CB (v) =σ st (v)σ sts≠v≠t∈V
∑
Cc (v) = 2−dG (v,t )t∈V \v∑
σ st
σ st (v)
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013
38
form of graph. Then we have evaluated indexes by
comparing their extraction rate of the specialized vocabulary
within the groups of documents focusing on nodes indicating
the noun. As for evaluation, we have used the text data
classified in either technology category or culture category
from Wikipedia in Japanese [15], [16]. Attributions of data
are shown in Table I.
TABLE I: DATA ATTRIBUTES
Number of
Categories Total pages
Technology 69 4167
Culture 44 1847
Total 133 6014
We have created graph for each text data according to the
flowchart in Fig.1 and steps following it.
Step 3:Obtain a page text, and perform Japanese
language morphological analyses
Step 4:Extract all compound nouns in a page
Step 5:Does a node for target compound
noun exist?
Step 6:Create a compound noun node CN
and add edge (P, CN)
Step 8:Does a node for target
noun exist?
Step 2:Create a page node P
No
Yes
Step 7:Extract nouns in a compound noun
Have all compound nounsin a page been processed?
No
Have all nouns been processed?
Step 9:Create a noun node Nand add edge (CN,N)
Yes
No
Yes
Yes
Have all pages in a category been
processed?
Yes
No
No
Start
End
Step 1:Obtain pages in a category
Fig. 1. Flowchart of graph creation.
1) Obtain pages for the particular category, and take the
following steps for each page.
2) Create a page node P.
3) Obtain a page text, and perform Japanese language
morphological analysis.
4) Extract all compound nouns in a page, and take the
following steps for each compound noun.
5) If the node for the target compound noun already exists in the graph, move to the next compound noun. If not,
take the next step.
6) Create a compound noun node CN, and add edge (P,
CN).
7) Extract all nouns in a compound noun, and take the
following steps for all these nouns.
8) If the node for the target noun already exists in the graph,
move to the next noun. If not, take the next step.
9) Create a noun node N, and add edge (CN, N)
V. CENTRALITY COMPARISON
A. Creating Graph Outline
An example of the created graph is shown in Fig. 2.
P1
N1
CN1
N2
P2
CN2
CN3
N5
N3
P3P4
CN4
Page
Compound
noun
Noun
legend
Fig. 2. Example of created graph.
Table II indicates the outline of the created graph. From
the result, there are no remarkable differences in the
following categories: Clustering Coefficient, Density, and
Slope of Power Law Distribution.
TABLE II: THE ARRANGEMENT OF GRAPH
Avg.
Nodes
P 52.8
CN 2929.2
N 2548.6
Total 5573.6
Edges
(P, CN) 3345.6
(CN, N) 6221.8
Total 9567.4
Clustering Coefficient 0.00019
Density 0.0019
Slope of Power Law
Distribution 1.63
B. Index of the Centrality
Degree centrality, closeness centrality, and betweenness
centrality are calculated from the noun nodes of each graph
respectively. We have measured the appearance ratio of
category name at top n words with high centrality. In the case
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013
39
of the category name composed of multiple nouns, we have
accepted it if the part of category name matches. The result is
shown in Fig. 3.
As a result, we have obtained the high ratio of agreement
for degree centrality and betweenness centrality more than
for closeness centrality.
Comparing the result of degree centrality with the result of
betweenness centrality, there is no remarkable difference
between them. However, comparing them at the upper rank
where n<30, betweenness centrality had slightly higher
agreement rate.
Fig. 3. Cumulative sum of rate of agreement to the category name.
Next, we have measured agreement rates with the parent
category in the same procedure. Fig. 4 indicates the results of
the comparison.
From the graph, we have remarkable result as follows;
Both closeness centrality and betweenness centrality have
scored high agreement rateat the upper rank where n<30. On
the other hand, betweenness centrality alone scored the
highest agreement rate at the lower rank where n<30
Fig. 4. Cumulative sum of rate of agreement to the parent category name.
The agreement rate to category name was about 60% for
the top 3 words, was about 70% for the top 6 words,, and was
about 80% for the top 15 words, respectively.
Moreover, the agreement rate with upper category was
about 40% for the top 5 words, about 60% for the top 12
words,, and about 70% for the top 20 words.
From the result of the comparison, the betweenness
centrality is suitable in extraction of conceptual meanings.
C. Evaluation in Consideration of Synonymous Words
We have made evaluations in accordance with the noun
node and the category name in the previous section, but in
actual, we have also observed many synonymous words. For
example, although the category name is ―production‖, its
node name is ―manufacture ―.
For this reason, we have made a new evaluation in
consideration with the synonymous words. In particular, list
of synonymous words of the particular noun node has been
made using the Japanese word net [17]. Then among those
words, if there was a match entirely or partially with the
category name, we assumed that the noun node and the
category name are in accord. Results are shown in Fig. 5 and
Fig. 6.
Fig. 5. Cumulative sum of rate of agreement with the category name
(Synonymous Words Consideration).
Fig. 6. Cumulative sum of rate of agreement with the upper category name
(Synonymous Words Consideration).
We have obtained the similar results for the synonymous
words consideration as well. In order to clarify the effect of
synonymous word consideration, the difference of the
numerical value for Fig. 3 and Fig. 5 is shown in Fig. 7, and
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013
40
that of Fig. 4 and Fig. 6 is shown in Fig. 8. respectively.
Fig. 7. Difference of the rate of agreement with the category name in
presence and absence of the synonym word consideration.
Fig. 8. Difference of the rate of agreement with the parent category name in
presence and absence of the synonym word consideration.
For the purpose of extracting the word indicating the
feature of a certain category, it is desirable for its
synonymous words to appear also in the higher rank. Both
Fig. 7 and Fig. 8 indicate the appearance ratio of the
synonymous words per rank.
As shown in Fig. 7, in the case of considering synonymous
words, we have obtained the high agreement rate with the
category name, in the higher rank(n=1, 2, 3) for all three
centrality indexes. That is to say that the synonymous words
of the category name would also appear in the higher rank
(n= 1, 2, 3).
However, we also have observed the repetition of rise and
fall of agreement rate for the degree centrality at n< 30, and
as well as for the closeness centrality at n< 70; therefore, it is
hard to say that the synonymous words would appear
frequently at the higher rank..
In contrast, the betweenness centrality has indicated the
slight rise-and-fall motion, but it is stable around 0.05 at n< 3,
around 0.02 at 3< n<20, and around 0.01 at beyond. So for
the betweenness centrality, we have observed the tendency of
frequent appearance of the synonymous words at the higher
rank.
Furthermore, although betweenness centrality has the
slight rise-and-fall motion for the agreement rate of the
parent category as shown in Fig. 8, it has remained stable
around 0.02 appearance rate for the synonymous words.
On the other hand, both the degree centrality and the
closeness centrality have the tendency of high agreement rate
in the lower rank, so both of them do not agree for the
purpose of extracting the word showing the feature of
category (the synonymous words of the category name).
From these analyses, the betweenness centrality is suitable
for extracting the conceptual meanings even in the case of the
synonymous words being taken into consideration. To make
use of betweenness centrality, it is effective for the
synonymous words with the category name to appear at the
higher rank.
VI. CONCLUSION
In network text analysis domain which expresses a result
of the text analysis as a network structure and analyzes it, a
method for node evaluation is still unclear.
Therefore, we have first analyzed relations in the form of
graph among the document, the compound noun, and the
noun within the group of documents with the same topic.
Then three typical centralities (degree centrality, closeness
centrality, and betweenness centrality) are calculated from
the noun nodes of each graph respectively. Lastly, we have
measured the appearance ratio of category name by top n
words with high centrality. From the result of the comparison,
betweenness centrality was the most suitable in extraction of
conceptual meaning.
The agreement rate with category names was about 60%
for top 3, about 70% in top 6, and about 80% in top 15,
respectively.
Moreover, the agreement rate for the parent category was
about 40% for top 5, about 60% for top 12, and about 70% for
top 20.
In addition, the betweenness centrality is suitable for
extracting the conceptual meanings even in the case of the
synonymous words were taken into consideration. To make
use of betweenness centrality, it was effective for the
synonymous words with the category name to appear at
higher rank.
REFERENCES
[1] B. Bruce and D. Newman, ―Interacting plans,‖ Cognitive Science, vol.
2, no. 3, pp. 195–233, Sep. 1978.
[2] R. Popping, ―Knowledge Graphs and Network Text Analysis,‖ Social
Science Information, vol. 42, no. 1, pp. 91–106, Mar. 2003.
[3] W. Lehnert, ―Plot units and narrative summarization,‖ Cognitive
Science, vol. 5, no. 4, pp. 293–331, Dec. 1981.
[4] J. Jung, M. Miyake, and H. Akama, "Associative Language Data
Processing by Recurrent Graph Clustering Methods," JSAI, 2006.
[5] H. Akama, M. Miyake, and J. Jung, ―Graph-based Linguistic Analysis
on the Ideological Similarity between the Mesmerism and the Modern
Stoicism,‖ IPSJ SIG-CH, vol. 2007, no. 49, pp. 49–56, May 2007.
[6] S. Van Dongen, ―Performance criteria for graph clustering and Markov
cluster experiments,‖ Report - Information systems, no. 12, pp. 1–36,
2000.
[7] L. C. Freeman, ―Centrality in social networks conceptual
clarification,‖ Social Networks, vol. 1, no. 3, pp. 215–239, Jan. 1978.
[8] G. Sabidussi, ―The centrality index of a graph,‖ Psychometrika, pp.
581–603, 1966.
[9] M. E. J. Newman, ―A measure of betweenness centrality based on
random walks,‖ Social Networks, vol. 27, no. 1, pp. 39–54, Jan. 2005.
[10] J. D. Noh, ―Random Walks on Complex Networks,‖ Phys. Rev. Lett.,
vol. 92, no. 11, pp. 118701, Mar. 2004.
[11] K. Stephenson and M. Zelen, ―Rethinking centrality: Methods and
examples,‖ Social Networks, vol. 11, no. 1, pp. 1–37, Mar. 1989.
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013
41
[12] C. Dangalchev, ―Residual closeness in networks,‖ Physica A:
Statistical Mechanics and its Applications, vol. 365, no. 2, pp.
556–564, Jun. 2006.
[13] L. C. Freeman, ―A set of measures of centrality based on betweenness,‖
Sociometry, pp. 35–41, 1977.
[14] U. Brandes, ―A Faster Algorithm for Betweenness Centrality,‖ Journal
of Mathematical Sociology, vol. 25, pp. 163–177, 2001.
Koji Tanaka received his M.B.A. degree in 2012 from
The University of TSUKUBA, Japan. He is with
Government & Public Corporation System Engineering,
Ltd and is a lead engineer of Research & Development
Center. His research interests include natural language
processing, information retrieval and human-computer
interaction. He is a member of The Information Processing
Society of Japan.
Masakazu Takahashi received his M.B.A. degree in
1996 and Ph.D. in 2010 both from the University of
Tsukuba, Japan. He started his business carrier from
Nikkei Research, a subsidiary of Nikkei: Japanese
newspaper corporation, in 1992, and started his
educational carrier from Gunma University in 2010. He
is presently an associate professor at the Graduate School
of Innovation and Technology Management, Yamaguchi University from
2012. His research interests include computational organization theory,
cooperative agents, case-based reasoning, service science and knowledge
system development methodology. He is a member of IEEJ, IPSJ, JSAI, and
ACM
Kazuhiko Tsuda received the B.S. and Ph.D. degrees in
Engineering from Tokushima University in 1986 and
1994, respectively. He was with Mitsubishi Electric
Corporation during 1986-1990, and with Sumitomo Metal
Industries Ltd. during 1991-1998. He is presently with the
Graduate School of Business Science, University of
Tsukuba, Tokyo, Japan as an associate professor during
1998-2005, and as a professor since 2005. His research interests include
natural language processing, database, information retrieval and
human-computer interaction. He is a member of The Information Processing
Society of Japan and The Institute of Electronics, Information and
Communication Engineers.
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 1, February 2013
42
[15] Wikipedia. [Online]. Available: http://ja.wikipedia.org
[16] Jawiki dump progress on 20120826. [Online]. Available:
http://dumps.wikimedia.org/jawiki/20120826/
[17] Japanese Wordnet. [Online]. Available:
http://nlpwww.nict.go.jp/wn-ja/
[18] Centrality. le:
http://en.wikipedia.org/wiki/Closeness_centrality/ .
[19] H. Isahara, F. Bond, K. Uchimoto, M. Utiyama, and K. Kanzaki,
―Development of the Japanese WordNet,‖ LREC 2008, 2008.
[Online]. Availab