Network Description Language: Semantic Web for Hybrid Networks
Language networks
description
Transcript of Language networks
LANGUAGE NETWORKS
SI/EECS 767Yang LiuJanuary 29, 2010
INTRODUCTION Human language described as a
complex network
(Sole et al, 05)
INCENTIVES Analyzing statistical properties Building models to explain the patterns Studying the origins and evolution of
human language Statistical approaches to natural
language processing
CATEGORIZATION Words as vertices
Co-occurrence networks (Dorogovtsev & Mendes, 2001; Masucci & Rodgers, 2008) Semantic networks (Steyvers, Tenenbaum, 2005) Syntactic networks (Cancho et al., 2004)
Sentences as vertices (Erkan & Radev, 2004)
Documents as vertices (Menzer, 2004)
CO-OCCURENCE AND SYNTACTIC NETWORKS
SEMANTIC NETWORKS
Language as an evolving word web
(Dorogovtsev & Mendes, 2001)
INTRODUCTION Propose a theory of how language
evolves Treat human language as a complex
network of distinct words Words are connected with nearest
neighbors (co-occurrence networks) Papers of Ferrer & Sole (2001, 2002) degree distribution consists of two power-
law parts with different exponent
THE MODEL Preferential attachment
Provide power-law degree distribution Average degree does not change
The total number of connections increases more rapidly than the number of vertices and the average degree grows
THE MODEL At each time step,
a new vertex (word) is added; t is the total number of vertices, plays the role
of time; connect it with some old one i with the
probability proportional to its degree ki; ct new edges emerge between old words (c
is a constant coefficient) These new edges emerge between vertices i
and j with the p ~ ki kj
DATA Two word webs by Ferrer and Sole
(2001, 2002) Obatain ¾ of a million words of the
British National Corpus 470 000 vertices Average degree = 72
SOLVING THE MODEL Continuum approximation k(s,t) : the average degree of the vertices
born at time s and observed at time t
Ct ≈ 70 >>1
SOLVING THE MODEL
The degree distribution has two regions separated by the crossover point
SOLVING THE MODEL Below this point, stationary degreedistribution Above this point,Non-stationanry degree distribution
Empty and filled circles show the degree distributions for two word webs by Ferrer and Sole (2001, 2002)
DISCUSSION Interested only in degree distribution Clustering coefficients not match The total number of words of degree
greater than kcross does not change The size of kernel lexicon does not
depend on the total number of distinct words in language
Network properties of written human language
(Masucci & Rodgers, 2008)
TOPOLOGY OF THE NETWORK The words (include punctuations) are
vertices and two vertices are linked if they are neighbors.
Directed network
NETWORK STATISTICS 8992 vertices, 117687 edges, mean
degree <k> = 13.1 P(k) ∝k -1.9
Zipf’s law slope -1.2
GROWTH PROPERTIES The number of edges between words
grows faster than the number of vertices.
N(t) ∝ t 1.8
NEAREST NEIGHBOR’S PROPERTIES The mean cluster coefficient <c> =
0.19
REPEATED BINARY STRUCTURES OF WORDS
Reproduce by local PA
THE MODELS (D-M MODEL) Starts with a chain of 20 connected
vertices At each time add a new vertex and
connect it to some vertex i with p ∝ ki
m(t) -1 new edges emerge between old words with p ∝ ki kj
D-M MODEL <c(k)> = 0.16 Catches the average clustering and the global
growth behavior Misses the internal structure
MODEL 2 Include local PA P(t) ≈ 0.1t0.16 Start with a chain of 20 connected vertices At each time add a new vertex and connect it to
some vertex i (not nearest neighbors) with p ∝ ki
m(t) -1 times, with probability p(t) link the last vertex to an old vertex i (in its nearest neighborhood) through local PA (p ∝ ki); with 1 – p(t), link an old vertex i (not part of its nearest neighborhood) with global PA
MODEL 2 <c> = 0.08 Catches the global and nearest
neighbor behavior but not the average cluster coeffient
MODEL 3 Different words in written human
language display different statistical distributions, according to their functions
MODEL 3 Start with a chain of 20 connected vertices At each time add a new vertex and connect it to
some vertex i (not nearest neighbors) with p ∝ ki
m(t) -1 times, with probability q= 0.05, link the last linked vertex to one of the three fixed vertices; with probability p(t) link the last vertex to an old vertex i (in its nearest neighborhood) through local PA (p ∝ ki); with 1 – p(t) – 3q, link an old vertex i (not part of its nearest neighborhood) with global PA.
MODEL 3 <c> = 0.20
CONCLUSIONS New growth mechanisms: 1.local PA 2.the allocation of a set of preselected
vertices
The large-scale structure of semantic networks: Statistical
analyses and a model of semantic growth
(Steyvers & Tenebaum, 2005)
INTRODUCTION There are general principles governing
the structure of network representations for natural language semantics
The small-world structure arise from a scale-free organization
MODEL Concepts enter the network early are
expected to show higher connectivity One aspect of semantic development –
growth of semantic networks by differentiations of existing nodes
The model grows through a process of differentiation analogous to mechanisms of mechanic development which allows it to produce both small-world and scale-free structure.
ANALYSIS OF SEMANTIC NETWORKS Free association norms WordNet Roget’s thesaurus
METHODS Associative networks
Created two networks: directed, undirected
ROGET’S THESAURUS Bipartite graph
Word nodes and semantic category nodes A connection is made between a word and
category node when the word falls into the semantic category
Convert to a simple graph for calculating cc( one-mode projection)
WORDNET 120,000+ word forms 99,000+ word meanings Links between forms and forms,
meanings and meaning, forms and meanings
Treat as an undirected graph
RESULTS
ZIPF’S “LAW OF MEANING”
GROWING NETWORK MODEL Previous models
BA model: low cc WS model: no scale-free structure
MODEL A: UNDIRECTED
At each time step, a new node with M links is added to the network by randomly choosing some existing node i for differentiation,
and then connecting the new node to M randomly chosen nodes in the semantic neighborhood of node i.
TWO PROBABILITY DISTRIBUTION
Set n equal to the size of the target network
Set M equal to ½ <k>
MODEL B: DIRECTED Assume the direction of each arc is
chosen randomly and independently of the other arcs
Point toward old node with probability α, point toward new node with probability 1-α
RESULTS Only test on association networks with
Model A and B set α = 0.95 Average of 50 simulations
Patterns in syntactic dependency networks
(Ferrer et al., 2004)
INTRODUCTION Co-occurrence networks fail in
capturing the characteristic long-distance correlations of words in sentences
The proportion of incorrect syntactic dependency links is high
Require a precise definition of syntactic link
THE SYNTACTIC DEPENDENCY NETWORK Defined according to the dependency
grammar formalism Vertices are words, links go from the
modifier to its head
CORPORA 1. Czech corpus
Proportion of links is about 0.65 (missing links between function words)
Performed by hand 2. Romanian corpus
Performed by hand 3. German corpus
Proportion of links is about 0.16 (obey no regularity)
Performed automatically
NETWORK PROPERTIES Small world structure
Small average path length D and high cluster coefficient C Heterogeneity
Power-law degree distribution Hierarchical organization
C(k) ~ k -θ
Betweenness centrality P(g) ~ g –η
Assortativeness
RESULTS
RESULTS
RESULTS
RESULTS
RESUTS
GLOBAL VS SENTENCE-LEVEL PATTERNS
DISCUSSIONS(1) Disassortative mixing tells us that labor is
divided in human language. Linking words tend to avoid connections among them.
(2) Hierarchical organization tells us that syntactic dependency networks not only define the syntactically correct links but also a top down hierarchical organization that is the basis of phrase structure formalisms.
(3) Small worldness is a necessary condition for recursion.
Lexrank: graph-based lexical centrality as salience in text
summarization
(Erkan & Radev, 2004)
INTRODUCTION Graph-based methods in NLP Random walks on sentence-based
graphs help in Text Summarization (TS) Extractive summarization VS
abstractive summarization Assess the centrality of each sentence
in a cluster and extract the most important ones to include the summary
Centrality measures: Degree, LexRank with threshold, and continuous Lexrank
Vertices represent sentences and edges are defined in terms of the similarity relation between pairs of sentences
Toolkit MEAD Test data DUC 2003, 2004
CENTROID-BASED SUMMARIZATION Centroid of the document cluster in a
vector space The centroid is a pseudo-document
which consists of words that have tf*idf scores above a predefined threshold
The sentences that contain more words from the centroid are considered as central
CENTRALITY-BASED SENTENCE SALIENCE Hypothesis: the sentences that are
similar to many of the other sentences are more central/salient to the topic
Cosine similarity between two sentences:
DEGREE CENTRALITY Significantly similar sentences are
connected to each other Choice of cosine threshold
EIGENVECTOR CENTRALITY AND LEXRANK PageRank
Sum of neighbor’s divided prestige d: dumping factor; set to 0.85
CONTINUOUS LEXRANK Improve by using the strength of the
similarity links
CENTRALITY VS CENTROID 1. centrality accounts for information
subsumption among sentences 2. it prevents unnaturally high idf
scores from boosting up the score of a sentence that is unrelated to the topic
EXPERIMENT Data set: DUC 2003 and 2004 Evaluation method: ROGUE MEAD toolkit
The feature extraction Centroid, position and length Relative weight
Combiner reranker
RESULTS AND DISCUSSION Effects of threshold
COMPARISON OF CENTRALITY MEASURES
EXPERIMENT ON NOISY DATA
Evolution of document networks
(Menczer, 2004)
BACKGROUD Content similarity
Link probability approximated by link similarity metric (Jaccard coefficient)
JOINT DISTRIBUTION MAPS
DEPENDENCY OF THE WEB’S LINK TOPOLOGY ON CONTENT Conditional probability that the link
neighborhood between two web pages is above some threshold λ, given that the two pages have some content similarity κ, as a function of κ :
Phase transition around κ* For κ>κ*, the probability that two pages are
neighbors does not seem to depend on their content similarity; for κ<κ *, the probability decreases according to a power-law Pr(λ |κ) ~ κγ
MODEL At each step t one new page t is added, and m new
links are created from t to m existing pages, each selected from {i, i<t} with probability:
(m, κ*, and γ are constants and c is a nomorlization factor)
VALIDATING PRIOR MODELS
Look for a model capable of predicting both the degree distribution and the similarity distributions among linked documents
DEGREE-SIMILARITY MIXTURE MODEL At each step, one new document is added,
and m new links or references are created from it to existing documents.
At time t the probability that the i th document is selected and linked from the tth document is:
α is a preferential attachment parameter
VALIDATE THE MODEL
CONCLUSION Page content cannot be neglected
when we try to understand the evolution of document networks.
The tension between referring to popular
versus related documents
Questions?