Language networks

LANGUAGE NETWORKS

SI/EECS 767Yang LiuJanuary 29, 2010

INTRODUCTION Human language described as a

complex network

(Sole et al, 05)

INCENTIVES Analyzing statistical properties Building models to explain the patterns Studying the origins and evolution of

human language Statistical approaches to natural

language processing

CATEGORIZATION Words as vertices

Co-occurrence networks (Dorogovtsev & Mendes, 2001; Masucci & Rodgers, 2008) Semantic networks (Steyvers, Tenenbaum, 2005) Syntactic networks (Cancho et al., 2004)

Sentences as vertices (Erkan & Radev, 2004)

Documents as vertices (Menzer, 2004)

CO-OCCURENCE AND SYNTACTIC NETWORKS

SEMANTIC NETWORKS

Language as an evolving word web

(Dorogovtsev & Mendes, 2001)

INTRODUCTION Propose a theory of how language

evolves Treat human language as a complex

network of distinct words Words are connected with nearest

neighbors (co-occurrence networks) Papers of Ferrer & Sole (2001, 2002) degree distribution consists of two power-

law parts with different exponent

THE MODEL Preferential attachment

Provide power-law degree distribution Average degree does not change

The total number of connections increases more rapidly than the number of vertices and the average degree grows

THE MODEL At each time step,

a new vertex (word) is added; t is the total number of vertices, plays the role

of time; connect it with some old one i with the

probability proportional to its degree ki; ct new edges emerge between old words (c

is a constant coefficient) These new edges emerge between vertices i

and j with the p ~ ki kj

DATA Two word webs by Ferrer and Sole

(2001, 2002) Obatain ¾ of a million words of the

British National Corpus 470 000 vertices Average degree = 72

SOLVING THE MODEL Continuum approximation k(s,t) : the average degree of the vertices

born at time s and observed at time t

Ct ≈ 70 >>1

SOLVING THE MODEL

The degree distribution has two regions separated by the crossover point

SOLVING THE MODEL Below this point, stationary degreedistribution Above this point,Non-stationanry degree distribution

Empty and filled circles show the degree distributions for two word webs by Ferrer and Sole (2001, 2002)

DISCUSSION Interested only in degree distribution Clustering coefficients not match The total number of words of degree

greater than kcross does not change The size of kernel lexicon does not

depend on the total number of distinct words in language

Network properties of written human language

(Masucci & Rodgers, 2008)

TOPOLOGY OF THE NETWORK The words (include punctuations) are

vertices and two vertices are linked if they are neighbors.

Directed network

NETWORK STATISTICS 8992 vertices, 117687 edges, mean

degree <k> = 13.1 P(k) ∝k -1.9

Zipf’s law slope -1.2

GROWTH PROPERTIES The number of edges between words

grows faster than the number of vertices.

N(t) ∝ t 1.8

NEAREST NEIGHBOR’S PROPERTIES The mean cluster coefficient <c> =

0.19

REPEATED BINARY STRUCTURES OF WORDS

Reproduce by local PA

THE MODELS (D-M MODEL) Starts with a chain of 20 connected

vertices At each time add a new vertex and

connect it to some vertex i with p ∝ ki

m(t) -1 new edges emerge between old words with p ∝ ki kj

D-M MODEL <c(k)> = 0.16 Catches the average clustering and the global

growth behavior Misses the internal structure

MODEL 2 Include local PA P(t) ≈ 0.1t0.16 Start with a chain of 20 connected vertices At each time add a new vertex and connect it to

some vertex i (not nearest neighbors) with p ∝ ki

m(t) -1 times, with probability p(t) link the last vertex to an old vertex i (in its nearest neighborhood) through local PA (p ∝ ki); with 1 – p(t), link an old vertex i (not part of its nearest neighborhood) with global PA

MODEL 2 <c> = 0.08 Catches the global and nearest

neighbor behavior but not the average cluster coeffient

MODEL 3 Different words in written human

language display different statistical distributions, according to their functions

MODEL 3 Start with a chain of 20 connected vertices At each time add a new vertex and connect it to

some vertex i (not nearest neighbors) with p ∝ ki

m(t) -1 times, with probability q= 0.05, link the last linked vertex to one of the three fixed vertices; with probability p(t) link the last vertex to an old vertex i (in its nearest neighborhood) through local PA (p ∝ ki); with 1 – p(t) – 3q, link an old vertex i (not part of its nearest neighborhood) with global PA.

MODEL 3 <c> = 0.20

CONCLUSIONS New growth mechanisms: 1.local PA 2.the allocation of a set of preselected

vertices

The large-scale structure of semantic networks: Statistical

analyses and a model of semantic growth

(Steyvers & Tenebaum, 2005)

INTRODUCTION There are general principles governing

the structure of network representations for natural language semantics

The small-world structure arise from a scale-free organization

MODEL Concepts enter the network early are

expected to show higher connectivity One aspect of semantic development –

growth of semantic networks by differentiations of existing nodes

The model grows through a process of differentiation analogous to mechanisms of mechanic development which allows it to produce both small-world and scale-free structure.

ANALYSIS OF SEMANTIC NETWORKS Free association norms WordNet Roget’s thesaurus

METHODS Associative networks

Created two networks: directed, undirected

ROGET’S THESAURUS Bipartite graph

Word nodes and semantic category nodes A connection is made between a word and

category node when the word falls into the semantic category

Convert to a simple graph for calculating cc( one-mode projection)

WORDNET 120,000+ word forms 99,000+ word meanings Links between forms and forms,

meanings and meaning, forms and meanings

Treat as an undirected graph

RESULTS

ZIPF’S “LAW OF MEANING”

GROWING NETWORK MODEL Previous models

BA model: low cc WS model: no scale-free structure

MODEL A: UNDIRECTED

At each time step, a new node with M links is added to the network by randomly choosing some existing node i for differentiation,

and then connecting the new node to M randomly chosen nodes in the semantic neighborhood of node i.

TWO PROBABILITY DISTRIBUTION

Set n equal to the size of the target network

Set M equal to ½ <k>

MODEL B: DIRECTED Assume the direction of each arc is

chosen randomly and independently of the other arcs

Point toward old node with probability α, point toward new node with probability 1-α

RESULTS Only test on association networks with

Model A and B set α = 0.95 Average of 50 simulations

Patterns in syntactic dependency networks

(Ferrer et al., 2004)

INTRODUCTION Co-occurrence networks fail in

capturing the characteristic long-distance correlations of words in sentences

The proportion of incorrect syntactic dependency links is high

Require a precise definition of syntactic link

THE SYNTACTIC DEPENDENCY NETWORK Defined according to the dependency

grammar formalism Vertices are words, links go from the

modifier to its head

CORPORA 1. Czech corpus

Proportion of links is about 0.65 (missing links between function words)

Performed by hand 2. Romanian corpus

Performed by hand 3. German corpus

Proportion of links is about 0.16 (obey no regularity)

Performed automatically

NETWORK PROPERTIES Small world structure

Small average path length D and high cluster coefficient C Heterogeneity

Power-law degree distribution Hierarchical organization

C(k) ~ k -θ

Betweenness centrality P(g) ~ g –η

Assortativeness

RESULTS

RESUTS

GLOBAL VS SENTENCE-LEVEL PATTERNS

DISCUSSIONS(1) Disassortative mixing tells us that labor is

divided in human language. Linking words tend to avoid connections among them.

(2) Hierarchical organization tells us that syntactic dependency networks not only define the syntactically correct links but also a top down hierarchical organization that is the basis of phrase structure formalisms.

(3) Small worldness is a necessary condition for recursion.

Lexrank: graph-based lexical centrality as salience in text

summarization

(Erkan & Radev, 2004)

INTRODUCTION Graph-based methods in NLP Random walks on sentence-based

graphs help in Text Summarization (TS) Extractive summarization VS

abstractive summarization Assess the centrality of each sentence

in a cluster and extract the most important ones to include the summary

Centrality measures: Degree, LexRank with threshold, and continuous Lexrank

Vertices represent sentences and edges are defined in terms of the similarity relation between pairs of sentences

Toolkit MEAD Test data DUC 2003, 2004

CENTROID-BASED SUMMARIZATION Centroid of the document cluster in a

vector space The centroid is a pseudo-document

which consists of words that have tf*idf scores above a predefined threshold

The sentences that contain more words from the centroid are considered as central

CENTRALITY-BASED SENTENCE SALIENCE Hypothesis: the sentences that are

similar to many of the other sentences are more central/salient to the topic

Cosine similarity between two sentences:

DEGREE CENTRALITY Significantly similar sentences are

connected to each other Choice of cosine threshold

EIGENVECTOR CENTRALITY AND LEXRANK PageRank

Sum of neighbor’s divided prestige d: dumping factor; set to 0.85

CONTINUOUS LEXRANK Improve by using the strength of the

similarity links

CENTRALITY VS CENTROID 1. centrality accounts for information

subsumption among sentences 2. it prevents unnaturally high idf

scores from boosting up the score of a sentence that is unrelated to the topic

EXPERIMENT Data set: DUC 2003 and 2004 Evaluation method: ROGUE MEAD toolkit

The feature extraction Centroid, position and length Relative weight

Combiner reranker

RESULTS AND DISCUSSION Effects of threshold

COMPARISON OF CENTRALITY MEASURES

EXPERIMENT ON NOISY DATA

Evolution of document networks

(Menczer, 2004)

BACKGROUD Content similarity

Link probability approximated by link similarity metric (Jaccard coefficient)

JOINT DISTRIBUTION MAPS

DEPENDENCY OF THE WEB’S LINK TOPOLOGY ON CONTENT Conditional probability that the link

neighborhood between two web pages is above some threshold λ, given that the two pages have some content similarity κ, as a function of κ :

Phase transition around κ* For κ>κ*, the probability that two pages are

neighbors does not seem to depend on their content similarity; for κ<κ *, the probability decreases according to a power-law Pr(λ |κ) ~ κγ

MODEL At each step t one new page t is added, and m new

links are created from t to m existing pages, each selected from {i, i<t} with probability:

(m, κ*, and γ are constants and c is a nomorlization factor)

VALIDATING PRIOR MODELS

Look for a model capable of predicting both the degree distribution and the similarity distributions among linked documents

DEGREE-SIMILARITY MIXTURE MODEL At each step, one new document is added,

and m new links or references are created from it to existing documents.

At time t the probability that the i th document is selected and linked from the tth document is:

α is a preferential attachment parameter

VALIDATE THE MODEL

CONCLUSION Page content cannot be neglected

when we try to understand the evolution of document networks.

The tension between referring to popular

versus related documents

Questions?

Language networks

Documents

Transcript of Language networks