arX
iv:1
907.
1229
3v7
[cs
.CL
] 1
2 Ju
l 202
01
A mathematical model for universal semantics
Weinan E and Yajun Zhou
Abstract—We characterize the meaning of words with language-independent numerical fingerprints, through a mathematical analysis
of recurring patterns in texts. Approximating texts by Markov processes on a long-range time scale, we are able to extract topics,
discover synonyms, and sketch semantic fields from a particular document of moderate length, without consulting external
knowledge-base or thesaurus. Our Markov semantic model allows us to represent each topical concept by a low-dimensional vector,
interpretable as algebraic invariants in succinct statistical operations on the document, targeting local environments of individual words.
These language-independent semantic representations enable a robot reader to both understand short texts in a given language
(automated question-answering) and match medium-length texts across different languages (automated word translation). Our
semantic fingerprints quantify local meaning of words in 14 representative languages across 5 major language families, suggesting a
universal and cost-effective mechanism by which human languages are processed at the semantic level. Our protocols and source
codes are publicly available on https://github.com/yajun-zhou/linguae-naturalis-principia-mathematica
Index Terms—recurring patterns in texts, semantic model, recurrence time, hitting time, word translation, question answering
1 INTRODUCTION
A QUANTITATIVE model for the meaning of words notonly helps us understand how we transmit informa-
tion and absorb knowledge, but also provides footholdfor algorithms in machine processing of natural languagetexts. Ideally, a universal mechanism of semantics shouldbe based on numerical characteristics of human languages,transcending concrete written and spoken forms of verbalmessages. In this work, we demonstrate, in both theoryand practice, that the time structure of recurring languagepatterns is a good candidate for such a universal semanticmechanism. Through statistical analysis of recurrence timesand hitting times, we numerically characterize connectivityand association of individual concepts, thereby devisinglanguage-independent semantic fingerprints (LISF).
Concretely speaking, we define semantics through al-gebraic invariants of a stochastic text model that approx-imately governs the empirical hopping rates on a webof word patterns. Such a stochastic model explains thedistribution of recurrence times and outputs recurrenceeigenvalues as semantic fingerprints. Statistics of recurrencetimes allow machines to tell non-topical words from topicalones. A comparison of hitting and recurrence times furthergenerates quantitative fingerprints for topics, enabling ma-chines to overcome language barriers in translation tasksand perform associative reasoning in comprehension tasks,like humans.
Akin to the physical world, there is a hierarchy oflength scales in languages. On short scales such as syllables,words, and phrases, human languages do not exhibit auniversal pattern related to semantics. Except for a fewonomatopoeias, the sounds of words do not affect theirmeaning [1]. Neither do morphological parameters [2] (say,
• Weinan E is with Department of Mathematics and Program in Appliedand Computational Mathematics, Princeton University, and Beijing In-stitute of Big Data ResearchE-mail: [email protected]
• Yajun Zhou is with Beijing Institute of Big Data ResearchEmail. [email protected]
singular/plural, present/past) or syntactic rôles [3] (say,subject/object, active/passive). In short, there are no uni-versal semantic mechanisms at the phonological, lexical orsyntactical levels [4]. Grammatical “rules and principles”[2], [3], however typologically diverse, play no definitiverôle in determining the inherent meaning of a word.
Motivated by the observations above, we will build ourquantitative semantic model on long-range and language-independent textual features. Specifically, we will measurethe lengths of text fragments flanked by word patterns ofinterest (Fig. 1). Here, a word pattern is a collection of contentwords that are identical up to morphological parameters andsyntactic rôles. A content word signifies definitive concepts(like apple, eat, red), instead of serving purely grammatical orlogical functions (like but, of, the). Fragment length statisticswill tell us how tightly/loosely one concept is connectedto another. This in turn, will provide us with quantitativecriteria for inclusion/exclusion of different concepts withinthe same (computationally constructed) semantic field. Suchstatistical semantic mining will then pave the way for ma-chine comprehension and machine translation.
2 METHODOLOGY
We quantify the time structure of an individual word pat-tern Wi through the statistics of its recurrence times τii. Wecharacterize the dynamic impact of a word pattern Wi onanother word pattern Wj by the statistics of their hittingtimes τij . In what follows, we will describe the statisticalanalyses of τii and τij , on which we build a language-independent Markov model for semantics.
2.1 Recurrence times and topicality
Assuming uniform reading speed,1 we measure the recur-rence times τii for a word pattern Wi through nii samples
1. On the scale of words (rather than phonemes), this assumptionworks fine in most languages that are written alphabetically. However,this working hypothesis does not extend to Japanese texts, whichinterlace Japanese syllabograms (lasting one mora per written unit)with Chinese ideograms (lasting one or more morae per written unit).
2
Wi :=happ(ier|ily|iness|y) ≡ happier, happily, happiness, happy, Wj :=marr(iage|ied|y) ≡ marriage, married, marry... LOREM IPSUM HAPPY DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING UNHAPPY ELIT, HAPPINESS SED HAPPY DO HAPPY EIUSMOD TEMPOR HAPPIER, INCIDIDUNT UT LABORE ...
HAPPINESS HAPPINESS HAPPINESS
Lii Lii Lii
... LOREM IPSUM, MARRIAGE DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING MARRIED ELIT, MARRY SED HAPPILY DO HAPPILY EIUSMOD TEMPOR MARRIED INCIDIDUNT UT LAB ...
HAPPINESS HAPPINESS
Lij Lij
HAPPINESS
Lij
Fig. 1. Counting long-range transitions between word patterns. A transition from Wi to Wj counts towards long-range statistics, if the underlinedtext fragment in between contains no occurrences of Wi, and lasts strictly longer than the longest word in Wi ∪Wj . For each long-range transition,the effective fragment length Lij discounts the length of the longest word in Wi ∪Wj .
Wi = Jane(∅|’s)
Wj = than
(a)
0 1 2 3 4 5 6 7 8 90
100
200
# than in a block
#b
lock
s
0 1 2 3 4 5 6 7 8 90
100
200
(b)
0 5 000 10 000 15 000
100
101
102
Lii
Co
un
ts
0 5 000 15 000
100
101
102Wi = Jane(∅|’s)
0 5 000 10 000 15 000
Ljj
0 5 000 10 000 15 000
Wj = than
(c) (d)
0 1 2 3 4 5 6 7 8 9 10 11
100
101
102
logLii
Co
un
ts
0 1 2 3 4 5 6 7 8 9 10 11
100
101
102Wi = Jane(∅|’s)
0 1 2 3 4 5 6 7 8 9 10 11
logLjj
0 1 2 3 4 5 6 7 8 9 10 11
Wj = than
(c′) (d′)
AREA FORBIDDEN BY
JENSEN’S INEQUALITY
Eliza(∅|beth|beth’s)Darcy(∅|’s)
Bennet(∅|’s|s)Bingley(∅|’s|s|s’)
Jane(∅|’s)Wickham(∅|’s)
Collins(∅|’s)
happ(ily|iness|y|ier|iest)
Lydia(∅|’s)Catherine(∅|’s)
lov(e|e’|ed|ely|es|ing|eliness|e-making|er|ers)
Gardiner(∅|’s|s)
Lizzy(∅|’s)
Charlotte(∅|’s)
Lucas(∅|’s|es|es’)
danc(e|ed|es|ing)
Kitty(∅|’s)
Chapter
RosingsWilliam(∅|’s)
handsome(∅|ly|r|st)
beaut(iful|ies|y)
Forster(∅|’s|s)
Mary(∅|’s)Bourgh(∅|’s)
Fitzwilliam(∅|’s)
Hurst(∅|’s|s)
7 8 9 10 11
6
7
8
9
10
log〈Lii〉(e)
〈logL
ii〉
Jense
n’sbound
Poisso
nian
banal
ity
Fig. 2. Statistical analysis of recurrence times and topicality. (a) Barcode representations (adapted from [5, Fig. 2]) for the coverage of Wi =Jane(∅|’s) (291 occurrences) and Wj = than (282 occurrences) in the whole text of Pride and Prejudice. Horizontal axis scales linearly withrespect to the text length measured in the number of constituting letters, spaces and punctuation marks. (b) Counts of the word than within aconsecutive block of 1217 words (spanning about 1% of the entire text), drawn from 1000 randomly chosen blocks, fitted to a Poisson distributionwith mean 2.776 (blue curve). (c) Histogram of effective fragment length Lii (see Fig. 1 for its definition) for the topical pattern Wi = Jane(∅|’s),fitted to an exponential distribution (blue line in the semi-log plot) and a weighted mixture of two exponential distributions c1k1e
−k1t + c2k2e−k2t
(red curve, with c1 : c2 ≈ 1 : 3, k1 : k2 ≈ 1 : 7). (d) Histogram of Ljj for the function word Wj = than, fitted to an exponential distribution (blue linein the semi-log plot). All the parameter estimators in panels b–d are based on maximum likelihood. (c′)–(d′) Reinterpretations of panels c–d, withlogarithmic binning on the horizontal axes, to give fuller coverage of the dynamic ranges for the statistics. (e) Recurrence statistics for word patternsin Jane Austen’s Pride and Prejudice, where 〈· · · 〉 denotes averages over nii samples of long-range transitions. Data points in gray, green and redhave radii 1
4√
nii. Labels for proper names and some literary motifs are attached next to the corresponding colored dots. Jensen’s bound (green
dashed line) has unit slope and zero intercept. Exponentially distributed recurrence statistics reside on the line of Poissonian banality (blue line),with unit slope and negative intercept. Red (resp. green) dots mark significant downward (resp. upward) departure from the blue line.
of the effective fragment lengths Lii (Figs. 1, 2a). Here, whilecounting as in Fig. 1, we ignore contacts between short-range neighbors, which may involve language-dependentredundancies.2
2.1.1 Recurrence of non-topical patterns
In a memoryless (hence banal) Poisson process (Fig. 2b),recurrence times are exponentially distributed (Fig. 2d,d′).The same is also true for word recurrence in a randomlyreshuffled text [5]. If we have nii independent samples ofexponentially distributed random variables Lii, then thestatistic δi := log〈Lii〉 − 〈logLii〉 − γ0 + 1
2niisatisfies an
inequality
|δi| <2√nii
√π2
6− 1− 1
2nii(1)
2. For example, a German phrase liebe Studentinnen und Studentenwith short-range recurrence is the gender-inclusive equivalent of theEnglish expression dear students. Some Austronesian languages (suchas Malay and Hawaiian) use reduplication for plurality or emphasis.
with probability 95% (see Theorem 1 in Appendix A for atwo-sigma rule). Here, γ0 := limn→∞
(− logn+
∑nm=1
1m
)
is the Euler–Mascheroni constant.
As a working definition, we consider a word pattern Wi
non-topical if its nii counts of effective fragment lengths Lii
are exponentially distributed P(Lii > t) ∼ e−kt, within 95%margins of error [that is, satisfying (1) above].
2.1.2 Recurrence of topical patterns
In contrast, we consider a word pattern Wi topical if itsdiagonal statistics nii, Lii constitute significant departurefrom the Poissonian line 〈logLii〉−log〈Lii〉+γ0 = 0 (Fig. 2e,blue line), violating the bound in (1).
Notably, most data points for topics (colored dots onFig. 2e) in Jane Austen’s Pride and Prejudice mark system-atic downward departures from the Poissonian line. Thissuggests that the topical recurrence times τ = Lii follow
3
Word Countsfelt 100feelings 86feel 39feeling 38feels 4feelingly 1
word count censorship============⇒
Word Countsfelt 100feelings 86feel 39feeling 38
alphabetic sorting==========⇒
Word Countsfeel 39feeling 38feelings 86felt 100
sequence alignment===========⇒ 100
863839
FFFF
EEE
EEEE
LL
LL
TII
NN
GG
S vertical merger========⇒ 100
863839
FEEELTINGS
(a)
MR ELIZABETH’S
SAAIYDING DARCY’S
SEEA
NI
WNG KNO
E
WNING
THOUI
GN
HK
TI
S
NG MRS MAKDIE
NG
BENNET’S GB
OOE
DTSTER
BINGLEY’S LADYI
S
E
H
S
IP
SISTERS’S
WG
EONI
TENG JANE’S
MISS FEEELTINGS GI
A
VIENG
COAMI
ENS
G TIMES GREATLE
YSR
T
LOE
NGE
T
R
H MARRYII
EA
N
DG
G
E
WICKHAM’S LITTLE TOOAKI
ENG
WISHIE
NSD
G HEARID
NG FRIENDSHIP
MEAN COLLINS’S YOUNGES
RT
HAPPYIN
E
ES
ST
S LOOKSIE
ND
G DEAREST
LYDIA’S HOPESD DAYS FAMILY
SPOEAKE
IN
NG LIKESLD
Y BELIEVF
IE
ND
G
TOELD
L LEEAFV
TIE
NG FOUINDING FIRST
MANNERS LIVFIE
NLD
GY MEANT
SING MOTHER’S
DAUGHTERS’S CERTAINT
LY FATHER’S
TWO REPLYIED LETTERS WALKI
END
G
CATHERINE’S REALLY RECEPI
TVI
EOND
NG
ROOMS LOVED HOUSE TALKIE
ND
G WAYPLEASI
EA
NDN
G
T
EXPECTEA
DTIONS RETURNI
END
G
SEEMSED ATTENTIV
OENS SURE ANSWERED
APPEARIEA
NDN
G
CE GARDINERS’S KINDNL
EY
SS
PLEASURES EXPRESSIEOND
NG
S CONSIDERIEA
NDTB
G
IL
OE
N
MEEETING LIZZY GENTLEMEAN’LS
IKE
WROITE
TIE
EN
NG
SUBJECTS OH INDEEDASSURE
A
D
NCES CRIED PARTYIE
ASLITY PASSIEN
DG
PARTSIE
ND
G VISITSOI
RN
SG PLACESD BEGI
ANNING
SUPPOSIEND
G ACQUAINTEA
DNCES OBJECTS
IONS
LONGBOURN AFFECTIE
OD
N ASKIE
ND
G RELATIE
OND
NG
SHIP
TURNIE
ND
G CALLIEND
G CHARLOTTE’S
AUNT’S LUCAS’ES GIRLS SENSIEBLE
ENGAGIE
NMD
GENTS SATISFY
IA
EC
DTOI
RO
YN MOMENTS
’S
PRESENTCE
MORNING WOMEAN’S PARTICULARS
LY
BROTHER’S POSSIBLI
YELITY
WORDS EVENINGS
SIR OPINIONS REASONSABLE DELIGHTF
EUD
LLY
SIATTING NETHERFIELD DANCI
ENSD
GCONTINUE
A
DLLY SURPRISED IMMEDIATELY EASY
IE
LY
UNCLE’S GENERALLY CONVERSEATION ADDED THINGS
PROUIDE KITTY CHARACTER SMILI
ENSD
G
OBSERVIEA
NDT
G
ION ALLOWIE
ND
G OCCASIONSALLY BROU
IGN
HG
TING EYES
WORLD CHOOOO
SI
EC
N
S
E CARRYII
EA
N
DG
G
ES FOLLOWSIE
ND
G PERFECTLY
HOMECOLONEL MENTIONIE
ND
G UNDERSTOOA
DNDING TOWN
MR ELIZABETHDIT
SR
E
AE
S
IT DARCYVU
OI
E
ITS
RAGE
MRSBINGLEY BENNETJANE M
D
IE
SM
SOISELLES
MBIO
EN
UN
XES SŒURS V
AALI
LS
OEA
NZRI
S
T
FILLES PARTISRE WICKHAM
PRIE
TSENE
DA
ZRNI
ET
PENSEÉEA
ZREI
S
T SUAVI
OEATS
IZI
RTS
JEUNESSE MLDAD
M
AYE
M
S
E
TROUVEÉEA
RNIT DEUX SENTI
A
TRMI
ET
NTS
VIE
NEN
TNUIEA
TS
RZIT AMI
A
TSE
B
I
I
É
LITÉ
COLLINS AMI
OM
UÉEA
RIB
E
TL
U
EX
S
LYDIA VOEUL
DXT
UOEAR
IZIA
R
TIS GRANDE
DONNEÉEA
ZRIT
DEMANDEÉEA
R
IT SEULESMENT
PASSIEÉEA
ORE
G
N
S
E PARLEÉEA
ZRNIT MARIÉE
ÉA
SREGE
PLAÎI
TSR
IAE
RIT
CRUO
TIS
RE RÉPONDURI
ET
JOURSNÉE POINTS ARRIVÈ
EÉEA
RENI
E
T
NT
CERTEA
SINS
EMENT LAISSEÉEA
ZRNIT
CHÉÈERI
EE LETTRES ENTENDU
RIA
ETIT
CHOSES FAMILLES TEMPS PÈREESPO
ÉÈÉ
IR
REA
R
NICTES MÈA
RM
EAN
NOUVELA
LU
ES
HOMMES CATHERINE FEMMES
CONNAÎI
TS
RS
EEA
ZNI
CT
E PRÉSENTC
EÉEA
ERE
TION
PEINE DOUTER HEUREUXSEMENT
GARDINER BELA
LU
EXT
S
É DÉSIREA
R
IT
SOIRÉE CAUSEÉEA
R
IT
NATÎ
UT
RELLEMENT
VISITEUSR
RS PARUOA
TIÎI
STS
SR
SE
AIT
ENTRÉE
E
LONDRES REÉ
ÇC
UE
T
VP
OT
IRONS
PREMIEÈERSES
SEMBLEAI
B
TLES
VRAISMEM
NBT
LABLE ATTENDRA
ENIT
SENT
REVIE
NEN
TRNUIA
ED
RI
NR
TE
TA
N
I
T
T LONGBOURN REGARDSEA
RNIT
SUJETS YŒ
EI
UL
X MONDE CHARLOTTEFRÈ
A
RT
ERNELLE LIZZY QUITTE
ÉEA
RNT
COURUSRIA
T
IRN
E
TR HEURES TANTE CONVERSATION
SATISFAIC
TRT
EEION MOMENTS AJOUTE
ARNIT
MAISON INVITÉEÉA
SRETIONS INTE
ÉRVIÊE
IETS
NUS
S
TR
EA
I
RNI
O
T
N
ÉCRIVTRIEUE
ETZRE LUCAS COUSINS
ES
SÛREMENT CÔTÉ RENCONTREÈEÉE
SRENT
IMPRO
ER
ST
SEAIN
OTC
NE
S VOYOEA
NZNI
S
T SURPRIE
SN
EDRE
QUESTIONS EXPRIMEÉA
RE
NIT MANIÈRES CRAING
TSDN
EREA
S
EZIT
CHARMEA
S
NTES PORTIEÉEA
ÈRE
I
R
T
E ONCLE AIR
DANSEÉE
USR
SR
ESTENU
IEDA
E
RZRI
ET
SSE DÉCIDEÉA
RE
FORTE BONHEUR POSSIEÈÉ
BSD
LSEA
EII
OT
N MITSR
EENT
IDÉES
SOURITRA
ENI
S
T
MAULXADE LIVRESR ÉCRIA SUIVR
IA
ETRNI
ET
NE
T
RECONNUA
T
ÎI
TS
RS
EANT
CE REPRIE
TSN
EDA
SRE
T
VÉRITÉABLEMENT
ATTENTIE
ONS LONGTS
EMPS NETHERFIELDCOLONEL INSTANTS PETITS
ES
PERDURERESTÉ
EE MATINÉE CRU
OA
YU
EA
T
ZI
É
T
TROISIÈME
Wfr j
Weni
MRELIZABETH’S
SAAIYDING
DARCY’S
MRSBENNET’S
BINGLEY’S
MISSJANE’S
WICKHAM’S
COLLINS’S
DEAREST
LYDIA’S
MOTHER’S
FATHER’S
LETTERS
CATHERINE’S
GARDINERS’S
LIZZYWRO
ITETIE
EN
NG
AUNT’S
CHARLOTTE’S
LUCAS’ES
BROTHER’S
NETHERFIELDDANCI
ENSD
G
UNCLE’S
KITTYCOLONEL
COUSINS’S
MERYTONPEMBERLEY
ROSINGSWEEKS
WILLIAM’S
FORSTERMARY
BOURGH’S
DEFITZWILLIAM’S
HURSTPHILLIPS’S
NIECES’S
NEPHEWS’S
DERBYSHIREBRIGHTON
MR
EL
IZA
BE
TH
DITSR
EAESI
T
DA
RC
YM
RS
BIN
GL
EY
BE
NN
ET
JA
NE
MD
IE
SM
SOI
SE
LL
ES
WIC
KH
AM
CO
LL
INS
LY
DIA
CH
ÉÈE
RIE
E
LE
TT
RE
S
PÈ
RE
MÈA
RM
EAN
CA
TH
ER
INE
GA
RD
INE
RC
HA
RL
OT
TE
FR
ÈA
RT
ER
NE
LL
E
LIZ
ZY
TA
NT
EÉ
CR
IVTRIEUEE
TZRE
LU
CA
SC
OU
SIN
SES
ON
CL
ED
AN
SEÉE
USRSR
ES
NE
TH
ER
FIE
LD
CO
LO
NE
LK
ITT
YM
ER
YT
ON
PE
MB
ER
LE
YS
IRW
ILL
IAM
RO
SIN
GS
NIÈ
CE
S
FO
RS
TE
RS
EM
AIN
ES
HU
RS
TF
ITZ
WIL
LIA
MB
OU
RG
HM
AR
YP
HIL
IPS
NE
VE
UX
DE
RB
YS
HIR
EB
RIG
HT
ON
sR(b
en
i,b
fr j)
0
0.2
0.4
0.6
0.8
1
(b) (b′) (c)
Fig. 3. Automated topic extraction and raw alignment across bilingual corpora. (a) Schematic diagram illustrating our graphical representationof morphologically related words (identified by supervised algorithms in Supplementary Materials) in a word pattern. To avoid unprintably smallcharacters, rarely occurring forms (less than 5% of the total sum of all the words ranked above) are ignored in graphical display. To enhance thevisibility of word stems, we print shared letters only once, and compress other letters vertically, with heights proportional to their correspondingword counts. (b) Word patterns Wi in Jane Austen’s Pride and Prejudice, sorted by descending nii, with font size proportional to the squareroot of e−〈logLii〉 (a better indicator of reader’s impression than the number of recurrences nii ∝ e− log〈Lii〉). Topical (that is, significantly non-Poissonian) patterns painted in red (resp. green) reside below (resp. above) the critical line of Poissonian banality (blue line in Fig. 2e), wherethe deviations exceed the error margin prescribed in (1) of §2.1. (b′) A similar service on a French version of Pride and Prejudice (tr. ValentineLeconte & Charlotte Pressoir). (c) A low-cost and low-yield word translation, based on chapter-wise word counts ben
i and bfrj . Ružicka similarities
sR(beni ,bfr
j ) between selected topics (sorted by descending nii ≥ 20) in English and French versions of Pride and Prejudice. Rows and columns
with maximal sR(beni ,bfr
j ) less than 0.7 are not shown. Correct matchings are indicated by green cross-hairs.
weighted mixtures of exponential distributions (Fig. 2c,c′):
P(τ > t) ∼∑
m
cme−kmt, (2)
(where cm, km > 0, and∑
m cm = 1), which impose an in-equality constraint on the recurrence time τ = Lii:
〈logLii〉 − log〈Lii〉+ γ0
=∑
m
cm log1
km− log
∑
m
cmkm
≤ 0. (3)
2.1.3 Raw alignment of topical patterns
If a word pattern Wi qualifies as a topic by our definition(Fig. 3b,b′), then the signals in its coarse-grained timecourse(say, a vector bi = (bi,1, . . . , bi,61) representing word countsin each chapter of Pride and Prejudice) are not overwhelmedby Poisson noise.
This vectorization scheme, together with the Ružickasimilarity [6]
sR(bAi ,b
Bj ) :=
‖bAi ∧ b
Bj ‖1
‖bAi ∨ b
Bj ‖1
(4)
between two vectors with non-negative entries, allow us toalign some topics found in parallel versions of the same doc-ument, in languages A and B (Fig. 3c). Here, in the definitionof the Ružicka similarity, ∧ (resp. ∨) denotes component-wise minimum (resp. maximum) of vectors; ‖b‖1 sums overall the components in b.
2.2 Markov text model
2.2.1 Transition probabilities via pattern analysis
The diagonal statistics nii, Lii (Fig. 1) have enabled us toextract topics automatically through recurrence time anal-ysis (Figs. 2e and 3b,b′). The off-diagonal statistics nij , Lij
(Fig. 1) will allow us to determine how strongly one wordpattern Wi binds to another word pattern Wj , through hit-ting time analysis. In an empirical Markov matrix P = (pij),the long-range transition rate pij is estimated by
pij :=nije
−〈logLij〉
N∑
k=1
nike−〈logLik〉
, (5)
4
0 50 1000
0.02
0.04
word count rank
Vec
tor
com
po
nen
tsπ
π∗
0 50 100
word count rank
π
π∗
0 50 100
word count rank
π
π∗
0 50 100
word count rank
π
π∗
(a)
1 2 3 4 5
10−5
10−3
10−1
n
rn
1 2 3 4 5n
1 2 3 4 5n
1 2 3 4 5n
(b)
−6 −4 −2 00
50
100
log |λ(P)|
Cu
mu
l.co
un
ts
−1 −0.5 0 0.5 11πarg λ(P)
(c)
–— English –— French –— Russian –— Finnish
Fig. 4. Quantitative properties of Markov text model. (a) Dominanteigenvector π of a 100 × 100 Markov matrix P, computed from oneof the four versions of Pride and Prejudice, in comparison with π
∗, thelist of normalized frequencies for top 100 word patterns. (b) Precipitous
decays of rn := 12
∑1≤i,j≤100
∣∣πip(n)ij − πjp
(n)ji
∣∣ from the initial value
r1 ≈ 0.07, for matrix powers Pn = (p(n)ij
)1≤i,j≤100 constructed from
four versions of Pride and Prejudice. (In contrast, one has r1 ≈ 0.33for a random 100× 100 Markov matrix.) Such quick relaxations supportour working hypothesis about detailed balance π∗
i p∗ij = π∗
j p∗ji. (c) Dis-
tributions of eigenvalues λ of empirical Markov matrices P, with nearlylanguage-independent modulus |λ(P)| and phase-angle arg λ(P).
where nij counts the number of long-range transitions fromWi to Wj , and Lij is a statistic that measures the effectivefragment lengths of such transitions (Fig. 1).
2.2.2 Equilibrium state and detailed balance
Numerically, we find that our empirical Markov matrixP = (pij) defined in (5) is a fair approximation to anergodic3 matrix P
∗ = (p∗ij), which in turn, governs thestochastic hoppings between content word patterns duringtext generation.
Each ergodic Markov matrix P∗ = (p∗ij)1≤i,j≤N pos-
sesses a unique equilibrium state π∗ = (π∗
i )1≤i≤N . Theequilibrium state π
∗ represents a probability distribution
(that is, π∗i ≥ 0 for 1 ≤ i ≤ N and
∑Ni=1 π
∗i = 1)
that satisfies π∗P
∗ = π∗ (that is,
∑1≤i≤N π∗
i p∗ij = π∗
j for1 ≤ j ≤ N ). In our numerical experiments, the dominanteigenvector π (satisfying πP = π) consistently reproducesword frequency statistics that are proportional to the idealequilibrium state π
∗ (Fig. 4a).Furthermore, through numerical experimentation, we
find that our empirical Markov matrix P ≈ P∗ approxi-
mately honors the detailed balance condition π∗i p
∗ij = π∗
j p∗ji
for 1 ≤ i, j ≤ N . The approximation πip(n)ij ≈ πjp
(n)ji
3. If a Markov chain is ergodic, then there is a strictly positiveprobability to transition from any Markov state (that is, any individualword pattern in our model) to any other state, after finitely many steps.
becomes closer as we go to higher iterates Pn = (p
(n)ij ),
where n is a small positive integer (Fig. 4b).On an ergodic Markov chain with detailed balance, one
can show that recurrence times are distributed as weightedmixtures of exponential decays (see Theorem 3 in Ap-pendix B.3), thus offering a theoretical explanation for (2).
2.2.3 Spectral invariance under translation
The spectrum σ(P) (collection of eigenvalues) is approxi-mately invariant against translations of texts (Fig. 4c), whichcan be explained by a matrix equation
PATA→B = TA→BPB. (6)
Here, both sides of the identity above quantify the transitionprobabilities from words in language A to words in lan-guage B, from the impressions of Alice and Bob, two mono-lingual readers in a thought experiment. On the left-handside, Alice first processes the input in her native language Aby a Markov matrix PA, and then translates into languageB, using a dictionary matrix TA→B; on the right-hand side,Bob needs to first translate the input into language B,using the same dictionary TA→B, before brainstorming inhis own native language, using PB. Putatively, the matrixequation holds because semantic content is shared by nativespeakers of different languages. In the ideal scenario wheretranslation is lossless (with invertible TA→B), the Markovmatrices PA and PB are indeed linked to each other by asimilarity transformation that leaves their spectrum intact.
2.3 Localized Markov matrices and semantic cliques
2.3.1 Semantic contexts for recurrent topics
Specializing spectral invariance to individual topical pat-terns, we will be able to generate semantic fingerprintsthrough a list of topic-specific and language-independenteigenvalues. Here, we will be particularly interested inrecurrence eigenvalues of individual topical patterns, whichcorrespond to multiple decay rates in the weighted mixturesof exponential distributions.
Unlike the single exponential decays associated to non-topical recurrence patterns, the multiple exponential decaymodes will enable our robot reader to easily discern onetopic from another. In general, it is numerically challengingto recover multiple exponential decay modes from a limitedamount of recurrence time measurements [7]. However,in text processing, we can circumvent such difficulties byoff-diagonal statistics nij and Lij that provide semanticcontexts for individual topical patterns.
To quantitatively define the semantic content of a topicalpattern Wi, we specify a local, directed, and weightedgraph, corresponding to a localized Markov transition ma-trix P
[i].
2.3.2 Localized Markov contexts of topical patterns
To localize, we need to remove edges between two ver-tices Wi and Wj , when the hitting times Lij and Lji
are “long enough” relative to what one could naïvely ex-pect from recurrence time statistics nij , nji and Lii, Ljj .Here, for naïve expectation, we approximate the probability
5
0
12
1
ℓ5.5 6.0 6.5 7.0 7.5
α12α14 α13
W4 W2
W3P(〈logL
ij〉>
ℓ)
ℓ6.5 7.0 7.5 8.0 8.5
α21 α24 α23
W3W1
W4
ℓ8.5 9.0 9.5 10.0
α32α31α34
W4
W1
W2
W1 = Eliza(∅|beth|beth’s), W2 = Darcy(∅|’s), W3 = pr(ide|ided|oud|oudly|oudest), W4 = Jane(∅|’s)
(a)
DARCY’S
FEEELTINGSME
ANWICKHAM’SPLEASIEA
NDN
G
T
PROUIDECHO
OOO
SI
EC
N
S
E
NETHERFIELDFORTUNEATE
PEOPLE INTERESTIE
ND
G
PAINFULVAAINITY HANDSOME
AMIABLE
PEMBERLEYSORTDISLIKESOCIETYCAUSES
TEMPER
REFLECTIONS
OFFENSD
EED
DERBYSHIRE PHILLIPS
ELIZABETH’S
DARCY’SKNO
E
WNING
MISSBINGLEY’S
LOOKIE
ND
GSPOEAKE
ING
MEANTSING
ROOMPLEASIEANDN
G
T
GENTLEMEANTURNI
END
GEYESEASYIE
LY
EVENINGBROTHER’S
DIFFERENT
RESOLVU
ET
DION
PROUIDE
PEOPLEPOWER
CIVILITY
AGREEABLE
HANDSOME
COUNTRYPEMBERLEY
VAAINITY
APPLIEC
DATION
FANCYDISLIKEINTRODUCT
EION
SORTBALL
PRETTY
PAUSE
PLAYEVIDENT
ELIZABETH’S
DARCY’SMISSBINGLEY’SLOOKI
END
G
ROOM WALKIE
ND
G
GENTLEMEAN ASKI
END
GDELIGHTFE
UD
L SIATTINGCARRIAGE
AUNTCOLLINS
CATHERINECIVILITY
ANXIOUE
STYDOORS
LAUGHIE
ND
G
FACE
DISLIKERUANNING
MIASTREERSS
PLAY
WINDOWSMARY
PURPOSE
0 2 4 6 80
20
40
− log |λ(R[i])|
Cu
mu
l.co
un
ts
0 2 4 6 8
− log |λ(R[i])|
0 2 4 6 8
− log |λ(R[i])|
(b)
MRDARCY
SENTITMENTS
CAUSEÉEA
R
IT
LONGUE
NETHERFIELDCARACTÈRESYMPATHIQ
EUE
FORTUNE
PEMBERLEYTÉMOIGNAIT
ORGUEIL VOLONTIÉ
ERS
BAL
MOTIFSHIER FIEÈERT
EÉ
DEVENIR
PARCPHILIPS
PROPRIÉTÉAIRE
CHÂTEAU
ELIZABETHDARCYMISS
BINGLEYAMIS
E AMI
OM
UA
RBLES
SENTIMENTSSOIRÉEPRÉSENT
C
E
E
R CONNAÎITS
RS
EANCE
YŒ
EI
UL
XREGARDA T
FRÈREAIR
SOURIRA
ET
INSTANTMOTS
CARACTÈRESYMPATHIE
POLITESSEPEMBERLEY
PERSONNES
PROMENADEDISTANCE
BALORGUEIL
VEILLEGUÈRECHEMIN
JOUER
FIERTÉ
HURSTMARCHEATTACHEMENT
GEORGIANA
CHÂTEAU
NEVEU
ELIZABETHDARCYMISS
BINGLEYYŒ
EI
UL
XREGARDAIT
AIRSOURITRA
ET
INSTANTSTANTE
OBSERVEA
R
VOITURE
POLITESSESALON
ARRÊTEA
R
PEMBERLEY
ROUTE
RIREDISTANCE
TABLE
ROSINGS
FRANCHISE
EMBARRAS
CHANGEMENT
ATTACHEMENT
CHÂTEAU
DARCYNA
MIESH
T
EÄ
NSÄTOTSD
TIE
ANAK
AS
NI
KIN
WICKHAMINMIELTL
ÄY
ÄN
T
N
YTÄVÄ
PARISNI
S
N
AKAUNISIT
KOHTELIAIASU
TUI
YLPEÄYS
NETHERFIELDISN
Ä
TUNNUSTAA
KÄYTTÄYTYMISEKUVAVASTENMIELISY
EY
SYVÄÄSTI
KÄRSIÄM
HELLÄY
ÄYT
YLISTYE
SLYT
SIEVÄN
KEHOIT
SUVAITN
SNIUTASUM
IAANMENETTELY
ELIZABETHIN
DARCYNTUNT
SN
UEIE
I
N
A
PUHUE
ALE
LA
AJI
AK
TNA
E
ALLA BINGLEYNYSTÄVÄNSÄ
KATSOEMLAI
N
AIHMETI
TS
ELN
AJATUSKSETNARVOS
NA
KOHTELIAIASU
TUI KAUNIS
I
HYMYILLEN MUISTIAN
TUTTAVUUTE
YLPEÄY
SÄVYYNSÄ
VELJENSÄKYSYMYKSE
HIENOMAHDOTONTA
SALATASKÄYTTÄYTYM
HÄMMÄSTY E
VASTAUSKS
KUVA
KOVA KASVOI
VAIKUTUSISÄLL S
SALISIN
A
VASTENMIELIS
PEMBERLEY
HELLÄ
TULIJA
ASETTUI
HENKILÖ
KIUSAU
ONNETON A
KOSINTAAN
UTELIAISUU
KEHOIT
MAHTAVA
SISARENPOIKA
ELIZABETHIN
HRE
ARRADARCYN
PUHUE
ALE
LA
AJI
AK
TNA
E
ALLA NT
EITIBINGLEYN
HUOMIA
OKATSOEMLAI
N
A
SILMÄISS
ÄÄ
VIERAIA
T
N
A ILMEAIS
AJATUSKSEN
KOHTELIAIASU
TUI
IHASI
TLUN HYMYILLEN
ISTUI A SATTUI
MATKAA TARKJOUT
TÄTI HUONEESEAEN
SÄVYYNSÄ
NÄKÖYINEN
NAURA
KUVAEROON
VAUNUJ
GARDINER
TULIJA
KIUSA
AKKUNASTA
VOITT A
ДАРСИДРУЗГЬЕУАЯЙ УИКХЕМОАМ
ЧУВСТВОАМСИЛЬА
НХ
ОЕЕ
ОБЪЯСНИЕОТНЬИЯПОСТУПОКИКЛРАССКАЗЫАВТЛАЬАПОВЕДЕНИИЕХАРАКТЕРЕАПОНРАВИЛТССЯСЯЬ
ГОРДОСТЬ МАНЕРЫАНЕДОСТАТК БЕДНОАЙЯ
ПЕМБЕРЛИ
ПРИВЕТСЛТИВИ
ПРЕДЕАЛА
НЕПРИЯЗНЬ
ПОМОГЛ
ПРЕНЕБРЕЧЬ
ПОХВАЛ
ПОМЕСТЬЯЕДЕРБИШИРЕА
ПАРКА
ТЯЖЕЛО
ТЩЕСЛАВИЕ
ЭЛИЗАБЕТСГ
КО
АВ
ЗО
АР
ТЛЯИ
ЬАТЛЬ
ДАРСИМИССБИНГЛИ
ВСТРЕЧТИАИТЛЬСЯДРУЗГАОБЪЯСНИЕОТНЬИВЫЗВАТЛЬПРОВЕСЛТИЛЮБЕЗНОСТ
УЛЫБКУОЙПОСТУПИЛ
ПОНРАВИЛТСЯ
МОЛЧАНИЕДОВОЛЬНО БРАТА
ПЛЕМЯННИЦК
МАНЕРЫА
ГОРДОСТЬПРЕДПОЛОЖИЕНИЕ
ПРИВЕТСЛТИВИ ПЕМБЕРЛИИЗУМЛЕНИЕБЕСПОКОЙСТВО ПРОГУЛКУБАЛУАХВАТИАЛО ГЛУБОКОНЕПРИЯЗНЬ ИСПОЛНИЕНДЕРБИШИРЕ
ПОЗНАКОМИТЬОТЗЫВСЫН ДЖОРДЖИАН
ТЩЕСЛАВИЕ
МУЗЫК
ЭЛИЗАБЕТМИСТЕРУОАМДАРСИ
СЕСТРЕ
ЫУЕАР
МИССБИНГЛИМЫСЛЬИ УДИВЛИЯЕНИГОСТИЕНЙУЮ КОМНАТЫУЕПРОВЕСЛТИ ЛЮБЕЗНОСТУЛЫБКУОЙ СМОТРЕТЛЬ
ПРЕДЛОЖИЕНИЕ
МОЛЧАНИЕ ДОВОЛЬНО
СИДЕЛ
ТЕТУКШК ПОДРУГ
ГАРДИНЕРПОСПЕШ
ВЫГЛЯДЕТЛ
ПОДНЯЛ
ПРОГУЛКУ
БЕСПОКОЙСТВОПРОВОДИТЬ
ОКНА СМУЩЕНДЕРБИШИРЕ
ДОСАДНЕЛОВКО
ПУТЕШЕСТВИ
Indo-EuropeanDanish
GermanDutch
SpanishFrench
LatinPolish
RussianKoreanic
KoreanTurkic
TurkishUralic
Finnish
HungarianVasconic
Basque
#Topics: 0 20 40 60 80 100120
correct close incorrect
(c)Fig. 5. Semantic cliques and their applications to word translation. (a) Empirical distributions of 〈logLij〉 in Pride and Prejudice, as gray and coloreddots with radii 1
4√
nij, compared to Gaussian model αij(ℓ) (colored curves parametrized by (7) and (8)). The numerical samplings of Wj ’s exhaust
all the textual patterns available in the novel, including topical word patterns, non-topical word patterns and function words. Only those textualpatterns with over 40 occurrences are displayed as data points. Inset of each frame shows the semantic clique Si surrounding topic Wi (painted inblack ), color-coded by the αij (〈logLij〉) score. The areas of the bounding boxes for individual word patterns are proportional to the components of
π[i] (the equilibrium state of P[i]). (b) Distributions for the magnitudes of eigenvalues (LISF) in the recurrence matrices R
[i], for three concepts fromfour versions of Pride and Prejudice. The color encoding for languages follows Fig. 4. The largest ⌊eηi⌋ magnitudes of eigenvalues are displayedas solid lines, while the remaining terms are shown in dashed lines. Inset of each frame shows the semantic clique Si, counterclockwise fromtop-left, in French, Russian and Finnish. (c) Yields from bipartite matching of LISF (see Fig. 6 for English-French) for topical words between theEnglish original of Pride and Prejudice and its translations into 13 languages out of 5 language families.
P(〈logLij〉 > ℓ) by a Gaussian model αij(ℓ) (colored curvesin Fig. 5a)
P(〈logLij〉 > ℓ) ≈ αij(ℓ) :=
√nij
2πβi
∫ ∞
ℓe−nij(x−ℓi)
2
2βi dx,
(7)
whose mean and variance are deducible from nij and Lii
(see Theorem 4 in Appendix B.4):
ℓi :=〈Lii logLii〉
〈Lii〉− 1, βi :=
〈Lii(ℓi − logLii)2〉
〈Lii〉. (8)
The parameters in the Gaussian model are justified by therelation between hitting and recurrence times [8] on anergodic Markov chain with detailed balance, and becomeasymptotically exact if distinct word patterns are statisti-cally independent (such as α13, α24, α31, α34 in Fig. 5a).Here, statistical independence justifies additivity of vari-ances, hence the
√nij factor in (7); sums of independent
samples of logLij become asymptotically Gaussian, thanksto the central limit theorem. Failing that, the actual rankingof 〈logLij〉 may deviate from the Gaussian model predictionin (7), such as the intimately related pairs of words Eliz-abeth/Darcy, Elizabeth/Jane, Darcy/Elizabeth, Darcy/pride andpride/Darcy.
2.3.3 Markov criteria for semantic cliques
Empirically, we find that higher αij(ℓ) scores point to closeraffinities between word patterns (Fig. 5a), attributable tokinship (Elizabeth, Jane), courtship (Darcy, Elizabeth), disposi-tion (Darcy, pride) and so on. Our robot reader automaticallydetects such affinities, without references other than thenovel itself. Therefore, we can use the αij(ℓ) scores as guidesto numerical approximations of semantic fields, hereafterreferred to as semantic cliques.
We invite a topical pattern Wj to the semanticclique Si (Figs. 5a and b, insets) surrounding Wi, ifminαij(〈logLij〉), αji(〈logLji〉) > α∗ for a standard
Gaussian threshold α∗ := 1√2π
∫ 1−∞ e−x2/2dx ≈ 0.8413.
This operation emulates the brainstorming procedure of ahuman reader, who associates one word with another onlywhen they stay much closer than two randomly pickedwords, according to his/her impression.
Indeed, by numerical brainstorming from Wi, our se-mantic cliques Si (Figs. 5a and b, insets) inform us abouttheir center word Wi, through several types of semanticrelations, including, but not limited to
• Synonyms (pride and vanity in English, orgeuil andfierté in French, etc.);
• Temperaments (Elizabeth, a delightful girl, oftenlaughs, corresponding to French verbs sourire andrire);
• Co-references (e.g. Darcy as a personification ofpride);
• Causalities (such as pride based on fortune).
On a local graph with vertices Si = Wi1 = Wi, Wi2 ,. . . , WiNi
, we specify the connectivity of each directed
edge by a localized Markov matrix P[i] = (p
[i]jk)1≤j,k≤Ni
.This localized Markov matrix is the row-wise normaliza-tion of an Ni × Ni subblock of P with the same set ofvertices as Si. Resetting the entries p
[i]1k and p
[i]j1 as zero,
one arrives at the localized recurrence matrix R[i]. We call
R[i] a recurrence matrix, because one can use it to compute
the distribution for recurrence times to the Markov stateWi in Si. As we will see soon in the applications below,the eigenvalues of R
[i], when properly arranged, becomelanguage-independent semantic fingerprints.
6
Wfr j
Weni
MRELIZABETH’S
DARCY’S
MRSBENNET’S
GB
OOE
DTSTER
BINGLEY’S
LADYI
S
E
H
S
IP
MISSJANE’S
FEEELTINGS
MARRYII
EA
N
DG
G
E
WICKHAM’S
COLLINS’S
YOUNGESR
T
DEAREST
LYDIA’S
DAYS
MOTHER’S
FATHER’S
TWOLETTERS
CATHERINE’S
GARDINERS’S
LIZZYWRO
ITETIE
EN
NG
CRIEDVISITSO
I
RN
SG
AUNT’S
CHARLOTTE’S
LUCAS’ES
BROTHER’S
EVENINGS
SIRNETHERFIELD
UNCLE’S
KITTYEYES
CARRYII
EA
N
DG
G
ES
COLONELTOWN
YEARS
COUSINS’S
FORTUNEATE
MERYTONINVITIE
ANDT
G
ION
LONDONTHANKS
IFE
NUD
GL
HUSBANDS’S
PEMBERLEYOFFICI
EOURS
ROSINGSWEEKS
SENTD
WILLIAM’S
PROMISIENSD
G
IMPOSSIBLELAUGHI
END
G
BALLS
POORLY
FORSTERCHILDREN
MARYBOURGH’S
DEFITZWILLIAM’S
DINNERS
TENSOCIETY
THOUSANDPHILLIPS’S
DINIE
ND
G
NIECES’S
TABLES
ABSENTCE
REFUSIEAND
L
G
POLITENL
EY
SSMOTIVES
NEPHEWS’S
DERBYSHIREBRIGHTON
CONGRATULATIE
OD
NS
IGNORANTCE
REGIMENT’S
PARKDISTANT
CE
MR
EL
IZA
BE
TH
DA
RC
YM
RS
BIN
GL
EY
BE
NN
ET
JA
NE
MD
IE
SM
SOI
SE
LL
ES
MBIO
ENUN
XES
WIC
KH
AM
JE
UN
ES
SE
MLD
ADM
AYE
MS
E
DE
UX
SE
NTIA
TRMIET
NT
S
CO
LL
INS
LY
DIA
MA
RIÉEÉA
SREGE
CH
ÉÈE
RIE
E
LE
TT
RE
S
PÈ
RE
MÈA
RM
EAN
CA
TH
ER
INE
GA
RD
INE
RS
OIR
ÉE
VIS
ITE
USRR
S
LO
ND
RE
SYŒ
EIUL
XC
HA
RL
OT
TE
FR
ÈA
RT
ER
NE
LL
E
LIZ
ZY
TA
NT
EIN
VIT
ÉEÉASRETI
ON
S
ÉC
RIVTR
IEUEETZR
E
LU
CA
SC
OU
SIN
SES
ON
CL
EÉ
CR
IAN
ET
HE
RF
IEL
DC
OL
ON
EL
AN
SNÉ
ES
DÎN
EÉ
RK
ITT
YM
ER
YT
ON
MA
RIS
PE
MB
ER
LE
YM
ILL
ES
VO
ITU
RE
RE
FU
SEÉA
ZRNT
EN
FA
NTC
SE
BA
LS
SIR
WIL
LIA
MR
OS
ING
SIG
NO
REA
RNITCTS
ES
NIÈ
CE
S
FO
RT
UN
EF
ÉL
ICIT
EÉEARTNIIT
ON
S
SO
CIÉA
TL
ÉE
SS
PO
LITM
ESN
ST
EO
FF
ICIE
ER
S
IMP
OS
SIB
LI
ELI
TÉ
FO
RS
TE
RS
EM
AIN
ES
RE
ME
RC
I EA
RME
NT
S
EN
VOE
YIREÉAER
ZREAI
FIT
ZW
ILL
IAM
BO
UR
GH
PR
OM
ETS
TSRAE
ENIT
MA
RY
PA
UV
RE
TSÉ
PH
ILIP
SR
ISREA
EZNT
TA
BL
ES
NE
VE
UX
MO
TIF
S
RÉ
GIM
EN
TA
BS
EN
TCEE
DIS
TA
NTC
ES
PA
RC
DE
RB
YS
HIR
EB
RIG
HT
ON
DIX
s(Weni
,Wfrj)
0
0.2
0.4
0.6
0.8
1
Fig. 6. Automated alignmentsof vectorized topics via bipartitematching of semantic similarities.The semantic similaritiess(Wen
i ,Wfrj ) are computed
for selected topics (sorted bydescending nii ≥ 20) in twoversions of Pride and Prejudice.Rows and columns filled withzeros are not shown. Cross-hairsmeet at optimal nodes that solvethe bipartite matching problem.The thickness of each horizontal(resp. vertical) cross-hair isinversely proportional to the row-wise (resp. column-wise) ranking ofthe similarity score for the optimalnode. Green (resp. amber ) cross-hair indicates an exact (resp. aclose but non-exact) match. At thesame confidence level (0.7) forsimilarities, this experiment hasbetter recall than Fig. 3c, withoutmuch cost of precision.
3 APPLICATIONS
3.1 Automated word translations from bilingual docu-
ments
Experimentally, we resolve the connectivity of an individ-ual pattern Wi through the recurrence spectrum σ(R[i])(Fig. 5b). The dominant eigenvalues of R
[i] are concept-specific while remaining nearly language-independent (alocalized version of the invariance in Fig. 4c). Such empiricalevidence motivates us to define the language-independentsemantic fingerprint (LISF) of a word pattern Wi by a de-scending list for the magnitudes of eigenvalues
vi = (|λ1(R[i])|, |λ2(R
[i])|, . . . ), (9)
computable from its semantic clique Si. We zero-pad thisvector from the (⌊eηi⌋ + 1)st component onwards, whereηi is the Kolmogorov–Sinai entropy production rate of theMarkov matrix P
[i], measured in nats per word.4
4. The entropy production rate η(P) := −∑
i,j πipij log pij [9,(4.27)] of a Markov matrix P represents the weighted average (as-signing probability mass πi to the ith Markov state) of Boltzmann’spartition entropies −
∑j pij log pij [10, §8.2]. We have η(P) ≤ logN
for an N × N Markov matrix P with strictly positive entries [10,Theorem 14.1].
Via bipartite matching (Fig. 6) of word vectors vi acrosslanguages, our algorithm translates words from paralleltexts at very high precision (Fig. 5c), being competitive withstate-of-the-art algorithms for bilingual word mapping [11],[12].
Unlike the vector bi (Fig. 3c) that captures only chapter-scale features of Wi, the semantic fingerprint vi in (9)characterizes the kinetic behavior of Wi on all the long-range time scales.
Given a topical pattern WAi in language A, its semantic
fingerprint vAi (a descending list of recurrence eigenvalues,
as in Fig. 5b) allows us to numerically locate a semanticallyclose pattern in a parallel text written in another languageB, in two steps:(1) Divide the document into K chapters, and define thesemantic similarity function as s(WA
i ,WBj ) := sR(v
Ai ,v
Bj ) if
sR(bAi ,b
Bj ) ≥ max
1− 0.07
√K, 1−
√‖bA
i ∧ bBj ‖0
‖bAi ∨ b
Bj ‖1
(10)
(which is a ballpark screening more robust than Fig. 3c, with‖b‖0 counting the number of non-zero components in b)and sR(v
Ai ,v
Bj ) ≥ 0.7; s(WA
i ,WBj ) := 0 otherwise.
7
WikiQA-Q26: Howdid Anne Frank die?Reference: “AnneFrank” (Wikipedia)
⇒FRANK’S
ANNE’S
DIEAET
DHS
TYPHUSCENTRE
ERSURVIVED
FRANKFURTEARLYIER
GOSLAR1945CAMPS
RESTORED
FILMS
HOLOCAUST
1933EXISTED
BERGEN-BELSENCONCENTRATIONDOCUMENTARYPRIVATEACTION
⇒(1) Anne Frank and her sister, Margot , were eventually transferred to the Bergen-Belsen concentration camp , where they died of typhus in March 1945.(2) Annelies "Anne" Marie Frank (, ?, ; 12 June 1929early March 1945) is one of the most discussed Jewish victims of theHolocaust .(3) Otto Frank, the only survivor of the family, returned to Amsterdam after the war to find that Anne’s diary had been saved, and his effortsled to its publication in 1947.(4) As persecutions of the Jewish population increased in July 1942, the family went into hiding in the hidden rooms of Anne’s father, Otto Frank ’s, office building.(5) The Frank family moved from Germany to Amsterdam in 1933, the year the Nazis gained control over Germany.
MAP MRR0.6190 CNN 0.62810.6091 LISF
∗ 0.62680.5993 LCLR 0.60860.5899 LISF 0.60600.5110 PV 0.51600.4891 Word Count 0.49240.3913 Random Sort 0.3990
(a) (b)
Fig. 7. Applications of semantic cliques to question-answering. (a) A construction of semantic clique Q ∪ Q′ (based on Q = Anne, Frank, die)weighted by the PageRank equilibrium state π and subsequent question-answering. Top 5 candidate answers, with punctuation and spacing asgiven by WikiQA, are shown with font sizes proportional to the entropy production score in (11). Here, the top-scoring sentence with highlightedbackground is the same as the official answer chosen by the WikiQA team. Like a human reader, our algorithm automatically detects the place“Bergen-Belsen concentration camp”, cause “typhus”, and year “1945” of Anne Frank’s death. (b) Evaluations of our model (LISF and LISF∗) on theWikiQA data set, in comparison with established algorithms.
(2) Solve a bipartite matching problem (Fig. 6) that maxi-mizes
∑i,j s(W
Ai ,W
Bj ), using the Hungarian Method [13]
attributed to Jacobi–Konig–Egerváry–Kuhn [14].
3.2 Machine-assisted text comprehension on WikiQA
data set
By automatically discovering related words through numer-ical brainstorming (Figs. 5a and b, insets), our semanticcliques Si are useful in text comprehension and ques-tion answering. We can expand a set of question words
Q = Wq1 , . . . ,WqK into Q∪Q′ =⋃K
k=1 Sqk , by bringingtogether the semantic cliques Sqk generated from a refer-ence text by each and every question word Wqk .
As before, we construct a localized Markov matrix P =(pij)1≤i,j≤N on this subset of word patterns Q ∪ Q′. Wefurther use the Brin–Page damping [15] to derive an ergodic
Markov matrix P = (pij)1≤i,j≤N , where pij = 0.85pij +0.15N .
By analogy to the behavior of internet surfing [15],[16], we model the process of associative reasoning [17]as a navigation through the nodes Q ∪ Q′ according to
P, which quantifies the click-through rate from one ideato another. The PageRank recursion [16] ensures a unique
equilibrium state π attached to P. If our question Q and acandidate answer A contain, respectively, words from WQ1 ,. . . , WQm
∈ Q and WA1 , . . . , WAn∈ Q ∪ Q′ (counting
multiplicities, but excluding function words and patternswith fewer than 3 occurrences in the reference document),then we assign the following entropy production score
F [Q,A] := −m∑
i=1
n∑
j=1
πQipQiAj
log pQiAj(11)
to this question-answer pair.5
A sample work flow is shown in Fig. 7a, to illustratehow our rudimentary question-answering machine handlesa query. To answer a question, we use a single Wikipediapage (without infoboxes and other structural data) as theonly reference document and training source. Like a typicalhuman reader of Wikipedia, our numerical associative rea-soning generates a weighted set of nodes Q∪Q′ (presentedgraphically as a thought bubble in Fig. 7a), without the helpof external stimuli or knowledge feed. Here, the relative
5. One may compare the score F [Q,A] to the Kolmogorov–Sinaientropy production rate [9, (4.27)] η(P) = −
∑Ni=1
∑Nj=1 πipij log pij
of a Markov matrix P = (pij)1≤i,j≤N . The score F [Q,A] is modeledafter Boltzmann’s partition entropies, weighted by words in the ques-tion, and sifted by topics in the answer. Such a weighting and siftingmethod is analogous to the definition of scattering cross-sections inparticle physics.
weights in the nodes of Q ∪ Q′ are computed from the
equilibrium state π of P, via the PageRank algorithm.We then test our semantic model (LISF in Fig. 7b) on all
the 1242 questions in the WikiQA data set, each of whichis accompanied by at least one correct answer located in adesignated Wikipedia page. Our algorithm’s performanceis roughly on par with LCLR and CNN benchmarks [18],improving upon the baseline by significant margin. This isperhaps remarkable, considering the relatively scant dataat our disposal. Unlike the LCLR approach, our numericaldiscovery of synonyms does not draw on the WordNetdatabase [19] or pre-existent corpora of question-answerpairs. Unlike the CNN method, we do not need pre-trainedword2vec embeddings [20] as semantic input.
Moreover, our algorithm (LISF∗ in Fig. 7b) performsslightly better on a subset of 990 questions that do not re-quire quantitative cues (How big? How long? How many? Howold? What became of? What happened to? What year? and so on).This indicates that, with a Markov chain description of two-body interactions between topics, our structural model fitsassociative reasoning better than rule-based reasoning [17],while imitating human behavior in the presence of limiteddata. To enhance the reasoning capabilities of our algorithm,it is perhaps appropriate to apply a Markov random field[21, §4.1.3] to graphs of word patterns, to capture many-body interactions among different topics.
4 CONCLUSION
In our current work, we define semantics through alge-braic invariants that are concept-specific and language-independent. To construct such invariants, we develop astochastic model that assigns a semantic fingerprint (list ofrecurrence eigenvalues) to each concept via its long-rangecontexts. Consistently using a single Markov framework, weare able to extract topics (Figs. 2e, 3b,b’), translate topics(Figs. 3c, 4c, 5b,c, 6) and understand topics (Figs. 5a,b,7a,b), through statistical mining of short and medium-lengthtexts. In view of these three successful applications, we areprobably close to a complete set of semantic invariants, afterdemystifying the long-range behavior of human languages.
Notably, our algorithms apply to documents of moderatelengths, similar to the experience of human readers. Thiscontrasts with data-hungry algorithms in machine learning[18], [22], which utilize high-dimensional numerical repre-sentations of words and phrases [11], [12], [20], [23] fromlarge corpora. Our semantic mechanism exhibits universal-ity on long-range linguistic scales. This adds to our quanti-tative understanding of diversity on shorter-range linguisticscales, such as phonology [24], [25], [26], morphology [27],[28], [29], [30] and syntax [3], [30], [31], [32], [33].
8
Thanks to the independence between semantics andsyntax [3], our current model conveniently ignores the non-Markovian syntactic structures which are essential to fluentspeech. In the near future, we hope to extend our frameworkfurther, to incorporate both Markovian and non-Markovianfeatures across different ranges. The Mathematical Principlesof Natural Languages, as we envision, must and will combinethe statistical analysis of a Markov model with linguisticproperties on shorter time scales that convey morphological[27], [28], [29], [30] and syntactical [3], [30], [31], [32], [33]information.
APPENDIX A
POISSONIAN BANALITY AND NON-TOPICALITY
Here, we first present a proof of our statistical criterion forPoissonian banality (which we identify with non-topicalityof word patterns), as stated in (1).
Theorem 1 (Sums and products of exponentially distributedrandom variables). Let X1, . . . , XN be independent ran-dom variables, each obeying an exponential distributionwith mean 1. The probability distribution of
YN := log
(1
N
N∑
i=1
Xi
)− 1
N
N∑
i=1
logXi (12)
is asymptotic to a Gaussian law
P
YN −
(γ0 − 1
2N
)
1√N
√π2
6 − 1− 12N
< a
∼
∫ a
−∞e−x2/2 dx√
2π(13)
as N → ∞.
Proof: By definition, we have a moment-generatingfunction
Ee−tYN
=
∫
(0,+∞)N
∏Ni=1 x
t/Ni(
1N
∑Ni=1 xi
)t e−∑N
i=1 xi dx1 · · ·dxN
= 2N∫
(0,+∞)N
∏Ni=1 ξ
2t/N+1i(
1N
∑Ni=1 ξ
2i
)t e−∑N
i=1 ξ2i d ξ1 · · · d ξN . (14)
To evaluate the last multiple integral, we use the spheri-cal coordinates ξ1 = r cos θ1, ξ2 = r sin θ1 cos θ2, ξ3 =r sin θ1 sin θ2 cos θ3, . . . , ξN = r
∏N−1j=1 sin θj (where r > 0,
and 0 < θj < π/2 for all j ∈ 1, . . . , N − 1), with volumeelement
d ξ1 · · · d ξN = rN−1 d rN−1∏
j=1
sinN−1−j θj d θj . (15)
The result reads
Ee−tYN = N tΓ(N)N−1∏
j=1
Γ(N+tN
)Γ((N−j)(N+t)
N
)
Γ((N−j+1)(N+t)
N
)
=N tΓ(N)
Γ(N + t)
[Γ
(1 +
t
N
)]N, (16)
where Γ(s) :=∫∞0 xs−1e−x dx is Euler’s gamma function.
Consequently, the proof of (13) builds on a cumulantexpansion of (16), that is, development of logEe−tYN upto O(t2) terms.
If we have N samples of recurrence times from a Poissonprocess, then the statistic YN satisfies the inequality in (1)
with probability∫ 2−2 e
−x2/2 d x√2π
≈ 0.95, in view of the
theorem above.
APPENDIX B
RECURRENCE TIMES ON AN ERGODIC MARKOV
CHAIN WITH DETAILED BALANCE
B.1 Background in probability theory
Based on numerical evidence (Fig. 4), we postulate that atthe discourse level (the longest time scale in Friederici’s [4]neurobiological hierarchy), the production of natural lan-guage texts can be caricatured by the stochastic transitionson a stationary and ergodic Markov chain M = (S ,P).6
Here, the state space S = W1, . . . ,WN runs over finitelymany word patterns occurring in the text, which in turnis representable as a discrete-time stochastic process (X(0),X(1), . . . , X(n), . . . ); the transition matrix P = (pij)describes the conditional hopping probability on the webof word patterns: pij = P(X(n+ 1) = Wj |X(n) = Wi), forn ∈ Z≥0. Depending on context, we also model a documentby a localized Markov chain M ′ = (S ′,P′), where certainword patterns Wi are removed from the state space S toform a proper subset S ′ $ S .
In a more formal setting, our notation M = (S ,P)for the Markov chain should be expanded into M =(Ω,F , PW : W ∈ S , X(t) : t ∈ Z≥0, θn : n ∈ Z≥0),whose components are explained below.
• The sample space Ω consists of all stochastic trajec-tories ω = (X(0), X(1), . . . , X(t), X(t + 1), . . . ) =(X(t))t∈Z≥0
on the state space S , namely, all pos-sible texts that can be analyzed by our particularmodel.
• Each member in the field of events F = 2Ω (thetotality of all subsets in the sample space Ω) is a setcontaining zero or more stochastic processes that canbe regarded as caricatures of text productions.
• The family of probability measures PW : W ∈ S are related to the transition matrix P = (pij)1≤i,j≤N
by the identity p(t)ij = PWi(X(t) = Wj) = P(X(t) =
Wj |X(0) = Wi), ∀Wi,Wj ∈ S , ∀t ∈ Z>0.
• The shift operator θn : Ω −→ Ω acts onan arbitrary trajectory (X(0), X(1), . . . , X(t), X(t+1), . . . ) = ω ∈ Ω in the following manner: θn ω = (X(n), X(n + 1), . . . , X(t + n), X(t + n +1), . . . ), ∀t, n ∈ Z≥0.
For discussions of the stopping times (a class of randomvariables on Markov chains) as well as the Markov property,it is convenient to further introduce the notation Fn forn ∈ Z≥0. Here, Fn is the smallest σ-algebra contain-ing all the events in the form of (X(0) = Wi0 , X(1) =Wi1 , . . . , X(n) = Win). The family of σ-algebras Fn :
6. To reduce notational burden, we will not use superscripted aster-isks to mark Markov matrices, beyond this point.
9
n ∈ Z≥0 forms a filtration: Fn ⊆ Fm if n ≤ m. For anyn ∈ Z≥0, B ∈ F , W ∈ S , we have the following relationconcerning conditional expectations:
EW(1B θn|Fn) = EX(n)(1B) := PX(n)(B), (17)
(for every time-step n ∈ Z≥0 and Markov state W ∈ S )which is merely a reformulation of the Markov condi-tion PW(X(n + 1) = Win+1 |X(n) = Win , X(n − 1) =Win−1 , . . . , X(0) = Wi0 ) = PWin (X(1) = Win+1) =PW(X(n + 1) = Win+1 |X(n) = Win). In other words, forany A ∈ Fn, B ∈ F , W ∈ S , we have the followingstatement of the Markov property:
EW(1B θn;A) = EW(EX(n)(1B);A). (17′)
In both (17) and (17′), one can replace the indicator function1B for event B ∈ F by any random variable with finiteexpectation.
B.2 Hitting time and return time on a Markov chain
We need to first precisely define the probability distributionsfor the hitting time and return time (the latter also knownas “recurrence time”) on our Markov chain M = (S ,P).
For each state Wi ∈ S and trajectory ω = (X(t))t∈Z≥0∈
Ω, we define
τi(ω) := infn ∈ Z>0 : X(n) = Wi, (18)
then τi : Ω −→ Z>0 ∪ +∞ is a stopping time. Supposethat our pattern of interest corresponds to a subset of statesW ⊆ S on the web of words. We define another stoppingtime τW : Ω −→ Z>0 ∪ +∞ as
τW (ω) := infn ∈ Z>0 : X(n) ∈ W = infτi(ω) : Wi ∈ W . (19)
Clearly, τW (ω) is equal to the first time when a forwardstepwise search lands on the set of interest W . Recallingthe invariant measure π = (π1, . . . , πN ), we define thecumulative distribution function (CDF) for the hitting timeto the set of patterns W as
HW (t) :=∑
Wi∈S
PWi(τW < t)πi, t ∈ Z>0. (20)
Similarly, the CDF for the return time to the set of patternsW is defined as
RW (t) :=∑
Wi∈W
PWi (τW < t)πi∑
Wj∈Wπj
, t ∈ Z>0. (21)
In the next theorem, we present an identity that connectshitting and return times of Markov states, which in turn, isa discrete analog of the Haydn–Lacroix–Vaient relation forcontinuous-time ergodic dynamical systems [8].
Theorem 2 (Relation between hitting time and return timedistributions). For a stationary and ergodic Markovchain M = (S ,P), and a subset of states W ⊆ S ,we have the following identity regarding the probabilitydistribution of hitting and return times to W :
HW (1) = RW (1) = 0;HW (t+ 1)∑
Wj∈Wπj
=t∑
n=1
[1−RW (n)]
(22)
for all t ∈ Z>0.
Proof: Clearly, our task is equivalent to the verificationof the following formula:
∑
Wi∈S
PWi(τW ≤ t)πi =t∑
n=1
∑
Wj∈W
[1− PWj (τW < t)]πj
(22′)
for all t ∈ Z>0.For t = 1, the left-hand side of (22′) can be computed as
follows:∑
Wi∈S
PWi(τW = 1)πi =∑
Wi∈S
∑
Wj∈W
PWi(X(1) = Wj)πi
=∑
Wi∈S
∑
Wj∈W
πipij =∑
Wj∈W
πj .
(23)
This is equal to the right-hand side of (22′), becausePWj (τW < 1) = 0.
For t ∈ Z>0, we can compute∑
Wi∈S
PWi (τW ≤ t+ 1)πi
=∑
Wi∈S
EWi((1(τW ≤t) θ1)(1(X(1)/∈W )) + 1(X(1)∈W ))πi
=∑
Wi∈S
[EWi(EX(1)(1(τW ≤t);X(1) /∈ W ))
+ EWi(1(X(1)∈W ))]πi (24)
from the Markov property (17′). Here, we have∑Wi∈S
EWi(1(X(1)∈W ))πi =∑
Wi∈SPWi(τW = 1)πi =∑
Wj∈Wπj according to (23), and
∑
Wi∈S
EWi(EX(1)(1(τW ≤t);X(1) /∈ W ))πi
=∑
Wi∈S
∑
Wk∈SrW
PWi(X(1) = Wk)EWk(1(τW ≤t))πi
=∑
Wk∈SrW
EWk(1(τW ≤t))πk
=∑
Wi∈S
PWi(τW ≤ t)πi −∑
Wj∈W
PWj (τW ≤ t)πj . (25)
Therefore, the recursion∑
Wi∈S
PWi(τW ≤ t+ 1)πi
=∑
Wi∈S
PWi(τW ≤ t)πi +∑
Wj∈W
[1− PWj (τW ≤ t)]πj (26)
for all t ∈ Z>0 allows us to build (22′) inductively on thet = 1 case (23).
B.3 Positive definite return time distributions on a
Markov chain with detailed balance
Thanks to Theorem 2, our analysis of a return time distri-bution can be built on the analysis of the correspondinghitting time, the latter of which is usually easier to compute(in theoretical analysis).
For a stationary and ergodic Markov chain (S ,P =(pij)1≤i,j≤N ), and a pattern of interest W $ S , one can
10
use the Markov property to show that the hitting timedistribution satisfies
∑
Wi∈S
PWi(τW > 1)πi =∑
Wi∈S
PWi(X(1) /∈ W )πi
=∑
Wi∈S
∑
Wm∈SrW
πipim =∑
Wm∈SrW
πm (27)
and∑
Wi∈S
PWi(τW > t)πi
=∑
Wi∈S
PWi(X(n) /∈ W , ∀n ∈ Z ∩ [1, t])πi
=∑
Wi∈S
∑
Wmn∈SrW ,n∈Z∩[1,t]
πipim1
∏
n∈Z∩[1,t−1]
pmnmn+1
=∑
Wmn∈SrW ,n∈Z∩[1,t]
πm1
∏
n∈Z∩[1,t−1]
pmnmn+1 , (28)
for all t ∈ Z>1.
Consider the abelian semigroup Σ = Z≥0. A semi-character [34, p. 92, Definition 2.1] ρ : Z≥0 −→ C mustassume the following form: ρ(0) = 1; ρ(s) = [ρ(1)]s
for s ∈ Z>0. Since the spectral radius of our recurrencematrix is strictly less than 1, we will only concern ourselveswith ρ0(s) := 10(s) and ρλ(s) := λs, s ∈ Z>0 forλ ∈ (−1, 0) ∪ (0, 1).
According to the harmonic analysis on semigroups,a bounded function f : Z≥0 −→ C can be repre-sented as a weighted mixture of semicharacters f(s) =∫−1<λ<1 ρλ(s) dµ(λ) if and only if it is positive definite [34,
p. 93, Theorem 2.5]: the inequality
m∑
r=1
m∑
s=1
crcsf(tr + ts) ≥ 0 (29)
holds for arbitrary positive integers m ∈ Z>0, and arbi-trarily chosen m-dimensional “vectors” (t1, . . . , tm) ∈ Σm,(c1, . . . , cm) ∈ Cm.
In the following theorem, we check the positive def-initeness criterion (29) for the hitting time distribution1 − HW (s + 2), s ∈ Z≥0 on a Markov chain satisfying thedetailed balance condition. This would imply that the returntime distribution 1−RW (s+ 2), s ∈ Z≥0 is also a weightedmixture of semicharacters ρλ(s) for −1 < λ < 1, accordingto the finite difference relation (22) in Theorem 2.
Theorem 3 (Positive definite return time distributions).Suppose that a non-void pattern W $ S on a stationaryand ergodic Markov chain M = (Ω,F , PW : W ∈S , X(t) : t ∈ Z≥0, θn : n ∈ Z≥0) satisfiesπipij = πjpji for all Wi,Wj ∈ S r W . Then, thecumulative distribution for the return time to W has thefollowing integral representation
1−RW (s+ 2)
1−RW (2)=
∑Wi∈W
PWi(τW > s+ 1)πi
∑Wi∈W
PWi(τW > 1)πi
=
∫ρλ(s) dPW (λ), s ∈ Z≥0, (30)
for a certain probability measure PW (λ) supported onλ ∈ (−1, 1).
Proof: By (28), we have∑
Wi∈S
PWi(τW > r + s+ 1)πi
=∑
Wmn∈SrW ,n∈Z∩[1,r+s+1]
πm1
∏
n∈Z∩[1,r+s]
pmnmn+1
=∑
Wmn∈SrW ,n∈Z∩[1,r+s+1]
πm1×
×∏
n∈Z∩[1,r]
pmnmn+1
∏
n∈Z∩[r+1,r+s]
pmnmn+1 , (31)
for r, s ∈ Z>0. We then apply the identity πipij =πjpji, ∀Wi,Wj ∈ S r W to the product over n ∈ Z ∩ [1, r],which brings us
∑
Wi∈S
PWi(τW > r + s+ 1)πi
=∑
Wmn∈SrW ,n∈Z∩[1,r+s+1]
πm1×
×∏
n∈Z∩[1,r]
πmn+1
πmn
pmn+1mn
∏
n∈Z∩[r+1,r+s]
pmnmn+1
=∑
Wmr+1∈SrW
PWmr+1 (τW > r)PWmr+1 (τW > s)πmr+1
(32)
for r, s ∈ Z>0. In other words, we have verified∑
Wi∈S
PWi(τW > r + s+ 1)πi
=∑
Wi∈SrW
EWi(1(τW >r))EWi(1(τW >s))πi (33)
for r, s ∈ Z>0. Using the Markov property, one can readilygeneralize the identity above to r, s ∈ Z≥0. Thus, we have
m∑
r=1
m∑
s=1
∑
Wi∈S
crcsPWi (τW > tr + ts + 1)πi
=∑
Wi∈SrW
∣∣∣∣∣EWi
(m∑
r=1
cr1(τW >tr)
)∣∣∣∣∣
2
πi ≥ 0 (34)
for tr, ts ∈ Z≥0. By the positive definiteness criterion in(29), we see that 1 − HW (s + 2), s ∈ Z≥0 is a weightedmixture of semicharacters. In view of the recursion identityrelating hitting and return time distributions (26), we haveconfirmed the statement in (30).
With bin sizes far larger than 2 time units on the Markovchain, our Lii is nearly a continuous random variable and allthe period-2 oscillatory decays ρλ(s) for −1 < λ < 0 behavejust like noise terms in the histogram. Then, Theorem 3tells us that the probability distribution for Lii can beapproximated by a weighted mixture7 of exponential decayse−kLii , in the continuum limit.
7. We note that there exist exceptions to stationarity and ergodicityin realistic documents. A novel may involve the birth and/or deathof leading/supporting characters. An academic treatise may placeparticular emphasis on certain concepts in specific chapters. In suchscenarios, the overall recurrence kinetics may still follow the genericlaw given in (2), as a consequence of both detailed balance (in locallyapplied Markov models) and a heterogeneous mixture of several sta-tionary and ergodic Markov models that patch together as a whole.
11
B.4 A statistical criterion for numerical independence
between different word patterns
If we define Lij as the effective length of a text frag-ment free from word pattern Wj that is simultaneouslyflanked by Wi to the left and by Wj to the right, then∑
Wi∈SπiP(Lij > t) ∼ ∑
Wi∈SπiPWi(τj > t) represents
the hitting time distribution to the Markov state Wj . Ac-cording to Theorem 2, the hitting time distribution canbe obtained by integrating the return time distribution
P(Ljj > t) ∼ PWj (τj > t) ∼∑n cne−knt. Here, the positive
weights cn > 0 are normalized to one:∑
n cn = 1. Inother words, for fixed Wj , we can predict some statistical
properties of Lij by analyzing the recurrence statistic Ljj ,as described in the theorem below.
Theorem 4 (Mean and variance for the logarithm of hittingtime). Using the continuous time approximation, wehave the following relations for a fixed Wj :
E log Lij =〈Ljj log Ljj〉
〈Ljj〉− 1, (35)
E(log Lij − E log Lij)2 =
〈Ljj(ℓ∗ − log Ljj)2〉
〈Ljj〉, (36)
where ℓ∗ equals the right-hand side of (35), and theexpectation E denotes average over all the states Wi,weighted by πi.
Proof: In the continuous time approximation, we can
assume that the probability density function for Ljj is∑n cnkne
−knt, (where cn, kn > 0) so that the probability
density function for Lij (weighted average over all the statesWi) is
∑
n
cne−knt
∑
n
cn/kn=
∑
n
cne−knt
〈Ljj〉. (37)
In view of (37), we can compute the left-hand side of (35) as∫ ∞
0
∑
n
cne−knt log t
d t
〈Ljj〉= −
∑
n
cn(γ0 + log kn)
kn〈Ljj〉. (38)
Meanwhile, we may evaluate the right-hand side of (35) asfollows:
∫ ∞
0
∑
n
cnkne−kntt log t
d t
〈Ljj〉− 1
= −∑
n
cn(γ0 − 1 + log kn)
kn〈Ljj〉− 1. (39)
Thus, (35) is confirmed.In a similar vein, we can argue that the left-hand side of
(36) equals∫ ∞
0
∑
n
cne−knt log2 t
d t
〈Ljj〉− ℓ2∗
=∑
n
cn[6 log kn(2γ0 + log kn) + 6γ20 + π2]
6kn〈Ljj〉− ℓ2∗, (40)
while the right-hand side of (36) amounts to∫ ∞
0
∑
n
cnkne−kntt(ℓ∗ − log t)2
d t
〈Ljj〉
=
∫ ∞
0
∑
n
cnkne−kntt log2 t
d t
〈Ljj〉− 2ℓ∗ − ℓ2∗
=∑
n
cn[6 log kn(2γ0 − 2 + log kn) + 6γ0(γ0 − 2) + π2]
6kn〈Ljj〉
+∑
n
2cn(γ0 + log kn)
kn〈Ljj〉− ℓ2∗. (41)
This completes the proof of (36).The detailed balance condition πipij = πjpji means that
a Markov chain is reversible, and the reverse chain has thesame transition matrix P [35, §4.7, p. 203].
By the reversibility of our Markov chain, we may extendthe results from the theorem above to a dual situation,namely, the statistical properties for Lij , the effective lengthof a text fragment free from word Wi that is simultaneouslyflanked by Wi to the left and by Wj to the right (asconsidered in Fig. 1).
Trading Lij (resp. Ljj ) in (35)–(36) for Lij (resp. Lii),we recover (8): for a given i, we have used (35) as anestimate for 〈logLij〉, and have used 1/nij times (36) asan estimate for the variance of 〈logLij〉, which is taken overnij independent samples.
ACKNOWLEDGMENTS
We thank N. Chomsky and S. Pinker for their inputs onseveral problems of linguistics. We thank X. Sun for discus-sions on neural networks. We thank X. Wan, R. Yan and D.Zhao for their suggestions on experimental design, duringthe early stages of this work. We thank two anonymousreviewers, whose thoughtful comments help us improve thepresentation of this work.
REFERENCES
[1] F. de Saussure, Cours de linguistique générale, 5th ed. Paris, France:Payot, 1949.
[2] S. Pinker and A. Prince, “On language and connectionism: Analy-sis of a parallel distributed processing model of language acquisi-tion,” Cognition, vol. 28, no. 1-2, pp. 73–193, 1988.
[3] N. Chomsky, Syntactic Structures, 2nd ed. Berlin, Germany:Mouton de Gruyter, 2002.
[4] A. D. Friederici, “The neurobiology of language comprehension,”in Language Comprehension: A Biological Perspective, A. D. Friederici,Ed. Berlin, Germany: Springer, 1999, ch. 9, pp. 265–304.
[5] J. P. Herrera and P. A. Pury, “Statistical keyword detection inliterary corpora,” Eur. Phys. J. B, vol. 63, pp. 135–146, 2008.
[6] M. Ružicka, “Anwendung mathematisch-statistischer Methodenin der Geobotanik (Synthetische Bearbeitung von Aufnahmen),”Biológia (Bratislava), vol. 13, pp. 647–661, 1958.
[7] Y. Zhou and X. Zhuang, “Robust reconstruction of the rate con-stant distribution using the phase function method,” Biophys. J.,vol. 91, no. 11, pp. 4045–4053, 2006.
[8] N. Haydn, Y. Lacroix, and S. Vaienti, “Hitting and return times inergodic dynamical systems,” Ann. Probab., vol. 33, pp. 2043–2050,2005.
[9] T. M. Cover and J. A. Thomas, Elements of Information Theory,2nd ed. Hoboken, NJ: Wiley Interscience, 2006.
[10] M. Pollicott and M. Yuri, Dynamical Systems and Ergodic Theory, ser.London Mathematical Society Student Texts. Cambridge, UK:Cambridge University Press, 1998, vol. 40.
[11] A. Joulin, P. Bojanowski, T. Mikolov, H. Jégou, and E. Grave, “Lossin translation: Learning bilingual word mapping with a retrievalcriterion,” in Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing. Brussels, Belgium: Association forComputational Linguistics, Oct.-Nov. 2018, pp. 2979–2984.
12
[12] X. Chen and C. Cardie, “Unsupervised multilingual word embed-dings,” in Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing. Brussels, Belgium: Association forComputational Linguistics, Oct.-Nov. 2018, pp. 261–270.
[13] H. W. Kuhn, “The Hungarian Method for the assignment prob-lem,” Nav. Res. Logist. Q., vol. 2, pp. 83–97, 1955.
[14] ——, “A tale of three eras: The discovery and rediscovery of theHungarian Method,” Eur. J. Oper. Res., vol. 219, pp. 641–651, 2012.
[15] S. Brin and L. Page, “The anatomy of a large-scale hypertextualweb search engine,” Comput. Networks ISDN, vol. 30, no. 1-7, pp.107–117, 1998.
[16] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRankcitation ranking: Bringing order to the web,” Stanford InfoLab,Tech. Rep., 1999, http://ilpubs.stanford.edu:8090/422/.
[17] S. A. Sloman, “The empirical case for two systems of reasoning.”Psychol. Bull., vol. 119, no. 1, pp. 3–22, 1996.
[18] Y. Yang, W.-t. Yih, and C. Meek, “WikiQA: A challenge datasetfor open-domain question answering,” in Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing(EMNLP). Lisbon, Portugal: Association for Computational Lin-guistics, 2015.
[19] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database (Language,Speech, and Communication). Cambridge, MA: MIT Press, 1998.
[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in Advances in Neural Information Processing Systems26. La Jolla, CA: NIPS, 2013, pp. 3111–3119.
[21] D. Mumford and A. Desolneux, Pattern theory: the stochastic analysisof real-world signals. Natick, MA: A K Peters, 2010.
[22] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong,O. Kononova, K. A. Persson, G. Ceder, and A. Jain, “Unsuper-vised word embeddings capture latent knowledge from materialsscience literature,” Nature, vol. 571, no. 7763, pp. 95–98, 2019.
[23] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “A latent variablemodel approach to PMI-based word embeddings,” Transactions ofthe Association for Computational Linguistics, vol. 4, pp. 385–399,2016.
[24] M. A. Nowak and D. C. Krakauer, “The evolution of language,”Proc. Natl. Acad. Sci. USA, vol. 96, no. 14, pp. 8028–8033, 1999.
[25] C. Everett, D. E. Blasí, and S. G. Roberts, “Climate, vocal folds,and tonal languages: Connecting the physiological and geographicdots,” Proc. Natl. Acad. Sci. USA, vol. 112, pp. 1322–1327, 2015.
[26] C. Everett, “Languages in drier climates use fewer vowels,” Fron-tiers in Psychology, vol. 8, p. Article 1285, 2017.
[27] S. Pinker, “Words and rules in the human brain,” Nature, vol. 387,no. 6633, pp. 547–548, 1997.
[28] W. D. Marslen-Wilson and L. K. Tyler, “Dissociating types ofmental computation,” Nature, vol. 387, no. 6633, pp. 592–594, 1997.
[29] E. Lieberman, J.-B. Michel, J. Jackson, T. Tang, and M. A. Nowak,“Quantifying the evolutionary dynamics of language,” Nature, vol.449, no. 7163, pp. 713–716, 2007.
[30] M. G. Newberry, C. A. Ahern, R. Clark, and J. B. Plotkin, “Detect-ing evolutionary forces in language change,” Nature, vol. 551, no.7679, pp. 223–226, 2017.
[31] S. Pinker, “Survival of the clearest,” Nature, vol. 404, no. 6777, pp.441–442, 2000.
[32] M. A. Nowak, J. B. Plotkin, and V. A. A. Jansen, “The evolution ofsyntactic communication,” Nature, vol. 404, no. 6777, pp. 495–498,2000.
[33] M. Dunn, S. J. Greenhill, S. C. Levinson, and R. D. Gray, “Evolvedstructure of language shows lineage-specific trends in word-orderuniversals,” Nature, vol. 473, no. 7345, pp. 79–82, 2011.
[34] C. Berg, J. P. R. Christensen, and P. Ressel, Harmonic Analysis onSemigroups: Theory of Positive Definite and Related Functions, ser.Graduate Texts in Mathematics. New York, NY: Springer-Verlag,1984, vol. 100.
[35] S. M. Ross, Stochastic Processes, 2nd ed. Hoboken, NJ: John Wiley& Sons, 1995.
Weinan E received his B.S. degree in mathematics from the Universityof Science and Technology of China in 1982, and his Ph.D. degree inmathematics from the University of California, Los Angeles in 1989. Hehas been a full professor at the Department of Mathematics and the Pro-gram in Applied and Computational Mathematics, Princeton University
since 1999. Prof. E has made significant contributions to numerical anal-ysis, fluid mechanics, partial differential equations, multiscale modeling,and stochastic processes.
His monograph Principles of Multiscale Modeling (Cambridge Uni-versity Press, 2011) is a standard reference for mathematical modelingof physical systems on multiple length scales. He was awarded the PeterHenrici Prize in 2019 for his recent contributions to machine learning.
Yajun Zhou received his B.S. degree in Physics from Fudan University(Shanghai, China) in 2004, and his Ph.D. degree in Chemistry fromHarvard University in 2010. Having undertaken postdoctoral training inapplied and computational mathematics at Princeton University, he nowworks in the Laboratory for Natural Language Processing & CognitiveIntelligence at Beijing Institute of Big Data Research. Dr. Zhou haspublished in numerical analysis, special functions, number theory, elec-tromagnetic theory, quantum theory, probability theory and stochasticprocesses, solving several open problems in the related fields.
A book chapter in Elliptic integrals, elliptic functions and modularforms in quantum field theory (Springer, 2019) surveys his proofs ofvarious mathematical conjectures. Dr. Zhou is a polyglot, with a workingknowledge of 24 languages spanning 7 language families.
Top Related