Download - 1 A mathematical model for universal semanticslanguage-independent semantic ﬁngerprints (LISF). Concretely speaking, we deﬁne semantics through al-gebraic invariants of a stochastic

arX

iv:1

907.

1229

3v7

[cs

.CL

] 1

2 Ju

l 202

01

A mathematical model for universal semantics

Weinan E and Yajun Zhou

Abstract—We characterize the meaning of words with language-independent numerical fingerprints, through a mathematical analysis

of recurring patterns in texts. Approximating texts by Markov processes on a long-range time scale, we are able to extract topics,

discover synonyms, and sketch semantic fields from a particular document of moderate length, without consulting external

knowledge-base or thesaurus. Our Markov semantic model allows us to represent each topical concept by a low-dimensional vector,

interpretable as algebraic invariants in succinct statistical operations on the document, targeting local environments of individual words.

These language-independent semantic representations enable a robot reader to both understand short texts in a given language

(automated question-answering) and match medium-length texts across different languages (automated word translation). Our

semantic fingerprints quantify local meaning of words in 14 representative languages across 5 major language families, suggesting a

universal and cost-effective mechanism by which human languages are processed at the semantic level. Our protocols and source

codes are publicly available on https://github.com/yajun-zhou/linguae-naturalis-principia-mathematica

Index Terms—recurring patterns in texts, semantic model, recurrence time, hitting time, word translation, question answering

1 INTRODUCTION

A QUANTITATIVE model for the meaning of words notonly helps us understand how we transmit informa-

tion and absorb knowledge, but also provides footholdfor algorithms in machine processing of natural languagetexts. Ideally, a universal mechanism of semantics shouldbe based on numerical characteristics of human languages,transcending concrete written and spoken forms of verbalmessages. In this work, we demonstrate, in both theoryand practice, that the time structure of recurring languagepatterns is a good candidate for such a universal semanticmechanism. Through statistical analysis of recurrence timesand hitting times, we numerically characterize connectivityand association of individual concepts, thereby devisinglanguage-independent semantic fingerprints (LISF).

Concretely speaking, we define semantics through al-gebraic invariants of a stochastic text model that approx-imately governs the empirical hopping rates on a webof word patterns. Such a stochastic model explains thedistribution of recurrence times and outputs recurrenceeigenvalues as semantic fingerprints. Statistics of recurrencetimes allow machines to tell non-topical words from topicalones. A comparison of hitting and recurrence times furthergenerates quantitative fingerprints for topics, enabling ma-chines to overcome language barriers in translation tasksand perform associative reasoning in comprehension tasks,like humans.

Akin to the physical world, there is a hierarchy oflength scales in languages. On short scales such as syllables,words, and phrases, human languages do not exhibit auniversal pattern related to semantics. Except for a fewonomatopoeias, the sounds of words do not affect theirmeaning [1]. Neither do morphological parameters [2] (say,

• Weinan E is with Department of Mathematics and Program in Appliedand Computational Mathematics, Princeton University, and Beijing In-stitute of Big Data ResearchE-mail: [email protected]

• Yajun Zhou is with Beijing Institute of Big Data ResearchEmail. [email protected]

singular/plural, present/past) or syntactic rôles [3] (say,subject/object, active/passive). In short, there are no uni-versal semantic mechanisms at the phonological, lexical orsyntactical levels [4]. Grammatical “rules and principles”[2], [3], however typologically diverse, play no definitiverôle in determining the inherent meaning of a word.

Motivated by the observations above, we will build ourquantitative semantic model on long-range and language-independent textual features. Specifically, we will measurethe lengths of text fragments flanked by word patterns ofinterest (Fig. 1). Here, a word pattern is a collection of contentwords that are identical up to morphological parameters andsyntactic rôles. A content word signifies definitive concepts(like apple, eat, red), instead of serving purely grammatical orlogical functions (like but, of, the). Fragment length statisticswill tell us how tightly/loosely one concept is connectedto another. This in turn, will provide us with quantitativecriteria for inclusion/exclusion of different concepts withinthe same (computationally constructed) semantic field. Suchstatistical semantic mining will then pave the way for ma-chine comprehension and machine translation.

2 METHODOLOGY

We quantify the time structure of an individual word pat-tern Wi through the statistics of its recurrence times τii. Wecharacterize the dynamic impact of a word pattern Wi onanother word pattern Wj by the statistics of their hittingtimes τij . In what follows, we will describe the statisticalanalyses of τii and τij , on which we build a language-independent Markov model for semantics.

2.1 Recurrence times and topicality

Assuming uniform reading speed,1 we measure the recur-rence times τii for a word pattern Wi through nii samples

1. On the scale of words (rather than phonemes), this assumptionworks fine in most languages that are written alphabetically. However,this working hypothesis does not extend to Japanese texts, whichinterlace Japanese syllabograms (lasting one mora per written unit)with Chinese ideograms (lasting one or more morae per written unit).

http://arxiv.org/abs/1907.12293v7

https://github.com/yajun-zhou/linguae-naturalis-principia-mathematica

[email protected]

[email protected]

2

Wi :=happ(ier|ily|iness|y) ≡ happier, happily, happiness, happy, Wj :=marr(iage|ied|y) ≡ marriage, married, marry... LOREM IPSUM HAPPY DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING UNHAPPY ELIT, HAPPINESS SED HAPPY DO HAPPY EIUSMOD TEMPOR HAPPIER, INCIDIDUNT UT LABORE ...

HAPPINESS HAPPINESS HAPPINESS

Lii Lii Lii

... LOREM IPSUM, MARRIAGE DOLOR SIT AMET, HAPPY, CONSECTETUR ADIPISCING MARRIED ELIT, MARRY SED HAPPILY DO HAPPILY EIUSMOD TEMPOR MARRIED INCIDIDUNT UT LAB ...

HAPPINESS HAPPINESS

Lij Lij

HAPPINESS

Lij

Fig. 1. Counting long-range transitions between word patterns. A transition from Wi to Wj counts towards long-range statistics, if the underlinedtext fragment in between contains no occurrences of Wi, and lasts strictly longer than the longest word in Wi ∪Wj . For each long-range transition,the effective fragment length Lij discounts the length of the longest word in Wi ∪Wj .

Wi = Jane(∅|’s)

Wj = than

(a)

0 1 2 3 4 5 6 7 8 90

100

200

# than in a block

#b

lock

s

0 1 2 3 4 5 6 7 8 90

100

200

(b)

0 5 000 10 000 15 000

100

101

102

Lii

Co

un

ts

0 5 000 15 000

100

101

102Wi = Jane(∅|’s)

0 5 000 10 000 15 000

Ljj

0 5 000 10 000 15 000

Wj = than

(c) (d)

0 1 2 3 4 5 6 7 8 9 10 11

100

101

102

logLii

Co

un

ts

0 1 2 3 4 5 6 7 8 9 10 11

100

101

102Wi = Jane(∅|’s)

0 1 2 3 4 5 6 7 8 9 10 11

logLjj

0 1 2 3 4 5 6 7 8 9 10 11

Wj = than

(c′) (d′)

AREA FORBIDDEN BY

JENSEN’S INEQUALITY

Eliza(∅|beth|beth’s)Darcy(∅|’s)

Bennet(∅|’s|s)Bingley(∅|’s|s|s’)

Jane(∅|’s)Wickham(∅|’s)

Collins(∅|’s)

happ(ily|iness|y|ier|iest)

Lydia(∅|’s)Catherine(∅|’s)

lov(e|e’|ed|ely|es|ing|eliness|e-making|er|ers)

Gardiner(∅|’s|s)

Lizzy(∅|’s)

Charlotte(∅|’s)

Lucas(∅|’s|es|es’)

danc(e|ed|es|ing)

Kitty(∅|’s)

Chapter

RosingsWilliam(∅|’s)

handsome(∅|ly|r|st)

beaut(iful|ies|y)

Forster(∅|’s|s)

Mary(∅|’s)Bourgh(∅|’s)

Fitzwilliam(∅|’s)

Hurst(∅|’s|s)

7 8 9 10 11

6

7

8

9

10

log〈Lii〉(e)

〈logL

ii〉

Jense

n’sbound

Poisso

nian

banal

ity

Fig. 2. Statistical analysis of recurrence times and topicality. (a) Barcode representations (adapted from [5, Fig. 2]) for the coverage of Wi =Jane(∅|’s) (291 occurrences) and Wj = than (282 occurrences) in the whole text of Pride and Prejudice. Horizontal axis scales linearly withrespect to the text length measured in the number of constituting letters, spaces and punctuation marks. (b) Counts of the word than within aconsecutive block of 1217 words (spanning about 1% of the entire text), drawn from 1000 randomly chosen blocks, fitted to a Poisson distributionwith mean 2.776 (blue curve). (c) Histogram of effective fragment length Lii (see Fig. 1 for its definition) for the topical pattern Wi = Jane(∅|’s),fitted to an exponential distribution (blue line in the semi-log plot) and a weighted mixture of two exponential distributions c1k1e

−k1t + c2k2e−k2t

(red curve, with c1 : c2 ≈ 1 : 3, k1 : k2 ≈ 1 : 7). (d) Histogram of Ljj for the function word Wj = than, fitted to an exponential distribution (blue linein the semi-log plot). All the parameter estimators in panels b–d are based on maximum likelihood. (c′)–(d′) Reinterpretations of panels c–d, withlogarithmic binning on the horizontal axes, to give fuller coverage of the dynamic ranges for the statistics. (e) Recurrence statistics for word patternsin Jane Austen’s Pride and Prejudice, where 〈· · · 〉 denotes averages over nii samples of long-range transitions. Data points in gray, green and redhave radii 1

4√

nii. Labels for proper names and some literary motifs are attached next to the corresponding colored dots. Jensen’s bound (green

dashed line) has unit slope and zero intercept. Exponentially distributed recurrence statistics reside on the line of Poissonian banality (blue line),with unit slope and negative intercept. Red (resp. green) dots mark significant downward (resp. upward) departure from the blue line.

of the effective fragment lengths Lii (Figs. 1, 2a). Here, whilecounting as in Fig. 1, we ignore contacts between short-range neighbors, which may involve language-dependentredundancies.2

2.1.1 Recurrence of non-topical patterns

In a memoryless (hence banal) Poisson process (Fig. 2b),recurrence times are exponentially distributed (Fig. 2d,d′).The same is also true for word recurrence in a randomlyreshuffled text [5]. If we have nii independent samples ofexponentially distributed random variables Lii, then thestatistic δi := log〈Lii〉 − 〈logLii〉 − γ0 + 1

2niisatisfies an

inequality

|δi| <2√nii

√π2

6− 1− 1

2nii(1)

2. For example, a German phrase liebe Studentinnen und Studentenwith short-range recurrence is the gender-inclusive equivalent of theEnglish expression dear students. Some Austronesian languages (suchas Malay and Hawaiian) use reduplication for plurality or emphasis.

with probability 95% (see Theorem 1 in Appendix A for atwo-sigma rule). Here, γ0 := limn→∞

(− logn+

∑nm=1

1m

)

is the Euler–Mascheroni constant.

As a working definition, we consider a word pattern Wi

non-topical if its nii counts of effective fragment lengths Lii

are exponentially distributed P(Lii > t) ∼ e−kt, within 95%margins of error [that is, satisfying (1) above].

2.1.2 Recurrence of topical patterns

In contrast, we consider a word pattern Wi topical if itsdiagonal statistics nii, Lii constitute significant departurefrom the Poissonian line 〈logLii〉−log〈Lii〉+γ0 = 0 (Fig. 2e,blue line), violating the bound in (1).

Notably, most data points for topics (colored dots onFig. 2e) in Jane Austen’s Pride and Prejudice mark system-atic downward departures from the Poissonian line. Thissuggests that the topical recurrence times τ = Lii follow

3

Word Countsfelt 100feelings 86feel 39feeling 38feels 4feelingly 1

word count censorship============⇒

Word Countsfelt 100feelings 86feel 39feeling 38

alphabetic sorting==========⇒

Word Countsfeel 39feeling 38feelings 86felt 100

sequence alignment===========⇒ 100

863839

FFFF

EEE

EEEE

LL

LL

TII

NN

GG

S vertical merger========⇒ 100

863839

FEEELTINGS

(a)

MR ELIZABETH’S

SAAIYDING DARCY’S

SEEA

NI

WNG KNO

E

WNING

THOUI

GN

HK

TI

S

NG MRS MAKDIE

NG

BENNET’S GB

OOE

DTSTER

BINGLEY’S LADYI

S

E

H

S

IP

SISTERS’S

WG

EONI

TENG JANE’S

MISS FEEELTINGS GI

A

VIENG

COAMI

ENS

G TIMES GREATLE

YSR

T

LOE

NGE

T

R

H MARRYII

EA

N

DG

G

E

WICKHAM’S LITTLE TOOAKI

ENG

WISHIE

NSD

G HEARID

NG FRIENDSHIP

MEAN COLLINS’S YOUNGES

RT

HAPPYIN

E

ES

ST

S LOOKSIE

ND

G DEAREST

LYDIA’S HOPESD DAYS FAMILY

SPOEAKE

IN

NG LIKESLD

Y BELIEVF

IE

ND

G

TOELD

L LEEAFV

TIE

NG FOUINDING FIRST

MANNERS LIVFIE

NLD

GY MEANT

SING MOTHER’S

DAUGHTERS’S CERTAINT

LY FATHER’S

TWO REPLYIED LETTERS WALKI

END

G

CATHERINE’S REALLY RECEPI

TVI

EOND

NG

ROOMS LOVED HOUSE TALKIE

ND

G WAYPLEASI

EA

NDN

G

T

EXPECTEA

DTIONS RETURNI

END

G

SEEMSED ATTENTIV

OENS SURE ANSWERED

APPEARIEA

NDN

G

CE GARDINERS’S KINDNL

EY

SS

PLEASURES EXPRESSIEOND

NG

S CONSIDERIEA

NDTB

G

IL

OE

N

MEEETING LIZZY GENTLEMEAN’LS

IKE

WROITE

TIE

EN

NG

SUBJECTS OH INDEEDASSURE

A

D

NCES CRIED PARTYIE

ASLITY PASSIEN

DG

PARTSIE

ND

G VISITSOI

RN

SG PLACESD BEGI

ANNING

SUPPOSIEND

G ACQUAINTEA

DNCES OBJECTS

IONS

LONGBOURN AFFECTIE

OD

N ASKIE

ND

G RELATIE

OND

NG

SHIP

TURNIE

ND

G CALLIEND

G CHARLOTTE’S

AUNT’S LUCAS’ES GIRLS SENSIEBLE

ENGAGIE

NMD

GENTS SATISFY

IA

EC

DTOI

RO

YN MOMENTS

’S

PRESENTCE

MORNING WOMEAN’S PARTICULARS

LY

BROTHER’S POSSIBLI

YELITY

WORDS EVENINGS

SIR OPINIONS REASONSABLE DELIGHTF

EUD

LLY

SIATTING NETHERFIELD DANCI

ENSD

GCONTINUE

A

DLLY SURPRISED IMMEDIATELY EASY

IE

LY

UNCLE’S GENERALLY CONVERSEATION ADDED THINGS

PROUIDE KITTY CHARACTER SMILI

ENSD

G

OBSERVIEA

NDT

G

ION ALLOWIE

ND

G OCCASIONSALLY BROU

IGN

HG

TING EYES

WORLD CHOOOO

SI

EC

N

S

E CARRYII

EA

N

DG

G

ES FOLLOWSIE

ND

G PERFECTLY

HOMECOLONEL MENTIONIE

ND

G UNDERSTOOA

DNDING TOWN

MR ELIZABETHDIT

SR

E

AE

S

IT DARCYVU

OI

E

ITS

RAGE

MRSBINGLEY BENNETJANE M

D

IE

SM

SOISELLES

MBIO

EN

UN

XES SŒURS V

AALI

LS

OEA

NZRI

S

T

FILLES PARTISRE WICKHAM

PRIE

TSENE

DA

ZRNI

ET

PENSEÉEA

ZREI

S

T SUAVI

OEATS

IZI

RTS

JEUNESSE MLDAD

M

AYE

M

S

E

TROUVEÉEA

RNIT DEUX SENTI

A

TRMI

ET

NTS

VIE

NEN

TNUIEA

TS

RZIT AMI

A

TSE

B

I

I

É

LITÉ

COLLINS AMI

OM

UÉEA

RIB

E

TL

U

EX

S

LYDIA VOEUL

DXT

UOEAR

IZIA

R

TIS GRANDE

DONNEÉEA

ZRIT

DEMANDEÉEA

R

IT SEULESMENT

PASSIEÉEA

ORE

G

N

S

E PARLEÉEA

ZRNIT MARIÉE

ÉA

SREGE

PLAÎI

TSR

IAE

RIT

CRUO

TIS

RE RÉPONDURI

ET

JOURSNÉE POINTS ARRIVÈ

EÉEA

RENI

E

T

NT

CERTEA

SINS

EMENT LAISSEÉEA

ZRNIT

CHÉÈERI

EE LETTRES ENTENDU

RIA

ETIT

CHOSES FAMILLES TEMPS PÈREESPO

ÉÈÉ

IR

REA

R

NICTES MÈA

RM

EAN

NOUVELA

LU

ES

HOMMES CATHERINE FEMMES

CONNAÎI

TS

RS

EEA

ZNI

CT

E PRÉSENTC

EÉEA

ERE

TION

PEINE DOUTER HEUREUXSEMENT

GARDINER BELA

LU

EXT

S

É DÉSIREA

R

IT

SOIRÉE CAUSEÉEA

R

IT

NATÎ

UT

RELLEMENT

VISITEUSR

RS PARUOA

TIÎI

STS

SR

SE

AIT

ENTRÉE

E

LONDRES REÉ

ÇC

UE

T

VP

OT

IRONS

PREMIEÈERSES

SEMBLEAI

B

TLES

VRAISMEM

NBT

LABLE ATTENDRA

ENIT

SENT

REVIE

NEN

TRNUIA

ED

RI

NR

TE

TA

N

I

T

T LONGBOURN REGARDSEA

RNIT

SUJETS YŒ

EI

UL

X MONDE CHARLOTTEFRÈ

A

RT

ERNELLE LIZZY QUITTE

ÉEA

RNT

COURUSRIA

T

IRN

E

TR HEURES TANTE CONVERSATION

SATISFAIC

TRT

EEION MOMENTS AJOUTE

ARNIT

MAISON INVITÉEÉA

SRETIONS INTE

ÉRVIÊE

IETS

NUS

S

TR

EA

I

RNI

O

T

N

ÉCRIVTRIEUE

ETZRE LUCAS COUSINS

ES

SÛREMENT CÔTÉ RENCONTREÈEÉE

SRENT

IMPRO

ER

ST

SEAIN

OTC

NE

S VOYOEA

NZNI

S

T SURPRIE

SN

EDRE

QUESTIONS EXPRIMEÉA

RE

NIT MANIÈRES CRAING

TSDN

EREA

S

EZIT

CHARMEA

S

NTES PORTIEÉEA

ÈRE

I

R

T

E ONCLE AIR

DANSEÉE

USR

SR

ESTENU

IEDA

E

RZRI

ET

SSE DÉCIDEÉA

RE

FORTE BONHEUR POSSIEÈÉ

BSD

LSEA

EII

OT

N MITSR

EENT

IDÉES

SOURITRA

ENI

S

T

MAULXADE LIVRESR ÉCRIA SUIVR

IA

ETRNI

ET

NE

T

RECONNUA

T

ÎI

TS

RS

EANT

CE REPRIE

TSN

EDA

SRE

T

VÉRITÉABLEMENT

ATTENTIE

ONS LONGTS

EMPS NETHERFIELDCOLONEL INSTANTS PETITS

ES

PERDURERESTÉ

EE MATINÉE CRU

OA

YU

EA

T

ZI

É

T

TROISIÈME

Wfr j

Weni

MRELIZABETH’S

SAAIYDING

DARCY’S

MRSBENNET’S

BINGLEY’S

MISSJANE’S

WICKHAM’S

COLLINS’S

DEAREST

LYDIA’S

MOTHER’S

FATHER’S

LETTERS

CATHERINE’S

GARDINERS’S

LIZZYWRO

ITETIE

EN

NG

AUNT’S

CHARLOTTE’S

LUCAS’ES

BROTHER’S

NETHERFIELDDANCI

ENSD

G

UNCLE’S

KITTYCOLONEL

COUSINS’S

MERYTONPEMBERLEY

ROSINGSWEEKS

WILLIAM’S

FORSTERMARY

BOURGH’S

DEFITZWILLIAM’S

HURSTPHILLIPS’S

NIECES’S

NEPHEWS’S

DERBYSHIREBRIGHTON

MR

EL

IZA

BE

TH

DITSR

EAESI

T

DA

RC

YM

RS

BIN

GL

EY

BE

NN

ET

JA

NE

MD

IE

SM

SOI

SE

LL

ES

WIC

KH

AM

CO

LL

INS

LY

DIA

CH

ÉÈE

RIE

E

LE

TT

RE

S

PÈ

RE

MÈA

RM

EAN

CA

TH

ER

INE

GA

RD

INE

RC

HA

RL

OT

TE

FR

ÈA

RT

ER

NE

LL

E

LIZ

ZY

TA

NT

EÉ

CR

IVTRIEUEE

TZRE

LU

CA

SC

OU

SIN

SES

ON

CL

ED

AN

SEÉE

USRSR

ES

NE

TH

ER

FIE

LD

CO

LO

NE

LK

ITT

YM

ER

YT

ON

PE

MB

ER

LE

YS

IRW

ILL

IAM

RO

SIN

GS

NIÈ

CE

S

FO

RS

TE

RS

EM

AIN

ES

HU

RS

TF

ITZ

WIL

LIA

MB

OU

RG

HM

AR

YP

HIL

IPS

NE

VE

UX

DE

RB

YS

HIR

EB

RIG

HT

ON

sR(b

en

i,b

fr j)

0

0.2

0.4

0.6

0.8

1

(b) (b′) (c)

Fig. 3. Automated topic extraction and raw alignment across bilingual corpora. (a) Schematic diagram illustrating our graphical representationof morphologically related words (identified by supervised algorithms in Supplementary Materials) in a word pattern. To avoid unprintably smallcharacters, rarely occurring forms (less than 5% of the total sum of all the words ranked above) are ignored in graphical display. To enhance thevisibility of word stems, we print shared letters only once, and compress other letters vertically, with heights proportional to their correspondingword counts. (b) Word patterns Wi in Jane Austen’s Pride and Prejudice, sorted by descending nii, with font size proportional to the squareroot of e−〈logLii〉 (a better indicator of reader’s impression than the number of recurrences nii ∝ e− log〈Lii〉). Topical (that is, significantly non-Poissonian) patterns painted in red (resp. green) reside below (resp. above) the critical line of Poissonian banality (blue line in Fig. 2e), wherethe deviations exceed the error margin prescribed in (1) of §2.1. (b′) A similar service on a French version of Pride and Prejudice (tr. ValentineLeconte & Charlotte Pressoir). (c) A low-cost and low-yield word translation, based on chapter-wise word counts ben

i and bfrj . Ružicka similarities

sR(beni ,bfr

j ) between selected topics (sorted by descending nii ≥ 20) in English and French versions of Pride and Prejudice. Rows and columns

with maximal sR(beni ,bfr

j ) less than 0.7 are not shown. Correct matchings are indicated by green cross-hairs.

weighted mixtures of exponential distributions (Fig. 2c,c′):

P(τ > t) ∼∑

m

cme−kmt, (2)

(where cm, km > 0, and∑

m cm = 1), which impose an in-equality constraint on the recurrence time τ = Lii:

〈logLii〉 − log〈Lii〉+ γ0

=∑

m

cm log1

km− log

∑

m

cmkm

≤ 0. (3)

2.1.3 Raw alignment of topical patterns

If a word pattern Wi qualifies as a topic by our definition(Fig. 3b,b′), then the signals in its coarse-grained timecourse(say, a vector bi = (bi,1, . . . , bi,61) representing word countsin each chapter of Pride and Prejudice) are not overwhelmedby Poisson noise.

This vectorization scheme, together with the Ružickasimilarity [6]

sR(bAi ,b

Bj ) :=

‖bAi ∧ b

Bj ‖1

‖bAi ∨ b

Bj ‖1

(4)

between two vectors with non-negative entries, allow us toalign some topics found in parallel versions of the same doc-ument, in languages A and B (Fig. 3c). Here, in the definitionof the Ružicka similarity, ∧ (resp. ∨) denotes component-wise minimum (resp. maximum) of vectors; ‖b‖1 sums overall the components in b.

2.2 Markov text model

2.2.1 Transition probabilities via pattern analysis

The diagonal statistics nii, Lii (Fig. 1) have enabled us toextract topics automatically through recurrence time anal-ysis (Figs. 2e and 3b,b′). The off-diagonal statistics nij , Lij

(Fig. 1) will allow us to determine how strongly one wordpattern Wi binds to another word pattern Wj , through hit-ting time analysis. In an empirical Markov matrix P = (pij),the long-range transition rate pij is estimated by

pij :=nije

−〈logLij〉

N∑

k=1

nike−〈logLik〉

, (5)

4

0 50 1000

0.02

0.04

word count rank

Vec

tor

com

po

nen

tsπ

π∗

0 50 100

word count rank

π

π∗

0 50 100

word count rank

π

π∗

0 50 100

word count rank

π

π∗

(a)

1 2 3 4 5

10−5

10−3

10−1

n

rn

1 2 3 4 5n

1 2 3 4 5n

1 2 3 4 5n

(b)

−6 −4 −2 00

50

100

log |λ(P)|

Cu

mu

l.co

un

ts

−1 −0.5 0 0.5 11πarg λ(P)

(c)

–— English –— French –— Russian –— Finnish

Fig. 4. Quantitative properties of Markov text model. (a) Dominanteigenvector π of a 100 × 100 Markov matrix P, computed from oneof the four versions of Pride and Prejudice, in comparison with π

∗, thelist of normalized frequencies for top 100 word patterns. (b) Precipitous

decays of rn := 12

∑1≤i,j≤100

∣∣πip(n)ij − πjp

(n)ji

∣∣ from the initial value

r1 ≈ 0.07, for matrix powers Pn = (p(n)ij

)1≤i,j≤100 constructed from

four versions of Pride and Prejudice. (In contrast, one has r1 ≈ 0.33for a random 100× 100 Markov matrix.) Such quick relaxations supportour working hypothesis about detailed balance π∗

i p∗ij = π∗

j p∗ji. (c) Dis-

tributions of eigenvalues λ of empirical Markov matrices P, with nearlylanguage-independent modulus |λ(P)| and phase-angle arg λ(P).

where nij counts the number of long-range transitions fromWi to Wj , and Lij is a statistic that measures the effectivefragment lengths of such transitions (Fig. 1).

2.2.2 Equilibrium state and detailed balance

Numerically, we find that our empirical Markov matrixP = (pij) defined in (5) is a fair approximation to anergodic3 matrix P

∗ = (p∗ij), which in turn, governs thestochastic hoppings between content word patterns duringtext generation.

Each ergodic Markov matrix P∗ = (p∗ij)1≤i,j≤N pos-

sesses a unique equilibrium state π∗ = (π∗

i )1≤i≤N . Theequilibrium state π

∗ represents a probability distribution

(that is, π∗i ≥ 0 for 1 ≤ i ≤ N and

∑Ni=1 π

∗i = 1)

that satisfies π∗P

∗ = π∗ (that is,

∑1≤i≤N π∗

i p∗ij = π∗

j for1 ≤ j ≤ N ). In our numerical experiments, the dominanteigenvector π (satisfying πP = π) consistently reproducesword frequency statistics that are proportional to the idealequilibrium state π

∗ (Fig. 4a).Furthermore, through numerical experimentation, we

find that our empirical Markov matrix P ≈ P∗ approxi-

mately honors the detailed balance condition π∗i p

∗ij = π∗

j p∗ji

for 1 ≤ i, j ≤ N . The approximation πip(n)ij ≈ πjp

(n)ji

3. If a Markov chain is ergodic, then there is a strictly positiveprobability to transition from any Markov state (that is, any individualword pattern in our model) to any other state, after finitely many steps.

becomes closer as we go to higher iterates Pn = (p

(n)ij ),

where n is a small positive integer (Fig. 4b).On an ergodic Markov chain with detailed balance, one

can show that recurrence times are distributed as weightedmixtures of exponential decays (see Theorem 3 in Ap-pendix B.3), thus offering a theoretical explanation for (2).

2.2.3 Spectral invariance under translation

The spectrum σ(P) (collection of eigenvalues) is approxi-mately invariant against translations of texts (Fig. 4c), whichcan be explained by a matrix equation

PATA→B = TA→BPB. (6)

Here, both sides of the identity above quantify the transitionprobabilities from words in language A to words in lan-guage B, from the impressions of Alice and Bob, two mono-lingual readers in a thought experiment. On the left-handside, Alice first processes the input in her native language Aby a Markov matrix PA, and then translates into languageB, using a dictionary matrix TA→B; on the right-hand side,Bob needs to first translate the input into language B,using the same dictionary TA→B, before brainstorming inhis own native language, using PB. Putatively, the matrixequation holds because semantic content is shared by nativespeakers of different languages. In the ideal scenario wheretranslation is lossless (with invertible TA→B), the Markovmatrices PA and PB are indeed linked to each other by asimilarity transformation that leaves their spectrum intact.

2.3 Localized Markov matrices and semantic cliques

2.3.1 Semantic contexts for recurrent topics

Specializing spectral invariance to individual topical pat-terns, we will be able to generate semantic fingerprintsthrough a list of topic-specific and language-independenteigenvalues. Here, we will be particularly interested inrecurrence eigenvalues of individual topical patterns, whichcorrespond to multiple decay rates in the weighted mixturesof exponential distributions.

Unlike the single exponential decays associated to non-topical recurrence patterns, the multiple exponential decaymodes will enable our robot reader to easily discern onetopic from another. In general, it is numerically challengingto recover multiple exponential decay modes from a limitedamount of recurrence time measurements [7]. However,in text processing, we can circumvent such difficulties byoff-diagonal statistics nij and Lij that provide semanticcontexts for individual topical patterns.

To quantitatively define the semantic content of a topicalpattern Wi, we specify a local, directed, and weightedgraph, corresponding to a localized Markov transition ma-trix P

[i].

2.3.2 Localized Markov contexts of topical patterns

To localize, we need to remove edges between two ver-tices Wi and Wj , when the hitting times Lij and Lji

are “long enough” relative to what one could naïvely ex-pect from recurrence time statistics nij , nji and Lii, Ljj .Here, for naïve expectation, we approximate the probability

5

0

12

1

ℓ5.5 6.0 6.5 7.0 7.5

α12α14 α13

W4 W2

W3P(〈logL

ij〉>

ℓ)

ℓ6.5 7.0 7.5 8.0 8.5

α21 α24 α23

W3W1

W4

ℓ8.5 9.0 9.5 10.0

α32α31α34

W4

W1

W2

W1 = Eliza(∅|beth|beth’s), W2 = Darcy(∅|’s), W3 = pr(ide|ided|oud|oudly|oudest), W4 = Jane(∅|’s)

(a)

DARCY’S

FEEELTINGSME

ANWICKHAM’SPLEASIEA

NDN

G

T

PROUIDECHO

OOO

SI

EC

N

S

E

NETHERFIELDFORTUNEATE

PEOPLE INTERESTIE

ND

G

PAINFULVAAINITY HANDSOME

AMIABLE

PEMBERLEYSORTDISLIKESOCIETYCAUSES

TEMPER

REFLECTIONS

OFFENSD

EED

DERBYSHIRE PHILLIPS

ELIZABETH’S

DARCY’SKNO

E

WNING

MISSBINGLEY’S

LOOKIE

ND

GSPOEAKE

ING

MEANTSING

ROOMPLEASIEANDN

G

T

GENTLEMEANTURNI

END

GEYESEASYIE

LY

EVENINGBROTHER’S

DIFFERENT

RESOLVU

ET

DION

PROUIDE

PEOPLEPOWER

CIVILITY

AGREEABLE

HANDSOME

COUNTRYPEMBERLEY

VAAINITY

APPLIEC

DATION

FANCYDISLIKEINTRODUCT

EION

SORTBALL

PRETTY

PAUSE

PLAYEVIDENT

ELIZABETH’S

DARCY’SMISSBINGLEY’SLOOKI

END

G

ROOM WALKIE

ND

G

GENTLEMEAN ASKI

END

GDELIGHTFE

UD

L SIATTINGCARRIAGE

AUNTCOLLINS

CATHERINECIVILITY

ANXIOUE

STYDOORS

LAUGHIE

ND

G

FACE

DISLIKERUANNING

MIASTREERSS

PLAY

WINDOWSMARY

PURPOSE

0 2 4 6 80

20

40

− log |λ(R[i])|

Cu

mu

l.co

un

ts

0 2 4 6 8

− log |λ(R[i])|

0 2 4 6 8

− log |λ(R[i])|

(b)

MRDARCY

SENTITMENTS

CAUSEÉEA

R

IT

LONGUE

NETHERFIELDCARACTÈRESYMPATHIQ

EUE

FORTUNE

PEMBERLEYTÉMOIGNAIT

ORGUEIL VOLONTIÉ

ERS

BAL

MOTIFSHIER FIEÈERT

EÉ

DEVENIR

PARCPHILIPS

PROPRIÉTÉAIRE

CHÂTEAU

ELIZABETHDARCYMISS

BINGLEYAMIS

E AMI

OM

UA

RBLES

SENTIMENTSSOIRÉEPRÉSENT

C

E

E

R CONNAÎITS

RS

EANCE

YŒ

EI

UL

XREGARDA T

FRÈREAIR

SOURIRA

ET

INSTANTMOTS

CARACTÈRESYMPATHIE

POLITESSEPEMBERLEY

PERSONNES

PROMENADEDISTANCE

BALORGUEIL

VEILLEGUÈRECHEMIN

JOUER

FIERTÉ

HURSTMARCHEATTACHEMENT

GEORGIANA

CHÂTEAU

NEVEU

ELIZABETHDARCYMISS

BINGLEYYŒ

EI

UL

XREGARDAIT

AIRSOURITRA

ET

INSTANTSTANTE

OBSERVEA

R

VOITURE

POLITESSESALON

ARRÊTEA

R

PEMBERLEY

ROUTE

RIREDISTANCE

TABLE

ROSINGS

FRANCHISE

EMBARRAS

CHANGEMENT

ATTACHEMENT

CHÂTEAU

DARCYNA

MIESH

T

EÄ

NSÄTOTSD

TIE

ANAK

AS

NI

KIN

WICKHAMINMIELTL

ÄY

ÄN

T

N

YTÄVÄ

PARISNI

S

N

AKAUNISIT

KOHTELIAIASU

TUI

YLPEÄYS

NETHERFIELDISN

Ä

TUNNUSTAA

KÄYTTÄYTYMISEKUVAVASTENMIELISY

EY

SYVÄÄSTI

KÄRSIÄM

HELLÄY

ÄYT

YLISTYE

SLYT

SIEVÄN

KEHOIT

SUVAITN

SNIUTASUM

IAANMENETTELY

ELIZABETHIN

DARCYNTUNT

SN

UEIE

I

N

A

PUHUE

ALE

LA

AJI

AK

TNA

E

ALLA BINGLEYNYSTÄVÄNSÄ

KATSOEMLAI

N

AIHMETI

TS

ELN

AJATUSKSETNARVOS

NA

KOHTELIAIASU

TUI KAUNIS

I

HYMYILLEN MUISTIAN

TUTTAVUUTE

YLPEÄY

SÄVYYNSÄ

VELJENSÄKYSYMYKSE

HIENOMAHDOTONTA

SALATASKÄYTTÄYTYM

HÄMMÄSTY E

VASTAUSKS

KUVA

KOVA KASVOI

VAIKUTUSISÄLL S

SALISIN

A

VASTENMIELIS

PEMBERLEY

HELLÄ

TULIJA

ASETTUI

HENKILÖ

KIUSAU

ONNETON A

KOSINTAAN

UTELIAISUU

KEHOIT

MAHTAVA

SISARENPOIKA

ELIZABETHIN

HRE

ARRADARCYN

PUHUE

ALE

LA

AJI

AK

TNA

E

ALLA NT

EITIBINGLEYN

HUOMIA

OKATSOEMLAI

N

A

SILMÄISS

ÄÄ

VIERAIA

T

N

A ILMEAIS

AJATUSKSEN

KOHTELIAIASU

TUI

IHASI

TLUN HYMYILLEN

ISTUI A SATTUI

MATKAA TARKJOUT

TÄTI HUONEESEAEN

SÄVYYNSÄ

NÄKÖYINEN

NAURA

KUVAEROON

VAUNUJ

GARDINER

TULIJA

KIUSA

AKKUNASTA

VOITT A

ДАРСИДРУЗГЬЕУАЯЙ УИКХЕМОАМ

ЧУВСТВОАМСИЛЬА

НХ

ОЕЕ

ОБЪЯСНИЕОТНЬИЯПОСТУПОКИКЛРАССКАЗЫАВТЛАЬАПОВЕДЕНИИЕХАРАКТЕРЕАПОНРАВИЛТССЯСЯЬ

ГОРДОСТЬ МАНЕРЫАНЕДОСТАТК БЕДНОАЙЯ

ПЕМБЕРЛИ

ПРИВЕТСЛТИВИ

ПРЕДЕАЛА

НЕПРИЯЗНЬ

ПОМОГЛ

ПРЕНЕБРЕЧЬ

ПОХВАЛ

ПОМЕСТЬЯЕДЕРБИШИРЕА

ПАРКА

ТЯЖЕЛО

ТЩЕСЛАВИЕ

ЭЛИЗАБЕТСГ

КО

АВ

ЗО

АР

ТЛЯИ

ЬАТЛЬ

ДАРСИМИССБИНГЛИ

ВСТРЕЧТИАИТЛЬСЯДРУЗГАОБЪЯСНИЕОТНЬИВЫЗВАТЛЬПРОВЕСЛТИЛЮБЕЗНОСТ

УЛЫБКУОЙПОСТУПИЛ

ПОНРАВИЛТСЯ

МОЛЧАНИЕДОВОЛЬНО БРАТА

ПЛЕМЯННИЦК

МАНЕРЫА

ГОРДОСТЬПРЕДПОЛОЖИЕНИЕ

ПРИВЕТСЛТИВИ ПЕМБЕРЛИИЗУМЛЕНИЕБЕСПОКОЙСТВО ПРОГУЛКУБАЛУАХВАТИАЛО ГЛУБОКОНЕПРИЯЗНЬ ИСПОЛНИЕНДЕРБИШИРЕ

ПОЗНАКОМИТЬОТЗЫВСЫН ДЖОРДЖИАН

ТЩЕСЛАВИЕ

МУЗЫК

ЭЛИЗАБЕТМИСТЕРУОАМДАРСИ

СЕСТРЕ

ЫУЕАР

МИССБИНГЛИМЫСЛЬИ УДИВЛИЯЕНИГОСТИЕНЙУЮ КОМНАТЫУЕПРОВЕСЛТИ ЛЮБЕЗНОСТУЛЫБКУОЙ СМОТРЕТЛЬ

ПРЕДЛОЖИЕНИЕ

МОЛЧАНИЕ ДОВОЛЬНО

СИДЕЛ

ТЕТУКШК ПОДРУГ

ГАРДИНЕРПОСПЕШ

ВЫГЛЯДЕТЛ

ПОДНЯЛ

ПРОГУЛКУ

БЕСПОКОЙСТВОПРОВОДИТЬ

ОКНА СМУЩЕНДЕРБИШИРЕ

ДОСАДНЕЛОВКО

ПУТЕШЕСТВИ

Indo-EuropeanDanish

GermanDutch

SpanishFrench

LatinPolish

RussianKoreanic

KoreanTurkic

TurkishUralic

Finnish

HungarianVasconic

Basque

#Topics: 0 20 40 60 80 100120

correct close incorrect

(c)Fig. 5. Semantic cliques and their applications to word translation. (a) Empirical distributions of 〈logLij〉 in Pride and Prejudice, as gray and coloreddots with radii 1

4√

nij, compared to Gaussian model αij(ℓ) (colored curves parametrized by (7) and (8)). The numerical samplings of Wj ’s exhaust

all the textual patterns available in the novel, including topical word patterns, non-topical word patterns and function words. Only those textualpatterns with over 40 occurrences are displayed as data points. Inset of each frame shows the semantic clique Si surrounding topic Wi (painted inblack ), color-coded by the αij (〈logLij〉) score. The areas of the bounding boxes for individual word patterns are proportional to the components of

π[i] (the equilibrium state of P[i]). (b) Distributions for the magnitudes of eigenvalues (LISF) in the recurrence matrices R

[i], for three concepts fromfour versions of Pride and Prejudice. The color encoding for languages follows Fig. 4. The largest ⌊eηi⌋ magnitudes of eigenvalues are displayedas solid lines, while the remaining terms are shown in dashed lines. Inset of each frame shows the semantic clique Si, counterclockwise fromtop-left, in French, Russian and Finnish. (c) Yields from bipartite matching of LISF (see Fig. 6 for English-French) for topical words between theEnglish original of Pride and Prejudice and its translations into 13 languages out of 5 language families.

P(〈logLij〉 > ℓ) by a Gaussian model αij(ℓ) (colored curvesin Fig. 5a)

P(〈logLij〉 > ℓ) ≈ αij(ℓ) :=

√nij

2πβi

∫ ∞

ℓe−nij(x−ℓi)

2

2βi dx,

(7)

whose mean and variance are deducible from nij and Lii

(see Theorem 4 in Appendix B.4):

ℓi :=〈Lii logLii〉

〈Lii〉− 1, βi :=

〈Lii(ℓi − logLii)2〉

〈Lii〉. (8)

The parameters in the Gaussian model are justified by therelation between hitting and recurrence times [8] on anergodic Markov chain with detailed balance, and becomeasymptotically exact if distinct word patterns are statisti-cally independent (such as α13, α24, α31, α34 in Fig. 5a).Here, statistical independence justifies additivity of vari-ances, hence the

√nij factor in (7); sums of independent

samples of logLij become asymptotically Gaussian, thanksto the central limit theorem. Failing that, the actual rankingof 〈logLij〉 may deviate from the Gaussian model predictionin (7), such as the intimately related pairs of words Eliz-abeth/Darcy, Elizabeth/Jane, Darcy/Elizabeth, Darcy/pride andpride/Darcy.

2.3.3 Markov criteria for semantic cliques

Empirically, we find that higher αij(ℓ) scores point to closeraffinities between word patterns (Fig. 5a), attributable tokinship (Elizabeth, Jane), courtship (Darcy, Elizabeth), disposi-tion (Darcy, pride) and so on. Our robot reader automaticallydetects such affinities, without references other than thenovel itself. Therefore, we can use the αij(ℓ) scores as guidesto numerical approximations of semantic fields, hereafterreferred to as semantic cliques.

We invite a topical pattern Wj to the semanticclique Si (Figs. 5a and b, insets) surrounding Wi, ifminαij(〈logLij〉), αji(〈logLji〉) > α∗ for a standard

Gaussian threshold α∗ := 1√2π

∫ 1−∞ e−x2/2dx ≈ 0.8413.

This operation emulates the brainstorming procedure of ahuman reader, who associates one word with another onlywhen they stay much closer than two randomly pickedwords, according to his/her impression.

Indeed, by numerical brainstorming from Wi, our se-mantic cliques Si (Figs. 5a and b, insets) inform us abouttheir center word Wi, through several types of semanticrelations, including, but not limited to

• Synonyms (pride and vanity in English, orgeuil andfierté in French, etc.);

• Temperaments (Elizabeth, a delightful girl, oftenlaughs, corresponding to French verbs sourire andrire);

• Co-references (e.g. Darcy as a personification ofpride);

• Causalities (such as pride based on fortune).

On a local graph with vertices Si = Wi1 = Wi, Wi2 ,. . . , WiNi

, we specify the connectivity of each directed

edge by a localized Markov matrix P[i] = (p

[i]jk)1≤j,k≤Ni

.This localized Markov matrix is the row-wise normaliza-tion of an Ni × Ni subblock of P with the same set ofvertices as Si. Resetting the entries p

[i]1k and p

[i]j1 as zero,

one arrives at the localized recurrence matrix R[i]. We call

R[i] a recurrence matrix, because one can use it to compute

the distribution for recurrence times to the Markov stateWi in Si. As we will see soon in the applications below,the eigenvalues of R

[i], when properly arranged, becomelanguage-independent semantic fingerprints.

6

Wfr j

Weni

MRELIZABETH’S

DARCY’S

MRSBENNET’S

GB

OOE

DTSTER

BINGLEY’S

LADYI

S

E

H

S

IP

MISSJANE’S

FEEELTINGS

MARRYII

EA

N

DG

G

E

WICKHAM’S

COLLINS’S

YOUNGESR

T

DEAREST

LYDIA’S

DAYS

MOTHER’S

FATHER’S

TWOLETTERS

CATHERINE’S

GARDINERS’S

LIZZYWRO

ITETIE

EN

NG

CRIEDVISITSO

I

RN

SG

AUNT’S

CHARLOTTE’S

LUCAS’ES

BROTHER’S

EVENINGS

SIRNETHERFIELD

UNCLE’S

KITTYEYES

CARRYII

EA

N

DG

G

ES

COLONELTOWN

YEARS

COUSINS’S

FORTUNEATE

MERYTONINVITIE

ANDT

G

ION

LONDONTHANKS

IFE

NUD

GL

HUSBANDS’S

PEMBERLEYOFFICI

EOURS

ROSINGSWEEKS

SENTD

WILLIAM’S

PROMISIENSD

G

IMPOSSIBLELAUGHI

END

G

BALLS

POORLY

FORSTERCHILDREN

MARYBOURGH’S

DEFITZWILLIAM’S

DINNERS

TENSOCIETY

THOUSANDPHILLIPS’S

DINIE

ND

G

NIECES’S

TABLES

ABSENTCE

REFUSIEAND

L

G

POLITENL

EY

SSMOTIVES

NEPHEWS’S

DERBYSHIREBRIGHTON

CONGRATULATIE

OD

NS

IGNORANTCE

REGIMENT’S

PARKDISTANT

CE

MR

EL

IZA

BE

TH

DA

RC

YM

RS

BIN

GL

EY

BE

NN

ET

JA

NE

MD

IE

SM

SOI

SE

LL

ES

MBIO

ENUN

XES

WIC

KH

AM

JE

UN

ES

SE

MLD

ADM

AYE

MS

E

DE

UX

SE

NTIA

TRMIET

NT

S

CO

LL

INS

LY

DIA

MA

RIÉEÉA

SREGE

CH

ÉÈE

RIE

E

LE

TT

RE

S

PÈ

RE

MÈA

RM

EAN

CA

TH

ER

INE

GA

RD

INE

RS

OIR

ÉE

VIS

ITE

USRR

S

LO

ND

RE

SYŒ

EIUL

XC

HA

RL

OT

TE

FR

ÈA

RT

ER

NE

LL

E

LIZ

ZY

TA

NT

EIN

VIT

ÉEÉASRETI

ON

S

ÉC

RIVTR

IEUEETZR

E

LU

CA

SC

OU

SIN

SES

ON

CL

EÉ

CR

IAN

ET

HE

RF

IEL

DC

OL

ON

EL

AN

SNÉ

ES

DÎN

EÉ

RK

ITT

YM

ER

YT

ON

MA

RIS

PE

MB

ER

LE

YM

ILL

ES

VO

ITU

RE

RE

FU

SEÉA

ZRNT

EN

FA

NTC

SE

BA

LS

SIR

WIL

LIA

MR

OS

ING

SIG

NO

REA

RNITCTS

ES

NIÈ

CE

S

FO

RT

UN

EF

ÉL

ICIT

EÉEARTNIIT

ON

S

SO

CIÉA

TL

ÉE

SS

PO

LITM

ESN

ST

EO

FF

ICIE

ER

S

IMP

OS

SIB

LI

ELI

TÉ

FO

RS

TE

RS

EM

AIN

ES

RE

ME

RC

I EA

RME

NT

S

EN

VOE

YIREÉAER

ZREAI

FIT

ZW

ILL

IAM

BO

UR

GH

PR

OM

ETS

TSRAE

ENIT

MA

RY

PA

UV

RE

TSÉ

PH

ILIP

SR

ISREA

EZNT

TA

BL

ES

NE

VE

UX

MO

TIF

S

RÉ

GIM

EN

TA

BS

EN

TCEE

DIS

TA

NTC

ES

PA

RC

DE

RB

YS

HIR

EB

RIG

HT

ON

DIX

s(Weni

,Wfrj)

0

0.2

0.4

0.6

0.8

1

Fig. 6. Automated alignmentsof vectorized topics via bipartitematching of semantic similarities.The semantic similaritiess(Wen

i ,Wfrj ) are computed

for selected topics (sorted bydescending nii ≥ 20) in twoversions of Pride and Prejudice.Rows and columns filled withzeros are not shown. Cross-hairsmeet at optimal nodes that solvethe bipartite matching problem.The thickness of each horizontal(resp. vertical) cross-hair isinversely proportional to the row-wise (resp. column-wise) ranking ofthe similarity score for the optimalnode. Green (resp. amber ) cross-hair indicates an exact (resp. aclose but non-exact) match. At thesame confidence level (0.7) forsimilarities, this experiment hasbetter recall than Fig. 3c, withoutmuch cost of precision.

3 APPLICATIONS

3.1 Automated word translations from bilingual docu-

ments

Experimentally, we resolve the connectivity of an individ-ual pattern Wi through the recurrence spectrum σ(R[i])(Fig. 5b). The dominant eigenvalues of R

[i] are concept-specific while remaining nearly language-independent (alocalized version of the invariance in Fig. 4c). Such empiricalevidence motivates us to define the language-independentsemantic fingerprint (LISF) of a word pattern Wi by a de-scending list for the magnitudes of eigenvalues

vi = (|λ1(R[i])|, |λ2(R

[i])|, . . . ), (9)

computable from its semantic clique Si. We zero-pad thisvector from the (⌊eηi⌋ + 1)st component onwards, whereηi is the Kolmogorov–Sinai entropy production rate of theMarkov matrix P

[i], measured in nats per word.4

4. The entropy production rate η(P) := −∑

i,j πipij log pij [9,(4.27)] of a Markov matrix P represents the weighted average (as-signing probability mass πi to the ith Markov state) of Boltzmann’spartition entropies −

∑j pij log pij [10, §8.2]. We have η(P) ≤ logN

for an N × N Markov matrix P with strictly positive entries [10,Theorem 14.1].

Via bipartite matching (Fig. 6) of word vectors vi acrosslanguages, our algorithm translates words from paralleltexts at very high precision (Fig. 5c), being competitive withstate-of-the-art algorithms for bilingual word mapping [11],[12].

Unlike the vector bi (Fig. 3c) that captures only chapter-scale features of Wi, the semantic fingerprint vi in (9)characterizes the kinetic behavior of Wi on all the long-range time scales.

Given a topical pattern WAi in language A, its semantic

fingerprint vAi (a descending list of recurrence eigenvalues,

as in Fig. 5b) allows us to numerically locate a semanticallyclose pattern in a parallel text written in another languageB, in two steps:(1) Divide the document into K chapters, and define thesemantic similarity function as s(WA

i ,WBj ) := sR(v

Ai ,v

Bj ) if

sR(bAi ,b

Bj ) ≥ max

1− 0.07

√K, 1−

√‖bA

i ∧ bBj ‖0

‖bAi ∨ b

Bj ‖1

(10)

(which is a ballpark screening more robust than Fig. 3c, with‖b‖0 counting the number of non-zero components in b)and sR(v

Ai ,v

Bj ) ≥ 0.7; s(WA

i ,WBj ) := 0 otherwise.

7

WikiQA-Q26: Howdid Anne Frank die?Reference: “AnneFrank” (Wikipedia)

⇒FRANK’S

ANNE’S

DIEAET

DHS

TYPHUSCENTRE

ERSURVIVED

FRANKFURTEARLYIER

GOSLAR1945CAMPS

RESTORED

FILMS

HOLOCAUST

1933EXISTED

BERGEN-BELSENCONCENTRATIONDOCUMENTARYPRIVATEACTION

⇒(1) Anne Frank and her sister, Margot , were eventually transferred to the Bergen-Belsen concentration camp , where they died of typhus in March 1945.(2) Annelies "Anne" Marie Frank (, ?, ; 12 June 1929early March 1945) is one of the most discussed Jewish victims of theHolocaust .(3) Otto Frank, the only survivor of the family, returned to Amsterdam after the war to find that Anne’s diary had been saved, and his effortsled to its publication in 1947.(4) As persecutions of the Jewish population increased in July 1942, the family went into hiding in the hidden rooms of Anne’s father, Otto Frank ’s, office building.(5) The Frank family moved from Germany to Amsterdam in 1933, the year the Nazis gained control over Germany.

MAP MRR0.6190 CNN 0.62810.6091 LISF

∗ 0.62680.5993 LCLR 0.60860.5899 LISF 0.60600.5110 PV 0.51600.4891 Word Count 0.49240.3913 Random Sort 0.3990

(a) (b)

Fig. 7. Applications of semantic cliques to question-answering. (a) A construction of semantic clique Q ∪ Q′ (based on Q = Anne, Frank, die)weighted by the PageRank equilibrium state π and subsequent question-answering. Top 5 candidate answers, with punctuation and spacing asgiven by WikiQA, are shown with font sizes proportional to the entropy production score in (11). Here, the top-scoring sentence with highlightedbackground is the same as the official answer chosen by the WikiQA team. Like a human reader, our algorithm automatically detects the place“Bergen-Belsen concentration camp”, cause “typhus”, and year “1945” of Anne Frank’s death. (b) Evaluations of our model (LISF and LISF∗) on theWikiQA data set, in comparison with established algorithms.

(2) Solve a bipartite matching problem (Fig. 6) that maxi-mizes

∑i,j s(W

Ai ,W

Bj ), using the Hungarian Method [13]

attributed to Jacobi–Konig–Egerváry–Kuhn [14].

3.2 Machine-assisted text comprehension on WikiQA

data set

By automatically discovering related words through numer-ical brainstorming (Figs. 5a and b, insets), our semanticcliques Si are useful in text comprehension and ques-tion answering. We can expand a set of question words

Q = Wq1 , . . . ,WqK into Q∪Q′ =⋃K

k=1 Sqk , by bringingtogether the semantic cliques Sqk generated from a refer-ence text by each and every question word Wqk .

As before, we construct a localized Markov matrix P =(pij)1≤i,j≤N on this subset of word patterns Q ∪ Q′. Wefurther use the Brin–Page damping [15] to derive an ergodic

Markov matrix P = (pij)1≤i,j≤N , where pij = 0.85pij +0.15N .

By analogy to the behavior of internet surfing [15],[16], we model the process of associative reasoning [17]as a navigation through the nodes Q ∪ Q′ according to

P, which quantifies the click-through rate from one ideato another. The PageRank recursion [16] ensures a unique

equilibrium state π attached to P. If our question Q and acandidate answer A contain, respectively, words from WQ1 ,. . . , WQm

∈ Q and WA1 , . . . , WAn∈ Q ∪ Q′ (counting

multiplicities, but excluding function words and patternswith fewer than 3 occurrences in the reference document),then we assign the following entropy production score

F [Q,A] := −m∑

i=1

n∑

j=1

πQipQiAj

log pQiAj(11)

to this question-answer pair.5

A sample work flow is shown in Fig. 7a, to illustratehow our rudimentary question-answering machine handlesa query. To answer a question, we use a single Wikipediapage (without infoboxes and other structural data) as theonly reference document and training source. Like a typicalhuman reader of Wikipedia, our numerical associative rea-soning generates a weighted set of nodes Q∪Q′ (presentedgraphically as a thought bubble in Fig. 7a), without the helpof external stimuli or knowledge feed. Here, the relative

5. One may compare the score F [Q,A] to the Kolmogorov–Sinaientropy production rate [9, (4.27)] η(P) = −

∑Ni=1

∑Nj=1 πipij log pij

of a Markov matrix P = (pij)1≤i,j≤N . The score F [Q,A] is modeledafter Boltzmann’s partition entropies, weighted by words in the ques-tion, and sifted by topics in the answer. Such a weighting and siftingmethod is analogous to the definition of scattering cross-sections inparticle physics.

weights in the nodes of Q ∪ Q′ are computed from the

equilibrium state π of P, via the PageRank algorithm.We then test our semantic model (LISF in Fig. 7b) on all

the 1242 questions in the WikiQA data set, each of whichis accompanied by at least one correct answer located in adesignated Wikipedia page. Our algorithm’s performanceis roughly on par with LCLR and CNN benchmarks [18],improving upon the baseline by significant margin. This isperhaps remarkable, considering the relatively scant dataat our disposal. Unlike the LCLR approach, our numericaldiscovery of synonyms does not draw on the WordNetdatabase [19] or pre-existent corpora of question-answerpairs. Unlike the CNN method, we do not need pre-trainedword2vec embeddings [20] as semantic input.

Moreover, our algorithm (LISF∗ in Fig. 7b) performsslightly better on a subset of 990 questions that do not re-quire quantitative cues (How big? How long? How many? Howold? What became of? What happened to? What year? and so on).This indicates that, with a Markov chain description of two-body interactions between topics, our structural model fitsassociative reasoning better than rule-based reasoning [17],while imitating human behavior in the presence of limiteddata. To enhance the reasoning capabilities of our algorithm,it is perhaps appropriate to apply a Markov random field[21, §4.1.3] to graphs of word patterns, to capture many-body interactions among different topics.

4 CONCLUSION

In our current work, we define semantics through alge-braic invariants that are concept-specific and language-independent. To construct such invariants, we develop astochastic model that assigns a semantic fingerprint (list ofrecurrence eigenvalues) to each concept via its long-rangecontexts. Consistently using a single Markov framework, weare able to extract topics (Figs. 2e, 3b,b’), translate topics(Figs. 3c, 4c, 5b,c, 6) and understand topics (Figs. 5a,b,7a,b), through statistical mining of short and medium-lengthtexts. In view of these three successful applications, we areprobably close to a complete set of semantic invariants, afterdemystifying the long-range behavior of human languages.

Notably, our algorithms apply to documents of moderatelengths, similar to the experience of human readers. Thiscontrasts with data-hungry algorithms in machine learning[18], [22], which utilize high-dimensional numerical repre-sentations of words and phrases [11], [12], [20], [23] fromlarge corpora. Our semantic mechanism exhibits universal-ity on long-range linguistic scales. This adds to our quanti-tative understanding of diversity on shorter-range linguisticscales, such as phonology [24], [25], [26], morphology [27],[28], [29], [30] and syntax [3], [30], [31], [32], [33].

8

Thanks to the independence between semantics andsyntax [3], our current model conveniently ignores the non-Markovian syntactic structures which are essential to fluentspeech. In the near future, we hope to extend our frameworkfurther, to incorporate both Markovian and non-Markovianfeatures across different ranges. The Mathematical Principlesof Natural Languages, as we envision, must and will combinethe statistical analysis of a Markov model with linguisticproperties on shorter time scales that convey morphological[27], [28], [29], [30] and syntactical [3], [30], [31], [32], [33]information.

APPENDIX A

POISSONIAN BANALITY AND NON-TOPICALITY

Here, we first present a proof of our statistical criterion forPoissonian banality (which we identify with non-topicalityof word patterns), as stated in (1).

Theorem 1 (Sums and products of exponentially distributedrandom variables). Let X1, . . . , XN be independent ran-dom variables, each obeying an exponential distributionwith mean 1. The probability distribution of

YN := log

(1

N

N∑

i=1

Xi

)− 1

N

N∑

i=1

logXi (12)

is asymptotic to a Gaussian law

P

YN −

(γ0 − 1

2N

)

1√N

√π2

6 − 1− 12N

< a

∼

∫ a

−∞e−x2/2 dx√

2π(13)

as N → ∞.

Proof: By definition, we have a moment-generatingfunction

Ee−tYN

=

∫

(0,+∞)N

∏Ni=1 x

t/Ni(

1N

∑Ni=1 xi

)t e−∑N

i=1 xi dx1 · · ·dxN

= 2N∫

(0,+∞)N

∏Ni=1 ξ

2t/N+1i(

1N

∑Ni=1 ξ

2i

)t e−∑N

i=1 ξ2i d ξ1 · · · d ξN . (14)

To evaluate the last multiple integral, we use the spheri-cal coordinates ξ1 = r cos θ1, ξ2 = r sin θ1 cos θ2, ξ3 =r sin θ1 sin θ2 cos θ3, . . . , ξN = r

∏N−1j=1 sin θj (where r > 0,

and 0 < θj < π/2 for all j ∈ 1, . . . , N − 1), with volumeelement

d ξ1 · · · d ξN = rN−1 d rN−1∏

j=1

sinN−1−j θj d θj . (15)

The result reads

Ee−tYN = N tΓ(N)N−1∏

j=1

Γ(N+tN

)Γ((N−j)(N+t)

N

)

Γ((N−j+1)(N+t)

N

)

=N tΓ(N)

Γ(N + t)

[Γ

(1 +

t

N

)]N, (16)

where Γ(s) :=∫∞0 xs−1e−x dx is Euler’s gamma function.

Consequently, the proof of (13) builds on a cumulantexpansion of (16), that is, development of logEe−tYN upto O(t2) terms.

If we have N samples of recurrence times from a Poissonprocess, then the statistic YN satisfies the inequality in (1)

with probability∫ 2−2 e

−x2/2 d x√2π

≈ 0.95, in view of the

theorem above.

APPENDIX B

RECURRENCE TIMES ON AN ERGODIC MARKOV

CHAIN WITH DETAILED BALANCE

B.1 Background in probability theory

Based on numerical evidence (Fig. 4), we postulate that atthe discourse level (the longest time scale in Friederici’s [4]neurobiological hierarchy), the production of natural lan-guage texts can be caricatured by the stochastic transitionson a stationary and ergodic Markov chain M = (S ,P).6

Here, the state space S = W1, . . . ,WN runs over finitelymany word patterns occurring in the text, which in turnis representable as a discrete-time stochastic process (X(0),X(1), . . . , X(n), . . . ); the transition matrix P = (pij)describes the conditional hopping probability on the webof word patterns: pij = P(X(n+ 1) = Wj |X(n) = Wi), forn ∈ Z≥0. Depending on context, we also model a documentby a localized Markov chain M ′ = (S ′,P′), where certainword patterns Wi are removed from the state space S toform a proper subset S ′ $ S .

In a more formal setting, our notation M = (S ,P)for the Markov chain should be expanded into M =(Ω,F , PW : W ∈ S , X(t) : t ∈ Z≥0, θn : n ∈ Z≥0),whose components are explained below.

• The sample space Ω consists of all stochastic trajec-tories ω = (X(0), X(1), . . . , X(t), X(t + 1), . . . ) =(X(t))t∈Z≥0

on the state space S , namely, all pos-sible texts that can be analyzed by our particularmodel.

• Each member in the field of events F = 2Ω (thetotality of all subsets in the sample space Ω) is a setcontaining zero or more stochastic processes that canbe regarded as caricatures of text productions.

• The family of probability measures PW : W ∈ S are related to the transition matrix P = (pij)1≤i,j≤N

by the identity p(t)ij = PWi(X(t) = Wj) = P(X(t) =

Wj |X(0) = Wi), ∀Wi,Wj ∈ S , ∀t ∈ Z>0.

• The shift operator θn : Ω −→ Ω acts onan arbitrary trajectory (X(0), X(1), . . . , X(t), X(t+1), . . . ) = ω ∈ Ω in the following manner: θn ω = (X(n), X(n + 1), . . . , X(t + n), X(t + n +1), . . . ), ∀t, n ∈ Z≥0.

For discussions of the stopping times (a class of randomvariables on Markov chains) as well as the Markov property,it is convenient to further introduce the notation Fn forn ∈ Z≥0. Here, Fn is the smallest σ-algebra contain-ing all the events in the form of (X(0) = Wi0 , X(1) =Wi1 , . . . , X(n) = Win). The family of σ-algebras Fn :

6. To reduce notational burden, we will not use superscripted aster-isks to mark Markov matrices, beyond this point.

9

n ∈ Z≥0 forms a filtration: Fn ⊆ Fm if n ≤ m. For anyn ∈ Z≥0, B ∈ F , W ∈ S , we have the following relationconcerning conditional expectations:

EW(1B θn|Fn) = EX(n)(1B) := PX(n)(B), (17)

(for every time-step n ∈ Z≥0 and Markov state W ∈ S )which is merely a reformulation of the Markov condi-tion PW(X(n + 1) = Win+1 |X(n) = Win , X(n − 1) =Win−1 , . . . , X(0) = Wi0 ) = PWin (X(1) = Win+1) =PW(X(n + 1) = Win+1 |X(n) = Win). In other words, forany A ∈ Fn, B ∈ F , W ∈ S , we have the followingstatement of the Markov property:

EW(1B θn;A) = EW(EX(n)(1B);A). (17′)

In both (17) and (17′), one can replace the indicator function1B for event B ∈ F by any random variable with finiteexpectation.

B.2 Hitting time and return time on a Markov chain

We need to first precisely define the probability distributionsfor the hitting time and return time (the latter also knownas “recurrence time”) on our Markov chain M = (S ,P).

For each state Wi ∈ S and trajectory ω = (X(t))t∈Z≥0∈

Ω, we define

τi(ω) := infn ∈ Z>0 : X(n) = Wi, (18)

then τi : Ω −→ Z>0 ∪ +∞ is a stopping time. Supposethat our pattern of interest corresponds to a subset of statesW ⊆ S on the web of words. We define another stoppingtime τW : Ω −→ Z>0 ∪ +∞ as

τW (ω) := infn ∈ Z>0 : X(n) ∈ W = infτi(ω) : Wi ∈ W . (19)

Clearly, τW (ω) is equal to the first time when a forwardstepwise search lands on the set of interest W . Recallingthe invariant measure π = (π1, . . . , πN ), we define thecumulative distribution function (CDF) for the hitting timeto the set of patterns W as

HW (t) :=∑

Wi∈S

PWi(τW < t)πi, t ∈ Z>0. (20)

Similarly, the CDF for the return time to the set of patternsW is defined as

RW (t) :=∑

Wi∈W

PWi (τW < t)πi∑

Wj∈Wπj

, t ∈ Z>0. (21)

In the next theorem, we present an identity that connectshitting and return times of Markov states, which in turn, isa discrete analog of the Haydn–Lacroix–Vaient relation forcontinuous-time ergodic dynamical systems [8].

Theorem 2 (Relation between hitting time and return timedistributions). For a stationary and ergodic Markovchain M = (S ,P), and a subset of states W ⊆ S ,we have the following identity regarding the probabilitydistribution of hitting and return times to W :

HW (1) = RW (1) = 0;HW (t+ 1)∑

Wj∈Wπj

=t∑

n=1

[1−RW (n)]

(22)

for all t ∈ Z>0.

Proof: Clearly, our task is equivalent to the verificationof the following formula:

∑

Wi∈S

PWi(τW ≤ t)πi =t∑

n=1

∑

Wj∈W

[1− PWj (τW < t)]πj

(22′)

for all t ∈ Z>0.For t = 1, the left-hand side of (22′) can be computed as

follows:∑

Wi∈S

PWi(τW = 1)πi =∑

Wi∈S

∑

Wj∈W

PWi(X(1) = Wj)πi

=∑

Wi∈S

∑

Wj∈W

πipij =∑

Wj∈W

πj .

(23)

This is equal to the right-hand side of (22′), becausePWj (τW < 1) = 0.

For t ∈ Z>0, we can compute∑

Wi∈S

PWi (τW ≤ t+ 1)πi

=∑

Wi∈S

EWi((1(τW ≤t) θ1)(1(X(1)/∈W )) + 1(X(1)∈W ))πi

=∑

Wi∈S

[EWi(EX(1)(1(τW ≤t);X(1) /∈ W ))

+ EWi(1(X(1)∈W ))]πi (24)

from the Markov property (17′). Here, we have∑Wi∈S

EWi(1(X(1)∈W ))πi =∑

Wi∈SPWi(τW = 1)πi =∑

Wj∈Wπj according to (23), and

∑

Wi∈S

EWi(EX(1)(1(τW ≤t);X(1) /∈ W ))πi

=∑

Wi∈S

∑

Wk∈SrW

PWi(X(1) = Wk)EWk(1(τW ≤t))πi

=∑

Wk∈SrW

EWk(1(τW ≤t))πk

=∑

Wi∈S

PWi(τW ≤ t)πi −∑

Wj∈W

PWj (τW ≤ t)πj . (25)

Therefore, the recursion∑

Wi∈S

PWi(τW ≤ t+ 1)πi

=∑

Wi∈S

PWi(τW ≤ t)πi +∑

Wj∈W

[1− PWj (τW ≤ t)]πj (26)

for all t ∈ Z>0 allows us to build (22′) inductively on thet = 1 case (23).

B.3 Positive definite return time distributions on a

Markov chain with detailed balance

Thanks to Theorem 2, our analysis of a return time distri-bution can be built on the analysis of the correspondinghitting time, the latter of which is usually easier to compute(in theoretical analysis).

For a stationary and ergodic Markov chain (S ,P =(pij)1≤i,j≤N ), and a pattern of interest W $ S , one can

10

use the Markov property to show that the hitting timedistribution satisfies

∑

Wi∈S

PWi(τW > 1)πi =∑

Wi∈S

PWi(X(1) /∈ W )πi

=∑

Wi∈S

∑

Wm∈SrW

πipim =∑

Wm∈SrW

πm (27)

and∑

Wi∈S

PWi(τW > t)πi

=∑

Wi∈S

PWi(X(n) /∈ W , ∀n ∈ Z ∩ [1, t])πi

=∑

Wi∈S

∑

Wmn∈SrW ,n∈Z∩[1,t]

πipim1

∏

n∈Z∩[1,t−1]

pmnmn+1

=∑

Wmn∈SrW ,n∈Z∩[1,t]

πm1

∏

n∈Z∩[1,t−1]

pmnmn+1 , (28)

for all t ∈ Z>1.

Consider the abelian semigroup Σ = Z≥0. A semi-character [34, p. 92, Definition 2.1] ρ : Z≥0 −→ C mustassume the following form: ρ(0) = 1; ρ(s) = [ρ(1)]s

for s ∈ Z>0. Since the spectral radius of our recurrencematrix is strictly less than 1, we will only concern ourselveswith ρ0(s) := 10(s) and ρλ(s) := λs, s ∈ Z>0 forλ ∈ (−1, 0) ∪ (0, 1).

According to the harmonic analysis on semigroups,a bounded function f : Z≥0 −→ C can be repre-sented as a weighted mixture of semicharacters f(s) =∫−1<λ<1 ρλ(s) dµ(λ) if and only if it is positive definite [34,

p. 93, Theorem 2.5]: the inequality

m∑

r=1

m∑

s=1

crcsf(tr + ts) ≥ 0 (29)

holds for arbitrary positive integers m ∈ Z>0, and arbi-trarily chosen m-dimensional “vectors” (t1, . . . , tm) ∈ Σm,(c1, . . . , cm) ∈ Cm.

In the following theorem, we check the positive def-initeness criterion (29) for the hitting time distribution1 − HW (s + 2), s ∈ Z≥0 on a Markov chain satisfying thedetailed balance condition. This would imply that the returntime distribution 1−RW (s+ 2), s ∈ Z≥0 is also a weightedmixture of semicharacters ρλ(s) for −1 < λ < 1, accordingto the finite difference relation (22) in Theorem 2.

Theorem 3 (Positive definite return time distributions).Suppose that a non-void pattern W $ S on a stationaryand ergodic Markov chain M = (Ω,F , PW : W ∈S , X(t) : t ∈ Z≥0, θn : n ∈ Z≥0) satisfiesπipij = πjpji for all Wi,Wj ∈ S r W . Then, thecumulative distribution for the return time to W has thefollowing integral representation

1−RW (s+ 2)

1−RW (2)=

∑Wi∈W

PWi(τW > s+ 1)πi

∑Wi∈W

PWi(τW > 1)πi

=

∫ρλ(s) dPW (λ), s ∈ Z≥0, (30)

for a certain probability measure PW (λ) supported onλ ∈ (−1, 1).

Proof: By (28), we have∑

Wi∈S

PWi(τW > r + s+ 1)πi

=∑

Wmn∈SrW ,n∈Z∩[1,r+s+1]

πm1

∏

n∈Z∩[1,r+s]

pmnmn+1

=∑


πm1×

×∏

n∈Z∩[1,r]

pmnmn+1

∏

n∈Z∩[r+1,r+s]

pmnmn+1 , (31)

for r, s ∈ Z>0. We then apply the identity πipij =πjpji, ∀Wi,Wj ∈ S r W to the product over n ∈ Z ∩ [1, r],which brings us

∑

Wi∈S


=∑


πm1×

×∏

n∈Z∩[1,r]

πmn+1

πmn

pmn+1mn

∏

n∈Z∩[r+1,r+s]

pmnmn+1

=∑

Wmr+1∈SrW

PWmr+1 (τW > r)PWmr+1 (τW > s)πmr+1

(32)

for r, s ∈ Z>0. In other words, we have verified∑

Wi∈S


=∑

Wi∈SrW

EWi(1(τW >r))EWi(1(τW >s))πi (33)

for r, s ∈ Z>0. Using the Markov property, one can readilygeneralize the identity above to r, s ∈ Z≥0. Thus, we have

m∑

r=1

m∑

s=1

∑

Wi∈S

crcsPWi (τW > tr + ts + 1)πi

=∑

Wi∈SrW

∣∣∣∣∣EWi

(m∑

r=1

cr1(τW >tr)

)∣∣∣∣∣

2

πi ≥ 0 (34)

for tr, ts ∈ Z≥0. By the positive definiteness criterion in(29), we see that 1 − HW (s + 2), s ∈ Z≥0 is a weightedmixture of semicharacters. In view of the recursion identityrelating hitting and return time distributions (26), we haveconfirmed the statement in (30).

With bin sizes far larger than 2 time units on the Markovchain, our Lii is nearly a continuous random variable and allthe period-2 oscillatory decays ρλ(s) for −1 < λ < 0 behavejust like noise terms in the histogram. Then, Theorem 3tells us that the probability distribution for Lii can beapproximated by a weighted mixture7 of exponential decayse−kLii , in the continuum limit.

7. We note that there exist exceptions to stationarity and ergodicityin realistic documents. A novel may involve the birth and/or deathof leading/supporting characters. An academic treatise may placeparticular emphasis on certain concepts in specific chapters. In suchscenarios, the overall recurrence kinetics may still follow the genericlaw given in (2), as a consequence of both detailed balance (in locallyapplied Markov models) and a heterogeneous mixture of several sta-tionary and ergodic Markov models that patch together as a whole.

11

B.4 A statistical criterion for numerical independence

between different word patterns

If we define Lij as the effective length of a text frag-ment free from word pattern Wj that is simultaneouslyflanked by Wi to the left and by Wj to the right, then∑

Wi∈SπiP(Lij > t) ∼ ∑

Wi∈SπiPWi(τj > t) represents

the hitting time distribution to the Markov state Wj . Ac-cording to Theorem 2, the hitting time distribution canbe obtained by integrating the return time distribution

P(Ljj > t) ∼ PWj (τj > t) ∼∑n cne−knt. Here, the positive

weights cn > 0 are normalized to one:∑

n cn = 1. Inother words, for fixed Wj , we can predict some statistical

properties of Lij by analyzing the recurrence statistic Ljj ,as described in the theorem below.

Theorem 4 (Mean and variance for the logarithm of hittingtime). Using the continuous time approximation, wehave the following relations for a fixed Wj :

E log Lij =〈Ljj log Ljj〉

〈Ljj〉− 1, (35)

E(log Lij − E log Lij)2 =

〈Ljj(ℓ∗ − log Ljj)2〉

〈Ljj〉, (36)

where ℓ∗ equals the right-hand side of (35), and theexpectation E denotes average over all the states Wi,weighted by πi.

Proof: In the continuous time approximation, we can

assume that the probability density function for Ljj is∑n cnkne

−knt, (where cn, kn > 0) so that the probability

density function for Lij (weighted average over all the statesWi) is

∑

n

cne−knt

∑

n

cn/kn=

∑

n

cne−knt

〈Ljj〉. (37)

In view of (37), we can compute the left-hand side of (35) as∫ ∞

0

∑

n

cne−knt log t

d t

〈Ljj〉= −

∑

n

cn(γ0 + log kn)

kn〈Ljj〉. (38)

Meanwhile, we may evaluate the right-hand side of (35) asfollows:

∫ ∞

0

∑

n

cnkne−kntt log t

d t

〈Ljj〉− 1

= −∑

n

cn(γ0 − 1 + log kn)

kn〈Ljj〉− 1. (39)

Thus, (35) is confirmed.In a similar vein, we can argue that the left-hand side of

(36) equals∫ ∞

0

∑

n

cne−knt log2 t

d t

〈Ljj〉− ℓ2∗

=∑

n

cn[6 log kn(2γ0 + log kn) + 6γ20 + π2]

6kn〈Ljj〉− ℓ2∗, (40)

while the right-hand side of (36) amounts to∫ ∞

0

∑

n

cnkne−kntt(ℓ∗ − log t)2

d t

〈Ljj〉

=

∫ ∞

0

∑

n

cnkne−kntt log2 t

d t

〈Ljj〉− 2ℓ∗ − ℓ2∗

=∑

n

cn[6 log kn(2γ0 − 2 + log kn) + 6γ0(γ0 − 2) + π2]

6kn〈Ljj〉

+∑

n

2cn(γ0 + log kn)

kn〈Ljj〉− ℓ2∗. (41)

This completes the proof of (36).The detailed balance condition πipij = πjpji means that

a Markov chain is reversible, and the reverse chain has thesame transition matrix P [35, §4.7, p. 203].

By the reversibility of our Markov chain, we may extendthe results from the theorem above to a dual situation,namely, the statistical properties for Lij , the effective lengthof a text fragment free from word Wi that is simultaneouslyflanked by Wi to the left and by Wj to the right (asconsidered in Fig. 1).

Trading Lij (resp. Ljj ) in (35)–(36) for Lij (resp. Lii),we recover (8): for a given i, we have used (35) as anestimate for 〈logLij〉, and have used 1/nij times (36) asan estimate for the variance of 〈logLij〉, which is taken overnij independent samples.

ACKNOWLEDGMENTS

We thank N. Chomsky and S. Pinker for their inputs onseveral problems of linguistics. We thank X. Sun for discus-sions on neural networks. We thank X. Wan, R. Yan and D.Zhao for their suggestions on experimental design, duringthe early stages of this work. We thank two anonymousreviewers, whose thoughtful comments help us improve thepresentation of this work.

REFERENCES

[1] F. de Saussure, Cours de linguistique générale, 5th ed. Paris, France:Payot, 1949.

[2] S. Pinker and A. Prince, “On language and connectionism: Analy-sis of a parallel distributed processing model of language acquisi-tion,” Cognition, vol. 28, no. 1-2, pp. 73–193, 1988.

[3] N. Chomsky, Syntactic Structures, 2nd ed. Berlin, Germany:Mouton de Gruyter, 2002.

[4] A. D. Friederici, “The neurobiology of language comprehension,”in Language Comprehension: A Biological Perspective, A. D. Friederici,Ed. Berlin, Germany: Springer, 1999, ch. 9, pp. 265–304.

[5] J. P. Herrera and P. A. Pury, “Statistical keyword detection inliterary corpora,” Eur. Phys. J. B, vol. 63, pp. 135–146, 2008.

[6] M. Ružicka, “Anwendung mathematisch-statistischer Methodenin der Geobotanik (Synthetische Bearbeitung von Aufnahmen),”Biológia (Bratislava), vol. 13, pp. 647–661, 1958.

[7] Y. Zhou and X. Zhuang, “Robust reconstruction of the rate con-stant distribution using the phase function method,” Biophys. J.,vol. 91, no. 11, pp. 4045–4053, 2006.

[8] N. Haydn, Y. Lacroix, and S. Vaienti, “Hitting and return times inergodic dynamical systems,” Ann. Probab., vol. 33, pp. 2043–2050,2005.

[9] T. M. Cover and J. A. Thomas, Elements of Information Theory,2nd ed. Hoboken, NJ: Wiley Interscience, 2006.

[10] M. Pollicott and M. Yuri, Dynamical Systems and Ergodic Theory, ser.London Mathematical Society Student Texts. Cambridge, UK:Cambridge University Press, 1998, vol. 40.

[11] A. Joulin, P. Bojanowski, T. Mikolov, H. Jégou, and E. Grave, “Lossin translation: Learning bilingual word mapping with a retrievalcriterion,” in Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing. Brussels, Belgium: Association forComputational Linguistics, Oct.-Nov. 2018, pp. 2979–2984.

12

[12] X. Chen and C. Cardie, “Unsupervised multilingual word embed-dings,” in Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing. Brussels, Belgium: Association forComputational Linguistics, Oct.-Nov. 2018, pp. 261–270.

[13] H. W. Kuhn, “The Hungarian Method for the assignment prob-lem,” Nav. Res. Logist. Q., vol. 2, pp. 83–97, 1955.

[14] ——, “A tale of three eras: The discovery and rediscovery of theHungarian Method,” Eur. J. Oper. Res., vol. 219, pp. 641–651, 2012.

[15] S. Brin and L. Page, “The anatomy of a large-scale hypertextualweb search engine,” Comput. Networks ISDN, vol. 30, no. 1-7, pp.107–117, 1998.

[16] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRankcitation ranking: Bringing order to the web,” Stanford InfoLab,Tech. Rep., 1999, http://ilpubs.stanford.edu:8090/422/.

[17] S. A. Sloman, “The empirical case for two systems of reasoning.”Psychol. Bull., vol. 119, no. 1, pp. 3–22, 1996.

[18] Y. Yang, W.-t. Yih, and C. Meek, “WikiQA: A challenge datasetfor open-domain question answering,” in Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing(EMNLP). Lisbon, Portugal: Association for Computational Lin-guistics, 2015.

[19] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database (Language,Speech, and Communication). Cambridge, MA: MIT Press, 1998.

[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in Advances in Neural Information Processing Systems26. La Jolla, CA: NIPS, 2013, pp. 3111–3119.

[21] D. Mumford and A. Desolneux, Pattern theory: the stochastic analysisof real-world signals. Natick, MA: A K Peters, 2010.

[22] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong,O. Kononova, K. A. Persson, G. Ceder, and A. Jain, “Unsuper-vised word embeddings capture latent knowledge from materialsscience literature,” Nature, vol. 571, no. 7763, pp. 95–98, 2019.

[23] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “A latent variablemodel approach to PMI-based word embeddings,” Transactions ofthe Association for Computational Linguistics, vol. 4, pp. 385–399,2016.

[24] M. A. Nowak and D. C. Krakauer, “The evolution of language,”Proc. Natl. Acad. Sci. USA, vol. 96, no. 14, pp. 8028–8033, 1999.

[25] C. Everett, D. E. Blasí, and S. G. Roberts, “Climate, vocal folds,and tonal languages: Connecting the physiological and geographicdots,” Proc. Natl. Acad. Sci. USA, vol. 112, pp. 1322–1327, 2015.

[26] C. Everett, “Languages in drier climates use fewer vowels,” Fron-tiers in Psychology, vol. 8, p. Article 1285, 2017.

[27] S. Pinker, “Words and rules in the human brain,” Nature, vol. 387,no. 6633, pp. 547–548, 1997.

[28] W. D. Marslen-Wilson and L. K. Tyler, “Dissociating types ofmental computation,” Nature, vol. 387, no. 6633, pp. 592–594, 1997.

[29] E. Lieberman, J.-B. Michel, J. Jackson, T. Tang, and M. A. Nowak,“Quantifying the evolutionary dynamics of language,” Nature, vol.449, no. 7163, pp. 713–716, 2007.

[30] M. G. Newberry, C. A. Ahern, R. Clark, and J. B. Plotkin, “Detect-ing evolutionary forces in language change,” Nature, vol. 551, no.7679, pp. 223–226, 2017.

[31] S. Pinker, “Survival of the clearest,” Nature, vol. 404, no. 6777, pp.441–442, 2000.

[32] M. A. Nowak, J. B. Plotkin, and V. A. A. Jansen, “The evolution ofsyntactic communication,” Nature, vol. 404, no. 6777, pp. 495–498,2000.

[33] M. Dunn, S. J. Greenhill, S. C. Levinson, and R. D. Gray, “Evolvedstructure of language shows lineage-specific trends in word-orderuniversals,” Nature, vol. 473, no. 7345, pp. 79–82, 2011.

[34] C. Berg, J. P. R. Christensen, and P. Ressel, Harmonic Analysis onSemigroups: Theory of Positive Definite and Related Functions, ser.Graduate Texts in Mathematics. New York, NY: Springer-Verlag,1984, vol. 100.

[35] S. M. Ross, Stochastic Processes, 2nd ed. Hoboken, NJ: John Wiley& Sons, 1995.

Weinan E received his B.S. degree in mathematics from the Universityof Science and Technology of China in 1982, and his Ph.D. degree inmathematics from the University of California, Los Angeles in 1989. Hehas been a full professor at the Department of Mathematics and the Pro-gram in Applied and Computational Mathematics, Princeton University

since 1999. Prof. E has made significant contributions to numerical anal-ysis, fluid mechanics, partial differential equations, multiscale modeling,and stochastic processes.

His monograph Principles of Multiscale Modeling (Cambridge Uni-versity Press, 2011) is a standard reference for mathematical modelingof physical systems on multiple length scales. He was awarded the PeterHenrici Prize in 2019 for his recent contributions to machine learning.

Yajun Zhou received his B.S. degree in Physics from Fudan University(Shanghai, China) in 2004, and his Ph.D. degree in Chemistry fromHarvard University in 2010. Having undertaken postdoctoral training inapplied and computational mathematics at Princeton University, he nowworks in the Laboratory for Natural Language Processing & CognitiveIntelligence at Beijing Institute of Big Data Research. Dr. Zhou haspublished in numerical analysis, special functions, number theory, elec-tromagnetic theory, quantum theory, probability theory and stochasticprocesses, solving several open problems in the related fields.

A book chapter in Elliptic integrals, elliptic functions and modularforms in quantum field theory (Springer, 2019) surveys his proofs ofvarious mathematical conjectures. Dr. Zhou is a polyglot, with a workingknowledge of 24 languages spanning 7 language families.

http://ilpubs.stanford.edu:8090/422/