arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model...

45
Inverse statistical problems: from the inverse Ising problem to data science H. Chau Nguyen Max-Planck-Institut f¨ ur Physik komplexer Systeme, N¨ othnitzer Str. 38, D-01187 Dresden, Germany Riccardo Zecchina Bocconi University, via Roentgen 1, 20136 Milano, Italy and Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino Johannes Berg Institute for Theoretical Physics, University of Cologne, Z¨ ulpicher Straße 77, 50937 Cologne, Germany Inverse problems in statistical physics are motivated by the challenges of ‘big data’ in different fields, in particular high-throughput experiments in biology. In inverse problems, the usual procedure of statistical physics needs to be reversed: Instead of calculating observables on the basis of model parameters, we seek to infer parameters of a model based on observations. In this review, we focus on the inverse Ising problem and closely related problems, namely how to infer the coupling strengths between spins given observed spin correlations, magnetisations, or other data. We review applications of the inverse Ising problem, including the reconstruction of neural connections, protein structure determination, and the inference of gene regulatory networks. For the inverse Ising problem in equilibrium, a number of controlled and uncontrolled approximate solutions have been developed in the statistical mechanics community. A particularly strong method, pseudolikelihood, stems from statistics. We also review the inverse Ising problem in the non-equilibrium case, where the model parameters must be reconstructed based on non-equilibrium statistics. a CONTENTS I. Introduction and applications 2 A. Modelling neural firing patterns and the reconstruction of neural connections 4 B. Reconstruction of gene regulatory networks 5 C. Protein structure determination 6 D. Fitness landscape inference 8 E. Combinatorial antibiotic treatment 9 F. Interactions between species and between individuals 9 G. Financial markets 10 II. Equilibrium reconstruction 10 1. Definition of the problem 10 A. Maximum likelihood 11 1. Exact maximization of the likelihood 12 2. Uniqueness of the solution 13 3. Maximum entropy modelling 13 4. Information theoretic bounds on graphical model reconstruction 15 5. Thermodynamics of the inverse Ising problem 15 6. Variational principles 16 7. Mean-field theory 17 8. The Onsager term and TAP reconstruction 18 a citation: Inverse statistical problems: from the inverse Ising problem to data science, H.C. Nguyen, R. Zecchina and J. Berg, Advances in Physics, 66 (3), 197-261 (2017) 9. Couplings without a loop: mapping to the minimum spanning tree problem 18 10. The Bethe–Peierls ansatz 19 11. Belief propagation and susceptibility propagation 20 12. The independent-pair approximation and the Cocco–Monasson adaptive-cluster expansion 21 13. The Plefka expansion 22 14. The Sessak–Monasson small-correlation expansion 23 B. Logistic regression and pseudolikelihood 24 C. Comparison of the different approaches 26 D. Reconstruction and non-ergodicity 29 E. Parameter reconstruction and criticality 31 III. Non-equilibrium reconstruction 32 A. Dynamics of the Ising model 32 1. Sequential Glauber dynamics 32 2. Parallel Glauber dynamics 33 B. Reconstruction from time series data 33 1. Maximisation of the likelihood 33 2. Mean-field theory of the non-equilibrium steady state 34 3. The Gaussian approximation 35 4. Method comparison 35 C. Outlook: Reconstruction from the steady state 35 IV. Conclusions 36 Acknowledgements 38 References 39 arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017

Transcript of arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model...

Page 1: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

Inverse statistical problems: from the inverse Ising problem to data science

H. Chau NguyenMax-Planck-Institut fur Physik komplexer Systeme, Nothnitzer Str. 38, D-01187 Dresden, Germany

Riccardo ZecchinaBocconi University, via Roentgen 1, 20136 Milano, Italy and Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino

Johannes BergInstitute for Theoretical Physics, University of Cologne, Zulpicher Straße 77, 50937 Cologne, Germany

Inverse problems in statistical physics are motivated by the challenges of ‘big data’ in differentfields, in particular high-throughput experiments in biology. In inverse problems, the usual procedureof statistical physics needs to be reversed: Instead of calculating observables on the basis of modelparameters, we seek to infer parameters of a model based on observations. In this review, wefocus on the inverse Ising problem and closely related problems, namely how to infer the couplingstrengths between spins given observed spin correlations, magnetisations, or other data. We reviewapplications of the inverse Ising problem, including the reconstruction of neural connections, proteinstructure determination, and the inference of gene regulatory networks. For the inverse Ising problemin equilibrium, a number of controlled and uncontrolled approximate solutions have been developedin the statistical mechanics community. A particularly strong method, pseudolikelihood, stems fromstatistics. We also review the inverse Ising problem in the non-equilibrium case, where the modelparameters must be reconstructed based on non-equilibrium statistics. a

CONTENTS

I. Introduction and applications 2

A. Modelling neural firing patterns and thereconstruction of neural connections 4

B. Reconstruction of gene regulatory networks 5

C. Protein structure determination 6

D. Fitness landscape inference 8

E. Combinatorial antibiotic treatment 9

F. Interactions between species and betweenindividuals 9

G. Financial markets 10

II. Equilibrium reconstruction 10

1. Definition of the problem 10

A. Maximum likelihood 11

1. Exact maximization of the likelihood 12

2. Uniqueness of the solution 13

3. Maximum entropy modelling 13

4. Information theoretic bounds on graphicalmodel reconstruction 15

5. Thermodynamics of the inverse Isingproblem 15

6. Variational principles 16

7. Mean-field theory 17

8. The Onsager term and TAPreconstruction 18

a citation: Inverse statistical problems: from the inverse Isingproblem to data science, H.C. Nguyen, R. Zecchina and J. Berg,Advances in Physics, 66 (3), 197-261 (2017)

9. Couplings without a loop: mapping to theminimum spanning tree problem 18

10. The Bethe–Peierls ansatz 1911. Belief propagation and susceptibility

propagation 2012. The independent-pair approximation and

the Cocco–Monasson adaptive-clusterexpansion 21

13. The Plefka expansion 2214. The Sessak–Monasson small-correlation

expansion 23B. Logistic regression and pseudolikelihood 24C. Comparison of the different approaches 26D. Reconstruction and non-ergodicity 29E. Parameter reconstruction and criticality 31

III. Non-equilibrium reconstruction 32A. Dynamics of the Ising model 32

1. Sequential Glauber dynamics 322. Parallel Glauber dynamics 33

B. Reconstruction from time series data 331. Maximisation of the likelihood 332. Mean-field theory of the non-equilibrium

steady state 343. The Gaussian approximation 354. Method comparison 35

C. Outlook: Reconstruction from the steadystate 35

IV. Conclusions 36

Acknowledgements 38

References 39

arX

iv:1

702.

0152

2v4

[co

nd-m

at.d

is-n

n] 6

Nov

201

7

Page 2: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

2

I. INTRODUCTION AND APPLICATIONS

The primary goal of statistical physics is to derive ob-servable quantities from microscopic laws governing theconstituents of a system. In the example of the Isingmodel, the starting point is a model describing interac-tions between elementary magnets (spins), the goal is toderive observables such as spin magnetisations and cor-relations.

In an inverse problem, the starting point is observa-tions of some system whose microscopic parameters areunknown and to be discovered. In the inverse Ising prob-lem, the interactions between spins are not known to us,but we want to learn them from measurements of mag-netisations, correlations, or other observables. In general,the goal is to infer the parameters describing a system(for instance, its Hamiltonian) from extant data. To thisend, the relationship between microscopic laws and ob-servables needs to be inverted.

In the last two decades, inverse statistical problemshave arisen in different contexts, sparking interest in thestatistical physics community in taking the path frommodel parameters to observables in reverse. The areaswhere inverse statistical problems have arisen are char-acterized by (i) microscopic scales becoming experimen-tally accessible and (ii) sufficient data storage capabili-ties being available. In particular, the biological scienceshave generated several inverse statistical problems, in-cluding the reconstruction of neural and gene regulatorynetworks and the determination of the three-dimensionalstructure of proteins. Technological progress is likely toto open up further fields of research to inverse statisticalanalysis, a development that is currently described bythe label ‘big data’.

In physics, inverse statistical problems also arise whenwe need to design a many-body system with particu-lar desired properties. Examples are finding the poten-tials that result in a particular single-particle distribu-tion [48, 122], interaction parameters in a binary alloythat yield the observed correlations [142], the potentialsbetween atoms that lead to specific crystal lattices [253],or the parameters of a Hamiltonian that lead to a par-ticular density matrix [47]. In the context of soft mat-ter, a question is how to design a many-body systemthat will self-assemble into a particular spatial configu-ration or has particular bulk properties [185, 227]. Inbiophysics, we may want to design a protein that foldsinto a specified three-dimensional shape [120]. For RNA,even molecules with more than one stable target struc-ture are possible [78]. As a model of such design prob-lems, [66, 136] study how to find the parameters of IsingHamiltonian with a prescribed ground state.

In all these examples, ‘spin’ variables describe micro-scopic degrees of freedom particular to a given system,for instance, the states of neurons in a neural network.

The simplest description of these degrees of freedom interms of random binary variables then leads to Ising-type spins. In the simplest non-trivial scenario, correla-tions between the ’spins’ are generated by pairwise cou-plings between the spins, leading to an Ising model withunknown parameters (couplings between the spins andmagnetic fields acting on the spins). In many cases ofinterest, the couplings between spins will not all be posi-tive, as is the case in a model of a ferromagnet. Nor willthey couplings conform to a regular lattice embedded insome finite-dimensional space.

For a concrete example, we look at a system of Nbinary variables (Ising spins) si, i = 1, . . . , N withsi = ±1. These spins are coupled by pairwise couplingsJij and are subject to external magnetic fields hi.

P (si) =1

Zexp

∑i

hisi +∑i<j

Jijsisj

(1)

is the Boltzmann equilibrium distribution P (si) =e−H(si)/Z, where we have subsumed temperature intothe couplings and fields. (We will discuss this choice insection II 1). The Hamiltonian

H(si) = −∑i

hisi −∑i<j

Jijsisj (2)

specifies the energy of the spin system as a function ofthe microscopic spin variables, local fields, and pairwisecouplings. The inverse Ising problem is the determina-tion of the couplings Jij and local fields hi, given a setof M observed spin configurations. Depending on theparticular nature of the system at hand, the restrictionto binary variables or pairwise interactions may need tobe lifted, or the functional form of the Hamiltonian maybe altogether different from the Ising Hamiltonian withpairwise interactions (2). For non-equilibrium systems,the steady state is not even described by a Boltzmanndistribution with a known Hamiltonian. However, thebasic idea remains the same across different types of in-verse statistical problems: even when the frequencies ofspin configurations may be under-sampled, the data maybe sufficient to infer at least some parameters of a model.

The distribution (1) is well known not only as the equi-librium distribution of the Ising model. It is also the formof the distribution which maximizes the (Gibbs) entropy

S[P ] = −∑si

P (si) lnP (si) (3)

under the constraint that P (si) is normalized and hasparticular first and second moments, that is, magnetisa-tions and correlations. We will discuss in section II A 3how this distribution emerges as the ‘least biased distri-bution’ of a binary random variable with prescribed firstand second moments [107]. The practical problem is then

Page 3: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

3

again an inverse one: to find the couplings Jij and localfields hi such that the first and second moments observedunder the Boltzmann distribution (1) match the meanvalues of si and sisj in data. In settings where thirdmoments differ significantly from the prediction of (1)based on the first two moments, one may need to con-struct the distribution of maximum entropy given thefirst three moments, leading to three-spin interactions inthe exponent of (1).

Determining the parameters of a distribution suchas (1) is always a many-body problem: changing a singlecoupling Jij generally affects correlations between manyspin variables, and conversely a change in the correlationbetween two variables can change the values of many in-ferred couplings. The interplay between model parame-ters and observables is captured by a statistical mechan-ics of inverse problems, where the phase space consistsof quantities normally considered as fixed model param-eters (couplings, fields). The observables, such as spincorrelations and magnetisations on the other hand, aretaken to be fixed, as they are specified by empirical ob-servations. Such a perspective is not new to statisticalphysics; the analysis of neural networks in the seventiesand eighties of the last century led to a statistical me-chanics of learning [72, 95, 238], where the phase space isdefined by the set of possible rules linking the input intoa machine with an output. The set of all rules compatiblewith a given set of input/output relations then defines astatistical ensemble. In the inverse statistical problems,however, there are generally no explicit rules linking theinput and output, but data with different types of corre-lations or other observations, which are to be accountedfor in a statistical model.

Inverse statistical problems fall into the broader realmof statistical inference [31, 131], which seeks to deter-mine the properties of a probability distribution under-lying some observed data. The problem of inferring theparameters of a distribution such as (1) is known underdifferent names in different communities; also emphasisand language differ subtly across communities.

• In statistics, an inverse problem is the inferenceof model parameters from data. In our case,the problem is the inference of the parameters ofthe Ising model from observed spin configurations.A particular subproblem is the inference of thegraph formed by the non-zero couplings of the Isingmodel, termed graphical model selection or recon-struction. In the specific context of statistical mod-els on graphs (graphical models), the term inferencedescribes the calculation of the marginal distribu-tion of one or several variables. (A marginal dis-tribution describes the statistics of one or severalparticular variables in a many-variable distribution,for example, P (x1) =

∑x2,x3

P (x1, x2, x3).)

• In machine learning, a frequent task is to train an

artificial neural network with symmetric couplingssuch that magnetisations and correlations of theartificial neurons match the corresponding valuesin the data. This is a special case of what is calledBoltzmann machine learning; the general case alsoconsiders so-called hidden units, whose values areunobserved [1].

• In statistical physics, much effort has been di-rected towards estimating the parameters of theIsing model given observed values of the magneti-sation and two-point correlations. As we will see insection II A, this is a hard problem from an algo-rithmic point of view. Recently, threshold phenom-ena arising in inference problems have attractedmuch interest from the statistical physics commu-nity, due to the link between phase transitions andthe boundaries separating different regimes of infer-ence problems, for instance solvable and unsolvableproblems, or easy and hard ones [144, 248].

Common theme across different applications and com-munities is the inference of model parameters given ob-served data or desired properties. In this review we willfocus on a prototype inverse statistical problem: the in-verse Ising problem and its close relatives. Many of theapproaches developed for this problem are also readilyextended to more general scenarios. We will start witha discussion of applications of the inverse Ising prob-lem and related approaches in biology, specifically thereconstruction of neural and genetic networks, the de-termination of three-dimensional protein structures, theinference of fitness landscapes, the bacterial responsesto combinations of antibiotics, and flocking dynamics.We will find that these applications define two distinctsettings of the inverse Ising problem; equilibrium andnon-equilibrium. Part II of this review treats the inverseIsing problem in an equilibrium setting, where the cou-plings between spins are symmetric. Detailed balanceholds and results from equilibrium statistical physics canbe used. This setting arises naturally within the con-text of maximum entropy models, which seek to describethe observed statistics of configurations with a simpli-fied effective model capturing, for instance, collective ef-fects. We introduce the basics of statistical inference andmaximum entropy modelling, discuss the thermodynam-ics of the inverse Ising problem, and review different ap-proaches to solve the inverse Ising problem, pointing outtheir connections and comparing the resulting parameterreconstructions. Part III of this review considers asym-metric coupling matrices, where in the absence of detailedbalance couplings can be reconstructed from time series,from data on perturbations of the system, or from de-tailed knowledge of the non-equilibrium steady state.

We now turn to applications of the inverse Ising prob-lem, which mostly lie in the analysis of high-throughputdata from biology. One aim of inverse statistical mod-

Page 4: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

4

elling is to find the parameters of a microscopic modelto describe this data. A more ambitious aim is achievedwhen the parameters of the model are informative aboutthe processes which produced the data, this is, whensome of the mechanisms underlying the data can be in-ferred. The data is large-scale measurements of the de-grees of freedom of some system. In the language of sta-tistical physics these describe the micro-states of a sys-tem: states of neurons, particular sequences of DNA orproteins, or the concentration levels of RNA. We brieflyintroduce some of the experimental background of thesemeasurements, so their potential and the limitations canbe appreciated. The models are simple models of the mi-croscopic degrees of freedom. In the spirit of statisticalphysics, these models are simple enough so the param-eters can be computed given the data, yet sufficientlycomplex to reproduce some of the statistical interdepen-dences of the observed microscopic degrees of freedom.The simplest case, consisting of binary degrees of freedomwith unknown pairwise couplings between them, leads tothe inverse Ising problem, although we will also discussseveral extensions.

A. Modelling neural firing patterns and thereconstruction of neural connections

Neurons can exchange information by generating dis-crete electrical pulses, termed spikes, that travel downnerve fibres. Neurons can emit these spikes at differentrates, a neuron emitting spikes at a high rate is said tobe ‘active’ or ‘firing’, a neuron emitting spikes at a lowrate or not at all is said to be ‘inactive’ or ‘silent’. Themeasurement of the activity of single neurons has a longhistory starting in 1953 with the development of micro-electrodes for recording [68]. Multi-electrodes were de-veloped, allowing to record multiple simultaneous neuralsignals independently over long time periods [163, 209].Such data presents the intriguing possibility to see ele-mentary brain function emerge from the interplay of alarge number of neurons.

However, even when vast quantities of data are avail-able, the different configurations of a system are stillunder-sampled in most cases. For instance, consider Nneurons, each of which can be either active (firing) orinactive (silent). Given that the firing patterns of thou-sands of neurons can be recorded simultaneously [198],the number of observations M will generally be far lessthan the total number of possible neural configurations,M 2N . For this reason, a straightforward statisticaldescription that seeks to determine directly the frequencywith which each configuration appears will likely fail.

On the other hand, a feasible starting point is a sim-ple distribution, whose parameters can to be determinedfrom the data. For a set of N binary variables, this mightbe a distribution with pairwise interactions between the

variables. In [196], Bialek and collaborators applied sucha statistical model to neural recordings. Dividing timeinto small intervals of duration ∆τ = 20 ms induces abinary representation of neural data, where each neuroni either spikes during a given interval (si = 1) or it doesnot (si = 0). The joint statistics observed in 40 neuronsin the retina of a salamander was modelled by an Isingmodel (1) with magnetic fields and pairwise symmetriccouplings. Rather than describing the dynamics of neu-ral spikes, this model describes the correlated firing ofdifferent neurons over the course of the experiment. Thesymmetric couplings Jij in (1) describe statistical depen-dencies, not physical connections. The synaptic connec-tions between neurons, on the other hand, are generallynot symmetric.

In this context, the distribution (1) can be viewed asthe form of the maximum entropy distribution over neu-ral states, given the observed one- and two-point correla-tions [196]. In [52, 204], a good match was found betweenthe statistics of three neurons predicted by (1) and thefiring patterns of the same neurons in the data. Thismeans that the model with pairwise couplings provides astatistical description of the empirical data, one that caneven be used to make predictions. Similar results wereobtained also from cortical cells in cell cultures [217].

The mapping from the neural data to a spin modelrests on dividing time into discrete bins of duration ∆τ .A different choice of this interval would lead to differentspin configurations; in particular changing ∆τ affects themagnetisation of all spins by altering the number of inter-vals in which a neuron fires. In [191], Roudi, Nirenbergand Latham show that the pairwise model (1) provides agood description of the underlying spin statistics (gener-ated by neural spike trains), provided N ∆τ ν 1, whereν is the average firing rate of neurons. Increasing the binsize beyond this regime leads to an increase in bins wheremultiple neurons fire, as a result couplings beyond thepairwise couplings in (1) can become important.

As a minimal model of neural correlations, the statis-tics (1) has been extended in several ways. Tkacik etal. [225] and Granot-Atedgi et al. [88] consider stimulus-dependent magnetic fields, that is, fields which dependon the stimulus presented to the experimental subjectat a particular time of the experiment. Ohiorhenuan etal. looks at stimulus-dependent couplings [164]. Whenthe number of neurons increases to ∼ 100, limitations ofthe pairwise model (1) become apparent, which has beaddressed by adding additional terms coupling triplets,etc. of spins in the exponent of the Boltzmann mea-sure (1) [82].

The statistics (1) serves as a description of the em-pirical data: the couplings between spins in the Hamil-tonian (2) do not describe physical connections betweenthe neurons. The determination of the network of neuralconnections from the observed neural activities is thus adifferent question. Simoncelli and collaborators [169, 179]

Page 5: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

5

and Cocco, Leibler, and Monasson [52] use an integrate-and-fire model [40] to infer how the neurons are intercon-nected on the basis of time series of spikes in all neurons.In such a model, the membrane potential of neuron iobeys the dynamics

CdVidt

=∑j 6=i

Jij∑l

K(t− tjl) + Ii − gVi + ξi(t) , (4)

where the first term on the right-hand side encodes thesynaptic connections Jij and a memory kernel K; tjlspecifies the time at which neuron j emitted its lth spike.The remaining terms describe a background current,voltage leakage, and white noise. Finding the synapticconnections Jij that best describe a large set of neuralspike trains is a computational challenge; [52, 168] de-velop an approximation based on maximum likelihood,see section II A. A related approach based on point pro-cesses and generalized linear models (GLM) is presentedin [229]. We will discuss this problem of inferring the net-work of connections the context of the non-equilibriummodels in section III.

Neural recordings give the firing patterns of severalneurons over time. These neurons may have connectionsbetween them, but they also receive signals from neuralcells whose activity is not recorded [133]. In [230], the ef-fect of connections between neurons is disentangled fromcorrelations arising from shared non-stationary input.This raises the possibility that the correlations describedby the pairwise model (1) in [196] and related works orig-inate from a confounding factor (connections to a neuronother than those whose signal is measured), rather thanfrom connections between recorded neurons [121].

B. Reconstruction of gene regulatory networks

The central dogma of molecular biology is this: Pro-teins are macromolecules consisting of long chains ofamino acids. The particular sequence of a protein is en-coded in DNA, a double-stranded helix of complemen-tary nucleotides. Specific parts of DNA, the genes, aretranscribed by polymerases, producing a single-strandedcopy called m(essenger)RNAs, which are translated byribosomes, usually multiple times, to produce proteins.

The process of producing protein molecules from theDNA template by transcription and translation is calledgene expression. The expression of a gene is tightly con-trolled to ensure that the right amounts of proteins areproduced at the right time. One important control mech-anism is transcription factors, proteins which affect theexpression of a gene (or several) by binding to DNA nearthe transcription start site of that gene. This part ofDNA is called the regulatory region of a gene. A targetgene of a transcription factor may in turn encode an-other transcription factor, leading to a cascade of regu-latory events. To add further complications, the binding

of multiple transcription factors in the regulatory regionof a gene leads to combinatorial control exerted by sev-eral transcription factors on the expression of a gene [37].Can the regulatory connections between genes be inferredfrom data on gene expression, that is, can we learn theidentity of transcription factors and their targets?

Over the last decades, the simultaneous measurementof expression levels of all genes have become routine. Atthe centre of this development are two distinct techno-logical advances to measure mRNA levels. The first ismicroarrays, consisting of thousands of short DNA se-quences, called probes, grafted to the surface of a smallchip. After converting the mRNA in a sample to DNAby reverse transcription, cleaving that DNA into shortsegments, and fluorescently labelling the resulting DNAsegments, fluorescent DNA can bind to its complemen-tary sequence on the chip. (Reverse transcription con-verts mRNA to DNA, a process which requires a so-calledreverse transcriptase as an enzyme.) The amount of flu-orescent DNA bound to a particular probe depends onthe amount of mRNA originally present in the sample.The relative amount of mRNA from a particular gene canthen be inferred from the fluorescence signal at the corre-sponding probes [98]. A limitation of microarrays is thelarge amount of mRNA required: The mRNA sample istaken from a population of cells. As a result, cell-to-cellfluctuations of mRNA concentrations are averaged over.To obtain time series, populations of cells synchronizedto approximately the same stage in the cell cycle are used[84].

The second way to measure gene expression levels isalso based on reverse transcription of mRNA, followedby high-throughput sequencing of the resulting DNA seg-ments. Then the relative mRNA levels follow directlyfrom counts of sequence reads [237]. Recently, expressionlevels even in single cells have been measured in this way[242]. In combination with barcoding (adding short DNAmarkers to identify individual cells), 104 cells can havetheir expression profiled individually in a single sequenc-ing run [134]. Such data may allow, for instance, theanalysis of the response of target genes to fluctuationsin the concentration of transcription factors. However,due to the destructive nature of single-cell sequencing,there may never be single-cell data that give time seriesof genome-wide expression levels.

Unfortunately, cellular concentrations of proteins aremuch harder to measure than mRNA levels. As a result,much of the literature focuses on mRNA levels, neglectingthe regulation of translation. Advances in protein mass-spectrometry [178] may lead to data on both mRNA andprotein concentrations. This data would pose the addi-tional challenge of inferring two separate levels of generegulation: gene transcription from DNA to mRNA andtranslation from mRNA to proteins.

As in the case of neural data discussed in the precedingsection, gene expression data presents two distinct chal-

Page 6: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

6

lenges: (i) finding a statistical description of the data interms of suitable observables and (ii) inferring the un-derlying regulatory connections. Both these problemshave been addressed extensively in the machine learningand quantitative biology communities. Clustering of geneexpression data to detect sets of genes with correlatedexpression levels has been used to detect regulatory rela-tionships. A model-based approach to the reconstructionof regulatory connections is Boolean networks. Booleannetworks assign binary states to each gene (gene expres-sion on/off), and the state of a gene at a given time de-pends on the state of all genes at a previous time througha set of logical functions assigned to each gene. See [64]for a review of clustering and Boolean network inferenceand [96] for a review of Boolean network inference.

A statistical description that has also yielded insightinto regulatory connections is Bayesian networks. ABayesian network is a probabilistic model describing aset of random variables (expression levels) through con-ditional dependencies described by a directed acyclicgraph. Learning both the structure of the graph and thestatistical dependencies is a hard computational problem,but can capture strong signals in the data that are oftenassociated with a regulatory connection. In principle,causal relationships (like the regulatory connections) canbe inferred, in particular if the regulatory network con-tains no cycles. For reviews, see [79, 116]. Both Booleanor Bayesian networks have been applied to measurementsof the response of expression levels to external perturba-tions of the regulatory network or of expression levels,see [103, 174]. A full review of these methods is beyondthe scope of this article, instead we focus on approachesrelated to the inverse Ising problem.

For a statistical description of gene expression levels,[126] applied a model with pairwise couplings

P (xi) = exp

∑i≤j

Jijxixj +∑i

hixi

/Z , (5)

fitted to gene expression levels. The standard definitionof expression levels xi is log2-values of fluorescence sig-nals with the mean value for each gene subtracted. Since(5) is a multi-variate Gaussian distribution, the matrixof couplings Jij must be negative definite. These cou-plings can be inferred simply by inverting the matrix ofvariances and covariances of expression levels. In [126],the resulting couplings Jij were then used to identify hubgenes which regulate many targets. The same approachwas used in [128] to analyse the cellular signalling net-works mediated by the phosphorylation of specific siteson different proteins. Again, the distribution (5) can beviewed as a maximum entropy distribution for continu-ous variables with prescribed first and second moments.This approach is also linked to the concept of partial cor-relations in statistics [9, 119].

Again the maximum-entropy distribution (5) has sym-metric couplings between expression levels, whereas thenetwork of regulatory interactions is intrinsically asym-metric. One way to infer the regulatory connections istime series [205]. [12] uses expression levels measured atdifferent times to infer the regulatory connections, basedon a minimal model of expression dynamics with asym-metric regulatory connections between pairs of genes. Inthis model, expression levels xti at successive time inter-vals t obey

sign(xt+1i ) =

1 if

∑j Jijx

tj > κ

−1 if∑j Jijx

tj < κ

, (6)

where κ is a threshold. The regulatory connections Jijare taken to be discrete, with the values −1, 1, 0 denotingrepression, activation and no regulation of gene i by theproduct of gene j. The matrix of connections is theninferred based on Bayes theorem (see section II A) and aniterative algorithm for estimating marginal probabilities(message passing, see section II A 11).

A second line of approach that can provide informationon regulatory connections is perturbations [218]. An ex-ample is gene knockdowns, where the expression of oneor more genes is reduced, by introducing small inter-fering RNA (siRNA) molecules into the cell [67] or byother techniques. siRNA molecules can be introducedinto cells from the outside; after various processing stepsthey lead to the cleavage of mRNA with a complementarysequence, which is then no longer available for transla-tion. If that mRNA translates to a transcription fac-tor, all targets of that transcription factor will be up-regulated or downregulated (depending on whether thetranscription factor acted as a repressor or an activa-tor, respectively). Knowing the responses of gene ex-pression levels to a sufficient number of such perturba-tions allows the inference of regulatory connections. [148]considers a model of gene expression dynamics basedon continuous variables xi evolving deterministically as∂txi = ai tanh(

∑j Jijxj)− cixi. The first term describes

how the expression level of gene j affects the rate of geneexpression of gene i via the regulatory connection Jij , thesecond term describes mRNA degradation. The station-ary points of this model shift in response to perturba-tions of expression levels of particular genes (for instancethrough knockdowns), and these changes depend on reg-ulatory connections. In [148], the regulatory connectionsare inferred from perturbation data, again using beliefpropagation.

C. Protein structure determination

Tremendous efforts have been made to determine thethree-dimensional structure of proteins. A linear aminoacid chain folds into a convoluted shape, the folded pro-

Page 7: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

7

tein, thus bringing amino acids into close physical prox-imity that are separated by a long distance along thelinear sequence.

Due to the number of proteins (several thousand perorganism) and the length of individual proteins (hun-dreds of amino acid residues), protein structure deter-mination is a vast undertaking. However, the rewardsare also substantial. The three-dimensional structure ofa protein determines its physical and chemical proper-ties, and how it interacts with other cellular components:broadly, the shape of a protein determines many aspectsof its function. Protein structure determination relies oncrystallizing proteins and analysing the X-ray diffractionpattern of the resulting solid. Given the experimentaleffort required, the determination of a protein’s struc-ture from its sequence alone has been a key challengeto computational biology for several decades [60, 65].The computational approach models the forces betweenamino acids in order to find the low-energy structure aprotein in solution will fold into. Depending on the levelof detail, this approach requires extensive computationalresources.

An attractive alternative enlists evolutionary informa-tion: Suppose that we have at our disposal amino acidsequences of a protein as it appears in different relatedspecies (so-called orthologs). While the sequences arenot identical across species, they preserve to some degreethe three-dimensional shape of the protein. Suppose aspecific pair of amino acids that interact strongly witheach other and bring together parts of the protein thatare distal on the linear sequence. Replacing this pairwith another, equally strongly interacting pair of aminoacids would change the sequence, but leave the structureunchanged. For this reason, we expect sequence differ-ences across species to reflect the structure of the pro-tein. Specifically, we expect correlations of amino acidsin positions that are proximal to each other in the three-dimensional structure. In turn, the correlations observedbetween amino acids at different positions might allowto infer which pairs of amino acids are proximal to eachother in three dimensions (the so-called contact map).The use of such genomic information has recently lead topredictions of the three-dimensional structure of manyprotein families inaccessible to other methods [167], fora review see [51].

Early work looked at the correlations as a measureof proximity [87, 92, 129, 207]. However correlationsare transitive; if amino acids at sequence sites i andj are correlated due to proximity in the folded state,and j and k are correlated for some reason, i andk will also exhibit correlations, which need not stemfrom proximity. This problem is addressed by an in-verse approach aimed at finding the set of pairwisecouplings that lead to the observed correlations or se-quences [39, 55, 58, 71, 99, 138, 154, 213, 239]. Sinceeach sequence position can be taken up by one of 20

amino acids or a gap in the sequence alignment, thereare 212 correlations at each pair of sequence positions.In [154, 239] a statistical model with pairwise interac-tions is formulated, based on the Hamiltonian

H = −∑i<j

Jij(si, sj)−∑i

hi(si) . (7)

This Hamiltonian depends on spin variables si, one foreach sequence position i = 1, . . . , N . Each spin vari-able can take on one of 21 values, describing the 20possible amino acids at that sequence position as wellas the possibility of a gap (corresponding to an extraamino acid inserted in a particular position in the se-quences of other organisms). Each pair of amino acidssi, sj in sequence position i, j contributes Jij(si, sj) tothe energy. The inverse problem is to find the couplingsJij(A,B) for each pair of sequence positions i, j and pairof amino acids A,B, as well as field hi(A), such that theamino acid frequencies and correlations observed across

FIG. 1. Correlations and couplings in protein struc-ture determination. Both figures show the three-dimensional structure of a particular part (region 2) of theprotein SigmaE of E. coli, as determined by X-ray diffrac-tion. This protein, or rather a protein of similar sequence andpresumably similar structure, occurs in many other bacterialspecies as well. In figure B, lines indicate pairs of sequencepositions whose amino acids are highly correlated across dif-ferent bacteria: for each pair of sequence positions at least5 amino acids apart, the mutual information of pairwise fre-quency counts of amino acids was calculated, and the 20 mostcorrelated pairs are shown here. Such pairs that also turn outto be close in the three-dimensional structure are shown in

red, those whose distance exceeds 8A are shown in green. We

see about as many highly correlated sequence pairs that areproximal to one another as correlated pairs that are furtherapart. By contrast, in figure A, lines show sequence pairs thatare strongly coupled in the Potts model (7), whose model pa-rameters are inferred from the correlations. The fraction offalse contact predictions (green lines) is reduced considerably.The figures are taken from [154].

Page 8: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

8

FIG. 2. Protein contact maps predicted from evolu-tionary correlations. The two figures show contact mapsfor the ELAV4 protein (left) and the RAS protein (right). x-and y-axes correspond to sequence positions along the linearchain of amino acids. Pairs of sequence positions whose aminoacids are in close proximity in the folded protein are indicatedin grey (experimental data). Pairs of sequence positions withhighly correlated amino acids are shown in blue (mutual in-formation, bottom triangle). Pairs of sequence positions withhigh direct information (8) calculated from (7) are shown inred. The coincidence of red and grey points shows excellentagreement between predictions from direct information withthe experimentally determined structure of the protein. Thefigure is taken from [138].

species are reproduced. The sequence positions withstrong pairwise couplings are then predicted to be prox-imal in the protein structure. A simple measure of thecoupling between sequence positions is the matrix norm(Frobenius norm)

∑si,sj

(Jij(si, sj))2. The so-called di-

rect information [239] is an alternative measure basedon information theory. A two-site model is defined withpij(si, sj) = expJij(si, sj)+hi(si)+hj(sj)/Zij . Directinformation is the mutual information between the two-site model and a model without correlations between theamino acids

DIij =∑si,sj

pij(si, sj) ln

(pij(si, sj)

pi(si)pj(sj)

). (8)

The Boltzmann distribution resulting from (7) can beviewed as the maximum entropy distribution with one-and two-point correlations between amino acids in dif-ferent sequence positions determined by the data. Thereis no reason to exclude higher order terms in the Hamil-tonian (7) describing interactions between triplets of se-quence positions, although the introduction of such termsmay lead to overfitting. Also, fitting the Boltzmann dis-tribution (7) to sequence data uses no prior informa-tion on protein structures; for this reason it is called anunsupervised method. Recently, neural network models

trained on sequence data and protein structures (super-vised learning) have been very successful in predictingnew structures [109, 236].

The maximum entropy approach to structure analy-sis is not limited to evolutionary data. In [252] Zhangand Wolynes analyse chromosome conformation captureexperiments and use the observed frequency of contactsbetween different parts of a chromosome in a maximumentropy approach to predict the structure and topologyof the chromosomes.

D. Fitness landscape inference

The concept of fitness lies at the core of evolution-ary biology. Fitness quantifies the average reproductivesuccess (number of offspring) of an organism with a par-ticular genotype, i.e., a particular DNA sequence. Thedependence of fitness on the genotype can be visualizedas a fitness landscape in a high-dimensional space, wherefitness specifies the height of the landscape. As the num-ber of possible sequences grows exponentially with theirlength, the fitness landscape requires in principle an ex-ponentially large number of parameters to specify, andin turn those parameters need an exponentially growingamount of data to infer.

A suitable model system for the inference of a fit-ness landscape is HIV proteins, due to the large num-ber of sequences stored in clinical databases and therelative ease of generating mutants and measuring theresulting fitness. In a series of papers, Chakrabortyand co-workers proposed a fitness model the so-calledGag protein family (group-specific antigen) of the HIVvirus [59, 74, 135, 202]. The model is based on pair-wise interactions between amino acids. Retaining onlythe information whether the amino acid at sequence po-sition i was mutated (si = 1) with respect to a referencesequence or not (si = 0), Chakraborty and co-workerssuggest a minimal model for the fitness landscape givenby the Ising Hamiltonian (2). Again, one can view thelandscape (2) as generating the maximum entropy distri-bution constrained by the observed one- and two-pointcorrelations.

Adding a constant to (2) in order to make fitness (ex-pected number of offspring) non-negative does not alterthe resulting statistics. The inverse problem is to inferthe couplings Jij and fields hi from frequencies of aminoacids and pairs of amino acids in particular sequence po-sitions observed in HIV sequence data. Of course it is notclear from the outset that a model with only pairwise in-teractions can describe the empirical fitness landscape.As a test of this approach, [74] compares the predictionof (2) for specific mutants to the results of independentexperimental measurements of fitness.

Statistical models of sequences described by pair-wise interactions may be useful to model a wide range

Page 9: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

9

FIG. 3. Three-point correlations in an amino acid se-quences and their prediction from a model with pair-wise interactions. Mora et al. look at the so-called D-regionin the IgM protein (maximum length N = 8) [153]. The D-region plays an important role in immune response. The fre-quencies at which given triplets of consecutive amino acidsoccur were compiled (x-axis, normalized with respect to theprediction of a model with independent sites). The results arecompared to the prediction from a model with pairwise inter-actions like (2) on the y-axis. The figure is taken from [153].

of protein families with different functions [100], andhave been used in other contexts as well. Santolini,Mora, and Hakim model the statistics of sequences bind-ing transcription factors using (7), with each spin tak-ing one of four states to characterize the nucleotidesA,C,G, T [195]. A similar model is used in [153] to modelthe sequence diversity of the so-called IgM protein, anantibody which plays a key role in the early immune re-sponse. The model with pairwise interactions predictsnon-trivial three-point correlations which compare wellwith those found in the data, see figure 3.

E. Combinatorial antibiotic treatment

Antibiotics are chemical compounds which kill specificbacteria or inhibit their growth [115, 234]. Mutationsin the bacterial DNA can lead to resistance against aparticular antibiotic, which is a major hazard to pub-lic health [127, 243]. One strategy to slow down oreliminate the emergence of resistance is to use a com-

bination of antibiotics either simultaneously or in rota-tion [115, 234]. The key problem of this approach is tofind combinations of compounds which are particularlyeffective against a particular strain of bacteria. Tryingout all combinations experimentally is prohibitively ex-pensive. Wood et al. use an inverse statistical approachto predict the effect of combinations of several antibioticsfrom data on the effect of pairs of antibiotics [244] . Theavailable antibiotics are labelled i = 1, . . . , N ; in [244] adistribution over continuous variables xi is constructed,such that 〈xi〉 gives the bacterial growth rate when an-tibiotic i is administered, 〈xixj〉 gives the growth ratewith both i and j are given, etc. for higher moments.Choosing this distribution to be a multi-variate Gaus-

sian P (xi) = exp[∑

i≤j Jijxixj +∑i hixi

]/Z results

in simple relationships between the different moments,which lead to predictions of the response to drug combi-nations that are borne out well by experiment [244].

F. Interactions between species and betweenindividuals

Species exist in various ecological relationships. For in-stance individuals of one species hunt and eat individualsof another species. Another example is microorganismswhose growth can be influenced, both positively and neg-atively, by the metabolic output of other microorganisms.Such relationships form a dense web of ecological inter-actions between species. Co-culturing and perturbationexperiments (for instance species removal) lead to datawhich may allow the inference of these networks [73, 94].

Interactions between organisms exist also at the levelof individuals, for instance when birds form a flock, orfish form a school. This emergent collective behaviour isthought to have evolved to minimize the exposure of in-dividuals to predators. In [28, 29], a model with pairwiseinteractions between the velocities of birds in a flock isconstructed. Individual birds labelled i = 1, . . . , N movewith velocity ~vi in a direction specified by ~si = ~vi/|~vi|.The statistics of the these directions is modelled by adistribution

P (~si) = exp

∑i,j

Jij~si · ~sj

/Z , (9)

where the couplings between the spins ~si need to be in-ferred from the experimentally observed correlations be-tween normalized velocities. This model can be viewed asthe maximum-entropy distribution constrained by pair-wise correlations between normalized velocities. Fromthe point of view of statistical physics it describes a dis-ordered system of Heisenberg spins. As birds frequentlychange their neighbours in flight, the couplings are notconstant in time and it makes sense to consider couplings

Page 10: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

10

that depend on the distance between two individuals [29].An alternative is to apply the maximum entropy princi-ple to entire trajectories [46].

G. Financial markets

Market participants exchange commodities, shares incompanies, currencies, or other goods and services, usu-ally for money. The change in prices of such goods areoften correlated, as the demand for different goods canbe influenced by the same events. In [41, 42], Bury usesa spin model with pairwise interactions to analyse stockmarket data. Shares in N different companies are de-scribed by binary spin variables, where spin si = 1 in-dicates ‘bullish’ conditions for shares in company i withprices going up at a particular time, and si = −1 implies‘bearish’ conditions with decreasing prices. Couplings Jijdescribe how price changes in shares i affect changes inthe price of j, or how prices are affected jointly by ex-ternal events. Bury fit stock marked data to this spinmodel, and found clusters in the resulting matrix of cou-plings [41]. These clusters correspond to different indus-tries whose companies are traded on the market. In [34],a similar analysis finds that heavy tails in the distributionof inferred couplings are linked to such clusters. Slonimet al. identified clusters in stocks using an information-based metric of stock prices [206] .

II. EQUILIBRIUM RECONSTRUCTION

The applications discussed above can be classified ac-cording to the symmetry of pairwise couplings: In net-work reconstruction, couplings between spins are gener-ally asymmetric, in maximum entropy models they aresymmetric. A stochastic dynamics based on symmetriccouplings entails detailed balance, leading to a steadystate described by the Boltzmann distribution [118],whereas asymmetric couplings lead to a non-equilibriumsteady state. This distinction shapes the structure ofthis review: In this section, we discuss the inverse Isingproblem in equilibrium, in section III we turn to non-equilibrium scenarios.

1. Definition of the problem

We consider the Ising model with N binary spin vari-ables si = ±1, i = 1, . . . , N . Pairwise couplings (orcoupling strengths) Jij encode pairwise interactions be-tween the spin variables, and local magnetic fields hi acton individual spins. The energy of a spin configurationsss ≡ si is specified by the Hamiltonian

HJJJ,hhh(sss) = −∑i<j

Jijsisj −∑i

hisi . (10)

The equilibrium statistics of the Ising model is describedby the Boltzmann distribution

p(sss) =1

Ze−HJJJ,hhh(sss) , (11)

where we have subsumed the temperature into couplingsand fields such that kBT = 1: The statistics of spins un-der the Boltzmann distribution exp−βH/Z dependson couplings, magnetic fields, and temperature onlythrough the products βJij and βhi. As a result, onlythe products βJij and βhi can be inferred and we setβ to 1 without loss of generality. The energy specifiedby the Hamiltonian (2) or its generalisation (7) is thus adimensionless quantity.Z denotes the partition function

Z(JJJ,hhh) =∑sss

e−HJJJ,hhh(sss) . (12)

In such a statistical description of the Ising model, eachspin is represented by a random variable. Throughout,we denote a random spin variable by σ, and a particu-lar realisation of that random variable by s. This dis-tinction will become particularly useful in the context ofnon-equilibrium reconstruction in section III. The expec-tation values of spin variables and their functions thenare denoted

〈Q(σσσ)〉 ≡∑sss

p(sss)Q(sss) , (13)

where Q(sss) is some function mapping a spin configura-tion to a number. Examples are the equilibrium magne-tizations mi ≡ 〈σi〉 =

∑sss p(sss)si or the pair correlations

χij ≡ 〈σiσj〉 =∑sss p(sss)(sisj). In statistics, the latter ob-

servable is called the pair average. We are also interestedin the connected correlation Cij = χij −mimj , which instatistics is known as the covariance.

The equilibrium statistics of the Ising problem (11)is fully determined by the couplings between spins andthe magnetic fields acting on the spins. Collectively, cou-plings and magnetic fields are the parameters of the Isingproblem. The forward Ising problem is to compute sta-tistical observables such as the magnetizations and corre-lations under the Boltzmann distribution (11); the cou-plings and fields are taken as given. The inverse Isingproblem works in the reverse direction: The couplingsand fields are unknown and are to be determined fromobservations of the spins. The equilibrium inverse Isingproblem is to infer these parameters from spin configu-rations sampled independently from the Boltzmann dis-tribution. We denote such a data set of M samples byD = sssµ for µ = 1, 2, . . . ,M . (This usage of the term‘sample’ appears to differ from how it is used in the sta-tistical mechanics of disordered systems, where a sampleoften refers to a random choice of the model parameters,not spin configurations. However, it is in line with the in-verse nature of the problem: From the point of view of the

Page 11: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

11

statistical mechanics of disordered systems in an inversestatistical problem the ‘phase space variables’ are cou-plings and magnetic fields to be inferred, the ‘quencheddisorder’ is spin configurations sampled from the Boltz-mann distribution.)

Generally, neither the values of couplings nor the graphstructure formed by non-zero couplings is known. Unlikein many instances of the forward problem, the couplingsoften do not conform to a regular, finite-dimensional lat-tice; there is no sense of spatial distance between spins.Instead, the couplings might be described by a fully con-nected graph, with all pairs of spins coupling to eachother, generally all with different values of the Jij . Al-ternatively, most of the couplings might be zero, and thenon-zero entries of the coupling matrix might define astructure that is (at least locally) treelike. The graphformed by the couplings might also be highly heteroge-neous with few highly connected nodes with many non-zero couplings and many spins coupling only to a fewother spins. These distinctions can affect how well spe-cific inference methods perform, a point we will revisitin section II C, which compares the quality of differentmethods in different situations.

A. Maximum likelihood

The inverse Ising problem is a problem of statistical in-ference [31, 131]. At the heart of many methods to recon-struct the parameters of the Ising model is the maximumlikelihood framework, which we discuss here.

Suppose a set of observations x1, x2, . . . , xM drawnfrom a statistical model p(x1, x2, . . . , xM |θ). In the caseof the Ising model, each observation would be a spin con-figuration sss. While the functional form of this model maybe known a priori, the parameter θ is unknown to us andneeds to be inferred from the observed data. Of course,with a finite amount of data, one cannot hope to deter-mine the parameter θ exactly. The so-called maximumlikelihood estimator

θML = argmaxθ p(x1, x2, . . . , xM |θ) (14)

has a number of attractive properties [57]: In the limit ofa large number of samples, θML converges in probabilityto the value θ being estimated. This property is termedconsistency. Also for large sample sizes, there is no con-sistent estimator with a smaller mean-squared error. Fora finite number of samples, the maximum likelihood esti-mator may however be biased, that is, the mean of θML

over many realisations of the samples does not equal θ(although the difference vanishes with the sample size).The term likelihood refers to p(x1, x2, . . . , xM |θ) viewedas a function of the parameter θ at constant values ofthe data x1, x2, . . . , xM . The same function at constant θgives the probability of observing the data x1, x2, . . . , xM .

The maximum likelihood estimator (14) can also bederived using Bayes theorem [31, 131]. In Bayesian infer-ence, one introduces a probability distribution p(θ) overthe unknown parameter θ. This prior distribution de-scribes our knowledge prior to receiving the data. Uponaccounting for additional information from the data,our knowledge is described by the posterior distributiongiven by Bayes theorem

p(θ|x1, x2, . . . , xM ) =p(θ, x1, x2, . . . , xM )

p(x1, x2, . . . , xM )(15)

=p(x1, x2, . . . , xM |θ)p(θ)p(x1, x2, . . . , xM )

.

For the case where θ is a priori uniformly distributed(describing a scenario where we have no prior knowl-edge of the parameter value), the posterior proba-bility distribution of the parameter conditioned onthe observations p(θ|x1, x2, . . . , xM ) is proportional top(x1, x2, . . . , xM |θ)[162]. Then the parameter value max-imizing the probability density p(θ|x1, x2, . . . , xM ) isgiven by the maximum likelihood estimator (14). Max-imizing the logarithm of the likelihood function, termedthe log-likelihood function, leads to the same parameterestimate, because the logarithm is a strictly monotonicfunction. As the likelihood scales exponentially with thenumber of samples, the the log-likelihood is more con-venient to use. (This is simply the convenience of nothaving to deal with very small numbers: the logarithmis not linked to the quenched average considered in thestatistical mechanics of disordered systems; there is noaverage involved and the likelihood depends on both themodel parameters and the data.)

We now apply the principle of maximum likelihoodto the inverse Ising problem. Assuming that the con-figurations in the dataset were sampled independentlyfrom the Boltzmann distribution (1), the log-likelihoodof the model parameters given the observed configura-tions D = sssµ is derived easily.

LD(JJJ,hhh) =1

Mln p(D|JJJ,hhh) (16)

=∑i<j

Jij1

M

∑µ

sµi sµj +

∑i

hi1

M

∑i

sµi − lnZ(JJJ,hhh)

=∑i<j

Jij 〈σiσj〉D +∑i

hi 〈σi〉D − lnZ(JJJ,hhh),

gives the log-likelihood per sample, a quantity of orderzero in M since the likelihood scales exponentially withthe number of samples. The sample averages of spin vari-ables and their functions are defined by

〈Q〉D =1

M

∑µ

Q(sssµ) . (17)

Beyond the parameters of the Ising model, the log-likelihood (16) depends only on the correlations between

Page 12: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

12

pairs of spins observed in the data 〈σiσj〉D and the mag-

netizations 〈σi〉D. To determine the maximum-likelihoodestimates of the model parameters we thus only needthe pair correlations and magnetizations observed in thesample (sample averages); at least in principle, furtherobservables are superfluous. In the language of statistics,these sets of sample averages provide sufficient statisticsto determine the model parameters.

The log-likelihood (16) has a physical interpretation:The first two terms are the sample average of the (nega-tive of the) energy, the second term adds the free energy.Thus the log-likelihood is the (negative of the) entropyof the Ising system, based on the sample estimate of theenergy. We will further discuss this connection in sectionII A 5.

A second interpretation of the log-likelihood is basedon the difference between the Boltzmann distribu-tion (11) and the empirical distribution of the data in thesample D, denoted pD(sss) ≡ 1

M

∑µ δsssµ,sss. The difference

between two probability distributions p(sss) and q(sss) canbe quantified by the Kullback–Leibler (KL) divergence

KL(p|q) =∑sss

p(sss) lnp(sss)

q(sss), (18)

which is non-negative and reaches zero only when thetwo distributions are identical [56]. The KL divergencebetween the empirical distribution and the Boltzmanndistribution is

KL(pD|p) =∑sss

pD(sss) lnpD(sss)

p(sss)(19)

= −LD(JJJ,hhh) +∑sss

pD(sss) ln pD(sss).

The second term (the negative empirical entropy) is in-dependent of the model parameters; the best match be-tween the Boltzmann distribution and the empirical dis-tribution (minimal KL divergence) is thus achieved whenthe likelihood (16) is maximal.

Above, we derived the principle of maximum likeli-hood (14) from Bayes theorem under the assumption thatthe model parameter θ is sampled from a uniform priordistribution. Suppose we had the prior information thatthe parameter θ was taken from some non-uniform distri-bution, the posterior distribution would then acquire anadditional dependence on the parameter. In the case ofthe inverse Ising problem, prior information might for ex-ample describe the sparsity of the coupling matrix, with

a suitable prior pJJJ ∼ exp[−γ∑i<j |Jij |

]that assigns

small probabilities to large entries in the coupling ma-trix. The resulting (log) posterior is

ln p(JJJ,hhh|D) = MLD(JJJ,hhh)− γ∑i<j

|Jij | (20)

up to terms that do not depend on the model parameters.The maximum of the posterior is now no longer achieved

by maximizing the likelihood, but involves a second termthat penalizes coupling matrices with large entries. Max-imizing the posterior with respect to the parameters nolonger makes the Boltzmann distribution as similar tothe empirical distribution as possible, but strikes a bal-ance between making these distributions similar whileavoiding large values of the couplings. In the context ofinference, the second term is called a regularisation term.Different regularisation terms have been used, includingthe absolute-value term in (20) as well as a penalty onthe square values of couplings

∑i<j J

2ij (called `1- and

`2-regularisers, respectively). One standard way to de-termine the value of the regularisation coefficient γ isto cross-validate with a part of the data that is initiallywithheld, that is to probe (as a function of γ) how wellthe model can predict aspects of the data not yet usedto infer the model parameters [93].

1. Exact maximization of the likelihood

The maximum likelihood estimate of couplings andmagnetic fields

JJJML,hhhML = argmaxLD(JJJ,hhh) (21)

has a simple interpretation. Since lnZ(JJJ,hhh) serves asa generating function for expectation values under theBoltzmann distribution, we have

∂LD

∂hi(JJJ,hhh) = 〈σi〉D − 〈σi〉 (22)

∂LD

∂Jij(JJJ,hhh) = 〈σiσj〉D − 〈σiσj〉 .

At the maximum of the log-likelihood these derivativesare zero; the maximum-likelihood estimate of the param-eters is reached when the expectation values of pair corre-lations and magnetizations under the Boltzmann statis-tics match their sample averages

〈σi〉 = 〈σi〉D (23)

〈σiσj〉 = 〈σiσj〉D .

The log-likelihood (16) turns out to be a concave func-tion of the model parameters, see II A 2. Thus, in princi-ple, it can be maximized by a convex optimization algo-rithm. One particular way to reach the maximum of thelikelihood is a gradient-descent algorithm called Boltz-mann machine learning [1]. At each step of the algorithmfields and couplings are updated according to

hn+1i = hni + η

∂LD

∂hi(JJJn,hhhn) (24)

Jn+1ij = Jnij + η

∂LD

∂Jij(JJJn,hhhn) . (25)

The parameter η is the learning rate of the algorithm,which has (23) as its fixed point.

Page 13: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

13

In order to calculate the expectation values 〈σi〉 and〈σiσj〉 on the left-hand side of these equations, one needsto perform thermal averages of the form (13) over all 2N

configurations, which is generally infeasible for all butthe smallest system sizes. Analogously, when maximiz-ing the log-likelihood (16) directly, the partition functionis the sum over 2N terms. Moreover, the expectation val-ues or the partition function need to be evaluated manytimes during an iterative search for the solution of (23) orthe maximum of the likelihood. As a result, also numer-ical sampling techniques such as Monte Carlo samplingare cumbersome, but have been used for moderate sys-tem sizes [36]. Habeck proposes a Monte Carlo samplerthat draws model parameters from the posterior distribu-tion [91]. A recent algorithm uses information containedin the shape of the likelihood maximum to speed up theconvergence [75]. An important development in machinelearning has led to the so-called restricted Boltzmann ma-chines, where couplings form a symmetric and bipartitegraph. Variables fall into two classes, termed ‘visible’and ‘hidden’, with couplings never linking variables ofthe same class. This allows fast learning algorithms [76]at the expense of additional hidden variables.

We stress that the difficulty of maximizing the like-lihood is associated with the restriction of our inputto the first two moments (magnetisations and correla-tions) of the data. On the one hand, this restrictionis natural, as the likelihood only depends on these twomoments. On the other hand, computationally efficientmethods have been developed that effectively use corre-lations in the data beyond the first two moments. Animportant example is pseudolikelihood, which we willdiscuss in section II B. Other learning techniques thatsidestep the computation of the partition function in-clude score matching [101] and minimum probabilityflow [208]. Also, when the number of samples is small(compared to the number of spins), the likelihood needno longer be the best quantity to optimize.

2. Uniqueness of the solution

We will show that the log-likelihood (16) is a strictlyconcave function of the model parameters (couplings andmagnetic fields). As the space of parameters is convex,the maximum of the log-likelihood is unique.

We use the shorthands λλλ = JJJ,hhh and Qk(sss) =sisj , si for the model parameters and the functions cou-pling to them and write the Boltzmann distribution as

p(sss) =1

Z(λλλ)e∑k λkQk(sss) . (26)

For such a general class of exponential distributions [93],the second derivatives of the log-likelihood LD with re-

spect to a parameters obey

− ∂2LD

∂λi∂λj(λλλ) = 〈QiQj〉 − 〈Qi〉 〈Qj〉 . (27)

This matrix of second derivatives is non-negative(has no negative eigenvalues) since

∑ij(〈QiQj〉 −

〈Qi〉 〈Qj〉)xixj =⟨

[∑k (xkQk − 〈xkQk〉)]2

⟩≥ 0 for all

xi. If no non-trivial linear combination of the observ-ables Qk has vanishing fluctuations, the Hessian matrixis even positive-definite. For the inverse Ising problem,there are indeed no non-trivial linear combinations of thespin variables σi and pairs of spins variables σiσj that donot fluctuate under the Boltzmann measure, unless someof the couplings or fields are infinite. As a result, themaximum of the likelihood, if it exists, is unique. How-ever, it can happen that the maximum lies at infinitevalues of some of the parameters (for instance when thesamples contain only positive values of a particular spin,the maximum likelihood value of the corresponding mag-netic field is infinite). These divergences can be avoidedwith the introduction of a regularisation term, see sec-tion II A.

3. Maximum entropy modelling

The Boltzmann distribution in general and the IsingHamiltonian (2) in particular can be derived from in-formation theory and the principle of maximum entropy.This principle has been invoked in neural modelling [196],protein structure determination [239], and DNA sequenceanalysis [153]. In this section, we discuss the statisticalbasis of Shannon’s entropy, the principle of maximumentropy, and their application to inverse statistical mod-elling.

Consider M distinguishable balls, each to be placedin a box with R compartments. The number of waysof placing the balls such that nr balls are in the rthcompartment (r ∈ 1, . . . , R) is

W =M !∏Rr=1 nr!

(28)

with∑Rr=1 nr = M . For large M , we write nr = Mqr

and exploit Stirling’s formula nr! ≈ enrnnrr , yielding theGibbs entropy

lnW/M ≈ −R∑r=1

qr ln qr . (29)

This combinatorial result forms the basis of equilibriumstatistical physics in the classic treatment due to Gibbsand can be found in standard textbooks. In the contextof statistical physics, each of the R compartments corre-sponds to a microstate of a system, and each microstate

Page 14: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

14

r is associated with energy Er. The M balls in the com-partments describe a set of copies of the system, or a so-called ensemble of replicas. The replicas may exchangeenergy with each other, while the ensemble of replicasitself is isolated and has a fixed total energy ME (andpossibly other conserved quantities). In this way, thereplicas can be thought of as providing a heat-bath foreach other. If we assume that each state of the ensem-ble of replicas with a given total energy is equally likely,the statistics of qr is dominated by a sharp maximumof W as a function of the qr, subject to the constraint∑r qr = 1 and

∑r Erqr = E. Using Lagrange multipli-

ers to maximize (29) subject to these constraints yieldsthe Boltzmann distribution [107].

This seminal line of argument can also be used to de-rive Shannon’s information entropy (3). The argumentis due to Wallis and is recounted in [108]. Suppose wewant to find a probability distribution pr compatible witha certain constraint, for instance a specific expectationvalue

∑r prEr = E to within some small margin of er-

ror. Consider M independent casts of a fair die with Rfaces. We denote the number of times outcome r is real-ized in these throws as nr. The probability of a particularset nr is

M !∏Rr=1 nr!

R∏r=1

(1/R)nr . (30)

In the limit of large M , the logarithm of this probabilityis −M

∑r qr ln qr −M lnR with qr = nr/M .

Each set of M casts defines one instance of the nr.In most instances, the constraint will not be realized. Forthose (potentially rare) instances obeying the constraint,we can ask what are the most likely values of nr, and cor-respondingly qr. Maximising the Shannon’s informationentropy −

∑r qr ln qr subject to the constraint and the

normalisation∑r qr = 1 gives the so-called maximum-

entropy estimate of pr. If the underlying set of probabili-ties (the die with R faces) differs from the uniform distri-bution, so outcome r occurs with probability q0r , it is notthe entropy but the relative entropy −

∑r qr ln qr

q0rthat

is to be maximised. Up to a sign, this is the Kullback-Leibler divergence (18) between qr and q0r .

The maximum-entropy estimate can be used to ap-proximate an unknown probability distribution qr thatis under-sampled. Suppose data is sampled several timesfrom some unknown probability distribution. With a suf-ficient number of samples M , the distribution qr can beeasily determined from frequency counts qr = nr/M . Of-ten this is not feasible; if qrM 1, nr fluctuates stronglyfrom one set of samples to the next. This situation ap-pears naturally when the number of possible outcomes Rgrows exponentially with the size of the system, see e.g.section I A. Nevertheless the data may be sufficient to pindown one or several expectation values. The maximum-entropy estimate has been proposed as the most unbiased

estimate of the unknown probability distribution com-patible with the observed expectation values [108]. For adiscussion of the different ways to justify the maximumentropy principle, and derivations based on robust esti-mates see [220].

Many applications of the maximum- entropy estimateare in image analysis and spectral analysis [90], for re-views in physics and biology see [14, 30, 151], and forcritical discussion see [5, 231].

The connection between maximum entropy and the in-verse Ising problem is simple: For a set of N binary vari-ables, the distribution with given first and second mo-ments maximizing the information entropy is the Boltz-mann distribution (1) with the Ising Hamiltonian (2).We use Lagrange multipliers to maximize the informationentropy (3) subject to the normalization condition andthe constraints on the first and second moments (mag-netisations and pair correlations) of p(sss) to be mmm and χχχ.Setting the derivatives of∑

sss

−p(sss) ln p(sss) + η[1−∑sss

p(sss)]+∑i

hi[mi −∑sss

p(sss)si] +∑i<j

Jij [χij −∑sss

p(sss)sisj ]

(31)

with respect to p(sss) to zero yields the Ising model (1).The Lagrange multipliers hhh and JJJ need to be chosen toreproduce the first and second moments (magnetisationsand correlations) of the data and can be interpreted ascouplings between spins and magnetic fields.

While this principle appears to provide a statisticalfoundation to the model (1), there is no a priori reason todisregard empirical data beyond the first two moments.Instead, the pairwise couplings result from the particularchoice of making the probability distribution match thefirst two moments of the data. The reasons for this stepmay be different in different applications.

• Moments beyond the first and second may bepoorly determined by the data. Conversely, withan increasing number of samples, the determinationof higher order correlations and hence interactionsbetween triplets of spin variables etc. becomes vi-able.

• The data may actually be generated by an equi-librium model with (at most) pairwise interactionsbetween spin variables. This need not be obvi-ous from observed correlations of any order, butcan be tested by comparing three-point correla-tions predicted by a model with pairwise couplingsto the corresponding correlations in the data. Ex-amples are found in sequence analysis, where pop-ulation dynamics leads to an equilibrium steadystate [24, 200] and the energy can often be approx-imated by pairwise couplings [153, 195]. For a re-view see [211]. Surprisingly, also in neural data

Page 15: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

15

(not generated by an equilibrium model), three-point correlations are predicted well by a modelwith pairwise interactions [222, 226].

• A model of binary variables interacting via a high-order coupling terms Jijk...sisjsk . . . can sometimesbe approximated surprisingly well by pairwise in-teractions. This seems to be the case when thecouplings are dense, so that each variable appearsin several coupling terms [143].

• Often one seeks to describe a subset of n variabless1, s2, . . . , sn from a larger set of N variables, forinstance when only the variables in the subset canbe observed. The subset of variables is character-ized by effective interactions which stem from inter-actions between variables in the subset, and frominteractions with the other variables. If the subsetis sufficiently small, the resulting statistics is oftendescribed by a model with pairwise couplings [191].

• The true probability distribution underlying somedata may be too complicated to calculate in prac-tice. A more modest goal then is to describethe data using an effective statistical model suchas (1), which is tractable and allows the derivationof bounds on the entropy or the free energy. Ex-amples are the description of neural data and geneexpression data using the Ising model with sym-metric couplings (see I A and I B).

• There are also useful models which are compu-tationally tractable but do not maximize the en-tropy. An example is Gaussian models to generateartificial spike trains with prescribed pair correla-tions [3, 132].

4. Information theoretic bounds on graphical modelreconstruction

A particular facet of the inverse Ising problem is graph-ical model selection. Consider the Ising problem on agraph. A graph is a set of nodes and edges connectingthese nodes, and nodes associated with spin variables.Couplings between node pairs connected by an edge arenon-zero, couplings between unconnected node pairs arezero. The graphical model selection problem is to re-cover the underlying graph (and usually also the valuesof the couplings) from data sampled independently fromthe Boltzmann distribution. Given a particular numberof samples, one can ask with which probability a givenmethod can reconstruct the graph correctly (the recon-struction fluctuates between different realisations of thesamples). Notably, there are also universal limits ongraphical model selection that are independent of a par-ticular method.

In [194], Santhanam and Wainwright deriveinformation-theoretic limits to graphical model se-lection. Key result is the dependence of the requirednumber of samples on the smallest and on the largestcoupling

α = mini<j|Jij | , β = max

i<j|Jij | (32)

and on the maximum node connectivity (number ofneighbours on the graph) d. Reconstruction of the graph,by any method, is impossible if fewer than

max

lnN

2α tanhα,eβd ln(Nd/4− 1)

4dαeα,d

8ln

(N

8d

)(33)

samples are available (the precise statement is of a prob-abilistic nature, see [194]). If the maximum connectivityd grows with the system size, this result implies thatat least cmaxd2, α−2 lnN samples are required (withsome constant c) [194]. The derivation of this and otherresults is based on Fano’s inequality (Fano’s lemma) [56],which gives a lower bound for the probability of error of aclassification function (such as the mapping from samplesto the graph underlying these samples).

5. Thermodynamics of the inverse Ising problem

Calculations in statistical physics are greatly simpli-fied by introducing thermodynamic potentials. In thissection, we will discuss the method of thermodynamicpotentials for the inverse Ising problem. It turns out thatthe maximum likelihood estimation of the fields and cou-plings is simply a transformation of the thermodynamicpotentials.

Recall that the thermodynamic potential most usefulfor the forward problem, where couplings and magneticfields are given, is the Helmholtz free energy F (JJJ,hhh) =− lnZ(JJJ,hhh). Derivatives of this free energy give the mag-netizations, correlations, and other observables. Thethermodynamic potential most useful for the inverseproblem, where the pair correlations χχχ and magneti-zations mmm are given, is the Legendre transform of theHelmholtz free energy with respect to both couplings andfields [53, 54, 201]

S(χχχ,mmm) = minJJJ,hhh

−∑i

himi −∑i<j

Jijχij − F (JJJ,hhh)

.(34)

This thermodynamic potential is readily recognised asthe entropy function; up to a sign, it gives the maximumlikelihood (16) of the model parameters. The transforma-tion (34) thus provides a link between the inference viamaximum likelihood and the statistical physics of theIsing model as described by its Helmholtz free energy.

Page 16: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

16

The couplings and the fields are found by differentiation,

Jij = − ∂S

∂χij(χχχ,mmm) (35)

hi = − ∂S

∂mi(χχχ,mmm) ,

where the derivatives are evaluated at the sample cor-relations and magnetizations. These relationships followfrom the inverse transformation of (34)

F (JJJ,hhh) = minχχχ,mmm

−∑i

himi −∑i<j

Jijχij − S(χχχ,mmm)

,(36)

by setting derivatives of the term in square brackets withrespect to χχχ and mmm to zero.

In practice, performing the Legendre transformation ofboth hhh and JJJ is often not necessary; derivatives of theHelmholtz free energy F (JJJ,hhh) with respect to JJJ can alsobe generated by differentiating with respect to hhh, e.g.,

∂F

∂Jij(JJJ,hhh) =

∂2F

∂hi∂hj(JJJ,hhh)− ∂F

∂hi(JJJ,hhh)

∂F

∂hj(JJJ,hhh). (37)

The thermodynamics of the inverse problem can thus bereduced to a single Legendre transform of the Helmholtzfree energy, yielding the Gibbs free energy

G(JJJ,mmm) = maxhhh

[∑i

himi + F (JJJ,hhh)

]. (38)

The magnetic fields are given by the first derivative ofthe Gibbs free energy

hi =∂G

∂mi(JJJ,mmm) . (39)

To infer the couplings, we consider the second derivativesof Gibbs potential, which give

∂2G

∂mj∂mi(JJJ,mmm) = (CCC−1)ij , (40)

where CCC is the matrix of connected correlations Cij ≡χij −mimj . (40) follows from the inverse function theo-rem,[

∂(h1, . . . , hN )

∂(m1, . . . ,mN )

]ij

=

[(∂(m1, . . . ,mN )

∂(h1, . . . , hN )

)−1]ij

,(41)

and linear response theory

Cij =∂mj

∂hi(JJJ,hhh) = − ∂2F

∂hi∂hj(JJJ,hhh) , (42)

which links the susceptibility of the magnetization to asmall change in the magnetic field with the connectedcorrelation [210].

The result (40) turns out to be central to many meth-ods for the inverse Ising problem. The left-hand side ofthis expression is a function of Jij . If the Gibbs free en-ergy (38) can be evaluated or approximated, (40) can besolved to yield the couplings. Similarly (39) with the es-timated couplings and the sample magnetisations givesthe magnetic fields, completing the reconstruction of theparameters of the Ising model.

6. Variational principles

For most systems, neither the free energy F (JJJ,hhh)nor other thermodynamic potentials can be evaluated.However, there are many approximation schemes forF (JJJ,hhh) [165], which lead to approximations for the en-tropy and the Gibbs free energy. Direct approximationschemes for S(χχχ,mmm) and G(JJJ,mmm) have also be formulatedwithin the context of the inverse Ising problem. The keyidea behind most of these approximations is the varia-tional principle.

The variational principle for the free energy is

F (JJJ,hhh) = minqU [q]− S[q] ≡ min

qF [q] , (43)

where q denotes a probability distribution over spin con-figurations. U [q] ≡ 〈H〉q and S[q] ≡ 〈ln q〉q and theminimisation is taken over all distributions q. This prin-ciple finds its origin in information theory. Take an ar-bitrary trial distribution q(sss), the Kullback-Leibler di-vergence (18) quantifies the difference between q and theBoltzmann distribution p is positive and vanishes if andonly if q = p [56]. One then arrives directly at (43) whenrewriting KL(q|p) = U [q]− S[q]− F (JJJ,hhh).

We will refer to F [q] ≡ U [q] − S[q] as the functionalHelmholtz free energy, also called the non-equilibriumfree energy in the context of non-equilibrium statisticalphysics [171]. Another term in use is ‘Gibbs free en-ergy’ [165], which we have reserved for the thermody-namic potential (38).

So far, nothing has been gained as the minimum is overall possible distributions q, including the Boltzmann dis-tribution itself. A practical approximation arises whena constraint is put on q, leading to a family of trial dis-tributions q. Often the minimisation can then be car-ried out over that family, yielding an upper bound to theHelmholtz free energy [165].

In the context of the inverse problem, it is useful to de-rive the variational principles for other thermodynamicpotentials as well. Using the definition of Gibbs poten-tial (38) and the variational principle for the Helmholtzpotential (43) we obtain

G(JJJ,mmm) = maxhhh

∑i

himi + minqU [q]− S[q]

. (44)

Page 17: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

17

By means of Lagrange multipliers it is easy to show thatthis double extremum can be obtained by a single condi-tional minimisation,

G(JJJ,mmm) = minq∈G

−∑i<j

Jij 〈σiσj〉q − S[q]

, (45)

where the set G denotes all distributions q with a given〈σi〉q = mi [165]. We will refer to the functional G[q] =−∑i<j Jij 〈σiσj〉q − S[q]

as the functional Gibbs free

energy defined on G.

Similarly, the variational principle can be applied tothe entropy function S(χχχ,mmm), leading once again to aclose relationship between statistical modelling and ther-modynamics. The entropy (34) is found to be

S(χχχ,mmm) = maxq∈SS[q] , (46)

where S denotes distributions with 〈σi〉q = mi and〈σiσj〉q = χij . This is nothing but the maximum en-

tropy principle [108]: the variational principle identifiesthe distribution with the maximum information entropysubject to the constraints on magnetisations and spincorrelations, which are set equal to their sample averages(see the section on maximum entropy modelling above).

7. Mean-field theory

As a first demonstration of the variational principle, wederive the mean-field theory for the inverse Ising problem.The starting point is an ansatz for the Boltzmann distri-bution (11) which factorises in the sites [165, 210, 216]

pMF(sss) =∏i

1 + misi2

, (47)

thus making the different spin variables statistically in-dependent of one another. The parameters mi of thisansatz describe the spin magnetizations; each spin has amagnetisation resulting from the effective magnetic fieldacting on that spin. This effective field arises from itslocal magnetic field hi, as well as from couplings withother spin. The mean field giving its name to mean-fieldtheory is the average over typical configurations of theeffective field.

Using the mean-field ansatz, we now estimate theGibbs free energy. Within the mean-field ansatz, theminimisation of the variational Gibbs potential (45) istrivial: there is only a single mean-field distribution (47)that satisfies the constraint G that spins have magneti-sations mmm, namely mmm = mmm. We can thus directly write

down the mean-field Gibbs free energy

GMF(mmm,JJJ) = −∑i<j

Jijmimj

+∑i

[1 +mi

2ln

1 +mi

2+

1−mi

2ln

1−mi

2

].(48)

The equation for the couplings JJJ follows from the secondorder derivative of G(mmm,JJJ), cf. equation (40)

(CCC−1)ij = −JMFij , (i 6= j) . (49)

Similarly, the reconstruction of the magnetic field followsfrom the derivative of G(mmm,JJJ) with respect to mi, cf.equation (39),

hMFi = −

∑j 6=i

JMFij mj + artanhmi . (50)

This result establishes a simple relationship between theobserved connected correlations and the couplings be-tween spins in terms of the inverse of the correlationmatrix. The matrix inverse in (49) is of course muchsimpler to compute than the maximum over the likeli-hood (16) and takes only a polynomial number of steps:Gauß-Jordan elimination for inverting an N × N ma-trix requires O(N3) operations, compared to the expo-nentially large number of steps to compute a partitionfunction or its derivatives.

The standard route to mean-field reconstruction pro-ceeds somewhat differently, namely via the Helmholtzfree energy rather than the Gibbs free energy. Early workto address the inverse Ising problem using the mean-fieldapproximation was performed by Peterson and Ander-son [177]. In [112], Kappen and Rodrıguez constructthe Helmholtz functional free energy FMF

mmm (JJJ,hhh) given by(43) under the mean-field ansatz (47). FMF

mmm (JJJ,hhh) is thenminimised with respect to the parameters of the mean-field ansatz mmm by setting its derivatives with respect tommm to zero. This yields equations which determine thevalues mMF

i of the magnetization parameters that min-imize the KL divergence between the mean-field ansatzand the Boltzmann distribution, namely the well-knownself-consistent equations

mMFi = th(hi +

∑j 6=i

JijmMFj ) . (51)

Using mMFi as an approximation for the equilibrium

magnetisations mi one can derive the so-called linear-response approximation for the connected correlation

function CMF−LRij ≡ ∂mMF

i

∂hj(hhh). Taking derivatives of the

self-consistent equations (51) with respect to local fieldsgives ∑

j

(δij

1−m2i

− Jij)CMF−LRjk = δik , (52)

Page 18: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

18

where we have used the fact that diagonal terms of thecoupling matrix are zero. Identifying the result for theconnected correlations CMF−LR

jk with the sample corre-lations Cjk leads to a system of linear equations to besolved for the couplings [112]. However, to obtain (52),we have used that the diagonal elements Jii are zero.With these constraints, the system of equations (52) be-comes over-determined and in general there is no solu-tion. Although different procedures have been suggestedto overcome this problem [105, 112, 186], there seems tobe no canonical way out of this dilemma. The most com-mon approach is to ignore the constraints on the diagonalelements altogether and invert equation (52) to get

JMF−LRij =

δij1−m2

i

− (CCC−1)ij . (53)

This result agrees with the reconstruction via the Gibbsfree energy except for the non-zero diagonal couplings,which bear no physical meaning and are to be ignored.No diagonal couplings arise in the approach based on theGibbs free energy since equation (40) with j = i does notinvolve any unknown couplings Jij .

8. The Onsager term and TAP reconstruction

The variational estimate of the Gibbs free energy (48)can be improved further. In 1977, Thouless, Anderson,and Palmer (TAP) advocated adding a term to the Gibbsfree energy

GTAP(JJJ,mmm) = GMF(JJJ,mmm)− 1

2

∑i<j

J2ij(1−m2

i )(1−m2j ) .

(54)This term can be interpreted as describing the effect offluctuations of a spin variable on the magnetisation ofthat spin via their impact on neighbouring spins [219].It is called the Onsager term, which we will derive insection II A 13 in the context of a systematic expansionaround the mean-field ansatz. For the forward prob-lem, adding this term modifies the self-consistent equa-tion (51) to the so-called TAP equation

mTAPi = th

hi +∑j 6=i

JijmTAPj −mTAP

i

∑j

J2ij(1− (mTAP

j )2)

.

(55)In the inverse problem, the TAP free energy(54) gives anthe equation for the couplings based on (40)

(CCC−1)ij = −JTAPij − 2(JTAP

ij )2mimj . (56)

Solving this quadratic equation gives the TAP recon-struction [112, 215]

JTAPij =

−2(CCC−1)ij

1 +√

1− 8(CCC−1)ijmimj

, (57)

where we have chosen the solution that coincides withthe mean-field reconstruction when the magnetisationsare zero. The magnetic fields can again be found bydifferentiating the Gibbs free energy

hi = artanh(mi)−∑j 6=i

JTAPij mj+mi

∑j 6=i

(JTAPij )2(1−m2

j ) .

(58)

9. Couplings without a loop: mapping to the minimumspanning tree problem

The computational hardness of implementing Boltz-mann machine learning (24) comes from the difficulty ofcomputing correlations under the Boltzmann measure,which can require a computational time that scales ex-ponentially with the system size. This scaling originatesfrom the presence of loops in the graph of couplings be-tween the spins. Graphs for which correlations can becomputed efficiently are the acyclic graphs or trees, soit comes as no surprise that the first efficient method tosolve the inverse Ising problem was developed for treesalready in 1968. This was done by Chow and Liu [50]in the context of a product approximation to a multi-variate probability distribution. While the method itselfcan be used as a crude approximation for models withloops or as reference point for more advanced methods,the exact result by Chow and Liu is of basic interest in it-self as it provides a mapping of the inverse Ising problemfor couplings forming a tree onto a minimum spanningtree (MST) problem. MST is a core problem in com-putational complexity theory, for which there are manyefficient algorithms. This section on Chow and Liu’s re-sult also provides some of the background needed in sec-tion II A 10 on the Bethe ansatz and section II A 11 onbelief propagation.

We consider an Ising model whose pairwise couplingsform a tree. The graph of couplings may consist of severalparts that are not connected to each other (in any numberof steps along connected node pairs), or it may form onesingle connected tree, but it contains no loops. We denotethe set of nodes (vertices) associated with a spin of thetree T by VT and the set of edges (couplings betweennodes) by ET . It is straightforward to show that in thiscase, the Boltzmann distribution for the Ising model canbe written in a pairwise factorised form

pT (sss) =∏i∈VT

pi(si)∏

(i,j)∈ET

pij(si, sj)

pi(si)pj(sj)(59)

=∏

(ij)∈ET

pij(si, sj)∏i∈VT

pi(si)1−|∂i| .

∂i denotes the set of neighbours of node i, so |∂i| is thenumber of nodes i couples to. The distributions pi andpij denote the one-point and two-point marginals of pT .

Page 19: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

19

The KL divergence (19) between the empirical distri-bution pD(sss) and pT is given by

KL(pD|pT ) =∑sss

pD(sss) lnpD(sss)∏

i∈VT pi(si)∏

(i,j)∈ETpij(si,sj)pi(si)pj(sj)

.

(60)

For a given tree, it is straightforward to show that theKL divergence is minimized when the marginals pi andpij equal the empirical marginals pDi and pDij . This gives

minpi,pij

KL(pD|pT ) = −H +∑i∈VT

Hi −∑

(ij)∈ET

Iij , (61)

where H = −∑sss p

D(sss) ln pD(sss) is the entropy of the

empirical distribution pD, Hi = −∑sipDi (si) ln pDi (si) is

the single site entropy and Iij is the mutual informationbetween a pair of spins

Iij =∑si,sj

pDij(si, sj) lnpDij(si, sj)

pDi (si)pDj (sj). (62)

Assuming the graph of couplings is an (unknown) treeand that empirical estimates of all pairs of mutual infor-mations are available, the inverse Ising problem can thenbe solved by minimizing the KL divergence (60) over allpossible NN−2 trees T . The first two terms in eq. (61)do not depend on the choice of the tree, so only the lastterm needs to be minimized over, which is a sum over lo-cal terms on the graph. The optimal tree topology Toptis thus given by

ETopt = argminET

− ∑(ij)∈ET

Iij

.

This is where the mapping onto MST problem emerges:Topt connects all vertices of the graph and its edges aresuch that the total sum of their weights is minimal. Inour case, each edge weight is the (negative) pairwise mu-tual information Iij between spin variables. Finding theMST does not require an infeasible exploration of thespace of all possible trees. On the contrary, it can befound in a number of steps bounded by O(|V |2 ln |V |) bygreedy iterative procedures which identify the optimaledges to be added at each step (V is the set of nodes in thedata). The most famous algorithms for the MST problemdate back to the 1950s and are known under the namesof Prim’s algorithm, Kruskal’s algorithms and Boruvka’salgorithm (see e.g. [150]). In practice, one has to com-pute the empirical estimates of the mutual informationfrom samples and then proceed with one of the abovealgorithms. An interesting observation which makes theChow–Liu approach even easier to apply is that one mayuse as edge weights also the connected correlations be-tween spins [50].

Once the optimal tree Topt has been identified, westill need to find the optimal values of the couplingsJij of the Ising model. This is, however, an easy task:the factorised form of the probability measure over thetree (59) allows one to compute the couplings using theindependent-pair approximation, see subsection II A 12.

10. The Bethe–Peierls ansatz

The factorising probability distribution (59) can alsobe used as an ansatz in situations where the graph ofcouplings is not a tree. In this context,

pBP(si) =∏i

pi(si)∏

(i,j)∈E

pij(si, sj)

pi(si)pj(sj)(63)

is called the Bethe–Peierls ansatz [27, 175]. (i, j) ∈ Eruns over pairs of interacting spins, or equivalently overedges in the graph of couplings. One can parameterizethe marginal distribution pi and pij using the magnetisa-tion parameters mi and the (connected) correlation pa-rameters Cij ,

pi(si) =1 + misi

2(64)

pij(si, sj) =(1 + misi)(1 + mjsj) + Cijsisj

4

subject to the constraints

− 1 ≤ mi ≤ 1 (65)

− 1 + |mi + mi| ≤ Cij + mimj ≤ +1− |mi − mj | .

The Bethe–Peierls ansatz can be compared to the mean-field ansatz (47), which assigns a magnetisation (or an ef-fective field) to each spin. The Bethe–Peierls ansatz goesone step further; it assigns to each coupled pair of spinscorrelations as well as magnetisations. These correlationsand magnetisations are then determined self-consistently.An important feature of the Bethe–Peierls ansatz (63) isthat its Shannon entropy (3) can be decomposed intospin pairs

S[pBP] =∑i

S[pi] +∑(i,j)

(S[pij ]− S[pi]− S[pj ]) . (66)

For graphs containing loops, the entropy generally con-tains terms involving larger sets of spins than pairs, asituation we will discuss in section II A 12.

The Bethe–Peierls ansatz is well defined, and indeedexact, when the couplings form a tree. When the graphof couplings contains loops, the probability distribu-tion (63) is not normalized and the ansatz is not welldefined. In that case, the Bethe–Peierls ansatz is an un-controlled approximation, although recently progress hasbeen made in the control of the resulting error [49, 170].We start with the assumption that there are no loops.

Page 20: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

20

To address the inverse Ising problem, we use theBethe–Peierls ansatz (63) as a variational ansatz to min-imise the functional Gibbs free energy (45). Again theconstraint G in (45) implies mmm = mmm. The remaining min-imisation is over the correlation parameters CCC,

GBP(JJJ,mmm) = minCCC

−∑(i,j)

Jij 〈σiσj〉pBP − S[pBP]

(67)

and yields

GBP(JJJ,mmm) = −∑(ij)

Jij(CBPij +mimj) (68)

+∑i

(1− zi)∑si

1 +misi2

ln1 +misi

2

+∑(i,j)

∑si,sj

(1 +misi)(1 +mjsj) + CBPij sisj

4

× ln(1 +misi)(1 +mjsj) + CBP

ij sisj

4,

where CCCBP ist the optimal value of CCC and satisfies

Jij =∑si,sj

sisj4

ln(1 +misi)(1 +mjsj) + CBP

ij sisj

4.

(69)zi denotes the number of neighbours (interaction part-ners with non-zero couplings) of node i. From (40), theequation for the couplings can be found again by equat-ing the second derivative of the Gibbs free energy withthe inverse of the correlation matrix, that is,

(CCC−1)ij =CBPij

(CBPij )2 − (1−m2

i )(1−m2j )

(j 6= i) . (70)

This quadratic equation can be solved for CBPij ; inserting

the solution

CBPij =

1

2

1

(CCC−1)ij−√

1

(CCC−1)2ij− 4(1−m2

i )2(1−m2

j )2

(71)

for (CCC−1)ij 6= 0 and CBPij = 0 for (CCC−1)ij = 0

into (69) gives the couplings of the Bethe–Peierls recon-struction [160, 186]. For the special case mi = 0, oneobtains a particularly simple result

JBPij = −1

2arsinh[2(CCC−1)ij ], (j 6= i). (72)

In graph theory, this formula can be related to the expres-sion for the distance in a tree whose links carry weightsspecified by pair correlations between spin pairs [15].Correspondingly, the magnetic fields follow from the first

derivative of the Gibbs free energy as in (39) giving

hBPi = (1− zi) artanhmi −

∑j∈∂i

JBPij mj (73)

+∑j∈∂i

∑si,sj

si +mjsisj4

ln(1 +misi)(1 +mjsj) + CBP

ij sisj

4.

In section II A 13, we will show that the Bethe–Peierlsansatz, and hence the resulting reconstruction, is exactfor couplings forming a tree. However, the reconstruc-tion of couplings and magnetic fields based on the Bethe–Peierls ansatz can also be applied to cases where the cou-plings do not form a tree. Although the results then arisefrom an uncontrolled approximation, the quality of thereconstruction can still be rather good. For a comparisonof the different approaches, see section II C.

11. Belief propagation and susceptibility propagation

Belief propagation is a distributed algorithm to com-pute marginal distributions of statistical models, such aspi(si) and pij(si, sj) of the preceding sections. Again, itis exact on trees, but also gives a good approximationwhen the graph of couplings is only locally treelike (soany loops present are long). The term belief propaga-tion is used in the machine learning [172] and statisticalphysics communities, in coding theory the approach isknown as the sum-product algorithm [81]. Belief prop-agation shares a deep conceptual link with the Bethe–Peierls ansatz; Yedidia et al. [246] showed that beliefpropagation is a numerical scheme to efficiently computethe parameters of Bethe–Peierls ansatz.

A detailed exposition and many applications of beliefpropagation can be found in the textbook by Mezardand Montanari [144]. Here, we briefly introduce the ba-sics and discuss applications to the inverse Ising problem.We start by considering the ferromagnetic Ising model in1D, that is, a linear chain of N spins. The textbook so-lution of this problem considers the partitions functionof the system when the last spin is constrained to takeon the values sN = ±1, ZN (+1) and ZN (−1). The cor-responding partition functions for the linear chain withN + 1 spins are linked to the former via the so-calledtransfer matrix [20]. The partition function for a systemof any size can be computed iteratively starting from asingle spin and extending the 1D lattice with each mul-tiplication of the transfer matrix. In fact the transfermatrix can also be used to solve the Ising model on atree [70]. Belief propagation is similar in spirit, but canbe extended (as an approximation) also to graphs whichare not trees. The best book-keeping device is again arestricted partition function, namely Zi→j(si). It is de-fined as the partition function of the part of the systemcontaining i when the coupling present between spins i

Page 21: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

21

and j has been deleted from the tree and spin i is con-strained to take on si. (Deleting the edge (i, j) splits thetree containing i and j into two disconnected parts.) Ona tree we obtain the recursion relation relation

Zi→j(si) = ehisi∏

k∈∂i\j

(∑sk

Zk→i(sk)eJiksisk

), (74)

which can be computed recursively starting from leavesof the tree (nodes connected to a single edge only). Instatistical physics, (74) is called the cavity recursion forpartition functions, since deleting a link can be thought ofas leaving a ‘cavity’ in the original system. The partitionfunction for the entire tree with spin i constrained to siis then

Zi(si) = ehisi∏k∈∂i

(∑sk

Zk→i(sk)eJiksisk

). (75)

The marginal distribution pi(si) can be calculated bynormalizing pi(si) ∝ Zi(si) and the marginal distributionpij(si, sj) can be calculated by normalizing pij(si, sj) ∝Zi→j(si)e

JijsisjZj→i(sj).The power of the cavity recursion (76) lies in its ap-

plication to graphs which are not trees. It is particularlyeffective when the graph of couplings is at least locallytreelike, so any loops present are long. To extend the cav-ity recursion as an approximation to graphs which arenot trees, it makes sense to consider normalized quan-tities and define πi→j(si) = Zi→j(si)/

∑s Zi→j(s) with

the recursion

πi→j(si) ∝ ehisi∏k∈∂i

(∑sk

πk→i(sk)eJiksisk

)(76)

leading to the estimate of the two-point marginalsπij(si, sj) ∝ eJijsisjπi→j(si)πj→i(sj). (This step is nec-essary, as deleting the link (i, j) on a tree leads to twodisjoint subtrees with separate partition functions andzero connected correlations. On a locally treelike graph,correlations between si and sj can still be small once thelink (i, j) has been cut, but the tree does not split intodisjoint parts with separate partition functions.) Associ-ating each edge (i, j) in the graph with particular valuesof πi→j(si) and πj→i(sj), one can update these values inan iterative scheme which replaces them with the right-hand side of (76) at each step. The fixed point of thisprocedure obeys (76). In this context, πi→j(si) is termeda ‘message’ that is being ‘passed’ between nodes. Beliefpropagation is an example of a message passing algo-rithm.

Belief propagation has been used to solve the inverseIsing problem in two different ways. In the first, beliefpropagation is used to approximately calculate the mag-netisations and correlations given some parameters of theIsing model, and then fields and couplings are updated

according to the Boltzmann learning rule (24). To esti-mate the correlations, one has to define additional mes-sages also for the susceptibilities of each spin togetherwith their update rules. This approach, termed sus-ceptibility propagation, was developed by Welling andTeh [240, 241] for the forward problem. Susceptibilitypropagation thus solves the ‘forward’ problem, that is,it offers a computationally efficient approximation to thecorrelations and magnetisations, which are then used forBoltzmann machine learning.

A sophisticated variant of susceptibility propagationwas developed by Mezard and Mora [8, 145] specificallyfor the inverse Ising problem, where also the couplingsare updated at each step. This approach gives the samereconstruction as the Bethe–Peierls reconstruction fromsection II A 10. However, the iterative equations can failto converge, even when the analytical approach (71)–(73)gives valid approximate solutions [160].

12. The independent-pair approximation and theCocco–Monasson adaptive-cluster expansion

We expect the Bethe–Peierls ansatz to work well whenthe couplings are locally treelike and loops are long. Con-versely, we expect the ansatz to break down when thecouplings generate many short loops. Cocco and Monas-son developed an iterative procedure to identify clustersof spins whose couplings form short loops and evaluatetheir contribution to the entropy (34) [53, 54]. In thecontext of disordered systems, the expansion of the en-tropy in terms of clusters of connected spins is known asKikuchi’s cluster variational method [114, 247].

The Cocco–Monasson adaptive-cluster expansion di-rectly approximates the entropy potential (34). We startby considering the statistics of a single spin variable de-scribed by its magnetisation mi, with the entropy

S(1)(mi) =∑si

1 +misi2

ln1 +misi

2. (77)

The simplest entropy involving coupled spins is the two-spin entropy

S(2)(mi,mj , χij) =∑si,sj

1 +misi +mjsj + χijsisj4

×

ln1 +misi +mjsj + χijsisj

4. (78)

When the two spins are statistically independent, weobtain S(2)(mi,mj , χij) = S(1)(mi) + S(2)(mj). Hencethe residual entropy which accounts for the corre-lation between the two spins is ∆S2(mi,mj , χij) ≡S(2)(mi,mj , χij)−S(1)(mi)−S(1)(mj). To make the no-tation uniform, we also define ∆S(1)(mi) ≡ S(1)(mi). Avery simple approximation is based on the assumption

Page 22: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

22

that the N -spin entropy (34) is described by pairwiseterms

S(χχχ,mmm) ≈∑i

∆S(1)(mi)+∑(i,j)

∆S(2)(mi,mj , χij), (79)

where the pair (i, j) denotes a cluster of two distinctspins. The couplings and fields can then be obtainedvia differentiation as in (35),

Jij =∑si,sj

sisj4

ln1 +misi +mjsj + χijsisj

4, (80)

hi =2−N

2ln

1 +mi

1−mi+∑

j 6=i

∑si,sj

si4

ln1 +misi +mjsj + χijsisj

4.

This result is called the independent-pair approximationfor the couplings, see [193] and [188]. (Expressions (8a)and (8b) in [188] differ slightly from (80) due to a typo.)When the topology of couplings is known and forms atree, equation (79) gives the exact entropy of the systemwhen the second sum is restricted to pairs of interactingspins (see sections II A 9 and II A 10). In this case, (80)gives the exact couplings.

However, in most cases the topology is not known, sosecond sum in (79) runs over all pairs of spins. In thiscase, equation (79) is only a (rather bad) approximationto the entropy. However, the independent-pair approxi-mation (79) can serve as starting point for an expansionthat includes clusters of spins of increasing size

S(χχχ,mmm) =∑i

∆S(1)(mi) + (81)∑(i,j)

∆S(2)(χij ,mi,mj) +

∑(i,j,k)

∆S(3)(χij , χjk, χkl,mi,mj ,mk) + · · · .

In this expansion, the contribution from clusters consist-ing of 3 spins is

∆S(3) = S(3)(χij , χjk, χkl,mi,mj ,mk)− (82)

S(2)(χij ,mi,mj)− S(2)(χjk,mj ,mk)−S(2)(χki,mk,mi) + S(1)(mi) +

S(1)(mj) + S(1)(mk) ,

where we have dropped the argument of ∆S(3) to simplifythe notation. The residual entropies of higher clusters aredefined analogously.

The evaluation of S(3) is not as straightforward as S(2)

and S(1), but still tractable. In general, the evaluation

of ∆S(k) requires the computation of order 2k steps andtherefore becomes quickly intractable. Cocco and Monas-son argue that the contribution of ∆Sk decreases with kand can be neglected from a certain order on [53]. This

inspires an adaptive procedure to build up the libraryof all clusters which contribute significantly to the to-tal entropy of the system. Starting from 1-clusters, oneconstructs all 2-clusters by merging pairs of 1-clusters.If the entropy contribution of the new cluster is largerthan some threshold, the new cluster is added to a list ofclusters. One then constructs the 3-clusters, 4-clusters inthe same way until no new cluster gives a contribution tothe entropy exceeding the threshold. The total entropyand the reconstructed couplings and fields are updatedeach time a new cluster is added. This procedure fareswell in situations when there are many short loops, likein a regular lattice in more than one dimension or whenthe graph of couplings is highly heterogeneous and somesubsets of spins are strongly connected among each other.

The approaches introduced so far are each built on anansatz for the Boltzmann distribution such as the mean-field ansatz (47) or the Bethe–Peierls ansatz (63): Theyare all uncontrolled approximations. In the next sections,we discuss controlled approximations based on either anexpansion in small couplings or small correlations.

13. The Plefka expansion

In 1982, Plefka gave a systematic expansion ofthe Gibbs free energy of the Sherrington–Kirkpatrickmodel [203] in the couplings between spins [180]. Al-ready 8 years earlier, Bogolyubov Jr. et al. had usedsimilar ideas in the context of the ferromagnetic Isingmodel on a lattice [33]. The resulting estimate of thepartition function can also be used to derive new solu-tions of the inverse Ising problem. Zeroth- and first-orderterms of Plefka’s expansion turn out to yield mean-fieldtheory (48), the second-order term gives the TAP freeenergy (54), and correspondingly the reconstructions ofcouplings (49) and (57).

Plefka’s expansion is a Legendre transformation of thecumulant expansion of the Helmholtz free energy F (JJJ,hhh).Plefka introduced a perturbation parameter λ to the in-teracting part of the Hamiltonian

H(sss) = H0(sss) + λV (sss), (83)

where H0(sss) = −∑i hisi and V (sss) = −

∑i<j Jijsisj .

The perturbation parameter λ serves to distinguish thedifferent orders in the strength of couplings and will beset to one at the end of the calculation. The standard cu-mulant expansion of the Helmholtz free energy Fλ(JJJ,hhh)then reads

Fλ(JJJ,hhh) = F (0)(hhh)−λ 〈V 〉0 +λ2

2

[⟨V 2⟩0− 〈V 〉20

]+ · · · ,

(84)where F (0)(hhh) =

∑i 2 coshhi and 〈〉0 denotes the av-

erage with respect to the Boltzmann distribution corre-sponding to the non-interacting part H0 of the Hamil-tonian. Next we perform the Legendre transformation

Page 23: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

23

of Fλ(JJJ,hhh) with respect to hhh to obtain the perturbativeseries for the Gibbs free energy

Gλ = G(0) + λG(1) + λ2G(2) + · · · , (85)

where we suppressed the dependence of G on JJJ and mmm tosimplify the notation. Using the definition of the Gibbsfree energy (38) we solve

Gλ(JJJ,mmm) =∑i

himi + Fλ(JJJ,hhh) (86)

with the local fields satisfying

mi = −∂Fλ(JJJ,hhh)

∂hi. (87)

Plugging the perturbative series

hhhλ = hhh(0) + λhhh(1) + λ2hhh(2) + · · · (88)

into (87) and re-expanding the right-hand side in λ, onefinds successive orders of hhh. In particular, the lowest twoorders are found easily as

h(0) = artanhmi (89)

h(1) = −∑j

Jijmj .

This gives the Gibbs free energy up to first order in λ,

G(0)(mmm) =∑i

1 +mi

2ln

1 +mi

2+

1−mi

2ln

1−mi

2

G(1)(mmm) = −

∑i<j

Jijmimj . (90)

Continuing to the second order, one obtains the Onsagerterm (54)

G(2)(mmm) = −1

2

∑i<j

J2ij(1−m2

i )(1−m2j ). (91)

A systematic way to perform this expansion has beendeveloped by Georges and Yedidia [85] leading to

G(3)(mmm) = −2

3

∑(i,j)

J3ijmi(1−m2

i )mj(1−m2j ) (92)

−∑

(i,j,k)

JijJjkJki(1−m2i )(1−m2

j )(1−m2k)

G(4)(mmm) =1

12

∑(i,j)

J4ij(1−m2

i )(1−m2j )

×(1 + 3m2i + 3m2

j − 15m2im

2j )

−∑

(i,j,k,l)

JijJjkJklJli(1−m2i )(1−m2

j )(1−m2k)(1−m2

l )

−2∑

(i,j,k)

J2ijJjkJkimi(1−m2

i )mj(1−m2j )(1−m2

k).

The mean-field and TAP reconstructions of cou-plings (53) and (57) then follow from (40) and thefirst derivatives of G with respect to the magnetizationsgives the corresponding reconstructions of the magneticfields (50) and (58). Higher order terms of the Plefka ex-pansion are discussed by Georges and Yedidia [85, 245],but have not yet been applied to the inverse Ising prob-lem. For a fully-connected ferromagnetic model withall couplings set to J/N (to ensure an extensive Hamil-tonian) one finds that already the second-order termG(2) vanishes relative to the first in the thermodynamiclimit. For the Sherrington-Kirkpatrick model of a spinglass [203], couplings are of the order of N−1/2, againto make the energy extensive. In that case, G(0), G(1),and G(2) turn out to be extensive, but the third or-der vanishes in the thermodynamic limit. In specificinstances (not independently and identically distributedcouplings), higher order terms of the Plefka expansionmay turn out to be important.

The terms in Jij , J2ij , J

3ij , etc. appearing in (90)-(92)

resum to the results of the Bethe–Peierls ansatz [85].If the couplings form a tree, these terms are the onlyones contributing to the Gibbs free energy; terms such asJijJjkJki in (92) quantify the coupling strengths arounda closed loop of spins and are zero if couplings form atree. This finally shows that the Bethe–Peierls ansatz isexact on a tree. However, when the couplings containloops, these terms contribute to the Gibbs free energy.Beyond evaluating the Plefka expansion at different or-ders, one can also resum the contributions from specifictypes of loops to infinite order. We will discuss such anapproach in the next section.

14. The Sessak–Monasson small-correlation expansion

The Sessak–Monasson expansion is a perturbative ex-pansion of the entropy S(χχχ,mmm) in terms of the connectedcorrelation Cij ≡ χij−mimj [201]. For zero correlationsCij , the couplings Jij should also be zero. This motivatesan expansion of the free energy F (JJJ,hhh) in terms of theconnected correlations around a non-interacting system;the Legendre transformation (34) then yields the per-turbative series of the entropy function S(χχχ,mmm). Equiv-alently, this is an expansion of equations (23) for thefields and the couplings in the connected correlation CCC.To make the perturbation explicit, we replace CCC by λCCC,where the perturbation parameter λ is to be set to oneat the end. The couplings and fields are the solutions of

mi = −∂F (JJJ,hhh)

∂hi(93)

λCij +mimj = −∂F (JJJ,hhh)

∂Jij,

Page 24: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

24

where the latter can be replaced by

λCij = −∂2F (JJJ,hhh)

∂hi∂hj. (94)

We then expand the solution hhh and JJJ in a power seriesaround the uncorrelated case λ = 0

hhhλ = hhh(0) + λhhh(1) +λ2

2hhh(2) + · · · (95)

JJJλ = JJJ (0) + λJJJ (1) +λ2

2JJJ (2) + · · · .

At the zeroth-order in the connected correlations, spinsare uncorrelated, so JJJ (0) = 0 and h(0) = artanh(mi). Theexpansion of model parameters induces an expansion ofthe Hamiltonian

Hλ = H(0) + λH(1) +λ2

2H(2) + · · · , (96)

where

H(k)(sss) = −∑i<j

J(k)ij sisj −

∑i

h(k)i si. (97)

Likewise, the free energy F (JJJ,hhh) = − lnZ(JJJ,hhh) (orthe cumulant generator) can be expanded in λ,

Fλ = F (0) + λF (1) + λ2F (2) + · · · (98)

= F (0) − λ⟨H(1)

⟩0

+λ2

2

[−⟨H(2)

⟩0

+⟨

(H(1))2⟩0−⟨H(1)

⟩20

]+ · · · ,

where the subscript 〈〉0 refers to the thermal average un-der the zeroth-order Hamiltonian H(0)(sss). For example,the first-order term is

F (1) = −⟨H(1)

⟩0

(99)

=∑i<j

J(1)ij th(h0i ) th(h0j ) +

∑i

h(1)i th(h0i ) .

Equations (93) and (94) then read

mi = − ∂Fλ

∂h(0)i

= −∂F(0)

∂h(0)i

− λ∂F(1)

∂h(0)i

− λ2 ∂F(2)

∂h(0)i

− . . .(100)

λCij = − ∂2Fλ

∂h(0)i ∂h

(0)j

= − ∂2F (0)

∂h(0)i ∂h

(0)j

− λ ∂2F (1)

∂h(0)i ∂h

(0)j

− . . . .

Evaluating these equations successively gives solutions

for the different orders of h(k)i and J

(k)ij , which in turn

yield expressions for reconstructed fields and couplingswhen the magnetisations mmm and the connected correla-tions CCC are identified with their empirical values. Re-

calling the zeroth-order solution h(0)i = artanh(mi), the

first-order contribution of the free energy (99) leads to

J(1)ij =

Cij[1− (mi)2][1− (mj)2]

(101)

h(1)i = −

∑j 6=i

J(1)ij mj .

Higher-order terms are given in [201]. Also in [201], Ses-sak and Monasson give a diagrammatic framework suit-able for the derivation of higher order terms in the cou-plings and sum specific terms to infinite order. Roudi etal. [193] simplified the results yielding the reconstruction

Jij = −(CCC−1)ij + J IPij −

Cij(1−m2

i )(1−m2j )− (Cij)2

(102)for couplings with i 6= j, where J IP

ij is the independentpair approximation (80). This result can be interpretedas follows [193]: The first term is the mean-field recon-struction, which is of a sub-series of the Sessak–Monassonexpansion. One then adds the sub-series that constitutesthe independent pair approximation (80). The last termis the overlap of the two series: the mean-field recon-struction of two spins considered independently, whichneeds to be subtracted. The resummation of other sub-series for special cases is also possible [201]; in [105] loopdiagrams are resummed to obtain a reconstruction thatis particularly robust against sampling noise.

B. Logistic regression and pseudolikelihood

Pseudolikelihood is an alternative to the likelihoodfunction (16) and leads to the exact inference of modelparameters in the limit of an infinite number of sam-ples [4, 102, 156]. The computational complexity of thisapproach scales polynomially with the number of spinvariables N , but also with the number of samples M . Inpractice, this is usually much faster than exact maximi-sation of the likelihood function, whose computationalcomplexity is exponential in the system size. Pseudolike-lihood inference was developed by Julian Besag in 1974in the context of statistical inference of data with spatialdependencies [25]. It is closely related to logistic regres-sion. While pseudolikelihood and regression have beenused widely in statistical inference [26, 110, 183], thisapproach was not well-known in the physics communityuntil quite recently [7, 62, 71, 156].

Our derivation focuses on the physical character of re-gression analysis, which we then link to Besag’s pseudo-likelihood. Key observation is that although the likeli-hood function (16) depends on the model parameters ina complicated way, one can simplify this dependence byseparating different groups of parameters. Let us con-sider how the statistics of a particular spin variable σidepends on the configuration of all other spins. We splitthe Hamiltonian into two parts

H(sss) = Hi(sss) +H\i(sss \ si) (103)

= −hisi − si∑j 6=i

Jijsj +H\i(sss \ si) ,

such that the first part Hi(sss) depends on the magneticfield hi and the couplings of spin i to other spins Jij ,

Page 25: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

25

while the part H\i(sss \ si) does not. sss \ si denotes allspin variables except spin si. This splitting of vari-ables is possible since the statistics of σi conditioned onthe remaining spins sjj 6=i is fully captured by hi andJij , (j ∈ 1, . . . , N).

The expectation values of σi can be computed basedon the partition function

Z(JJJ,hhh) =∑sss\si

2 cosh

hi +∑j 6=i

Jijsj

e−H\i(sss\si) ,

(104)where only spin i has been summed over. Differentiatingthe partition function in this form with respect to hi andJij yields the expectation values

〈σi〉 =

⟨th

hi +∑k 6=i

Jikσk

⟩ (105)

〈σiσj〉 =

⟨σj th

hi +∑k 6=i

Jikσk

⟩ , (106)

where on both sides the thermal average is over the Boltz-mann measure e−H(sss)/Z. The first equation follows from

〈σi〉 =1

Z

∑sss\si

e−H\i∑si

sie−Hi =

1

Z

∑sss\si

e−H\i2 sinh Θi

=1

Z

∑sss\si

e−H\i tanh Θi2 cosh Θi

=1

Z

∑sss\si

e−H\i tanh Θi

∑si

e−Hi = 〈tanh Θi〉 (107)

with the shorthand Θi = hi +∑k 6=i Jikσk. The second

equation follows analogously.In statistical physics, equations (105) and (106) are

known as Callen’s identities [44] and have been usedto compute coupling parameters from observables inMonte Carlo simulations for the numerical calculationof renormalisation-group trajectories [214]. While theseequations are exact, the expectation values on the righthand sides still contain the average over the N − 1 spinsother than i.

The crucial step and the only approximation involvedis to replace the remaining averages in (106) with thesample means

〈σi〉D =

⟨th

hPLi +

∑k 6=i

JPLik σk

⟩D

(108)

〈σiσj〉D =

⟨σj th

hPLi +

∑k 6=i

JPLik σk

⟩D

.

We now have a system of non-linear equations to besolved for hi and Jij for fixed i and various j, j 6= i.

Standard methods to solve such equations are Newton–Raphson or conjugated gradient descent [93, 181].

With these steps, we have broken down the problemof estimating the magnetic fields and the full couplingmatrix to N separate problems of estimating one mag-netic field and a single row of the coupling matrix fora specific spin i. Crucially, the Boltzmann average over2N−1 states has been replaced with an average over allconfigurations of the samples. As a result, the compu-tation of (108) uses not only the sample magnetizationsand correlations (sufficient statistics), but all spin con-figurations that have been sampled. The time to evalu-ate (108) is thus linear in the number of samples M . Ingeneral, the coupling matrix inferred in this way will beasymmetric, Jij 6= Jji due to sampling noise. A practicalsolution is to use the average 1

2 (Jij + Jji) as estimate ofthe (symmetric) coupling matrix.

The set of equations (108) can be viewed as solvinga gradient-descent problem of a logistic regression [183].The statistics of σi conditioned on the values of the re-maining spins sjj 6=i can be written as

p(si|sjj 6=i) =1

2

1 + si th

hi +∑j 6=i

Jijsj

(109)

=1

1 + e−2si(hi+∑j 6=i Jijsj)

.

From this expression for the conditional probability, thei-th row of couplings Ji∗ and the field hi are simply thecoefficients and the intercept in the multi-variate logisticregression of the response variable σi on the variablessjj 6=i. The log-likelihood for the i-th row of couplingsJi∗ and the magnetic field hi of this regression problemis

LiD(Ji∗, hi) =1

M

∑µ

ln p(sµi |sµj j 6=i) (110)

=1

M

∑µ

ln1

2

1 + sµi th

hi +∑j 6=i

Jijsµj

.

Setting the derivatives of this likelihood function withrespect to the magnetic field hi and the entries of Ji∗ tozero recovers (108).

Consider all couplings and fields together, one cansum (110) over all rows of couplings to obtain the so-called (log)-pseudolikelihood [26]

LPL(JJJ,hhh) =∑i

LiD(Ji∗, hi) , (111)

which can be maximized with respect to all rows of thecoupling matrix, yielding an asymmetric matrix as dis-cussed above. The pseudolikelihood can also be maxi-mized within the space of symmetric matrices, although

Page 26: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

26

the maximisation problem is harder [7]; rather than solv-ing N independent gradient-descent problems in N vari-ables we have a single problem in N(N + 1)/2 variables.In practice, maximising the pseudolikelihood without thesymmetry constraint on the coupling matrix is preferredbecause of its simplicity and efficiency.

The reconstruction based on maximizing the pseudo-likelihood is of a different nature than the previous meth-ods, which approximated the Boltzmann measure. TheBoltzmann measure specifies the probability of observinga particular spin configuration sss in equilibrium. On theother hand, the conditional probability ln p(sµi |s

µj j 6=i)

only gives the probability of observing a given spin siconditioned on the remaining spin variables and can-not be used to generate the configurations of all spins.Yet, as a function of the model parameters, the log-pseudolikelihood (110) has the same maximum as thelikelihood (16) in the limit of M →∞, when the sampleaverage in (108) coincides with the corresponding expec-tation values under the Boltzmann distribution.

Curiously, the pseudolikelihood can also be used toobtain a variant of the mean-field and TAP recon-structions. We follow Roudi and Hertz [190] and re-

place expressions such as⟨

th(hi +

∑j 6=i Jijσj

)⟩by

th(hi +

∑j 6=i Jij 〈σj〉

), thus replacing the effective lo-

cal field by its mean. The resulting approximation ofCallen’s identity (105) is

mMFi = th

hi +∑j 6=i

JijmMFj

, (112)

which is the mean-field equation (51) for the magnetiza-tions. Then, replacing σi by 〈σi〉+(σi−〈σi〉) in (106) andexpanding the equation to the lowest orders in (σi−〈σi〉)gives

CPL−MFij =

1

1− (mMFi )2

∑k 6=i

JikCPL−MFkj . (113)

Identifying the pseudolikelihood mean-field magnetisa-tions mMF

i and connected correlations CPL−MFij with the

sample magnetisations mi and sampled correlations Cij ,this linear equation can be solved for JPL−MF

ik for a fixedi, giving

JPL−MFik = [1−m2

i ]∑j 6=i

Cij × [(CCC\i)−1]jk , (114)

whereCCC\i is the submatrix of the correlation matrix withrow and column i removed. This reconstruction is closelyrelated, but not identical to the mean-field reconstruc-tion (53). In particular, the diagonal terms are natu-rally excluded. Numerical experiments show that thisreconstruction gives a good correlation between the re-constructed and true couplings, but the magnitude of the

associatedpotential

exact variationalapproximation

perturbativeexpansion

F (JJJ,hhh) convexnonlinear

optimisation

G(JJJ,mmm) mean-field,Bethe–Peierls

Plefka,mean-field,

TAP,Bethe–Peierls

S(χχχ,mmm) independent-pair

approximation,Cocco–

Monasson

Sessak–Monasson

TABLE I. Classification of reconstruction methodsbased on the thermodynamic potentials used and theapproximations employed to evaluate them. Pseu-dolikelihood maximisation falls outside this classificationscheme, as it is not based on an approximation of the Boltz-mann measure.

couplings is systematically underestimated. Continuingthe expansion to second order in (σi − 〈σi〉) leads to avariant of the TAP reconstruction (57).

C. Comparison of the different approaches

Table II C gives a classification of the different recon-struction methods based on the thermodynamic poten-tials they employ and the approximations used to evalu-ate them. Some of these approximations are exact in par-ticular limits, and fail in others. This has consequencesfor how well a reconstruction method works in a givenregime. For instance, the TAP equations become exactfor fully connected systems (with couplings between allspin pairs drawn from a distribution with variance 1/N)in the thermodynamic limit. Hence, we expect the TAPreconstruction to perform well when couplings are uni-formly distributed across spin pairs, and couplings aresufficiently weak so there is no ergodicity breaking (seesubsection II D). Similarly, the Bethe–Peierls approxima-tion is exact when the non-zero couplings between spinpairs form a tree. Hence, we expect the Bethe–Peierls re-construction to work perfectly in this case, and to workwell when the graph of couplings is locally treelike (sothere are no short loops). The adaptive cluster expan-sion, on the other hand, is expected to work well evenwhen there are short loops.

In this section, we compare the reconstruction methodsdiscussed so far. We consider the reconstruction problemof an Ising model (2) with couplings J0

ij between pairs

Page 27: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

27

-1 0 1J0

-1

0

1

J

-2 -1 0 1 2h0

-2

-1

0

1

2

h

-1 0 1C0

-1

0

1

C

-1 0 1m0

-1

0

1

m

ACESMAMF TAP BP PLLH ML

FIG. 4. Reconstructing a fully connected Ising modelwith different methods. The scatter plots are generatedfrom a particular realization of the Sherrington-Kirkpatrickmodel with N = 20, β = 1.3 and M = 15000 samples (config-urations drawn from the Boltzmann measure, see text). Thecolour legend indicates the different reconstruction methods:mean-field (MF), TAP, Bethe–Peierls (BP), Sessak–Monasson(SM), the adaptive cluster expansion (ACE), maximum pseu-dolikelihood (MPL), and maximum likelihood (ML). The topplots show the reconstructed couplings/fields against the cou-plings/fields of the original model. The bottom plots show theconnected correlations and magnetisations calculated from Msamples generated using the underlying model parameters (onthe x-axis) and the same quantities generated using the in-ferred parameters (y-axis). The Sessak–Monasson expansionwas computed as described in [201] with the series in the mag-netic fields truncated at the third order excluding the loopterms. The adaptive cluster expansion was carried out usingthe ACE-package [17, 53] with a maximum cluster size k = 6and default parameters. The numerical maximisation of thelikelihood and pseudolikelihood was done using the Eigen 3wrappers [89] for Minpack [155].

of spins defined on a certain graph. We explore two as-pects of the model parameters: the type of the interactiongraph and the coupling strength (temperature). We con-sider three different graphs: the fully connected graph,a random graph with fixed degree as a representative ofgraphs with long loops, and the 2D square lattice as a rep-resentative of graphs with short loops. In this section, wedenote the couplings of the model underlying the data bythe superscript ‘0’ to distinguish them from the inferredcouplings. For each graph, every edge is assigned a cou-pling J0

ij drawn from a certain distribution. We use theGaussian distribution with zero mean and standard devi-ation β/

√N for couplings on the fully connected graph,

leading to the Sherrington–Kirkpatrick (SK) model. Forthe tree and square lattice, the uniform distribution onthe interval [−β, β] is used for the couplings. By tun-

0.5 1.0 1.5 2.0β

0.1

1.0

γJ

MF TAP BP SMA ACE PLLH

0.1

1.0

γJ

0.1

1.0

γJ

0.5 1 5

M(×104)

0.1

1.0

γJ

0.5 1 5

M(×104)

0.2 0.6 1.0 1.4 1.8β =

FIG. 5. Reconstruction of a fully connected Isingmodel as a function of the coupling strength β andthe number of samples M . The top panel shows the re-construction error γJ defined by (115) as a function of β ata constant M = 15000 samples. The smaller panels show γJas a function of the number of samples M at different cou-pling strengths β and for different methods (mean-field, TAP,Bethe–Peierls, Sessak–Monasson, ACE, and maximum pseu-dolikelihood). For these plots a larger system size N = 64 wasused, so evaluating the likelihood is not feasible; at lower sys-tem sizes we found pseudolikelihood performs as well as theexact maximization of the likelihood in the parameter rangeconsidered.

ing β, we effectively change the coupling strength. Forthe SK-model, external magnetic fields field h0i are drawnuniformly from the interval [−0.3β, 0.3β].

Page 28: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

28

At each value of β, we generateM samples (spin config-urations) drawn from the Boltzmann distribution. Eachsample is obtained by simulating 104N Monte Carlo stepsusing the Metropolis transition rule starting from a ran-dom configuration. Although this is not always sufficientto guarantee that the system has reached equilibrium,the results are not sensitive to increasing this breaking-in time.

Figure 4 top left compares the reconstructed couplingswith the couplings of the original model at β = 1.3. Wechose this value as it results in couplings which are suf-ficiently large so that the different methods perform dif-ferently (we will see that at low β all methods performsimilarly). On the fully connected graph, the adaptivecluster expansion performs poorly as it sets too manycouplings to zero. Mean-field reconstruction somewhatoverestimates large couplings [16]. This error also af-fects the estimate of the magnetic fields. The TAP andBethe–Peierls reconstructions correct this overestimatequite effectively. The reconstruction by pseudolikelihoodstands out by providing an accurate reconstruction of themodel parameters. One consequence is that in figure 4the symbols indicating the results from pseudolikelihoodare largely obscured by those from the exact maximiza-tion of the likelihood (16). We also compare the cor-relations and magnetizations based on the reconstructed

parameters with those of the original model. The correla-tions and magnetisations are sampled as described above.We find significant bias in the results from all methodsexcept for likelihood maximization and pseudolikelihoodmaximization (bottom panels of figure 4).

Next, we explore how the reconstruction quality de-pends on the coupling strength β and the number ofsamples M . To this end, we define the relative recon-struction error

γJ =

√∑i<j(Jij − J0

ij)2∑

i<j(J0ij)

2, (115)

which compares the reconstructed couplings JJJ to the cou-plings of the original model underlying the data JJJ0. Fig-ure 5, large panel, shows the relative reconstruction er-ror γJ as a function of coupling strength β. At low β,the connected correlations are small, so all reconstructionmethods are equally limited by sampling noise: The rel-ative reconstruction error increases with decreasing β asthe couplings become small relative to the sampling er-ror. This is also compatible with the result that at weakcouplings the relative errors of all methods decrease withan increasing number of samples, as seen in the 6 smallpanels of figure 5.

We find that the pseudolikelihood reconstruction out-performs the other (non-exact) methods over the entirerange of β, and correctly reconstructs the model param-eters even in the glassy phase at strong couplings [7].This is because the conditional statistics of a single spin(conditioned on the other spins) is correctly describedby (109) at any coupling strength. Although at strongcouplings, the error of the pseudolikelihood reconstruc-tion grows with β, this can be compensated by increasingthe number of samples (Figure 5, bottom right). This isnot so for all other approximate methods (Figure 5, smallpanels): At high β, the approximation each method isbased on breaks down, which cannot be compensated bya larger number of samples.

Figure 6 shows the reconstruction errors of differentmethods for a random graph of fixed degree (columnA) and the square lattice (column B). Again, pseu-dolikelihood performs well in these cases, as does theBethe–Peierls reconstruction. The adaptive cluster ex-pansion shows a remarkable behaviour: For both graphs,it has a very small reconstruction error at weak couplingstrength, but breaks down at strong couplings. At weakcouplings, where all other methods have similar recon-struction errors arising from sampling noise, however theadaptive cluster expansion appears to avoid this source oferror. The adaptive cluster expansion explicitly assumessome couplings are exactly zero, so the reconstruction is

biased towards graphs which are not fully connected. Itthus uses extra information beyond the data.

Our comparison of the different methods is based onthe knowledge of the underlying couplings, so the recon-structed couplings can be compared to the true underly-ing ones. In practice, the underlying couplings are notavailable. However, the likelihood (16) can be evaluatedfor different reconstructions, with the better reconstruc-tion resulting in a higher value of the likelihood. Alterna-tively, the correlations and magnetizations of the recon-structed model can be compared with those observed inthe data as in figure 4. Whether regularizing terms im-prove the reconstruction can be decided based on statisti-cal tests such as the Bayesian information criterion [199]or the Akaike information criterion [2].

An aspect not discussed so far is the inference of the in-teraction graph, this is, the distinction between non-zeroand zero couplings. In practice, there is always some am-biguity between small non-zero couplings and couplingswhich are exactly zero, particularly at high samplingnoise. A standard practice is to set a threshold and cutoff small couplings below the threshold. Two exceptionsare the adaptive cluster expansion and the pseudolike-lihood reconstruction. The adaptive cluster expansionhas a built-in procedure to set some couplings to zero.For the pseudolikelihood reconstruction, the problem isrelated to feature selection. Many methods for feature

Page 29: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

29

A B

0.5 1.0 1.5 2.0β

0.1

1.0

γJ

MF TAP BP SMA ACE PLLH

0.5 1.0 1.5 2.0β

0.1

1.0

γJ

MF TAP BP SMA ACE PLLH

0.1

1.0

γJ

0.1

1.0

γJ

0.5 1 5

M(×104)

0.1

1.0

γJ

0.5 1 5

M(×104)

0.4 0.9 1.4 1.9 2.4β =

0.1

1.0

10

γJ

0.1

1.0

γJ

0.5 1 5

M(×104)

0.1

1.0

γJ

0.5 1 5

M(×104)

0.2 0.6 1.0 1.4 1.8β =

FIG. 6. Reconstruction of the Ising model on a random graph of fixed degree z = 3 (column A) and on the squarelattice (column B) as a function of coupling strength β and number of samples M . The underlying couplings J0

ij

are drawn from the uniform distribution on the interval [−β,+β], and the external magnetic fields are set to zero for simplicity.The system size is N = 64. The top panel shows the reconstruction error γJ defined by (115) as a function of β at constantM = 15000 samples. The smaller panels show γJ as a function of the number of samples M at different coupling strength βand for different methods (mean-field, TAP, Bethe–Peierls, Sessak–Monasson, ACE, and maximum pseudolikelihood).

selection are available from statistics [93], however thereis no consensus on the best method for the inverse Isingproblem. One possibility is adding an `1-regularizationterm (known as Lasso, see section II A) to the pseudo-likelihood [183]. However, there is a critical couplingstrength, below which the `1 regularization of the pseu-dolikelihood fails to recover the interaction graph [149].

D. Reconstruction and non-ergodicity

At strong couplings, the dynamics of a system can be-come non-ergodic, which affects the sampling of configu-rations, and hence the reconstruction of parameters. Re-construction on the basis of non-ergodic sampling is afundamentally difficult subject where little is known to

Page 30: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

30

date.

At low temperatures (or strong couplings), a disor-dered spin system can undergo a phase transition to astate where spins are ‘frozen’ in different directions. Thisis the famous transition to the spin glass phase [146].At the spin glass transition, the Gibbs free energy de-velops multiple valleys with extensive barriers betweenthem (thermodynamic states). These free energy barri-ers constrain the dynamics of the system, and the sys-tem can remain confined to one particular valley for longtimes. A signature of this ergodicity breaking is the emer-gence of non-trivial spin magnetisations: At the phasetransition, the self-consistent equations (51) and (55) de-velop multiple solutions for the magnetisations mi, cor-responding to the different orientations the spins freezeinto. The appearance of multiple free energy minimaand the resulting ergodicity breaking has long been rec-ognized as an obstacle to mean-field reconstruction [137].In fact, all methods based on self-consistent equations(mean-field, TAP, Bethe–Peierls reconstruction) fail inthe glassy phase [145].

The mean-field equations (51) describe individual ther-modynamic states, not the mixture of many states thatcharacterizes the Boltzmann measure. At large couplings(or low temperatures), mean-field and related approachesturn out not to be limited by the validity of self-consistentequations like (51) or (55), but by the identification ofobserved magnetisations and correlations with the cor-responding quantities calculated under one of the self-consistent equations. The former may involve averagesover multiple thermodynamic states, whereas each so-lution of the self-consistent equations describes a sepa-rate thermodynamic state. A solution to this problem isthus to consider correlations and magnetisations withina single thermodynamic state, where the mean-field re-sult (52) is valid within the limitations of mean-field the-ory. The different thermodynamic states (free energyminima) can be identified from the data by searchingfor clusters in the sampled spin configurations. Collect-ing self-consistent equations from different minima andjointly solving these equations using the Moore-Penrosepseudo-inverse allows the reconstruction of couplings andfields [161].

The emergence of multiple states can lead to anotherproblem: the samples need not come from the Gibbs mea-sure in the first place, but might be taken only from asingle thermodynamic state. In [35], the reconstructionfrom samples from a single thermodynamic state is stud-ied for the concrete case of the Hopfield model. In the so-called memory regime, the Hopfield model has a numberof thermodynamic states (attractor states) which scaleslinearly with the system size. Samples generated froma single run of the model’s dynamics will generally comefrom a single state only. (The particular state the systemis ‘attracted’ to depends on the initial conditions.) In thisregime, the cavity-recursion equations (76) typically have

multiple solutions, just like the TAP equations (55) in thelow-temperature phase. These solutions can be identifiedas fixed points of belief propagation, see II A 10. For asystem where couplings do not form a tree, each individ-ual fixed point only approximates the marginal probabil-ity distributions inside a single non-ergodic component ofthe full Gibbs measure. However, this approximation canbecome exact in the thermodynamic limit. (A necessarycondition is that loops are sufficiently long, so connectedcorrelations measured in a particular state decay suffi-ciently quickly with distance, where distance is measuredalong the graph of non-zero couplings.)

Assuming one is able to find a fixed point of the cavity-recursion equations, it is a straightforward computationto express the Boltzmann weight (1) in terms of the fullset of the belief propagation marginals and of the ratiobetween the cavity-recursion partition function and thetrue partition function of the model. For any fixed point,labelled by the index α, the Boltzmann weights are

p(sss) =ZαBPZ(J, h)

∏i

pαi (si)∏i<j

pαij(si, sj)

pαi (si)pαj (sj). (116)

This result also applies to more general forms of theenergy function [35]. It was first derived for the zero-temperature case in the context of the so-called tree-reweighted approximation to the partition function [117].

The probability distribution (116) can be used as thestarting point for reconstruction in at least two ways.The simpler one consists in replacing pαij(si, sj) andpαi (si) by their sample estimates inside state α and thensolving the identity between (116) and (1) with respectto JJJ and hhh: Z(JJJ,hhh) cancels out and one is left with theindependent-pair approximation of subsection II A 12.Solving this identity can be done using belief propaga-tion, see section II A 11 and [35]. This approach suffersfrom the fact that in general there is no belief propaga-tion fixed point for real data sets.

A more promising approach is to guide the belief prop-agation equations to converge to a fixed point corre-sponding to an appropriate ergodic component close tothe empirical data. In fact, ignoring the information thatthe data come from a single ergodic component results ina large reconstruction error as one is effectively maximiz-ing the wrong likelihood: the (reduced) free energy− lnZappearing in the likelihood (16) needs to be replaced bya free energy restricted to a single thermodynamic state.An algorithmic implementation of this idea consists inrestricting the spin configurations to a subset of configu-ration Ωd(ξ), e.g., formed by a hypersphere of diameter dcentered around the centroid ξ of the data samples. Thesystem is forced to follow the measure

pd(sss)∝I[sss ∈ Ωd(ξ)]eβ(

∑i<j Jijsisj+

∑i hisi) , (117)

where the indicator function I[sss ∈ Ωd(ξ)] for the set ofconfigurations Ωd(ξ) is one for sss ∈ Ωd(ξ) and zero other-wise. The BP approach can be used to enforce that both

Page 31: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

31

magnetizations m(d)i and correlations c(d)ij are com-

puted within the subspace Ωd(ξ), see [35] for algorithmicdetails. In this way, the reconstruction of the couplingsand local fields can be done by maximising the correctlikelihood, i.e., replacing the partition function in thelog-likelihood (16) with the partition function computedover the subset of configurations Ωd(ξ). In this way, theparameters describing all thermodynamic states can beinferred from configurations of the system sampled froma single state.

Inverse problems in non-ergodic systems remain a chal-lenging topic with many open questions, for instance ifthe approach applied above to the Hopfield model can beextended to infer generic couplings from samples from asingle thermodynamic state.

E. Parameter reconstruction and criticality

Empirical evidence for critical behaviour has been re-ported in systems as diverse as neural networks [21, 22]and financial markets [41, 80], leading to the intrigu-ing hypothesis that some information-processing systemsmay be operating at a critical point [151]. A key signa-ture of criticality is broad tails in some quantities, forinstance the distribution of returns in a financial mar-ket, or avalanches of neural activity, whose sizes are dis-tributed according to a power law. A second sign of crit-icality involves an inverse statistical problem: A statis-tical model like the Ising system with parameters cho-sen to match empirical data (such as neural firing pat-terns or financial data) shows signs of a phase transi-tion [141, 151, 152, 212, 222, 224]. Specifically, the heatcapacity of an Ising model with parameters Jij and hireconstructed from data shows a peak in the heat ca-pacity as a function of the temperature. Temperature isintroduced by changing the couplings to βJij and βhi.Varying the inverse temperatures β, the heat capacityCh ≡ −β2∂〈H〉/∂β shows a pronounced maximum nearor at β = 1, that is, at the model parameter values in-ferred from the data. The implication is that the parame-ters of the reconstructed model occupy a very particulararea in the space of all model parameters, namely oneresulting in critical behaviour.

An example is shown in figure 7. It is based on record-ings of 160 neurons in the retina of a salamander takenby the Berry lab [222, 224]. As described in section I A,time is discretised into intervals of 20 ms duration, andthe spin variable si takes on the values +1 if neuron ifires during a particular interval, and −1 if it does notfire. Correlations and magnetisations are then computedby averaging over 297 retinal stimulation time courses,where during each time course, the retina is subjected tothe same visual stimulation of 953×20ms duration. Dur-ing most time intervals, a neuron is typically silent. As aresult, spins variables (each representing a different neu-

ΣjJij-1 0 1 2 3

-4

-2

0

2

hi

β0.5 1.0 1.5 2.0

0.0

0.5

1.0

C/N m

0.0

-0.5

-1.0

FIG. 7. The Ising model with parameters matchingthe statistics of neural firing patterns. Couplings andexternal fields are generated from neural data [222, 224] asdescribed in the text. The system size here is N = 30. Left:Fields and couplings turn out not to be independent, butobey a linear-affine relationship, which is due to neural fir-ing rates and hence magnetisations being approximately con-stant across neurons. Right: We simulate the model with re-constructed couplings at different temperatures using MonteCarlo simulations. The heat capacity (blue) shows a peaknear β = 1, and the magnetisation m = 1

N

∑imi (red, axis

on the right) goes from zero to minus one as the inverse tem-perature is increased.

ron) have negative magnetisations, with mean and stan-dard deviation −0.93± 0.06 over all spins. We note thatthe magnetisations are fairly homogeneous across spins.Connected correlations between spins are small, with aslight bias towards positive correlations; mean and stan-dard deviation over all spin pairs are 0.006±0.017. Bothpoints turn out to be important.

We use the pseudolikelihood method of section II B toreconstruct the model parameters from the firing pat-terns the first N = 30 neurons. Similar results are foundfor different system sizes. Figure 7A shows that the re-constructed model parameters indeed occupy a particu-lar region of the space of model parameters: Externalfields hi are linked to the sum over couplings

∑j Jij by a

linear-affine relationship. If the magnetisations mi areall equal to some m, such a relationship follows directlyfrom the mean-field equation hi = artanhmi−

∑j Jijmj ,

along with slope −m and zero-offset artanhm.

Now we simulate the Ising model with the recon-structed parameters, but rescale both couplings and fieldsby β. At inverse temperature β = 1, magnetisations andcorrelations are close to the magnetisations and corre-lations found in the original data. However, away fromβ = 1, different fields hi = 1

β artanh mi − 1β

∑j Jijmj

(in the mean-field approximation) would be needed toretain the same magnetisation. However, as the fields re-main fixed, the magnetisations change instead, reaching

Page 32: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

32

−1(0) in the limit β → ∞(0). Between these limits, themagnetisations, and hence the energy changes rapidly,leading to a peak in the heat capacity. For small, largelypositive connected correlations Cij we expect couplingsto scale as 1/N , resulting in a phase transition describedby the Curie-Weiss model. Indeed, a finite-size analysisshows the peak getting sharper and moving closer to β =1 as the system size is increased [141, 151, 152, 212, 224].From this, we posit that the peak in the heat capacityindeed signals a ferromagnetic phase transition, howevernot necessarily one underlying the original system. In-stead, we generically expect such a ferromagnetic transi-tion whenever a system shows sufficient correlations andmagnetisations that are neither plus nor minus one: Thecorresponding couplings and fields then lie near (not at)a critical point, which is characterized by the emergenceof non-zero magnetisations. Changing the temperaturewill drive the system to either higher or lower magneti-sations, and away from the critical point.

This effect may not be limited to a ferromagnetic tran-sition induced by the empirical magnetisation. In [141]Mastromatteo and Marsili point out a link between thecriticality of inferred models and information geome-try [13, 157]: susceptibilities such as the magnetic suscep-tibility can be interpreted as the entries in the so-calledFisher information matrix used to calculated the covari-ances of maximum-likelihood estimates [56]. A high sus-ceptibility implies that different parameter values can bedistinguished on the basis of limited data, whereas lowsusceptibilities mean that the likelihood does not differsufficiently between different parameter values to signif-icantly favour one parameter value over another. Thus,a critical point corresponds to a high density of modelswhose parameters are distinguishable on the basis of thedata [141].

III. NON-EQUILIBRIUM RECONSTRUCTION

In a model of interacting magnetic spins such as (2),each pair of spins i < j contributes a term Jijsisj to theHamiltonian. The resulting effective fields on a spin i,∑j Jijsj+hi, involve symmetric couplings Jij = Jji; spin

i influences spin j as much as j influences i. A stochasticdynamics based on such local fields, such as Monte Carlodynamics, obeys detailed balance, which allows one todetermine the steady-state distribution [83].

In applications such as neural networks or gene regula-tory networks discussed in sections I A and I B we have noreason to expect symmetric connections between neuronsor between genes; a synaptic connection from neuron ito neuron j does not imply a link in the reverse direc-tion. Stochastic systems generally relax to a steady stateat long times, where observables no longer change withtime. However, in systems with asymmetric couplings,the resulting non-equilibrium steady state (NESS) vio-

lates detailed balance and differs from the Boltzmanndistribution. As a result, none of the results of sectionII on equilibrium reconstruction apply to systems withasymmetric couplings.

In this section we consider the inverse Ising problem ina non-equilibrium setting. We first review different typesof spin dynamics, and then turn to the problem of re-constructing the parameters of a spin system from eithertime-series data or from samples of the non-equilibriumsteady state.

A. Dynamics of the Ising model

The Ising model lacks a prescribed dynamics, and thereare many different dynamical rules that lead to a partic-ular steady-state distribution. One particularly simpledynamics is the so-called Glauber dynamics [86], whichallows the derivation of a number of analytical results.Other dynamical rules can be used for parameter recon-struction in the same way at least in principle. What dy-namical rule is suitable for particular systems is howeveran open question. An approach which sidesteps this issueis to apply the maximum entropy principle to stochastictrajectories, known as the principle of maximum caliber[182]. In [139, 158] such an approach is applied to analyseneural dynamics.

1. Sequential Glauber dynamics

Glauber dynamics can be based on discrete time steps,and at each step either one or all spins have new valuesassigned to them according to a stochastic rule. We firstconsider a sequential dynamics: In the transition fromtime t to t + 1, the label of a spin variable is pickedrandomly, say i. The value of the spin variable σi isthen updated, with si = ±1 sampled from the probabilitydistribution

p(si(t+ 1)|sss(t)) =expsi(t+ 1)θi(t)

2 cosh(θi(t)), (118)

where the effective local field is denoted

θi(t) =∑j

Jijsj(t) + hi . (119)

One way to implement this dynamics is to set

si(t+ 1) = sign (θi(t) + ξ(t)) , (120)

where ξi(t) is drawn independently at each step from thedistribution p(ξ) = 1− th2(ξ).

If the couplings Jij between spins are symmetric andthere are no self-couplings Jii, the sequential Glauberdynamics (118) obeys detailed balance, and the distri-bution of spin configurations at long times relaxes to

Page 33: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

33

a steady state described by the Boltzmann distribution(11). However, the NESS arising at long times fromGlauber dynamics with non-symmetric couplings is gen-erally not known. Nevertheless, there are several exactrelations that follow from (118), and those can be ex-ploited for inference.

Suppose that spin i is picked for updating in the stepfrom time t to t+1. Averaging over the two possible spinconfigurations at time t+1, we obtain the magnetisation

mi(t+ 1)sss(t) ≡ 〈σi(t+ 1)〉sss(t) = th (θi(t)) , (121)

where the average is conditioned on the configuration ofall other spins at time t through the effective local fieldθi. If spin i is not updated in this interval, the condi-tioned magnetisation is trivially mi(t+ 1)sss(t) = si(t). Inthe steady state, effective fields Θi =

∑j Jijσj(t) + hi

and spins configurations σσσ are random variables whosedistribution no longer depends on time; their averagesgive

mi ≡ 〈σi〉 = 〈th(Θi)〉 = 〈th(∑j

Jijσj + hi)〉 . (122)

Similarly, the pair correlation between spins at sites i 6= jat equal times obeys in the NESS

χij ≡ 〈σiσj〉 =1

2〈σi th (Θj)〉+

1

2〈σj th (Θi)〉 , (123)

with the two terms arising from instances where the lastupdate of i was before the last update of j, and vice versa.Likewise, the pair correlation of spin configurations atconsecutive time intervals is in the steady state

φij ≡ 〈σi(t+ 1)σj(t)〉 =1

N〈th (Θi)σj〉+

N − 1

Nχij .

(124)These relationships are exact, but are hard to evaluatesince the averages on the right-hand sides are over thestatistics of spins in the NESS, which is generally un-known. Below, these equations will be used in differentways for the reconstruction of the model parameters.

The dynamics (118) defines a Markov chain of transi-tions between spin configurations differing by at mostone spin flip. Equivalently, one can also define anasynchronous dynamics described by a Master equation,where time is continuous and the time between successivespin flips is a continuous random variable.

2. Parallel Glauber dynamics

Glauber dynamics can also be defined with parallel up-dates, where all spin variables can change their configu-rations in the time interval between t and t+1 accordingto the stochastic update rule

p(sss(t+ 1)|sss(t)) =exp

∑i si(t+ 1)θi(t)∏

i 2 cosh(θi(t)). (125)

This update rule defines a Markov chain consisting ofstochastic transitions between spin configurations. Theresulting dynamics is not realistic for biological networks,as the synchronous update requires a central clock. It ishowever implemented easily in technical networks, and iswidely used for its simplicity. For a symmetric couplingmatrix, the steady state can still be specified in closedform using Peretto’s pseudo-Hamiltonian [176].

Magnetisations and correlations in the NESS obey sim-pler relationships for parallel updates than for sequentialupdates; with the same arguments as above, one obtains

mi = 〈th (Θi)〉 (126)

χij = 〈th (Θi) th (Θj)〉φij = 〈th (Θi)σj〉 .

B. Reconstruction from time series data

The reconstruction of system parameters is surpris-ingly easy on the basis of time series data, which speci-fies the state of each spin variable at M successive timepoints. An application where such data is widely avail-able is the reconstruction of neural networks from tem-poral recordings of neural activity. Given a stochasticupdate rule such as (118) or (125), the (log-) likelihoodgiven a time series D = sss(t) is

LD(JJJ,hhh) =1

M

M−1∑t=1

ln p(sss(t+ 1)|sss(t)) . (127)

The likelihood can be evaluated over any time interval,both in the NESS or even before the steady state has beenreached. This approach is not limited to non-equilibriumsystems; whenever time series data are available, it canbe used equally well to reconstruct symmetric couplings.

1. Maximisation of the likelihood

For Glauber dynamics with parallel updates, the like-lihood is

LD(JJJ,hhh) =1

M

M−1∑t=1

∑i

[si(t+ 1)θi(t)− ln 2 cosh(θi(t))] .

(128)Unlike the likelihood (16) arising in equilibrium statis-tics, the non-equilibrium likelihood can be evaluated eas-ily, as the normalisation is already contained in the termcosh(θ). Derivatives of the likelihood (128) are

∂LD

∂hi(JJJ,hhh) =

1

M

M−1∑t=1

[si(t+ 1)− th θi(t)] (129)

∂LD

∂Jij(JJJ,hhh) =

1

M

M−1∑t=1

[si(t+ 1)sj(t)− th(θi(t))sj(t)] .

Page 34: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

34

These derivatives can be evaluated in MN2 computa-tional steps and can be used to maximize the likelihoodby gradient ascent [190].

The derivatives of the likelihood can also be written astime averages over the data, which makes a conceptualconnection with logistic regression apparent [190]. Thederivative with respect to fields gives ∂LD

∂hi= 〈σi(t+1)〉DT−

〈th(∑j Jijσj(t) + hi)〉DT and similarly for the derivative

with respect to the couplings (the subscript refers to thetemporal average, the superscript to configurations takenfrom the data). Parallel Glauber dynamics (125) definesa logistic regression giving the statistics of σi(t+ 1) as afunction of sj(t). In a hypothetical data set of P trajec-tories of the system D = sssp(t)Mt=1, each realization canbe considered as M − 1 realisations (spi (t + 1), spj (t))of such regression pairs. The discussion in section II Bthen applies directly, giving rise to regression equationsfor fields and couplings.

For Glauber dynamics with sequential updates, the sit-uation is not quite so simple. In each time interval onespin is picked for updating, but if this spin is assigned theconfiguration it had before it is impossible to tell whichspin was actually chosen [250]. The solution is to sumthe likelihood (127) over all spins (each of which mighthave been the one picked for updating with probability1/N). It turns out that both the intervals when a spinwas actually flipped, and the intervals when no spin wasflipped are required for the inference of couplings andfields.

2. Mean-field theory of the non-equilibrium steady state

As in the case of equilibrium reconstruction, mean-fieldtheory offers an approximation to the maximum likeli-hood reconstruction that can be evaluated quickly. Thespeed-up is not quite as significant as it is in equilibrium,because the likelihood (128) can already be computed inpolynomial time in N and M .

In II A 13, the mean-field equation (51) and the TAPequation (55) were derived in an expansion around a fac-torising ansatz for the equilibrium distribution. Kappenand Spanjers showed that, remarkably, exactly the sameequations emerge as first- and second-order expansionsaround the same ansatz for the NESS as well [113]. Sup-pose that the (unknown) steady-state distribution of con-figurations p(sss) in the NESS is ‘close’ to another distri-bution with the same magnetisations, q(sss) =

∏i1+misi

2 .According to (122), this distribution describes the NESS

of a different model with fields h(q)i = artanhmi and

couplings J(q)ij = 0. Next, we consider a small change in

these fields and couplings and ask how the magnetisa-tions change. To first order this change is given by the

derivatives of (122) evaluated at hi = h(q)i and Jij = 0

∆mi =∑j

∂〈si〉∂hj|q∆hj +

∑k,j

∂〈si〉∂Jkj

|q∆Jkj + . . . (130)

= (1−m2i )∆hi + (1−m2

i )∑j

∆Jijmj + . . . .

Setting ∆hi = hi − h(q)i and ∆Jij = Jij − J

(q)ij = Jij

and demanding that magnetisations remain unchangedunder this change of fields and couplings gives to first

order h(q)i = hi +

∑j Jijmj and thus

mi = th(hi +∑j

Jijmj) . (131)

Carrying the expansion (130) to second order in ∆hjand ∆Jjk yields the TAP equations (55) [113]. As theequations for the magnetisations (122) and correlations(126) are identical for sequential and parallel dynamics,the mean-field and TAP equations apply equally to bothtypes of dynamics. Roudi and Hertz [189] extended theseresults to time scales before the NESS is reached using agenerating functional approach.

The next step is to apply the mean-field approxima-tion to the correlations (126) for parallel Glauber up-dates [189, 251]. For sequential updates, analogous re-sult can be derived from (123) and (124). We expandthe effective local field Θi = hi +

∑j Jijmj +

∑j Jijδσj

around the mean field θMFi = hi +

∑j Jijmj writing

δσi ≡ σi−mi. Expanding the th Θi-term of the two-timepair correlation (126) in a formal expansion in powers ofδσ gives

φij = 〈th (Θi)σj〉 (132)

= 〈th θMFi σj〉+ (1− th2 θMF

i )∑l

Jil〈δσlσj〉+ . . .

= mimj + (1−m2i )∑l

Jil(χlj −mlmj) + . . . .

To first order, this equation can be read as a matrixequation in the connected correlations functions, Dij ≡φij −mimj =

∑m,lAimJmlClj with Aim = δim(1−m2

i )and Cij = χij −mimj . Inverting this relationship leadsto the mean-field reconstruction

JJJMF = AAA−1DDDCCC−1 , (133)

based on sample averages of magnetisations and con-nected correlations in the NESS. The reconstructionbased on the TAP equation can be derived analo-gously [189, 251]; the result is of the same form as (133)with Aim = δim(1−m2

i )(1−Fi), where Fi is the smallestroot of

Fi(1− Fi)2 = (1−m2i )∑j

(JMF)2ij(1−m2j ) . (134)

Page 35: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

35

3. The Gaussian approximation

For an asymmetric coupling matrix, with no correla-tion between Jij and Jji, the statistics of the effective lo-cal fields in the NESS is remarkably simple [147]: In thethermodynamic limit, Θi turns out to follow a Gaussiandistribution at least in some regimes. This distributionis characterized by a mean θMF

i , standard deviation ∆i

and covariance εij . Using the definition of the effectivelocal field (119), these parameters are linked to the spinobservables by

gi = hi +∑j

Jijmj (135)

∆2i =

∑lk

JilClkJik

εij =∑lk

JilClkJjk .

The key idea is that one can transform back and forthbetween σi(t) and Θi(t) via

ΘΘΘ(t) = JJJσσσ(t) + hhh, (136)

σσσ(t) = JJJ−1(ΘΘΘ(t)− hhh)

and evaluate the correlation functions within the Gaus-sian theory. With θi = θMF

i +√

∆ix and θj = θMFj +√

∆jy, where x and y are univariate Gaussian ran-dom variables, one obtains from Dik = 〈th(Θi)σk〉 −〈th(Θi)〉〈σk〉∑

k

JjkDik = 〈th(Θi)(Θj − hj)〉 − 〈th(Θi)〉〈Θj − hj〉

= 〈th(θMFi +

√∆ix)

√∆jy〉

= εij〈1− th2(θMFi +

√∆ix)〉 . (137)

In the last step we have used the fact that covari-ances between spin variables are small [147]. Insert-ing the result for the covariance (135) gives again anequation of the same form as (133), however with

Aim = δim∫

dx 1√2πe−

x2

2

[1− th2(θMF

i +√

∆ix)]. Mean-

field theory neglects the fluctuations here and rendersthis term as 1−m2

i , whereas the fluctuations can be cap-tured more accurately under the Gaussian theory. How-ever, Aim cannot be determined directly from the dataalone, but also require the couplings; [147] gives an iter-ative scheme to infer the parameters of the effective localfields (135) as well as the coupling matrix and magneticfields. The typical-case performance of the Gaussiantheory in the thermodynamic limit has been analysedwithin the framework of statistical learning [10], findingthe Gaussian theory breaks down at strong couplings anda small number of samples.

The Gaussian distribution of local fields is not limitedto the asymmetric Ising model. In fact, the asymmetricIsing model is one particular example from a class of

models called generalized linear models. In this modelclass, the Gaussian approximation has been used in thecontext of neural network reconstruction [228] alreadyprior to its application to the asymmetric Ising model.

4. Method comparison

We compare the results of the mean-field approxima-tion, the Gaussian approximation, and the maximizationof the exact likelihood (128). As in section II C, we drawcouplings from a Gaussian distribution with mean zeroand standard deviation β/

√N , but now couplings J0

ij are

statistically independent of J0ji, so the matrix of couplings

is in general asymmetric. Fields are drawn from a uni-form distribution on the interval [−0.3β, 0.3β]. We thensample a time series of M = 15000 steps by parallel up-dates under Glauber dynamics (125). The scatter plotsin the top row of figure 8 compares couplings and fieldsreconstructed by different methods with the couplingsand fields of the original model. It shows that couplingsare significantly underestimated by the mean-field recon-struction, a bias which is avoided by the Gaussian theory.The right-hand plot shows the relative reconstruction er-ror (115) against β. As in the equilibrium case shown infigure 5, reconstruction by any method is limited at smallβ by sampling noise. At strong couplings β, mean-fieldtheory breaks down. Also at strong couplings, the iter-ative algorithm for Gaussian approximation reconstruc-tion converges very slowly and stops when the maximumnumber of iterations is reached (here is set 50000 steps).

C. Outlook: Reconstruction from the steady state

Time series data is not available in all applications.This prompts the question how to reconstruct the pa-rameters of a non-equilibrium model from independentsamples of the steady state. It is clear that, unlike forequilibrium systems, pairwise correlations are insufficientto infer couplings: The matrix of correlations is symmet-ric, whereas the matrix of couplings is asymmetric fornon-equilibrium systems. Hence, there are twice as manyfree parameters as there are observables. Similarly, usingan equilibrium model like (1) (maximum-entropy model)on data generated by a non-equilibrium model would giveparameters matching the two-spin observables, but en-tirely different from the true parameters underlying thedata.

One solution uses three-spin correlations to infer thecouplings [63]. A problem of this approach is that con-nected three-spin correlations are small since the effectivelocal fields are well described by a multi-variate Gaussiandistribution (see III B 3).

A second approach uses perturbations of the non-equilibrium steady state: We measure one set of pair

Page 36: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

36

-0.5 0 0.5J0

-0.5

0

0.5

J

0 0.5h0

-0.5

0

0.5

h

mean field Gaussian maximum likelihood

1.0 2.0 3.0β

0.1

1.0

γJ

FIG. 8. Reconstructing the asymmetric Ising modelfrom time series. The scatter plots in the upper panelscompare the reconstructed couplings and fields with the cou-plings and fields underlying the original model. The systemsize is N = 64, the standard deviation of the original cou-plings is β/

√N with β = 1.5, the original fields are uniformly

distributed between [−0.3β, 0.3β], and M = 15000 time stepsare sampled. The bottom plot gives the reconstruction erroras a function of β, see text.

correlations at certain (unknown) model parameters, andthen a second set of pair correlations of a perturbed ver-sion of the system. Possible perturbations include chang-ing one or several of the couplings by a known amount,changing the magnetic fields, or fixing particular spinsto a constant value. This generates two sets of coupledequations specifying two symmetric correlations matri-ces, which can be solved for one asymmetric couplingmatrix. Conceptually, such an approach is well knownin biology, where altering parts of a system and check-ing the consequences is a standard mode of scientific in-quest. Neural stimulation can lead to a rewiring of neu-ral connections, and the effects of this neural plasticitycan be tracked in neural recordings [184]. An excitingdevelopment is optogenetic tools, which allow to stim-ulate and monitor the activity of individual neurons invivo [221]. In the context of inferring gene regulatorynetworks from gene expression data, perturbation-basedapproaches have been used both with linear [218] andnon-linear models of gene regulation [148, 159], see also

section I B. In this context, it is also fruitful to considerthe genetic variation occurring in a population of cells asa source of perturbations. Such an approach has alreadybeen used in gene regulatory networks [187, 254], andwith the expansion of tools to analyze large numbers ofsingle cells [43, 249], it may soon spread to other typesof networks.

IV. CONCLUSIONS

At the end of this overview, we step back and summa-rize the aims and motivations behind the inverse Isingproblem, discuss the efficacy of different approaches, andoutline different areas of research that may involve thestatistical mechanics of inverse problems in the future.Motivation. The inverse Ising problem arises in the

context of very different types of questions connectedwith the inference of model parameters. The first andmost straightforward question appears when data ac-tually is generated by a process obeying detailed bal-ance and has pairwise couplings between binary spins.The problem of inferring the parameters of such a modelcan then be phrased in terms of the equilibrium statis-tics (11), the maximisation of the likelihood (16), and theuse of approximation schemes discussed in section II.

The second question arises when data is generated by adifferent, and possibly entirely unknown type of process,and we seek a statistical description of the data in termsof a simpler model matching only particular aspects ofthe observed data. An example is models with maximumentropy given pairwise correlations (section II A 3) usedto describe neural data in section I A.

The assumptions behind the Ising model (pairwise cou-plings, binary spins, . . .) can be relaxed. For instance,multi-valued Potts spin variables have been used exten-sively in models of biological sequences [195, 239]. Theextension of the Ising model to models with three- andfour-spin couplings has not been used yet in the contextof inference. Nevertheless, for many of the methods ofsection II, the extension beyond pairwise couplings wouldbe straightforward. A much larger rift lies between equi-librium and non-equilibrium models. Parameter infer-ence of a non-equilibrium model is often based on databeyond independent samples of the steady state, specifi-cally time-series data.Methods. The practical question how to infer the

couplings and fields that parametrise an Ising modelhas been answered using many different approximationswith different regimes of validity. Section II C gives anoverview. For data sampled from the equilibrium distri-bution (1), the pseudolikelihood approach of section II Bgives a reconstruction close to the optimal one (in fact,asymptotically close in the number of samples). How-ever, this is paid for by a computational effort that scaleswith the number of samples. For the non-equilibrium

Page 37: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

37

regime and a time series of configurations sampled fromthe stochastic dynamics, the likelihood of a model can beevaluated exactly comparatively easily.

Both the pseudolikelihood and the likelihood of a timeseries can be understood in the language of regression.Thus a single framework links two of the most successfulapproaches, in both the equilibrium and non-equilibriumsetting. Regression singles out one spin variable as adependent variable and treats the remainder as indepen-dent variables. It then characterizes the statistics of thedependent variable given configurations of the indepen-dent variables.

The application of novel concepts to the inverse Isingproblem and the development of new algorithms con-tinues. An exciting new direction is interaction screen-ing [130, 233]. Vuffray, Misra, Lokhov, and Chertkovintroduce the objective function

1

M

M∑µ=1

eHi(sssµ) =

1

M

M∑µ=1

exp−∑j 6=i

Jijsµi sµj − his

µi

(138)

to be minimized with respect to the ith column of thecoupling matrix and the magnetic field on spin i by con-vex optimisation. Hi(sss) denotes the part of the Hamilto-nian containing all terms in si, see II B. Sparsity or otherproperties of the model parameters can be effected by ap-propriate regularisation terms. This objective functionaims to find those parameters which ‘screen the interac-tions’ (and the magnetic field) in the data, making thesum over samples in (138) as balanced as possible. Toillustrate the appeal of this objective function, we lookat the infinite sampling limit, where the summation oversamples in the objective function (138) can be replacedby a summation over the configurations, reweighted bythe Boltzmann distribution with the true couplings JJJ∗

and fields hhh∗,

1

M

M∑µ=1

eHi(sssµ) ≈

∑sss\si

1

Z(JJJ∗,hhh∗)e∑k>j;k,j 6=i J

∗kjsksj+

∑k 6=i h

∗ksk

×∑si

e−∑j(Jij−J

∗ij)sisj−(hi−h

∗i )si .

(139)

Since the last sum as a function of Jij and hi (for a fixedi) is convex and even when being reflected around J∗ijand h∗i , so is the whole objective function. It followsimmediately that this objective function has a uniqueglobal minimum at Jij = J∗ij and hi = h∗i .

Of course, the sample average over eHi(sss) is affectedby sampling fluctuations when the number of samples issmall. Nevertheless, the minimum of this objective func-tion nearly saturates the information theoretic bounds onthe reconstruction of sparse Ising models of Santhanamand Wainwright [194] (see section II A 4).

The flat histogram method in Monte Carlo simulationsuses a similar rebalancing with respect to the Boltzmannmeasure [235]. Also, there may be conceptual links be-tween interaction screening and the fluctuations theo-rems such as the Jarzynski equality [106], which also takesample averages over exponentials of different thermody-namic quantities.

Another recent development concerning the sparse in-verse Ising problem is the use of Bayesian model selectiontechniques by Bulso, Marsili and Roudi [38].

More to come. Over the last two decades, interestin inverse statistical problems has been driven by tech-nological progress and this progress is likely to continueopening up new applications. Beyond the extrapolationof technological developments, there are several broadareas at the interface of statistical mechanics, statistics,and machine learning where inverse statistical problemssuch as the inverse Ising problem might play a role in thefuture.

• Stochastic control theory. Stochastic controltheory seeks to steer a stochastic system towardscertain desired states [111]. An inverse problemarises when the parameters describing the stochas-tic system are only partially known. As a result, itsresponse to changes in external control parameters(’steering’) must be predicted on the basis of itspast dynamics. A recent application is the controlof cell populations, specifically populations of can-cer cells [77]. Such cell populations evolve stochas-tically due to random fluctuations of cell duplica-tions and cell deaths. Birth and death rates of atleast part of the population can be controlled bytherapeutic drugs.

• Network inference. Like regulatory connectionsbetween genes discussed in section I B, metabolicand signalling interactions also form intricate net-works. An example where such a network op-timizes a specific global quantity is flux-balanceanalysis [166], which models the flow of metabo-lites through a metabolic network. Inverting therelationship between metabolic rates and metabo-lite concentrations allows in principle to infer themetabolic network on the basis of observed metabo-lite concentrations [61].

• Causal analysis. The do-calculus developed byJudea Pearl seeks to establish causal relationshipsbehind statistical dependencies [173]. do-calculusis based on interventions such as fixing a variableto a particular value, and then observing the re-sulting statistics of other variables. Recently, Au-rell and Del Ferrano have found a link between do-calculus and the dynamic cavity method from sta-tistical physics [6].

Page 38: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

38

• Maximum entropy and dynamics. The notionof a steady-state distribution with maximum en-tropy has been generalized to a maximum-entropydistribution over trajectories of a dynamical sys-tem [182]. Like the maximum entropy models ofsection I A-I C, this approach can be used to derivesimple effective models of the dynamics of a sys-tem, whose parameters can be inferred from timeseries. In fact, Glauber dynamics with parallel up-dates gives the maximum entropy distribution ofσσσ(t + 1) given sss(t). So far, applications have beenin neural modelling [139, 158], in the effective dy-namics of quantitative traits in genetics [18, 32],and in flocking dynamics [46].

• Hidden variables. Even in large-scale data sets,there will be variables that are unobserved. Yet,those hidden variables affect the statistics of othervariables, and hence the inference we make aboutinteractions between the observed variables [19, 69,192, 223]. This can lead to a signature of criticalbehaviour, even when the original system is notcritical [140, 197].

• High-dimensional statistics and inference. Inmany applications, the number of systems param-eters to be inferred is of the same order of mag-nitude or exceeds the sample size. The field ofhigh-dimensional statistics deals with this regime,and a fruitful interaction with statistical physicshas emerged over the last decade [248], spurred byapplications such as compressed sensing. Vuffray etal. [233] propose the objective function (138) forthe inverse Ising problem in the high-dimensionalregime. Other objective functions might performeven better, and the optimal objective functionmay depend on the number of samples and thestatistics of the underlying couplings. [11, 23] usethe statistical mechanics of disordered systems tofind the objective function which minimizes the dif-ference between the reconstructed and the under-lying couplings for Gaussian distributed couplings.

• Restricted Boltzmann machines and deeplearning. Recently, (deep) feed-forward neuralnetworks have re-established themselves as pow-erful learning architectures, leading to spectacu-lar applications in the area of computer vision,speech recognition, data visualisation, and gameplaying [123]. This progress has also demonstratedthat the most challenging step in data analysis isthe extraction of features from unlabelled data.There is a wide range of methods for feature ex-traction, the most well-known ones being convo-lutional networks for images [123, 124], or auto-encoders [97, 232], and restricted Boltzmann ma-chines (RBMs) [97] for less structured data. RBMs

are possibly one of the most general methods, al-though the algorithms used for finding the opti-mal parameters are heuristic and approximate inmany respects. For instance, in the case of multi-ple layers, RMBs correspond to generic Boltzmannmachines (with feedback loops), for which learn-ing is a hard computational problem. Usually, thesub-optimal solution which is adopted is used totrain each layer independently. These observationsare not surprising since RBMs are nothing but aninverse Ising problem with a layer of visible spinsconnected to a layer of hidden spins. Empiricaldata is available only for the visible spins. Therole of the hidden variables is to compress infor-mation and identify structural features in the data.Learning consists of finding the visible-to-hiddencouplings Jij = Jji and the local fields such thatsumming over to the hidden variables gives backa (marginalized) probability distribution over thevisible variables which is maximally consistent withthe data, i.e. has minimal KL divergence (18) fromthe empirical distribution of the data.

• Learning phases of matter. One aspect of theemerging applications of machine learning in quan-tum physics is the identification of non-trivial quan-tum states from data. This is of particular inter-est in phases of matter where the order parame-ter is either unknown or hard to compute (suchas the so-called entanglement entropy [104, 125]).Techniques from machine learning, specifically deeplearning with neural networks, have recently beenused to classify quantum states without knowledgeof the underlying Hamiltonian [45].

ACKNOWLEDGEMENTS

Discussion with our colleagues and students have in-spired and shaped this work. In particular we wouldlike to thank Erik Aurell, Michael Berry, Andreas Beyer,Simona Cocco, Simon Dettmer, David Gross, Jiang Yi-jing, David R. Jones, Bert Kappen, Alessia Marruzzo,Matteo Marsili, Marc Mezard, Remi Monasson, ThierryMora, Manfred Opper, Andrea Pagnani, Federico Ricci-Tersenghi, Yasser Roudi, Gasper Tkacik, AleksandraWalczak, Martin Weigt and Pieter Rein ten Wolde.Many thanks to Gasper Tkacik and Michael Berry formaking the neural recordings from [222, 224] available.This work was supported by the DFG under Grant SFB680; BMBF under Grants emed:SMOOSE and SYBA-COL.

Page 39: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

39

REFERENCES

[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. Alearning algorithm for Boltzmann machines. CognitiveScience, 9(1):147–169, 1985.

[2] H. Akaike. A new look at the statistical model iden-tification. IEEE transactions on automatic control,19(6):716–723, 1974.

[3] S.-i. Amari, H. Nakahara, S. Wu, and Y. Sakai. Syn-chronous firing and higher-order interactions in neuronpool. Neural Computation, 15(1):127–142, 2003.

[4] B. C. Arnold and D. Strauss. Pseudolikelihood estima-tion: some examples. Sankhya: The Indian Journal ofStatistics, Series B, pages 233–243, 1991.

[5] E. Aurell. The maximum entropy fallacy redux? PLoSComput Biol, 12(5):e1004777, 2016.

[6] E. Aurell and G. Del Ferraro. Causal analysis,correlation-response, and dynamic cavity. In Journal ofPhysics: Conference Series, volume 699, page 012002.IOP Publishing, 2016.

[7] E. Aurell and M. Ekeberg. Inverse Ising inference usingall the data. Phys. Rev. Lett., 108(9):090201, 2012.

[8] E. Aurell, C. Ollion, and Y. Roudi. Dynamics andperformance of susceptibility propagation on syntheticdata. Eur. Phys. J. B, 77(4):587–595, 2010.

[9] K. Baba, R. Shibata, and M. Sibuya. Partial correlationand conditional correlation as measures of conditionalindependence. Australian & New Zealand Journal ofStatistics, 46(4):657–664, 2004.

[10] L. Bachschmid-Romano and M. Opper. Learning of cou-plings for random asymmetric kinetic ising models revis-ited: random correlation matrices and learning curves.Journal of Statistical Mechanics: Theory and Experi-ment, 2015(9):P09016, 2015.

[11] L. Bachschmid-Romano and M. Opper. A statisticalphysics approach to learning curves for the inverse isingproblem. arXiv preprint arXiv:1705.05403, 2017.

[12] M. Bailly-Bechet, A. Braunstein, A. Pagnani, M. Weigt,and R. Zecchina. Inference of sparse combinatorial-control networks from gene-expression data: a mes-sage passing approach. BMC Bioinformatics, 11(1):355,2010.

[13] V. Balasubramanian. Statistical inference, Occam’s ra-zor, and statistical mechanics on the space of probabilitydistributions. Neural Computation, 9(2):349–368, 1997.

[14] J. R. Banavar, A. Maritan, and I. Volkov. Applica-tions of the principle of maximum entropy: from physicsto ecology. Journal of Physics: Condensed Matter,22(6):063101, 2010.

[15] R. Bapat, S. J. Kirkland, and M. Neumann. On dis-tance matrices and Laplacians. Linear Algebra Appl.,401(0):193–209, 2005.

[16] J. P. Barton, S. Cocco, E. De Leonardis, and R. Monas-son. Large pseudocounts and l2-norm penalties are nec-essary for the mean-field inference of Ising and Pottsmodels. Physical Review E, 90(1):012132, 2014.

[17] J. P. Barton, E. De Leonardis, A. Coucke, and S. Cocco.ACE: adaptive cluster expansion for maximum entropygraphical model inference. Bioinformatics, 32(20):3089–3097, 2016.

[18] N. H. Barton and H. P. de Vladar. Statistical mechan-ics and the evolution of polygenic quantitative traits.Genetics, 181(3):997–1011, 2009.

[19] C. Battistin, J. Hertz, J. Tyrcha, and Y. Roudi. Be-lief propagation and replicas for inference and learn-ing in a kinetic Ising model with hidden spins. Jour-nal of Statistical Mechanics: Theory and Experiment,2015(5):P05021, 2015.

[20] R. J. Baxter. Exactly solvable models in statistical me-chanics. Academic Press London, 1982.

[21] C. Bedard, H. Kroeger, and A. Destexhe. Does the 1/ffrequency scaling of brain signals reflect self-organizedcritical states? Phys. Rev. Lett., 97(11):118102, 2006.

[22] J. M. Beggs and N. Timme. Being critical of criticalityin the brain. Front Physiol, 3:163, 2012.

[23] J. Berg. Statistical mechanics of the inverseIsing problem and the optimal objective function.http://arxiv.org/abs/1611.04281, 2016.

[24] J. Berg, S. Willmann, and M. Lassig. Adaptive evolu-tion of transcription factor binding sites. BMC Evolu-tionary Biology, 4(1):42, 2004.

[25] J. Besag. Spatial interaction and the statistical analysisof lattice systems. J. R. Stat. Soc. B, 36(2):192–236,1974.

[26] J. Besag. On the statistical analysis of dirty pictures.J. R. Stat. Soc. B, 48(3):259–302, 1986.

[27] H. Bethe. Statistical theory of superlattices. In Proc.Roy. Soc. London A, volume 150, pages 552–575, 1935.

[28] W. Bialek, A. Cavagna, I. Giardina, T. Mora, O. Pohl,E. Silvestri, M. Viale, and A. M. Walczak. Social in-teractions dominate speed control in poising naturalflocks near criticality. Proc. Natl. Acad. Sci. USA,111(20):7212–7217, 2014.

[29] W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Sil-vestri, M. Viale, and A. M. Walczak. Statistical me-chanics for natural flocks of birds. Proc. Natl. Acad.Sci. USA, 109(13):4786–4791, 2012.

[30] W. Bialek and R. Ranganathan. Rediscoveringthe power of pairwise interactions. arXiv preprintarXiv:0712.4397, 2007.

[31] C. M. Bishop. Pattern recognition and machine learn-ing. Springer, 2006.

[32] K. Bodova, G. Tkacik, and N. H. Barton. A generalapproximation for the dynamics of quantitative traits.Genetics, 202(4):1523–1548, 2016.

[33] N. M. Bogolyubov, V. Brattsev, A. N. Vasil’ev, A. Ko-rzhenevskii, and R. Radzhabov. High-temperature ex-pansions at an arbitrary magnetization in the Isingmodel. Theoret. Math. Phys., 26(3):230–237, 1976.

[34] S. S. Borysov, Y. Roudi, and A. V. Balatsky. US stockmarket interaction network as learned by the Boltzmannmachine. The European Physical Journal B, 88(12):1–14, 2015.

[35] A. Braunstein, A. Ramezanpour, R. Zecchina, andP. Zhang. Inference and learning in sparse systems withmultiple states. Phys. Rev. E, 83(5):056114, 2011.

[36] T. Broderick, M. Dudik, G. Tkacik, R. E. Schapire, andW. Bialek. Faster solutions of the inverse pairwise Isingproblem. arXiv preprint arXiv:0712.2437, 2007.

[37] N. E. Buchler, U. Gerland, and T. Hwa. On schemesof combinatorial transcription logic. Proc. Natl. Acad.Sci. USA, 100(9):5136–5141, 2003.

[38] N. Bulso, M. Marsili, and Y. Roudi. Sparse modelselection in the highly under-sampled regime. Jour-nal of Statistical Mechanics: Theory and Experiment,2016(9):093404, 2016.

Page 40: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

40

[39] L. Burger and E. Van Nimwegen. Disentangling directfrom indirect co-evolution of residues in protein align-ments. PLoS Comput. Biol., 6(1):e1000633, 2010.

[40] A. N. Burkitt. A review of the integrate-and-fire neu-ron model: I. Homogeneous synaptic input. BiologicalCybernetics, 95(1):1–19, 2006.

[41] T. Bury. Market structure explained by pairwise inter-actions. Physica A: Statistical Mechanics and its Appli-cations, 392(6):1375–1385, 2013.

[42] T. Bury. A statistical physics perspective on criticalityin financial markets. Journal of Statistical Mechanics:Theory and Experiment, 2013(11):P11004, 2013.

[43] C. Cadwell, A. Palasantza, X. Jiang, P. Berens,Q. Deng, M. Yilmaz, J. Reimer, S. Shen, M. Bethge,K. Tolias, R. Rickard Sandberg, and T. Andreas. Elec-trophysiological, transcriptomic and morphologic profil-ing of single neurons using Patch-seq. Nature Biotech-nology, 34:199–203, 2015.

[44] H. B. Callen. A note on Green functions and the Isingmodel. Phys. Lett., 4(3):161, 1963.

[45] J. Carrasquilla and R. G. Melko. Machine learningphases of matter. Nature Physics, 13:431434, 2017.

[46] A. Cavagna, I. Giardina, F. Ginelli, T. Mora, D. Pi-ovani, R. Tavarone, and A. M. Walczak. Dynamicalmaximum entropy approach to flocking. Phys. Rev. E,89(4):042707, 2014.

[47] H. J. Changlani, H. Zheng, and L. K. Wagner. Density-matrix based determination of low-energy model Hamil-tonians from ab initio wavefunctions. The Journal ofchemical physics, 143(10):102814, 2015.

[48] J. Chayes, L. Chayes, and E. H. Lieb. The inverse prob-lem in classical statistical mechanics. Communicationsin Mathematical Physics, 93(1):57–121, 1984.

[49] M. Chertkov and V. Y. Chernyak. Loop series for dis-crete statistical models on graphs. J. Stat. Mech., pageP06009, 2006.

[50] C. Chow and C. Liu. Approximating discrete probabil-ity distributions with dependence trees. IEEE Transac-tions on Information Theory, 14:462–467, 1968.

[51] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monas-son, and M. Weigt. Inverse statistical physicsof protein sequences: A key issues review.http://arxiv.org/abs/1703.01222, 2017.

[52] S. Cocco, S. Leibler, and R. Monasson. Neuronal cou-plings between retinal ganglion cells inferred by efficientinverse statistical physics methods. Proc. Natl. Acad.Sci. USA, 106(33):14058–14062, 2009.

[53] S. Cocco and R. Monasson. Adaptive cluster expansionfor inferring Boltzmann machines with noisy data. Phys.Rev. Lett., 106(9):090601, 2011.

[54] S. Cocco and R. Monasson. Adaptive cluster expansionfor the inverse Ising problem: convergence, algorithmand tests. J. Stat. Phys., 147(2):252–314, 2012.

[55] S. Cocco, R. Monasson, and M. Weigt. Inference ofHopfield-Potts patterns from covariation in protein fam-ilies: calculation and statistical error bars. In Journal ofPhysics: Conference Series, volume 473, page 012010.IOP Publishing, 2013.

[56] T. M. Cover and J. A. Thomas. Elements of InformationTheory. John Wiley & Sons, 2006.

[57] H. Cramer. Mathematical methods of statistics. Prince-ton University Press, 1961.

[58] A. E. Dago, A. Schug, A. Procaccini, J. A. Hoch,M. Weigt, and H. Szurmant. Structural basis of histi-

dine kinase autophosphorylation deduced by integratinggenomics, molecular dynamics, and mutagenesis. Proc.Natl. Acad. Sci. USA, 109(26):E1733–E1742, 2012.

[59] V. Dahirel, K. Shekhar, F. Pereyra, T. Miura, M. Arty-omov, S. Talsania, T. M. Allen, M. Altfeld, M. Car-rington, D. J. Irvine, et al. Coordinate linkage of HIVevolution reveals regions of immunological vulnerability.Proc. Natl. Acad. Sci. USA, 108(28):11530–11535, 2011.

[60] D. de Juan, F. Pazos, and A. Valencia. Emerging meth-ods in protein co-evolution. Nature Reviews Genetics,14(4):249–261, 2013.

[61] D. De Martino, F. Capuani, and A. De Martino. Growthagainst entropy in bacterial metabolism: the phenotypictrade-off behind empirical growth rate distributions inE. coli. Physical Biology, 13(3):036005, 2016.

[62] A. Decelle and F. Ricci-Tersenghi. Pseudolikelihooddecimation algorithm improving the inference of the in-teraction network in a general class of Ising models.Phys. Rev. Lett., 112(7):070603, 2014.

[63] S. Dettmer, H. C. Nguyen, and J. Berg. Network infer-ence in the non-equilibrium steady state. Phys. Rev.,5(E 94):052116, 2016.

[64] P. D’haeseleer, S. Liang, and R. Somogyi. Genetic net-work inference: from co-expression clustering to reverseengineering. Bioinformatics, 16(8):707–726, 2000.

[65] K. A. Dill and J. L. MacCallum. The protein-foldingproblem, 50 years on. Science, 338(6110):1042–1046,2012.

[66] R. A. DiStasio Jr., E. Marcotte, R. Car, F. H. Stillinger,and S. Torquato. Designer spin systems via inverse sta-tistical mechanics. Phys. Rev. B, 88(13):134104, 2013.

[67] Y. Dorsett and T. Tuschl. siRNAs: applications in func-tional genomics and potential as therapeutics. NatureReviews Drug Discovery, 3(4):318–329, 2004.

[68] R. M. Dowben and J. E. Rose. A metal-filled microelec-trode. Science, 118(3053):22–24, 1953.

[69] B. Dunn and Y. Roudi. Learning and inference in anonequilibrium Ising model with hidden nodes. PhysicalReview E, 87(2):022127, 2013.

[70] T. P. Eggarter. Cayley trees, the Ising problem, and thethermodynamic limit. Phys. Rev. B, 9:2989, 1974.

[71] M. Ekeberg, C. Lovkvist, Y. Lan, M. Weigt, and E. Au-rell. Improved contact prediction in proteins: Usingpseudolikelihoods to infer Potts models. Phys. Rev. E,87(1):012707, 2013.

[72] A. Engel and C. Van den Broeck. Statistical mechanicsof learning. Cambridge University Press, 2001.

[73] K. Faust, J. F. Sathirapongsasuti, J. Izard, N. Segata,D. Gevers, J. Raes, and C. Huttenhower. Microbialco-occurrence relationships in the human microbiome.PLoS Comput Biol, 8(7):e1002606–e1002606, 2012.

[74] A. L. Ferguson, J. K. Mann, S. Omarjee, T. Ndung’u,B. D. Walker, and A. K. Chakraborty. Translating HIVsequences into quantitative fitness landscapes predictsviral vulnerabilities for rational immunogen design. Im-munity, 38(3):606–617, 2013.

[75] U. Ferrari. Learning maximum entropy models fromfinite-size data sets: A fast data-driven algorithm al-lows sampling from the posterior distribution. PhysicalReview E, 94(2):023301, 2016.

[76] A. Fischer and C. Igel. An introduction to restrictedBoltzmann machines. In Iberoamerican Congress onPattern Recognition, pages 14–36. Springer, 2012.

Page 41: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

41

[77] A. Fischer, I. Vazquez-Garcıa, and V. Mustonen. Thevalue of monitoring to control evolving populations.Proc. Natl. Acad. Sci. USA, 112(4):1007–1012, 2015.

[78] C. Flamm, I. L. Hofacker, S. Maurer-Stroh, P. F.Stadler, and M. Zehl. Design of multistable RNAmolecules. RNA, 7(02):254–265, 2001.

[79] N. Friedman. Inferring cellular networks using proba-bilistic graphical models. Science, 303(5659):799–805,2004.

[80] X. Gabaix, P. Gopikrishnan, V. Plerou, and H. E. Stan-ley. A theory of power-law distributions in financialmarket fluctuations. Nature, 423(6937):267–270, 2003.

[81] R. G. Gallager. Low-density parity-check codes. IRETransactions on Information Theory, 8(1):21–28, 1962.

[82] E. Ganmor, R. Segev, and E. Schneidman. Sparse low-order interaction network underlies a highly correlatedand learnable neural population code. Proc. Natl. Acad.Sci. USA, 108(23):9679–9684, 2011.

[83] C. Gardiner. Handbook of stochastic methods forphysics, chemistry and the natural sciences. Springer,1985.

[84] A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. O.Brown. Genomic expression programs in the response ofyeast cells to environmental changes. Molecular Biologyof the Cell, 11(12):4241–4257, 2000.

[85] A. Georges and J. S. Yedidia. How to expand aroundmean-field theory using high-temperature expansions.J. Phys. A: Math. Gen., 24:2173, 1991.

[86] R. J. Glauber. Time-dependent statistics of the Isingmodel. Journal of Mathematical Physics, 4(2):294–307,1963.

[87] U. Gobel, C. Sander, R. Schneider, and A. Valen-cia. Correlated mutations and residue contacts in pro-teins. Proteins: Structure, Function, and Bioinformat-ics, 18(4):309–317, 1994.

[88] E. Granot-Atedgi, G. Tkacik, R. Segev, and E. Schnei-dman. Stimulus-dependent maximum entropy mod-els of neural population codes. PLoS Comput. Biol,9(3):e1002922, 2013.

[89] G. Guennebaud, B. Jacob, et al. Eigen v3.http://eigen.tuxfamily.org, 2010.

[90] S. Gull and J. Skilling. Maximum entropy method inimage processing. Communications, Radar and SignalProcessing, IEE Proceedings F, 131(6):646–659, 1984.

[91] M. Habeck. Bayesian approach to inverse statistical me-chanics. Phys. Rev. E, 89:052113, May 2014.

[92] N. Halabi, O. Rivoire, S. Leibler, and R. Ranganathan.Protein sectors: evolutionary units of three-dimensionalstructure. Cell, 138(4):774–786, 2009.

[93] T. Hastie, R. Tibshirani, and J. Friedman. The Ele-ments of Statistical Learning. Springer, 2009.

[94] D. R. Hekstra, S. Cocco, R. Monasson, and S. Leibler.Trend and fluctuations: Analysis and design of popu-lation dynamics measurements in replicate ecosystems.Phys. Rev. E, 88(6):062714, 2013.

[95] J. Hertz, A. Krogh, and R. G. Palmer. Introductionto the theory of neural computation, volume 1. BasicBooks, 1991.

[96] G. J. Hickman and T. C. Hodgman. Inference of generegulatory networks using Boolean-network inferencemethods. Journal of Bioinformatics and ComputationalBiology, 7(06):1013–1029, 2009.

[97] G. E. Hinton and R. R. Salakhutdinov. Reducing thedimensionality of data with neural networks. Science,313(5786):504–507, 2006.

[98] J. D. Hoheisel. Microarray technology: beyond tran-script profiling and genotype analysis. Nature ReviewsGenetics, 7(3):200–210, 2006.

[99] T. A. Hopf, L. J. Colwell, R. Sheridan, B. Rost,C. Sander, and D. S. Marks. Three-dimensional struc-tures of membrane proteins from genomic sequencing.Cell, 149(7):1607–1621, 2012.

[100] T. A. Hopf, J. B. Ingraham, F. J. Poelwijk, M. Springer,C. Sander, and D. S. Marks. Quantification of the effectof mutations using a global probability model of naturalsequence variation. arXiv preprint arXiv:1510.04612,2015.

[101] A. Hyvarinen. Estimation of non-normalized statisticalmodels by score matching. Journal of Machine LearningResearch, 6(Apr):695–709, 2005.

[102] A. Hyvarinen. Consistency of pseudolikelihood estima-tion of fully visible Boltzmann machines. Neural Com-putation, 18(10):2283–2292, 2006.

[103] T. E. Ideker, V. Thorsson, and R. M. Karp. Discoveryof regulatory interactions through perturbation: infer-ence and experimental design. In Pacific Symposium onBiocomputing, volume 5, pages 302–313, 2000.

[104] R. Islam, R. Ma, P. M. Preiss, M. E. Tai, A. Lukin,M. Rispoli, and M. Greiner. Measuring entanglemententropy in a quantum many-body system. Nature,528(7580):77–83, 2015.

[105] H. Jacquin and A. Rancon. Resummed mean-field infer-ence for strongly coupled data. Phys. Rev. E, 94:042118,Oct 2016.

[106] C. Jarzynski. Nonequilibrium equality for free energydifferences. Physical Review Letters, 78(14):2690, 1997.

[107] E. T. Jaynes. Information theory and statistical me-chanics. Phys. Rev., 106(4):620, 1957.

[108] E. T. Jaynes. E.T. Jaynes: Papers on Probability,Statistics, and Statistical Physics. Springer Science &Business Media, 1989.

[109] D. T. Jones, T. Singh, T. Kosciolek, and S. Tetchner.Metapsicov: combining coevolution methods for accu-rate prediction of contacts and long range hydrogenbonding in proteins. Bioinformatics, 31(7):999–1006,2015.

[110] J. D. Kalbfleisch. Pseudo-likelihood. John Wiley & Sons,Ltd, 2005.

[111] H. J. Kappen, J. Marro, P. L. Garrido, and J. J. Tor-res. An introduction to stochastic control theory, pathintegrals and reinforcement learning. In AIP conferenceproceedings, volume 887, pages 149–181. AIP, 2007.

[112] H. J. Kappen and F. Rodrıguez. Boltzmann machinelearning using mean field theory and linear responsecorrection. Advances in Neural Information ProcessingSystems, pages 280–286, 1998.

[113] H. J. Kappen and J. J. Spanjers. Mean-field theory forasymmetric neural networks. Phys. Rev. E, 61(5):5658,2000.

[114] R. Kikuchi. A theory of cooperative phenomena. Phys.Rev., 81:988–1003, 1951.

[115] M. A. Kohanski, D. J. Dwyer, and J. J. Collins. How an-tibiotics kill bacteria: from targets to networks. NatureReviews Microbiology, 8(6):423–435, 2010.

[116] D. Koller and N. Friedman. Probabilistic graphical mod-els: principles and techniques. MIT press, 2009.

Page 42: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

42

[117] V. Kolmogorov. Convergent tree-reweighted messagepassing for energy minimization. IEEE transactions onpattern analysis and machine intelligence, 28(10):1568–1583, 2006.

[118] P. L. Krapivsky, S. Redner, and E. Ben-Naim. A kineticview of statistical physics. Cambridge University Press,2010.

[119] J. Krumsiek, K. Suhre, T. Illig, J. Adamski, and F. J.Theis. Gaussian graphical modeling reconstructs path-way reactions from high-throughput metabolomics data.BMC Systems Biology, 5(1):21, 2011.

[120] B. Kuhlman, G. Dantas, G. C. Ireton, G. Varani, B. L.Stoddard, and D. Baker. Design of a novel globu-lar protein fold with atomic-level accuracy. Science,302(5649):1364–1368, 2003.

[121] J. E. Kulkarni and L. Paninski. Common-input modelsfor multiple neural spike-train data. Network: Compu-tation in Neural Systems, 18(4):375–407, 2007.

[122] W. Kunkin and H. Frisch. Inverse problem in classi-cal statistical mechanics. Physical Review, 177(1):282,1969.

[123] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015.

[124] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recogni-tion. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[125] M. Levin and X.-G. Wen. Detecting topological or-der in a ground state wave function. Phys. Rev. Lett.,96:110405, Mar 2006.

[126] T. R. Lezon, J. R. Banavar, M. Cieplak, A. Maritan,and N. V. Fedoroff. Using the principle of entropymaximization to infer genetic interaction networks fromgene expression patterns. Proc. Natl. Acad. Sci. USA,103(50):19033–19038, 2006.

[127] Y.-Y. Liu, Y. Wang, T. R. Walsh, L.-X. Yi, R. Zhang,J. Spencer, Y. Doi, G. Tian, B. Dong, X. Huang,et al. Emergence of plasmid-mediated colistin resis-tance mechanism mcr-1 in animals and human beings inChina: a microbiological and molecular biological study.The Lancet Infectious Diseases, 16(2):161168, 2016.

[128] J. W. Locasale and A. Wolf-Yadlin. Maximum entropyreconstructions of dynamic signaling networks fromquantitative proteomics data. PLoS ONE, 4(8):e6522,2009.

[129] S. W. Lockless and R. Ranganathan. Evolutionarilyconserved pathways of energetic connectivity in proteinfamilies. Science, 286(5438):295–299, 1999.

[130] A. Y. Lokhov, M. Vuffray, S. Misra, and M. Chertkov.Optimal structure and parameter learning of Ising mod-els. arXiv preprint arXiv:1612.05024, 2016.

[131] D. J. MacKay. Information theory, inference and learn-ing algorithms. Cambridge University Press, 2003.

[132] J. H. Macke, P. Berens, A. S. Ecker, A. S. Tolias, andM. Bethge. Generating spike trains with specified corre-lation coefficients. Neural Computation, 21(2):397–423,2009.

[133] J. H. Macke, M. Opper, and M. Bethge. Common inputexplains higher-order correlations and entropy in a sim-ple model of neural population activity. Physical ReviewLetters, 106(20):208102, 2011.

[134] E. Z. Macosko, A. Basu, R. Satija, J. Nemesh,K. Shekhar, M. Goldman, I. Tirosh, A. R. Bialas,N. Kamitaki, E. M. Martersteck, et al. Highly paral-lel genome-wide expression profiling of individual cells

using nanoliter droplets. Cell, 161(5):1202–1214, 2015.[135] J. K. Mann, J. P. Barton, A. L. Ferguson, S. Omarjee,

B. D. Walker, A. Chakraborty, and T. Ndung’u. Thefitness landscape of HIV-1 gag: advanced modeling ap-proaches and validation of model predictions by in vitrotesting. PloS One, 10(8):e1003776, 2014.

[136] E. Marcotte, R. A. DiStasio Jr, F. H. Stillinger, andS. Torquato. Designer spin systems via inverse statisti-cal mechanics. II. Ground-state enumeration and clas-sification. Phys. Rev. B, 88(18):184432, 2013.

[137] E. Marinari and V. Van Kerrebroeck. Intrinsic limita-tions of the susceptibility propagation inverse inferencefor the mean field Ising spin glass. Journal of StatisticalMechanics: Theory and Experiment, 2010(02):P02008,2010.

[138] D. S. Marks, L. J. Colwell, R. Sheridan, T. A. Hopf,A. Pagnani, R. Zecchina, and C. Sander. Protein 3Dstructure computed from evolutionary sequence varia-tion. PloS One, 6(12):e28766, 2011.

[139] O. Marre, S. El Boustani, Y. Fregnac, and A. Des-texhe. Prediction of spatiotemporal patterns of neuralactivity from pairwise correlations. Phys. Rev. Lett.,102(13):138101, 2009.

[140] M. Marsili, I. Mastromatteo, and Y. Roudi. Onsampling and modeling complex systems. Journalof Statistical Mechanics: Theory and Experiment,2013(09):P09003, 2013.

[141] I. Mastromatteo and M. Marsili. On the criticality of in-ferred models. Journal of Statistical Mechanics: Theoryand Experiment, 2011(10):P10012, 2011.

[142] W. Maysenholder. On the determination of interactionparameters from correlations in binary alloys. PhysicaStatus solidi, B 139:399–408, 1987.

[143] L. Merchan and I. Nemenman. On the sufficiency ofpairwise interactions in maximum entropy models ofnetworks. Journal of Statistical Physics, 162(5):1294–1308, 2016.

[144] M. Mezard and A. Montanari. Information, physics,and computation. Oxford University Press, 2009.

[145] M. Mezard and T. Mora. Constraint satisfaction prob-lems and neural networks: A statistical physics perspec-tive. J. Physiol. Paris, 103(1-2):107–113, 2009.

[146] M. Mezard, G. Parisi, and M. A. Virasoro. Spin GlassTheory and Beyond. World Scientific, Singapore, 1987.

[147] M. Mezard and J. Sakellariou. Exact mean field infer-ence by in asymmetric kinetic Ising systems. J. Stat.Mech., page L07001, 2011.

[148] E. J. Molinelli, A. Korkut, W. Wang, M. L. Miller, N. P.Gauthier, X. Jing, P. Kaushik, Q. He, G. Mills, D. B.Solit, C. A. Pratilas, M. Weigt, A. Braunstein, A. Pag-nani, R. Zecchina, and C. Sander. Perturbation biology:Inferring signaling networks in cellular systems. PLoSComput. Biol., 9(12):e1003290, 2013.

[149] A. Montanari and J. A. Pereira. Which graphical modelsare difficult to learn? In Advances in Neural Informa-tion Processing Systems, pages 1303–1311, 2009.

[150] C. Moore and S. Mertens. The Nature of Computation.Oxford University Press, 2011.

[151] T. Mora and W. Bialek. Are biological systems poisedat criticality? J. Stat. Phys., 144(2):268–302, 2011.

[152] T. Mora, S. Deny, and O. Marre. Dynamical critical-ity in the collective activity of a population of retinalneurons. Phys. Rev. Lett., 114(7):078105, 2015.

Page 43: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

43

[153] T. Mora, A. M. Walczak, W. Bialek, and C. G. Callan.Maximum entropy models for antibody diversity. Proc.Natl. Acad. Sci. USA, 107(12):5405–5410, 2010.

[154] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. Onuchic, T. Hwa,and M. Weigt. Direct-coupling analysis of residue co-evolution captures native contacts across many proteinfamilies. Proc. Natl. Acad. Sci. USA, 108(49):E1293–E1301, 2011.

[155] J. J. More, B. S. Garbow, and K. E. Hillstrom. UserGuide for MINPACK-1. ANL-80-74, Argonne NationalLaboratory, 1980.

[156] A. Mozeika, O. Dikmen, and J. Piili. Consistent in-ference of a general model using the pseudolikelihoodmethod. Phys. Rev. E, 90(1):010101, 2014.

[157] I. J. Myung, V. Balasubramanian, and M. A. Pitt.Counting probability distributions: Differential geom-etry and model selection. Proc. Natl. Acad. Sci. USA,97(21):11170–11175, 2000.

[158] H. Nasser and B. Cessac. Parameter estimation forspatio-temporal maximum entropy distributions: Ap-plication to neural spike trains. Entropy, 16(4):2244–2277, 2014.

[159] S. Nelander, W. Wang, B. Nilsson, Q.-B. She, C. Prati-las, N. Rosen, P. Gennemark, and C. Sander. Modelsfrom experiments: Combinatorial drug perturbations ofcancer cells. Molecular Systems Biology, 4(1), 2008.

[160] H. C. Nguyen and J. Berg. Bethe–Peierls approximationand the inverse Ising problem. J. Stat. Mech., pageP03004, 2012.

[161] H. C. Nguyen and J. Berg. Mean-field theory for theinverse Ising problem at low temperatures. Phys. Rev.Lett., 109(5):050602, 2012.

[162] This line of argument only works if the parameter spaceis bounded, so a uniform prior can be defined.

[163] M. E. J. Obien, K. Deligkaris, T. Bullmann, D. J.Bakkum, and U. Frey. Revealing neuronal functionthrough microelectrode array recordings. Frontiers inNeuroscience, 8, 2014.

[164] I. E. Ohiorhenuan, F. Mechler, K. P. Purpura, A. M.Schmid, Q. Hu, and J. D. Victor. Sparse coding andhigh-order correlations in fine-scale cortical networks.Nature, 466(7306):617–621, 2010.

[165] M. Opper and D. Saad, editors. Advanced Mean-fieldMethods: Theory and Practice. The MIT Press, 2001.

[166] J. D. Orth, I. Thiele, and B. Ø. Palsson. What is fluxbalance analysis? Nature Biotechnology, 28(3):245–248,2010.

[167] S. Ovchinnikov, H. Park, N. Varghese, P.-S. Huang,G. A. Pavlopoulos, D. E. Kim, H. Kamisetty, N. C.Kyrpides, and D. Baker. Protein structure deter-mination using metagenome sequence data. Science,355(6322):294–298, 2017.

[168] L. Paninski. The most likely voltage path and large de-viations approximations for integrate-and-fire neurons.Journal of Computational Neuroscience, 21(1):71–87,2006.

[169] L. Paninski, J. W. Pillow, and E. P. Simoncelli. Max-imum likelihood estimation of a stochastic integrate-and-fire neural encoding model. Neural Computation,16(12):2533–2561, 2004.

[170] G. Parisi and F. Slanina. Loop expansion around theBethe–Peierls approximation for lattice models. Journalof Statistical Mechanics: Theory and Experiment, page

L02003, 2006.[171] J. M. Parrondo, J. M. Horowitz, and T. Sagawa. Ther-

modynamics of information. Nature Physics, 11(2):131–139, 2015.

[172] J. Pearl. Probabilistic reasoning in intelligent systems:networks of plausible inference. Morgan Kaufmann,1988.

[173] J. Pearl. Causal diagrams for empirical research.Biometrika, 82(4):669–688, 1995.

[174] D. Pe’er, A. Regev, G. Elidan, and N. Friedman. In-ferring subnetworks from perturbed expression profiles.Bioinformatics, 17(suppl 1):S215–S224, 2001.

[175] R. Peierls. On Ising’s model of ferromagnetism. InMathematical Proceedings of the Cambridge Philosophi-cal Society, volume 32, pages 477–481. Cambridge UnivPress, 1936.

[176] P. Peretto. Collective properties of neural networks:a statistical physics approach. Biological Cybernetics,50(1):51–62, 1984.

[177] C. Peterson and J. Anderson. A mean field theory learn-ing algorithm for neural networks. Complex Systems,1:995–1019, 1987.

[178] P. Picotti, M. Clement-Ziza, H. Lam, D. S. Campbell,A. Schmidt, E. W. Deutsch, H. Rost, Z. Sun, O. Rinner,L. Reiter, M. J. Shen Q, A. Frei, S. Alberti, U. Kuse-bauch, B. Wollscheid, M. RL, A. Beyer, and R. Aeber-sold. A complete mass-spectrometric map of the yeastproteome applied to quantitative trait analysis. Nature,494(7436):266–270, 2013.

[179] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M.Litke, E. Chichilnisky, and E. P. Simoncelli. Spatio-temporal correlations and visual signalling in a completeneuronal population. Nature, 454(7207):995–999, 2008.

[180] T. Plefka. Convergence condition of the TAP equationsfor the infinite-ranged Ising spin glass model. J. Phys.A: Math. Gen., 15:1971, 1982.

[181] W. H. Press, S. A. Teukolsky, W. T. Vetterling, andB. P. Flannery. Numerical recipies in C++. CambridgeUniversity Press, 2002.

[182] S. Presse, K. Ghosh, J. Lee, and K. A. Dill. Principlesof maximum entropy and maximum caliber in statisticalphysics. Rev. Mod. Phys., 85(3):1115, 2013.

[183] P. Ravikumar, M. J. Wainwright, and J. D. Laf-ferty. High-dimensional Ising model selection using l1-regularised logistic regression. Ann. Stat., 38(3):1287,2010.

[184] J. M. Rebesco, I. H. Stevenson, K. Koerding, S. A. Solla,and L. E. Miller. Rewiring neural interactions by micro-stimulation. Frontiers in systems neuroscience, 4:39,2010.

[185] M. Rechtsman, F. Stillinger, and S. Torquato. De-signed interaction potentials via inverse methods forself-assembly. Phys. Rev. E, 73(1):011406, 2006.

[186] F. Ricci-Tersenghi. On mean-field approximations forestimating correlations and solving the inverse Isingproblem. J. Stat. Mech., page P08015, 2012.

[187] M. V. Rockman. Reverse engineering the genotype–phenotype map with natural genetic variation. Nature,456(7223):738–744, 2008.

[188] Y. Roudi, E. Aurell, and J. A. Hertz. Statistical physicsof pairwise probability models. Front. Comput. Neu-rosci., 3:22, 2009.

[189] Y. Roudi and J. Hertz. Dynamical TAP equations fornon-equilibrium Ising spin glasses. Journal of Statistical

Page 44: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

44

Mechanics: Theory and Experiment, 2011(03):P03031,2011.

[190] Y. Roudi and J. Hertz. Mean-field theory for nonequi-librium network reconstruction. Phys. Rev. Lett.,106(4):048702, 2011.

[191] Y. Roudi, S. Nirenberg, and P. E. Latham. Pairwisemaximum entropy models for studying large biologi-cal systems: when they can work and when they can’t.PLoS Comput Biol, 5(5):e1000380, 2009.

[192] Y. Roudi and G. Taylor. Learning with hidden variables.Current opinion in neurobiology, 35:110–118, 2015.

[193] Y. Roudi, J. Tyrcha, and J. Hertz. Ising model forneural data: model quality and approximate methodsfor extracting functional connectivity. Physical ReviewE, 79(5):051915, 2009.

[194] N. P. Santhanam and M. J. Wainwright. Information-theoretic limits of selecting binary graphical models inhigh dimensions. IEEE Transactions on InformationTheory, 58(7):4117–4134, 2012.

[195] M. Santolini, T. Mora, and V. Hakim. A general pair-wise interaction model provides an accurate descriptionof in vivo transcription factor binding sites. PLoS ONE,9(6):e99015, 2014.

[196] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek.Weak pairwise correlations imply strongly correlatednetwork states in a neural population. Nature,440(7087):1007–12, 2006.

[197] D. J. Schwab, I. Nemenman, and P. Mehta. Zipf’s lawand criticality in multivariate data without fine-tuning.Phys. Rev. Lett., 113(6):068102, 2014.

[198] D. A. Schwarz, M. A. Lebedev, T. L. Hanson, D. F.Dimitrov, G. Lehew, J. Meloy, S. Rajangam, V. Subra-manian, P. J. Ifft, Z. Li, et al. Chronic, wireless record-ings of large-scale brain activity in freely moving rhesusmonkeys. Nature Methods, 11(6):670–676, 2014.

[199] G. Schwarz et al. Estimating the dimension of a model.The annals of statistics, 6(2):461–464, 1978.

[200] G. Sella and A. E. Hirsh. The application of statisticalphysics to evolutionary biology. Proc. Natl. Acad. Sci.USA, 102(27):9541–9546, 2005.

[201] V. Sessak and R. Monasson. Small-correlation expan-sions for the inverse Ising problem. J. Phys. A: Math.Theor., 42(5):055001, 2009.

[202] K. Shekhar, C. F. Ruberman, A. L. Ferguson, J. P. Bar-ton, M. Kardar, and A. K. Chakraborty. Spin modelsinferred from patient-derived viral sequence data faith-fully describe HIV fitness landscapes. Phys. Rev. E,88(6):062705, 2013.

[203] D. Sherrington and S. Kirkpatrick. Solvable model of aspin-glass. Phys. Rev. Lett., 35(26):1792–1796, 1975.

[204] J. Shlens, G. D. Field, J. L. Gauthier, M. I. Grivich,D. Petrusca, A. Sher, A. M. Litke, and E. Chichilnisky.The structure of multi-neuron firing patterns in primateretina. The Journal of Neuroscience, 26(32):8254–8266,2006.

[205] C. Sima, J. Hua, and S. Jung. Inference of gene regula-tory networks using time-series data: a survey. CurrentGenomics, 10(6):416, 2009.

[206] N. Slonim, G. S. Atwal, G. Tkacik, and W. Bialek.Information-based clustering. Proc. Natl. Acad. Sci.USA, 102(51):18297–18302, 2005.

[207] M. Socolich, S. W. Lockless, W. P. Russ, H. Lee,K. H. Gardner, and R. Ranganathan. Evolution-ary information for specifying a protein fold. Nature,

437(7058):512–518, 2005.[208] J. Sohl-Dickstein, P. B. Battaglino, and M. R. DeWeese.

New method for parameter estimation in probabilisticmodels: minimum probability flow. Physical review let-ters, 107(22):220601, 2011.

[209] M. E. Spira and A. Hai. Multi-electrode array technolo-gies for neuroscience and cardiology. Nature Nanotech-nology, 8(2):83–94, 2013.

[210] H. E. Stanley. Introduction to phase transitions andcritical phenomena. Oxford University Press, 1987.

[211] R. R. Stein, D. S. Marks, and C. Sander. Infer-ring pairwise interactions from biological data usingmaximum-entropy probability models. PLoS Comput.Biol., 11(7):e1004182, 2015.

[212] G. J. Stephens, T. Mora, G. Tkacik, and W. Bialek.Statistical thermodynamics of natural images. Phys.Rev. Lett., 110(1):018701, 2013.

[213] J. I. Su lkowska, F. Morcos, M. Weigt, T. Hwa, and J. N.Onuchic. Genomics-aided structure prediction. Proc.Natl. Acad. Sci. USA, 109(26):10340–10345, 2012.

[214] R. H. Swendsen. Monte Carlo calculation of renormal-ized coupling parameters. Phys. Rev. Lett., 52:1165,1984.

[215] T. Tanaka. Mean-field theory of Boltzmann machinelearning. Physical Review E, 58(2):2302, 1998.

[216] T. Tanaka. Information geometry of mean-field approx-imation. Neural Computation, 12(8):1951–1968, 2000.

[217] A. Tang, D. Jackson, J. Hobbs, W. Chen, J. L. Smith,H. Patel, A. Prieto, D. Petrusca, M. I. Grivich, A. Sher,P. Hottowy, W. Dabrowski, A. M. Litke, and J. M.Beggs. A maximum entropy model applied to spa-tial and temporal correlations from cortical networksin vitro. The Journal of Neuroscience, 28(2):505–518,2008.

[218] J. Tegner, M. S. Yeung, J. Hasty, and J. J. Collins.Reverse engineering gene networks: integrating geneticperturbations with dynamical modeling. Proc. Natl.Acad. Sci. USA, 100(10):5944–5949, 2003.

[219] D. J. Thouless, P. W. Anderson, and R. G. Palmer.Solution of a ‘solvable model of a spin glass’. Phil. Mag.,35:593, 1977.

[220] Y. Tikochinsky, N. Tishby, and R. D. Levine. Alterna-tive approach to maximum-entropy inference. PhysicalReview A, 30(5):2638, 1984.

[221] D. Tischer and O. D. Weiner. Illuminating cell signallingwith optogenetic tools. Nature reviews Molecular cellbiology, 15(8):551–558, 2014.

[222] G. Tkacik, O. Marre, D. Amodei, E. Schneidman,W. Bialek, and M. J. Berry II. Searching for collectivebehavior in a large network of sensory neurons. PLoSComput. Biol., 10(1):e1003408, 2014.

[223] G. Tkacik, O. Marre, T. Mora, D. Amodei, M. J.Berry II, and W. Bialek. The simplest maximum en-tropy model for collective behavior in a neural network.Journal of Statistical Mechanics: Theory and Experi-ment, 2013(03):P03011, 2013.

[224] G. Tkacik, T. Mora, O. Marre, D. Amodei, S. E. Palmer,M. J. Berry, and W. Bialek. Thermodynamics and sig-natures of criticality in a network of neurons. Proc. Natl.Acad. Sci. USA, page 201514188, 2015.

[225] G. Tkacik, J. S. Prentice, V. Balasubramanian, andE. Schneidman. Optimal population coding bynoisy spiking neurons. Proc. Natl. Acad. Sci. USA,107(32):14419–14424, 2010.

Page 45: arXiv:1702.01522v4 [cond-mat.dis-nn] 6 Nov 2017of quantities normally considered as xed model param-eters (couplings, elds). The observables, such as spin correlations and magnetisations

45

[226] G. Tkacik, E. Schneidman, I. Berry, J. Michael, andW. Bialek. Spin glass models for a network of real neu-rons. arXiv preprint arXiv:0912.5409, 2009.

[227] S. Torquato. Inverse optimization techniques for tar-geted self-assembly. Soft Matter, 5(6):1157–1173, 2009.

[228] T. Toyoizumi, K. R. Rad, and L. Paninski. Mean-fieldapproximations for coupled populations of generalizedlinear model spiking neurons with Markov refractori-ness. Neural computation, 21(5):1203–1243, 2009.

[229] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue,and E. N. Brown. A point process framework for relatingneural spiking activity to spiking history, neural ensem-ble, and extrinsic covariate effects. Journal of neuro-physiology, 93(2):1074–1089, 2005.

[230] J. Tyrcha, Y. Roudi, M. Marsili, and J. Hertz. Theeffect of nonstationarity on models inferred from neu-ral data. Journal of Statistical Mechanics: Theory andExperiment, 2013(03):P03005, 2013.

[231] E. van Nimwegen. Inferring contacting residues withinand between proteins: What do the probabilities mean?PLoS Comput Biol, 12(5):e1004726, 2016.

[232] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learn-ing useful representations in a deep network with a lo-cal denoising criterion. Journal of Machine LearningResearch, 11(Dec):3371–3408, 2010.

[233] M. Vuffray, S. Misra, A. Y. Lokhov, and M. Chertkov.Interaction screening: Efficient and sample-optimallearning of Ising models. Advances in Neural Informa-tion Processing Systems, page 25952603, 2016.

[234] C. Walsh. Molecular mechanisms that confer antibacte-rial drug resistance. Nature, 406(6797):775–781, 2000.

[235] J.-S. Wang. Flat histogram Monte Carlo method.Physica A: Statistical Mechanics and its Applications,281(1):147–150, 2000.

[236] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu. Ac-curate de novo prediction of protein contact map byultra-deep learning model. PLOS Computational Biol-ogy, 13(1):e1005324, 2017.

[237] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: arevolutionary tool for transcriptomics. Nature ReviewsGenetics, 10(1):57–63, 2009.

[238] T. L. Watkin, A. Rau, and M. Biehl. The statistical me-chanics of learning a rule. Reviews of Modern Physics,65(2):499, 1993.

[239] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, andT. Hwa. Identification of direct residue contacts inprotein-protein interaction by message passing. Proc.Natl. Acad. Sci. USA, 106(1):67–72, 2009.

[240] M. Welling and Y. W. Teh. Approximate inference inBoltzmann machines. Artif. Intell., 143(1):19–50, 2003.

[241] M. Welling and Y. W. Teh. Linear response algorithmsfor approximate inference in graphical models. NeuralComput., 16:197–221, 2004.

[242] Q. F. Wills, K. J. Livak, A. J. Tipping, T. Enver,A. J. Goldson, D. W. Sexton, and C. Holmes. Single-cell gene expression analysis reveals genetic associationsmasked in whole-tissue experiments. Nature Biotechnol-ogy, 31(8):748–752, 2013.

[243] R. Wise, T. Hart, O. Cars, M. Streulens, R. Helmuth,P. Huovinen, and M. Sprenger. Antimicrobial resistanceis a major threat to public health. British Medical Jour-nal, 317(7159):609–611, 1998.

[244] K. Wood, S. Nishida, E. D. Sontag, and P. Cluzel.Mechanism-independent method for predicting responseto multidrug combinations in bacteria. Proc. Natl. Acad.Sci. USA, 109(30):12254–12259, 2012.

[245] J. S. Yedidia. An idiosyncratic journey beyond meanfield theory. In M. Opper and D. Saad, editors,Advanced Mean-field Methods: Theory and Practice,page 21. The MIT Press, 2001.

[246] J. S. Yedidia, W. Freeman, and Y. Weiss. Constructingfree-energy approximations and generalized belief prop-agation algorithms. IEEE Transactions on InformationTheory, 51(7):2282–2312, July 2005.

[247] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Bethe freeenergy, Kikuchi approximations, and belief propagationalgorithms. Advances in neural information processingsystems, 13, 2001.

[248] L. Zdeborova and F. Krzakala. Statistical physics ofinference: Thresholds and algorithms. Advances inPhysics, 65(5):453–552, 2016.

[249] A. Zeisel, A. B. Munoz-Manchado, S. Codeluppi,P. Lonnerberg, G. La Manno, A. Jureus, S. Marques,H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler, and S. Linnarsson. Celltypes in the mouse cortex and hippocampus revealedby single-cell RNA-seq. Science, 347(6226):1138–1142,2015.

[250] H.-L. Zeng, M. Alava, E. Aurell, J. Hertz, andY. Roudi. Maximum likelihood reconstruction for Isingmodels with asynchronous updates. Phys. Rev. Lett.,110(21):210601, 2013.

[251] H.-L. Zeng, E. Aurell, M. Alava, and H. Mahmoudi.Network inference using asynchronously updated kineticIsing model. Phys. Rev. E, 83(4):041135, 2011.

[252] B. Zhang and P. G. Wolynes. Topology, structures, andenergy landscapes of human chromosomes. Proc. Natl.Acad. Sci. USA, 112(19):6062–6067, 2015.

[253] G. Zhang, F. Stillinger, and S. Torquato. Probingthe limitations of isotropic pair potentials to produceground-state structural extremes via inverse statisticalmechanics. Phys. Rev. E, 88(4):042309, 2013.

[254] J. Zhu, B. Zhang, E. N. Smith, B. Drees, R. B. Brem,L. Kruglyak, R. E. Bumgarner, and E. E. Schadt. In-tegrating large-scale functional genomic data to dissectthe complexity of yeast regulatory networks. NatureGenetics, 40(7):854–861, 2008.