Markovian (and conceivably causal) representations of ...cshalizi/Talks/2010-07-10-UAI.pdf ·...

Markovian (and conceivably causal)representations of stochastic processes

Cosma Shalizi

Statistics Dept., Carnegie Mellon University & Santa Fe Institute

10 July 2010UAI

Set-UpOptimality Properties

ReconstructionCausality

References

Notation and setting

“State”

In classical physics or dynamical systems, “state” is the presentvariable which fixes all future observablesBack away from determinism: state determines distribution offuture observablesWould like a small stateWould like the state to be well-behaved, e.g. homogeneousMarkovTry construct states by constructing predictions

Cosma Shalizi Markovian Representations



References


Notation etc.

Upper-case letters are random variables, lower-case theirrealizationsStochastic process . . . ,X−1,X0,X1,X2, . . .X t

s = (Xs,Xs+1, . . .Xt−1,Xt )Past up to and including t is X t

−∞, future is X∞t+1Discrete time is not required but cuts down on measure theory




References


Making a Prediction

Look at X t−∞, make a guess about X∞t+1

Most general guess is a probability distributionOnly ever attend to selected aspects of X t

−∞mean, variance, phase of 1st three Fourier modes, . . .

∴ guess is a function or statistic of X t−∞

What’s a good statistic to use?




References


Predictive Sufficiency

For any statistic σ,

I[X∞t+1; X t−∞] ≥ I[X∞t+1;σ(X t

−∞)]

σ is predictively sufficient iff

I[X∞t+1; X t−∞] = I[X∞t+1;σ(X t

−∞)]

Sufficient statistics retain all predictive information in the data




References


Why Care About Sufficiency?

Optimal strategy, under any loss function, only needs asufficient statistic (Blackwell & Girshick)Strategies using insufficient statistics can generally beimproved (Blackwell & Rao)Excuse for not worrying about particular loss functions




References



Optimal strategy, under any loss function, only needs asufficient statistic (Blackwell & Girshick)

Strategies using insufficient statistics can generally beimproved (Blackwell & Rao)Excuse for not worrying about particular loss functions




References



Optimal strategy, under any loss function, only needs asufficient statistic (Blackwell & Girshick)Strategies using insufficient statistics can generally beimproved (Blackwell & Rao)

Excuse for not worrying about particular loss functions




References



Optimal strategy, under any loss function, only needs asufficient statistic (Blackwell & Girshick)Strategies using insufficient statistics can generally beimproved (Blackwell & Rao)Excuse for not worrying about particular loss functions




References


“Causal” States

(Crutchfield and Young, 1989)

Histories a and b are equivalent iff

Pr(X∞t+1|X t

−∞ = a)

= Pr(X∞t+1|X t

−∞ = b)

[a] ≡ all histories equivalent to aThe statistic of interest, the causal state, is

ε(x t−∞) = [x t

−∞]

Set st = ε(x t−1−∞)

A state is an equivalence class of histories and a distributionover future eventsIID = 1 state, periodic = p states




References


“Causal” States



Pr(X∞t+1|X t

−∞ = a)

= Pr(X∞t+1|X t

−∞ = b)

[a] ≡ all histories equivalent to a

The statistic of interest, the causal state, is

ε(x t−∞) = [x t

−∞]






References


“Causal” States



Pr(X∞t+1|X t

−∞ = a)

= Pr(X∞t+1|X t

−∞ = b)


ε(x t−∞) = [x t

−∞]






References


“Causal” States



Pr(X∞t+1|X t

−∞ = a)

= Pr(X∞t+1|X t

−∞ = b)


ε(x t−∞) = [x t

−∞]


A state is an equivalence class of histories and a distributionover future events

IID = 1 state, periodic = p states




References


“Causal” States



Pr(X∞t+1|X t

−∞ = a)

= Pr(X∞t+1|X t

−∞ = b)


ε(x t−∞) = [x t

−∞]






References


set of histories, color-coded by conditional distribution of futures




References


Partitioning histories into causal states




References

SufficiencyMarkov PropertiesMinimalities

Sufficiency

(Shalizi and Crutchfield, 2001)

I[X∞t+1; X t−∞] = I[X∞t+1; ε(X t

−∞)]

because

Pr(X∞t+1|St = ε(x t

−∞))

=

∫y∈[x t

−∞]Pr(X∞t+1|X t

−∞ = y)

Pr(X t−∞ = y |St = ε(x t

−∞))

dy

= Pr(X∞t+1|X t

−∞ = x t−∞)




References


A non-sufficient partition of histories




References


Effect of insufficiency on predictive distributions




References


Recursive Updating/Deterministic Transitions

Recursive transitions for states:

ε(x t+1−∞) = T (ε(x t

−∞), xt+1)

Automata theory: “deterministic transitions” (even though thereare probabilities)In continuous time:

ε(x t+h−∞) = T (ε(x t

−∞), x t+ht )




References


Causal States are Markovian

S∞t+1 |= St−1−∞|St

because

S∞t+1 = T (St ,X∞t )

andX∞t |=

{X t−1−∞,S

t−1−∞

}|St

Also, the transitions are homogeneous




References


Minimality

ε is minimal sufficient= can be computed from any other sufficient statistic

= for any sufficient η, exists a function g such that

ε(X t−∞) = g(η(X t

−∞))

Therefore, if η is sufficient

I[ε(X t−∞); X t

−∞] ≤ I[η(X t−∞); X t

−∞]




References


Minimality

ε is minimal sufficient= can be computed from any other sufficient statistic= for any sufficient η, exists a function g such that

ε(X t−∞) = g(η(X t

−∞))

Therefore, if η is sufficient

I[ε(X t−∞); X t

−∞] ≤ I[η(X t−∞); X t

−∞]




References


Sufficient, but not minimal, partition of histories




References


Coarser than the causal states, but not sufficient




References


Uniqueness

There is really no other minimal sufficient statistic

If η is minimal, there is an h such that

η = h(ε) a.s.

but ε = g(η) (a.s.) so

g(h(ε)) = ε

h(g(η)) = η

ε and η partition histories in the same way (a.s.)




References


Uniqueness

There is really no other minimal sufficient statisticIf η is minimal, there is an h such that

η = h(ε) a.s.


g(h(ε)) = ε

h(g(η)) = η





References


Uniqueness


η = h(ε) a.s.

but ε = g(η) (a.s.)

so

g(h(ε)) = ε

h(g(η)) = η





References


Uniqueness


η = h(ε) a.s.


g(h(ε)) = ε

h(g(η)) = η





References


Minimal stochasticity

If Rt = η(X t−1−∞) is also sufficient, then

H[Rt+1|Rt ] ≥ H[St+1|St ]

∴ the predictive states are the closest we get to a deterministicmodel, without losing power




References


Entropy Rate

h1 ≡ limn→∞

H[Xn|X n−11 ] = lim

n→∞H[Xn|Sn]

= H[X1|S1]

so the predictive states lets us calculate the entropy rateand do source coding




References


Entropy Rate

h1 ≡ limn→∞


n→∞H[Xn|Sn]

= H[X1|S1]

so the predictive states lets us calculate the entropy rate

and do source coding




References


Entropy Rate

h1 ≡ limn→∞


n→∞H[Xn|Sn]

= H[X1|S1]

so the predictive states lets us calculate the entropy rateand do source coding




References


Minimal Markovian Representation

The observed process (Xt ) is non-Markovian and ugly

But it is generated from a homogeneous Markov process (St )After minimization, this representation is (essentially) uniqueCan exist small Markovian representations, but then alwayshave distributions over those states. . .. . . and those distributions correspond to predictive states




References



The observed process (Xt ) is non-Markovian and uglyBut it is generated from a homogeneous Markov process (St )

After minimization, this representation is (essentially) uniqueCan exist small Markovian representations, but then alwayshave distributions over those states. . .. . . and those distributions correspond to predictive states




References



The observed process (Xt ) is non-Markovian and uglyBut it is generated from a homogeneous Markov process (St )After minimization, this representation is (essentially) unique

Can exist small Markovian representations, but then alwayshave distributions over those states. . .. . . and those distributions correspond to predictive states




References



The observed process (Xt ) is non-Markovian and uglyBut it is generated from a homogeneous Markov process (St )After minimization, this representation is (essentially) uniqueCan exist small Markovian representations, but then alwayshave distributions over those states. . .

. . . and those distributions correspond to predictive states




References



The observed process (Xt ) is non-Markovian and uglyBut it is generated from a homogeneous Markov process (St )After minimization, this representation is (essentially) uniqueCan exist small Markovian representations, but then alwayshave distributions over those states. . .. . . and those distributions correspond to predictive states




References


What Sort of Markov Model?

Common-or-garden HMM:

St+1 |= Xt |St

But hereSt+1 = T (St ,Xt )

This is a chain with complete connections (Onicescu andMihoc, 1935; Iosifescu and Grigorescu, 1990)




References


HMM

S1

X1

S2

X2

S3

X3

CCC

S1

X1

S2

X2

S3

X3




References


Example of a CCC: Even Process

2 1B | 1.0B | 0.5

A | 0.5

Blocks of As of any length, separated by even-length blocks ofBs

Not Markov at any order




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)

Measure-theoretic prediction process (Knight, 1975, 1992)Forecasting/true measure complexity (Grassberger, 1986)Causal states, ε machine (Crutchfield and Young, 1989)Observable operator model (Jaeger, 2000)Predictive state representations (Littman et al., 2002)Sufficient posterior representation (Langford et al., 2009)




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)Measure-theoretic prediction process (Knight, 1975, 1992)

Forecasting/true measure complexity (Grassberger, 1986)Causal states, ε machine (Crutchfield and Young, 1989)Observable operator model (Jaeger, 2000)Predictive state representations (Littman et al., 2002)Sufficient posterior representation (Langford et al., 2009)




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)Measure-theoretic prediction process (Knight, 1975, 1992)Forecasting/true measure complexity (Grassberger, 1986)

Causal states, ε machine (Crutchfield and Young, 1989)Observable operator model (Jaeger, 2000)Predictive state representations (Littman et al., 2002)Sufficient posterior representation (Langford et al., 2009)




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)Measure-theoretic prediction process (Knight, 1975, 1992)Forecasting/true measure complexity (Grassberger, 1986)Causal states, ε machine (Crutchfield and Young, 1989)

Observable operator model (Jaeger, 2000)Predictive state representations (Littman et al., 2002)Sufficient posterior representation (Langford et al., 2009)




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)Measure-theoretic prediction process (Knight, 1975, 1992)Forecasting/true measure complexity (Grassberger, 1986)Causal states, ε machine (Crutchfield and Young, 1989)Observable operator model (Jaeger, 2000)

Predictive state representations (Littman et al., 2002)Sufficient posterior representation (Langford et al., 2009)




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)Measure-theoretic prediction process (Knight, 1975, 1992)Forecasting/true measure complexity (Grassberger, 1986)Causal states, ε machine (Crutchfield and Young, 1989)Observable operator model (Jaeger, 2000)Predictive state representations (Littman et al., 2002)

Sufficient posterior representation (Langford et al., 2009)




References


Inventions

Statistical relevance basis (Salmon, 1971, 1984)Measure-theoretic prediction process (Knight, 1975, 1992)Forecasting/true measure complexity (Grassberger, 1986)Causal states, ε machine (Crutchfield and Young, 1989)Observable operator model (Jaeger, 2000)Predictive state representations (Littman et al., 2002)Sufficient posterior representation (Langford et al., 2009)




References


How Broad Are These Results?

Knight (1975, 1992) gave most general constructionsNon-stationary Xt continuous (but discrete works as special case)Xt with values in a Lusin space

(= image of a complete separable

metrizable space under a measurable bijection)

St is a homogeneous strong Markov process withdeterministic updatingSt has cadlag sample paths (in appropriate topologies oninfinite-dimensional distributions)




References


How Broad Are These Results?

Knight (1975, 1992) gave most general constructionsNon-stationary Xt continuous (but discrete works as special case)Xt with values in a Lusin space (= image of a complete separable

metrizable space under a measurable bijection)

St is a homogeneous strong Markov process withdeterministic updatingSt has cadlag sample paths (in appropriate topologies oninfinite-dimensional distributions)




References


A Cousin: The Information Bottleneck

(Tishby et al., 1999)

For inputs X and outputs Y , fix β > 0, find η(X ), the bottleneckvariable, maximizing

I[η(X ); Y ]− βI[η(X ); X ]

give up 1 bit of predictive information for β bits of memoryPredictive sufficiency comes as β →∞, unwilling to lose anypredictive power




References





I[η(X ); Y ]− βI[η(X ); X ]

give up 1 bit of predictive information for β bits of memory

Predictive sufficiency comes as β →∞, unwilling to lose anypredictive power




References





I[η(X ); Y ]− βI[η(X ); X ]

give up 1 bit of predictive information for β bits of memoryPredictive sufficiency comes as β →∞, unwilling to lose anypredictive power




References


Extension 1: Input-Output

(Littman et al., 2002; Shalizi, 2001, ch. 7)

System output (Xt ), input (Yt )Histories x t

−∞, y t−∞ have distributions of output xt+1 for each

further input yt+1Equivalence class these distributions and enforce recursiveupdatingInternal states of the system, not trying to predict future inputs




References


Extension 2: Space and Time

(Shalizi, 2003; Shalizi et al., 2004, 2006; Jänicke et al., 2007)

Dynamic random field X (~r , t)Past cone: points in space-time which could matter to X (~r , t)Future cone: points in space-time for which X (~r , t) could matter

past

future

Equivalence-class past cone configurations byconditional distributions over future conesS(~r , t) is a Markov fieldMinimal sufficiency, recursive updating, etc., allgo through




References


Statistical Complexity

Definition (Grassberger, 1986; Crutchfield and Young, 1989)

C ≡ I[ε(X t−∞); X t

−∞] is the statistical forecasting complexityof the process

= amount of information about the past needed for optimalprediction= H[ε(X t

−∞)] for discrete causal states= expected algorithmic sophistication (Gács et al., 2001)= log(period) for period processes= log(geometric mean(recurrence time)) for stationaryprocessesProperty of the process, not learning problem




References




C ≡ I[ε(X t−∞); X t


= amount of information about the past needed for optimalprediction

= H[ε(X t−∞)] for discrete causal states

= expected algorithmic sophistication (Gács et al., 2001)= log(period) for period processes= log(geometric mean(recurrence time)) for stationaryprocessesProperty of the process, not learning problem




References




C ≡ I[ε(X t−∞); X t



−∞)] for discrete causal states

= expected algorithmic sophistication (Gács et al., 2001)= log(period) for period processes= log(geometric mean(recurrence time)) for stationaryprocessesProperty of the process, not learning problem




References




C ≡ I[ε(X t−∞); X t



−∞)] for discrete causal states= expected algorithmic sophistication (Gács et al., 2001)

= log(period) for period processes= log(geometric mean(recurrence time)) for stationaryprocessesProperty of the process, not learning problem




References




C ≡ I[ε(X t−∞); X t



−∞)] for discrete causal states= expected algorithmic sophistication (Gács et al., 2001)= log(period) for period processes

= log(geometric mean(recurrence time)) for stationaryprocessesProperty of the process, not learning problem




References




C ≡ I[ε(X t−∞); X t



−∞)] for discrete causal states= expected algorithmic sophistication (Gács et al., 2001)= log(period) for period processes= log(geometric mean(recurrence time)) for stationaryprocesses

Property of the process, not learning problem




References




C ≡ I[ε(X t−∞); X t



−∞)] for discrete causal states= expected algorithmic sophistication (Gács et al., 2001)= log(period) for period processes= log(geometric mean(recurrence time)) for stationaryprocessesProperty of the process, not learning problem




References

Connecting to Data

Everything so far has been math/probability

(The Oracle tells us the infinite-dimensional distribution of X )

Can we do some statistics and find the states?Two senses of “find”: learn in a fixed model vs. discover theright model




References

Connecting to Data

Everything so far has been math/probability(The Oracle tells us the infinite-dimensional distribution of X )

Can we do some statistics and find the states?Two senses of “find”: learn in a fixed model vs. discover theright model




References

Learning

Given states and transitions (ε,T ), realization xn1

Estimate Pr (Xt+1 = x |St = s)

Just estimation for stochastic processesEasier than ordinary HMMs because St is a function oftrajectoryExponential families in the all-discrete case, very tractable




References

Discovery

Given xn1

Estimate ε,T ,Pr (Xt+1 = x |St = s)

Inspiration: “geometry from a time series” in nonlineardynamicsConditional independence testing reminiscent of PCalgorithmFunction learning approach (Langford et al., 2009)Nobody seems to have tried non-parametric Bayes




References

“Geometry from a Time Series”

Deterministic dynamical system with state zt on a smoothmanifold of dimension m, zt+1 = f (zt )Only identified up to a smooth, invertible change of coordinates(diffeomorphism)Observe a time series of a single smooth, instantaneousfunction of state xt = g(zt )Set st = (xt , xt−1, . . . xt−k+1)

Generically, if k ≥ 2m + 1, then zt = φ(st )φ is smooth and invertibleφ commutes with time evolution, φ(st+1) = f (φ(st ))Regressing st+1 on st gives φ−1 ◦ fIdea due to Packard et al. (1980); Takens (1981), modern review in Kantz and

Schreiber (2004)




References


Deterministic dynamical system with state zt on a smoothmanifold of dimension m, zt+1 = f (zt )Only identified up to a smooth, invertible change of coordinates(diffeomorphism)Observe a time series of a single smooth, instantaneousfunction of state xt = g(zt )Set st = (xt , xt−1, . . . xt−k+1)Generically, if k ≥ 2m + 1, then zt = φ(st )φ is smooth and invertibleφ commutes with time evolution, φ(st+1) = f (φ(st ))Regressing st+1 on st gives φ−1 ◦ f

Idea due to Packard et al. (1980); Takens (1981), modern review in Kantz and

Schreiber (2004)




References


Deterministic dynamical system with state zt on a smoothmanifold of dimension m, zt+1 = f (zt )Only identified up to a smooth, invertible change of coordinates(diffeomorphism)Observe a time series of a single smooth, instantaneousfunction of state xt = g(zt )Set st = (xt , xt−1, . . . xt−k+1)Generically, if k ≥ 2m + 1, then zt = φ(st )φ is smooth and invertibleφ commutes with time evolution, φ(st+1) = f (φ(st ))Regressing st+1 on st gives φ−1 ◦ fIdea due to Packard et al. (1980); Takens (1981), modern review in Kantz and

Schreiber (2004)




References

CSSR: Causal State Splitting Reconstruction

Key observation: Recursion + one-step-ahead predictivesufficiency⇒ general predictive sufficiency

Get next-step distribution right by independence testingThen make states recursive

Assumes discrete observations, discrete time, finite causalstatesPaper: Shalizi and Klinkner (2004); C++ code,http://bactra.org/CSSR/


http://bactra.org/CSSR/



References

One-Step Ahead Prediction

Start with all histories in the same state

Given current partition of histories into states, test whethergoing one step further back into the past changes the next-stepconditional distributionUse a hypothesis test to control false positive rate

If yes, split that cell of the partition, but see if it matches anexisting distributionMust allow this merging or else lose minimality

If no match, add new cell to the partitionStop when no more divisions can be made or a maximumhistory length Λ is reachedFor consistency, Λ < log n

h1+ιfor some ι




References


Start with all histories in the same stateGiven current partition of histories into states, test whethergoing one step further back into the past changes the next-stepconditional distribution

Use a hypothesis test to control false positive rate



h1+ιfor some ι




References


Start with all histories in the same stateGiven current partition of histories into states, test whethergoing one step further back into the past changes the next-stepconditional distributionUse a hypothesis test to control false positive rate



h1+ιfor some ι




References

Ensuring Recursive Transitions

Need to determinize a probabilistic automatonSeveral ways of doing this; technical and not worth going intohereTrickiest part of the algorithm and can influence thefinite-sample behavior




References

Convergence

S = true causal state structureSn = structure reconstructed from n data pointsAssume: finite # of states, every state has a finite history, usinglong enough histories, technicalities:

Pr(Sn 6= S

)→ 0

D = true distribution, Dn = inferredError scales like independent samples

E[‖Dn −D‖TV

]= O(n−1/2)




References

Handwaving

Empirical conditional distributions for histories converge(large deviations principle for Markov chains)

Histories in the same state become harder to accidentallyseparateHistories in different states become harder to confuseEach state’s predictive distribution converges O(n−1/2)(from LDP again, take mixture)




References

Example: The Even Process

2 1B | 1.0B | 0.5

A | 0.5




References

2 1B | 1.000B | 0.503

A | 0.497

reconstruction with Λ = 3, n = 1000, α = 0.005




References

Some Uses

Geomagnetic fluctuations (Clarke et al., 2003)Natural language processing (Padró and Padró, 2005a,c,b,2007a,b)Anomaly detection (Friedlander et al., 2003a,b; Ray, 2004)Information sharing in networks (Klinkner et al., 2006; Shaliziet al., 2007)Social media propagation (Cointet et al., 2007)Neural spike train analysis (Haslinger et al., 2010)




References

About “Causal”

Term “causal states” introduced by Crutchfield and Young(1989)

without too much precisionAll about probabilistic prediction, not counterfactuals(selecting sub-ensembles of naturally-occurring trajectories, not enforcing certain

trajectories)

Still, those screening-off properties are really suggestive




References

About “Causal”

Term “causal states” introduced by Crutchfield and Young(1989) without too much precision

All about probabilistic prediction, not counterfactuals(selecting sub-ensembles of naturally-occurring trajectories, not enforcing certain

trajectories)





References

About “Causal”

Term “causal states” introduced by Crutchfield and Young(1989) without too much precisionAll about probabilistic prediction, not counterfactuals

(selecting sub-ensembles of naturally-occurring trajectories, not enforcing certain

trajectories)





References

About “Causal”

Term “causal states” introduced by Crutchfield and Young(1989) without too much precisionAll about probabilistic prediction, not counterfactuals(selecting sub-ensembles of naturally-occurring trajectories, not enforcing certain

trajectories)





References

Back to Physics

(Shalizi and Moore, 2003)

Assume: Microscopic state Zt ∈ Z, with an evolution operator f

Assume: Micro-states support counterfactualsAssume: Never get to see Zt , instead deal with Xt = γ(Zt )Xt are coarse-grained, macroscopic variablesEach macrovariable gives a partition Γ of Z




References

Back to Physics


Assume: Microscopic state Zt ∈ Z, with an evolution operator fAssume: Micro-states support counterfactuals

Assume: Never get to see Zt , instead deal with Xt = γ(Zt )Xt are coarse-grained, macroscopic variablesEach macrovariable gives a partition Γ of Z




References

Back to Physics


Assume: Microscopic state Zt ∈ Z, with an evolution operator fAssume: Micro-states support counterfactualsAssume: Never get to see Zt , instead deal with Xt = γ(Zt )

Xt are coarse-grained, macroscopic variablesEach macrovariable gives a partition Γ of Z




References

Back to Physics


Assume: Microscopic state Zt ∈ Z, with an evolution operator fAssume: Micro-states support counterfactualsAssume: Never get to see Zt , instead deal with Xt = γ(Zt )Xt are coarse-grained, macroscopic variablesEach macrovariable gives a partition Γ of Z




References

Sequences of Xt values refine Γ

Γ(T ) =T∧

t=1

f−t Γ

ε partitions histories of X∴ ε joins cells of Γ(∞)

∴ ε induces a partition ∆ of ZThis is a new, Markovian coarse-grained variable




References


Γ(T ) =T∧

t=1

f−t Γ

ε partitions histories of X

∴ ε joins cells of Γ(∞)





References


Γ(T ) =T∧

t=1

f−t Γ






References


Γ(T ) =T∧

t=1

f−t Γ


∴ ε induces a partition ∆ of Z

This is a new, Markovian coarse-grained variable




References


Γ(T ) =T∧

t=1

f−t Γ






References

Connecting to Causality

Interventions moving z from one cell of ∆ to another changesthe distribution of X∞t+1

Changing z inside a cell of ∆ might still make a difference“There must be at least this much structure”




References


Interventions moving z from one cell of ∆ to another changesthe distribution of X∞t+1Changing z inside a cell of ∆ might still make a difference

“There must be at least this much structure”




References


Interventions moving z from one cell of ∆ to another changesthe distribution of X∞t+1Changing z inside a cell of ∆ might still make a difference“There must be at least this much structure”




References

Summary

Your stochastic process has a unique, minimal Markovianrepresentation

This representation has nice predictive propertiesCan reconstruct from sample data in some cases. . .and a lot more could be done in this lineThe Markov states have the right screening-off propertiesfor causal models




References

Summary

Your stochastic process has a unique, minimal MarkovianrepresentationThis representation has nice predictive properties

Can reconstruct from sample data in some cases. . .and a lot more could be done in this lineThe Markov states have the right screening-off propertiesfor causal models




References

Summary

Your stochastic process has a unique, minimal MarkovianrepresentationThis representation has nice predictive propertiesCan reconstruct from sample data in some cases. . .

and a lot more could be done in this lineThe Markov states have the right screening-off propertiesfor causal models




References

Summary

Your stochastic process has a unique, minimal MarkovianrepresentationThis representation has nice predictive propertiesCan reconstruct from sample data in some cases. . .and a lot more could be done in this line

The Markov states have the right screening-off propertiesfor causal models




References

Summary

Your stochastic process has a unique, minimal MarkovianrepresentationThis representation has nice predictive propertiesCan reconstruct from sample data in some cases. . .and a lot more could be done in this lineThe Markov states have the right screening-off propertiesfor causal models




References

Clarke, Richard W., Mervyn P. Freeman and Nicholas W.Watkins (2003). “Application of Computational Mechanics tothe Analysis of Natural Data: An Example in Geomagnetism.”Physical Review E , 67: 0126203. URLhttp://arxiv.org/abs/cond-mat/0110228.

Cointet, Jean-Philippe, Emmanuel Faure and Camille Roth(2007). “Intertemporal topic correlations in online media.” InProceedings of the International Conference on Weblogs andSocial Media [ICWSM] . Boulder, CO, USA. URLhttp://camille.roth.free.fr/travaux/cointetfaureroth-icwsm-cr4p.pdf.

Crutchfield, James P. and Karl Young (1989). “InferringStatistical Complexity.” Physical Review Letters, 63:105–108. URL http://www.santafe.edu/~cmg/compmech/pubs/ISCTitlePage.htm.


http://arxiv.org/abs/cond-mat/0110228

http://camille.roth.free.fr/travaux/cointetfaureroth-icwsm-cr4p.pdf

http://camille.roth.free.fr/travaux/cointetfaureroth-icwsm-cr4p.pdf

http://www.santafe.edu/~cmg/compmech/pubs/ISCTitlePage.htm

http://www.santafe.edu/~cmg/compmech/pubs/ISCTitlePage.htm



References

Friedlander, David S., Shashi Phoha and Richard Brooks(2003a). “Determination of Vehicle Behavior based onDistributed Sensor Network Data.” In Advanced SignalProcessing Algorithms, Architectures, and ImplementationsXIII (Franklin T. Luk, ed.), vol. 5205 of Proceedings of theSPIE . Bellingham, WA: SPIE. Presented at SPIE’s 48thAnnual Meeting, 3–8 August 2003, San Diego, CA.

Friedlander, Davis S., Isanu Chattopadhayay, Asok Ray, ShashiPhoha and Noah Jacobson (2003b). “Anomaly Prediction inMechanical System Using Symbolic Dynamics.” InProceedings of the 2003 American Control Conference,Denver, CO, 4–6 June 2003.

Gács, Péter, John T. Tromp and Paul M. B. Vitanyi (2001).“Algorithmic Statistics.” IEEE Transactions on InformationTheory , 47: 2443–2463. URL




References

http://arxiv.org/abs/math.PR/0006233.Grassberger, Peter (1986). “Toward a Quantitative Theory of

Self-Generated Complexity.” International Journal ofTheoretical Physics, 25: 907–938.

Haslinger, Robert, Kristina Lisa Klinkner and Cosma RohillaShalizi (2010). “The Computational Structure of SpikeTrains.” Neural Computation, 22: 121–157. URLhttp://arxiv.org/abs/1001.0036.doi:10.1162/neco.2009.12-07-678.

Iosifescu, Marius and Serban Grigorescu (1990). Dependencewith Complete Connections and Its Applications. Cambridge,England: Cambridge University Press. Revised paperbackprinting, 2009.

Jaeger, Herbert (2000). “Observable Operator Models forDiscrete Stochastic Time Series.” Neural Computation, 12:


http://arxiv.org/abs/math.PR/0006233

http://arxiv.org/abs/1001.0036

http://dx.doi.org/10.1162/neco.2009.12-07-678



References

1371–1398. URL http://www.faculty.iu-bremen.de/hjaeger/pubs/oom_neco00.pdf.

Jänicke, Heike, Alexander Wiebel, Gerik Scheuermann andWolfgang Kollmann (2007). “Multifield Visualization UsingLocal Statistical Complexity.” IEEE Transactions onVisualization and Computer Graphics, 13: 1384–1391. URLhttp://www.informatik.uni-leipzig.de/bsv/Jaenicke/Papers/vis07.pdf.doi:10.1109/TVCG.2007.70615.

Kantz, Holger and Thomas Schreiber (2004). Nonlinear TimeSeries Analysis. Cambridge, England: Cambridge UniversityPress, 2nd edn.

Klinkner, Kristina Lisa, Cosma Rohilla Shalizi and Marcelo F.Camperi (2006). “Measuring Shared Information andCoordinated Activity in Neuronal Networks.” In Advances in


http://www.faculty.iu-bremen.de/hjaeger/pubs/oom_neco00.pdf

http://www.faculty.iu-bremen.de/hjaeger/pubs/oom_neco00.pdf

http://www.informatik.uni-leipzig.de/bsv/Jaenicke/Papers/vis07.pdf

http://www.informatik.uni-leipzig.de/bsv/Jaenicke/Papers/vis07.pdf

http://dx.doi.org/10.1109/TVCG.2007.70615



References

Neural Information Processing Systems 18 (NIPS 2005) (YairWeiss and Bernhard Schölkopf and John C. Platt, eds.), pp.667–674. Cambridge, Massachusetts: MIT Press. URLhttp://arxiv.org/abs/q-bio.NC/0506009.

Knight, Frank B. (1975). “A Predictive View of Continuous TimeProcesses.” Annals of Probability , 3: 573–596. URL http://projecteuclid.org/euclid.aop/1176996302.

— (1992). Foundations of the Prediction Process. Oxford:Clarendon Press.

Langford, John, Ruslan Salakhutdinov and Tong Zhang (2009).“Learning Nonlinear Dynamic Models.” Electronic preprint.URL http://arxiv.org/abs/0905.3369.

Littman, Michael L., Richard S. Sutton and Satinder Singh(2002). “Predictive Representations of State.” In Advances inNeural Information Processing Systems 14 (NIPS 2001)


http://arxiv.org/abs/q-bio.NC/0506009

http://projecteuclid.org/euclid.aop/1176996302

http://projecteuclid.org/euclid.aop/1176996302

http://arxiv.org/abs/0905.3369



References

(Thomas G. Dietterich and Suzanna Becker and ZoubinGhahramani, eds.), pp. 1555–1561. Cambridge,Massachusetts: MIT Press. URL http://www.eecs.umich.edu/~baveja/Papers/psr.pdf.

Onicescu, Octav and Gheorghe Mihoc (1935). “Sur les chaînesde variables statistiques.” Comptes Rendus de l’Académiedes Sciences de Paris, 200: 511–512.

Packard, Norman H., James P. Crutchfield, J. Doyne Farmerand Robert S. Shaw (1980). “Geometry from a Time Series.”Physical Review Letters, 45: 712–716.

Padró, Muntsa and Lluís Padró (2005a). “Applying a FiniteAutomata Acquisition Algorithm to Named EntityRecognition.” In Proceedings of 5th International Workshopon Finite-State Methods and Natural Language Processing


http://www.eecs.umich.edu/~baveja/Papers/psr.pdf

http://www.eecs.umich.edu/~baveja/Papers/psr.pdf



References

(FSMNLP’05). URL http://www.lsi.upc.edu/~nlp/papers/2005/fsmnlp05-pp.pdf.

— (2005b). “Approaching Sequential NLP Tasks with anAutomata Acquisition Algorithm.” In Proceedings ofInternational Conference on Recent Advances in NLP(RANLP’05). URL http://www.lsi.upc.edu/~nlp/papers/2005/ranlp05-pp.pdf.

— (2005c). “A Named Entity Recognition System Based on aFinite Automata Acquisition Algorithm.” Procesamiento delLenguaje Natural , 35: 319–326. URL http://www.lsi.upc.edu/~nlp/papers/2005/sepln05-pp.pdf.

— (2007a). “ME-CSSR: an Extension of CSSR using MaximumEntropy Models.” In Proceedings of Finite State Methods forNatural Language Processing (FSMNLP) 2007 . URL


http://www.lsi.upc.edu/~nlp/papers/2005/fsmnlp05-pp.pdf

http://www.lsi.upc.edu/~nlp/papers/2005/fsmnlp05-pp.pdf

http://www.lsi.upc.edu/~nlp/papers/2005/ranlp05-pp.pdf

http://www.lsi.upc.edu/~nlp/papers/2005/ranlp05-pp.pdf

http://www.lsi.upc.edu/~nlp/papers/2005/sepln05-pp.pdf

http://www.lsi.upc.edu/~nlp/papers/2005/sepln05-pp.pdf



References

http://www.lsi.upc.edu/%7Enlp/papers/2007/fsmnlp07-pp.pdf.

— (2007b). “Studying CSSR Algorithm Applicability on NLPTasks.” Procesamiento del Lenguaje Natural , 39: 89–96.URL http://www.lsi.upc.edu/%7Enlp/papers/2007/sepln07-pp.pdf.

Ray, Asok (2004). “Symbolic dynamic analysis of complexsystems for anomaly detection.” Signal Processing, 84:1115–1130.

Salmon, Wesley C. (1971). Statistical Explanation andStatistical Relevance. Pittsburgh: University of PittsburghPress. With contributions by Richard C. Jeffrey and James G.Greeno.

— (1984). Scientific Explanation and the Causal Structure ofthe World . Princeton: Princeton University Press.


http://www.lsi.upc.edu/%7Enlp/papers/2007/fsmnlp07-pp.pdf

http://www.lsi.upc.edu/%7Enlp/papers/2007/fsmnlp07-pp.pdf

http://www.lsi.upc.edu/%7Enlp/papers/2007/sepln07-pp.pdf

http://www.lsi.upc.edu/%7Enlp/papers/2007/sepln07-pp.pdf



References

Shalizi, Cosma Rohilla (2001). Causal Architecture, Complexityand Self-Organization in Time Series and Cellular Automata.Ph.D. thesis, University of Wisconsin-Madison. URLhttp://bactra.org/thesis/.

— (2003). “Optimal Nonlinear Prediction of Random Fields onNetworks.” Discrete Mathematics and Theoretical ComputerScience, AB(DMCS): 11–30. URLhttp://arxiv.org/abs/math.PR/0305160.

Shalizi, Cosma Rohilla, Marcelo F. Camperi and Kristina LisaKlinkner (2007). “Discovering Functional Communities inDynamical Networks.” In Statistical Network Analysis:Models, Issues, and New Directions (Edo Airoldi andDavid M. Blei and Stephen E. Fienberg and AnnaGoldenberg and Eric P. Xing and Alice X. Zheng, eds.), vol.4503 of Lecture Notes in Computer Science, pp. 140–157.


http://bactra.org/thesis/

http://arxiv.org/abs/math.PR/0305160



References

New York: Springer-Verlag. URLhttp://arxiv.org/abs/q-bio.NC/0609008.

Shalizi, Cosma Rohilla and James P. Crutchfield (2001).“Computational Mechanics: Pattern and Prediction, Structureand Simplicity.” Journal of Statistical Physics, 104: 817–879.URL http://arxiv.org/abs/cond-mat/9907176.

Shalizi, Cosma Rohilla, Robert Haslinger, Jean-BaptisteRouquier, Kristina Lisa Klinkner and Cristopher Moore(2006). “Automatic Filters for the Detection of CoherentStructure in Spatiotemporal Systems.” Physical Review E ,73: 036104. URLhttp://arxiv.org/abs/nlin.CG/0508001.

Shalizi, Cosma Rohilla and Kristina Lisa Klinkner (2004). “BlindConstruction of Optimal Nonlinear Recursive Predictors forDiscrete Sequences.” In Uncertainty in Artificial Intelligence:


http://arxiv.org/abs/q-bio.NC/0609008


http://arxiv.org/abs/nlin.CG/0508001



References

Proceedings of the Twentieth Conference (UAI 2004) (MaxChickering and Joseph Y. Halpern, eds.), pp. 504–511.Arlington, Virginia: AUAI Press. URLhttp://arxiv.org/abs/cs.LG/0406011.

Shalizi, Cosma Rohilla, Kristina Lisa Klinkner and RobertHaslinger (2004). “Quantifying Self-Organization withOptimal Predictors.” Physical Review Letters, 93: 118701.URL http://arxiv.org/abs/nlin.AO/0409024.

Shalizi, Cosma Rohilla and Cristopher Moore (2003). “What Isa Macrostate? From Subjective Measurements to ObjectiveDynamics.” Electronic pre-print. URLhttp://arxiv.org/abs/cond-mat/0303625.

Takens, Floris (1981). “Detecting Strange Attractors in FluidTurbulence.” In Symposium on Dynamical Systems and


http://arxiv.org/abs/cs.LG/0406011

http://arxiv.org/abs/nlin.AO/0409024




References

Turbulence (D. A. Rand and L. S. Young, eds.), pp. 366–381.Berlin: Springer-Verlag.

Tishby, Naftali, Fernando C. Pereira and William Bialek (1999).“The Information Bottleneck Method.” In Proceedings of the37th Annual Allerton Conference on Communication, Controland Computing (B. Hajek and R. S. Sreenivas, eds.), pp.368–377. Urbana, Illinois: University of Illinois Press. URLhttp://arxiv.org/abs/physics/0004057.


http://arxiv.org/abs/physics/0004057

Markovian (and conceivably causal) representations of ...cshalizi/Talks/2010-07-10-UAI.pdf ·...

Documents

Transcript of Markovian (and conceivably causal) representations of ...cshalizi/Talks/2010-07-10-UAI.pdf ·...