Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both...

5
Determinism, Complexity, and Predictability in Computer Performance Joshua Garland * , Ryan G. James and Elizabeth Bradley *† * Dept. of Computer Science University of Colorado, Boulder, Colorado 80309-0430 USA Email: [email protected] Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501 USA Email: [email protected] Complexity Sciences Center & Physics Dept., University of California, Davis, California 95616 USA Email: [email protected] Abstract—Computers are deterministic dynamical systems [1]. Among other things, that implies that one should be able to use deterministic forecast rules to predict their behavior. That statement is sometimes—but not always—true. The memory and processor loads of some simple programs are easy to predict, for example, but those of more-complex pro- grams like gcc are not. The goal of this paper is to determine why that is the case. We conjecture that, in practice, complexity can effectively overwhelm the predictive power of deterministic forecast models. To explore that, we build models of a number of performance traces from different programs running on different Intel-based computers. We then calcu- late the permutation entropy—a temporal entropy metric that uses ordinal analysis—of those traces and correlate those values against the prediction success. I. I NTRODUCTION Computers are among the most complex en- gineered artifacts in current use. Modern micro- processor chips contain multiple processing units and multi-layer memories, for instance, and they use complicated hardware/software strategies to move data and threads of computation across those resources. These features—along with all the others that go into the design of these chips— make the patterns of their processor loads and memory accesses highly complex and hard to predict. Accurate forecasts of these quantities, if one could construct them, could be used to improve computer design. If one could predict that a particular computational thread would be bogged down for the next 0.6 seconds waiting for data from main memory, for instance, one could save power by putting that thread on hold for that time period (e.g., by migrating it to a processing unit whose clock speed is scaled back). Computer performance traces are, however, very complex. Even a simple “microkernel,” like a three-line loop that repeatedly initializes a matrix in column- major order, can produce chaotic performance traces [1], as shown in Figure 1, and chaos places fundamental limits on predictability. 7.742 7.743 7.744 7.745 7.746 7.747 7.748 7.749 7.75 7.751 x 10 4 2.05 2.1 2.15 2.2 2.25 2.3 x 10 4 time (instructions x 100,000) cache misses Fig. 1. A small snippet of the L2 cache miss rate of col_major, a three-line C program that repeatedly initializes a matrix in column-major order, running on an Intel Core Duo R -based machine. Even this simple program exhibits chaotic performance dynamics. The computer systems community has applied a variety of prediction strategies to traces like this, most of which employ regression. An appealing alternative builds on the recently established fact that computers can be effectively modeled as de- terministic nonlinear dynamical systems [1]. This result implies the existence of a deterministic fore- cast rule for those dynamics. In particular, one can arXiv:1305.5408v1 [nlin.CD] 23 May 2013

Transcript of Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both...

Page 1: Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both microker-nels were run on the Intel Core Duo R machine; both SPEC benchmarks were

Determinism, Complexity, andPredictability in Computer Performance

Joshua Garland∗, Ryan G. James‡ and Elizabeth Bradley∗†∗Dept. of Computer Science University of Colorado, Boulder, Colorado 80309-0430 USA

Email: [email protected]†Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501 USA

Email: [email protected]‡Complexity Sciences Center & Physics Dept., University of California, Davis, California 95616 USA

Email: [email protected]

Abstract—Computers are deterministic dynamicalsystems [1]. Among other things, that implies thatone should be able to use deterministic forecastrules to predict their behavior. That statement issometimes—but not always—true. The memory andprocessor loads of some simple programs are easy topredict, for example, but those of more-complex pro-grams like gcc are not. The goal of this paper is todetermine why that is the case. We conjecture that,in practice, complexity can effectively overwhelm thepredictive power of deterministic forecast models.To explore that, we build models of a number ofperformance traces from different programs runningon different Intel-based computers. We then calcu-late the permutation entropy—a temporal entropymetric that uses ordinal analysis—of those tracesand correlate those values against the predictionsuccess.

I. INTRODUCTION

Computers are among the most complex en-gineered artifacts in current use. Modern micro-processor chips contain multiple processing unitsand multi-layer memories, for instance, and theyuse complicated hardware/software strategies tomove data and threads of computation acrossthose resources. These features—along with all theothers that go into the design of these chips—make the patterns of their processor loads andmemory accesses highly complex and hard topredict. Accurate forecasts of these quantities,if one could construct them, could be used toimprove computer design. If one could predictthat a particular computational thread would bebogged down for the next 0.6 seconds waiting fordata from main memory, for instance, one could

save power by putting that thread on hold for thattime period (e.g., by migrating it to a processingunit whose clock speed is scaled back). Computerperformance traces are, however, very complex.Even a simple “microkernel,” like a three-lineloop that repeatedly initializes a matrix in column-major order, can produce chaotic performancetraces [1], as shown in Figure 1, and chaos placesfundamental limits on predictability.

7.742 7.743 7.744 7.745 7.746 7.747 7.748 7.749 7.75 7.751

x 104

2.05

2.1

2.15

2.2

2.25

2.3x 10

4

time (instructions x 100,000)

ca

ch

e m

isse

s

Fig. 1. A small snippet of the L2 cache miss rate ofcol_major, a three-line C program that repeatedly initializesa matrix in column-major order, running on an Intel CoreDuo R©-based machine. Even this simple program exhibitschaotic performance dynamics.

The computer systems community has applied avariety of prediction strategies to traces like this,most of which employ regression. An appealingalternative builds on the recently established factthat computers can be effectively modeled as de-terministic nonlinear dynamical systems [1]. Thisresult implies the existence of a deterministic fore-cast rule for those dynamics. In particular, one can

arX

iv:1

305.

5408

v1 [

nlin

.CD

] 2

3 M

ay 2

013

Page 2: Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both microker-nels were run on the Intel Core Duo R machine; both SPEC benchmarks were

use delay-coordinate embedding to reconstruct theunderlying dynamics of computer performance,then use the resulting model to forecast the futurevalues of computer performance metrics such asmemory or processor loads [2]. In the case ofsimple microkernels like the one that producedthe trace in Figure 1, this deterministic modelingand forecast strategy works very well. In more-complicated programs, however, such as speechrecognition software or compilers, this forecaststrategy—as well as the traditional methods—break down quickly.

This paper is a first step in understanding when,why, and how deterministic forecast strategies failwhen they are applied to deterministic systems.We focus here on the specific example of computerperformance. We conjecture that the complexity oftraces from these systems—which results from theinherent dimension, nonlinearity, and nonstation-arity of the dynamics, as well as from measure-ment issues like noise, aggregation, and finite datalength—can make those deterministic signals ef-fectively unpredictable. We argue that permutationentropy [3], a method for measuring the entropyof a real-valued-finite-length time series throughordinal analysis, is an effective way to explorethat conjecture. We study four examples—twosimple microkernels and two complex programsfrom the SPEC benchmark suite—running on dif-ferent Intel-based machines. For each program, wecalculate the permutation entropy of the processorload (instructions per cycle) and memory-use effi-ciency (cache-miss rates), then compare that to theprediction accuracy attainable for that trace usinga simple deterministic model.

II. MODELING COMPUTER PERFORMANCE

Delay-coordinate embedding allows one to re-construct a system’s full state-space dynamicsfrom a single scalar time-series measurement—provided that some conditions hold regardingthat data. Specifically, if the underlying dynamicsand the measurement function—the mapping fromthe unknown state vector ~X to the scalar valuex that one is measuring—are both smooth andgeneric, Takens [4] formally proves that the delay-coordinate map

F (τ,m)(x) = ([x(t) x(t+ τ) . . . x(t+mτ)])

2.052.1

2.152.2

2.252.3

2.35

x 104

2.052.1

2.152.2

2.252.3

2.35

x 104

2.15

2.2

2.25

2.3

2.35

x 104

cache misses (t) cache misses (t + τ)

ca

ch

e m

isse

s (

t +

2τ)

Fig. 2. A 3D projection of a delay-coordinate embedding ofthe trace from Figure 1 with a delay (τ ) of 100,000 instructions.

from a d-dimensional smooth compact manifoldM to Re2d+1, where t is time, is a diffeomor-phism on M—in other words, that the recon-structed dynamics and the true (hidden) dynamicshave the same topology.

This is an extremely powerful result: amongother things, it means that one can build a formalmodel of the full system dynamics without mea-suring (or even knowing) every one of its statevariables. This is the foundation of the modelingapproach that is used in this paper. The first stepin the process is to estimate values for the twofree parameters in the delay-coordinate map: thedelay τ and the dimension m. We follow standardprocedures for this, choosing the first minimumin the average mutual information as an estimateof τ [5] and using the false-near(est) neighbormethod of [6], with a threshold of 10%, to estimatem. A plot of the data from Figure 1, embeddedfollowing this procedure, is shown in Figure 2.The coordinates of each point on this plot aredifferently delayed elements of the col_majorL2 cache miss rate time series y(t): that is, y(t)on the first axis, y(t+ τ) on the second, y(t+2τ)on the third, and so on. Structure in these kinds ofplots—clearly visible in Figure 2—is an indicationof determinism1. That structure can also be usedto build a forecast model.

Given a nonlinear model of a deterministic dy-namical system in the form of a delay-coordinateembedding like Figure 2, one can build deter-ministic forecast algorithms by capturing and ex-ploiting the geometry of the embedding. Many

1A deeper analysis of Figure 2—as alluded to on the previ-ous page—supports that diagnosis, confirming the presence ofa chaotic attractor in these cache-miss dynamics, with largestLyapunov exponent λ1 = 8000± 200 instructions, embeddedin a 12-dimensional reconstruction space [1].

Page 3: Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both microker-nels were run on the Intel Core Duo R machine; both SPEC benchmarks were

techniques have been developed by the dynamicalsystems community for this purpose (e.g., [7], [8]).Perhaps the most straightforward is the “Lorenzmethod of analogues” (LMA), which is essentiallynearest-neighbor prediction in the embedded statespace [9]. Even this simple algorithm—whichbuilds predictions by finding the nearest neighborin the embedded space of the given point, thentaking that neighbor’s path as the forecast—worksquite well on the trace in Figure 1, as shown inFigure 3. On the other hand, if we use the same

82,125 82,625 83,125 83,625 84,125 84,625 85,125 85,625 86,1252.05

2.1

2.15

2.2

2.25

2.3

2.35

x 104

time (instructions x 100,000)

co

l_m

ajo

r ca

ch

e m

isse

s

Fig. 3. A forecast of the last 4,000 points of the signal inFigure 1 using an LMA-based strategy on the embedding inFigure 2. Red circles and blue ×s are the true and predictedvalues, respectively; vertical bars show where these valuesdiffer.

approach to forecast the processor load2 of the482.sphinx3 program from the SPEC cpu2006benchmark suite, running on an Intel i7 R©-basedmachine, the prediction is far less accurate; seeFigure 4.

62,137 62,637 63,137 63,637 64,137 64,637 65,137 65,637 66,1370

0.5

1

1.5

2

2.5

3

time (instructions x 100,000)

48

2.s

ph

inx3

IP

C

Fig. 4. An LMA-based forecast of the last 4,000 points ofa processor-load performance trace from the 482.sphinx3benchmark. Red circles and blue ×s are the true and predictedvalues, respectively; vertical bars show where these valuesdiffer.

2Instructions per cycle, or IPC

Table I presents detailed results about the pre-diction accuracy of this algorithm on four differentexamples: the col_major and 482.sphinx3programs in Figures 3 and 4, as well as anothersimple microkernel that initializes the same matrixas col_major, but in row-major order, andanother complex program (403.gcc) from theSPEC cpu2006 benchmark suite. Both microker-nels were run on the Intel Core Duo R© machine;both SPEC benchmarks were run on the Intel i7 R©

machine. We calculated a figure of merit for eachprediction as follows. We held back the last kelements3 of the N points in each measured timeseries, built the forecast model by embedding thefirst N − k points, used that embedding and theLMA method to predict the next k points, thencomputed the Root Mean Squared Error (RMSE)between the true and predicted signals:

RMSE =

√∑ki=1(ci − pi)2

k

To compare the success of predictions acrosssignals with different units, we normalized RMSEas follows:

nRMSE =RMSE

Xmax,obs −Xmin,obs

TABLE INORMALIZED ROOT MEAN SQUARED ERROR (NRMSE) OF

4000-POINT PREDICTIONS OF MEMORY & PROCESSORPERFORMANCE FROM DIFFERENT PROGRAMS.

cache miss rate instrs per cycle

row_major 0.0324 0.0778

col_major 0.0080 0.0161

403.gcc 0.1416 0.2033

482.sphinx3 0.2032 0.3670

The results in Table I show a clear distinc-tion between the two microkernels, whose fu-ture behavior can be predicted effectively usingthis simple deterministic modeling strategy, andthe more-complex SPEC benchmarks, for whichthis prediction strategy does not work nearly aswell. This begs the question: If these traces allcome from deterministic systems—computers—then why are they not equally predictable? Our

3Several different prediction horizons were analyzed in ourexperiment; the results reported in this paper are for k=4000

Page 4: Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both microker-nels were run on the Intel Core Duo R machine; both SPEC benchmarks were

conjecture is that the sheer complexity of thedynamics of the SPEC benchmarks running on theIntel i7 R© machine make them effectively impos-sible to predict.

III. MEASURING COMPLEXITY

For the purposes of this paper, one can view en-tropy as a measure of complexity and predictabil-ity in a time series. A high-entropy time series isalmost completely unpredictable—and conversely.This can be made more rigorous: Pesin’s relation[10] states that in chaotic dynamical systems, theShannon entropy rate is equal to the sum of thepositive Lyapunov exponents, λi. The Lyapunovexponents directly quantify the rate at whichnearby states of the system will diverge with time:|∆x(t)| ≈ eλt |∆x(0)|. The faster the divergence,the more difficult prediction becomes.

Utilizing entropy as a measure of temporal com-plexity is by no means a new idea [11], [12]. Its ef-fective usage requires categorical data: xt ∈ S forsome finite or countably infinite alphabet S, anddata taken from real-world systems is effectivelyreal-valued. To get around this, one must discretizethe data—typically by binning. Unfortunately, thisis rarely a good solution to the problem, as thebinning of the values introduces an additionaldynamic on top of the intrinsic dynamics whoseentropy is desired. The field of symbolic dynamicsstudies how to discretize a time series in such away that the intrinsic behavior is not perverted, butthese methods are fragile in the face of noise andrequire further understanding of the underlyingsystem, which defeats the purpose of measuringthe entropy in the first place.

Bandt and Pompe introduced permutation en-tropy (PE) as a “natural complexity measure fortime series” [3]. Permutation entropy employs amethod of discretizing real-valued time series thatfollows the intrinsic behavior of the system underexamination. Rather than looking at the statisticsof sequences of values, as is done when computingthe Shannon entropy, permutation entropy looksat the statistics of the orderings of sequences ofvalues using ordinal analysis. Ordinal analysis ofa time series is the process of mapping successivetime-ordered elements of a time series to theirvalue-ordered permutation of the same size. Byway of example, if (x1, x2, x3) = (9, 1, 7) then its

ordinal pattern, φ(x1, x2, x3), is 231 since x2 ≤x3 ≤ x1. This method has many features; amongother things, it is robust to noise and requires noknowledge of the underlying mechanisms.

Definition (Permutation Entropy). Given a timeseries {xt}t=1,...,T . Define Sn as all n! permuta-tions π of order n. For each π ∈ Sn we determinethe relative frequency of that permutation occur-ring in {xt}t=1,...,T :

p(π) =|{t|t ≤ T − n, φ(xt+1, . . . , xt+n) = π}|

T − n+ 1

Where | · | is set cardinality. The permutationentropy of order n ≥ 2 is defined as

H(n) = −∑π∈Sn

p(π) log2 p(π)

Notice that 0 ≤ H(n) ≤ log2(n!) [3]. With thisin mind, it is common in the literature to normalizepermutation entropy as follows: H(n)

log2(n!). With this

convention, “low” entropy is close to 0 and “high”entropy is close to 1. Finally, it should be notedthat the permutation entropy has been shown to beidentical to the Shannon entropy for many largeclasses of systems [13].

In practice, calculating permutation entropy in-volves choosing a good value for the wordlengthn. The key consideration here is that the valuebe large enough that forbidden ordinals are dis-covered, yet small enough that reasonable statis-tics over the ordinals are gathered: e.g., n =argmax

`{T > 100`!}, assuming an average of 100

counts per ordinal. In the literature, 3 ≤ n ≤ 6 isa standard choice—generally without any formaljustification. In theory, the permutation entropyshould reach an asymptote with increasing n,but that requires an arbitrarily long time series.In practice, what one should do is calculate thepersistent permutation entropy by increasing nuntil the result converges, but data length issuescan intrude before that convergence is reached.

Table II shows the permutation entropy resultsfor the examples considered in this paper, with thenRMSPE prediction accuracies from the previoussection included alongside for easy comparison.The relationship between prediction accuracy andthe permutation entropy (PE) is as we conjec-tured: performance traces with high PE—those

Page 5: Determinism, Complexity, and Predictability in Computer ... · SPEC cpu2006 benchmark suite. Both microker-nels were run on the Intel Core Duo R machine; both SPEC benchmarks were

TABLE IIPREDICTION ERROR (IN NRMSPE) AND PERMUTATION

ENTROPY (FOR DIFFERENT WORDLENGTHS n

cache misses error n = 4 n = 5 n = 6row_major 0.0324 0.6751 0.5458 0.4491col_major 0.0080 0.5029 0.4515 0.3955403.gcc 0.1416 0.9916 0.9880 0.9835

482.sphinx3 0.2032 0.9913 0.9866 0.9802insts per cyc error n = 4 n = 5 n = 6row_major 0.0778 0.9723 0.9354 0.8876col_major 0.0161 0.8356 0.7601 0.6880403.gcc 0.2033 0.9862 0.9814 0.9764

482.sphinx3 0.3670 0.9951 0.9914 0.9849

whose temporal complexity is high, in the sensethat little information is being propagated forwardin time—are indeed harder to predict using thesimple deterministic forecast model described inthe previous section. The effects of changing nare also interesting: using a longer wordlengthgenerally lowers the PE—a natural consequenceof finite-length data—but the falloff is less rapidin some traces than in others, suggesting that thosevalues are closer to the theoretical asymptote thatexists for perfect data. The persistent PE valuesof 0.5–0.6 for the row_major and col_majorcache-miss traces are consistent with dynamicalchaos, further corroborating the results of [1]. (PEvalues above 0.97 are consistent with white noise.)Interestingly, the processor-load traces for thesetwo microkernels exhibit more temporal complex-ity than the cache-miss traces. This may be aconsequence of the lower baseline value of thistime series.

IV. CONCLUSIONS & FUTURE WORK

The results presented here suggest that permuta-tion entropy—a ordinal calculation of forward in-formation transfer in a time series—is an effectivemetric for predictability of computer performancetraces. Experimentally, traces with a persistent PE' 0.97 have a natural level of complexity thatmay overshadow the inherent determinism in thesystem dynamics, whereas traces with PE / 0.7seem to be highly predictable (viz., at least anorder of magnitude improvement in nRMSPE).

If information is the limit, then gathering andusing more information is an obvious next step.There is an equally obvious tension here betweendata length and prediction speed: a forecast that

requires half a second to compute is not usefulfor the purposes of real-time control of a computersystem with a MHz clock rate. Another alternativeis to sample several system variables simulta-neously and build multivariate delay-coordinateembeddings. Existing approaches to that are com-putationally prohibitive [14]. We are working onalternative methods that sidestep that complexity.

ACKNOWLEDGMENT

This work was partially supported by NSF grant#CMMI-1245947 and ARO grant #W911NF-12-1-0288.

REFERENCES

[1] T. Myktowicz, A. Diwan, and E. Bradley, “Computersare dynamical systems,” Chaos, vol. 19, p. 033124, 2009,doi:10.1063/1.3187791.

[2] J. Garland and E. Bradley, “Predicting computer per-formance dynamics,” in Proceedings of the 10th Inter-national Conference on Advances in Intelligent DataAnalysis X, Berlin, Heidelberg, 2011, pp. 173–184.

[3] C. Bandt and B. Pompe, “Permutation entropy: A naturalcomplexity measure for time series,” Phys Rev Lett,vol. 88, no. 17, p. 174102, 2002.

[4] F. Takens, “Detecting strange attractors in fluid turbu-lence,” in Dynamical Systems and Turbulence, D. Randand L.-S. Young, Eds. Berlin: Springer, 1981, pp. 366–381.

[5] A. Fraser and H. Swinney, “Independent coordinatesfor strange attractors from mutual information,” PhysicalReview A, vol. 33, no. 2, pp. 1134–1140, 1986.

[6] M. B. Kennel, R. Brown, and H. D. I. Abarbanel,“Determining minimum embedding dimension using ageometrical construction,” Physical Review A, vol. 45,pp. 3403–3411, 1992.

[7] M. Casdagli and S. Eubank, Eds., Nonlinear Modelingand Forecasting. Addison Wesley, 1992.

[8] A. Weigend and N. Gershenfeld, Eds., Time Series Pre-diction: Forecasting the Future and Understanding thePast. Santa Fe Institute, 1993.

[9] E. N. Lorenz, “Atmospheric predictability as revealedby naturally occurring analogues,” Journal of the Atmo-spheric Sciences, vol. 26, pp. 636–646, 1969.

[10] Y. B. Pesin, “Characteristic Lyapunov exponents andsmooth ergodic theory,” Russian Mathematical Surveys,vol. 32, no. 4, p. 55, 1977.

[11] C. E. Shannon, “Prediction and entropy of printed En-glish,” Bell Systems Technical Journal, vol. 30, pp. 50–64, 1951.

[12] R. Mantegna, S. Buldyrev, A. Goldberger, S. Havlin,C. Peng, M. Simons, and H. Stanley, “Linguistic featuresof noncoding DNA sequences,” Physical review letters,vol. 73, no. 23, pp. 3169–3172, 1994.

[13] J. Amigo, Permutation Complexity in Dynamical Sys-tems: Ordinal Patterns, Permutation Entropy and AllThat. Springer, 2012.

[14] L. Cao, A. Mees, and K. Judd, “Dynamics from mul-tivariate time series,” Physica D, vol. 121, pp. 75–88,1998.