Optimizing array reference checking in Java programs

Optimizing arrayreference checkingin Java programs

by S. P. MidkiffJ. E. MoreiraM. Snir

The Java™ language specification requires thatall array references be checked for validity. If areference is invalid, an exception must bethrown. Furthermore, the environment at thetime of the exception must be preserved andmade available to whatever code handles theexception. Performing the checks at run timeincurs a large penalty in execution time. In thispaper we describe a collection of trans-formations that can dramatically reduce thisoverhead in the common case (when the accessis valid) while preserving the program state atthe time of an exception. The transformationsallow trade-offs to be made in the efficiency andsize of the resulting code, and are fully compliantwith the Java language semantics. Preliminaryevaluation of the effectiveness of thesetransformations shows that performanceimprovements of 10 times and more can beachieved for array-intensive Java programs.

T hree goals ofthe Java** programming language I

and execution environment are bit-for-bit repro-ducibility of results on different platforms, safety ofexecution, and ease of programming and testing. Acrucial component of the Java language specifica-tion for achieving the latter two goals is that a pro-gram only be allowed to access objects via a validpointer, and that programs only be allowed to ac-cess elements of an array that are part of the definedextent of the array. A naive implementation of thisspecification component requires every access to ev-ery element of an array to check the validity of allbase pointers and indices involved in the access.

Figure 1 shows a typical representation of a four-dimensional array in Java. A naive checking of a ref-

erence to element A [1][ 2][3][4], indicated in thefigure, would require four base pointer tests and fourrange tests:A, A[l], A [1][2], and A [1][2][3] mustall be tested as valid pointers; 1,2,3, and 4 must allbe tested as valid indices along the correspondingaxes of the array. If the entire array is accessed, atotal of 4N base pointer tests and 4N range tests willbe necessary, where N is the total number of ele-ments in the array. In this paper, we present a suiteof techniques to greatly reduce the number of run-time tests. In many cases, the run-time tests can becompletely eliminated.

The above-mentioned goal of bit-for-bit reproduc-ibility of results, combined with the ability to catchexceptions and examine the state of the program atthe time the exception was thrown, imply that it isnot sufficient to determine that an invalid referenceoccurred. Rather, any optimizations of valid refer-ence checking must cause all exceptions that wouldhave been thrown in the original program to still bethrown. The exceptions must be thrown in preciselythe same order, with respect to the rest of the com-putation, dictated by the original program seman-tics. The techniques we describe fulfill this require-ment. Our methods also handle loops that catchexceptions within their bodies (i.e., with nested tryblocks).

©Copyright 1998 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction isdonewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL, VOL 37, NO 3, 1998 0018·8670/98/$5.00 © 1998 IBM MIDKIFF, MOREIRA, AND SNIR 409

Figure 1 Access of an element in a four-dimensional array

We note that the techniques and methods describedhere are not restricted to Java. They can be appliedto any language in which the bounds of an array canbe determined at run time. They could be used, forexample, in C and FORTRAN compilers that want toprovide an option to check array references.

The rest of this paper is organized as follows. Thenext section presents an informal overview of ouroptimizations and is followed by an introduction tothe concept of safe bounds, used throughout the pa-per. The fourth section presents the first of our op-timization techniques, the exact method. The suc-ceeding sections present other techniques, thegeneral, compact, and restricted methods, respec-tively.The eighth section discusses an inspector-ex-ecutor variant of our methods, to handle morecomplex cases. The ninth section explains our op-timizations in the context of multithreaded execu-tion. The tenth section presents some experimentalresults, followed by a discussion of related work. Fi-nally, we present our conclusions.

Overview of the optimizations

The main goal of our work is to develop techniquesthat minimize the number of run-time tests that mustbe performed for array reference operations in Java.An array A is defined by a lower bound lo(A) andan upper bound up(A). We use a Java-like nota-tion for array declarations:

double A[] = new double[lo(A) :up(A)]

declares a one-dimensional arrayA of doubles. Notethat we allow an explicit declaration of the lower

410 MIDKIFF, MOREIRA, AND SNIR

bound lo(A) of array A. In Java, the lower boundis alwayszero. Also, Java array declarations specifythe number of elements of the array. Therefore, theJava declaration

double A[] = new double[n]

would correspond to the declaration

doubleA[] = new double[O:n - 1]

in our notation.

An array element reference is denoted by A [ (T],

where (T is an index into the array. For the referenceto be valid, A must not be null, and (T must be withinthe valid range of indices for A: lo(A), lo(A) +1, ... , up( A). ItA is null, then the reference causesa null pointer violation. In Java, this corresponds tothe throwing of a NullPointerException. ItAis a valid pointer and (T is less than the lower boundlo( A) of the array, then the reference causes a lowerbound violation. ItA is a validpointer and (T isgreaterthan the upper bound up( A) of the array, then thereference causes an upper bound violation. Java doesnot distinguish between lower and upper bound vi-olations. A bound violation (lower or upper) in Javacauses an ArraylndexOutOfBoundsExceptionto be thrown.

Multidimensional arrays are treated as arrays ofarrays:

IBM SYSTEMS JOURNAL, VOL 37, NO 3, 1998

declares an array M with lower bound II and upperbound u I' Each element of M is an array of doubles,with lower bound 12 and upper bound U2. M is anexample of a rectangulararray. Rectangular arrayshave uncoupled bounds along each axis. As in Java,we also allow ragged arrays, where the bounds for oneaxis depend on the coordinate along another axis.A triangular array is an example of a ragged array:

double T[][] = new double[l :n ][]

T[i] = new double[1 : i], i = 1, ... , n

Even though ragged arrays are allowed and treatedby our techniques, we do have optimizations that takeadvantage of rectangular arrays.

Arbitrary array lower bounds, the distinction be-tween lower bound and upper bound violations, andtreatment for rectangular arrays are features of ourwork that are not strictly necessary for Java appli-cations. Our motivation to include them is twofold:First, we want our methods to be applicable to otherprogramming languages, in particular C, C+ +, andFORTRAN 90. Second, we want to be prepared to han-dle extensions to Java that are being considered toefficiently support numerical computing, as describedin Reference 2.

Array accesses typically occur in the body of loops.Our optimizations operate on do-loops, which arefor-loops of the form:

for (i = I; i s: u; i++){B(i)

}

where i is the index variable of the loop, I definesthe lower bound of the loop, and u defines the up-per bound of the loop. The iteration space of theloop is defined by the range of values that i takes:I, I + 1, ... , u. All loops have a unit step, and there-fore a loop with I > u has an empty iteration space.B(i) is the body of the loop, which typically containsreferences to the loop index variable i. As a short-hand, we represent the above loop structure by thenotation L(i, I, u, B(i)). Note that loops with pos-itive nonunity strides can be normalized by the trans-formation

for (i = I; i :5 u; i = i + s){B(i)

}


becomes

for(i=O;i:5lU;lj;i++){

B(l + is)}

A loop with negative stride can be first transformedinto a loop with a positive stride:

for (i = u; i 2: I; i = i - s){B(i)

}

becomes

for (i = I; i s: u; i = i + s){B(u + I - i)

}

Loops are often nested within other loops. Standardcontrol-flow and data-flow techniques' can be usedto recognize many for, while, and do-while loops,which occur in Java and C, as do-loops. Many gotoloops, occurring in C and FORTRAN, can be recog-nized as do-loops as well.

Four different tests can be applied to an array ref-erence A [i]. A null test verifies whether A is a nullpointer. An Ib test verifies whether i 2: 10(A), anda ub test verifies whether i :5 up(A). Finally, a testcalled all tests verifies whether lo( A) :5 i :5 up(A).Tests for bounds subsume null tests, since a null ar-ray can be set to have an empty extent. Furthermore,we show in the next section that for do-loops onlyone of three cases can occur for each iteration andeach array reference: (1) an Ib test is required, (2)a ub test is required, or (3) no test is required. Inmany situations it is possible to precisely determinewhich case occurs. In other situations, one has to beconservative and adopt a stronger test than abso-lutely required. In particular, an all tests can be usedto detect any possible violations.

We introduce later a method for deriving the min-imum set of tests required at each iteration of a sim-ple loop with no nesting. This method will be referredto as the exactmethod. It provides the general frame-work that is used by the subsequent methods, whichhandle nested loops. The exact method specifies, foreach iteration and each array reference, which test

MIDKIFF, MOREIRA, AND SNIR 411

of the three named above is required, if any. A loopwith p array references in its body is split at compiletime into up to 3P consecutive regions, each with aspecialized version of the loop body. At run time,no more than 2p + 1 regions are actually executed.

This level of code replication is not practical in gen-eral. However, one can reduce the number of codeversions by merging together regions, at the expenseof performing superfluous tests. This is shown in thesection describing the general method. In practice,this does not increase execution time. The regionswhere tests are required will usually be found to beempty at run time so that the tests in these regionsare never normally executed. Reducing the numberof distinct regions will not only decrease code size,but may also decrease execution time. A practicalversion of the exact method splits an iteration spacei = I, I + 1, ... , u into three regions:

1. A region of the iteration space where no tests areneeded. This region is defined by a safe lowerbound IS and a safe upper bound US. The rangeof values of i in this region is ZS, Is + 1, ... , US.

2. A region of the iteration space where the indexvariable has values smaller than the safe lowerbound ZS. For this region i = I, I + 1, ... , ZS - 1.

3. A region ofthe iteration space where the indexvari-able has values greater than the safe upper boundu", For this region i = US + 1, US + 2, ... , u.

The loop is split into three loops, each associatedwith a different code version. As a simpler alterna-tive, the number of code versions can be further re-duced to two: one with no tests, and one with all testsenabled.

We extend this method to nested loops in the latersection, "The general method," recursively splittingeach iteration space into three regions. When ap-plied to a perfect d-dimensional loop nest, thismethod leads to 3 d distinct loop nests, each poten-tially executing a different code version. One can re-duce the number of distinct code versions by merg-ing different versions. In the extreme case, it ispossible to use only two versions. This method willbe referred to as the general method.

In the sixth section we present a method that avoidsthis exponential blow-up in static code size. Thismethod is referred to as the compact method. Whenapplied to a perfect d-dimensionalloop nest, it re-sults in 2d + 1 loop nests. Depending on the sim-


plifications adopted, it can use from 2 up to 2d +1 different code versions.

Finally, in the seventh section we describe a tilingmethod that can be applied to certain types of loopnests. It effectively implements the compact methodthrough generated code of a very simple form. Themethod uses two versions for each loop body (no testsand all tests enabled). This method is referred to asthe restricted method.

These four methods apply to Java as well as toFORTRAN or C and work for all machine architec-tures. Additional optimizations are possible in im-portant special cases. We briefly discuss two of theseoptimizations now: how to determine a valid indexwith a single test and how to test for null pointers.We do not discuss special optimization techniquesany further in this paper.

In the particular case of Java an all tests test can beimplemented with a single comparison. Because Javadoes not distinguish between lower bound and up-per bound violations, and because lo(A) = 0 always,it suffices to test if i < up( A) + 1 using unsignedarithmetic. A negative number appears, in unsignedarithmetic, as a very large positive number which,by the Java language specification, is always largerthan the largest possible array extent. This techniqueis used, for example, in the IBM HPCJ (High-Perfor-mance Compiler for Java) project."

The optimization of checks for null pointer viola-tions in array references is a direct consequence ofthe optimization of checks for bound violations, asdiscussed in the next section. There are also manyad hoc solutions to the optimization of null pointerviolations. For machines with segmented memory ar-chitecture, such as the IBM POWER2 Architecture*,null pointers can be represented by a pointer to aprotected segment. For machines with linear, butpaged, memory architecture, a protected segmentcan be simulated with a region of protected pages.Any attempts at access through a null pointer willimmediately cause a hardware exception that canthen be translated into the appropriate software ex-ception. This implementation completely eliminatesthe need for any explicit checks for null pointer vi-olations in the code. It isalso applicable to all pointerdereferences, not just array accesses. Despite thecomplications in properly translating a hardware ex-ception, this technique has been successfully used,for example, in the CACAO project. 5


~R[I] : i = I, ... , min(u + 1, 10(A» - 1 (7)

~R[3] : i = max(l- 1, up(A» + 1, ... , u (8)

Similarly, we use Equation 1 and Equation 3 to com-pute the lower and upper bounds of regions ~J( [1] and~R[3], respectively:

Computing safe bounds

The basic idea of our methods is to partition the it-eration space of a loop into regions. A region is acontiguous subset of the iteration space. It is char-acterized by a collection oftests that need to be per-formed for array references in that subset. We areparticularly interested in findingsaje regions, wherewe know there can be no violations in array refer-ences. Ifan array reference is guaranteed not to gen-erate a violation, explicit tests are unnecessary.

"11 = min(u, up(A»

~I( [2] : i = .'i.', . , . , ''II

(5)

(6)

}

Using Equation 2, we can compute the lower andupper bounds .~) and '-11 of the safe region:

for (i = I; i s: u; i++){B(i)

We illustrate the computation of safe regions witha simple example. We then proceed to formalize itto more general cases. Consider the simple loop:

(9)

(10)

(11)

(12)

(13)

~R [1] : i = I, ... , X' - 1

~R [2] : i = .r, ... , '11

~J( [3] : i = "11 + 1, ... , u

Note that min(u + 1, 10(A» = ,r, except whenlo( A) < I or lo( A) > u + 1. In the former case,~J( [1] is empty; in the latter case, ~R [2] is empty. Tohandle these cases, and the symmetric upper boundcases, we redefine

In the general case, we have an array reference ofthe form A [f(i)] in the bodyB(i) of a loop. Depend-ing on the behavior of j(i), we can compute safebounds S'and "11 that partition the iteration space intothree regions, as defined by Equations 11-13. Re-gion ~R[2] is safe, and no test is defined in it. We de-fine tests T[I] and T[3] that are sufficient for regions

In Figure 2 we illustrate the values of x' and "11 fordifferent relative positions of iteration bounds andarray bounds. Figure 2A has empty regions ~R [1] and~R [3], whereas region ~R [2] comprises the entire it-eration space. Region ~R [1] is empty in Figure 2B,whereas region U?,[3] is empty in Figure 2C. All threeregions are nonempty in Figure 2D. Regions ~R[2]

and ~R[3] are empty in Figure 2E, because all valuesof i fall below the lower bound ofA . Finally, regions~R[I] and fJl[2] are empty in Figure 2F, since all val-ues of i fall above the upper bound of A .

Equations 11-13 define the same regions as Equa-tions 6-8. The values of region bounds are differentonly for empty regions, but they are still empty.

We can now express the bounds of each of the re-gions just in terms of I, u, .r, and "11:

.~) = min(u + 1, max(l, 10(A»)

'11 = max(l- 1, min(u, up(A)))

(4)

Let there be a single array reference A [i] in the loopbodyB(i) . We denote the lower bound ofA by lo(A)and the upper bound of A by up( A). Therefore, forthe array reference A [i] to be valid we must havelo(A) :S i :S up( A). If A is null, we define lo(A)= 0 and up( A) = - 1. This guarantees that A [i] isnever valid if A is null. We can split the iterationspace of the loop into three regions ~J( [1], ~J( [2], and~R [3], defined as follows:

~R[I] : (l :S i :s u) 1\ (i < 10(A» (1)

~R[2] : (l :S i :S u) 1\ (lo(A) :S i :S up(A» (2)

~R[3] : (l :S i :S u) 1\ (i > up(A» (3)

which we represent as L(i, I, u, B(i». B(i) is thebody of the loop and, in general, contains referencesto the loop index variable i. The iteration space ofthis loop consists of the range of values i = I, I +1, ... , u.

Region ~j( [1] corresponds to those iterations of theloop for which the index i into array A is too small.Therefore, an Ib test is required before each arrayreference in this region. No tests are required in re-gion ~J( [2], since the index i falls within the boundsof A. Finally, a ub test is required in region ~R [3],because the values of i are too large to index A.

..\:' = max(l, 10(A»

IBM SYSTEMS JOURNAL, VOL 37, NO 3, 1998 MIDKIFF, MOREIRA, AND SNIR 413

Figure 2 Relationship between loop bounds and array bounds

~R[l] and ~JI [3], respectively. Expressions for ~,"11, andT for various forms ofj(i) are described in Appen-dix A. The safe bounds are computed so that we al-ways have "11 ~ .S! - 1 and .cr :'S "11 + 1. These prop-erties are important for the correct functioning ofour methods.


An exact method

In this section we consider the case of a simple (depthone) loop, with multiple references. The exactmethod described here performs only the tests strictlynecessary (as specified in Appendix A) to detect the


Figure 3 Regions generated by the intersection of simple regions

array reference violations that occur during the ex-ecution of a loop. It is, in general, of limited prac-ticality because up to Y versions of the loop bodymust be instantiated in the code, where p is the num-ber of array references in the loop body. In somesituations, when enough information is available tothe compiler, it is possible to generate efficient codeusing this technique. Our main interest in studyingthis method is that the other, more practical, trans-formations are derived from this technique.

In this section we treat references ofthe form A [f(i)]that allow the computation of safe bounds as de-scribed in the previous section. The eighth sectiondiscusses how to handle more complex subscripts.We first describe how the method generates opti-mized code, and then we illustrate the method withan example.

Code generation. The method works as follows. LetL(i, I, u, B(i)) be a loop on index variable i andbody B(i). Let there be p references of the formA) [fJCi)], j = 1, ... , p, in body B(i). Each arrayreferenceA )[jj (i)] partitions the iteration space intothree (each possibly empty) regions:

'J')[1] . /' - I (,») - 1'\ • - , ••• , ,..J...

performed for referenceA )[f)(i)] in that region. Thetest T) [2] for region ~Rj [2] always specifies no test,because ~Ri[2] is a safe region with respect to thisreference. The tests T)[I] and T)[3], for regions ~Rj[l]

and ~J~)[3], respectively, can be anyone of ub test(an upper bound test), lb test (a lower bound test),or all tests (both a lower and an upper bound test).The exact choice depends on characteristics of thearray reference A) [f) (i)]. This issue is discussed indetail in Appendix A. For each region ~J~) [k] we de-note its lower bound by ~J~) [k].1 and its upper boundby ~R/[k].u.

Two references A)[jj(i)] and Adfk(i)] can bethought of as partitioning the iteration space intofive regions defined by the four points !f), G1{1, !f k ,

and "1lk • We illustrate this in Figure 3 for two ref-erences, At[ft(i)] andA 2[fz(i)]. Note that someof the resulting regions may be empty, in general.The resulting regions are a subset of the 3 . 3 = 9regions formed by all combinations of intersectionsof two simple regions, one from each reference.

In general, given the p references A) [f) (i)], we cancreate a vector ofallY regions formed by intersect-ing the simple regions from different array refer-ences:

Each indexx, is 1, 2, or 3. The lower bound of a re-gion ~R [x 1 • Xz •...• x p] is the maximum of the lowerbounds of the forming simple regions. Correspond-

~JU[2] : i = .\:,), ... ,"11)

as defined by Equations 11-13. We call these regionsdefined by a single array reference simple regions.Also, each region ~R)[k] has a test T)[k] associatedwith it, which describes what kind of test has to be

Hl[xt • X z ....• xp] =

~lll[Xl] n ~R2[XzJ n ... n ~R P[xp] (14)


ingly, its upper bound is the minimum of the upperbounds of the forming simple regions:

for (i = ~g[1 ·2· 3].1;i ::::; ~g[1 ·2· 3].u; i ++ ){B]2 . .3(i)}

~g[XI • X2 •.•.• xp].u =

min(Hl'[x,].u, ~Jl2[X2].U, ... , ~JlP[xp].u) (16)

Finally, the tests that characterize region Hl[ x I . X 2.. . . . x p ] can be described by:

That is, the combination of the tests for each simpleregion ~Jlj [x.l-

for (i = ~R[3' 3· 1].1;i s; ~g[3' 3· ·1].u; i++){B}3 .1(i)}

for (i = ~Jl[3 ·3· 2].1;i ::::; ~R [3 . 3 2].u; i ++ ){B33. .2(i)}

for (i = ~g[3 ·3· 3].1;

i::::; Hl[3' 3· ' 3].u; i++ ){Bn .. 3(i)} (19)

Note that the order of the resulting loops is impor-tant, although there are many correct orders. Therequirement is that, for any value of j, region~R[Xl' ••. 'Xj_1 'l'x j + I ' ., • • x p ] has to precede~Jl[Xl ..... Xj-l ·2· x j + 1 •••.• x p ] which in turnhas to precede ~j([XI ..... Xj-l ·3· x j + 1 •..•• x p ] .

Using this partitioning of the iteration space into Yregions, the loop

for (i = Hl[1 ·2· 1].1;

i s: ~Jl[I' 2· 'l]'u; i++){B12 ../i)}

for (i = Hl[1 ·2· 2].1;

i z: ~Jl[I' 2· ' 2].u; i++ ){B]2 .2(i)}

can be transformed into 3P loops, each one imple-menting one of the regions st]», ·X2·" ,·xp ] . Thebody of each region can be specialized to performexactly the tests described by T[Xl' X 2..... x p]. Weuse Bx,x, .xp(i) to denote the body B(i) specializedwith the tests described by T[X I • X2 •...• x,]:

(22)

(20)

(21)

if .~j 2: PHI

if c'l lJ < Pk

if(.~j <PHI) 1\ (ell j2:Pk)

}

can be transformed to

Each region ~P[k] corresponds to a region ~g[x I • X 2 •

. . . . x p ] defined by

Out of the 3P possible regions formed by the inter-section of the simple regions, no more than 2p + 1are nonempty and are actually executed at run time.Which of the 3p regions are nonempty depends onthe relative positions of the 2p safe bounds Y: I, ... ,

.~ P, "11 I, ... ,"11 '', Let P], ... ,P 2p be the list obtainedby sortings ', ... ,.~P,"lll + 1, ... ,"lIP + 1 in as-cending order. Define Po = I and P2p+1 = u + 1.The safe bounds .~ 1, ... , .~ P, ell I, ... , "11 p partitionthe iteration space into 2p + 1 (each possibly empty)regions ~P[k], k = 0, 1, ... , 2p defined by

(It can occur that both .'f'j 2: PHI and oW < P k- Inthat case, region ~I)[k] is necessarily empty, and thechoice ofx, is irrelevant.) LetM(k) be the functionthat defines the correspondence ~Jl[M(k)] == ~P[k]

for k = 0, ... , 2p. Then the loop

for (i = I; i::::; u; i++){B(i)

H)[k]:i = Pk"" ,PHI - 1

(18)

for (i = ~R[1 . 1 ..... 1].1;

i z: ~Jl[I'I"" ·1].u; i++){B ll . . l(i)}

for (i = ~R.[1 . 1 ..... 2].1;

i s: ~R[l'I" .. ' 2].u; i++ ){B I l . 2(i)}

for (i = ~g[1 . 1 ..... 3].1;

i z: ~R[I'I"'" 3].u; i++){Bll.ii)}

for (i = I; i::::; u; i++){B(i)

}

416 MIDKIFF. MOREIRA. AND SNIR IBM SYSTEMS JOURNAL. VOL 37, NO 3, 1998

Figure 4 A program optimized using the exact method: (A) original code, (B) transformed code

If the order of the safe bounds X I, ... , ,\;, P, l11 I, ... ,

011 P is known at compile time, the mapping functionM(k) can also be determined. In this case, only theloop body versions that are actually executed needto be generated. In general, the order of the safebounds is not known, and B M(k) (i) has to be selected

for (k = 0; k ::; 2p; k + +){for (i = Pk; i <Pk+I; i++){

BM(kl(i)}


(23)

at run time from the set of all possible 3P body ver-sions.

An example. In the interest of conciseness, we usea verysimple code fragment to illustrate this method.Figure 4A illustrates the original loop to be trans-formed. It has two array references, A 1 [i - 2] andA 2 [i + 2], each generating three simple regions:~Jll [1: 3] and ~R 2[1: 3], respectively. The straightfor-wardly transformed code, using the exact method,is shown in Figure 4B. We note that:


Figure 5 A program optimized using the exact method: (A) original code, (8) transformed code

Now let n = 10, as shown in Figure SA. In this case,the expressions for the safe bounds become:

which does not provide us with enough informationto order the safe bounds c't I, c't 2, ''II 1, and ell 2. We im-~lement the code in Figure 4B according to Equa-tion 19. We could have used the method in Equa-tion 23, but it would still have required all 3P bodyversions to be generated.

S' 1= min(u + 1, max(l, 3» (24)

"~l2 = min(u + 1, max(l, -1)) (25)

°11' = max(l- 1, min(u, n + 2)) (26)

ell 2 = max(l - 1, min(u, n - 2)) (27)

7 1[1] = 72[1] = lb test (28)

7'[3] = 72[3] = ub test (29)

and we can order c't 2 S c's! I sell 2 S ell 1. In particular,this is the ordering shown in Figure 3. We only needto generate code for five loops, as shown in Figure5B. In some cases, a compiler with appropriate sym-bolic analysis can even eliminate some of these fiveloops. For example, if the compiler could prove thatl > 2, neither of the first two loops (body versionsB 'I (i) and B n(i)) would be necessary.

(32)

(33)

Summary of the exact method. The exact methodtransforms code so that, for each iteration of the orig-inal loop, only those tests that cannot be shown tobe unnecessary are performed. In the straightforwardapplication of the method to a loop with p array ref-erences in its body, 3P new loops are generated, eachwith a slightly different body. No more than 2p + 1of these loops are actually executed at run time. Insome situations, compile-time analysis can show that

'11' = max(l- 1, min(u, 12))

ell 2 = max(l- 1, min(u, 8))

(30)

(31)

c't 1 = min(u + 1, max(l, 3))

"~' 2 = min(u + 1, max(l, -1»

418 MIDKIFF, MOREIRA, AND SNIR IBM SYSTEMS JOURNAL, VOL 37, NO 3, 1998

Figure 6 Partitioning of an iteration space into three regions

This partitioning is not very different from our veryfirst example in the third section, except that nowit applies to a loop with an arbitrary number of ar-ray references in its body. We illustrate this for tworeferences: A 1[fJ (i)] and A 2[fz(i)] in Figure 6. Re-gion ~R[2] is the intersection of the safe simple re-gions ~R 1[2] and ~R 2[2]. Correspondingly, region ~R[I]

is the union of the unsafe simple regions ~R 1[1] and~R 2[1]. Region Ul[3] is the uni.onof the unsafe si~ple

regions ~RI[3] and ~R 2[3]. With reference to Flg1!re3, ~Jl[l] is formed by merging all regions pre~edmg

~R[2 • 2], and region ~R[3] is formed by mergmg allregions succeeding region ~Jl [2 . 2].

Region ~Jl[2] is the safe region, and its body ~an beimplemented without any tests. We denote this ver-sion of the loop body by B(notest(i)). Conversely,the bodies of regions ~Jl [1] and ~R.[3] need at leastsome tests. For simplicity, we can implement bothwith a version B(test(i)) of body B(i). This versionperforms an all tests test before each and every ar-ray reference.

The implementation of the general method consistsof the transformation:

some ofthese loops implement empty regions of theiteration space and, therefore, can be discarded.

The general method

The general method works by partitioning ~he it~r

ation space of a loop always into three regions, in-dependent of the number of array references in.thebody of the loop. One of the regions is a safe regl~n:

no array reference in that region can cause a VIO-

lation. Another region precedes the safe region, initeration order. Finally, the third region succeeds thesafe region, in iteration order. This met~od can beapplied to each and every loop of an arbitrary loopnest individually. The general method does not spe-cialize the tests as much as the exact method, but itdoes identify the same safe regions that can be ex-ecuted without any tests.

Transforming a single loop. Consider the loop L(i,I, u, B(i)), with p references of the form Ai [fi(i)]in its body. Using the concepts developed in the pre-vious two sections, we can compute its safe regionas the intersection of all simple safe regions. Thissafe region is defined by the range of values of i =1', [I + 1, ... , US where:

l" = maxts' I, S,2, ... , S'I') (34)

u' = max(l' - 1, min("11', "11 2, ••• , '111')) (35)

~R [1] : i = I, . . . , I' - 1

~Jl[2] : i = l', ... , US

~R[3] : i = US + 1, ... , u

(36)

(37)

(38)

The 1s - 1 term in the expression for u S is necessaryto handle cases where the safe region is empty. Wecan then partition the iteration space of the loop intothree (each possibly empty) regions:

IBM SYSTEMS JOURNAL. VOL 37. NO 3. 1998

for(i = I; i:s u; i++){B(i)

}

MIDKIFF. MOREIRA. AND SNIR 419

Figure 7 Applying the general method to a loop nest

becomes

for (i = l; i:s l" - 1; i++){B(test(i»

}for (i = l'; i z: us; i++){

B(notest(i»}for (i = US + 1; i:s u; i++){

B(test(i»

in B. The transformation can be applied individu-ally to each and every loop in a general loop nest.In fact, the final result is independent of the orderin which the individual loops are transformed.

Consider the case of the two-dimensional loop nestui, ii' u ; L'(j, ii' u j , B(i, j)). By first applyingthe transformation to the L loop, we generate a re-gion without any tests on array references indexedby i:

}

or, in shorthand:

(39)

becomes

(

L / i' i, ZS - 1, B(test(i»)

L(i, t, u, B(i» ~ L 2(i, is, us, B(notest(i»)

L 3(i, US + 1, u, B(test(i»)(40)

The code expansion in this case is only twofold, sincethe same code for B(test(i» can be used twice.Methods to realize all three resulting loops using onlytwo instances of B(i) are discussed in Appendix B.

Recursive application of the transformation. If thebody B of a loop contains other loops, then the sametransformation can be applied to each of the loops


Lj(i, 1;, If - 1, L'(j, ii' ui' B(test(i),j»)

L 2(i, it, u;" L'(j, i; ui' B(notest(i),j)))

L 3(i, u; + 1, u., L'(j, ii' ui' B(test(i),j»)

We can now apply the transformation to each of thethree L' loops. This will generate two regions with-out any tests on array references indexed by j andone region without any tests on array references in-dexed by either i orj as seen in Figure 7. This trans-formation partitions the iteration space into nine re-gions, and requires four different versions of the loopbody B(i).


Figure 8 Program fragment to be optimized

Figure 9 A program with explicit violation tests

An example of the general method. For simplicity,we givean example of single-threaded code with rect-angular arrays. In the ninth section is a discussionof the application of these optimizations to multi-threaded code. The program of Figure 8will be usedin this example. It implements a step of a two-di-mensional Jacobi relaxation.i We chose it becauseit is a well-known operation that illustrates a loopnest with various array references in its body. Also,


the array references all use slightly different indices,making our example more interesting.

Figure 9 shows the loop implemented with naivetests. The statement

ifAccessViolation( A, i) throwException

throws the appropriate exception if A is a nullpointer, i is below the lower bound of A, or i is above


Figure 10 An example of the general method

the upper bound of A. In the actual executable code,the tests would be interweaved with the individualarray references, and the order of tests would beslightly different, but this example gives an idea ofthe cost of naive testing. Figure 10 shows the loop


nest optimized for testing using the general method.In each of the resulting loops, we list the violationsthat can occur in its body. The code with explicit testsfor the different versions of the loop body are shownin Figure 11. The minimal tests needed for the i 1


Figure 11 The four different body versions used in Figure 10

and i 3 loops differ in that the i I loop only needs totest for lower bounds violations, and the i 3 loop onlyneeds to test for upper bounds violations. By usingthe transformation of Equation 39, the versionfor both loops can be the same, as indicated inFigure 10.

Transforming the loops. The optimized code is theresult of the following transformations. First, theouter i loop is split into three loops (i 1, iz, and i3),


as shown in Figure 10. The middle loop, i z, corre-sponds to those iterations of the original i loop thatcannot cause bound violations on array referencesindexed by i. Within this loop, these array referencesneed not be tested. This is the safe region for loop i.

The first loop, iI, executes those iterations of theoriginal i loop whose values of i precede the safe re-gion. Within this loop, all array references indexedby i need to be tested. Because the subscript expres-


Figure 12 Lower and upper safe bounds for each reference

sions on i are monotonically increasing across theiteration space of the i loop, only lower bound vi-olations can occur in the i 1 region.

The third loop, i 3 , executes those iterations of theoriginal i loop whose values of i succeed the safe re-gion. Again, within this loop, all array references in-dexed by i need to be tested. Because the subscriptexpressions on i are monotonically increasing acrossthe iteration space of the i loop, only upper boundviolations can occur in the i 3 region.

Within each of the resulting loops i.. i 2 , and i 3 thej loop is similarly split. For example, the body of re-sulting loop j 1,3 executes all iterations of the nest thatattempt to reference elements of a or b that are be-low the corresponding lower bound, and elementsof b[i], aU], aU + 1], and a[i - 1] that are abovethe corresponding upper bound. In a typical correctnumerical code, where all references are in bounds,only the body of loop j 2,2 will execute. This body is


represented byB(notest(i), notest(j)), and becauseit has no tests, it executes faster.

Computing the bounds ofthe split loops. To computethe bounds for the resulting i.. i 2, and i 3 loops, wefirst have to compute the safe bounds for loop i: Itand u t. In the body of the i loop there are five arrayreferences indexed by i: b [i], a [i] (appearing twice),a[i + 1], and a[i - 1]. The values of crt and "litare shown in Figure 12A. They are computed usingEquation 64 of Appendix A. The values of l;' and u;can be computed according to Equations 34 and 35:

The bounds for the resultingj loops are computedsimilarly. In the body of the j loop there are fivearray references indexed by i: (a[i])[j + 1],(a[i])[j - 1], (aU + 1])[j], (aU - 1])[j], and


(b[i])[j]. Note that b[i], ali], a[i + 1], and ali -1] are the arrays being accessed. The values of IJ anduJ can be computed by Equations 34 and 35, usingvalues of }"f and 011f obtained through Equation 64and shown in Figure 12B. In this example, the ar-rays are rectangular, and the lower and upper boundsfor a [i], a [i + 1], a [i - 1], and b [i] do not dependon the value of i. (Note that, in general, IJ and uJwould be functions of i.)

Checks that must still be done. The transformationjust discussed creates regions of the loop that arefree from violations on array references indexed byeither i, j, or both. Figure 10 lists the specific vio-lations that can occur in each region. Note that thisparticular behavior is specific to this loop. We choseto implement the loop using versions of the body thatperform tests covering more than the strictly nec-essary violations. This allowed us to use only fourversions of the loop body, as shown in Figure 10anddetailed in Figure 11. We perform an all tests teston any i reference when outside the safe region fori. Correspondingly, we perform an all tests on anyj reference when outside the safe region for j.

Summary of the general method. The exact methodof the previous section formed a different region ofthe iteration space of a loop for each possible com-bination of necessary array reference tests. The gen-eral method always partitions the iteration space ofa loop into three regions: (1) a region for those it-erations that need no test (the safe region), (2) a re-gion for those iterations that occur prior to the saferegion, and (3) a region for those iterations that oc-cur after the safe region. In each of the nonsafe loopregions, any test that might be needed for a refer-ence in any iteration of that region is performed inall iterations of that region. This means that someadditional tests might be executed compared to theexact method. For a nest of loops, the generalmethod can be applied to each and every loop re-cursively and independently. Thus, for a loop nestof depth d, 3" loop regions are created.

When generating code for the regions, several ap-proaches can be taken. One approach generates adifferent version of the loop body for each region.When applied to the example of Figure 10,this wouldgenerate only lower bound tests on arrays indexedby i in the i I loop and only upper bound tests in thei 3 loop. This approach leads to 3" versions of theloop body for a loop nest of depth d. A second ap-proach, actually used in our example of Figures 10and 11, uses only two versions of the loop body for


each loop index: one with no tests on that index andone with all tests. This leads to 2" versions of theloop body for a loop nest of depth d. A third ap-proach would be to generate only two versions ofthe loop body, independent of the number of nestedloops: one with all tests on all indices and one withno tests on any index. Obviously, the version withno tests can only be used in that region where noviolations in any array reference can occur.

The compact method

The exponential expansion on the number of loopscaused by the application of the general method canbe highly undesirable in some cases. In this sectionwe propose an alternative method that results in onlya linear increase in the number of regions. The dif-ference resides in how the transformation is appliedto loop nests. In the compact method, the partition-ing into three regions is always applied in outermostto innermost loop order. Furthermore, it is only ap-plied to inner loops in the untested version of anouter loop body.

Again, consider the case of the two-dimensional loopnest L(i, l., u i, L'(j, Ii' ui' B(i, j))). By first ap-plying the transformation to the outer i loop, we gen-erate a region without tests on array references in-dexed by i:

becomes

L1(i, Ii' If - 1, L'(j, Ii' u., B(test(i),j»))

L 2(i, If, uf, L'(j, t; ui' B(notest(i),j)))

L 3(i, uf + 1, u., L'(j, i. "» B(test(i),j»)) (41)

We now apply the transformation to the instance ofL' in the L 2 region. This results in a region withoutany tests on array references indexed by either i orj as seen in Figure 13.

This transformation partitions the two-dimensionaliteration space into five regions, and requires threedifferent versions of the loop body B (i). It still gen-erates the same region safe on i and j as the generalmethod.

An example ofthe compact method. As an example,we apply the compact method to the two-dimensionalloop nest of Figure 8. The resulting code is shown


Figure 13 Applying the compact method to a loop nest

in Figure 14. Note that only loop i 2 has its j looppartitioned into three regions. When applied to ad-dimensionalloop nest, the compact method gen-erates 2d + 1 regions of the loop iteration space.Because the two regions preceding and succeedinga safe region are similar, only d + 1 versions of theloop body must be generated.

Summary of the compact method. Like the generalmethod of the previous section, the compact methoddivides each loop into one safe and two nonsafe re-gions. The differences between the two methods aremanifested only when applying the transformationto a loop nest. The general method splits all nestedloops into three regions, whereas the compactmethod splits only loops nested within the versionwithout tests. This implies that within the nonsafeversions of an outer loop the compact method mayperform more tests than the general method. Thebenefit is that only a linear number of loop regions(2d + 1 for a loop nest of depth d) are necessarywith the compact method, as opposed to an expo-nential number with the general method. The num-ber of versions of the loop needed is a linear func-tion of the nesting depth of the loop (d + 1 for aloop nest of depth d).

The number of versions of code can be further re-duced to two by using a fully tested version on anyregion that needs tests. Also, if the loop nesting isperfect, the structure of the generated code can begreatly simplified when only two versions are used.These observations provide the motivation for therestricted method of the next section.


The restricted method

We now consider a transformation that can be ap-plied to perfect loop nests, or any loop nest that canbe transformed to a perfect loop nest via compilertechniques. The restricted method partitions the loopinto multiple regions. Each region executes eithera test (all array references are fullytested) or a notest(no array references are tested) version of the loopbody. Having a perfect loop nest allows us to mapevery iteration of the loop nest onto a single pointin a d-dimensional space, where d is the depth ofthe loop nest. Partitioning the loop into test andnotest regions is equivalent to tiling this iterationspace.

The advantage of the restricted method, when ap-plied to a perfect d-dimensionalloop nest, is that itcan be implemented with only one instance of eachversion of the loop body. Also, the instances can begenerated in place in a fixed loop structure. The pre-viousmethods, when implemented with only twover-sions of code, require a specialized loop structurethat invokes those versions from more than one pointin the program.

We start by considering a d-dimensional rectangu-lar loop nest:

That is, the bounds ii, and Ui, of loop ik do not de-pend on the values of i j, ••• , ik - 1 • The iteration


Figure 14 Structure of generated code for the compact method

}for(i3=ut+ 1;i3 $ Uj; i3++) {

for(h,l =lj;h,l s Uj;h,l++) {B(test(i3), test ( h,I»

}

space of this loop can be partitioned into consecu-tive regions described by a vector ~R [1 :U Ii]' Each en-try in this vector has the form:

The vectors I and u define the lower and upperbounds for each loop in the region, respectively. Theelements [i

l(0) and U i

ldenote the lower and upper

bounds, respectively, of loop i j in the region u\[o].The definition of a region also contains a flag T thathas the value test if the region requires any arrayreference to be tested and notest otherwise.

~J\[O] = (1(0), 0(0), T)

1(0) = ((,(0), [i,(O), , [jo))

0(0) = (ui,(o), ui,(o), , ui}o))

T = {testjnotest}

(43)

(44)

(45)

(46)

With partitioning, the original loop can be executedby a driver loop that iterates over the regions:

for (0 = 1; o:s u8; O++){L1UI, [iJO), uiJo), LzUz, [i,(O), ui,(o), ... ,

Ld(id, [i,,(O), ul}o), BUI' iz, ... , id)) ... )).}

Note that in previous sections single-dimensional re-gions were described by a vector containing the up-per and lower bounds for the regions. Here a mul-tidimensional region is described by vectors withupper and lower bounds for every index variable inthe loop nest. The techniques of the previous sec-tions formed regions loop by loop, whereas the tech-niques of this section form regions for the entire loopnest.


Figure 15 Iteration space for a perfectly nested two-dimensional loop

Our goal is to partition the iteration space into re-gions that we can identify as either requiring testson array references or not. Those regions that donot require tests can then be executed with a notestversion of the loop body. We use the same kind ofpartitioning described in the compact method. First,an outer loop is divided into three regions. Then,the inner loop in the region without tests is dividedagain. This partitioning is illustrated in Figure 15,for a two-dimensional iteration space.

Computing vector ~R. Vector ~Jl can be computed byprocedure regions in Figure 16. Procedure regionstakes seven parameters. The first five are input pa-rameters and describe the loop nest being optimized.They are:

L The indexj indicating that region extents alongindex variable i, are being computed


2. The vector (ai, O'z, ... , aj-l), where O'k is thelower bound for loop index i, in the regions tobe computed

3. The vector (WI> Wz, ... , wj-d, where Wk is theupper bound for loop index ik in the regions tobe computed

4. The dimensionality d of the loop nest5. The vector ~g[1 :d], where ~jj[k] = (Ii" v.; I;:,

uO contains the full and safe bounds for loop ik

The next two parameters are the output of the pro-cedure. The first output parameter is the vector ofregions, ~Ji, described in Equations 43 through 46,for the loop. The second is Us, the count of thoseregions. Note that Us is also used as an input, whichgets incremented each time a new region is created.

An invocation of procedure regions for a given valueof j partitions the loop nest formed by loops ij ,


Figure 16 Procedure to compute the regions for a loop nest

ij+1> , i d • It is only performed for safe values ofi I> i z, , i j _ i. Procedure regions partitions loop i jinto three parts. The first part consists of iterationsofij that precede the safe region ofij • The inner loopsof i, are not partitioned. These regions necessarilyrequire testing of the array references, since they areunsafe on references indexed by if' Regions corre-sponding to this part of the iteration space are com-puted in statements SI and S2, and correspond tothe regions u~[I] (for j = 1), fJi[2], UI[5], fR[8], and~Jt[l1] (for j = 2) in the two-dimensional iterationspace of Figure 15. The third part consists of the it-erations of if that succeed the safe region of i.. Again,the inner loops are not partitioned, and testing is re-quired. Regions corresponding to this part of the it-eration space are computed in statements Sll andS12, and correspond to the regions fR [4], fR[7], fR [10],H~[13] (forj = 2), and fR[14] (forj = 1) in Figure15. The middle (second) part consists of the itera-tions of ij that are within its safe region. If ij is theinnermost loop, this is computed by statements S4and S5, and the partitioning of loop ij is complete.No tests are required. If i j is not the innermost loop,the partitioning is applied recursively to loop i j + 1 foreach iteration of ij in its safe region, as shown in linesS7 and S8.


To compute the entire vector of regions, the invo-cation regions(l, 0, 0, d, fl.\, fR, Us = 0) shouldbe performed. At the end of the computation, vec-tor ~K contains the description of the regions, andthe value of Us is the total number of regions in Ui.

Although the example in Figure 15 and the notationimply that the iteration space passed to regions in ~jj

is rectangular, the algorithm is not restricted to rect-angular loop nests. In particular, if the expressionsfor ti' U i , tt, and u t. are functions of ib 1 ::; k <• J } J • J • •

J, the computation performed by regions IS correct.

We can optimize the execution of the loop nest byusing two versions of code inside the driver loop: (1)a version B (notest(i i ), notest( i z), ... , notest( id))that does not perform anyof the array reference tests,and (2) a versionB(test(i j ) , test(i z) , '" ,test(id»that performs tests on all array references. Version1 is used only for regions where ~JI [oj.T is notest, whileversion 2 is used for all other regions. This corre-sponds to forcing all regions with any potential ref-erence violation to perform all reference tests. Theoptimized loop nest can be implemented by the fol-lowing construct:


for (8 = 1; 8:s us; 8++){if UJl[8].T ==test){

Lj(i l , li,(8), u i,(8),Lz(iz, Ii,(8), ui,( 8),

. . . ,

each iteration, either the version of the loop withrun-time tests or the version with no run-time testsis executed. The code in Figure 18 instantiates theregions dynamically. This contrasts with the staticinstantiation of Figure 14.

Ld(id, [i/8), ui/ 8) , B(test(i j ) ,

test(iz), ... , test(id)))

))

}else{Lj(ij, [i,(8), ui,(8),

Lz(iz, [i2

( 8), ui,(8),

Summary of the restricted method. The restrictedmethod works on perfect loop nests. It partitions theiteration space into multidimensional regions andgenerates two versions of the loop body. Each re-gion is executed with either the test version of thebody or the notest version. Only regions that are freefrom any possible access violation can use the notestversion.

... ,

))

Liid, li/8), uJ8), B(notest(i1) ,

notestu.), ... , notest(id)))

Iterative computation of loop bounds

In the third section and in Appendix A we discusshow to partition an iteration space into regions fora variety of common forms of subscript functions.In this section, we discuss how regions-based testingoptimization can often be performed even when thesubscript expressions are not one of the forms dis-cussed in those sections.(47)

}}

An important optimization. If, for a particular valueofj, Ii' = I, and uj' = tc., then the safe region alongaxis i/ of the iteration space corresponds to the en-tire extent of the axis. If I,' = l, and u ,' = u, for

k k k II;

k = j, ... ,d, then axis i j - 1 can be partitioned intoonly three regions: (1) one region from Ij - 1 to 1;:_,- ~, (2) one region from [;;_, to uL; and (3) oneregion from u ;;_, + 1 to U i

j. , ' Each of these regions

spans the entire iteration space along axes i.,i.; l i .•• , id . This situation for a two-dimensional it-eration space is illustrated in Figure 17. Note thatthe new partitioning results in only three regions.Collapsing multiple regions into a single region re-duces the cost of computing the regions, the totalnumber of iterations in the driver loop (47), and, con-sequently, reduces the run-time overhead of the re-stricted method. To incorporate this optimization inthe computation of regions, procedure regions ismodified as shown in Figure 37 in Appendix F.

An example of the restricted method. We apply therestricted method to the two-dimensional loop nestof Figure 8. The logical structure of the generatedcode is the same as that generated by the compactmethod in Figure 14. The actual implementation withthe driver loop and the two versions of code is shownin Figure 18. The driver loop (with index 8) of Fig-ure 18 executes Us iterations, where Us is the num-ber of regions computed by procedure regions. On

The technique is similar to that of the inspector/executor method for parallelizing loops. 7 We decom-pose the computation and execution of the regionsinto an inspector phase and an executor phase. Theinspector phase examines the references within theiteration space of a loop and computes a list of it-eration subspaces (regions). Some regions need run-time tests, whereas others do not. The inspector phaseis analogous to procedure regions of the previous sec-tion. The executor phase traverses the list of itera-tion subspaces and executes them using different ver-sions of the loop body. Thus the executor phasecorresponds to the driver loop of the last section.

Construction of the inspector phase. We show howto construct an inspector for a singly nested loop.This method can then be recursively applied to a loopnest. Let L(i, I, u, B(i» be a loop on i, where B(i)contains a series of preferences A I[O'd,A z[O'z], ... ,Ap[O'p], where O'j is a function defined in the iter-ation space of L. A reference can be of the formAj[AdO'k]]' where zl , is also an array. In general,A k[0'k] may be present as a term in O'j' We label thearray references so that A j [ O'j] executes beforeA j +j [O'J+d·

The inspector constructed takes the form shown inFigure 19A. The first argument to procedure regions

430 MIDKIFF. MOREIRA, AND SNIR IBM SYSTEMS JOURNAL, VOL 37, NO 3, 1998

Figure 17 Iteration space for a perfectly nested two-dimensional loop

is the output ~R, a vector of regions. An element ~R[0]of the region vector consists of:

~R[o] = (l(o), u(o), T)

where I( 0) and U (8) are the bounds of region ~Jl[0],and T is its test flag. The next argument, U 8, is alsoan output: the number of regions in vector ~I\. Theother arguments for procedure regions are the inputof the procedure. They are: (1) the lower and upperbounds, I and u, of the loop, and (2) the list of arrayreferences (A ,[0"1], Az[O"z]' ... , Ap[O"p]).

The inspector enumerates all iterations from theloop, by executing a for (i = I; i ::; u; i + +) loop.For each iteration i, it checks all array referencesAj[O"j] and determines if a test is necessary. If theevaluation of O"j itself causes a violation, we define


O"j = - 00. If O"j is invariant with respect to i, opti-mizations can be performed to reduce the numberof evaluations in the inspector. The inspector marksin a flag check if a test is required for iteration i, Ifcheck is the same as oldcheck, this iteration i belongsto the same region as the previous iterations, andwe update the upper bound of the current region.Otherwise, it is the first iteration of a new region.

Construction of the executor phase. The executor,shown in Figure 19B, consists of a driver loop (in-dicated by an index variable 0), that iterates over theregions. Each region is executed with one of the twodifferent versions of the loop body, one with testsand one without.

Optimizations and alternate constructions. Theinspector/executor technique can be applied to each


Figure 18 Generated code for the restricted method

loop in a loop nest independently, as we did for thegeneral method discussed earlier. Alternatively, itcan be directly applied to a d-dimensional perfectloop nest. In this case, the inspector has to enumer-ate all iterations in thed-dimensional iteration space,and the executor consists of a driver loop around dif-ferent versions of the loop nest.

For the inspector/executor method to be effective,the inspector phase has to be hoisted out of a loopnest. When that is possible, the same vector of re-gions Ht[l :ua] can be used in multiple instances ofthe executor. The number of checks is reduced bythe number of iterations in the loops from which theinspector was hoisted. We show such an example inthe following subsection.

Finally, it is also possible to build an inspector/executor pair that uses more refined versions of the


loop body. For example, one could use all 3P ver-sions generated in the exact method.

An example. In this example, the loop of Figure 20Ais transformed. The inspector for this loop is shownin Figure 20B and the executor in Figure 20C. Notethat we have optimized the loop by moving the in-spectorphase out of the j loop. This is possible be-cause the array reference patterns are not depen-dent on j. Thus, even though there are O(u;uJcomputations being performed, only O( u;) tests arenecessary.

Summary of inspector-based methods. The inspec-tor-based methods differ from the exact, general,compact, and restricted methods in how regions areformed. In the former methods, regions are com-puted analytically using the formulas of the third sec-tion and Appendix A. In the inspector-based meth-


Figure 19 Template for constructing an inspector/executor

(A) Inspector

. uo= ua+ 1·/(uo)=u(ua) =i

9t£ua]. t =checkoldcheck :;: check

-...., oldcheckH,".,.. _,_",= i

proce;duretegions(1(, ua, I, u, V'il,viJ"4·r'i.'i~"

Ua:;:Ooldcheck » undefinedfor(i =I; i $ U; i++){

check= notestfor (j =l;j $ p;j++H

if ((OJ< lo(A}) v (OJ >up(A;»)){i' i//check=test

ods the regions are computed by executing the codethat generates the subscripts for the references be-ing optimized. This allows references whose sub-scripts preclude an analytic computation of regionsto be optimized. There are two caveats, however.First, because the inspector overhead is proportionalwithin a constant factor to the cost of executing aninstance of the loop, the methods are most effectivewhen the inspector for a loop can be hoisted out ofthat loop. Second, there are still some loops whichcannot be optimized with inspector-based methods.Figure 21 shows such a loop. Because the executionof an inspector for this loop would have side effects(other than region computation) that livebeyond thelife of the inspector, forming an inspector with themethods we have discussed would alter the outcomeof the program. This loop can be optimized usingspeculative methods. Because they are not region-

based, they are beyond the scope ofthis paper. Theyare discussed in Reference 8.

Multithreading considerations

Up to now, the problem of other threads of execu-tion and optimized code being active at the same timehas been ignored. If arrays had static shape, that is,if it were not possible for an array to change shapeduring the course of execution of a routine, mul-tithreading would not be a problem. Multithread-ing is beyond the scope of the current FORTRAN, C,and C+ + language definitions. Java, however, hasmade multithreading an integral part of the language.Furthermore, it is possible for one thread of a Javaprogram to alter the shape of a multidimensional ar-ray that is being accessed by other threads.


Figure 20 A program optimized using inspector/executor

The Java memory model enables a solution to thisproblem. The coherent copy of a shared variable re-sides in main memory. A local copy of any sharedvariable can be copied by a thread from main mem-ory to working memory. The working memory is ac-cessible only by the thread, and therefore is not al-terable by any other thread. Java requires that:


1. No synchronization can occur within a thread be-tween time tm' when a variable is copied frommain to local memory, and time ta' when the vari-able is accessed by the thread. This can be en-forced by refreshing the shared variable in localmemory after a synchronization and before theaccess.


Figure 21 A loop that cannot be optimized by inspector-based methods

2. Before a synchronization within a thread is fin-ished, all shared variables that have been mod-ified since the last synchronization point in thatthread must be written back to the main mem-ory.

3. If a shared variable has been written by a thread,it must be written to main memory before its valuecan be read from main memory by some thread.This prevents locallywritten values to shared var-iables from being lost.

4. As long as item 3 is obeyed, the value of a sharedvariable can be updated from global memory atany time prior to an access by a thread.

The consequences ofthe Java memory model rulesto our work are threefold. First, it is legal and validwithin a thread to cache in local memory the valuesof a variable- even if the thread isnot synchronized.Second, if the thread isnot synchronized, the cachedvalues need not be written back. Third, if a synchro-nization point exists within a loop nest being opti-mized, the shapes of shared arrays have to beconservatively assumed to have changed at thesynchronization point. To prove that the shape of ashared array does not change across a synchroniza-tion requires proving that none of the threads thathave access to the array at that time can change itsshape.

We now describe how to handle the case in whicha body of code without synchronization accesses a(potentially) shared array. We can divide arrays intotwo distinct parts. The first part is comprised of de-scriptors, those parts of an array that contain point-ers to other descriptors or to rows of data. The sec-ond part is comprised of data, the one-dimensionalrows that contain the actual elements. In Figure 22,the dashed boxes indicate descriptors, and the solid


boxes indicate data. In general, a d -dimensional rect-angular array A [1 :nIJ[1 :nz] .. , [1 :nd] consists ofone level-O descriptor that points to a vector of n 1

level-l descriptors. Each level-l descriptor pointsto a vector of n : level-2 descriptors, for a total ofn Inz level-2 descriptors. There are a total ofO(n Inz ... nd-d descriptors. The last axisof the ar-ray contains the actual data, in the form of vectorsof n ; elements, pointed to by the level-(d - 1) de-scriptors.

The following then ensures the thread safety of ourtransformations: before every optimized body ofcode, a copy of the descriptors part of each sharedarray A is cached in working memory. Note that ifa cached descriptor for A that is valid by the Javathread semantics is already present, it may be usedfor the purposes we describe. This action insulatesthe thread executing the optimized code from anychanges to the shared array shape caused by otherthreads. Also note that only the array shape is rel-evant in computing the safe and unsafe regions forour transformations. The data part of the array canbe modified by other threads without any effects onthose regions. By using this caching of descriptors,the only places that shape changes become visibleare at synchronization points. The shape of a sharedarray should be marked as variant across thosepoints. This, in turn, can prevent many optimizations.

At first glance, the caching of descriptors may seemexpensive. In general, given a d-dimensional rect-angular array A[1 :n l][1 :nz] ... [1 :nd], onlyo (n 1n z ... nd-I) storage needs to be cached. Thus,a one-dimensional array needs only a constantamount of storage, and an n I X n z array needs onlyn 1 words of storage. Therefore, if most of the arrayis accessed within the loop, the amount of storage


Figure 22 Multidimensional array organization, showing the separation between descriptors and data

Figure 23 Java code that implements OOOT

cached is approximately a factor of n d less than thenumber of references.

Experimental results

To test the effectiveness of our compiler transfor-mations, we developed a prototype framework. Thisprototype framework currently implements only therestricted method and only handles perfectly nestedrectangular loops. The framework consists of a Java-to-Java translator that produces the two versions ofcode necessary for the restricted method. These ver-sions are then compiled into executable object codeusing the IBM High Performance Compiler for Java(HPCJ).4 HPCJ has a switch to generate code withoutany run-time checks, on a class basis.

Benchmarks. Using the prototype framework, we ap-plied the transformations in the restricted method


to Java programs in a benchmark suite. This suiteconsists of two numerical kernels and three appli-cation kernels, all with array-intensive computations.The two numerical kernels are a vector dot-productoperation (DDOT), and a matrix-multiply operation(MATMUL). The three application kernels are a shal-low-water simulation kernel (SHALLOW), a data-min-ing kernel (BSOM), and a cryptography kernel (CBC).We compare the performance of a Java implemen-tation of each benchmark with a corresponding ver-sion written in either C (DDOT, MATMUL, and CBC)orC++ (SHALLOW and BSOM).TheCandC++ ver-sions were written and compiled according to whatwe call Javarules: (1) two-dimensional arrays are rep-resented as vectors of pointers to one-dimensionalarrays, and (2) any compiler optimizations that vi-olate the Java IEEE 754 floating-point semantics (ex-act reproducibility of results) are disabled. The im-


Figure 24 Source code for C =C + A x B in Java

pact of these rules on the performance of Cor C++programs is beyond the scope of this paper. We dis-cuss each of the benchmarks in more detail below.

DDOT. The DDOT benchmark computes the dot-product L['=1 x.y, of two vectors X andy ofn double-precision floating-point numbers. The Java code thatimplements DDOT is straightforward and shown inFigure 23. The C code is very similar. The compu-tation of DDOT requires 2n floating-point loads and2n floating-point operations (n multiplies and nadds). For our measurements we use n = 10 6 • Wereport the performance of DDOT in Mflops.

MATMUL. The MATMUL benchmark computes thematrix operation C = C + A X B, where C is anm X p matrix, A is an m X n matrix, and B is ann X p matrix. The elements are double-precisionfloating-point numbers. The Java implementation ofthis matrix operation is illustrated in Figure 24. TheC implementation is very similar, with the matricesimplemented as vectors of pointers to rows of ele-ments. This is in accordance with the previouslymentioned Java rules for C. The computation ofMATMUL requires 2mnp floating-point operations(mnp adds and mnp multiplies). For our measure-


ments, we used m = n = p = 64. We also reportthe performance of MATMUL in Mflops.

SHALLOW The SHALLOW benchmark is a compu-tational kernel from a shallow water simulation codefrom the National Center for Atmospheric Research(NCAR). It consists of approximately 200 lines ofcode, in either the C+ + or Java version. The datastructures in SHALLOW consist of 14 matrices (two-dimensional arrays) of size n X m each. The com-putational part of the code is organized as a time-step loop, with several array operations executed ineach time step (iteration), as shown in Figure 25. Inthat figure we indicate the number of occurrencesfor each kind of array operation. The notations A +A, A *A, and A fA denote addition, multiplication,and division of all corresponding arrayelements. Thenotation s * A denotes multiplication of a scalarvalue by each of the elements of an array. There-fore, each iteration of the time step loop executes65mn floating-point operations. The matrix oper-ations do not appear explicitly. Instead, they arefused into three double-nested loops.

Just as in the MATMUL benchmark, the matrices inthe C+ + code are implemented as vectors of point-


Figure 25 General structure of the SHALLOW benchmark

ers to rows of elements. For our measurements wefix the number of time steps T = 20 and use n =m = 256. Once again, performance for SHALLOWis reported in Mflops.

BSOM. BSOM (Batch Self-Organizing Map) is adata-mining kernel being incorporated into Version2 ofthe IBM Intelligent Miner". It implements a neu-ral-network-based algorithm to determine clustersof input records that exhibit similar attributes of be-havior. The simplified kernel used for this study con-sists of approximately 300 lines of code, in either theC++ or Java version.

We time the execution of the training phase of thisalgorithm, which actually builds the neural network.The training is performed in multiple passes overthe training data. Each pass is called an epoch. Lete be the number of epochs, let n be the number ofneural nodes, let r be the number of records in thetraining data, and let m be the number of fields ineach input record. For each epoch and each inputrecord, the training algorithm performs nm connec-tion updates in the neural network. Each update re-quires five floating-point operations.

For our measurements, we use e = 25, n = 16, andr = m = 256. We report the performance of BSOMin millions of connection updates per second, orMCUP/s, as is usually done in the literature for neu-ral-network training.

CBC. Our last benchmark is an implementation ofeBe, or cipher block chaining:" eBe is a block ci-pher mode in which a feedback mechanism is usedto encrypt a vector of n blocks of data. Each blockis encrypted in sequence. The result of encrypting


a block is XORed with the next block before encrypt-ing this next block. Let P = (P" ... , P n) be thevector of plaintext blocks and let C = (C" ... , Cn)be the vector of ciphertext blocks. Then

where EKO is the encrypting function with key K,and Co is the initial chaining value. The eBe bench-mark uses the data encryption standard (DES) algo-rithm 9 to encrypt the blocks. DES transforms each64-bit plaintext block into a 64-bit ciphertext block,using a 56-bit key.

We use our eBebenchmark to encrypt a vector of128k blocks (1 MB) and we report its performancein millions of bytes encrypted per second (MB/s). Wemeasure both C and Java versions of this benchmark.The eBe benchmark performs integer and logicaloperations on array elements. This contrasts with themostly floating-point operations that the other fourbenchmarks perform.

Execution environment. We performed our exper-iments on an IBM RS/6000* Model 590 workstation,running Advanced Interactive Executive (AIX*) 4.2.This workstation has a 66 MHz POWER2 processor,and is configured with 256 kB of level-1 cache and512 MB of main memory. The C and C+ + programswere compiled with C Set+ + version 3 for AIX andwe used version {3A9 of HPCJ to compile the Javaprograms.

Results. Table 1summarizes the results from our ex-periments. For each benchmark and code version,we list the measured performance in the appropri-


Table 1 Summary of results from applying our transformations

DDOT(Mflopsl'}

MATMUL(M~ops)

Table 2 Comparing the performance of C and Java with transformations

l!!~ormance on RS/6OClO 590MATMl:Ic~ SHALLOW(Mflops) (Mflops)

34.339.1114%

ate units. The first row (C or C+ + with Java rules)of the table lists the results for the Cor C+ + ver-sion of the benchmarks. The second row (HPCJ nochecks) lists the results for the Java version compiledwith HPCJ, with all run-time checks disabled. Note thatthis version is not Java-compliant, since it will notdetect any indexing violations that may occur. Thethird row (Transformations) lists the results for theJava version with our transformations applied. Thisversion is Java-compliant, since all necessary testsare performed. When it is necessary to make privatecopies of the descriptors of an array (as describedin the previous section), the time for copying is in-cluded in computing the performance. Finally, thelast row (HPCJ with all checks) of the table lists theperformance measured for Java compiled with HPCJ,with the run-time checks enabled. Again, this ver-sion is Java-compliant.

From Table 1,we observe that HPCJ can produce ex-ecutable code for Java that is very competitive withcode produced for Cor C++, ifthe run-time checksare disabled. However, when the checks are present,the performance of Java code is degraded by as muchas 15 times (MATMUL). This indicates that HPCJ al-ready implements many of the optimizations neces-sary to make Java competitive with Cor C+ + in per-formance. In fact, the Java version of SHALLOWachieves higher performance than the C++ versionbecause of better pointer disambiguation in Java thanin C++. The executable code produced by HPCJ ishampered only by the need for run-time tests. When


our transformations are applied to the code, part ofthe execution can be performed without any run-timetests. That is the reason why the performance of thetransformed code isso much better than that of HPCJwith all tests.

We compare the performance achieved for Java withour transformations to the performance of the cor-responding C or C+ + code. That comparison isshown in Table 2. The "%" row in that table indi-cates the fraction of C or C+ + performance thatthe Java code with transformation achieves. We ob-serve that we can achieve between 55 percent and114 percent of C or C+ + code performance withcode that is entirely legal Java.

Finally, we compare the performance of code pro-duced with our transformations to the performanceof code generated by HPCJ with all run-time checksenabled. Those results are shown in Table 3. The"improvement" row lists the performance ratio be-tween the version with transformations and the ver-sion without. We observe that we achieve perfor-mance improvements of up to 14 times when weapply our transformations.

Related work

A great deal of work has been done in the area ofoptimizing bounds checking for arrays. Most of thiswork has been done in the context of programminglanguages with no requirements as to the execution


Table 3 Comparing the performance of Java with and without transformations

state at the time the bounds violation occurs. In Java,a violation must generate an exception at a very pre-cise point in the execution of the program. Thebounds checking work for Ada 10 confronts a prob-lem similar to the one we face, but takes a less ag-gressive approach than we do.

There are two main approaches in the literature tooptimizing array bounds checks: (1) the use of staticdata-flow analysis information to determine that atest is unnecessary, 10-14 and (2) the use of data-flowinformation and symbolic analysis at compile timeto reduce the dynamic number of tests remaining inthe program. 15-19

Work in the first group uses data-flow informationto prove at compile time that an array bounds vi-olation cannot occur at run time and, therefore, thatthe test for the violation is unnecessary. Using theterms of our discussion, the goal of this work is toidentify loops that are safe regions. In contrast, thegoal of our work is to transform loops to place themaximum possible number of iterations in safe re-gions (with constraints on the number of loop ver-sions or regions). Methods of the first group use in-formation about loop and array bounds, and aboutsubscript expressions, to prove that either no accessviolation will occur in a loop execution or that anaccess violation might occur. In the former case, notests are generated. In the latter case, tests are gen-erated as needed. The difficulty of this approach isthat the information needed to prove that no vio-lation will occur might not be available at compiletime. These techniques would have better results withjust-In-time compilation, but the analysis overheadthen becomes problematic.

Work in the second group attempts to reduce thedynamic and static number of bounds tests. It alsoattempts to reduce the overhead induced by a testeven if it cannot be eliminated. This is done (1) byhoisting tests out of loops when possible 19 and (2)


by also determining that a test is covered by anothertest 15-18 and can be eliminated.

The optimization method of Markstein, Cocke, andMarkstein 19 is closely related to our computation of[S and US for a single linear reference. It is imple-mented by three changes to the loop. First, beforeentering the loop a test is made to determine if abounds violation occurs on the first iteration. If so,the loop is exited immediately. Second, the loop exitconditions are modified so that the loop terminatesbefore the access violation occurs. Finally, at loopexit, the final iterate value is examined. If it is lessthan or equal to the upper loop bound in the orig-inal program, an access violation occurs, and an ex-ception (or trap, in their terminology) is thrown. Theapplication of the method to the four-point stencilproblem of Figure 8 is shown in Figure 26.

To see how the removal of redundant (covered) testsworks, consider the naive sequence of tests in Fig-ure 9. If a[i - 1] and a[i + 1] are both legal ref-erences, then a [I] is also legal; therefore, the explicittest ifAccessViolation(a, i) is redundant. Similarly,considering references indexed by j, if a [I] [j + 1]and a [I] [j - 1] are both legal, then a [i + 1] [j] anda [i - 1] [j] are also legal. (Note that we need to knowthat the array a is rectangular in order to make thisstatement.) Therefore, the explicit tests ifAccessVio-lation(a[i + 1],j) and ifAccessViolation(a[i - 1],j) are redundant. Finally, because b and a are ar-rays with the same shape, the tests ifAccessViola-tion(b, i) and ifAccessViolation(b[i],j) are also re-dundant. After these optimizations we are left withthe four explicit tests shown in Figure 27. Note thatthe remaining tests must be ordered appropriately.

Neither of these optimizations is usable with Javain general because the Java semantics requiring pre-cise exceptions make the hoisting and reordering oftests illegal in many situations. Also, when an accessviolation exception is caught by a try block in the


Figure 26 Optimization of the code of Figure 8 by the Markstein, Cocke, and Markstein method

Figure 27 Optimization of the tests in Figure 9 by test coverage

loop body, the loop should not be terminated (aswould occur when hoisting tests). Nevertheless, thetechniques for eliminating redundant tests can beused to optimize the performance of inspectors.

Some instruction set architectures provide specialsupport for index checking, which have been used


to speed up bounds checking in various implemen-tations of Java. This is exemplified by the bound in-struction in the Intel x86 architecture.i" used in thejx virtual machine, and the trap instructions of theIBM POWER2 Architecture, 2l used in the IBM Java JlTCompiler. 22 These instructions generate interruptsin the case of violations, which inhibit a wide range


of compiler optimizations." Optimizations that areinhibited include code motion, reordering, and pre-scient stores. 1

Conclusions

We have developed a set of optimizations to reducethe number of run-time tests that need to be per-formed in Java programs. All of the optimizationswork by transforming a loop nest, which implementsan iteration space, into code that partitions this it-eration space into multiple regions. In some regions,run-time tests are performed, whereas in other re-gions they are not necessary. The optimizations dif-fer on their level of refinement and practicality. Themore-refined methods generate code versions thatare more specialized for the different characteristicsof regions. They can cause an unacceptable code ex-pansion. The less-refined methods use fewer codeversions and merge regions with similar character-istics. They cause a smaller code expansion and aremore practical. All methods can create one or moreregions that are completely free of run-time tests.In correct array-intensive programs, these regionsare expected to perform all or almost all of the com-putations of a loop nest. In these cases, the less-re-fined methods can deliver most of the improvementsto be gained.

We want to emphasize the importance of creatingregions without any possible array bounds and nullpointer violations. The benefits of creating these re-gions are threefold: First, run-time tests can be elim-inated from the regions, which results in a direct per-formance improvement. Second, try blocks that catchbounds exceptions can be removed from the loopversions implementing the safe regions. Third andmost importantly, as a consequence of the first two,the resulting regions have simpler code, which en-ables optimizations that would otherwise be ham-pered by the explicit run-time tests. In general, theability to perform optimizations is enhanced by max-imizing the extent of violation-free regions. Manyforms of operation reordering and parallelization canbe performed in these regions.

We implemented a prototype framework for per-forming our more practical optimization, which onlyrequires two versions of code to be generated. Wehave measured the effectiveness of this optimizationon a suite of array-intensive benchmarks. Our re-sults indicate that Java code can achieve performancethat is within 55 to 100 percent or more of the per-formance of C or C+ + code, when the Cor C+ +


code follows Java rules. C or C++ code that followsJava rules implements two-dimensional arrays as avector of pointers to one-dimensional arrays, anddoes not perform optimizations that violate Javafloating-point semantics.

Our benchmark suite currently consists of variousarray-intensive programs, including numerical, data-mining, and cryptography kernels. The array oper-ations in these benchmarks are representative of thebehavior of many applications. We intend to extendour suite to include graphics and image-processingapplications.

In all our discussion we have ignored the use of con-trol-flow analysis.This analysiscan be a powerful toolto prove that array references only occur in certaincombinations (because of, for example, differentpaths through the body of a loop), thus reducingsome of our code replication. We intend to inves-tigate the usefulness of control-flow analysis as partof our future work.

Finally, our next step is to implement and integrateour transformations into IBM Java-related products.We will work with technical and data-mining appli-cation groups to help ensure the success of large-scale, array-intensive Java applications that dependon high performance.

Acknowledgments

The authors wish to thank George Almasi and RickLawrence for providing and assisting with the B80Mbenchmark, David Edelsohn for providing and as-sisting with the CBC benchmark, and Yurij Baran-sky and Murthy Devarakonda for leading the Javaperformance efforts that initially shed light on thisproblem.

Appendix A: Computing safe bounds

We describe how to compute the safe bounds v and''11 of an iteration space i = I, ... , u with respectto an array reference. The safe region of the iter-ation space of i with respect to an array referenceA [f(i)] is the set of values of i that make f(i) fallwithin the bounds of A . The safe region for this op-eration is defined by

(Io(A) :Sf(i):S up(A)) /\ (t s: i:s u) (48)

We first consider the case off(i) monotonically in-creasing. The iteration space is partitioned into threeregions defined as follows:


~R[1] : (l:s i s: u) 1\ (f(i) < 10(A»

~R[2] : (l :s i :s u) 1\ (lo(A) :s f(i) :s up(A»

~R[3] : (l:s i :s u) 1\ (f(i) > up(A»

Using the fact that

I rl(lo(A»l:s i:::} 10(A) :sf(i)

Lrl(up(A»J 2: i:::} up(A) 2:f(i)

the safe region ~R [2] can be defined by

(49)

(50)

(51)

(52)

(53)

Hi [1], ~R [2], and ~R [3] are defined as in Equations 59-61. In this case, 7[1], the test for ~R [1], is ub test and7[3], the test for ~Ji [3], is lb test.

Note that it is always legal to set either 7[1] or 7[3](or both) to all tests, so as to reduce the number ofdistinct code versions. We actually use this simpli-fication in some of our methods.

Linear subscripts. In the particular case of a linearsubscript function of the form f(i) = ai + b, theinverse function t I (i) can be easily computed:

~R[2] : i = max(l, it-I(lo(A»l), ... ,

min(u, Lrl(up(A»J) (54)

No test is required in this region.

Correspondingly, we can define region ~R [1], whichrequires an lb test, by

~R[l] : i = I, ... , min(u + 1, Irl(lo(A» l) - 1(55)

and region Hi[3], which requires a ub test, by

~Ji[3] : i = max(l - 1, Lrl(up(A»J) + 1, ... , u(56)

To combine all three definitions, we can compute

(64)

Also, the monotonicity of f(i) is determined fromthe value of a: if a > 0, thenf(i) is monotonicallyincreasing, and if a < 0, thenf(i) is monotonicallydecreasing. Note that the values ofa and b need notbe known at compile time, since ,~) and ell can be ef-ficiently computed at run time.

Affine subscripts. Consider the d-dimensionalloopnest

for (il = Ii,; i :s ui,; i, + +){for (i2= Ii,; i2 :S ui, ; i2+ +){

for (id = Ii,,; id :s ui,,; id + +){Bti.; i2, ••• , id )

Similarly, iff(i) is monotonically decreasing we cancompute

As previously discussed, 7[1], the test for ~R [1], is Ibtest and 7[3], the test for Hi [3], is ub test.

S' = min(u + 1, max(l, Irl(up(A»l» (62)

ell = max(l- 1, min(u, Lrl(lo(A»J» (63)

S' = min(u + 1, max(l, Irl(lo(A»l»

ell = max(l- 1, min(u, Lrl(up(A»J»

and write

~ji [1] : i = I, , X - 1

~ji [2] : i = ,~), , "11

~ji[3] : i = L>1r + 1, ... , U

(57)

(58)

(59)

(60)

(61 )

}

}}

and let there be an array reference A [f(i I, i 2 , ••• ,

id)] in the body B(i l, i2, ... , id). Let the subscriptbe an affine function of the form f( i I> i 2, , i d)= alii + a2i2 + ... + adid + b, where il> , idare the loop index variables and a I, ... , ad, bareloop invariants. At the innermost (id) loop thevalues of i.. ... , i d - l are fixed, andf can be treatedas linear on id • Determination of safe bounds forthe iI, ... , id-I loops can be done using theinspector/executor method described earlier in thispaper. Alternatively, these safe bounds can be ap-proximated. Replacing true safe bounds ,\:, and °11 byapproximated safe bounds ,0 and ql does not intro-duce any hazards as long as ,0 2: ,\:, and ''II :s ell. Tech-niques for approximating the iteration subspace ofa loop that accesses some range of an affinely sub-


scripted array axis are described in References 23and 24.

Constant subscripts. For an array referenceA [f(i)]where f(i) = k (a constant),f(i) is neither mono-tonically increasing nor monotonically decreasing.Nevertheless, we can treat this special case by de-fining

if = I and bTl = u if 10(A):'S k :'S up(A)

E=/and cTI=I-1 if k>up(A)

E = u + 1 and bTL = u if k < 10(A)

(Again, this is necessary to handle empty loops.) Thatis, the safe region is the whole iteration space if wecan guarantee that g(i) mod m + ai + b is alwayswithin the bounds of A, or empty otherwise. Region~J( [3] is always empty, and we make 7[1] = all teststo catch all violations when region ~J~[l] is not empty.

If the subscript function is neither one of the de-scribed cases, then the more general inspector/executor method described earlier should be used,if possible. This method also lifts the restriction onfunctionf(i) being monotonically increasing or de-creasing.

(This last step is necessary to handle empty loops.)The safe region for referenceA [k] is either the wholeiteration space, if k falls within the bounds of A , orempty otherwise. Only region ~K[1] is nonempty if kis too small, and only region ~J~[3] is nonempty ifk is too large. We also define 7[1] = lb test and7[3] = ub test.

Modulo-function subscripts. Another common formof array reference isA [f(i)] wheref(i) = g(i) modm + ai + b. In general, this is not a monotonic func-tion. However, we know that the values off(i) arewithin the range described by ai + b + j, for i =I, 1+ 1, ... ,u andj = 0, 1, ... ,m - 1. We definea function h(i, j) = ai + b + j. Let h max be themaximum value of h(i, j) in the domain i = I, I +1, ... , u andj = 0, 1, ... , m - 1. Let h min be theminimum value ofh(i,j) in the same domain. Theseextreme values of h (i, j) can be computed using thetechniques described in Reference 25. Then we candefine

becomes

}

(69)}

for (i = I; i z: I' - 1; i++){B(test(i))

}for (i = 1"; i z: us; i++){

B(notest(i))}for (i = US + 1; i s: u; i++){

B( test(i))

to partition the iteration space of a loop into threeregions, using two different versions of the loop body:one with tests (B(test(i)), which appears twice) andone without tests (B(notest(i)), which appearsonce).

for (i = I; i :'S u; i++){B(i)

Appendix B: Auxiliary transformations

In the fifth section of this paper we used the trans-formation(65)

(66)

.~ = min(u + 1, max(l, .~))

bl1 = max(l - 1, min(u, b1l))

and then computing

otherwise,

The following equivalent construct contains only oneinstance of each loop body version:

(70)}

for (8 = 1; 8:'S n; 8++){for (i = 1(8); i:'S u(o); i++){

B( test(i))}for (i = 15(8); i s: US(o); i++){

B(notest(i))

}

(67)

(68)

and compute

oS:' = min(u + 1, max(/, E))

ell = max(l- 1, min(u, e11))

444 MIDKIFF, MOREIRA, AND SNIR IBM SYSTEMS JOURNAL, VOL 37, NO 3, 1998

(A)

Figure 28 Bounds for the two loops nested within the 0driver loop

Figure 29 Example of a loop nest to be transformed

A driver loop, on index variable a, iterates overthe two instances of the loop, one with the tested


body version and the other with the untested ver-sion. The bounds [(a), u(a), ['(a), and u'(a) for


Figure 30 The four innermost loops transformed

Figure 31 The two middle loops transformed

each iteration of the driver loop must be properlycomputed to guarantee that the semantics of the orig-inalloop are preserved. The specifications of thesebounds as a function of I, u, ZS, and US for the mostpractical choices of n (n = 2 and n = 3) are shownin Figure 28.


Appendix C: Applying the general method toarbitrary loop nests

As a more complex example of the general method,consider the loop nest of Figure 29. It consists of anoutermost loop i. The body of loop i contains two


Figure 32 Result of applying the transformation to all loops

inner loops (j 1 and j 2) and three segments ofstraight-line code (5 1,5 3 , and 55)' The body ofloopj I contains two innermost loops (k 1 and k 2 ) and asegment of straight-line code (52)' The body of loopl : also contains two innermost loops (k 3 andk 4 ) anda segment of straight-line code (54)' The pseudocodefor the loop nest is shown in Figure 29A, and a sche-matic representation is shown in Figure 29B. We usethe notation i(B (curved braces) to represent a loopon index variable i and body B that needs boundstesting on array references indexed by i . We use thenotation i[B (square braces) to represent a loop onindex variable i and body B that does not need test-ing on array references indexed by i. In the originalloop, all array accesses have to be tested for validindices, and this is represented in the diagram withcurved braces for all the loops.

We first apply the transformation to the four inner-most loops (kj, k 2 , k 3 , and e .). Each ofthese loopsis transformed into a sequence of three loops, with


the middle one not needing tests on the loop index.These transformations are illustrated in Figure 30.

In the next step, we apply the transformation to themiddle loops j I and j 2' Again, each of these loopsis transformed into a sequence of three loops. Theresulting loop nest is shown in Figure 31. For clar-ity, we represent each of the transformed (expand-e.d) loops, j 1 and j2, by I(~(j 1)Iand I(~(j2) I, respec-tively.

Finally, we complete the operation by applying thetransformation to the outermost i loop. This gen-erates three versions of the i loop, with the middleone needing no tests on array references indexed byi. The final result is shown in Figure 32. The originalloop nest of Figure 29B is transformed into a se-quence of three i loops. Inside the middle i loop thereare regions that do not need any tests. The trans-formed code will execute efficiently if all or almostall the iterations are executed in these regions.


Figure 33 Applying the transformation to the outermost loop

Appendix D: Applying the compact methodto arbitrary loop nests

As a more complex example of the compact method,consider the loop nest of Figure 29. The applicationof the transformation to the outermost i loop is il-lustrated in Figure 33. It generates a driver looparound two instances of the loop nest (as describedin Appendix B). In our schematic notation, driverloops are represented by curly braces ({). One of theinstances (with the square bracket for i) does notneed any tests on references indexed by i.

The transformation can then be applied to the mid-dle loops j 1 and j 2 in the unchecked version of thei loop. This is illustrated in Figure 34. It resultsin the creation of a loop nest without any testsfor i or j 1 and another loop nest without any testsfor i or j2'

Finally, we apply the transformation to the inner-most k 1, k 2, k 3, and k 4 loops, in the regions alreadywithout i,j 1, andj 2 tests. As illustrated in Figure 35,this creates four regions of code (the bodies of thekj, k 2 , k 3 , and k , loops) that do not need any testson the array references.


Appendix E: Computing the number ofregions when using the compact method

In this appendix we derive an expression for the num-ber of regions a loop nest is partitioned into whenusing the compact method discussed earlier in thispaper. If we apply the compact method to loop L 1

of the loop nest shown in Equation 42, we partitionthe iteration space into three regions, according toEquation 41:

(At this point, we are not concerned with the run-time tests necessary in each region.) Applying themethod recursively to the L 2 loop in the safe regionof L 1 generates 2 + 3n 1regions, where nJ = u: -


Figure 34 Applying the transformation to the two middle loops

i

I;' + 1 is the number of iterations in the safe regionof loop L j as seen in Figure 36.

In general, applying the compact method to the per-fect loop nest of Equation 42 results in

variables), then the expression for the number of re-gions simplifies to:

n = 2 + n~(2 + n~( ... (2 + nd - 13) ... ))

= 2 + 2n~ + 2n~n~ + ... + 2n~n~ ... n~-z

+ 3n;n~ ... n~-l (72)(71)

regions. The summations over ij add the number ofregions for each value of ij in its safe region. Notethat the values of 1/ and u/ can depend on the valuesof i.. iz, ... .c.; If the loops are rectangular (i.e., I;' and

Ju/ do not depend on the values of other indexJ

or, in more compact form:

n~ 2(1+ :~ j~ n:) + 3:~ n; (73)


Figure 35 Applying the transformation to the four inner loops

To make Equation 73 correct for d = 1, we defineIIP~l ti] = ~. The partitioning of a two-dimensionaliteration space into regions is shown in Figure 15.Note that the regions have different characteristicsas to which run-time checks have to be performed.

Appendix F: Algorithm to optimize thenumber of regions computed for therestricted methodFigure 37 shows the algorithm for performing theoptimization described previously that reduces the


number of regions. The main difference,with respectto procedure regions of Figure 16, is when buildingthe safe region for loop i..

In Figure 37, instead of directly applying procedureregions to each iteration of the safe region i., a testis performed using function nochecks. The test ver-ifieswhether the safe lower and upper bounds of allloops inside Ioopz, (loopsij+j,ij+ 2 , ••• ,id ) are equalto the corresponding full bounds. If that is the case(function nochecks returns true), then a single mul-


Figure 36 Applying the compact method to the two outermost loops generates 2 + 3nf regions.

tidimensional safe region can be created at this point.If function nochecks returns false, then procedureregions is applied recursively as in Figure 16. We alsoput guards to generate the regions preceding and suc-ceeding the safe region of loop ij only if they are non-empty.

'Trademark or registered trademark of International BusinessMachines Corporation.

"Trademark or registered trademark of Sun Microsystems, Inc.

Cited references

1. J. Gosling, B. Joy, and G. Steele, The Java Language Spec-ification, Addison-Wesley Publishing Co., Reading, MA(1996).

2. J. Gosling, The Evolution of Numerical Computing in Java,document available at URL http://java.sun.com/people/jaglFP.html, Sun Microsystems, Inc.

3. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles,Techniques, and Tools, Addison-Wesley Publishing Co., Read-ing, MA (1985).

4. V. Seshadri, "IBM High Performance Compiler for Java,"AIXpert Magazine, http://www.developer.ibm.com/library/aixpert/ (September 1997).

5. A. Krall and R. Graft, "CACAO-a 64-bit JavaVM Just inTime Compiler," Concurrency, Practiceand Experience 9, No.11, 1017-30 (November 1997). Java for Computational Sci-ence and Engineering-Simulation and Modeling II, Las Ve-gas, NV (June 21, 1997).

6. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flan-nery, Numerical Recipes in FORTRAN: The Art of ScientificComputing, Cambridge University Press, Cambridge, UK(1992).

7. D. Baxter, R. Mirchandaney, and J. H. Saltz, "Run-Time Par-allelization and Scheduling of Loops," Proceedingsofthe 1989ACM Symposium on Parallel Algorithms and Architectures(1989), pp. 303-312.

8. S. P. Midkiff, J. E. Moreira, and M. Gupta, Method for Op-timizingArray Bounds Checks in Programs, IBM Docket #YO-998-052, patent filed April 24, 1998.

IBM SYSTEMS JOURNAL. VOL 37. NO 3. 1998

9. B. Schneier, Applied Cryptography:Protocols, Algorithms, andSource Code in C, John Wiley & Sons, Inc., New York (1994).

10. B. Schwarz, W. Kirchgassner, and R. Landwehr, "An Opti-mizer for Ada-Design, Experience and Results," Proceed-ings of the ACM SIGPLAN '88 Conference on ProgrammingLanguage Design and Implementation (June 1988), pp. 175-185.

11. P. Cousot and R. Cousot, "Abstract Interpretation: A Uni-fied Lattice Model for Static Analysis of Programs by Con-struction or Approximation of Fixpoints," Conference Recordofthe 4thACM Symposium on PrinciplesofProgramming Lan-guages (January 1977), pp. 238-252.

12. P. Cousot and N. Halbwachs, "Automatic Discovery of Lin-ear Restraints Among Variables of a Program," ConferenceRecord ofthe 5thACM Symposium on Principles ofProgram-ming Languages (January 1978), pp. 84-96.

13. P. Cousot and N. Halbwachs, "Automatic Proofs of the Ab-sence of Common Runtime Errors," Conference Record ofthe 5th ACM Symposium on Principles ofProgramming Lan-guages (January 1978), pp. 105-118.

14. W. H. Harrison, "Compiler Analysis for the Value Rangesfor Variables," IEEE Transactions on Software EngineeringSE3, No.3, 243-250 (May 1977).

15. J. M. Asuru, "Optimization of Array Subscript RangeChecks," ACM Letters on Programming Languages and Sys-tems 1, No.2, 109-118 (June 1992).

16. R. Gupta, "A Fresh Look at Optimizing Array Bounds Check-ing," Proceedings of the ACM SIGPLAN '90 Conference onProgramming Language Design and Implementation (June1990), pp. 272-282.

17. R. Gupta, "Optimizing Array Bound Checks Using FlowAnalysis," ACM Letters on Programming Languages and Sys-tems 2, Nos. 1-4, 135-150 (March-December, 1993).

18. P. Kolte and M. Wolfe, "Elimination of Redundant ArraySubscript Range Checks," ProceedingsoftheACMSIGPLAN'95 Conference on Programming Language Design and Imple-mentation (June 1995), pp. 270-278.

19. V. Markstein, J. Cocke, and P. Markstein, "Elimination ofRedundant Array Subscript Range Checks," Proceedings oftheA CM SIGPLAN '82 Conference on ProgrammingLanguageDesign and Implementation (June 1982), pp. 114-119.

MIDKIFF. MOREIRA. AND SNIR 451

Figure 37 Optimized procedure to compute the regions for a loop nest

20. Pentium Processor Family Developer's Manual, Volume 3: Ar-chitectureand Programming Manual, Intel Corporation, SantaClara, CA (1995).

21. AIX Version 3.2 Assembler Language Reference, Third Edi-tion, IBM Corporation (October 1993); available throughIBM branch offices.

22. Java JIT Compiler Project Home Page, http://www.trl.ibm.com.jp/projects/s721 O/javajit/index_e.htm, IBM Corporation.

23. S. P. Midkiff, "Computing the Local Iteration Set of a Block-Cyclically Distributed Reference with Affine Subscripts," SixthWorkshop on Compilers for Parallel Computing (1996).

24. K. van Reeuwijk, W. Denissen, H. J. Sips, and E. M. R. M.Paalvasr, "An Implementation Framework for HPF Distrib-uted Arrays on Message-Passing Parallel Computer Systems,"IEEE Transactions on Parallel and Distributed Systems 7, No.9, 897-914 (September 1996).


25. U. Banerjee, "Loop Transformations for Restructuring Com-pilers," Chapter 3, Dependence Analysis, Kluwer AcademicPublishers, Boston (1997).

Accepted for publication March 25, 1998.

Samuel P. Midkiff IBM Research Division, T.1. Watson ResearchCenter, P.G. Box 218, Yorktown Heights, New York 10598 (elec-tronic mail: [email protected]).Dr.MidkiffreceivedaB.S.de-gree in computer science in 1983 from the University of Kentucky,and M.S. and Ph.D. degrees in computer science from the Uni-versity of Illinois at Urbana-Champaign in 1986 and 1992, respec-tively. While at the University of Illinois, he spent two years inthe design and development ofthe Cedar FORTRAN compiler.


He is a research staff member in the Scalable Parallel SystemsDepartment at the Watson Research Center, and an adjunct as-sistant professor at the University of Illinois, Urbana-Champaign.Since joining the Research Center in 1992, Dr. Midkiffhas workedon the design and development of the IBM XL HPF compiler,the integration of the Distributed Resource Management Sys-tem (DRMS) with various high-level languages, and the ASCIBlue compiler project. His current research areas are optimizingcomputationally intensive Java programs, compilation for sharedmemory multiprocessors, and the analysis of explicitly parallelprograms. He has authored or coauthored several refereed jour-nal and conference papers on these subjects.

Jose E. Moreira IBM Research Division, T 1. Watson ResearchCenter, P.o. Box 218, Yorktown Heights, New York 10598 (elec-tronic mail:[email protected]).Dr.MoreirareceivedB.S.de-grees in physics and electrical engineering in 1987 and an M.S.degree in electrical engineering in 1990, all from the Universityof Sao Paulo, Brazil. He received his Ph.D. degree in electricalengineering from the University of Illinois at Urbana-Champaignin 1995. Dr. Moreira is a research staff member in the ScalableParallel Systems Department at the Watson Research Center.Since joining IBM in 1995, he has worked on various topics re-lated to the design and execution of parallel applications. His cur-rent research activities include performance evaluation and op-timization of Java programs, and scheduling mechanisms for theASCI Blue project. He iscoauthor of several papers on task sched-uling, performance evaluation, programming languages, and ap-plication reconfiguration.

Marc Snir IBM Research Division, T 1. Watson Research Center,Po. Box 218, Yorktown Heights, New York 10598 (electronic mail:[email protected]). Dr. Snir is a senior manager at the WatsonResearch Center, where he leads research on scalable parallelsystems. He and his gro~ developed many of the technologiesthat led to the IBM SpT product, and continue to work on fu-ture SP generations. He received a Ph.D. in mathematics fromthe Hebrew University of Jerusalem in 1979. He worked at NewYork University on the NYU Ultracomputer project in 1980-1982, and worked at the Hebrew University of Jerusalem from1982 to 1986, when he joined IBM. Dr. Snir has published closeto 100journal and conference papers on computational complex-ity, parallel algorithms, parallel architectures, and parallel pro-gramming. He has recently coauthored the High PerformanceFORTRAN and the Message Passing Interface standards. He ison the editorial board of Transactions on Computer Systems andParallel Processing Letters. He is a member ofthe IBM Academyof Technology and an IEEE Fellow.

Reprint Order No. G321-5685.

IBM SYSTEMS JOURNAL. VOL 37. NO 3, 1998 MIDKIFF, MOREIRA, AND SNIR 453

Optimizing array reference checking in Java programs

Documents

Transcript of Optimizing array reference checking in Java programs