STUnit-6

8/6/2019 STUnit-6

1/22

6.5 Data Flow Analysis with Arrays and Pointers

The models and flow analyses described in the preceding section

have been limited to simple scalar variables in individual

procedures. Arrays and pointers (including object references and

procedure arguments) introduce additional issues, because it is not

possible in general to determine whether two accesses refer to the

same storage location . For example, consider the following code

fragment:

1 a[i] = 13;

2 k = a[j];

Are these two lines a definition-use pair? They are if the values of i

and j are equal, which might be true on some executions and not on

others. A static analysis cannot, in general, determine whether they

are always, sometimes, or never equal, so a source of imprecision

is necessarily introduced into data flow analysis.

Pointers and object references introduce the same issue, often in

less obvious ways. Consider the following snippet:

1 a[2] = 42;

2 i = b[2];

8/6/2019 STUnit-6

2/22

I t seems that there cannot possibly be a definition-use pair

involving these two lines, since they involve none of the same

variables. However, arrays in Java are dynamically allocated

objects accessed through pointers. Pointers of any kind introduce

the possibility of aliasing , that is, of two different names referring

to the same storage location. For example, the two lines above

might have been part of the following program fragment:

1 int []a= new int [3];

2 int []b=a;

3 a[2] = 42;

4 i = b[2];

Here a and b are aliases , two different names for the same

dynamically allocated array object, and an assignment to part of a

is also an assignment to part of b.

The same phenomenon, and worse, appears in languages with

lower-level pointer manipulation. Perhaps the most egregious

example is pointer arithmetic in C:

1 p=&b;

2 *(p+i)=k;

8/6/2019 STUnit-6

3/22

I t is impossible to know which variable is defined by the second

line. Even if we know the value of i, the result is dependent on how

a particular compiler arranges variables in memory.

Dynamic references and the potential for aliasing introduceuncertainty into data flow analysis. I n place of a definition or use

of a single variable, we may have a potential definition or use of a

whole set of variables or locations that could be aliases of each

other. The proper treatment of this uncertainty depends on the use

to which the analysis will be put. For example, if we seek strong

assurance that v is always initialized before it is used, we may not

wish to treat an assignment to a potential alias of v as initialization,

but we may wish to treat a use of a potential alias of v as a use of

v.

A useful mental trick for thinking about treatment of aliases is to

translate the uncertainty introduced by aliasing into uncertainty

introduced by control flow. After all, data flow analysis already

copes with uncertainty about which potential execution paths willactually be taken; an infeasible path in the control flow graph may

add elements to an any-paths analysis or remove results from an

all-paths analysis. I t is usually appropriate to treat uncertainty

8/6/2019 STUnit-6

4/22

about aliasing consistently with uncertainty about control flow. For

example, considering again the first example of an ambiguous

reference:

1 a[i] = 13;2 k = a[j];

We can imagine replacing this by the equivalent code:

1 a[i] = 13;

2 if (i == j) {

3 k = a[i];

4 } else {

5 k = a[j];

6 }

I n the (imaginary) transformed code, we could treat all array

references as distinct, because the possibility of aliasing is fully

expressed in control flow. Now, if we are using an any-path

analysis like reaching definitions, the potential aliasing will resultin creating a definition-use pair. On the other hand, an assignment

to a[j] would not kill a previous assignment to a[i]. This suggests

that, for an any-path analysis, gen sets should include everything

8/6/2019 STUnit-6

5/22

8/6/2019 STUnit-6

6/22

We cannot determine whether the two arguments fromCust and

toCust are references to the same object without looking at the

context in which this method is called. Moreover, we cannot

determine whether fromHome and fromWork are (or could be)

references to the same object without more information about how

Cust I nfo objects are treated elsewhere in the program.

Sometimes it is sufficient to treat all nonlocal information as

unknown. For example, we could treat the two Cust I nfo objects as

potential aliases of each other, and similarly treat the four

PhoneNum objects as potential aliases. Sometimes, though, large

sets of aliases will result in analysis results that are so imprecise as

to be useless. Therefore data flow analysis is often preceded by an

interprocedural analysis to calculate sets of aliases or the locations

that each pointer or reference can refer to.

6.6 Interprocedural AnalysisMost important program properties involve more than one

procedure, and as mentioned earlier, some interprocedural analysis

(e.g., to detect potential aliases) is often required as a prelude even

to intraprocedural analysis. One might expect the interprocedural

analysis and models to be a natural extension of the intraprocedural

8/6/2019 STUnit-6

7/22

analysis, following procedure calls and returns like intraprocedural

control flow. Unfortunately, this is seldom a practical option.

I f we were to extend data flow models by following control flow

paths through procedure calls and returns, using the control flowgraph model and the call graph model together in the obvious way,

we would observe many spurious paths. Figure 6.13 illustrates the

problem: Procedure foo and procedure bar each make a call on

procedure sub. When procedure call and return are treated as if

they were normal control flow, in addition to the execution

sequences ( A,X,Y,B ) and ( C,X,Y,D ), the combined graph contains

the impossible paths ( A,X,Y,D ) and ( C,X,Y,B ).

Figure 6.13: Spurious execution paths result when procedure calls

and returns are treated as normal edges in the control flow graph.

The path ( A,X,Y,D ) appears in the combined graph, but it does not

correspond to an actual execution order.

8/6/2019 STUnit-6

8/22

I t is possible to represent procedure calls and returns precisely, for

example by making a copy of the called procedure for each point at

which it is called. This would result in a co ntext-sensitive analysis.

The shortcoming of context sensitive analysis was already

mentioned in the previous chapter : The number of different

contexts in which a procedure must be considered could be

exponentially larger than the number of procedures. I n practice, a

context-sensitive analysis can be practical for a small group of

closely related procedures (e.g., a single Java class), but is almost

never a practical option for a whole program.

Some interprocedural properties are quite independent of context

and lend themselves naturally to analysis in a hierarchical,

piecemeal fashion. Such a hierarchical analysis can be both precise

and efficient. The analyses that are provided as part of normal

compilation are often of this sort. The unhandled exception

analysis of Java is a good example : Each procedure (method) is

required to declare the exceptions that it may throw without

handling. I f method M calls method N in the same or another class,

and if N can throw some exception, then M must either handle that

exception or declare that it, too, can throw the exception. This

analysis is simple and efficient because, when analyzing method

8/6/2019 STUnit-6

9/22

M, the internal structure of N is irrelevant; only the results of the

analysis at N (which, in Java, is also part of the signature of N) are

needed.

Two conditions are necessary to obtain an efficient, hierarchicalanalysis like the exception analysis routinely carried out by Java

compilers. First, the information needed to analyze a calling

procedure must be small: I t must not be proportional either to the

size of the called procedure, or to the number of procedures that

are directly or indirectly called. Second, it is essential that

information about the called procedure be independent of the

caller; that is, it must be context-independent. When these two

conditions are true, it is straightforward to develop an efficient

analysis that works upward from leaves of the call graph. (When

there are cycles in the call graph from recursive or mutually

recursive procedures, an iterative approach similar to data flow

analysis algorithms can usually be devised.)

Unfortunately, not all important properties are amenable tohierarchical analysis. Potential aliasing information, which is

essential to data flow analysis even within individual procedures, is

one of those that are not. We have seen that potential aliasing can

8/6/2019 STUnit-6

10/22

depend in part on the arguments passed to a procedure, so it does

not have the context-independence property required for an

efficient hierarchical analysis. For such an analysis, additional

sacrifices of precision must be made for the sake of efficiency.

Even when a property is context-dependent, an analysis for that

property may be context-insensitive, although the context-

insensitive analysis will necessarily be less precise as a

consequence of discarding context information. At the extreme, a

linear time analysis can be obtained by discarding both context and

control flow information.

Context- and flow-insensitive algorithms for pointer analysis

typically treat each statement of a program as a constraint. For

example, on encountering an assignment

1 x=y;

where y is a pointer, such an algorithm simply notes that x may

refer to any of the same objects that y may refer to. R eferen c es( x) R eferen c es( y) is a constraint that is completely independent of

the order in which statements are executed. A procedure call, in

such an analysis, is just an assignment of values to arguments.

8/6/2019 STUnit-6

11/22

Using efficient data structures for merging sets, some analyzers

can process hundreds of thousands of lines of source code in a few

seconds. The results are imprecise, but still much better than the

worst-case assumption that any two compatible pointers might

refer to the same object.

The best approach to interprocedural pointer analysis will often lie

somewhere between the astronomical expense of a precise,

context- and flow-sensitive pointer analysis and the imprecision of

the fastest context- and flow-insensitive analyses. Unfortunately,

there is not one best algorithm or tool for all uses. I n addition to

context and flow sensitivity, important design trade-offs include

the granularity of modeling references (e.g., whether individual

fields of an object are distinguished) and the granularity of

modeling the program heap (that is, which allocated objects are

distinguished from each other).

8/6/2019 STUnit-6

12/22

Chapter 13: Data Flow Testing12 /* External file hex values.h defines Hex Values[128] 3 * with value 0 to 15 for the legal hex digits (case-insensitive) 4 * and value -1 for each illegal digit including special characters 5 */ 67 # incl u de "hex val u es.h"8 /** Translate a string from the CGI encoding to plain ascii text. 9 * '+' becomes space, %xx becomes byte with hex value xx, 10 * other alphanumeric characters map to themselves. 11 * Returns 0 for success, positive for erroneous input 12 * 1 = bad hexadecimal digit 13 */ 14 int cgi decode(char *encoded, char *decoded) { 15 char *eptr = encoded;16 char *dptr = decoded;17 int ok=0;18 while (*e p tr) { 19 char c;20 c = *eptr;2122 if (c == '+') { /* Case 1: '+' maps to blank */ 23 *dptr = '';24 } else if (c == '%') { /* Case 2: '%xx' is hex for character xx */ 25 int digit high = Hex Values[*(++eptr)];26 int digit low = Hex Values[*(++eptr)];27 if ( digit high == -1 || digit low==-1) {2 8 /* *dptr='?'; */ 29 ok=1; /* Bad return code */ 30 } else {31 *dptr = 16* digit high + digit low;32 }33 } else { /* Case 3: All other characters map to themselves */ 34 *dptr = *eptr;35 }36 ++dptr;37 ++eptr;

3 8 } 39 *dptr = ' \ 0'; /* Null terminator for string */ 40 ret u rn ok;41 }

Figure 13.1: The C function cgi_decode , which translates a cgi-encoded string to a plain ASCII string (reversingthe encoding applied by the common gateway interface of most Web servers). This program is also used inChapter 12 and also presented in Figure 12.1 of Chapter 12 .

8/6/2019 STUnit-6

13/22

13 .2 Definition-Use Associations

8/6/2019 STUnit-6

14/22

13 .3 Data Flow Testing CriteriaV arious data flow testing criteria can be defined by requiring

coverage of DU pairs in various ways.

The All DU pairs adequa c y c riteri o n requires each DU pair

to be exercised in at least one program execution, following

the idea that an erroneous value produced by one statement

(the definition) might be revealed only by its use in another

statement.

A test suite T for a program P satisfies the all DU pairs

adequacy criterion iff, for each DU pair du of P , at least one

test case in T exercises du.

The corresponding coverage measure is the proportion of

covered DU pairs:

The all DU pairs coverage CDU pairs of T for P is the

fraction of DU pairs of program P exercised by at least one

test case in T .

8/6/2019 STUnit-6

15/22

The all DU pairs adequacy criterion assures a finer grain

coverage than statement and branch adequacy criteria.

If we consider, for example, function cgi decode, we can

easily see that statement and branch coverage can be

obtained by traversing the while loop no more than once, for

example, with the test suite T bran ch = {"+","%3D","%FG","t"}

while several DU pairs cannot be covered without executing

the while loop at least twice.

The pairs that may remain uncovered after statement and

branch coverage correspond to occurrences of different

characters within the source string, and not only at the

beginning of the string.

For example, the DU pair 37, 25 for variable *eptr can

be covered with a test case T C DU pairs "test%3D" where the

hexadecimal escape sequence occurs inside the input string, but not with "%3D." The test suite T DU pairs obtained by

adding the test case T C DU pairs to the test suite T bran ch satisfies

the all DU pairs adequacy criterion, since it adds both the

8/6/2019 STUnit-6

16/22

cases of a hexadecimal escape sequence and an ASC II

character occurring inside the input string.

One DU pair might belong to many different execution

paths. The All DU pat h s adequa c y c riteri o n extends the all

DU pairs criterion by requiring each simple (non looping)

DU path to be traversed at least once, thus including the

different ways of pairing definitions and uses. This can

reveal a fault by exercising a path on which a definition of a

variable should have appeared but was omitted.

A test suite T for a program P satisfies the all DU paths

adequacy criterion iff, for each simple DU path dp of P ,

there exists at least one test case in T that exercises a path

that includes dp.

The corresponding coverage measure is the fraction of

covered simple DU paths:

The all DU pair coverage C DU pat h s of T for P is the fraction

of simple DU paths of program P executed by at least one

test case in T .

8/6/2019 STUnit-6

17/22

The test suite T DU pairs does not satisfy the all DU paths

adequacy criterion, since both DU pairs 37,37 for variable eptr and 36,23 for variable dptr correspond

each to two simple DU paths, and in both cases one of the

two paths is not covered by test cases in T DU pairs . The

uncovered paths correspond to a test case that includescharacter + occurring within the input string (e.g., test case

T C DU pat h s = "test+case").

Although the number of simple DU paths is often quite

reasonable, in the worst case it can be exponential in the size

of the program unit. This can occur when the code between

the definition and use of a particular variable is essentially

irrelevant to that variable, but contains many control paths,

as illustrated by the example in Figure 13.2 : The code

between the definition of ch in line 2 and its use in line 12

does not modify ch, but the all DU paths coverage criterion

would require that each of the 256 paths be exercised.

8/6/2019 STUnit-6

18/22

1

2 void countBits( char ch) {

3 int count = 0;

4 if (ch & 1) ++count;



7 if (ch & 8 ) ++count;

8 if (ch & 16) ++count;

9 if (ch & 32) ++count;

10 if (ch & 64) ++count;

11 if (ch & 12 8 ) ++count;

12 printf("'%c' (0X%02x) has %d bitsset to 1 \ n",

13 ch, ch, count);

14 }

Figure 13.2: A C procedure with a large number of DU paths.

The number of DU paths for variable ch is exponential in the

number of if statements, because the use in each increment

8/6/2019 STUnit-6

19/22

8/6/2019 STUnit-6

20/22

there exists at least one test case in T that exercises a DU

pair that includes def .

The corresponding coverage measure is the proportion of

covered definitions, where we say a definition is covered

only if the value is used before being killed:

The all definitions coverage C Def of T for P is the fraction of

definitions of program P covered by at least one test case in

T .

8/6/2019 STUnit-6

21/22

13 .4 Data Flow Coverage with Complex Structures

L ike all static analysis techniques, data flow analysis approximates the effects of program executions. I t suffersimprecision in modeling dynamic constructs, in particular dynamic access to storage, such as indexed access toarray elements or pointer access to dynamically allocated storage. We have seen in Chapter 6 (page 94) that the

proper treatment of potential aliases involving indexes and po inters depends on the use to which analysis results

will be put. For the purpose of choosing test cases, some risk of underestimating alias sets may be preferable togross overestimation or very expensive analysis.

The precision of data flow analysis depends on the precision of alias information used in the analysis. Aliasanalysis requires a trade-off between precision and computational expense, with significant overestimation of alias sets for approaches that can be pract ically applied to real programs. I n the case of data flow testing,imprecision can be mitigated by specializing the alias analysis to identification of definition-clear paths betweena potentially matched definition and use. We do not need to compute aliases for all possible behaviors, but onlyalong particular control flow paths. The risk of underestimating alias sets in a local analysis is acceptableconsidering the application in choosing good test cases rather than offering hard guarantees of a property.

I n the cgi decode example we have made use of information that would require either extra guidance from thetest designer or a sophisticated tool for data flow and alias analysis. We may know, from a global analysis, thatthe parameters encoded and decoded never refer to the same or overlapping memory regions, and we may infer that initially eptr and dptr likewise refer to disjoint regions of memory, over the whole range of values that thetwo pointers take. L acking this information, a simple static data flow analysis might consider *dptr a potentialalias of *eptr and might therefore consider the assignment *dptr = *eptr a potential definition of both *dptr and*eptr. These spurious definitions would give rise to infeasible DU pairs, which produce test obligations that cannever be satisfied. A local analysis that instead assumes (without verification) that *eptr and *dptr are distinctcould fail to require an important test case if they can be aliases. Such underestimation may be preferable tocreating too many infeasible test obligations.

A good alias analysis can greatly improve the applicability of data flow testing but cannot eliminate all problems. Undisciplined use of dynamic access to storage can make precise alias analysis extremely hard or impossible. For example, the use of pointer arithmetic in the program fragment of Figure 13.3 results in aliasesthat depend on the way the compiler arranges variables in memory.

1 void pointer abuse() {2 int i=5, j=10, k=20;3 int *p, *q;4 p=&j+1;5 q=&k;6 *p = 30;7 *q=*q+55;8 printf("p=%d, q=%d \ n", *p, *q);9 }

Figure 13.3: Pointers to objects in the program stack can create essentially arbitrary definition-use associations,particularly when combined with pointer arithmetic as in this example.

13 .5 The Infeasibility Problem

8/6/2019 STUnit-6

22/22

Not all elements of a program are executable, as discussed in Section 12.8 of Chapter 12 . The path-orientednature of data flow testing criteria aggravates the problem since infeasibility creates test obligations not only for isolated unexecutable elements, but also for infeasible combinations of feasible elements.

Complex data structures may amplify the infeasibility problem by adding infeasible paths as a result of aliascomputation. For example, while we can determine that x[i] is an alias of x[j] exactly when i = j, we may not beable to determine whether i can be equal to j in any possible program execution.

Fortunately, the problem of infeasible paths is usually modest for the all definitions and all DU pairs adequacy

criteria, and one can typically achieve levels of coverage comparable to those achievable with simpler criterialike statement and branch adequacy during unit testing. The all DU paths adequacy criterion, on the other hand,often requires much larger numbers of control flow paths. I t presents a greater problem in distinguishingfeasible from infeasible paths and should therefore be used with discretion.

O pen Research Issues

Data flow test adequacy criteria are close to the border between techniques that can be applied at low cost withsimple tools and techniques that offer more power but at much higher cost. While in principle data flow testcoverage can be applied at modest cost (at least up to the all DU adequacy criterion), it demands moresophisticated tool support than test coverage monitoring tools based on control flow alone.

Fortunately, data flow analysis and alias analysis have other important applications. I mproved support for dataflow testing may come at least partly as a side benefit of research in the programming languages and compilerscommunity. I n particular, finding a good balance of cost and precision in data flow and alias analysis across

procedure boundaries (interprocedural or "whole program" analysis) is an active area of research.

The problems presented by pointers and complex data structures cannot be ignored. I n particular, modernobject-oriented languages like Java use reference semantics - an object reference is essentially a pointer - and soalias analysis (preferably interprocedural) is a prerequisite for applying data flow analysis to object-oriented

programs.

STUnit-6

Documents

Transcript of STUnit-6