STUnit-6

download STUnit-6

of 22

Transcript of STUnit-6

  • 8/6/2019 STUnit-6

    1/22

    6.5 Data Flow Analysis with Arrays and Pointers

    The models and flow analyses described in the preceding section

    have been limited to simple scalar variables in individual

    procedures. Arrays and pointers (including object references and

    procedure arguments) introduce additional issues, because it is not

    possible in general to determine whether two accesses refer to the

    same storage location . For example, consider the following code

    fragment:

    1 a[i] = 13;

    2 k = a[j];

    Are these two lines a definition-use pair? They are if the values of i

    and j are equal, which might be true on some executions and not on

    others. A static analysis cannot, in general, determine whether they

    are always, sometimes, or never equal, so a source of imprecision

    is necessarily introduced into data flow analysis.

    Pointers and object references introduce the same issue, often in

    less obvious ways. Consider the following snippet:

    1 a[2] = 42;

    2 i = b[2];

  • 8/6/2019 STUnit-6

    2/22

    I t seems that there cannot possibly be a definition-use pair

    involving these two lines, since they involve none of the same

    variables. However, arrays in Java are dynamically allocated

    objects accessed through pointers. Pointers of any kind introduce

    the possibility of aliasing , that is, of two different names referring

    to the same storage location. For example, the two lines above

    might have been part of the following program fragment:

    1 int []a= new int [3];

    2 int []b=a;

    3 a[2] = 42;

    4 i = b[2];

    Here a and b are aliases , two different names for the same

    dynamically allocated array object, and an assignment to part of a

    is also an assignment to part of b.

    The same phenomenon, and worse, appears in languages with

    lower-level pointer manipulation. Perhaps the most egregious

    example is pointer arithmetic in C:

    1 p=&b;

    2 *(p+i)=k;

  • 8/6/2019 STUnit-6

    3/22

    I t is impossible to know which variable is defined by the second

    line. Even if we know the value of i, the result is dependent on how

    a particular compiler arranges variables in memory.

    Dynamic references and the potential for aliasing introduceuncertainty into data flow analysis. I n place of a definition or use

    of a single variable, we may have a potential definition or use of a

    whole set of variables or locations that could be aliases of each

    other. The proper treatment of this uncertainty depends on the use

    to which the analysis will be put. For example, if we seek strong

    assurance that v is always initialized before it is used, we may not

    wish to treat an assignment to a potential alias of v as initialization,

    but we may wish to treat a use of a potential alias of v as a use of

    v.

    A useful mental trick for thinking about treatment of aliases is to

    translate the uncertainty introduced by aliasing into uncertainty

    introduced by control flow. After all, data flow analysis already

    copes with uncertainty about which potential execution paths willactually be taken; an infeasible path in the control flow graph may

    add elements to an any-paths analysis or remove results from an

    all-paths analysis. I t is usually appropriate to treat uncertainty

  • 8/6/2019 STUnit-6

    4/22

    about aliasing consistently with uncertainty about control flow. For

    example, considering again the first example of an ambiguous

    reference:

    1 a[i] = 13;2 k = a[j];

    We can imagine replacing this by the equivalent code:

    1 a[i] = 13;

    2 if (i == j) {

    3 k = a[i];

    4 } else {

    5 k = a[j];

    6 }

    I n the (imaginary) transformed code, we could treat all array

    references as distinct, because the possibility of aliasing is fully

    expressed in control flow. Now, if we are using an any-path

    analysis like reaching definitions, the potential aliasing will resultin creating a definition-use pair. On the other hand, an assignment

    to a[j] would not kill a previous assignment to a[i]. This suggests

    that, for an any-path analysis, gen sets should include everything

  • 8/6/2019 STUnit-6

    5/22

  • 8/6/2019 STUnit-6

    6/22

    We cannot determine whether the two arguments fromCust and

    toCust are references to the same object without looking at the

    context in which this method is called. Moreover, we cannot

    determine whether fromHome and fromWork are (or could be)

    references to the same object without more information about how

    Cust I nfo objects are treated elsewhere in the program.

    Sometimes it is sufficient to treat all nonlocal information as

    unknown. For example, we could treat the two Cust I nfo objects as

    potential aliases of each other, and similarly treat the four

    PhoneNum objects as potential aliases. Sometimes, though, large

    sets of aliases will result in analysis results that are so imprecise as

    to be useless. Therefore data flow analysis is often preceded by an

    interprocedural analysis to calculate sets of aliases or the locations

    that each pointer or reference can refer to.

    6.6 Interprocedural AnalysisMost important program properties involve more than one

    procedure, and as mentioned earlier, some interprocedural analysis

    (e.g., to detect potential aliases) is often required as a prelude even

    to intraprocedural analysis. One might expect the interprocedural

    analysis and models to be a natural extension of the intraprocedural

  • 8/6/2019 STUnit-6

    7/22

    analysis, following procedure calls and returns like intraprocedural

    control flow. Unfortunately, this is seldom a practical option.

    I f we were to extend data flow models by following control flow

    paths through procedure calls and returns, using the control flowgraph model and the call graph model together in the obvious way,

    we would observe many spurious paths. Figure 6.13 illustrates the

    problem: Procedure foo and procedure bar each make a call on

    procedure sub. When procedure call and return are treated as if

    they were normal control flow, in addition to the execution

    sequences ( A,X,Y,B ) and ( C,X,Y,D ), the combined graph contains

    the impossible paths ( A,X,Y,D ) and ( C,X,Y,B ).

    Figure 6.13: Spurious execution paths result when procedure calls

    and returns are treated as normal edges in the control flow graph.

    The path ( A,X,Y,D ) appears in the combined graph, but it does not

    correspond to an actual execution order.

  • 8/6/2019 STUnit-6

    8/22

    I t is possible to represent procedure calls and returns precisely, for

    example by making a copy of the called procedure for each point at

    which it is called. This would result in a co ntext-sensitive analysis.

    The shortcoming of context sensitive analysis was already

    mentioned in the previous chapter : The number of different

    contexts in which a procedure must be considered could be

    exponentially larger than the number of procedures. I n practice, a

    context-sensitive analysis can be practical for a small group of

    closely related procedures (e.g., a single Java class), but is almost

    never a practical option for a whole program.

    Some interprocedural properties are quite independent of context

    and lend themselves naturally to analysis in a hierarchical,

    piecemeal fashion. Such a hierarchical analysis can be both precise

    and efficient. The analyses that are provided as part of normal

    compilation are often of this sort. The unhandled exception

    analysis of Java is a good example : Each procedure (method) is

    required to declare the exceptions that it may throw without

    handling. I f method M calls method N in the same or another class,

    and if N can throw some exception, then M must either handle that

    exception or declare that it, too, can throw the exception. This

    analysis is simple and efficient because, when analyzing method

  • 8/6/2019 STUnit-6

    9/22

    M, the internal structure of N is irrelevant; only the results of the

    analysis at N (which, in Java, is also part of the signature of N) are

    needed.

    Two conditions are necessary to obtain an efficient, hierarchicalanalysis like the exception analysis routinely carried out by Java

    compilers. First, the information needed to analyze a calling

    procedure must be small: I t must not be proportional either to the

    size of the called procedure, or to the number of procedures that

    are directly or indirectly called. Second, it is essential that

    information about the called procedure be independent of the

    caller; that is, it must be context-independent. When these two

    conditions are true, it is straightforward to develop an efficient

    analysis that works upward from leaves of the call graph. (When

    there are cycles in the call graph from recursive or mutually

    recursive procedures, an iterative approach similar to data flow

    analysis algorithms can usually be devised.)

    Unfortunately, not all important properties are amenable tohierarchical analysis. Potential aliasing information, which is

    essential to data flow analysis even within individual procedures, is

    one of those that are not. We have seen that potential aliasing can

  • 8/6/2019 STUnit-6

    10/22

    depend in part on the arguments passed to a procedure, so it does

    not have the context-independence property required for an

    efficient hierarchical analysis. For such an analysis, additional

    sacrifices of precision must be made for the sake of efficiency.

    Even when a property is context-dependent, an analysis for that

    property may be context-insensitive, although the context-

    insensitive analysis will necessarily be less precise as a

    consequence of discarding context information. At the extreme, a

    linear time analysis can be obtained by discarding both context and

    control flow information.

    Context- and flow-insensitive algorithms for pointer analysis

    typically treat each statement of a program as a constraint. For

    example, on encountering an assignment

    1 x=y;

    where y is a pointer, such an algorithm simply notes that x may

    refer to any of the same objects that y may refer to. R eferen c es( x) R eferen c es( y) is a constraint that is completely independent of

    the order in which statements are executed. A procedure call, in

    such an analysis, is just an assignment of values to arguments.

  • 8/6/2019 STUnit-6

    11/22

    Using efficient data structures for merging sets, some analyzers

    can process hundreds of thousands of lines of source code in a few

    seconds. The results are imprecise, but still much better than the

    worst-case assumption that any two compatible pointers might

    refer to the same object.

    The best approach to interprocedural pointer analysis will often lie

    somewhere between the astronomical expense of a precise,

    context- and flow-sensitive pointer analysis and the imprecision of

    the fastest context- and flow-insensitive analyses. Unfortunately,

    there is not one best algorithm or tool for all uses. I n addition to

    context and flow sensitivity, important design trade-offs include

    the granularity of modeling references (e.g., whether individual

    fields of an object are distinguished) and the granularity of

    modeling the program heap (that is, which allocated objects are

    distinguished from each other).

  • 8/6/2019 STUnit-6

    12/22

    Chapter 13: Data Flow Testing12 /* External file hex values.h defines Hex Values[128] 3 * with value 0 to 15 for the legal hex digits (case-insensitive) 4 * and value -1 for each illegal digit including special characters 5 */ 67 # incl u de "hex val u es.h"8 /** Translate a string from the CGI encoding to plain ascii text. 9 * '+' becomes space, %xx becomes byte with hex value xx, 10 * other alphanumeric characters map to themselves. 11 * Returns 0 for success, positive for erroneous input 12 * 1 = bad hexadecimal digit 13 */ 14 int cgi decode(char *encoded, char *decoded) { 15 char *eptr = encoded;16 char *dptr = decoded;17 int ok=0;18 while (*e p tr) { 19 char c;20 c = *eptr;2122 if (c == '+') { /* Case 1: '+' maps to blank */ 23 *dptr = '';24 } else if (c == '%') { /* Case 2: '%xx' is hex for character xx */ 25 int digit high = Hex Values[*(++eptr)];26 int digit low = Hex Values[*(++eptr)];27 if ( digit high == -1 || digit low==-1) {2 8 /* *dptr='?'; */ 29 ok=1; /* Bad return code */ 30 } else {31 *dptr = 16* digit high + digit low;32 }33 } else { /* Case 3: All other characters map to themselves */ 34 *dptr = *eptr;35 }36 ++dptr;37 ++eptr;

    3 8 } 39 *dptr = ' \ 0'; /* Null terminator for string */ 40 ret u rn ok;41 }

    Figure 13.1: The C function cgi_decode , which translates a cgi-encoded string to a plain ASCII string (reversingthe encoding applied by the common gateway interface of most Web servers). This program is also used inChapter 12 and also presented in Figure 12.1 of Chapter 12 .

  • 8/6/2019 STUnit-6

    13/22

    13 .2 Definition-Use Associations

  • 8/6/2019 STUnit-6

    14/22

    13 .3 Data Flow Testing CriteriaV arious data flow testing criteria can be defined by requiring

    coverage of DU pairs in various ways.

    The All DU pairs adequa c y c riteri o n requires each DU pair

    to be exercised in at least one program execution, following

    the idea that an erroneous value produced by one statement

    (the definition) might be revealed only by its use in another

    statement.

    A test suite T for a program P satisfies the all DU pairs

    adequacy criterion iff, for each DU pair du of P , at least one

    test case in T exercises du.

    The corresponding coverage measure is the proportion of

    covered DU pairs:

    The all DU pairs coverage CDU pairs of T for P is the

    fraction of DU pairs of program P exercised by at least one

    test case in T .

  • 8/6/2019 STUnit-6

    15/22

    The all DU pairs adequacy criterion assures a finer grain

    coverage than statement and branch adequacy criteria.

    If we consider, for example, function cgi decode, we can

    easily see that statement and branch coverage can be

    obtained by traversing the while loop no more than once, for

    example, with the test suite T bran ch = {"+","%3D","%FG","t"}

    while several DU pairs cannot be covered without executing

    the while loop at least twice.

    The pairs that may remain uncovered after statement and

    branch coverage correspond to occurrences of different

    characters within the source string, and not only at the

    beginning of the string.

    For example, the DU pair 37, 25 for variable *eptr can

    be covered with a test case T C DU pairs "test%3D" where the

    hexadecimal escape sequence occurs inside the input string, but not with "%3D." The test suite T DU pairs obtained by

    adding the test case T C DU pairs to the test suite T bran ch satisfies

    the all DU pairs adequacy criterion, since it adds both the

  • 8/6/2019 STUnit-6

    16/22

    cases of a hexadecimal escape sequence and an ASC II

    character occurring inside the input string.

    One DU pair might belong to many different execution

    paths. The All DU pat h s adequa c y c riteri o n extends the all

    DU pairs criterion by requiring each simple (non looping)

    DU path to be traversed at least once, thus including the

    different ways of pairing definitions and uses. This can

    reveal a fault by exercising a path on which a definition of a

    variable should have appeared but was omitted.

    A test suite T for a program P satisfies the all DU paths

    adequacy criterion iff, for each simple DU path dp of P ,

    there exists at least one test case in T that exercises a path

    that includes dp.

    The corresponding coverage measure is the fraction of

    covered simple DU paths:

    The all DU pair coverage C DU pat h s of T for P is the fraction

    of simple DU paths of program P executed by at least one

    test case in T .

  • 8/6/2019 STUnit-6

    17/22

    The test suite T DU pairs does not satisfy the all DU paths

    adequacy criterion, since both DU pairs 37,37 for variable eptr and 36,23 for variable dptr correspond

    each to two simple DU paths, and in both cases one of the

    two paths is not covered by test cases in T DU pairs . The

    uncovered paths correspond to a test case that includescharacter + occurring within the input string (e.g., test case

    T C DU pat h s = "test+case").

    Although the number of simple DU paths is often quite

    reasonable, in the worst case it can be exponential in the size

    of the program unit. This can occur when the code between

    the definition and use of a particular variable is essentially

    irrelevant to that variable, but contains many control paths,

    as illustrated by the example in Figure 13.2 : The code

    between the definition of ch in line 2 and its use in line 12

    does not modify ch, but the all DU paths coverage criterion

    would require that each of the 256 paths be exercised.

  • 8/6/2019 STUnit-6

    18/22

    1

    2 void countBits( char ch) {

    3 int count = 0;

    4 if (ch & 1) ++count;

    5 if (ch & 2) ++count;

    6 if (ch & 4) ++count;

    7 if (ch & 8 ) ++count;

    8 if (ch & 16) ++count;

    9 if (ch & 32) ++count;

    10 if (ch & 64) ++count;

    11 if (ch & 12 8 ) ++count;

    12 printf("'%c' (0X%02x) has %d bitsset to 1 \ n",

    13 ch, ch, count);

    14 }

    Figure 13.2: A C procedure with a large number of DU paths.

    The number of DU paths for variable ch is exponential in the

    number of if statements, because the use in each increment

  • 8/6/2019 STUnit-6

    19/22

  • 8/6/2019 STUnit-6

    20/22

    there exists at least one test case in T that exercises a DU

    pair that includes def .

    The corresponding coverage measure is the proportion of

    covered definitions, where we say a definition is covered

    only if the value is used before being killed:

    The all definitions coverage C Def of T for P is the fraction of

    definitions of program P covered by at least one test case in

    T .

  • 8/6/2019 STUnit-6

    21/22

    13 .4 Data Flow Coverage with Complex Structures

    L ike all static analysis techniques, data flow analysis approximates the effects of program executions. I t suffersimprecision in modeling dynamic constructs, in particular dynamic access to storage, such as indexed access toarray elements or pointer access to dynamically allocated storage. We have seen in Chapter 6 (page 94) that the

    proper treatment of potential aliases involving indexes and po inters depends on the use to which analysis results

    will be put. For the purpose of choosing test cases, some risk of underestimating alias sets may be preferable togross overestimation or very expensive analysis.

    The precision of data flow analysis depends on the precision of alias information used in the analysis. Aliasanalysis requires a trade-off between precision and computational expense, with significant overestimation of alias sets for approaches that can be pract ically applied to real programs. I n the case of data flow testing,imprecision can be mitigated by specializing the alias analysis to identification of definition-clear paths betweena potentially matched definition and use. We do not need to compute aliases for all possible behaviors, but onlyalong particular control flow paths. The risk of underestimating alias sets in a local analysis is acceptableconsidering the application in choosing good test cases rather than offering hard guarantees of a property.

    I n the cgi decode example we have made use of information that would require either extra guidance from thetest designer or a sophisticated tool for data flow and alias analysis. We may know, from a global analysis, thatthe parameters encoded and decoded never refer to the same or overlapping memory regions, and we may infer that initially eptr and dptr likewise refer to disjoint regions of memory, over the whole range of values that thetwo pointers take. L acking this information, a simple static data flow analysis might consider *dptr a potentialalias of *eptr and might therefore consider the assignment *dptr = *eptr a potential definition of both *dptr and*eptr. These spurious definitions would give rise to infeasible DU pairs, which produce test obligations that cannever be satisfied. A local analysis that instead assumes (without verification) that *eptr and *dptr are distinctcould fail to require an important test case if they can be aliases. Such underestimation may be preferable tocreating too many infeasible test obligations.

    A good alias analysis can greatly improve the applicability of data flow testing but cannot eliminate all problems. Undisciplined use of dynamic access to storage can make precise alias analysis extremely hard or impossible. For example, the use of pointer arithmetic in the program fragment of Figure 13.3 results in aliasesthat depend on the way the compiler arranges variables in memory.

    1 void pointer abuse() {2 int i=5, j=10, k=20;3 int *p, *q;4 p=&j+1;5 q=&k;6 *p = 30;7 *q=*q+55;8 printf("p=%d, q=%d \ n", *p, *q);9 }

    Figure 13.3: Pointers to objects in the program stack can create essentially arbitrary definition-use associations,particularly when combined with pointer arithmetic as in this example.

    13 .5 The Infeasibility Problem

  • 8/6/2019 STUnit-6

    22/22

    Not all elements of a program are executable, as discussed in Section 12.8 of Chapter 12 . The path-orientednature of data flow testing criteria aggravates the problem since infeasibility creates test obligations not only for isolated unexecutable elements, but also for infeasible combinations of feasible elements.

    Complex data structures may amplify the infeasibility problem by adding infeasible paths as a result of aliascomputation. For example, while we can determine that x[i] is an alias of x[j] exactly when i = j, we may not beable to determine whether i can be equal to j in any possible program execution.

    Fortunately, the problem of infeasible paths is usually modest for the all definitions and all DU pairs adequacy

    criteria, and one can typically achieve levels of coverage comparable to those achievable with simpler criterialike statement and branch adequacy during unit testing. The all DU paths adequacy criterion, on the other hand,often requires much larger numbers of control flow paths. I t presents a greater problem in distinguishingfeasible from infeasible paths and should therefore be used with discretion.

    O pen Research Issues

    Data flow test adequacy criteria are close to the border between techniques that can be applied at low cost withsimple tools and techniques that offer more power but at much higher cost. While in principle data flow testcoverage can be applied at modest cost (at least up to the all DU adequacy criterion), it demands moresophisticated tool support than test coverage monitoring tools based on control flow alone.

    Fortunately, data flow analysis and alias analysis have other important applications. I mproved support for dataflow testing may come at least partly as a side benefit of research in the programming languages and compilerscommunity. I n particular, finding a good balance of cost and precision in data flow and alias analysis across

    procedure boundaries (interprocedural or "whole program" analysis) is an active area of research.

    The problems presented by pointers and complex data structures cannot be ignored. I n particular, modernobject-oriented languages like Java use reference semantics - an object reference is essentially a pointer - and soalias analysis (preferably inter- procedural) is a prerequisite for applying data flow analysis to object-oriented

    programs.