On the application of program evaluation designs: Sorting out their use and abuse

23
On the Application of Program Evaluation Designs: Sorting Out Their Use and Abuse Ray C. RJst Increasingly complex methodological options, as well as growing sophis- tication of users, means that the formulation of a research design prior to conducting an evaluation study is likely to be more demanding and time- consuming than previously. In fact, one of the most difficult problems in the entire evaluation endeavor is the development of an appropriate de- sign. But the issue is not only one of complexity-it is also one of the appropriateness of the designs to the questions at hand. The concern of this article is with tightening the linkage between ques- tions asked and answers given-making sure that the design organizes and directs the evaluation efforts to provide relevant information germane to the needs of the policymakers. By tightening this linkage, it is presumed that the findings from evaluation studies can gain increased legitimacy and use. Appropriate uses and abuses of seven program evaluation de- signs are analyzed, stressing designs that are most appropriate to the types of informational questions asked by policymakers. "The most difficult evaluation problem is the development of an appro- priate evaluation design." So wrote Wholey in 1979. A decade later, that statement still rings true. Developing an evaluation design that is both technically adequate and useful is a challenge that has not diminished with years. Indeed, increasingly complex methodological oplions, as well as increasingly sophisticated users, mean that the formulation of a design is likely to be more demanding and more time consuming than previously. The evaluation community is not where it was ten years ago. What might well have passed as an appropriate and acceptable design strategy in the late 1970s cannot be presumed to have the same credibility in the late 1980s. Ray C. Rist is director of operations in the general government Civision of the United States General Accounting Office. He was previously a professor at Cornell University and has authored or edited sixteen books and written nearly one hun- dred articles. He is chair of the Working Group on Policy Evaluation whose mem- bers prepared the articles for this special symposium.

Transcript of On the application of program evaluation designs: Sorting out their use and abuse

On the Application of Program Evaluation Designs: Sorting Out

Their Use and Abuse

R a y C. RJs t

Increasingly complex methodological options, as well as growing sophis- tication of users, means that the formulation of a research design prior to conducting an evaluation study is likely to be more demanding and time- consuming than previously. In fact, one of the most difficult problems in the entire evaluation endeavor is the development of an appropriate de- sign. But the issue is not only one of complexity-it is also one of the appropriateness of the designs to the questions at hand. The concern of this article is with tightening the linkage between ques- tions asked and answers given-making sure that the design organizes and directs the evaluation efforts to provide relevant information germane to the needs of the policymakers. By tightening this linkage, it is presumed that the findings from evaluation studies can gain increased legitimacy and use. Appropriate uses and abuses of seven program evaluation de- signs are analyzed, stressing designs that are most appropriate to the types of informational questions asked by policymakers.

"The most difficult evaluation problem is the development of an appro- priate evaluation design." So wrote Wholey in 1979. A decade later, that statement still rings true. Developing an evaluation design that is both technically adequate and useful is a challenge that has not diminished with years. Indeed, increasingly complex methodological oplions, as well as increasingly sophisticated users, mean that the formulation of a design is likely to be more demanding and more time consuming than previously. The evaluation community is not where it was ten years ago. What might well have passed as an appropriate and acceptable design strategy in the late 1970s cannot be presumed to have the same credibility in the late 1980s.

Ray C. Rist is director of operations in the general government Civision of the United States General Accounting Office. He was previously a professor at Cornell University and has authored or edited sixteen books and written nearly one hun- dred articles. He is chair of the Working Group on Policy Evaluation whose mem- bers prepared the articles for this special symposium.

Rist 75

But even with the elaborate development of present day design alter- natives, there remains a fundamental issue. Matching the appropriate de- sign to the appropriate question has to be attended to in each and every evaluation effort. And while there are numerous texts, articles, and mono- graphs on the ways to develop one or another evaluation design, there has been much less attention to the matter of appropriate application. Even as the evaluation community has progressed in its attention to matters of technical adequacy, including more multimethod, multisite evaluation de- signs, there has remained the matter of whether the designs in place is appropriate to the questions that should be answered.

Stated succinctly, the concern of this article is with one aspect of eval- uation utilization-that of tightening the linkage between questions and answers. It is my presumption, and the rationale for this article, that de- veloping designs appropriate to the questions will increase the likelihood of user acceptance of the resultant information. I understand that such designs do not necessarily generate well executed studies and trustworthy answers. But I do take it that appropriate designs are necessary, if not sufficient, for the production of relevant answers. And I take it further that policymakers are likely to trust and use relevant answers more readily than they will use suspect answers.

While it seems self-evident that the design should be responsive to the questions, it has been my experience that this linkage is less well under- stood than it should be. Attempting to generalize from individual case studies, positing cause and effect relations from process evaluations, and attempting to answer questions for which data in a meta-evaluations are not adequate are but three of the "abuses" I have noted time and again. I would stress here that the issue is not whether the methods employed in the studies were technically correct. They frequently were (and even if they were not, it is not germane to the point). The issue, rather, is whether sufficient thought and discernment went into choosing a design that was most appropriate to the questions. Choosing the wrong design and then doing it well happens more often than we care to admit. The end result is inappropriate information that is fair to neither the policymaker receiving the information (and who probably paid for it as well) nor to the program and all those who work within it.

In light of a continuing disparity between framing questions and devel- oping the correct evaluation design to answer them, this article proposes a modest agenda. What follows is an effort to discuss appropriate use and potential abuse for each of the six general categories of evaluation design noted in the Evaluation Research Society Standards of 1982 (pp. 9-10). These six categories include: front-end analysis, evaluability assessment, formative evaluation, impact evaluation, program monitoring, and evalu- ation of evaluation. In addition, a seventh category of design, the case study, also will be included here. The end result should be a "first cut" at pulling together these seven design categories and examining each to determine what kinds of questions they are most appropriate to answer.

76 Knowledge in Society/Winter 1989-90

The discussion will not focus on the relative merits or strengths an weak- nesses of each of these seven designs. Each has its appropriate contribu- tions. The attention here is on when that contribution can be made cor- rectly, based on the questions being asked.

Front-End Analysis

In the Evaluation Research Society Standards (1982, p. 9), front- end analysis is described as follows:

This includes evaluation activities that take place prior to the installation of a program: to confirm, ascertain, or estimate needs (needs assessments), ade- quacy of conception, operational feasibility, sources of financial support, and availability of other necessary kinds of support (for example, organizational). The result should provide useful guidance for refining program plans, deter- mining the appropriate level of implementation, or deciding whether to install the program at all.

This design strategy (also known as pre-installation, context, feasibility analysis) focuses essentially on whether there is a justification for putting a program in place. It asks a set of key questions about financial support, operational feasibility, how well the program has been conceptualized, and whether the implementation plan is appropriate. Having the informa- tion gathered by this approach, the policymaker is in a better position when making the "go-no go" decision on the new program There are too many factors involved in this final decision (political support, media at- tention) to presume that the evaluation information should play the pivotal role for the decisionmaker. But it is also the case that carefully doing this up-front analysis can enhance the possibilities of the policyrnaker choos- ing wisely.

One additional data collection strategy that enhances efforts at front- end analysis is that of collecting and analyzing data from previous eval- uations on the same or nearly similar topics. Since seldom does a problem or policy issue exist sugeneris, the analysis task becomes one of learning what has been documented and analyzed in earlier evaluations. A suc- cessful application of this review of previous evaluations was conducted by the U.S. General Accounting Office in 1985. The subject was teenage pregnancy and the issue was what two senators had proposed as legisla- tive strategies for addressing the issue. By examining previous evaluations of programs that had used (in approximate terms) one or another of the two strategies outlined in the proposed legislation, the GAO was able to provide relevant information of the two senators on the strengths and weaknesses of their respective legislative proposals.

Appropriate uses of front-end or feasibility analysis are of several types. One is to use the information to answer questions conceptually about the basic logic of the program. The focus on logic first is to the question of

Rist 7 7

whether the problem itself is being correctly analyzed. Front-end analysis can play an important and provocative role in problem definition and assessment of the causes of that same problem. For it is only by first defining the problem and its causes that one can move to the subsequent stage of beginning to think through appropriate responses. Identifying gaps in logic early on can contribute to the overall success of a program initia- tive. Addressing these matters before a program is operational is surely preferable to waiting.

A second set of questions concerning the logic of the program can also be addressed using front-end analysis. These questions focus on deter- mining whether the various operating components of the program are clearly defined and whether they are sequenced correctly. By probing to learn if each sector of a program is sufficiently articulated (and logically so), as well as if the sectors or components are integrated in the appro- priate sequence, the evaluator can test for overall program coherence. The focus is on the presumed pattern of implementation. If this is not clear in its logical progression, how much less the possibility that it will be clear in its execution.

A third appropriate use of front-end analysis is to juxtapose resource projections for a proposed initiative against actual usage on previous in- itiatives. Comparing what a proposed school- to-work transition program might cost in one state or district to what it actually did cost in another can give policymakers some important "reality checks." this effort to realisti- cally assess the resource demands is critical for several obvious reasons. if the resources are not sufficient, then it is not likely that the program can deliver on its promises. If the resources are adequate in total, but misal- located within the program, again, the possibilities of success are dimin- ished. If the resources are adequate for a particular period, but the re- search suggests that generating an "effect" from the program takes twice as long as the resources will permit the program to operate, sustained political support becomes an issue. And finally, the resource mix that goes beyond funds to include staff, facilities, transportation or materials, also needs to be carefully assessed in light of projected operational activities, anticipated effects, and what has been learned from other efforts in the same area. Working these assessments through carefully before the pro- gram is brought on line is an appropriate and useful strategy to enhance success.

The fourth contribution that front-end analysis can offer to policymak- ers is to tell them when there is no research base that allows comparison to other initiatives or programs. It is information such as this that can clarify the "risk factor" for the policymaker. It is one thing to say that previous efforts were or were not successful for one or more reasons and that with careful attention, previous limitations can be systematically ad- dressed. It is something else to say that no previous analysis is available and that the decision on one or more components of the proposed pro~ gram is essentially a "judgment call." Altering the policymaker to this

78 Knowledge in Society/Winter 1989-90

reality can heighten attention to the risk in the proposal and at least in- crease awareness that choosing to go ahead is to do so with a sketchy or nonexistent road map. (It is an instance such as this that the rationale for a pilot or demonstration program is especially s t r o n g - a n d one where the policymaker can make a move to address the issue without an "all or nothing" approach.)

The potential abuses of front-end analysis are also numerous. Perhaps the most frequent one occurs when comparisons to previous programs or policies are undertaken without recognition of the contextual differences between past efforts and present circumstances. While there is much to be learned from previous efforts, it is risky to assume that the past and present are identical. Because it did not work when tried before does not mean that it will not work at p resen t , / f the populations are different , /f the po- litical support is greater, /f the funding is longer and more stable. The rejection of present initiatives because they are similar (or appear to be) to previous unsuccessful efforts should take place only after careful compar- isons. Otherwise, the logic of front-end analysis ceases to be. useful in the policy arena. The issue is really one of appropriate comparisons and care- fully drawn conclusions from those comparisons.

A second abuse comes when greater extrapolation takes place than is justified by previous evaluation. Policymakers and program officials can all rush to attribute casual consequences to programs when it is not clear that causes were isolated and identified. Programs and policies sometimes take on a political aura whereby careful examination of the actual conse- quences and actual impacts does not occur. On the presumption that "ev- eryone knows the last program was successful," front-end analysis is omit- ted. Political acceptability and program effectiveness ought not be confused.

Front-end analysis cannot be of much assistance when positions are staked out apriori to any questioning of data or of previous studies. Careful examination of the logic and rationale of a proposed program or policy is worthwhile when the end result of that assessment at least has the op- portunity to be considered. But if the situation is one where either a "go" or a "no go" decision already has been made, this effort is for naught. What front-end analysis can provide to the policyrnaker who is open to considering the findings is an early indication of logical feasibility, imple- mentation opportunities, needed resource levels, and the mine fields found in previous studies. Those are no small considerations. Well-crafted infor- mation at this stage of the process can be pivotal in the decision-making process.

Evaluability Assessment

Once the decision is made to begin or continue a program, a subsequent question arises of whether that program or policy as presently constructed can be evaluated. The focus on evaluability assessment has grown precisely because of the questions it asks about the plausibility and feasibility of pro-

Rist 79

gram performance and whether that performance is amenable to change based on new information. If a policy is a result of the political process con- ducting a "ready, fire, aim" exercise, the possibilities for careful evaluation and subsequent program modifications can not be taken for granted. The task of an evaluability assessment is to ask whether "program evaluation is likely to be useful in improving performance" (Wholey, 1979, p. 18).

The evaluation Research Society Standards (1982, p. 9) define the cat- egory of evaluability a s sessment studies as follows:

This includes activities undertaken to assess whether other kinds of program evaluation efforts (especially impact evaluation) should be initiated. The emer- gence of evaluability assessment as a legitimate and distinctive enterprise represents a growing professional concern with the costs of evaluation in relation to their benefits, as well as with identifying the general characteristics of programs (significance, scope, execution, and so forth) that facilitate or hinder formal evaluation efforts. Evaluability assessment may encompass in- quiries into technical feasibility (i.e., Can valid performance indicators be devised?), policy matters (i.e., Do program directors understand what kinds of information the proposed evaluation would produce? Is the funding agency's interest in the program likely to be short lived?), and, of course, the charac- teristics of the program, itself (i.e., Has it in fact been installed?)

One scholar closely linked to the development of the evaluability assess- ment strategy is Joseph Wholey. He defines the strategy as one that under- takes a preliminary evaluation of a program's design in order to assure that three key standards are met (Wholey, 1979, p. 17). The three are:

�9 Program objectives are well-defined (i.e., those in charge of the program have defined program objectives in terms of specific measures of program perfor- mance, and data on those measures are obtainable at reasonable cost).

�9 Program assumptions/objectives are plausible (i.e., there is evidence that pro- gram activities have some likelihood of causing progress toward program objectives).

�9 Intended uses o f evaluation informaUon are well-defined (i.e., those in charge of the program have defined the intended uses of evaluation information),

It is possible to reformulate these three objectives into the appropriate quest ions that an evaluator or pol icymaker might ask of an evaluability assessment . The first appropriate quest ion would be to ask that the design be used to test the logic of the casual a ssumpt ions that tie together in a coherent and sequential fashion the program inputs, implementa t ion strat- egies, and the p resumed ou tcomes or consequences . It is this test of the logic of the policy that gives the evaluability a s ses smen t one of its stron- gest and most appropriate uses. The degree to which an evaluability as- se s smen t is n o t able to trace the orderly sequence of assumpt ions and

80 Knowledge in Society/Winter 1989-90

work through the underlying rationale of what the policy or program is supposed to accomplish and how, is the degree to which a policymaker or program manager ought to take pause. Every program or policy operates with a "model" of how the resources are to be used, what impacts are anticipated, and what are the trade-offs between anticipate outcomes and the costs to achieve them. This evaluation strategy questions present or proposed program frameworks as to their assumptions and expectations (i.e., is the model coherent?).

The second appropriate use of this strategy is to provide a "formal def- inition of that portion of the program for which realistic, measurable ob- jectives and management uses of information have been d e f i n e d . . . " (Wholey, 1979, p. 83). By again focusing on the logic of what is proposed, evaluators using this strategy can work to specify precisely what portions of a program or policy can be defined in measurable terms. If objectives are vague or deliberately couched in political cliches, then the likelihood of evaluating the outcomes is extremely low.

It is true that when programs or policies are first proposed, they are sometimes kept deliberately vague so as to attract support and avoid mo- bilizing counterforces. But there does come the time when funds run out and new authorizations and appropriations are needed. It is then that the absence of a coherent framework can be a significant handicap in gaining further support. The shortrun gains of an umbrella policy that lets every- one come in out of the rain, but tosses aside the long term solution of more permanent shelter may soon be seen as no gain at all. What an evaluability assessment can appropriately do is carefully separate those portions are clearly conceived and for which objective measures can be developed from those portions where this is not possible. The political and program- matic implications of such a distinction should be immediately evident.

The third appropriate use would be to build on these previous two and work to redesign or redraft proposals for the program manager or policy- maker so as to tighten the linkage between components in such a way that there is, at least logically, a greater likelihood of achieving program ob- jectives. Identifying changes that would enhance organizational commu- nication, allow better targeting of services, build better controls over bud- geting and resource allocation procedures, and better track short, intermediate, or long term outcomes are but four broad ways in which reanalysis can help programs get on track. By testing the logic of the casual assumptions (appropriate use #1), defining which portions of a program can or can not be measured (appropriate use #2), and using this information to assess where changes are needed and strategies for doing so (appropriate use #3), the evaluability assessment can provide to man- agers that information base necessary to the constant testing of program intentions and direction. Indeed, the use of an evaluability assessment for a policy or program should not be thought to be a one-time only event. Its constant reapplication provides the intellectual framework, by which to consider mid-course corrections.

Rist 81

Among the abuses in the application of the evaluability assessment sev- eral stand out as a key. First, there is the effort to generalize and define program objectives from sampling or interviewing only a select tier of program managers and officials. One commonly accepted precept of pro- gram evaluation is that perceptions and objectives will vary depending upon the location in the organizational hierarchy. Those higher in the system may well look at the program and its objectives quite differently than will those closer to the actual delivery of services, for example. Thus, attempting to develop objectives and measures for a program from inter- viewing only those at the very top of the system is both incorrect and risky. The issue is not that consensus be reached by parsons at all levels in the program, but that it be understood precisely what objectives and assump- tions persons at all levels do, in fact, hold. The abuse in this area can come in one of two ways: the evaluator not taking the time to cover all tiers of the program, or those responsible for policy and program management presuming all views in the program are much like their own and thus the measurable objectives they favor are those favored by others. This is an instance where generalizing from the part to the whole is exactly what should not happen.

The second abuse of the evaluability assessment is to presume the log- ical model developed to predict how the program should operate can be translated into reality without modifications. "Rational systems" models tend to forget the dimensions of personal status, organizational morale, managerial preferences, political power, stakeholder groups both within and outside the organization, or patterns of discrimination that all impinge upon how "ideal" systems actually get implemented. Trying to force-fit a rational model onto a messy system of local preferences, entrenched groups, vested interests, different skills and abilities among staff, and varying lev- els of trust within the organization is not likely to succeed. Senior man- agers who push this strategy frequently create "lose-lose" situations. And when it happens, the search for whom to blame can soon catch almost everyone in its net-recalcitrant employees, "out of touch" managers, or middle managers fearful for their positions.

Finally, there is the situation when the answer to the question about the usefulness of further evaluation is based on incomplete or erroneous im- pressions and data. The evaluability assessment essentially necessitates a cost-benefit analysis on whether additional information would be helpful to program management and at what cost. Conclusions that further eval- uation is not needed (when it is) or that it is needed (when it is not) are both wide of the mark and harmful to the program. These incorrect judg- ments may come because the evaluator is being too hasty in coming to closure about what program activities are taking place. Without taking the necessary time, the judgment on what could be gained from a full-scale evaluation is increasingly precarious. The same end result can come when the program manager believes that he or she already "knows" what is happening in the program and that the collection efforts can be shortcut-

82 Knowledge in Society/Winter 1989-90

ted to save time and money. Either way, the abuse of failing to take the time to ensure adequate data is one that both prevalent and preventable.

Formative Evaluation

Also known as process or developmental evaluation, formative evalu- ation focuses on identifying and understanding the internal dynamics of programs as they actually operate. As Patton notes (1980, p. 60), evalua- tions of this type focus on the following kinds of question: "What are the factors that come together to make this program what it is? What are the strengths and weaknesses of the program? How are clients brought into the program and how do they move through the program once they are participants? What is the nature of staff-client interactions?"

The Evaluation Research Society defines this category of evaluation as follows (! 982, p. 9):

This includes testing or appraising the processes of an ongoing program in order to make modifications and improvements. Activities may include anal- ysis of management strategies and of interactions among persons involved in the program, personnel appraisal, surveys of attitudes toward the program, and observation. In some cases, formative evaluation means field-testing a program on a small scale before installing it more widely. The formative eval- uator is likely to work closely together with program designers or administra- tors and to participate directly in decisions to make program modifications.

Formative evaluations focus on the products of a program, per se, but on the internal dynamics of all that contributes to (or hinders) the produc- tion of that product or outcome. It is an evaluation strategy that examines how a program comes up with the results it does. Studying a program from this perspective implies careful attention to program implementation- how it is actually being shaped, defined, and operationalized in the day- to-day realities of the local setting. The interest is in anticipated as well as unanticipated consequences, formal as well as informal adaptations, pub- lic as well as private perceptions of program efficiency, the wirying modes of decision-making, and what institutional memory exists in the various sectors of the program. Conducting such an evaluation benefits from close and continuous interaction with the program, its staff, and participants. It also suggests a flexibility in data collection strategies, all aimed at devel- oping a clear and sensitive description of the processes by which the pro- gram operates (c.f., Hargrove, 1985; Pressman & Wildavsky, 1984; Rist, 1981; Williams, 1982.)

The first of the appropriate uses of this evaluation strategy allows those responsible for operating the program to understand the program in a detailed manner and thus make decisions on whether the program is op- erating as it ought. Note that this approach does not support estimates about the magnitude of program effects, though it can be used to support

Rist 83

judgments about the direction of program effects and help explain why (Trend, 1978). What it also allows are judgments about whether the pro- cesses of the program can be improved, streamlined, or reformulated via an assessment of the strengths and weaknesses in the present procedures. Formative evaluation is (or at least can be) an important management tool for those responsible for the program. Gaining detailed information on just how the implementation is progressing as well as the views of persons at all levels of the program on its current functioning is invaluable to the manager attuned to proactive leadership.

A second proper use of formative evaluations comes when one or more programs are being considered for replication elsewhere. What this type provides to those responsible for the transporting of the program to other settings is a detailed understanding of the functioning of the original effort and which contextual (in contrast to conceptual) factors are more or less critical to overall program implementation. Formative evaluations address precisely those issues that can enhance the possibility of a model project or demonstration program being successfully transplanted.

A third appropriate use comes when the information generated by this approach can be used by managers who are developing and implementing a program in an area where there is little previous experience or informa- tion. Formative evaluations can be particularly helpful at present, for ex- ample, in developing community-based hospice care for AIDS patients, in developing strategies to deal with homelessness, or in formulating retrain- ing strategies for workers who have lost jobs in the manufacturing sector. Continuous feedback loops are critical to formative or process evaluations for two reasons. First, they can allow program managers information about program implementation in almost real t ime- the gap being only the time it takes to go from data collection to data analysis to data reporting. Sec- ond, the fact that data are being generated in a a steady stream can change the dynamic of how one manages. It is not only that data are known, but that they can be anticipated as a recurrent event in managing. Evaluations of this type can be a pivotal management tool when the issue being ad- dressed by the program is complex and little understood, when the polit- ical sensitivities are high, or when the problem itself is a moving target. In this latter instance, careful tracking of the flexibility and adaptability of the program is necessary to ensure that the efforts of the program are in sync with the conditions of the problem.

When considering abuses, the primary error comes when efforts are made to answer questions about outcomes or results from the process data. Attributing casual impacts or consequences from exclusively descrip- tive data on program operations is incorrect. Such data do not allow by themselves for as assessment of outcomes nor for ruling out competing explanations as to how a program did or did not affect a target population. The description of processes is not to be equated with the measurement of products. But the temptation to do is strong. Good data on how a program is working should be sufficient basis, one might assume, for making judg-

84 Knowledge in Society/Winter 1989-90

ments about where all those program efforts are leading. Ye.t good imple- mentation procedures (and well documented by a formative evaluation) do not insure clear program impacts.

There are any number of reasons why a program may have little or no effect on the problem or population in question, even as it is well imple- mented, clearly managed, and organizationally crisp. Possible reasons might be: the program treatment was too weak to produce any discernible result.; the program was not in place long enough; the problem changed in ways undetected by the program; or the target population changed in ways undetected by the program. But for whatever the reason, to assume that a description of processes can isolate cause and effect relations is simply wrong.

A second abuse, and nearly as frequent as the first, happens when the evaluation is short-circuited and conclusions are drawn on the basis of insufficient and inadequate contact. Avoiding the effort to become ac- quainted with and involved in the activities and culture of the program can have the detrimental consequence of not knowing the program well, of missing important changes as they occur, never establishing the neces- sary rapport with the staff and clients so as to understand their perceptions and judgments about the implementation of the program, and on and on. The urge to do a process or formative evaluation "on the cheap" is strong because of the frequent presumptions that programs do not change much once they are in place, staff and clients tend to do pretty much the same thing day after day, good managers can sense changes they need to make and need not wait for data, or the cost savings by doing less can be put to better use. elsewhere. While each of these assumptions may be true at some time and for some programs, it nonetheless needs to be recognized that weakening the evaluation can only result in weakening the credibility and robustness of the information. As the formative evaluation erodes, the point is reached where the data are suspect and any inferences from them need to be avoided.

Impact Evaluat ion

The Evaluation Research Society provides a detailed discussion of this evaluation category (1982, pp. 9-10). It reads as follows:

This evaluation category corresponds to one of the most common definitions of evaluation-that is, finding out how well an entire program works. The results of impact evaluation-or of program results review or similar terms used in some governmental settings-are intended to provide ir formation use- ful in major decisions about program continuation, expansion, or reduction. The challenges for the evaluator are to find or devise appropriate indicators of impact and to be able to attribute types and amounts of impact to the program rather than to other influences. Some knowledge or estimate of conditions before the program was applied-or of conditions in the absence of the program-

Rist 85

is usually required. Impact evaluations differ in the degree to which the search for appropriate indicators goes beyond the stated objectives or expectations of the program formulators, directors, funders, or other sponsors of the evalua- tion. However, there is rather substantial agreement that the more indepen- dent the evaluator is, the more credible the results of the impact evaluation will be, so long as the people who manage, oversee, or influence the program are reflected in the evaluation. Achieving a balance among potentially con- flicting criteria will be a continuing challenge.

As noted in the ERS statement, the impact evaluation category is what most frequently comes to mind when the suggestion of conducting an evaluation is made. Evaluations of this types go to what many persons interested in the program most what to k n o w - t h e bottom line of whether it has or has not made a difference. Impact evaluations tend to be held up as the "ideal type" of evaluation to conduct because of the emphasis on cause-and-effect analysis. Whether the actual design uses a true experi- mental design or a quasi-experimental design (e.g., nonequivalent control group design, interrupted time-series design, recurrent institutional cycle design, or a regression- discontinuity design), these designs focus on iso- lating in as rigorous fashion as is possible the casual factors that deter- mine program impact (c.f., Rezmovic, 1979, pp. 165-166). These designs, as a group, tend to be those most attuned to addressing issues of validity, randomization, careful measurement, attrition in either the experimental or control group, and other such quantitative methodological consider- ations (c.f., Cook & Campbell, 1979; Rieken & Boruch, 1974; Rossi & Free- man, 1982; U.S. General Accounting Office, 1982, 1984).

Given the relatively explicit canons for conducting studies employing experimental or quasi-experimental designs, the agreement on uses and abuses is probably greater for this category of evaluation than for any of the seven discussed in this article.

First among appropriate uses is to employ this design to study program outcomes If the objectives of the program are sufficiently clear. It makes no sense to use this approach if the goals and objectives are ill-defined, obtuse, or deliberately couched in political rhetoric. To assess an impact necessitates the ability to specify precisely just what is not supposed to happen. If a program has no clear objectives, there is no possibility of determining effects, for effects are only relevant in light of the objectives. This is not to argue that the program will have no effects, even with ill- formulated objectives. It is only that there will be no systematic logic as to what effects emerge. Distinctions between anticipated and unanticipated effects will be meaningless. Likewise, the researcher will not be able to measure them for they could not be anticipated and thus no baseline data could be collected.

It is also necessary that sufficient time have passed before measure- ments are made to ascertain possible effects. Starting a program one week and measuring for effects the next day may sound ludicrous (and it is) but

86 Knowledge in Society/Winter 1989-90

the point is m a d e - p r o p e r use of the design necessitates a sufficient time interval before measuring for impacts. (How long that proper time will be has to vary from program to program and should be determined in con- junction with the program designers.) The stronger the presumed impact of the program, the easier it should be to detect and measure. Alternately, weak program efforts may produce no effects or else those loo modest to measure. The design will not be able to capture what is not there or too faint to detect.

This last point leads to a second appropriate use. Impact evaluations are useful if the design has built into it sufficient statistical power to detect effects. That is, sample size, statistical significance level, and control o v e r the exper- imental variables must be sufficient to discern program effects, given that the treatment, as noted above, is sufficient to generate effects. When program effects are anticipated to be large, the situation is a bit easier for the evaluator in that the choice of statistical measures is larger. But when the effect is known to be (or at least thought to be) weak, then the choice of statistical measures becomes extremely critical. Failure to select appropriately clearly determine whether or not the program is thought a success, for example. Using this design when impacts cannot be specified or when they are so weak that to be detected necessitates large samples and highly reliable mea- sures at a time and cost of proportion to the importance of the question generate two instances when it should be avoided.

A third appropriate use of an impact design would be when a random- ized assignment strategy can be used to distribute participants between experimental and control groups. Use of this design can help detect effects from the program, if a sufficient spread of variables is measured. The randomization becomes, then, an important consideration in the effort to wash out differences between groups, save for whatever impacts the pro- ~ram might have. Thus the experimental or impact design can strive to ~actor out, via randomization, differences among individuals or groups ~rior to the program and focus instead on whatever measurable impacts appear after the program is in place.

In formulating the inappropriate uses of effectiveness or impact designs, ~ne that has appeared many times is to incorrectly infer that the findings ~f an effectiveness evaluation can be taken at face value to determine .AIhat caused the noted program effects. Measuring a program as to its ~ffect iveness-and finding i t - i s not the same as determining what vari- ~ble or variables caused the effect. Determining what caused the effects is lot the same as measuring those effects. The causes may be multiple for ~ny one noted effect. What this design cannot accomplish on its own is to ;ort out from among the competing explanations which aspect(s) of the ~rogram causes the measured effects. Competing explanations of the find- ngs must be examined and ruled out before cause- effect conclusions can ~e drawn.

A second inappropriate use involves the extrapolation of the findings rom one population or one setting, however well measured, a larger uni-

Rist 87

verse of populations or settings. The potential for variation or differences between what was measured at a single site and other unstudied sites is simply too great to allow for such generalizability. The issue here is not whether the study at the single site was or was not well conceived and executed. Rather, it is whether one can presume to generalize the findings from this site to any other site or sites. And without knowing the charac- teristics of those other sites as to their similarities and differences to the study site, such extrapolation is not warranted.

The third abuse noted here involves attempting to develop a measure- ment strategy that does not involve a good comparison group. Frequently, it is assumed that establishing good baseline data and then measuring that same group again at the end of the program will be sufficient to determine cause-and-effect relations. It is not, for the design does not account for any of the multiple other influences beside the program intervention that could have caused the measured impact. The absence of a control group makes any such conclusion extremely tenuous and open to vigorous chal- lenge. The before-after comparison of outcomes for only the experimental group is simply not a strong enough design to allow the inferences of cause. A partial solution (and it is just that) comes when the evaluation design can incorporate a nonequivalent comparison group, so long as pre- existing differences are statistically adjusted for prior to the beginning of the program. If not, the pre-existing differences and the program effects are confounded and no statement of cause and effect is then possible.

Program Monitoring

This evaluation strategy is available to program managers and policy makers once there is, in fact, a program in place. The ability to track the on-going performance of a program is essential for any number of rather obvious reasons. Budgetary considerations, consequences for clients, ef- ficiency of operation, controls over funds, quality of services, and operat- ing in compliance with policy are but six that can be immediately noted. Program monitoring can be more or less systematic, more or less contin- uous, more or less comprehensive, and more or less formal. Intuitively, program managers probably do this all the time. The checklist may not be written, but there is some set of concerns or measures that managers believe they need to be on top of in order to "know" how their program is doing. What program monitoring offers as a strategy of evaluation to the manager or policymaker is a way of tracking the program in accord with his or her objectives, using a set of measures that are accepted as reflec- tive of program direction and consequence.

The Evaluation Research Society Standards define program monitoring as follows:

This is the least acknowledged but probably most practiced category of eval- uation, putting to rest the notion that the evaluator necessarily comes in, does

88 Knowledge in Society/Winter 1989-90

the job, and then gets out. From the GAO to human service agencies in states and provinces to military training installations, there are substantial require- ments to monitor programs that have already been installed, sometimes long ago. These programs may or may not once have been the subject of front-end analysis, process evaluation, impact evaluation, and perhaps even secondary evaluation. The kinds of activities involved in these evaluations vary widely, ranging from periodic checks of compliance with policy to relatively straight- forward tracking of services delivered and counting of clients. Program mon- itoring may include purposes or results found also under other evaluation categories; for example, it may involve serious re-examination of whether the needs the program was originally designed to serve still exist, or it may suggest system modification, updating, or revitalization.

Wholey (1979, p. 117) defines program monitoring is much the same way:

Performance monitoring is the periodic measurement of progress toward pro- gram objectives. Performance monitoring systems clarify program objectives and important side-effects, provide ,current information that compares pro- gram performance with prior or expected performance, and focus program activities on achieving progress toward program objectives. As with other types of evaluation, those in charge can use the resulting information to main- tain or change program activities or objectives.

In the context of these two definitions, a number of appropriate uses can be made of program monitoring as an evaluation strategy. First, program monitoring can be used correctly to answer the fundamental question of whether the program is still needed in its current form. While I suspect this is not frequently posed question, it is nonetheless one that good monitor- ing can directly address. Programs ought not to assume nor are they prom- ised immortality. They began at some point and will e n d - s o o n e r or later. Purposefully assessing programs to ask if that end is in sight may not be a popular task, but is one that should not be ignored. Stated differently, program monitoring can also be used to help justify the continuation of a program because of its documented impacts on achievement of stated objectives.

A second question (and it really is a cluster of questions) appropriate to program monitoring focuses on program efficiency. In this area, program managers can gain information on the degree to which program recipients are receiving the intended services and the extent of those services. Ask- ing questions about staffing levels, staff expertise and experience, staff commitment, and staff retention are but four aspects of staff efficiency that program monitoring can address. In this same area, questions about re- source allocation can be posed (e.g., What levels of resources are being devoted to the program in terms of direct services versus overhead, what resources are actually transferred to clients versus these that go into set-

Rist 89

vices, and have variations in resource levels impacted upon program ser- vices). The GAO has recently issued a report (GAO, 1988) that addresses these questions for a number of children's programs funded by the federal government. Still other questions in this area can focus on whether there are deleterious side effects that hinder or negate positive aspects of the program. The classic instance of increased drug education programs hav- ing the positive impact of increased knowledge and the negative impact of heightening the interest and willingness of youth to experiment with the drugs they now know about comes to mind here.

A third set of appropriate questions to pose with program monitoring focus on the matter of program management. Here there are questions about efficiency, adequate controls, compliance with policy, and manage- ment guidance. This cluster of "internal" program issues are vital, but often overlooked in assessing the status of a program.1

One key inappropriate use of program monitoring occurs when the an- alyst takes a few, relatively minor deficiencies found in a program and magnifies them out of proportion to their actual significance. This may occur because of the general assumption that all programs have deficien- cies and any report that minimizes deficiencies has "gone soR" on the program. Such inappropriate use of the findings means that more atten- tion is focused on what is wrong with the program, minor as these "wrongs" might be, than on what is right and correct with the program. This issue in the end, is a matter of balance, and when analysts tilt too far in either direction, the program and the policymaker are not well served.

A second abuse comes when the evaluator responsible for the program monitoring does get apriori agreement on the key program characteristics and how they are to be measured. This lack of agreement on what is most important can result in: The evaluator trying to "guess" the important factors; taking vague guidance from the program managers and then strug- gling to operationalize what cannot be operationalized; settling for a set of indicators that are noncontroversial but weak in terms of measuring pro- gram impacts; or continually changing the indicators whenever a new goal for the program is announced by the manager. Failure to achieve agreement on the front-end means that the entire effort is open to different interpretations, assumptions, and uses when the monitoring data start to arrive.

A final area of abuse comes when the evaluator fails to distinguish between reliable information that indicates a program is or is not working and the absence of adequate information, which means one cannot say one way of the other as to program performance. The evaluator is under pressure to be able to say something about the program, once the moni- toring processes are in place. The managers and the policymakers are interested in whether or not the program is implemented as intended, operating as it should, and impacting the appropriate target audience.

Answering these concerns means that the evaluator should be relying on a set of reliable measures that are tracking those aspects of the pro-

90 Knowledge in Society/Winter 1989-90

gram in question. The operant word here is "reliable." It is the task of the evaluator to establish data collection systems that will produce reliable data. If data are not reliable, they should not be reported, or should be strictly caveated. If the data are reliable and say that a program is not working as it should, those are important data and can be reported. But not having the data means that nothing can be said. Lacking adequate data means that the question remains open and unanswered. The abuse comes when the evaluator does not have adequate data, feels the pressure to say something, and stretches what data are available beyond what can be correctly inferred.

The final broad area of abuse in program monitoring comes because of a lack of sufficient attention to the quality of the data bases necessary for adequate monitoring. The monitoring can be no better than the data bases on which it relies. Failure to attend to the data bases can lead to their deterioration and thus provide increasingly suspect or unreliable data. This evaluation strategy, more than any other, depends upon the continual generation of quality data over time. Sustaining data bases takes time, attention, and resources. Failure to address all three of these needs in- creases the vulnerability of the evaluator in his or her capacity to speak to the direction and impacts of the program.

Evaluat ion o f Evaluat ions

Situations arise in evaluation work when it is time to pose the question, "What have others found from looking at this same policy or program?" In short, the question becomes one of pushing the evaluator to synthesize and summarize what information has been generated over time by other evaluators. The General Accounting Office faced this situation in 1984 when the Senate asked GAO to summarize sixty one identified evaluations of the Women, Infants, and Children (WIC) nutrition program. In contrast to conducting yet another impact evaluation, the GAO pulled together and assessed all existing evaluations on the WIC program. GAO then summa- rize for the Congress, in the aggregate, what these, evaluations suggested about the effectiveness of the program.

This strategy, of evaluating evaluations, has been categorized by the Evaluation Research Society as the sixth broad class of evaluation designs. The ERS wrote the following about this category (1982, p. 10):

These activities are applied most frequently to impact evaluations and are stimulated by various interests, such as scholarly investigation, requirements of agencies in coordination of oversight roles, unwillingness of the evaluator to accept the original evaluation results, or interest in the after-effects of the evaluation on the program. Evaluations of evaluations may take a variety of forms, ranging from professional critiques of evaluation reports and proce- dures to reanalysis of original data (sometimes with differe~t hypotheses in

Rist 91

mind) to collection of new information. In the case of programs that generate wide public interest (for example, Headstart and veterans' programs), second- ary evaluators may examine the result of a number of different evaluations (including evaluations of program units and components) in order to estimate overall impact.

This strategy, which is also known as evaluat ion synthesis, meta- evaluation, and secondary evaluation, can be used appropriately in a num- ber of ways to answer different evaluation questions. First, by pulling to- gether the known studies on a common topic, the evaluator can address relatively generic questions such as: "Does the program work? .... For whom does the program work best? .... Under what conditions does the program work best?" Answers to these questions go beyond what could be ascer- tained from an evaluation of an individual program. The evaluation of evaluations allows greater scope in the analysis and also greater gener- alizability than can be achieved with one study. If the studies are suffi- ciently rigorous, then the cross-study evaluation can also produce greater confidence that the direction of the findings are correct.

A second appropriate use for this strategy is to ask questions about where research findings are known and also where there are gaps in the information. This ability to identify gaps, based on the cross-study assess- ment of areas of inquiry and findings, can provide the decisionmaker with important information about those objectives where data are available and those where they are not. Identifying the gaps can have several re- percussions, including: calls for new research in these areas; reducing funding in a specific area until information is gathered; restructuring ob- jectives and stated goals in the legislation and in the statements of pro- gram mission; or choosing to continue, recognizing the risks in not know- ing what impacts if any the effort is having. Knowing what is not known is important information to program managers and policymakers.

A third appropriate use of this approach is to "level the playing field" among studies so that disparate findings can be compared against a com- mon metric or set of criteria. It is frequently the case that decisionmakers confront conflicting results from different studies. Some studies may show very positive results from the program and other studies may show just as strong negative results. The questions becomes what to make of such research. (Here the decisionmaker may put a pox on both houses and trust personal judgment.) But the evaluator has, via the evaluation of evalua- tions strategy, an option to present to the decisionmaker. It is to take all the studies, work to find common denominators among them and then compare those measures or criteria that are applicable across studies. Such an approach necessitates a set of strategies that account across studies for such variations as different sample sizes, different locales, dif- ferent ways in which variables were operationalized, and different statis-

92 Knowledge in Society/Winter 1989-90

tical or qualitative treatment of the data. But there are a number of texts that address these procedures and can help the evaluator to find the com- mon ground for comparison (c.f., Boruch et al., 1981; Glass et al., 1981; Green & Hall, 1984; Hedges, 1984; Light & Pillemer, 1984; and the U.S. General Accounting Office, 1983).

The inappropriate uses of the evaluation of evaluation strategy also cluster around several general errors. First is the matter of generalizability. Here the abuses are of several types. It is incorrect to use the findings based on a synthesis of studies addressing one unit of analysis (e.g., the performance of individual students) to generalize to another unit of anal- ysis (e.g., the individual school or the school system). Another incorrect application of generalization comes when one works in the reverse direc- tion from that just noted (i.e., trying to specify findings for a particular program site based on a synthesis of the aggregate results). In either in- stance, the shifting of the focus from one unit of analysis to another should be avoided. Another abuse of the generalizability issue occurs when the effort is made to go beyond the scope and quality of the evaluations used in the study. This is a classic problem and not unique to this evaluation strategy. When a number of studies are brought together from various locales and done by different researchers, the tendency is to speak broadly about "the program" while the data allow comments about only portions of that same program,

It is also inappropriate when those conducting the evaluation of evalu- ations use only published material. The "publication bias" of looking only to material that has been printed in journals or books biases the selection towards studies with statistically significant findings- the kind of findings journals are more likely to publish. Failure to also track down those re- ports or papers that have not been published limits the credibility of the subsequent analysis.

Two other inappropriate uses merit attention. It is not correct to apply nonuniform procedures or criteria across studies. The strength of the ap- proach comes in finding those metrics that allow for a common analysis across studies. Diluting that approach with a nonuniform criteria essen- tially eliminates the logic of the synthesis and results in a number of smaller, nonparallel assessments of subsamples. Yet another incorrect approach is to perform a synthesis bases on studies with primarily weak designs. Clus- tering a group of studies with weak designs will not make for strong find- ings, even if the number of studies is large. Weak designs simply mean too many caveats and too many unknowns to allow for robust judgments about the impacts of the program in question.

Case Study

It is in this seventh and last category of evaluation strategies that this paper moves beyond the framework provided by the Evaluation Research

Rist 93

Society. The ERS s tandards of 1982 did not create a separate category for case studies. Indeed, there is little men t ion of them. But in the intervening years, the case study strategy of evaluation has gained wide acceptance and application. It thus merits a t tent ion and inclusion here.

In wha t is perhaps the mos t definitive t rea tment of the case study as an evaluation strategy, the GAO issued a Transfer Paper in 1987 entitled, Case Study Evaluation. The definition of "case study" used in this GAO paper reads as follows (GAO, 1987, p. 9):

A case study is a method for learning about a complex instance, based on comprehensive understanding of that instance obtained by extensive descrip- tion and analysis of that instance taken as a whole and in its context.

Having established a generic definition, the GAO paper then proceeds to develop various specific applications of the case study me thod in research and in evaluation. Noting that the traditional case study comes from the academic disciplines of sociology and anthropology, those w h o would adopt the strategy to evaluation have been faced with reframing the ques- tions, developing new data collection strategies and modes of analysis, and working to hone new procedures for report ing the findings. Conse- quently, the evaluation communi ty since the early 1970s has fostered growth in the uses of the case study as an evaluation strategy, all the while rec- ognizing that the deve lopment is iterative.

The GAO paper represents an important consolidat ion and a s ses smen t of where the case study approach has come since the differentiation be- tween its application in research and in evaluation began to take root. Indeed, the deve lopment of the evaluation branch is such that the GAO paper was able to distinguish six different applications of the case study me thod to evaluation. The six are (GAO, 1987, p. 1):

1. Illustrative. This case study is descr4ptive in character and intended to add realism and in-depth examples to other information about a program or policy.

2. Exploratory. This is also a descriptive case study but is aimed at generating hypotheses for later investigation rather than illustration.

3. Criticalinstance. This examines a single instance of unique interest or serves as a critical test of an assertion about a program, problem, or strategy.

4. Program implementation. This case study investigates operations, often at several sites, and often normatively.

5. Program effects. This application uses the case study to examine causality and usually involves multisite, multimethod assessments.

6. Cumulative. This brings together findings from many case studies to answer an evaluation question, whether descriptive, normative, or cause- and-effect.

94 Knowledge in Society/Winter 1989-90

Appropriate use of the case study comes when the evaluator decides just what type of evaluation is requested and then examines each of the six alternatives to find the best match. This assessment will include a look at the strengths and weaknesses of each approach as well as making sure that the methodological requirements necessary to adequately complete the study are present. This latter point is worth stressing. It is one thing to have correctly identified the type of case study evaluation to match the question. It is something else to then make sure that the concerns with site selection, number of sites needed, access to key persons and settings, and the necessary records are available. The selection of sites has to be done with care and deliberation for a pivotal contribution of the case study comes from what settings are studies. Judgmental sampling, as it were, is critical in the application of this strategy to evaluation.

A second appropriate use arises when any one of the six approaches listed can be combined with other evaluation strategies to build a better understanding of the issue at hand. Linking case studies with survey re- search data, for example, becomes an important way in which each of these two strategies can enhance and expand the application of the other (c.f., Sieber, 1973). In so doing, the combination gives the evaluator a more complete data base by which to analyze the question. Finding ways of creating such linkage has not been an easy task. But for the evaluator who is willing to explore the possibilities of multimethod approaches to the question, the inclusion of a case study component can be a worthy con- tribution.

Abuses to the correct application of the case study evaluation strategy come in several ways. First, there is the matter of case selection alluded to earlier. If the case selection is inappropriate for the purpose of the study, the credibility of the study is severely eroded. Other aspects of the site selection that impinge upon the credibility of the study arise when too few sites are selected to adequately answer the question. Cenversely, prob- lems arise when too many sites are selected and the particular focus of the study becomes blurred by the addition of data from unnecessary sites.

Second, there is the abuse that arises when evaluators either over-or under-generalize from the data collected, given the questions to be ad- dressed and the basis for site selection. Stated differently, the evaluators face an important challenge in making sure that the level of generalization they offer from their analysis is appropriate to the question they are an- swering, the type and number of sites selected for the study, and the level of inference required by the design.

Third, there is the concern with whether the appropriate types and modes of evidence were all employed as they should be in the study. Attempting to answer the study question with less evidence than is available or nec- essary results in developing conclusions that simply cannot be considered as strong as they might be. If the effort to do so is undertaken with the clear intent of short- cutting the necessary data collection, it is an abuse of the method. For it is precisely the in-depth and intensive data collection effort

Rist 95

at only one or a few sites that so dis t inguishes the case s tudy approach and gives it the s t rengths it has in compar i son to o ther evaluat ion designs.

Postscript

In looking across the seven evaluat ion strategies d iscussed in this arti- cle, it is apparen t that cons iderable trust has to be placed in the evaluator to conduc t the w o r k in an hones t and ethical fashion. Whether the focus is an appropria te use or inappropriate abuse, the evaluator is the one w h o greatly de te rmines which w a y the s tudy will eventual ly proceed. It is prob- ably fair to say that the pressures towards abuse are just as great as are those towards doing things correctly. Pressures of funding, short t ime lines, keeping good relat ions with respondents , and the inevitable problems of data collection and analysis can all lead to suboptimizing, if not outright distortion, of the evaluat ion effort.

This paper has addressed one aspec t of evaluat ion w o r k w h e r e evalu- ators can e n h a n c e the quality of their individual s tudy and of the craft in general . By making sure they take the t ime to w o r k through their under - s tanding of the quest ion they mus t answer , they will be in a m u c h bet ter posit ion to select the appropr ia te evaluat ion design. Superficial a t tent ion to the mat te r of the evaluat ion design can undercu t all good in tent ions of conduc t ing a careful study. There are s o m e right w a y s of answer ing eval- uat ion quest ions and there are s o m e wrong ways. This article has sought to point these out.

Notes

The views expressed here are those of the author and no endorsement by the United States General Accounting Office is intended or should be inferred. Special thanks and acknowledgment must go to the following persons at the GAO who provided important materials and analysis: Scott Crosse, Lois-ellin Datta, Harriet Ganson, Eleanor Johnson, Douglas Longshore, and Eva Rezmovic. 1. Program evaluators generally are not accustomed to focusing on matters of internal

controls, for these are not typically the areas discussed in behavioral or social science training. They also are not the areas in which one can find publication opportunities in the program evaluation journals. Consequently, there is a continuing void in much of the training and reporting on internal controls of ongoing programs when the work is being conducted by social scientists. Alternatively, when such examinations are conducted by persons with training in auditing or financial management, these areas are front and center in what they would assess as dimensions of program manage- ment.

References

Boruch, R., Wortman, P., Cordray, D., et al. ( 1981). Reanalyzing program evaluations. San Francisco: Jossey-Bass.

Cook, T.D., & Campbell, D.T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago: Rand, McNally.

96 Knowledge in Society/Winter 1989-90

Evaluation Research Society. (1982). Standards for evaluation practice. San Francisco: Jossey-Bass.

Glass, G.V., McGaw, B., & Smith, M.L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.

Green, B., & Hall, J. (1984). Quantitative methods for meta-analysis. Annual Review of Psychology, 35, 37-53.

Hargrove, E.C. (1985). The missing link: The study of the implementation of social policy. Washington, D.C.: The Urban Institute.

Hedges, L. (1984). Advances in statistical methods for meta-analysis. New Direction for Program Evaluation, 24, 25-42.

Light, R. and Pillemer, D. (1984). Summing up: The science ofrevi~/ving research. Cam- bridge, MA: Harvard University Press.

Patton, M.Q. (1980). Qualitative evaluation methods. Beverly Hills, CA: Sage. Pressman, J.L., & Wildavsky, A. (1984). Implementation (3d Ed.) Berkeley, CA: University

of California Press. Resmovic, E.L. (1979). Methodological considerations in evaluating correctional effec-

tiveness: Issues and chronic problems. In L., Sechrest, S.O., White, & E.D. Brown. (Eds.), The rehabilitation of criminal offenders: Problems and prc,spects. Washington, D.C.: National Academy of Science.

Riecken, H.W., & Boruch, R.F. (1974) Social experimentation: A meUTod for planning and evaluating social Intervention. New York: Academic Press.

Rist, R.C. (198 I). Earning and learning: Youth employment policies and programs. Beverly Hills, CA: Sage.

Rossi, P.H., & Freeman, H.E. (1982). Evaluation: A systematic approach (2d Ed.). Beverly Hills, CA: Sage.

Sieber, S. (1973). The integration of fieldwork and survey methods American Journal of Sociology, 78, 1335-1359.

Trend, M.G. (1978). On the reconciliation of qualitative and quantitative analyses: A case study. Human organization, 37, 345- 354.

United States General Accounting Office. (1982). Casual analysis (Transfer Paper #1). Washington, D.C.: PEMD/USGAO.

United States General Accounting Office. (1983). The evaluation Synthesis (Methods pa- per #1). Washington, D.C.: PEMD/USGAO.

United States General Accounting Office. (1984a). Designing evaluations (Transfer Paper #4). Washington D.C.: PEMD/USGAO.

United States General Accounting Office. (1984b). WIC evaluation provide some favorable but no conclusive evidence on the effects expected for the special supplemental program for women, infants, and children (PEMD-84-4). Washington, D.C.: USGAO.

United States General Accounting Office. (1986). Teenage pregnancy: 500,000 births a year but few tested programs (PEMD- 86-16BR). Washington, D.C.: PEMD/USGAO.

United States General Accounting Office. (1987). Case study evaluation (Transfer Paper #9). Washington, D.C.: PEMD/USGAO.

United States General Accounting Office. (I 988). Children's program: A comparative eval- uation framework and five illustrations (PEMD-88-28BR). Washington, D.C.: USGAO.

Wholey, J. (1979). Evaluation: Promise and performance. Washington, D.C.: The Urban Institute.

Williams, J. (Ed.). (1982) Studying implementation: Methodological and administrative is- sues. Chatham, N.J.: Chatham House.