Methodological Problems in Assessing the Impact of Television Programs

JOURNAL OF SOCIAL ISSUES VOLUME 32, NUMBER 4 , 1976

Methodological Problems in Assessing the Impact of Television Programs

Samuel Ball

Educational Testing Service

The impact of television in general cannot be accurately assessed, but it is argued that reasonable assessments can be made of the impact of specific television programs. The components of three areas that pose special problems to the evaluation of television programs are discussed: design, sampling, and measurement. Methods for ameliorat- ing difficulties in these problem areas are suggested.

The problems involved in assessing accurately the overall impact that television has on our daily lives are so vast as to defy solution. We could look at changes in our lives that have occurred since the advent of television, but television could hardly be credited or blamed for all these changes. Alternatively, we could conduct research in places where television is only just now arriving (outback Australia, South Africa) by obtaining baseline measures, introducing television, and then looking for subsequent changes. But the problem would be that even if we could then carefully control the introduction of television, the results obtained from our research could hardly be generalized to the United States or indeed to any Western urbanized society. Clear and unequivocal evidence of a causal relationship between television in general and social change will probably remain forever a fugitive from our scientific eyes.

Even though it is not possible to assess accurately the impact of television in general, it is both possible and highly desirable to evaluate the effects of specific television programs. Such evaluations, though few, have been carried out and include evaluations of “Sesame Street” (Ball & Bogatz, Note 1; Bogatz &

The author wishes to thank G. Bogatz, K. Kazarow, and D. Rubk

Correspondence regarding this article may be addressed to S. Ball, for their helpful comments.

Educational Testing Service, Princeton, N J 08540.

8

METHODOLOGICAL PROBLEMS 9

Ball, Note 2), “Plaza Sesamo” (Diaz-Guerrero & Holtzman, 1974), “The Electric Company” (Ball & Bogatz, Note 3; Ball, Bogatz, Kazarow, 8c Rubin, Note 4), “Mr. Rogers’ Neighborhood” (Stein & Friedrich, Note 5 ) , “Carrascolendas” (Williams, McRae, & Van Wart, Note 6), and “Fat Albert” (Office of Social Research, CBS, Note 7) .

It is noteworthy that major assessments of the impact of specific television programs have been mainly of those broadcast by educational television stations. One should not be cynical and doubt that commercial programs have impact-but, by and large, evaluative research on specific programs broadcast by commercial networks has been rather narrowly focussed on audience size and marketing implications; unless, as in the case of “Fat Albert,” the specific program had intended educational outcomes. The point is that most commercial programs aim to entertain and develop an audience of sufficient size or type that an advertised product will be ultimately sold. Thus, audience and marketing surveys are central to the impact assessment usually desired by commercial television. These kinds of survey research have methodological problems of their own (Haskins, Note 8), but these surveys are outside the scope of this article.

The assessment of the educational impact of specific television programs has many problem areas. These include choosing an appropriate evaluation model, selecting an evaluation design, choosing the variables to be assessed, developing appropriate measures, developing a sampling plan, ensuring quality control of data collection and data management, analyzing the data, and interpreting the results. Of course, these problem areas exist in the evaluation of virtually any new program and they have been discussed elsewhere in evaluation literature (Anderson, Ball, & Murphy, 1974; Suchman, 1967). However, the assessment of the impact of a television program has special if not unique problems of its own and many of these problems fall into three categories- design, sampling, and measurement.

DESIGN Designing research plans for an impact study of a television

program is particularly difficult because control of the experimental treatment (viewing) is not usually within the capability of the evaluator. Merely designating subjects as experimentals does not ensure that they will view and designating subjects as controls does not ensure that they will not view.

10 SAMUEL BALL

A feasible design to assess impact should be dependent upon the kinds of questions to be answered. Generally, an evaluator wants to find out what effects the television program causes, and because of this interest in causal relationships he has to decide whether to use a true experimental design in which the treatment (usually viewing the program) is randomly assigned to some proportion of the sampled subjects or to use one of a variety of quasi-experimental designs (Campbell 8c Stanley, 1963).

The major advantage of a true experiment is that it allows the evaluator to interpret causal relationships with greater certainty than would be possible with a quasi-experiment. But the required random assignment of subjects to treatments may be neither possible nor practical, and, more importantly, even when random- ization occurs it is difficult to ensure the fidelity of the treatment groups. If the experiment on the effects of television viewing of a particular show or series is taking place with some sort of captive audience (as in a school or jail) then it is relatively simple to ensure fidelity-that is, if a teacher is asked to let her class view a show (or not to), she usually cooperates. On the other hand, if the experiment is taking place in a “natural” setting (for example, in the home) it is by no means simple to ensure that those randomly assigned to the experimental treatment do view or that those randomly assigned to the control treatment do not view. And the problem is greater if a series rather than a single show is being assessed.

In an attempt to overcome the fidelity problem one can verbally encourage the experimental subjects to view the designat- ed show and not encourage the control group. Experience indicates this to be at best minimally effective (Ball & Bogatz, Note 1). In our evaluation of the first year of “Sesame Street” we implemented this kind of true experiment. Our fear was that it would fail because too few of the experimental subjects would view, thereby making it rather difficult to assess the impact of viewing. In fact the reverse problem occurred. Too many of the control children viewed despite our decision not to encourage them to do so. Thus, the differences in amount of viewing between the experimental and control groups were quite small. Therefore, we were forced to use a number of different logical and statistical techniques to try to sort out the effects of viewing, effects that a simple comparison of experimental and control children could not provide.

Another possible method of implementing the true experiment is to put the series to be evaluated on cable television or


a UHF station and to provide cable or UHF reception capability only to the randomly selected experimental group (Bogatz 8c Ball, Note 2). Cable television companies that are developing their franchises may cooperate in such a venture because it extends the scope of their network as well as indicating to the government regulating agency their sensitivity to educational needs. In our second year evaluation of “Sesame Street,” cable television as the medium for broadcasting the television show ensured the fidelity of the treatment groups; but it carries with it the criticism that one can generalize clearly only to such ’ audiences as potential cable subscribers. It is also expensive to install and maintain cable in the experimental subjects’ homes.

A third possibility is to assign control subjects to view a different television program broadcast at the same time as the experimental program. Reminder telephone calls placed to all subjects immediately before and after each show attempt to ensure appropriate viewing and assess whether the subjects indeed viewed the shows assigned to them. Differences between the two groups measured at posttest can be ascribed to differences in the shows they viewed (Ball, Kazarow, & Rubin, Note 9). In this design, however, the effort to keep the groups faithful to their treatment may be reactive with the dependent variables. Being called before and after a television show will have impact on what the viewer learns from the show, making the evaluation somewhat cloudy: Was the impact due to the treatment alone, to the processes used to keep the treatment groups intact, or to some interaction of these? If reactivity can be shown to be inconsequential, this kind of study merits strong consideration. Reactivity could be assessed by “keeping back” some of the subjects in the experimental treatment group-not calling but simply encouraging them to view at the outset of the series. Differences in impact between the two treatment groups would provide some evidence of the effects of the extra phone calls on both amount of viewing and, if viewing took place, on impact.

This by no means exhausts the design possibilities within the context of a true experiment. The point is that true experiments can be implemented, even when the purpose of the evaluation is to assess the impact of a television program viewed at home. However, the implementation of a true experiment in assessing the impact of a television program may be considered too onerous, expensive, or impractical given the specific circumstances attend- ing a particular evaluation. In such circumstances, as Rubin (1974) points out, one can also legitimately use a weaker, quasi-experi-

12 SAMUEL BALL

mental design. Belson (1956) used a posttest only design toevaluate a television series teaching French to English-speakers. Perhaps unwisely, the design used covariates, presumably unaffected by viewing, to adjust for initial differences between viewers and nonviewers.

Campbell and Stanley (1963) initially described many kin+ of quasi-experimental designs which Anderson et al. (1974) have grouped into three major categories. One of these, a regression- discontinuity design, is used in such situations as when resources (the curriculum) can only be given to the most deserving and when it would be unfair or impossible to have a control group comparable to the treatment group. With television programs, where virtually everyone can view once the program is broadcast, the regression-discontinuity design still deserves consideration. Suppose a television series is aimed at three- to five-year-olds and the evidence is that these are the bulk of the viewing audience. If the series is effective, the regression of age (say three- to eight-year-olds) on performance with respect to the dependent variables should show discontinuity, with some elevation of the regression line around the three- to five-year-olds.

A second quasi-experimental design has considerable use in the assessment of all types of media-delivered programs. This type, the pretest-posttest nonequivalent group design, is used where it is not practicable to assign subjects at random to treatment groups. In the case of television impact studies, it is quite likely that from an initial group of subjects who are pretested some will subsequently view a program and others will not. This self-selection of treatment virtually ensures noncomparability of treatment groups at pretest. If the program is suspected to be educational, the likelihood is that the viewers will have greater achievement and motivation at pretest than will the nonviewers. Adjustments could be made by covarying pretest scores on gain or posttest scores. But the wisdom of this is heavily dependent on the initial bias being small and posttests and pretests being highly correlated. Otherwise, it is akin to asking whether the ants could beat the elephants at basketball if they were both the same size and both had trunks. They are not the same size and they do not both have trunks. It is a perilous task to assess the impact of a television program on an audience when a self-selected viewing audience is compared with a self-selected and non- comparable group of nonviewers.

The problem of nonequivalency of viewing and nonviewing groups at pretest can be avoided by using a third set of quasi-


experimental designs-the time series design. The basic elements can be schematized as follows, where the 0 s represent comparable measurements made from time to time and X represents the beginning of the broadcasting of the television program being studied :

0 , . . 0,. . 0 3 . . . . X . . . . 0,. . 0,. . 0, Impact is shown if there is a larger change between 0, and 0, (in the diagrammed series above; exact number of measurements will depend on particulars of the study) than between the other 0 s . A control group is not essential although it improves the quality of the time series design.

As an example of time series design, a sample of the intended audience would be measured on several occasions before a program begins. Self-selection then occurs and both viewers and nonviewers would subsequently be measured on several additional occasions. The trends in the status of the viewers and nonviewers could then be compared to see if differential changes occurred after the program began to be broadcast-presumably as a result of the broadcasting.

Still another useful design is the age cohorts design (Ball 8c Bogatz, Note 1). In this design a cohort of children (Cohort A) in a given age period (say 4 years, 4 months to 4 years, 10 months) is tested just before the series begins. At the end of the series (say six months later) a second cohort of children (Cohort B, who are now the same age as Cohort A had been, i.e., 4.44.10) from the same community with the same socioeconomic charac- teristics is also administered the same measures. If the series is ineffective, Cohort B should score at about the same level as Cohort A. If the series is effective, Cohort B should score higher than Cohort A with the preponderance of the higher scores coming from those members of Cohort B who had viewed the series.

The quasi-experimental and the age cohorts designs have the characteristic of allowing the evaluator to concentrate attention on an audience of self-selected viewers. They tell us what the program’s impact is on those who are “real world” viewers. They do not tell us the impact on potential viewers who, in the real world outside a laboratory study, may choose not to view. They provide, therefore, a relatively realistic assessment of the program’s impact. On the other hand, the true experiment probably provides a conservative estimate of the impact of the program because

14 SAMUEL BALL

the experimental group may include subjects who view with little enthusiasm and who view only because they have been asked to cooperate in the experiment.

Of the number of practicable designs, each has problems of implementation and of subsequent interpretation. But these problems are minor compared with the problems of interpretation of impact studies based on case study designs. Case study designs usually involve simply having a small number of subjects view a show or series of shows; then the evaluator reports findings, sometimes using such techniques as Wolf (1969) humorously refers to as the “cardiac method” or the “cosmetic method.” In the cardiac method the evaluator indicates that he feels sure in his heart that the subjects benefitted. In the cosmetic method the evaluator provides such evidence as that the students looked interested and involved. Such evidence can be useful but it lacks credibility if causal relationships are of major concern.

SA MPLINC

Television programs in the entertainment field typically reach a massive audience, even if their share of the market is relatively small. Typically, too, they are broadcast in more than one geographic area; and further, they are viewed by a heterogeneous set of viewers. Thus the formal definition of the population of actual viewers may be synonymous with the national population. But this is not the target population for an educational television program.

It is important first to define a target subpopulation such as “disadvantaged six-year-olds” or “mothers of preschoolers.” Then, if it is desirable to make estimates of program effects for subpopulations, a reasonable national probability sample would have to be drawn. However, this would have to include far more sampling units (geographic areas) than the evaluator could work with unless inordinately large funding was available.

Almost inevitably assessors of television impact either develop a sample from an incomplete frame consisting of only one or two geographic units, or subjectively sample elements of the target population with which it is convenient to work. These sampling procedures can produce useful data. However, the limitations should be made explicit in the evaluation report and the evaluator should be careful not to generalize too widely.

Along these lines, Yin (1973) cautioned Children’s Television Workshop that the evaluations of “Sesame Street” and “The


Electric Company” were “limited to a few sites from which any generalizations about national populations cannot be accurately made” (p. 35). This, of course, is true; but the problem remains how to sample in order that generalizations about impact can be made to national populations. T h e best compromise seems to be to carry out the impact study in at least a few sites and with subjects that are as representative as possible of the target audience. One can then obtain some notion of site X treatment interactions and, if these effects are negligible, make national impact projections. That is, if it can be shown that a television series works about as well for disadvantaged young adolescents in Atlanta, rural Illinois, New York City, and Los Angeles, then the national impact of the series can reasonably be projected using the knowledge of the observed impact multiplied by the national audience size estimated from a national audience survey.

MEASUREMENT In any impact assessment, the dependent variables and mea-

surement procedures must be specified. When the impact of a television program is being assessed, one problem is that the television medium itself as well as the program content is part of the treatment. There are affective and attitudinal impacts that television adds to the intended outcomes; an evaluator planning an impact study should consider the assessment of these overtones. As well, the program itself may contain a number of important but unintended messages, and assessments should be made of their impact too.

A related problem is that the treatment itself is usually neglected as a variable. It is assumed that the program is the program. There is usually a false expectation that a program on television must contain what it is supposed to contain. But the program can and should be subject to content analysis for it may contain more or less of the intended message. If, as we found in some of our work (Ball & Bogatz, Note 3), a particular subgoal is taught only about 1% of the time, this sparse treatment should be documented.

A third measurement problem is that amount of viewing, although hard to assess, is an important variable. In general, the more one tries to assess this variable, the more reactive the assessment procedures are to the final results. In the absurd extreme, one could obtain a reliable measure of amount of viewing if one were to sit with viewers monitoring the amount of time

16 SAMUEL BALL

they viewed the series. However, this would undoubtedly be reactive with the later results. O r at the other extreme, one could simply ask the experimental subjects how much they viewed the program-a procedure unlikely to produce valid data.

A television program impact study that does not measure affective and attitudinal dependent variables is probably incomplete; an impact study that does not measure the program content variable is probably misleading; and an impact study that has no measure of amount of viewing is probably worthless. However, the measurement of these variables has problems whatever the type of program. None of the ways of measuring affective and attitudinal variables embues the intelligent researcher with confidence, and the measurement of amount of viewing is fraught with difficulty. Overlying this is the fact that television impact studies take place in the real world and measurement techniques suitable for a captive classroom or a psychological laboratory may need to be adapted. Further erosions of confidence occur when the measurement often has to take place in the subjects’ homes under trying nonexperimental conditions. This is a depressing picture.

What should the evaluator do? In general, multiple measurement techniques should be used to supplement each other. When measuring attitudes, rating scales, paper and pencil tests, and some unobtrusive measures might all be used. When measuring amount of viewing, interviews, viewing diaries, and properly timed telephone calls might all be used. T h e net result might be confusing if some but not all indicators are in agreement. But better a degree of confusion than a sense of security with invalid results.

CONCLUSION

In this paper the point has been made that problems of design, sampling, and measurement are specially egregious in the assessment of the impact of a television program. There is no claim that complete solutions to these problems are presented here or that they even exist. Rather, the intended message is that the problems can be ameliorated by some of the suggested procedures. Besides, awareness of a problem is an early step in its solution.

Television is an important part of our lives. It would be foolish to assume it is easy to assess its impact. But it would be even more foolish to use the difficulties as an excuse to forego the assessment process.

METHODOLOGlCAL PROBLEMS 17

REFERENCE N O T E S 1. Ball, S., & Bogatz, G. A. The first year of “Sesame Street”: A n evaluation.

Princeton, N.J.: Educational Testing Service, 1970. 2. Bogatz, G. A,, & Ball, S. The second year of “Sesame Street”: A continuing

evaluation. Princeton, N. J.: Educational Testing Service, 197 1. 3. Ball, S., & Bogatz, G. A. Reading with television: An evaluation of “The

Electric Company.” Princeton, N. J . : Educational Testing Service, 1973. 4. Ball, S., Bogatz, G. A., Kazarow, K., & Rubin, D. B. Reading with television:

A follow-up evaluation of “The Electric Company.” Princeton, N.J.: Educa- tional Testing Service, 1974.

5. Stein, A., & Friedrich, L. The effects of aggressive and pro-social television programs on the naturalistic-social behavior of preschool children. Unpublished manuscript, Pennsylvania State University, 197 1.

6. Williams, F., McRae, S., & Van Wart, G. “Carrascolendas”: Effects of a Spanish/English television series for primary school children. Austin, Texas: Center for Communication Research, University of Texas, 1972.

7. Office of Social Research, Columbia Broadcasting System. A study of messages received by children who viewed an episode of “Fat Albert and the Cosby Kids.” New York: CBS Broadcast Group, 1974.

8. Haskins, J. B. How to evaluate m a s communications: The controlled field experiment. New York: Advertising Research Foundation, 1968.

9. Ball, S., Kazarow, K., & Rubin, D. B. An evaluation of “To Reach a Child,” a television series for parents. Princeton, N. J.: Educational Testing Service, 1974.

REFERENCES Anderson, S., Ball, S., & Murphy, R. Encyclopedia of educational evaluation.

San Francisco: Jossey-Bass, 1974. Belson, W. A. A technique for studying the effects of a television broadcast.

Applied Statistics, 1956, 5, 195-202. Campbell, D. T., & Stanley, J. C. Experimental and quasi-experimental designs

for research on teaching. Chicago: Rand McNally, 1963. Diaz-Guerrero, R., & Holtzman, W. Learning by televised “Plaza Sesamo”

in Mexico. Journal of Educational Psychology, 1974, 66, 632-643. Rubin, D. B. Estimating causal effects of treatments in randomized and

non-randomized studies. Journal of Educational Psychology, 1974, 66,

Suchman, E. A. Evaluative research: Principles and practice i n public service

Wolf, R. A model for curriculum evaluation. Psychology in the Schools, 1969,

Yin, R. K. The workshe and the world: Toward an assessment of the children’s

688-701.

and social action programs. New York: Russell Sage Foundation, 1967.

6, 107-109.

television workshop. Santa Monica: The Rand Corporation, 1973.

Methodological Problems in Assessing the Impact of Television Programs

Documents

Transcript of Methodological Problems in Assessing the Impact of Television Programs