Arousal and Valence in the Direct Scaling of Emotional ...nalvarado/PDFs/arousalValence.pdf ·...

26
Motivation and Emotion, Vol. 21, No. 4, 1997 Arousal and Valence in the Direct Scaling of Emotional Response to Film Clips1 Nancy Alvarado2 University of California, San Francisco Contributions of differential attention to valence versus arousal (Feldman, 1995) in self-reported emotional response may be difficult to observe due to (1) confounding of valence and arousal in the labeling of rating scales, and (2) the assumption of an interval scale type. Ratings of emotional response to film clips (Ekman, Friesen, & Ancoli, 1980) were reanalyzed as categorical (nominal) in scale type using consensus analysis. Consensus emerged for valence-related scales but not for arousal scales. Scales labeled Interest and Arousal produced a distribution of idiosyncratic responses across the scale, whereas scales labeled Happiness, Anger, Sadness, Fear, Disgust, Surprise, and Pain, produced consensual response. Magnitude of valenced response varied with both stimulus properties and self-reported arousal. Feldman (1995) presented evidence that individuals differ in their attention to two orthogonal dimensions of emotion: valence (evaluation) and arousal. These differences were noted when subjects were asked to make periodic mood ratings using scales that confound these two aspects of affective ex- perience. Feldman analyzed these ratings in the context of Russell's (1980) circumplex model and Watson and Tellegen's (1985) dimensions of positive affect (PA) and negative affect (NA) and suggested that the structure of 1Preparation of this article was supported in part by National Institute of Mental Health (NIMH) grant MH18931 to Paul Ekman and Robert Levenson for the NIMH Postdoctoral Training Program in Emotion Research. I thank Paul Ekman for permitting access to the data analyzed here. I also thank Jerome Kagan and several anonymous reviewers for their helpful comments on this manuscript. 2Address all correspondence concerning this article to Nancy Alvarado, who is now at the Department of Psychology (0109), University of California at San Diego, 9500 Gilman Drive, La Jolla, California 92093-0109. 323 0146-7239/97/1200-0323$12..50/0 <8 1997 Plenum Publishing Corporation

Transcript of Arousal and Valence in the Direct Scaling of Emotional ...nalvarado/PDFs/arousalValence.pdf ·...

Motivation and Emotion, Vol. 21, No. 4, 1997

Arousal and Valence in the Direct Scaling ofEmotional Response to Film Clips1

Nancy Alvarado2

University of California, San Francisco

Contributions of differential attention to valence versus arousal (Feldman,1995) in self-reported emotional response may be difficult to observe due to(1) confounding of valence and arousal in the labeling of rating scales, and(2) the assumption of an interval scale type. Ratings of emotional response tofilm clips (Ekman, Friesen, & Ancoli, 1980) were reanalyzed as categorical(nominal) in scale type using consensus analysis. Consensus emerged forvalence-related scales but not for arousal scales. Scales labeled Interest andArousal produced a distribution of idiosyncratic responses across the scale,whereas scales labeled Happiness, Anger, Sadness, Fear, Disgust, Surprise, andPain, produced consensual response. Magnitude of valenced response variedwith both stimulus properties and self-reported arousal.

Feldman (1995) presented evidence that individuals differ in their attentionto two orthogonal dimensions of emotion: valence (evaluation) and arousal.These differences were noted when subjects were asked to make periodicmood ratings using scales that confound these two aspects of affective ex-perience. Feldman analyzed these ratings in the context of Russell's (1980)circumplex model and Watson and Tellegen's (1985) dimensions of positiveaffect (PA) and negative affect (NA) and suggested that the structure of

1Preparation of this article was supported in part by National Institute of Mental Health(NIMH) grant MH18931 to Paul Ekman and Robert Levenson for the NIMH PostdoctoralTraining Program in Emotion Research. I thank Paul Ekman for permitting access to thedata analyzed here. I also thank Jerome Kagan and several anonymous reviewers for theirhelpful comments on this manuscript.

2Address all correspondence concerning this article to Nancy Alvarado, who is now at theDepartment of Psychology (0109), University of California at San Diego, 9500 Gilman Drive,La Jolla, California 92093-0109.

323

0146-7239/97/1200-0323$12..50/0 <8 1997 Plenum Publishing Corporation

324 Alvarado

affect changes with the focus of attention. She speculated that valence focus"may be associated with the tendency to attend to environmental, particu-larly social cues " (p. 163) whereas arousal focus may be related to internal(somesthetic) cues, citing Blascovich (1990; Blascovich et al., 1992). Thispaper presents support for Feldman's views, in a direct-scaling self-reportcontext where valence and arousal are reported independently and the en-vironmental cues are held constant, using data originally collected by Ek-man, Friesen, and Ancoli (1980).

Direct Scaling Assumptions

Direct scaling of emotional response occurs when a subject is exposedto an affect-inducing stimulus, then asked to introspect and rate the amountof some affect using a rating scale, often labeled with the name of an emo-tion to be reported, and typically numbered in intervals, such as from 1 to7. Researchers frequently anchor the endpoints of such scales with descrip-tive phrases such as not at all angry, extremely angry, or most anger ever feltin my life. These ratings are treated as judgments on an interval, continuousscale. They are then averaged to produce means which are compared usinganalysis of variance (ANOVA) or t-test.

There is some evidence that self-report judgments of emotional re-sponse are consistent across time for the same individual (Larsen & Diener,1985, 1987), that self report varies systematically with certain physiologicalchanges associated with emotion and thus may be a valid indicator of emo-tional response (Levenson, 1992), and that higher ratings on a scale docorrespond to greater emotional experience for the same individual (mono-tonicity). These findings justify assumption of an ordinal scale type duringdata analysis. On the other hand, there is no evidence that the subjectivedistances between adjacent numbers on every portion of the scale are equal,as would be necessary in order to assume that the data are interval in na-ture. Further, aggregation of data and interrater comparisons are problem-atic because it is unclear how individual differences in emotional responseare related to individual differences in the use of rating scales. Nor havethe distances between numbers been shown to correspond to the same sub-jective differences in response for each individual in a study.

Consider temperature as an analogy. We can use an objective scale,such as the Fahrenheit scale, to evaluate the accuracy of subjective judg-ments. However, if we had no such scale, but instead asked subjects torate temperature based upon the hottest or coldest temperatures they hadever experienced, their subjective experience would be confounded withvariations in their devised scales. Unless we know the anchor points and

Scaling Emotional Response to Film Clips 325

scale intervals, we cannot know whether two subjects reporting differenttemperature ratings for the same stimulus are using the same scale butexperiencing the temperature differently, or experiencing the temperatureas the same but using different scales. If we ignore these difficulties andaverage their ratings, we obtain a measure that is useful in certain experi-mental contexts but insensitive to individual variations in subjective expe-rience. Rather, we have a scale that assumes that individual differences areunimportant or nonexistent.

No objective physical unit of measurement exists to compare againstself-reported emotional experience. Even when we supply a 7-point scaleanchored by descriptive phrases, we have no way of knowing how the in-dividual interprets such phrases, e.g., how much anger one person has everfelt in his or her lifetime, compared to the maximum experienced by an-other. Further, anchoring using descriptive phrases such as most emotionever felt in your life invites subjects to apply a scale with unequal distancesbetween intervals, such that the most emotion ever felt on a 10-point scaleis not 10 times the amount felt when 1 is reported, but probably far greater.Use of a scale with 100 rather than 10 divisions does not remedy this prob-lem.

Use of rating scales to describe emotion is further complicated if mag-nitude is part of the meaning of the label used to identify the scale itself.For example, it is unclear how the difference in meaning between scalelabels such as anxiety and fear, or annoyance and fury, would affect thejudgments of magnitude made using that scale. Would an experience ratedin the middle of an annoyance scale be rated lower if the scale were labeledfrustration, anger, or rage?

Given these difficulties, the direct scaling of emotional response ap-pears to be, at best, ordinal. As Townsend and Ashby (1984) noted, ". . .if the strength of one's data is only ordinal, as much of that in the socialsciences seems to be, then even a comparison of group mean differencesvia the standard Z or t test or by analysis of variance is illegitimate. Onlythose statements and computations that are invariant under monotone (or-der is preserved) transformations are permissible" (p. 395). When the pur-pose of a study is merely to demonstrate a difference using self report asa dependent variable, then the measurement concerns described above areunlikely to affect the validity of the findings. However, when these meanstests are used to assert the equality of stimuli presented to evoke emotionalresponse, or the efficacy of such stimuli as an elicitor of a specific emotion,then the concerns raised above become crucial to the findings. Everythingthat follows in such a study rests upon an initial assumption that meanself-report values are an accurate index of emotional response.

326 Alvarado

This problem is relevant to several recent studies investigating the con-gruence between facial activity and self-reported emotional response, asnoted by Ruch (1995). In an ongoing controversy over whether smiling isan indicator of expressed feeling, Fridlund (1991) reported that happinessratings did not parallel electromyelograph (EMG) monitoring of smilingamong subjects viewing film clips, but seemed related instead to the so-ciality of the viewing condition. Hess, Banse, and Kappas (1995) improvedthe measurement of facial activity by monitoring Duchenne versus non-Duchenne smiling and varied the amusement level of the film stimuli pre-sented as well as the viewing context. They found a more complexrelationship between social context and smiling. In both studies, the crucialcomparison between facial activity and emotional response rested on theaccuracy of the self-report ratings, analyzed using an ANOVA across view-ing conditions, and assumed to be a valid measure of emotional response.

Use of Direct Rating to Norm Film Clips

This study reanalyzes self-report ratings of emotional response to filmclips, originally collected by Ekman et al. (1980). These data have beenfrequently cited by Fridlund (1994) because they contain anomalies thathe considers support for his view that smiling is related to social contextrather than emotional response. Fridlund's larger issue of the sociality ofsmiling was addressed by Hess et al. (1995) and will not be discussed furtherhere. This discussion instead will focus upon the complexity involved indemonstrating congruence between self-report ratings and facial activity (orother behavior), and the need to improve methods of collecting and ana-lyzing self-report data. The stimulus set used by Ekman et al. (1980) pro-vides a useful illustration of the methodological and theoretical issuesdiscussed earlier because, unlike many similar studies, it includes both base-line self-report ratings and concurrent ratings using multiple, separately la-beled rating scales.

Ekman et al. (1980) compared self-report judgments for 35 subjectswith their measured facial expressions when viewing pleasant and unpleas-ant film clips selected for their ability to evoke emotion. Fridlund (1994)noted that facial expression and direct ratings agreed only for the film stim-uli with social content, but not for a third film for which the mean ratedhappiness was the same. At issue were three pleasant film clips: (1) a gorillaplaying in a zoo, (2) ocean waves, and (3) a puppy playing with a flower.All three films evoked the same mean ratings when subjects were askedto rate their response on a scale labeled Happiness. However, as Fridlundnoted, the film clips evoked differential amounts of facial activity, with the

Scaling Emotional Response to Film Clips 327

gorilla film evoking the greatest duration and intensity of facial activity,the puppy film showing the greatest frequency of facial activity, and onlyseven subjects showing any facial response to the ocean film. From this,Fridlund argued that the gorilla and puppy films were somehow more socialin nature, evoking more facial expression because such expressions onlyarise from social antecedents. However, this is only true if the films did infact evoke the same emotional responses. As will be argued later, I believethey did not.

Consensus Modeling

The assumptions of the random-effects ANOVA model are that re-sponses are drawn from a normal distribution and that they are made usingan interval scale. The model further assumes that all individuals use thesame scale in the same manner (implicit to the assumption of equal vari-ance).3 The point here is not whether analysis of variance has been correctlyapplied in psychological research, but rather whether a model that assumesminimal individual differences is suitable for exploring whether such indi-vidual differences in fact exist. The analysis below applies consensus mod-eling to explore (1) whether the averaging of ratings produced misleadingnorms for the various film clips, (2) whether subject ratings were idiosyn-cratic or consensual (as is implicitly assumed by the averaging of data),and (3) whether subjects used all scales in an equivalent manner acrossthe rating contexts. Consensus analysis is a formal computational modelwhich uses the pattern of responses within a data set to predict the like-lihood of correct response for each subject (called the competence rating),provide an estimate of the homogeneity of response among subjects (themean competence), and provide confidence intervals for the correctness ofeach potential response to a set of questions. While this model also makescertain assumptions, discussed in greater detail below, it incorporates good-ness-of-fit measures that permit an analysis of the extent to which thoseassumptions have been met. Thus the model can be used to investigate thenature of response using rating scales, and thereby to address the issuesraised above. A formal description of the model has been provided byBatchelder and Romney (1988, 1989). Equations are provided in the Ap-pendix.

'According to Hays (1988), these assumptions can be violated without greatly affecting resultswhen a fixed-effects model is used to test inferences about specific means. Violatingassumptions of normality and equal variance has serious consequences for a random-effectsmodel used to test inferences about the variance of the population effects.

328 Alvarado

Consensus modeling assumes that subjects draw upon shared latentknowledge when making their responses. The source of this shared knowl-edge may be cultural or may be derived from shared physiology or commonhumanity. The model cannot distinguish between these sources of homo-geneous responding. It assumes that intercorrelation of subject responsesacross a data set occurs because subjects are drawing upon the same latentanswer key when making their responses. Therefore, the latent answer keycan be recreated using the pattern of intercorrelation. The model assumesthat subjects vary in their performance and in their access to shared knowl-edge, but that subjects with higher correlation to the group are more expertbecause they have greater access to shared knowledge. The answer key con-fidence intervals are estimated using Bayes' theorem. Each subject's com-petence score is used as a probability of correctness. Subjects who are moreexpert because they agree more with the group are given greater weightin producing the estimated answer key. Thus, consensus emerges not frommajority response to a particular question, but from patterns of agreementacross the entire data set.

For purposes of this study, the question was: "What number on thisrating scale best describes the emotional response to this film clip?" Thisanalysis assumes that there is a single correct number on each scale, foreach rating context, that characterizes the group. This is the same assump-tion made when a group mean is used as a normative rating. Use of sucha mean implies that one number (e.g., 4.5) best predicts the potential re-sponse of any individual selected at random from the population.

Using consensus analysis, we can test whether subjects assign the samestimulus the same number on their internal subjective scales, or whethertheir scales are calibrated such that the same stimulus may produce widelyvarying response. This is important because it tells us something about theconsistency of emotional response across individuals. Previous studies havealso assumed that individual scales are calibrated in a similar enough man-ner to justify the aggregation of data across subjects and the use of ANOVAmodels. This approach tests whether that assumption is justified. In thestudy that follows, consensus analysis results are supplemented by analysisof the normality of the distribution of responses, and of the patterns ofcorrelation among the scales.

METHOD

This analysis was performed upon the original self-report data col-lected by Ekman et al. (1980), rather than the summaries provided by theresulting article. Additional details about the data collection procedures

Scaling Emotional Response to Film Clips 329

were provided in that article and are omitted here, except where relevantto the arguments presented.

Subjects

Subjects were 35 female volunteers, ages 18 to 35 years, recruitedthrough advertisements to participate in a study of psychophysiology.

Stimuli

Stimuli consisted of five films of 1-min duration, three intended to bepleasant and two intended to be unpleasant. The three pleasant films (de-scribed above), were created by Ekman and Friesen and were always shownin the same order: gorilla, ocean, puppy. The two unpleasant films wereedited versions of a workshop accident film designed to evoke fear anddisgust. The first film depicts a man sawing off the tip of his finger. Thesecond shows a man dying when a plank of wood is thrust through hischest by a circular saw. These films were always shown in this same order.

Procedure

Subjects rated their emotional responses for two baseline periods andfive film-viewing periods using a series of nine unipolar 9-point scales, la-beled with the following terms: Interest, Anger, Disgust, Fear, Happiness,Pain, Sadness, Surprise, and Arousal. Pain was defined for subjects as "theexperience of empathetic pain" and Arousal was explained as applying tothe total emotional state rather than to any one of the other scales pre-sented. The other terms were not explained to subjects. Scales ranged from0 (no emotion) to 8 (strongest feeling). Instructions explained how the ratingswere to be made (Ekman et al., 1980): ".. . strength of a feeling shouldbe viewed as a combination of (a) the number of times you felt the emo-tion—its frequency; (b) the length of time you felt the emotion—its dura-tion; and (c) how intense or extreme the emotions [sic] was—its intensity"(p. 1127).

The first baseline occurred during a 20-min period in which the subjectwas instructed to relax. The presentation of pleasant or unpleasant filmsfirst was counterbalanced. Ratings for all three pleasant films were madeafter viewing all three films. Similarly, ratings for the two unpleasant filmswere made after viewing both films, A second baseline rating was made

330 Alvarado

after rating of the first set of films, during a 5-min interval before startingthe second series of films.

RESULTS

Consensus Analysis

The following discussion is adapted from the description of consensusmodeling provided by Weller and Romney (1988). Consensus analysis pro-vides a measure of reliability in situations where correct responses to itemsare not already known. Mathematically, it closely parallels item responsetheory or reliability theory, except that data are coded as given by subjectsrather than as "correct" or "incorrect," and the reliability of the subjectsis measured instead of the reliability of the items. The formal model isdescribed in Batchelder and Romney (1988, 1989). Additional descriptionof the model is provided in the Appendix. The main idea of the model isthat when correct answers exist, the answers given by subjects are likely tobe positively correlated with that correct answer key. Thus, in situationswhere correct answers are unknown but assumed to exist, the pattern ofintercorrelations or agreement among subjects (called consensus) can beused to reconstruct the latent answer key. This is similar to the idea inreliability theory that correlations among items reflect their independentcorrelation with an underlying trait or ability. Similarly, high agreementamong subjects about the answers to a set of items measuring a coherentdomain suggests the likelihood that shared knowledge exists and providesinformation about what that knowledge is. In the words of Weller and Rom-ney (1988), "A consensus analysis is a kind of reliability analysis performedon people instead of items" (p. 75). This reliability analysis is used to makeinferences about the nature of the domain or to determine the correct an-swers. When a correct answer key does not exist, as when subjects belongto subcultures drawing upon different sources of shared knowledge, orwhen subjects draw upon idiosyncratic knowledge, that violation of themodel's assumptions is readily apparent in the measures provided by themodel.

Ratings for each of the nine emotion-labeled scales were analyzedseparately; thus the data consisted of seven numerical ratings (one for eachrating period) for each of the 35 subjects, for each labeled scale (ninescales). The data were treated as multiple-choice responses to the impliedquestion "Which number corresponds to the correct emotional responserating for this particular film segment or baseline period?" Given the pre-ceding discussion about scale types, it would have been preferable to ana-

Scaling Emotional Response to Film Clips 331

lyze the data using an ordinal consensus model, but such a model has notyet been developed. The categorical, multiple-choice model used here as-sumes an equal probability of guessing the alternatives in its correction forguessing. The analysis of normality (presented later) suggests that this as-sumption is appropriate for some but not all of the rating scales. With or-dinal data, it is more likely that guessing biases differ among the ratingalternatives (e.g., the probability of guessing 5 may be different than theprobability of guessing 0). A model incorporating such biases had not beendeveloped at the time this analysis was performed, but now exists (seeKlauer and Batchelder, 1996). In general, the application of a categoricalmodel to what we suspect is ordinal data tends to work against a findingof consensus because subjects must agree on the exact rating number givento each stimulus out of nine alternatives (0 to 8).

The measures used to evaluate results are (1) individual competencescores, (2) mean competence, (3) eigenvalues produced during the principalcomponent analysis used to estimate the solution to the model's equations,and (4) answer key confidence estimates. Competence scores range from-1.00 to 1.00 and are maximum-likelihood parameter estimates. They arebest understood as estimated probabilities rather than correlation coeffi-cients. A negative competence score indicates extreme and consistent dis-agreement with the group across rating periods.

Batchelder and Romney (1988, 1989) established three criteria forjudging whether consensus exists in subject responses to questions about adomain: (1) eigenvalues showing a single dominant factor (a ratio greaterthan 3:1 between the first and second factors), (2) a mean competencegreater than .500, and (3) absence of negative competence scores in thegroup of subjects. While failure to meet these criteria does not necessarilyrule out consensus, it can indicate a poor fit between the data and themodel.

Consensus analysis results for the nine scales across the seven ratingperiods are summarized in Table I. All scales except those labeled Interestand Arousal met the criteria for consensus. In contrast, the scales for In-terest and Arousal showed nearly half the group with negative consensusscores, indicating severe disagreement about the correct responses on thosescales. The scales for Anger, Disgust, and Pain showed the greatest con-sensus, with the highest mean consensus scores and with eigenvalue ratiosindicating a single dominant factor in the data. While the scales for Sadnessand Surprise each showed a single negative consensus score, the otherwisehigh mean consensus scores and ratios between the eigenvalues suggest thatconsensus also existed for those scales.

This finding of consensus for seven of the nine scales suggests thatsubjects agreed strongly in their emotional responses to the stimuli pre-

332 Alvarado

Table I. Consensus Analysis of Nine Rating Scales Across Seven RatingPeriods

Scale label

Anger

Disgust

Fear

Happy

Pain

Sadness

Surprise

Interest

Arousal

Consensus

Mean

.831

.795

.699

.580

.793

.674

.657

.101

.150

SD

.179

.082

.155

.131

.106

.290

.230

.288

.230

Ratio ofeigenvalues

13.3481.702

13.6201.1808.9261.3325.9981.200

14.5111.2237.1361.5848.9591.0961.3821.1441.0871.720

Negativescores

0

0

0

0

0

1

1

16

17

N

35

35

35

35

35

35

35

35

35

Confidencelevel

1.0000

.9478

.9392

.9943

.9841

1.0000

.9838

.9363

.8486

sented, particularly with respect to the scales labeled Anger, Disgust, andPain. Lesser agreement existed for Surprise and Fear, and for Happinessand Sadness. Based upon the measures provided by this model, consensualemotional response did not exist for the two scales labeled Arousal andInterest. The importance of this finding will be discussed later.

Answer key confidence levels were high (M = .95), even when emo-tional response was reported, but consensus appeared to be largely gov-erned by agreement about the absence of negative emotion during thepleasant film clips, and the absence of positive emotion during the unpleas-ant film clips.4 The scales showing lower consensus (but nevertheless meet-ing the criteria for consensus), Sadness, Happiness, and Surprise, showedminor violations of this pattern. Because the presentation of films wascounterbalanced, half of the subjects saw pleasant films and half saw un-pleasant films before the second baseline. From the ratings, several subjectsappeared to have carried residual negative emotional response into thissecond baseline period, producing mixed ratings. They may also have car-ried such response into the pleasant film ratings, as Ekman et al. (1980)

4This is far from a trivial finding, as several emotion theorists have hypothesized that complexemotional responses may be blends of basic emotions and thus have insisted that multiplescales be provided to permit subjects to express such complexity. A lack of response is thusas meaningful as positive response on each single scale with respect to each rating context.

Scaling Emotional Response to Film Clips 333

Table II. Predicted Emotional Responses for Nine Rating Scales

Label Baseline 1 Gorilla Ocean Puppy Baseline 2 Cut finger Death

AngerDisgustFearHappinessPainSadnessSurprise

InterestLowHigh

ArousalLowMediumHigh

0000000

004

0012

0004000

116

1143

0000000

117

2216

0000000

116

1143

0000000

002

0014

0010808

335

1168

0580806

557

3358

noted in their discussion. Nor were the pleasant films unambiguously pleas-ant. Five subjects responded to the gorilla film with mild anger, and fourresponded to the puppy film with even stronger anger (e.g., 6, 7, or 8).Similarly, several subjects reported sadness when watching the gorilla film,and several reported disgust while watching the puppy film. These re-sponses may be partly explained by the content of the films. The puppyultimately chewed up and spit out the flower with which it was playing,evoking disgust in some subjects. The gorilla may have aroused sadnessbecause it resided in a zoo. The lower consensus for the Fear and Surpriseratings result from several subjects who claimed to have felt no surpriseor fear in response to the second workshop accident.

Model-predicted answer key responses for each of the scales duringeach of the viewing periods are shown in Table II. Examination of the an-swer key for the Happiness rating scale shows a clear difference in thelevel of enjoyment among subjects for the three film clips. The gorilla filmwas rated as 4, the ocean film as 0, and the puppy film as 0. The consensusmodel makes these predictions by weighting each subject's response by thatsubject's overall agreement with the group (the estimated probability ofcorrectness). Even without the model's weighting, these responses were themodal responses among subjects for these films. It is only when all re-sponses are averaged that higher numbers emerge for the ocean and puppyfilms. To see why this occurs, consider a group in which equal numbers ofsubjects give ratings of 0 and 8 and no other ratings. When these are av-eraged to obtain a mean of 4.0, it should be evident that this rating is anaccurate portrayal of emotional response for no single subject in that group.

334 Alvarado

Nor will it be a good predictor of the response of the next subject whoviews the film. The actual distribution of scores generally raises an alarmabout using the mean as an indicator of central tendency (see the analysisof normality below).

During subsequent research, Ekman and Friesen edited the puppy filmto remove the portion where the puppy eats the flower, and thereby ob-tained higher enjoyment ratings. Examination of the disgust and angerscales provided important clues to the differing emotions evoked in indi-vidual subjects by this particular film. The difference in content may ac-count for the puppy film's higher frequency of smiling but lower durationand intensity of smiling, compared to the gorilla film (Ekman et al., 1980).Differing emotions were not reported across the nine scales for the oceanfilm. This analysis shows that the ocean film was simply not as enjoyableas the gorilla film. The finding that few subjects smiled while viewing it isentirely consistent with the self-report ratings obtained for the ocean film.

Although responses are typically distributed across a range of responseoptions in any data set, even one showing strong consensus, the process ofconsensus modeling permits identification of those subjects with consistentlydivergent response patterns across the set of questions. These divergent sub-jects obtain negative consensus scores during analysis. By partitioning thedata set based upon the sign of the consensus score (negative or positive),

Table III. Consensus Analysis of Interest and Arousal SubgroupsPartitioned by Sign of Score

Scale label

Interest

Positive (low)

Negative (high)

Arousal

Positive (low)

Negative (high)

Neg-posa (high)

Neg-nega (mediui

Consensus

Mean

.101

.334

.243

.150

.402

.232

.409

m) .430

SD

.288

.215

.256

.230

.164

.322

.226

.273

Ratio ofeigenvalues

1.3821.1442.3521.2671.2851.045

1.0871.7202.5561.7201.6801.3582.0051.3042.108

Negativescores

16

1

3

17

0

5

0

0

N

35

19

16

35

19

16

11

5

aThe arousal negative subgroups was repartitioned and reanalyzed basedon the sign of the score. Neg = negative; Pos = positive.

Scaling Emotional Response to Film Clips 335

Fig. 1. Examples of used and unused valenced rating scales: Ratings of Surprise for thecut finger film clip (top) and ratings of Anger for the puppy film clip (bottom). Std. Dev.= standard deviation.

and reanalyzing the data, it can be determined whether response is idiosyn-cratic, or whether several divergent subjects form a coherent subgroup (per-haps because they are members of a subculture). Partitioning and reanalysis

336 Alvarado

of the data for the Arousal and Interest scales yielded no coherent sub-groups because the resulting partitioned data sets also failed to meet thecriteria for consensus (see Table III). Instead, responses seemed to be dis-tributed across the range of possible responses. However, subjects with nega-tive consensus scores on the arousal scale tended to obtain negativeconsensus scores on the Interest scale as well (Goodman and Kruskal'sgamma = .54). This suggests several conclusions: (1) the scales for Arousaland Interest do not lend themselves to this type of categorical analysis; (2)subjects are idiosyncratic but consistent in their response using these twoscales; and (3) there is no single correct (i.e., consensual) rating responsefor arousal or interest with respect to these stimuli. This suggests a quali-tative difference in behavior among subjects when using the Arousal andInterest scales compared to the remaining scales.

Analysis of Normality

Frequency histograms were produced for each of the nine rating scales,by stimulus rating period. In general, a given scale was either used or un-used (mostly 0 ratings) for a given stimulus, consistent with the consensusanalysis results described above and shown in Table II. When a scale wasused, the distribution was frequently bimodal and generally included a sub-stantial minority reporting no affect (0 ratings), as shown in Fig. 1. In con-trast, ratings of arousal and interest were distributed across the entire rangeof scores for each rating period, as shown in Fig. 2.

Happiness ratings were spread across the entire scale for all threepleasant films, as shown in Fig. 3. However, none of the distributions wasnormal. A representative comparison of observed versus expected scores,and detrended deviation from an expected normal distribution, are plottedin Fig. 4. Consistent with consensus analysis, the modal response for boththe puppy and ocean films was 0. Note that although the means for thethree pleasant film clips were equal, the distributions were clearly different.These differences, especially with respect to those reporting no affect (0ratings), are fully consistent with the differences in smiling noted by Ekmanet al. (1980) and do not support Fridlund's interpretation that little smilingoccurs because the ocean film evoked equal happiness but was asocial incontent.

Patterns of Correlation

For each scale, baseline periods were significantly correlated, ratingsfor pleasant films were significantly correlated, and ratings for unpleasant

Scaling Emotional Response to Film Clips 337

Fig. 2. Ratings of Arousal for the gorilla and puppy film clips. Std. Dev. = standard deviation.

films were significantly correlated. As would be expected, arousal showeda low correlation with scales that were unused (those with mostly 0 ratings).For scales that were used, significant correlations were found betweenarousal and valence.

338 Alvarado

Fig, 3. Ratings of Happiness for the three pleasant film clips. Std. Dev. = standard deviation.

Correlations between happiness and arousal for the seven rating pe-riods are shown in Table IV A significant Spearman rank order correlation(p < .01) was found between Arousal and Happiness for each of the three

Scaling Emotional Response to Film Clips 339

Fig. 4. Detrended normal Q-Q plot of Happiness ratings for the gorilla film clip.

pleasant films: gorilla (r = .68), ocean (r = .68), puppy (r = .44). However,the ocean film's Arousal and Happiness ratings were most strongly corre-lated with the second baseline period (r = .56) rather than the other pleas-ant films. In contrast, the Arousal ratings for the gorilla and puppy filmswere not significantly correlated (p < .05) with the second baseline. Thissupports the interpretation that the ocean film clip had less emotional con-

340 Alvarado

Scaling Emotional Response to Film Clips 341

tent and served as a less affective interval between the other two pleasantstimuli.

That most subjects reported arousal even when they reported no va-lenced emotion (e.g., during baseline periods) supports the consensusanalysis evidence that valence is experienced differently than arousal, thatit varies with the stimulus, and that it is only related to arousal when themagnitude of the rating is considered. In other words, arousal appears tobe related to the selection of a particular value on the Happiness ratingscale, but unrelated to whether that scale was used. The strong correlationbetween arousal and valence scales when a valenced emotion was reportedsuggests that subjects were using the arousal and valence scales in a con-sistent manner, on an individual basis. They were clearly using the arousaland valence scales inconsistently as a group because consensus emergedfor valence but not for arousal, and because no consensus for arousal ex-isted despite consensus for valence.

DISCUSSION

This reanalysis suggests that (1) the mean ratings used as norms werea misleading assessment of the happiness evoked by the film clips; (2) sub-ject ratings were consensual, varying with stimulus properties for the ratingscales labeled using valenced emotion terms, but were idiosyncratic, varyingfrom a personal baseline, for the scales labeled using arousal terms; (3)subjects appeared to use the valence-related scales differently than thearousal-related scales across the rating contexts; and (4) the magnitude ofratings of valence appeared related to the magnitude of arousal reportedwhen valenced emotion was reported (but not vice versa).

When emotional response ratings were treated as discrete, categoricaldata rather than as interval-scaled continuous data, results showed strongagreement among subjects with respect to scales labeled using emotionterms, including those labeled with the terms Anger, Disgust, Sadness, Hap-piness, Fear, and Surprise. Strong agreement was also found with respectto the scale labeled Pain. Strong disagreement among subjects was shownwith respect to the scales labeled Interest and Arousal, across the spectrumof rating contexts. Further, stimuli considered equal in their ability to evokeHappiness ratings when responses were analyzed as interval-scaled datawere found to be quite different in their enjoyment potential when analyzeddiscretely. This may account for the previously reported failure to findequal facial expressivity in response to equally rated film clips.

The analysis of normality suggests that averaged means do not ac-curately characterize group response for this data set. Further, substantial

342 Alvarado

minorities report no affective response to stimuli, even where consensussuggests that such response is normative. When a group includes suchsubjects, attempts to correlate scale values with objectively measurablecontinuous variables such as facial movement are likely to underreportany relationship between the two (Hays, 1988). This difficulty is com-pounded when data are aggregated across individuals. That any statisticalrelationship between facial activity and self-reported emotional responseexists in the literature suggests that a strong link between the two existsin reality, given the difficulties of measurement that must be overcome.Due to what appears to be a stimulus-related, all-or-nothing quality toemotional response, investigators may be justified in eliminating subjectswho report no affect until the sources of such response are better un-derstood.

Ruch (1995) provided evidence that correlations between self-reportedaffect and facial activity can be increased when methodology is improved.Some researchers have attempted to compensate for the difficulties inher-ent in using self-report data by standardizing ratings; by adjusting self-re-port ratings based upon some other variable, such as measured autonomicarousal; or by performing their correlations on a within-subject or individ-ual-by-individual basis. However, these techniques still assume that self-re-port ratings are interval in nature when they are likely to be ordinal, atbest. For example, standardizing ratings again assumes both a normal dis-tribution and an interval scale and does not eliminate difficulties of inter-rater agreement. The relevance of this difficulty to the questions at handshould be carefully evaluated.

Even arousal ratings using the arousal and interest scales are not nor-mally distributed in this data set (see Fig. 2). Adjusting self-report scoresfor an arousal baseline would correct for the effects of arousal upon theuse of the remaining valenced scales. However, the relationship betweenarousal focus and valence focus must be better understood before suchcorrections can be confidently made. That arousal and valence ratings arecorrelated does not mean that they necessarily report aspects of the sameexperience.

Larsen and Diener (1985, 1987) have noted individual differences inthe use of self-report scales similar to those reported here. However, theyattributed such differences to variation in the subjective experience of emo-tion. Because we have no objective measure of emotion, we do not knowwhether individual differences in self-reported emotional response arisefrom differences in internal experience or from consistent and stable dif-ferences in the use of self-report rating scales. Further, we do not knowwhether arousal is correlated with emotion because it is an essential part

Scaling Emotional Response to Film Clips 343

of emotional experience, or because level of arousal has a global effect onrating behavior, independent of what is being rated.

Feldman's hypothesis that subjects differentially attend to arousal andvalence, two dimensions of emotional experience, can be tested by askingsubjects explicitly to differentially focus on these dimensions and notingany resulting changes in their behavior. In a sense, that is what has beendone by Ekman et al. (1980), when subjects were asked to report arousalseparately from the other labeled scales. Because this manipulation wasnot the objective of the study, no control condition was provided in whicharousal and valence were confounded (i.e., when no separate Arousal scalewas provided). It may be that the results analyzed above combine the be-havior patterns of three groups: (1) those with an exclusive arousal focus,(2) those with a mixed focus that they did not dissociate, and (3) thosewith an exclusive valence focus. Group 1 above may have produced theunexpectedly large number of 0 ratings on the valence scales together withnon-0 ratings of arousal for all stimuli. Group 2 may have produced thestrongly correlated arousal and valence scores, albeit from differing indi-vidual baselines. Group 3 may have produced the consensual response onthe valenced scales, based largely upon stimulus properties, with 0 or lowratings of arousal. This speculative explanation can be confirmed by studiesthat more deliberately manipulate instructions to subjects, or that applycognitive approaches to studying attention.

Feldman's concept of an attentional focus provides a more completeexplanation, capable of resolving difficulties encountered by theories thatconsider emotion to be synonymous with arousal. For example, Mandler(1984) viewed the intensity of an experienced emotion as a function ofautonomic nervous system arousal, and Thayer (1989, p. 134) consideredenergetic arousal to be synonymous with positive affect while tense arousalis synonymous with negative affect. Thayer (1986) demonstrated that thetwo dimensions of self-reported arousal, energetic arousal and tensearousal, both correlate with psychophysiological measures of autonomicarousal. Neither of these definitions is wholly consistent with the resultsproduced here because they neglect instances in which self-reports of va-lence and arousal diverge.

Thayer (1989) mapped the items of his self-report adjective checklistfor arousal onto the dimensions of positive and negative affect suggestedby Watson and Tellegen (1985), suggesting that they are interchangeablelabels for the same phenomenon. Watson and Tellegen's self-report spacewas further analyzed by Larsen and Diener (1992), who suggested a revisedlabeling and interpretation of the relevant dimensions as unpleasant-ness/pleasantness and activation. The findings reported here support ob-

344 Alvarado

servations by Larsen and Diener (1992) that the practice of labeling scalesusing adjectives from different octants of the emotion circumplex will pro-duce different rating behavior, and that the dimensions of pleasantness orunpleasantness versus activation seem to vary independently of each other.To support this, Larsen and Diener (1992) described findings that theVelten mood induction techniques tend to change hedonic tone (evalu-ation) without affecting activation. My reanalysis confirms this.

While self-reported arousal does vary with external circumstances, andappears to have the characteristics of a state rather than a trait measurement(Matthews, Davies, & Lees, 1990), Matthews et al. noted the following:

. . .Revelle (personal communication [to Matthews et al.], July 11, 1988) pointedout that individuals' self-ratings of arousal may be affected by individual differencesin characteristic baseline levels of arousal, so that arousal ratings are not directlycomparable across subjects . . . . Thus, only a part of the interindividual variancein arousal scores will reflect absolute arousal values; a second part will reflectinterindividual variation in baseline, (pp. 151-152)

The .54 gamma correlation between Arousal and Interest scores for thesame subject may exist because both scales vary from the same baseline,not because they both measure the same construct.

The scales analyzed in this study drew their terms from differentquadrants of Larsen and Diener's (1992) two-dimensional self-reportspace. The Interest, Arousal, and Surprise scales were labeled with termsfrom the activation dimension. The Happiness, Sadness, Anger, and Dis-gust scales were labeled with terms from the hedonic (pleasant/unpleas-ant) dimension. Pain did not appear in the circumplex because it is notusually considered an affect term, but it seems closest to terms like mis-erable or distressed in the hedonic dimension. Fear appears midway be-tween the activation and hedonic dimensions, in a quadrant for activatedunpleasant affect.

The analysis reported here supports Larsen and Diener's (1992)contention that the dimensions of activation and pleasantness/unpleas-antness are orthogonal, at least with respect to introspective monitoringand self-report. In these results, the activation reported on the Arousaland Interest scales appears to vary differently than the remaining ratingscales for the stimuli presented. Even the scales combining hedonic affectand activation, i.e., the Surprise and Fear scales, show considerable con-sensual response with strong ratings by subjects in response to the secondunpleasant film clip (where an accidental death is shown). Although con-sidered to be located in the high activation quadrant of Larsen and Di-ener's self-report affect circumplex, these scales nevertheless showconsensual response. Surprise and fear typically involve strong autonomic

Scaling Emotional Response to Film Clips 345

arousal, correlated with self-report ratings of arousal. Here, there is nogreater correlation between surprise and arousal than exists between hap-piness and arousal. However, it may be that subjects experienced lessarousal when viewing a film than they might when fear involves personalthreat. Nevertheless, this is problematic for Larsen and Diener's theory.

The qualitative differences in the use of rating scales noted in thisstudy invite speculation about the effects of confounding valence andarousal in previous studies. For example, the distinction between valenceand arousal parallels the distinction traditionally made between statesthat are emotional in nature and those that are not (Clore, 1992; Clore,Ortony, & Foss, 1987). It appears that a continuous distribution of idi-osyncratmonitoreemotionto fill ousing terMy reancessible to factorthe charhypothes1992). Tsponse tlevel of seems toClore et

Whnique apfrom thewas quitmaining enced inappearedthe distiful. Wherelationsto includgation obetweentheorizinemotiona

ic response, directly related to autonomic or reticular arousal andd with respect to a personal baseline, is more typical of the less

al subjective states, including those recruited by Russell (1980)ut the activation quadrants of the emotion circumplex, labeledms such as dull, drowsy, relaxed, content, lively, peppy, and so on.alysis suggests that these states are subjective in nature and ac-to introspection and thus to self-report, but that they vary dues specific and internal to the individual, and secondarily due toacteristics of the stimulus or the appraisal of that stimulus, asized by Feldman (1995) and Blascovich (1990; Blascovich et al.,heorizing a dichotomy between shared, consensual emotional re-o a particular stimulus and a generalized, personal, idiosyncraticengagement with the environment makes intuitive sense and be important to the folk definition of emotion (Clore, 1992;

al., 1987).atever the reader might feel about the consensus modeling tech-plied in this study, it should be clear from the distributions and differing modal responses that subject use of the rating scalese different for the Interest and Arousal scales than for the re-rating scales. Although emotion may be expressed and experi- combination with personal activation, at least some subjects to be able to rate them independently. Further, maintaining

nction between arousal and valenced emotion appears to be use-n dependent variables behave differently and show a differenthip to an eliciting stimulus, it makes little methodological sensee them in a single encompassing construct. Systematic investi-

f the contribution of attentional focus may clarify the relationship self-report, valenced emotion, and arousal. In the meantime,g based upon a seeming incongruity between self-report andl behavior seems premature.

346 Alvarado

APPENDIX

The following description of the consensus model is adapted from Rom-ney, Weller, and Batchelder (1986). The model uses the following notation:

di the probability that a subject i knows the right answer to agiven question

1-di the probability that the subject doesn't know the answerL the number of response options to a given question1/L the probability that the subject will guess the correct answer1-1/L the probability of guessing the incorrect answermij the probability that two subjects i and j give the same answer

to a given question

The parameter di is the subject's competence score. It is readily cal-culated if the answer key is known because it is the percentage of correctquestions answered (Ti) minus a correction for guessing:

If the answer key is not known, the parameters are estimated using thefollowing equations:

where m*ij is an empirical point estimate of the proportion of matches be-tween two subjects, corrected for guessing (on the assumption of no bias).Equation (3) is solved for d via minimum residual factor analysis to yielda least squares estimate of the d parameter (competence score) for eachsubject. Bayes' theorem is then used to estimate the answer key confidencelevels, given the estimated values of d. Consensus analysis implemented insoftware by Borgatti (1993).

The model provides three measures for evaluating the extent of con-sensus within a group: (1) eigenvalues showing a single dominant factor,(2) mean competence rating over .500, (3) number of negative or low com-petence ratings in a group of subjects (more than one or two present in adata set suggests a lack of homogeneity even if the mean exceeds .500).Together, these criteria function similarly to a significance level, in the sensethat they are (1) established based on experience with the domain of knowl-edge in question, (2) related to acceptable levels of error, and (3) prees-tablished when used for hypothesis testing.

No experience using this model in this domain has been reported pre-viously in the literature, except by this author. However, the model has

Scaling Emotional Response to Film Clips 347

been widely used in anthropology (Romney, Batchelder, & Weller, 1987)and in other domains within psychology. It can be used as either a formalmodel investigating the nature of knowledge in a domain, or as a simplemeasure of the properties of a particular data set. In this application, thebroader assumptions of the model about culture are not claimed and thetechnique is used primarily to evaluate the nature of the response patternsamong a set of subjects.

The criteria listed above are those considered by Romney et al. (1986),the developers of the model, to be indicative of consensus in other domainsof cultural knowledge (e.g., classification of disease, parenting practices).Thus, they seem to be a reasonable standard for judging existence of con-sensus in this context. Behavior of the model has been tested using MonteCarlo simulation (as described by Batchelder & Romney, 1989). Compe-tence ratings have been found to be normally distributed and differencesbetween them can be tested using methods like normal curve tests andANOVAs (Batchelder & Romney, 1989).

Answer key confidence levels depend upon the number of subjects andthe extent of consensus within a group of subjects. The number of subjectsneeded to estimate an answer key with a specified level of confidence de-pends upon the mean competence of the group and can be estimated usingthe formal model (see Romney et al., 1986).

REFERENCES

Batchelder, W. & Romney, A. (1988). Test theory without an answer key. Psychometrika, 53,71-92.

Batchelder W., & Romney, A. (1989). New results in test theory without an answer key. InE. Roskam (Ed.), Mathematical psychology in progress (pp. 229-248). Heidelberg,Germany: Springer Verlag.

Blascovich, J. (1990). Individual differences in physiological arousal and perception of arousal:Missing links in Jamesian notions of arousal-based behaviors. Personality and SocialPsychology Bulletin, 16, 665-675.

Blascovich, J., Tomaka, J., Brennan, K., Kelsey, R., Hughes, P., Coad, M. L, & Adlin, R.(1992). Affect intensity and cardiac arousal. Journal of Personality and Social Psychology,63, 164-174.

Borgatti, S. (1993). Anthropac 4.0. Columbia, SC: Analytic Technologies, Inc.Clore, G. (1992). Cognitive phenomenology: Feelings and the construction of judgment. In

L. Martin & A. Tesser (Eds.), The construction of social judgments (pp. 133-163). Hillsdale,NJ: Erlbaum.

Clore, G., Ortony, A., & Foss, M. (1987). The psychological foundations of the affectivelexicon. Journal of Personality and Social Psychology, S3, 751-766.

Ekman, P., Friesen, W., & Ancoli S. (1980). Facial signs of emotional experience. Journal ofPersonality and Social Psychology, 39, 1125-1134.

Feldman, L. (1995). Valence focus and arousal focus: Individual differences in the structureof affective experience. Journal of Personality and Social Psychology, 69, 53-166.

348 Alvarado

Fridlund, A. (1991). Sociality of solitary smiling: Potentiation by an implicit audience. Journalof Personality and Social Psychology, 60, 229-240.

Fridlund, A. (1994). Human facial expression: An evolutionary view. San Diego, CA: AcademicPress.

Hays, W. (1988). Statistics (4th ed.). Austin, TX: Harcourt Brace College Publishers.Hess, U., Banse, R., & Kappas, A. (1995). The intensity of facial expression is determined

by underlying affective state and social situation. Journal of Personality and SocialPsychology, 69, 280-288.

Klauer, K., & Batchelder, W. (1996). Structural analysis of subjective categorical data.Psychometrika, 61, 199-240.

Larsen, R., & Diener, E. (1985). A multitrait-multimethod examination of affect structure:Hedonic level and emotional intensity. Personality and Individual Differences, 6, 631-636.

Larsen, R., & Diener, E. (1987). Affect intensity as an individual difference characteristic: Areview. Journal of Research in Personality, 21, 1-39.

Larsen, R., & Diener, E. (1992). Promises and problems with the circumplex model ofemotion. In M. Clark (Ed.), Review of personality and social psychology (pp. 25-59).Newbury Park, CA: Sage.

Levenson, R. (1992). Autonomic nervous system differences among emotions. PsychologicalScience, 3, 23-27.

Mandler, G. (1984). Mind and body: Psychology of emotion and stress. New York: Norton.Matthews, G., Davies, D. R., & Lees, J. (1990). Arousal, extraversion, and individual

differences in resource availability. Journal of Personality and Social Psychology, 59,150-168.

Romney, A. K., Weller, S., & Batchelder, W. (1986). Culture as consensus: A theory of culturaland informant accuracy. American Anthropologist, 88, 313-338.

Romney, A. K., Batchelder, W., & Weller, S. (1987). Recent applications of cultural consensustheory. American Behavioral Scientist, 31, 163-177.

Ruch, W. (1995). Will the real relationship between facial expression and affective experienceplease stand up: The case of exhilaration. Cognition and Emotion, 2, 33-58.

Russell, J. (1980). A circumplex model of affect. Journal of Personality and Social Psychology,32, 1161-1178.

Thayer, R. (1986). Activation-Deactivation Adjective Checklist: Current overview andstructural analysis. Psychological Reports, 58, 607-614.

Thayer, R. (1989). The biopsychology of mood and arousal. New York: Oxford University Press.Townsend, J., & Ashby, F. G. (1984). Measurement scales and statistics: The misconception

misconceived. Psychological Bulletin, 96, 394-401.Watson, D., & Tellegen, A. (1985). Towards a consensual structure of mood. Psychological

Bulletin, 98, 219-235.Weller, S., & Romney, A, K. (1988). Systematic data collection. Newbury Park, CA: Sage.