Paired Comparisons of Color Differences

9
JOURNAL OF TI-IE OPTICAL SOCIETY OF AMERICA Paired Comparisons of Color Differences Y. SUOIYAMrA* AND HILTONWRIGHT Division of Applied Physics, National Research Council, Ottawa, Canada (Received 26 December 1962) A group of 11 observers judged total color differences between a number of pairs of Munsell colors. The analysis of the judgments made for each observer and the group was carried out by three statistical methods, the Scheff6 method, the Morrissey-Gulliksen matrix method and the modified Thurstone-Mosteller method. The three methods give essentially the same results, but the Scheff6 method is preferred because it allows a more detailed treatment of the observational data. Some order and orientation effects arising from the manner of presentation of the color pairs were found to be significant. The relationship between the judg- ments and their statistical estimates for different rating scales is investigated and a nonlinear relation is found to hold for the case where the rating scale is too coarse for the given set of color samples. 1. INTRODUCTION THE method of paired comparisons' has important T applications in the field of sociology, 2 biology, 3 and physics 4 as a technique for the scaling of stimuli. An extensive investigation on the judgment of color differences by the method of paired comparisons is being carried out by the OSA Committee on Uniform Color Scales. The purpose of this paper is to examine a few of the fundamental problems relating to the judg- ment of color differences, and also to introduce and com- pare three different statistical procedures which may be used for the scaling of color differences. 2. EXPERIMENT Eleven observers made comparisons between different pairs of Munsell papers; the colors of a pair had the same Munsell hue but differed in Munsell chroma, and each pair represented a different hue. All samples had approximately the same luminous reflectance. The samples were measured on a spectrophotometer before and after the experiment to verify that the colors had undergone no serious changes during the course of the experiment. The specifications of the samples in terms of CIE (x,y,Y) coordinates computed with respect to the actual source used in the experiment are given in Table I. Each sample was hexagonal in shape with a maximum diameter of 2 in. Pairs of colors were mounted on a 94-in.X3R-in. sheet of Munsell value 5.6/. There was no separation between the members of a pair. The samples were observed in a booth with diffuse Macbeth artificial daylight providing an illuminance of about 650 lm/m 2 . The array of two pairs of samples sub- tended an angle of approximately 50 at the eye position. All observers viewed the samples normally by making use of a fixed headrest. The relative spectral distribution of irradiance pro- * Postdoctorate Fellow from Osaka Industrial Research Insti- tute, Ikeda Branch, Japan. 1 J. P. Guilford, Psychomiietric Methods (McGraw-Hill Book Company, Inc., New York, 1954), 2nd ed. 2 L. L. Thurstone, Tize M1easurement of Values (The University of Chicago Press, Chicago, Illinois, 1959). 3 N. T. Gridgeman, Biometrics 11, 335-344 (1955). S S. S. Stevens, J. Acoust. Soc. Am. 27, 815-829 (1955). vided by the Macbeth source at the position of the samples was measured and employed in the evaluation of the CIE (x,y,Y) coordinates of the samples. The chromaticity coordinates of the source were x=0.318 and y=0.333. Two paired samples were arranged in juxtaposition so that one pair, i, say, was on the observer's left and the other pair, j, was to his right. A mask of Munsell gray, value 5.6/, 11 in.X7 in., with cutouts in the center served as a guidesheet to locate the various pair combinations as shown in Fig. 1. There was no separa- tion between the nearest points of members of the different pairs. Pair combinations were judged four different ways to minimize systematic errors due to both the left-right and up-down positioning of the samples. The modes of presentation are shown in Fig. 1. Case A shows the high chromas as the top colors repre- sented by the shaded hexagons, Case B shows the high chromas occupying corner positions diametrically opposite each other. The other two modes of presenta- tion were obtained by interchanging the left pair with the right pair for both Cases A and B. The paired com- binations were presented in a random sequence to every observer. The instructions issued for judging the color differences were as follows: "It is required to estimate the total color difference between pairs of colors. In making the judgments, if the difference between the pair of colors on the right is TABLE I. Specification of color samples used in the experiment. Color CIE (x,y,Y) coordinates pair X y Y(% 1 0.417 0.419 27.5 0.383 0.394 27.4 2 0.277 0.382 28.8 0.294 0.362 28.4 3 0.258 0.336 29.1 0.277 0.345 28.8 4 0.244 0.266 27.6 0.266 0.287 28.4 5 0.307 0.284 28.0 0.312 0.297 27.0 6 0.364 0.315 28.4 0.350 0.325 29.0 1214 VOLUME 53, NUMBER 10 OCTOBER 963

Transcript of Paired Comparisons of Color Differences

Page 1: Paired Comparisons of Color Differences

JOURNAL OF TI-IE OPTICAL SOCIETY OF AMERICA

Paired Comparisons of Color Differences

Y. SUOIYAMrA* AND HILTON WRIGHT

Division of Applied Physics, National Research Council, Ottawa, Canada

(Received 26 December 1962)

A group of 11 observers judged total color differences between a number of pairs of Munsell colors. Theanalysis of the judgments made for each observer and the group was carried out by three statistical methods,the Scheff6 method, the Morrissey-Gulliksen matrix method and the modified Thurstone-Mosteller method.The three methods give essentially the same results, but the Scheff6 method is preferred because it allowsa more detailed treatment of the observational data. Some order and orientation effects arising from themanner of presentation of the color pairs were found to be significant. The relationship between the judg-ments and their statistical estimates for different rating scales is investigated and a nonlinear relation isfound to hold for the case where the rating scale is too coarse for the given set of color samples.

1. INTRODUCTION

THE method of paired comparisons' has importantT applications in the field of sociology,2 biology,3

and physics 4 as a technique for the scaling of stimuli.An extensive investigation on the judgment of colordifferences by the method of paired comparisons isbeing carried out by the OSA Committee on UniformColor Scales. The purpose of this paper is to examine afew of the fundamental problems relating to the judg-ment of color differences, and also to introduce and com-pare three different statistical procedures which maybe used for the scaling of color differences.

2. EXPERIMENT

Eleven observers made comparisons between differentpairs of Munsell papers; the colors of a pair had thesame Munsell hue but differed in Munsell chroma, andeach pair represented a different hue. All samples hadapproximately the same luminous reflectance. Thesamples were measured on a spectrophotometer beforeand after the experiment to verify that the colors hadundergone no serious changes during the course of theexperiment. The specifications of the samples in termsof CIE (x,y,Y) coordinates computed with respect tothe actual source used in the experiment are given inTable I. Each sample was hexagonal in shape with amaximum diameter of 2 in. Pairs of colors were mountedon a 94-in.X3R-in. sheet of Munsell value 5.6/. Therewas no separation between the members of a pair. Thesamples were observed in a booth with diffuse Macbethartificial daylight providing an illuminance of about650 lm/m 2 . The array of two pairs of samples sub-tended an angle of approximately 50 at the eye position.All observers viewed the samples normally by makinguse of a fixed headrest.

The relative spectral distribution of irradiance pro-

* Postdoctorate Fellow from Osaka Industrial Research Insti-tute, Ikeda Branch, Japan.

1 J. P. Guilford, Psychomiietric Methods (McGraw-Hill BookCompany, Inc., New York, 1954), 2nd ed.

2 L. L. Thurstone, Tize M1easurement of Values (The Universityof Chicago Press, Chicago, Illinois, 1959).

3 N. T. Gridgeman, Biometrics 11, 335-344 (1955).S S. S. Stevens, J. Acoust. Soc. Am. 27, 815-829 (1955).

vided by the Macbeth source at the position of thesamples was measured and employed in the evaluationof the CIE (x,y,Y) coordinates of the samples. Thechromaticity coordinates of the source were x=0.318and y=0.333.

Two paired samples were arranged in juxtapositionso that one pair, i, say, was on the observer's left andthe other pair, j, was to his right. A mask of Munsellgray, value 5.6/, 11 in.X7 in., with cutouts in thecenter served as a guidesheet to locate the various paircombinations as shown in Fig. 1. There was no separa-tion between the nearest points of members of thedifferent pairs. Pair combinations were judged fourdifferent ways to minimize systematic errors due toboth the left-right and up-down positioning of thesamples. The modes of presentation are shown in Fig. 1.Case A shows the high chromas as the top colors repre-sented by the shaded hexagons, Case B shows thehigh chromas occupying corner positions diametricallyopposite each other. The other two modes of presenta-tion were obtained by interchanging the left pair withthe right pair for both Cases A and B. The paired com-binations were presented in a random sequence toevery observer. The instructions issued for judging thecolor differences were as follows:

"It is required to estimate the total color differencebetween pairs of colors. In making the judgments, ifthe difference between the pair of colors on the right is

TABLE I. Specification of color samples used in the experiment.

Color CIE (x,y,Y) coordinatespair X y Y(%

1 0.417 0.419 27.50.383 0.394 27.4

2 0.277 0.382 28.80.294 0.362 28.4

3 0.258 0.336 29.10.277 0.345 28.8

4 0.244 0.266 27.60.266 0.287 28.4

5 0.307 0.284 28.00.312 0.297 27.0

6 0.364 0.315 28.40.350 0.325 29.0

1214

VOLUME 53, NUMBER 10 OCTOBER 963

Page 2: Paired Comparisons of Color Differences

PAIRED COMPARISONS OF COLOR

PAIR A PIR j

FIG. 1. Modes of orientation ofthe color pairs. In Case A the highchromas (shaded hexagons) are CASE Athe top colors. In Case B the high

site each other.c h r o a s a e d a m e t i c a l y o p o -PA IR % PI R j,CASE B

larger than the difference between the pair of colors onthe left, then the answer will be plus (+) some integer.However, if the right difference is smaller than the leftdifference then the answer will be minus (-) someinteger. In the case of a tie the answer is zero. Largerintegers indicate larger differences."

Judgments were made on all possible pair combina-tions for all four modes of presentation. A completetrial consisted of 60 judgments. One practice trial wasmade first by each observer. The results of the firsttrial were recorded but were not employed in the finalanalysis for the group. Half of the judgments weremade at one sitting and the usual time lapse betweensittings was about 2 days. Each observer ran throughthe experiment four times, the average time betweenrepeat runs being about one week.

3. ANALYSIS

Paired comparison data can be analyzed by variousmethods. Jackson and Fleckenstein5 have given anevaluation of a number of statistical methods commonlyused in psychometrics. The observational data of thepresent experiment were analyzed in accordance withthe Scheff6 method,6 the iMorrissey-Gulliksen matrixmethod,7 and the modified Thurstone-Mosteller method.8

3.1 Scheff 6 Method

chosen by the observers themselves. As is shown laterthe original scales proved to be of approximately opti-mum fineness.

The analysis of the data for the individual observersand the group is based on the judgments made in thethree repeat runs only. Case A and Case B judgmentswere analyzed separately. For the present, however, thediscussion is concerned with the over-all results, calledCase C, which consists of the combination of bothCase A and Case B judgments.

A brief account of the main formulas in Scheff6'smethod is given. When the samples are arranged so thatPair i is on the observer's left and Pair j is on his right,the mean rating, Aij of Pair i over Pair j for the paircombination (i,j) in the order (i,j) is given by

Aij= E_ Xijk/r, (1)

where Xijk refers to the kth judgment made onordered pair combination (i,j) and r is the number ofjudgments. 9

The mean rating of Pair i over Pair j in the order(j,i) is denoted by -jiy. Normally it would be expectedthat the mean ratings, ii and -ji, would be equal;however, in an actual experiment real differences mayoccur signifying an order effect. The order effect, Si&, isgiven by

gii= (ii+Pii)12. (2)

The average order effect, , for all the pair combinationsis given by

1 m

M i<(3)

where the primed summation sign denotes the sum overall i and j with i<J, and M is the total number ofpaired combinations, that is, M=rm(m- 1)72 with being the number of pairs.

The average rating, ri', of Pair i over Pair j for thetwo orders (i,j) and (j,i) is

The Scheffe method is designed to give a compre-hensive treatment of paired comparison data. Samples(i.e., differences between pairs of colors, in this case)are judged by means of a rating scale, and judgmentsmust be made on all pair combinations including thereverse order (sometimes spatial, sometimes temporal)of every combination. The selection of a suitable scalefor rating the samples plays an important part on theoutcome of the results, and Sec. 4 of this paper dealswith the effects of employing different rating scales.At this point it suffices to mention that the results ofthe Scheff6 analysis are based on scales originally

I J. E. Jackson and M. Fleckenstein, Biometrics 13, 51 (1957).6 H. Scheff6, J. Am. Statist. Assoc. 47, 381 (1952).7 H. Gulliksen, Psychometrika 21, 125 (1956); J. H. Morrissey,

J. Opt. Soc. Am. 45, 373 (1955).8 W. A. Glenn and H. A. David, Biometrics 1.6, 86 (1960).

(4)

From a set of values, rj, scale values, dii, characteriz-ing the comparative color differences of the pairs canbe computed.10 With the condition that

i= 0,

the scale value, dis, for Pair i is simply the average ofA circumflex over a symbol denotes a sample estimate of that

parameter. Symbols without circumflexes denote populationparameters.

10 In the past, the term relative color diyerences has been used fornonabsolute color differences of any sort; that is, for example,either for color differences forming a ratio scale or for those form-ing an interval (difference) scale. At the suggestion of G. L.Howett, we are introducing here the term comparative color differ-ences to refer specifically to values forming an interval scale thatdiffers from the absolute scale by an additive constant,

October 1963 D I F F E R E N C E 1215

in= (Aij-Aii)12.

Page 3: Paired Comparisons of Color Differences

Y. SUGIYAXIA AND HILTON WRIGHT

2345

6

78

910

I I

.

*0 A XO

A* OmD X

0 A 4 0 X

0 AXCD x A

r3 A X

. acA

A O A 0

* a A

A X

X

X

a AO A

* 0 O x

. x 0 0 A A

GROUP * Do 0 A X

+2.0 +1.0 0 -1.0 -2.0OBSERVER SCALE VALUES a;

a 0, &5, 4, a3, o a2, x a,DIRECTION OF LARGER COLOR DIFFERENCE

FIG. 2. Scale values, &i, computed by the Scheff6 method forCase C (combination of Case A and Case B) data for each ob-server and the group. ai represents the comparative color differ-ence of Pair i. Pairs judged as having larger color differences thanthe mean color difference lie to the right of zero.

all .iji for j= 1 to in, that is

1 mdi=- F *ij.aob z..

TIhe above condition

(5)

E R~di=

means that the mean of the scale values is set equal tozero. Contrary to the usual situation with this method,in the present experiment positive values of di indicatecomparative color differences which are smaller thanthe mean, zero, and negative values of di indicate com-parative color differences which are larger than themean. This is a consequence of assigning the index i tothe left difference when the observers have been in-structed to rate the right difference relative to the leftdifference.

Figure 2 shows the scale values di obtained for CaseC data for each observer and the group. The di for asingle observer are based on 3X60 judgments, and theci for the group are based on 3 X 60X 11 judgments. It isevident that there is a general consensus as to theranking of the color differences by the individualobservers.

Intervals of 95% confidence were established foreach di according to Scheffe's S test." The confidenceintervals for the di of a typical observer and the groupare illustrated in Fig. 3. These intervals can be inter-preted to mean that if the observer or the group wereto repeat the experiment, there would be a 95% chancethat the di obtained would lie within the regions shown.Furthermore, an overlapping of two or more confidenceintervals, such as occurs for d2, a3, and CY4 of Observer 8,

OBSERVER

GROUP

a, a, a, 4

-2- .12~~~~ _t _~

a,

I I . . 0 I+2.0 + 1.0

a6a, &2 '4GA -L

0SCALE VALUES

a,

4 12

-I.0 -2.0

FIG. 3. Intervals of 95% confidence for the i of a typical ob-server (8) and the group. These intervals represent the regions inwhich the i would be expected to fall if the experiment wererepeated.

indicates that the comparative color differences of Pairs2, 3, and 4 were not found to be significantly different.

A complete and more formal treatment of the experi-mental data can be made by carrying out an analysis ofvariance on the judgments. Although the details ofsuch an analysis are described in Scheff6's paper, someof the essential formulas are quoted here to assist in theunderstanding of the statistical inferences to be deducedfrom the analysis.

(i) A test can be made to determine if the di aresignificantly different from zero. This is done by calcu-lating the ratio

Sa/ (in- 1)

S,/2M(r- 1)(6)

where S and S are the sum of squares of the scalevalues di and the sum of squares of the judgments xija,respectively. These quantities are computed from

S1 =~in S,,== 2rm E e,,2,

i=land

m 71 r mS.. EE EXjj~2-r E ii2i=1 =1 k=1 i=l j=l

(7)

(8)

The Ratio (6) is distributed approximately as theF statistic with (in-1) degrees of freedom in thenumerator and 2M(r-1) degrees of freedom in thedenominator. From statistical tables the upper 5% Fpoint for (in -1) and 2M (r- 1) degrees of freedom, re-spectively, can be found. If the calculated Ratio (6) isgreater than the upper 5o F point then the di aredifferent from zero, that is, the pairs have differentcomparative color differences.

(ii) A test can be made to determine if the statisticalestimates, i-cdj, adequately predict the judgments,-,ij. This tests the relation that

(9)

which is referred to as the "hypothesis of subtractivity."The testing of this hypothesis constitutes one of themost important items of interest in Scheff's paper.To do this the ratio

l1 H. Scheff6, Te Analysis of Variance (John Wiley & Sns,Inc., New York, 1958).

St/(M-in+1)

S,./2M (r-1 )(10)

1216 Vol. 53

7rij=ai-aj,

Page 4: Paired Comparisons of Color Differences

PAIRED COMPARISONS OF COLOR DIFFERENCES

is evaluated, where S is the SUm of squares of thedeviations from subtractivity, and is given by

(11)m

SoY=2r A'` {fij- (4,_4oj)}2.i<i

The significance of Ratio (10) is determined in amanner similar to that of Ratio (6). If Ratio (10) is notsignificant then the observed color-difference ratings7rij are directly proportional to their correspondingstatistical estimates, di-dj.

(iii) A test of significance for the order effects can bemade by evaluating the significance of the ratio

S1 /M

Se/2M(r -1)'(12)

TABLE . Results of analysis of variance forCase C data for the group.

Degrees UpperSum of of 5 % Not

Test squates freedom Ratio F point Significant significant

(i) Sa =2429.99 5 369.3 2.21 X

Se =2567.00 1950 [Eq. (6)]

(ii) So= 22.48 10 1.71 1.83 X

Se =2567.00 1950 [Eq. (10)]

(iii) S1 = 144.53 15 7.32 1.67 X

Se =2567.00 1950 [Eq. (12)]

(iv) 2rMl' = 129.35 1 98.3 3.84 X

Sa =2567.00 1950 [Eq. (14)]

(v) Si = 15.18 14 <1.0 1.70 X

Se =2567.C0 1950 [Eq. (15)]

where S is the sum of squares of the order effects q

and is given by

Ss= 2r , ij2. (13)i<j

(iv) The significance of the average order effect, 3,is found by determining the significance of the ratio

2rMS/1t (14)

Sj2M (r- 1)

Confidence intervals for S can be established by use ofthe -test.6

(v) The significance of the differences among ordereffects, ij-6, computed from Eqs. (2) and (3), is de-termined by evaluating the significance of the ratio

S 1 ,/M- 1

Se/2M(r- 1)'

where SS, is the sum of squares of the differences amongorder effects, and is given by

Ss, = 2r (gij- )2. (16)i<j

The above five tests were applied to Case C data forthe group with results given in Table II. Comparingthe experimental F ratios listed in Col. 4 with the corre-sponding upper 50 F points given in Col. 5 the followinginferences can be drawn:

(i) One or more scale values i are significantlydifferent from the mean (which was set equal to zero).As expected this confirms the plot given in Fig. 2.

(ii) The deviations from subtractivity are not signifi-cant. Therefore the statistical estimates, as- dj, ade-quately predict the observed color differences, ij;

that is, the hypothesis of subtractiVity, 7rjj=Oj-aj,holds.

(iii) At least one order effect is significant. Thismeans that some of the observed color differences de.-pend upon the position of the color pairs.

(iv) The average order effect, , is significant. The95% confidence intervals for (=0.26) are 1=0.06. Thismeans that, on the average, the comparative colordifference of a pair placed on the observer's leftappears to be smaller than when the same pair is placedon the observer's right. Such an effect is interesting butno attempt is made here to give a psychological reasonfor it.

(v) The differences among order effects are not sig-nificant. This means that the order effects Sjj for allpaired combinations are essentially the same and equalto the average order effect, S.

A similar analysis was carried out on Case C data foreach observer. The statistical conclusions deduced fromthe analysis may be summarized as follows:

(i) The scale values di were found to be significantlydifferent from zero for each observer as was alreadyindicated in Fig. 2.

(di) The hypothesis of subtractivity was found tohold for six of the 11 observers.

(iii) Order effects were found to be significant forseven of the observers.

(iv) The average order effect for different observersranged between -0.04 and +1.21, which seems to bea large variation.

(v) The differences among order effects were notsignificant for five of the seven observers who showedorder effects.

A separate analysis of Case A and Case B data wasalso carried out for each observer and the group. Theresults may be summarized as follows:

(i) The scale values, di, were significant. Figure 4illustrates the results for both Case A and Case B datafor the group. Scale values for Case C data have alsobeen included for comparison purposes. The 95% con-

1217October 1963

Page 5: Paired Comparisons of Color Differences

Y. SUGIYAMIA AND HILTON WRIGHT

GROUP DATA

CASE A ----

CASE B -

CASE C ---

+1,5 +1 0 0.5 0 -0.5SCALE VALUES &

a &, * a', a4, A 03, 0 &, a,

IoG. 4. Scale values, c6j, with 95% confidence inter,for Case A, B, and C data for the groul

Scheff6 method), total score, or proportions in terms ofnormal deviates.

Equation (17) includes the assumption of subtrac-tivity. It can be shown that if the f ij represent averagescores, obtained from Eq. (4) for all pair combinations

HO ,; (i,j), then the resulting scale values, di, for the groupand for each observer will be identical to those calcu-lated by the Scheff6 method (see Fig. 2). Scheff6's

vals, resulting S test can be applied to establish confidence intervalsfor the di.11

fidence intervals associated with each di, established bymeans of the S test, are also given. It is evident thatthe scaling is not influenced by either mode of orienta-tion, A or B. Very similar results (not shown here) wereobtained for each observer.

(ii) The hypothesis of subtractivity was acceptablefor the group for both Cases, A and B. Seven observerssatisfied the hypothesis for Case A, eight observers forCase B.

(iii) Order effects for the group for Case A data werebarely significant but highly significant for Case Bdata. For individual observers it can be stated that theorder effects for Case B were generally of higher sig-nificance than those for Case A.

(iv) Values of the average order effect for thegroup, with 95% confidence intervals, were as follows:S=0.1040.07 for Case A, S=0.41-40.07 for Case B. Itis perhaps reasonable to assume that is negligible forCase A. The large order effect for Case B is possiblythe result of some real, but unknown, psychologicalphenomenon. For individual observers S ranged between-0.28 and +0.80 for Case A, and between -0.06 and+ 1.61 for Case B. The positive S's greatly outnumberedthe negative 's.

(v) For both Cases, A and B, order effects for thegroup, as generally for the individual observers, werefound to be the same for all pair combinations.

3.2 Morrissey-Gulliksen Matrix Method

The Morrisey-Gulliksen matrix method, as comparedto the Scheff6 method, provides a relatively simplertreatment of the observational data. As a particularfeature the Morrissey-Gulliksen method does not re-quire judgments on all pair combinations. This mayprove to be of advantage at times since the inclusionof many pairs will, in practice, prohibit the use of theScheff6 method.

The Morrissey-Gulliksen method employs a matrixprocedure for computing the scale values, di. The least-squares fit between the judgments *,7j and their statis-tical estimates, di-dj, requires the minimizing of

Ql= [ii~j-(as-j) (17)i<i

where ,rij may represent average score (as in the

3.3 Modified Thurstone-Mosteller Method

The object o the modified Thurstone-Mostellermethod' is to predict the proportion of observers whofind the difference of Pair j "larger than," "equal to,"or "smaller than" the difference of Pair i; thus themethod is equivalent to a three-point system of judg-ment. The method also incorporates the feature thatjudgments are not required on all pair combinations.Scale values Si analogous to the d in the Scheff6 andAMlorrissey-Gulliksen methods, are assigned to thesamples by an entirely different procedure as comparedto the other two methods. There are two procedures forcomputing the scale values, Si: one involves an un-weighted analysis and the other a weighted analysis.Generally, as was the case in the present experiment, it.will be found that the results computed according to theunweighted analysis differ very slightly from thosecomputed by the weighted procedure. Consequently,all values referred to in this paper are obtained throughthe unweighted analysis. The scale values, Si, originatefrom three proportions, Pi.ij, Pj.ij, and Po.ij, which re-late to the 3-point system of judgment. Pi.si representsthe proportion of observers who judge the color differ-ence of Pair i larger than the color difference of Pairj; Pj.ij represents the proportion of observers whojudge the color difference of Pair j larger than the colordifference of Pair i; and Po ij represents the proportionof observers who find the color difference of Pair i equalto the color difference of Pair j.

Two quantities, aij and aji, are defined from theproportions Pi.ij, Pj.ii, and Po.ii, that is

(18)

and(19)

These quantities are used in the calculation of

Hij= ![sin-'(2aij- 1)-sin- 1 (2aji- 1)]. (20)

The inverse sine (or arcsine) function in Eq. (20)originates from the fact that the cumulative normalfunction, in the mathematical model, is approximatedby the sine function. It may be noted that the HIij inEq. (20) correspond to the f ij given by Eq. (4) in theScheff6 model. The least-squares estimate of the scale

. I . I I . . . . . .

1218 Vol. 3

aij = Pi. ij+ Po. ij

aji= -Pj.ij+rPo.ij.

Page 6: Paired Comparisons of Color Differences

PAIRED COMPARISONS OF COLOR DIFFERENCES 1219

values, Si, is obtained by minimizing

mQ2= =,' (Si-Sj-Hij)2,

i<j(21)

THRESHOLD

CASE A l

CASE B -

GROUP DATA

* 0 A O AX

* aL 0 A

with the condition that S1=0. In the present experi-ment, however, the authors substituted the condition

ME S 0i=l

for the condition SI=0, to make the results (that is,scale values, Si) more comparable with those obtainedin the Scheff6 analysis. As long as the same number ofjudgments are made on all pair combinations, the scalevalues Si can be computed from

mS = -E Hij, (22)

where m is the number of pairs. Equation (22) should becompared with Eq. (5) which is used for computing thescale values, ci, in the Scheff6 method.

Another parameter r is defined such that if thedifference between two scale values Si-Si lies withinan interval from - -r to + r, then the two scale valuesare not considered to be significantly different. Theparameter r is symbolic of a psychological thresholdsuch that when the total color difference between twopairs is sufficiently small the observer declares a tie. Theleast-squares estimate of T is given by

2 mFE' Gij, (23)

(m-1) i<iwhere

Gij= '[sin-' (2aij- 1)+ sin-l (2aji- 1)]. (24)

To analyze the data of the present experiment, thejudgments, originally based on a rating scale, were

CASE C - 0 a 0 A l..., .... I . ..

-1.0 -0.5 0 0.5 1.0SCALE VALUES: Si

a S6. * S, S 4 A S, o S, X S.

FIG. 6. Scale values Si resulting for Case A, B, and C data for thegroup. The thresholds, T, for the three cases are also shown.

divided into three categories: all negative judgments,regardless of their ratings, were assumed to mean "lessthan," that is j<i for the combination (i,j), positivejudgments were assumed to mean "larger than," thatis j>i, and zero ratings were understood to mean j=i.Thus it was possible to convert all judgments to propor-tions, Pi.ii, Po.i, and Pj.ij and calculate the scale valuesSi according to Eqs. (18), (19), (20), and (22). Thecomputations were carried out first on Case C data foreach observer and the group. The derived scale valuesare plotted in Fig. 5 where the Si represent comparativecolor differences similar to the di in the Scheff6 method.For the group, ranking of the Si in Fig. 5 agrees withthe ranking of the i shown in Fig. 2. Values of ranged between 0.12 and 0.77 for the individual ob-servers and was 0.25 for the group.

In addition, separate computations were performedon Case A and Case B judgments for each observerand the group. The scale values, Si, and the threshold,r, resulting for Case A and Case B judgments for thegroup are plotted in Fig. 6. Scale values for Case C datahave also been included for comparison purposes. It isevident that only small differences result in the scalingfor the two modes of orientation A and B.

For Case A judgments of the individual observers Tranged between 0.15 and 0.90; for Case B judgments ranged between 0.10 and 0.74.

4. SCALE CONSIDERATIONS

THRESHOLDT

. 0

_ 0

i i

- -

. I . .

_1.5 -1.0

0 x A 0

0 AG

A AG 0

A 0A Ox

G O& A

A 0 AD

O A 0 A

GA*O A

0 A O

x 0 A A

.A 0 A

- 0.5 0 0.5

SCALE VALUES : Sia 6, * S5, A S4 , A 3, 0 2, . s,

DIRECTION OF LARGER COLOR DIFFEREI

I"1G. 5. Scale values, Si, computed by the modifie(Mosteller method for case C data for each observer aiThe threshold T for each observer and the group is

As pointed out earlier, the results in Scheffe's analysisdepend to a large extent upon the particular rating scalebeing used in the experiment. It is difficult to decidebeforehand how fine a rating scale should be for a givenexperimental situation. A practice trial is considered

A necessary to establish the most suitable scale. In such apractice trial the observer should be given sufficient free-dom in choosing his own scale, that is, he should be freeto determine the end points of the scale. In the presentexperiment the instructions given to the observer wereformulated with this purpose in mind. The result was

,t0 ,t5 that observers chose rating scales ranging from a 5-pointsystem (-2, -1, 0, 1, 2) to an 11-point system (-5,

CE -4, -3, -2, -1, 0,1, 2, 3, 4, 5), although the majority,seven observers, chose a 7-point system (-3, -2, -1,

d Thurstone- 0 1 nd the group. , 1, 2, 3). Before commencing the experiment twoalso shown. possibilities arose: (i) either repeat observations using

2

3

4

S

6

7

8

9

10

I I

GROUP

OBSERVER

1219October 1963

x a

x

0

0

Page 7: Paired Comparisons of Color Differences

Y. SUGIYAMA AND HILTON WRIGHT V

the same instructions, that is, again have the observersmake judgments according to their own rating scalesor (ii) use the same scale for all observers, this scalebeing the most commonly used in the practice trial. Itwas decided to adopt the first possibility, mainlybecause the rating scales used by the individual ob-servers were not considered to differ too much and,furthermore, the analysis would be carried out for thejudgments based on each individual's scale. These re-sults have already been discussed in Sec. 3.1.

The data were also analyzed on the basis of a commonscale so that each observer's judgments would carryapproximately equal weight in the group analysis.Each observer's judgments, *ij, were normalized to thecommon scale (seven-point system) by means of afactor, found in the following manner:

normalizing factor

average number of points used by group

number of points used by observer

For example, the average number of points in therating scale used by the group was 7; the number ofpoints in the rating scale used by Observer 9 was 11.Consequently the normalizing factor used for multi-plying the f rj of Observer 9 was 7/11. When the judg-ments of each observer are normalized in this fashionit is evident that the scale values, di, computed fromEq. (5), change only by their respective normalizingfactors. The resultant cei for Case C data for the group,based on the normalized judgments of the 11 observers,are shown in Fig. 7 (Group-Normalized). The results ofthe scaling for the original Case C data for the group arealso plotted in the figure (Group-Original). It may beseen that the Group-Normalized results are practicallyidentical to the Group-Original results. Intervals of95% confidence for each ai of the Group-Original dataare shown on the right side of Fig. 7 (the intervals arethe same length for each di, by the S test). Althoughconfidence intervals for the di of the Group-Normalizeddata were not computed, it is expected that theseintervals would have been nearly the same length asthose shown for the Group-Original data.

Other possibilities in the selection of a scale wouldhave been to have all the observers use some fixedscale such as a 5-point (-2, -1, 0, 1, 2), 3-point

CASE C DATA

a, a4 0,a 4 04 & 95 PERCENTCONFIDENCE

GROUP-NORMALIZED I o 0 A INTERVALS

GROUP-ORIGINAL * c 0 A

GROUP-SPOINT( x.413) * 0 0 A

GROUP-3POINT(&iX1.882) *0 A x

GROUP-2POINT(a i x 1,656) * 0a A x

RELATIVE SCALE VALUES

FIG. 7. Comparison of scale values, &i, for Case C data for thegroup based on five rating scales: normalized, original, 5 point,3 point, and 2 point.

(-1, 0, 1), or even a 2-point (-1, 1) system. Withoutperforming any additional comparisons it was decidedto analyze the results as if all the judgments had beenbased on such a reduced system. The reduction ofjudgments to a 5-point system was accomplished bymeans of a simple graphical method. The linear trans-formation illustrated in Fig. 8 shows a particularexample of how judgments originally based on an11-point system were reduced to judgments based on a3-point system. In this example, all judgments whichwere originally +5 (or -5) would now be recorded as+2 (or -2), and those judgments in the originalsystem not corresponding to whole numbers in the5-point system were averaged out proportionally be-tween appropriate values in the 5-point system. As anillustration assume that an observer made five judg-ments of "+4," then according to Fig. 8 three of thefive +4 original judgments would now be recorded as"+2's" and the two remaining +4 judgments wouldnow be recorded as "+ l's." This procedure was followedwherever necessary in reducing the different observers'

I-

I-

0a.uI

I.

l2)

I1

C.l

-O 1 2 3 4(.'-l 5ORIGINAL SCALE (11 POINT SYSTEM)

FIoc. 8. Graph used for reducing judgments originally based on11-point system to judgments based on 5-point system.

judgments to judgments based on a 5-point system.The resultant scale values di multiplied by the factor1.413, for the group for Case C data are plotted inFig. 7 (Group-5 Point). Intervals of 95% confidenceassociated with each di, established by means of theS test, are shown on the right side of the figure. As com-pared to the Group-Original results, the separationsbetween the i, as well as the confidence intervals,based on the 5-point system are practically the same.It appears that the difference between the two sets ofscale values and confidence intervals is only a relativedifference; that is, the scale values di and length ofconfidence intervals in the 5-point system differ onlyby the factor 1.413 from the corresponding scale valuesand confidence intervals in the original system.

The original judgments for Case C data for eachobserver, and consequently the group, were also re-duced to judgments based on a 3-point system. To dothis all judgments which were originally negative, re-gardless of their ratings, were recorded as -1; positivejudgments were recorded as + 1, and zero judgmentswere left unchanged. The resultant scale values dimultiplied by the factor 1.882, with 95% confidence in-

1220 Vol. 53

I

Page 8: Paired Comparisons of Color Differences

PAIRED COMPARISONS OF COLOR DIFFERENCES

tervals, are shown in Fig. 7 (Group-3 Point). Theseparations between the i, as well as the confidenceintervals, appear to be almost the same as for eitherof the 5-point or original systems. The ranking of thedi remains consistent.

The original judgments were further reduced tojudgments based on a 2-point system. All originalnegative judgments were recorded as -1, positivejudgments as +1, and ties were split proportionallybetween -1 and + 1 in accordance with the distributionof the judgments for each pair combination. The re-sultant scale values di multiplied by the factor 1.656,for Case C data for the group, including 95%0 confidenceintervals, are shown in Fig. 7 (Group-2 Point). Theseparations between the i, as well as the confidenceintervals for the 2-point system appear to be practicallythe same as the rest when compared on a relative basis.The ranking of the di for the 2-point system is identicalto the ranking in the other systems.

In view of Fig. 7 it may be said that, in general,reduction of the data based on the original rating scale

TABLE III. Sum of squares for deviations from subtractivity S,and experimental F ratios calculated for four scoring systems forCase C data for the group.

Sum of squares fordeviations from sub-tractivity calculated Experimental F

Scoring according to ratio calculatedsystem Eq. (11) from Eq. (10) Fo.95 (10,1950)

Original 22.48 1.71 1.835 point 12.86 2.08

1 3 point 32.47 7.55L 2 point 72.39 14.90

to a coarser system produces scale values as whichdiffer only by a factor from those obtained in theoriginal system. The ranking of the scale values re-mains unchanged.

It can be demonstrated that the hypothesis of sub-tractivity, 7rij=ai-a, is affected by the choice ofrating scale. The sum of squares for deviations fromsubtractivity, S,, have been computed according toEq. (11) for Case C data for the group for the originalscoring system, the 5-point system, the 3-point system,and the 2-point system. These results are listed in Col. 2of Table III. In addition the experimental F ratio givenby Eq. (10) was also computed in order to test thesignificance of the hypothesis of subtractivity for thedifferent scoring systems; the ratios (10) are given inthe third column of Table III. For each scoring systemthe degrees of freedom associated with the numerator[in Ratio (10)] are M-m+1=15-6+1= 10, and thedegrees of freedom associated with the denominator are2M(r-1)=2X15X65=1950. To test the hypothesisof subtractivity at the 0.05 level of significance for thedifferent scoring systems, each ratio in Col. 3 of Table

4.0

FIG. 9. Plot of thestatistical estimates,X ai-aj +K, versusthe actual judgments,|~~ |for Case Cdata for the group forfour different ratingscales: original scale(crosses), 5-pointscale (solid dots),3-point scale (opentriangles), 2-pointscale (open circles).

3.0

.<_SO

2.0

1.0

v0

g 2 POINT(K=3.0)

A 3 POINT(K=2.0)

4' VI' 065 POINT (K=1.O)

A ~~~~/

1_ ORIGINAL (K=O.O)

x/

. .$ ., .. .. , . .. I.0

I ij I2.0

III should be compared with the value Fo.95 (10,1950)= 1.83 listed in Column 4. In Column 3 those values

which are greater than 1.83 indicate significant de-partures from subtractivity. Only the original scoringsystem satisfies the hypothesis of subtractivity, althoughthe 5-point system comes close to meeting the require-ment. Both the 3-point and 2-point systems show largedepartures from subtractivity, which means that thereis poor agreement between the frij and the i-dj. Theactual comparison between the rij and the i-dj forthe four scoring systems is shown in Fig. 9. For con-venience l xi- dj + OK, where K is an arbitrary additiveconstant, has been plotted against l grij I for each scoringsystem. The crosses refer to the original scoring system,solid dots refer to the 5-point system, open trianglesrefer to the 3-point system, and open circles refer tothe 2-point system. It is at once obvious from the figurethat the crosses and solid dots follow straight lines.The open triangles and open circles show a systematicdeparture from linearity, as indicated by broken curves.Two conclusions can be deduced from Fig. 9:

(i) In the present experiment the minimum ratingscale should consist of at least 5 points (-2, -1, 0,1, 2); that is, if the scale values ai are to satisfy thelinear model 7rij=aoi-aj.

(ii) The curved lines, relating to the 3-point and2-point systems, indicate a distortion of the mathe-matical model and suggest that the judgments, rij, maybe related to the statistical estimates, a-oj, in somenonlinear manner, for example

Ilij= (i-a,)P+c, (25)

where p and c are constants characterizing the experi-mental conditions.

A test was performed on Case C data for the groupto determine whether the judgments and the statisticalestimates could be expressed in the form of Eq. (25).

1221October 1863

11

Page 9: Paired Comparisons of Color Differences

Y. SUGIYAMA AND HILTON WRIGHT

By using graphical methods the constants p and c werederived for the different rating scales. The results aresummarized below:

Rating scale

Original5 point3 point2 point

P c

1.0 0.001.0 0.000.6 0.070.5 0.07

The above values of p and c were substituted inEq. (25) and (di-oee)p+c was plotted against *i forall four rating scales. The result was a straight line ineach case. The interesting conclusion is that the judg-ments obey a power law when the rating scale is toocoarse for the given set of color samples, and p is lessthan unity.

5. SUMMARY

The method of paired comparisons was used by agroup of 11 observers to judge color differences betweenpairs of Munsell colors. The experimental data wereanalyzed by three statistical methods, and the experi-ence gained by using the methods supports the evalua-tion given by Jackson and Fleckenstein.' The Scheffemethod6' provides the most comprehensive analysis ofthe three methods used. The computation procedureby the Morrissey-Gulliksen7 matrix method is rela-tively simple, and this method has the advantage thatjudgments are not required on all pair combinations.The modified Thurstone-Mosteller method' is the mostefficient method of analysis when a 3-point system of

judgment is to be employed. All three methods givecomparable scale values and statistical estimates.

The effects of orientation and order of presentationof the color pairs were studied. Scale values, that iscomparative color differences, are not altered bydifferent modes of orientation. Order effects are signifi-cantly larger when the high chromas occupy cornerpositions diametrically opposite each other as againstthe arrangement in which the high chromas occupythe top positions in the pair combinations. For thelatter case the order effects are almost negligible.

There is reason to believe that the judgments and thestatistical estimates can be expressed in the form of alinear hypothesis such as Scheff6's hypothesis of sub-tractivity providing a sufficiently fine rating scale ischosen. The rating scale should be found by experi-mentation. Allowing the observers to choose their ownrating scales appears to be a good procedure.

If the rating scale is too coarse the linear hypothesisno longer holds, and should be replaced by a nonlinearhypothesis expressed in the form of a power law withan exponent smaller than unity.

ACKNOWLEDGMENTS

The authors express their thanks to Dr. GidnterWNTyszecki for his helpful suggestions throughout thecourse of the experiment and also for his assistance inthe preparation of the manuscript. Thanks are alsodue to Dr. G. L. Howlett and Dr. D. L. MacAdam fortheir valuable comments on the original manuscript.Many of these comments have been incorporated in thefinal version of the paper.

1222 Vol. 53