Rank correlation- some features and an application

On some interesting features and anapplication of rank correlation

Kushal Kr. Dey

Indian Statistical InstituteD.Basu Memorial Award Talk 2011

Kushal Kr. Dey [1.5 pt] Indian Statistical Institute D.Basu Memorial Award Talk 2011On some interesting features and an application of rank correlation

List of contents

1 Historical overview of rank correlation.

2 Some properties of rank correlation.

3 A practical example of rank correlation.

Historical Overview—Correlation

In 1886, Sir Francis Galton coined the term correlation byquoting

length of a human arm is said to be correlated withthat of the leg, because a person with long arm hasusually long legs and conversely.

Galton wanted a measure of correlation that takes value +1for perfect correspondence, 0 for independence, and -1 forperfect inverse correspondence.

Historical Overview—Correlation

In 1886, Sir Francis Galton coined the term correlation byquoting

length of a human arm is said to be correlated withthat of the leg, because a person with long arm hasusually long legs and conversely.

Galton wanted a measure of correlation that takes value +1for perfect correspondence, 0 for independence, and -1 forperfect inverse correspondence.

Historical overview—contd.

Karl Pearson, a student of Galton, worked on his idea andformulated his ”product moments” measure of correlation in1896.

r =Sxy√

√Syy

Spearman observed that for characteristics not quantitativelymeasurable, the Pearsonian measure fails to measure theassociation. This motivated him to use rank-based methodsfor association and develop his rank correlation coefficient in1904. [”The proof and measurement of association betweentwo things” by C. Spearman in The American Journal ofPsychology (1904)].

Historical overview—contd.

Karl Pearson, a student of Galton, worked on his idea andformulated his ”product moments” measure of correlation in1896.

r =Sxy√

√Syy

Spearman observed that for characteristics not quantitativelymeasurable, the Pearsonian measure fails to measure theassociation. This motivated him to use rank-based methodsfor association and develop his rank correlation coefficient in1904. [”The proof and measurement of association betweentwo things” by C. Spearman in The American Journal ofPsychology (1904)].

Historical overview contd

In 1938, two years after the death of Pearson, MauriceKendall, a British scientist, while working on psychologicalexperiments, came up with a new measure of correlationpopularly known as Kendall’s τ . [”A new measure of rankcorrelation”, M. Kendall, Biometrika,(1938)].

Th next few years saw extensive research in this area due toKendall, Daniels, Hoeffding and others.

In 1954, a modification to Kendall’s coefficient in case of tieswas made by Goodman and Kruskal. [”Measures ofassociation for cross classifications” Part I, L.A.Goodman andW.H. Kruskal, J. Amer. Statist. Assoc, (1954)]

Historical overview contd

In 1938, two years after the death of Pearson, MauriceKendall, a British scientist, while working on psychologicalexperiments, came up with a new measure of correlationpopularly known as Kendall’s τ . [”A new measure of rankcorrelation”, M. Kendall, Biometrika,(1938)].

Th next few years saw extensive research in this area due toKendall, Daniels, Hoeffding and others.

In 1954, a modification to Kendall’s coefficient in case of tieswas made by Goodman and Kruskal. [”Measures ofassociation for cross classifications” Part I, L.A.Goodman andW.H. Kruskal, J. Amer. Statist. Assoc, (1954)]

Daniel’s Generalized correlation coefficient

H.E. Daniels of Cambridge University, a close associate ofKendall, proposed a measure in 1944 to unify Pearson’s r ,Spearman’s ρ and Kendall’s τ [The relation betweenmeasures of correlation in the universe of samplepermutations, H.E.Daniels, Biometrika,(1944)].

Consider n data points given by (Xi ,Yi ), i = 1(|)n , for eachpair of X ’s, (Xi ,Xj), we may allot aij = −aji and aii = 0,similarly, we may allot bij to the pair (Yi ,Yj), then Daniel’sgeneralized coefficient D is given by

∑ni=1

∑nj=1 aijbij

∑nj=1 aij2.

∑ni=1

∑nj=1 bij

Daniel’s Generalized correlation coefficient

H.E. Daniels of Cambridge University, a close associate ofKendall, proposed a measure in 1944 to unify Pearson’s r ,Spearman’s ρ and Kendall’s τ [The relation betweenmeasures of correlation in the universe of samplepermutations, H.E.Daniels, Biometrika,(1944)].

Consider n data points given by (Xi ,Yi ), i = 1(|)n , for eachpair of X ’s, (Xi ,Xj), we may allot aij = −aji and aii = 0,similarly, we may allot bij to the pair (Yi ,Yj), then Daniel’sgeneralized coefficient D is given by

∑ni=1

∑nj=1 aijbij

∑nj=1 aij2.

∑ni=1

∑nj=1 bij

Daniel’s generalized coefficient contd.

Special cases

Put aij as Xj − Xi and bij as Yj − Yi to get Pearson’s r .

Put aij as Rank(Xj)− Rank(Xi ) and bij asRank(Yj)− Rank(Yi ) to get Spearman’s ρ.

Put aij as sgn(Xj − Xi ) and bij as sgn(Yj − Yi ) to getKendall’s τ .

Alternative expression for τ and ρ

First, we define dij to be +1 when the rank j ( j > i) precedesthe rank i in the second ranking and zero otherwise.

We can write the Kendall’s τ as the following

τ = 1− 4Q

n(n − 1)(3)

where Q is the total score, Q =∑

i<j dij and n is the totalnumber of elements in the sample.

Similarly, we can write Spearman’s ρ as the following

ρ = 1− 12V

n(n2 − 1)(4)

where V =∑

i<j (j − i)dij is the sum of inversions weightedby the numerical difference between the ranks inverted. Thisdifference is called the weight of inversion.

Alternative expression for τ and ρ

First, we define dij to be +1 when the rank j ( j > i) precedesthe rank i in the second ranking and zero otherwise.

We can write the Kendall’s τ as the following

τ = 1− 4Q

n(n − 1)(3)

where Q is the total score, Q =∑

i<j dij and n is the totalnumber of elements in the sample.

Similarly, we can write Spearman’s ρ as the following

ρ = 1− 12V

n(n2 − 1)(4)

where V =∑

i<j (j − i)dij is the sum of inversions weightedby the numerical difference between the ranks inverted. Thisdifference is called the weight of inversion.

An interesting result

We simulated observations in large sample size from abivariate normal distribution and plotted the mean values ofSpearman’s ρ and Kendall’s τ against Pearson’s r . Weobtained the following graph.

The graph

Figure: Relation of Spearman’s ρ and Kendall’s τ with Pearson’s rKushal Kr. Dey [1.5 pt] Indian Statistical Institute D.Basu Memorial Award Talk 2011On some interesting features and an application of rank correlation

Relation of τ and ρ with r for BVN

In 1907, Pearson , in his book [”On Further Methods ofDetermining Correlation”, Karl Pearson, Biometric series IV,(1907)], established the following relation betweenSpearman’s ρ and his r for bivariate normal distribution.

r = 2 sin(π

Cramer, in 1946, also established a relation between Kendall’sτ and Pearson’s r for bivariate normal.

r = sin(π

However it is easy to show that the above two relations holdfor any elliptic distribution.

r = 2 sin(π

r = sin(π

r = 2 sin(π

r = sin(π

Relation between Kendall’s τ and r for bivariatenormal

Let (X1,Y1), (X2,Y2), . . . , (Xn,Yn) be a sample drawn fromBVN(0,0,1,1,r). Then Kendall’s τ computed from the data isan unbiased estimator of

2P((X1 − X2)(Y1 − Y2) > 0)− 1 = 2P(Z1Z2 > 0)− 1 (7)

where (Z1,Z2) ∼ BVN(0, 0, 2, 2, 2r).

Note that (Z1,Z2)d=√

2(V√

1− r2 + Wr ,W ) where (V ,W )have standard normal distribution. Since (Z1,Z2) is symmetricabout (0, 0)

4P(Z1 > 0,Z2 > 0)−1 = 4P(V√

1− r2+Wr > 0,W > 0)−1(8)

Use polar transformation on (V ,W ) and evaluate thisprobability to get 2

π sin−1 r .

Relation between Kendall’s τ and r for bivariatenormal

Let (X1,Y1), (X2,Y2), . . . , (Xn,Yn) be a sample drawn fromBVN(0,0,1,1,r). Then Kendall’s τ computed from the data isan unbiased estimator of

2P((X1 − X2)(Y1 − Y2) > 0)− 1 = 2P(Z1Z2 > 0)− 1 (7)

where (Z1,Z2) ∼ BVN(0, 0, 2, 2, 2r).

Note that (Z1,Z2)d=√

2(V√

1− r2 + Wr ,W ) where (V ,W )have standard normal distribution. Since (Z1,Z2) is symmetricabout (0, 0)

4P(Z1 > 0,Z2 > 0)−1 = 4P(V√

1− r2+Wr > 0,W > 0)−1(8)

Use polar transformation on (V ,W ) and evaluate thisprobability to get 2

π sin−1 r .

Relation between Spearman’s ρ and r for bivariatenormal

Now we try to give a sketch of a proof of the relationshipbetween Pearson’s r and Spearman’s ρ for bivariate normaldistribution .

Let R(Xi ) and R(Yi ) be the ranks of Xi and Yi . DefineH(t) = I{t>0}. Then, observe that

R(Xi ) =n∑

H(Xi − Xj) + 1 (9)

Note that Spearman’s ρ is the Pearson’s correlation coefficient

between R(Xi ) and R(Yi ) which ish− 1

4n(n−1)2

112n(n2−1)

where h =∑n

∑nj=1

∑nk=1 H(Xi − Xj)H(Yi − Yk).

Proof continued

Case 1

If i , j , k are distinct, then (Xi − Xj ,Yi − Yk) are distributed asBVN(0, 0, 2, 2, r2).

E{H(Xi − Xj)H(Yi − Yk)} will reduce to the integral of theprobability density over the positive quadrant.

We can check, following similar technique as in the case of τthat, this integral is 1

2(1− 1π cos−1 r

Proof continued

Case 2

If i 6= j = k , then (Xi − Xj ,Yi − Yk) are distributed asBVN(0, 0, 2, 2, r) and the above expectation would reduce to12(1− 1

π cos−1 r). Then,

(h − 1

4n(n − 1)2

112n(n2 − 1)

{n − 2

n + 1sin−1

n + 1sin−1 r

As n goes to infinity, the R.H.S reduces to 6π sin−1 r

Reason for approximate linear relationship betweenSpearman’s ρ and Pearson’s r for BVN

As observed from the graph, Spearman’s ρ for Bivariatenormal is almost linearly related with Pearson’s r . This maybe attributed to the fact that ρ = 6

π sin−1 r2

= 6π ( r

2 + 16r3

8 + . . .)

= 3π r + terms very small compared to 1st order term≈ 3

For Kendall’s τ , using similar expansion, we can also showthat τ convex function of r in the interval [0,1]. a

Kendall’s comparative assessment of τ and ρ

Kendall in his paper admitted that ρ can take n3−n6 values

between −1 and +1, whereas τ can take only n2−n2 values in

the range, but according to him, this does not seriously affectthe sensitivity of τ .

Both Kendall’s τ and Spearman’s ρ computed from thesample have asymptotically normal distributions.

But Kendall showed using simulation experiments that thedistribution for his correlation coefficient is surprisingly closeto normal even for small values of n, which is not the case forSpearman’s correlation.

Bias properties of Kendall’s τ and Spearman’s ρ

Consider a finite population. Let ρ? and τ? be Spearman’sand Kendall’s rank correlation coefficients computed from theentire population.

Suppose that we have a simple random sample withoutreplacement from that population. And we computeSpearman’s ρ and Kendall’s τ from the sample.

Then, τ is an unbiased estimator for τ? but ρ is a biasedestimator for ρ?.

If the population size N tends to infinity, expected value ofSpearman’s ρ goes to 1

n+1{3τ? + (n − 2)ρ?} where n is the

size of the sample.

small sample distribution of τ , ρ and r

It is well-known that for a simple random sample of size ndrawn from a bivariate normal distribution, under theassumption of zero correlation, Pearson’s r satisfies

n − 2√1− r2

∼ tn−2 (11)

But the distribution of r for small samples from normaldistribution with non-zero correlation and from non-normaldistributions, is not tractable.

τ and ρ are distribution free statistics in the sense that theirdistributions do not depend on the distribution of the data solong as X and Y are independent. Consequently, theirdistributions under the hypothesis of independence of X andY can be tabulated.

Asymptotic normality of r , ρ and τ

Note that each of Pearson’s r , Spearman’s ρ and Kendall’s τcomputed from a bivariate data are asymptotically normallydistributed.

Asymptotic normality of Pearson’s r can be derived usingCentral Limit Theorem applied to various bivariate samplemoments.

Asymptotic normality of Spearman’s ρ follows fromasymptotic normality of linear rank statistics.

Asymptotic normality of Kendall’s τ follows from asymptoticnormality of U-statistics.

List of contents

Historical overview of rank correlation.

Some properties of rank correlation.

A practical example of rank correlation.

A practical application of rank correlation

Recently, the Ministry of Human Resource Development(MHRD) considered giving weightage to the marks scored inthe 10+2 Board exams for admission to engineering collegesin India.

The raw scores across the Boards are not comparable. So,they wanted help in this regard from the Indian StatisticalInstitute.

The use of percentile ranks of students based on theiraggregate scores was recommended by Indian StatisticalInstitute.

A practical application of rank correlation

Recently, the Ministry of Human Resource Development(MHRD) considered giving weightage to the marks scored inthe 10+2 Board exams for admission to engineering collegesin India.

The raw scores across the Boards are not comparable. So,they wanted help in this regard from the Indian StatisticalInstitute.

The use of percentile ranks of students based on theiraggregate scores was recommended by Indian StatisticalInstitute.

The Data

Indian Statistical Institute was provided data from 4 boards(namely, ICSE , CBSE , West Bengal Board andTamil Nadu Board) for two consecutive years 2008 and 2009

Though the recommendation from Indian Statistical Institutewas to use aggregate scores of a student for computing thepercentile rank of the student (and that recommendation wasfavorably accepted by MHRD), a statistically interestingquestion is what happens if we consider various subject scoresseparately instead of the aggregate score.

We intend to investigate this issue under some appropriateassumptions.

The Data

The Model

For convenience, let us consider only two subjects namelyMathematics and Physics.

Let us denote the observed score of a student in Mathematicsand Physics as XM and XP . Assume the existence ofunobserved merit variables WP and WM such that the scoresin the two subjects are related as

XM ≈ gM(WM) XP ≈ gP(WP) (12)

WM and WP may be treated as attributes of the studentwhich depend on the knowledge and understanding of Mathsand Physics respectively and also on other factors likeschooling, intelligence etc.

gM and gP relate to the examination procedure correspondingto the two subjects. They may vary across the boards. 3

The Model

Formulation of the model

Two students may obtain different scores in Mathematics andPhysics because of the difference in their merit variables WM

and WP or due to the difference in examination procedure gMand gP across the boards.

It is time that we lay down our assumptions about WM , WP

and gM and gP .

Assumptions of the model

Assumption 1

The functions gP and gM are monotonically increasing. Thisimplies the scores of the students are expected to increasefrom less meritorious to more meritorious students for each ofthe two subjects.

Assumption 2

The joint distribution of (WP ,WM) for the students is thesame in different boards.

How Assumptions can be checked

Imagine a common test in Mathematics and Physics taken bystudents of all the boards.

Mathematics score in the common test would be a monotonefunction of the Mathematics score in the board examination,as both are monotone functions of the same merit variable.(The same holds for Physics scores).This can be tested by using Spearman’s ρ and Kendall’s τstatistics.

Mathematics and Physics scores in the common test wouldhave the same distribution in the subpopulationscorresponding to different boards.This can be tested using any non-parametric test for equalityof bivariate distributions.

Is there a way to check the validity of theseassumptions using currently available data?

How assumptions can be checked without acommon test

According to Assumption 2, the dependence between meritsin Physics and Mathematics should be similar in all theboards.

Rank correlation between Physics and Mathematics scores ina particular board should not depend on the board-specificmonotone functions gM and gP .

Therefore, rank correlation between Physics and Mathematicsscores across the boards should be the same.

Rank correlation between Physics & Maths fordifferent boards and years

Rank correlation Physics & Chemistry

Figure: Rank correlation between Physics and Chemistry marks overyears

bar chart of rank correlation Chemistry & Maths

Figure: Rank correlation between Chemistry and Maths marks over years

Subject percentile graph WBHS 2008

Variation of a subject across a board same year

mKushal Kr. Dey [1.5 pt] Indian Statistical Institute D.Basu Memorial Award Talk 2011On some interesting features and an application of rank correlation

Inference from the data analysis

Between boards variation is significantly higher than withinboard variation across the two years.

Visibly,there is high correlation in Tamil Nadu Board, whereaslow correlation is observed in CBSE Board.

If we interpret the data available as a large sample from alarger hypothetical population, the rank correlation computedfor a board in a particular year will have an approximatenormal distribution.

So, we can use this rank correlation values to carry outANOVA type statistical analysis to see whether there issignificant difference values across different boards and acrossdifferent years. When this is done, rank correlation appears tobe significant across different boards.

This essentially implies breakdown of Assumption 2.

Study of the rank correlation brings out this fact even withoutscores of a common test.

Acknowledgement

I would like to express my gratitude towards my mentors for thisproject, Prof.Probal Chaudhuri and Prof. Debasis Senguptafor their immense co-operation. I would also like to think all thosewho have been associated with this work in some way or the other.

Thank You

Rank correlation- some features and an application

Education

Transcript of Rank correlation- some features and an application