Non-parametric procedure for knockout tournaments

This article was downloaded by: [Universitat Politècnica de València]On: 25 October 2014, At: 16:25Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number:1072954 Registered office: Mortimer House, 37-41 Mortimer Street,London W1T 3JH, UK

Journal of AppliedStatisticsPublication details, including instructionsfor authors and subscription information:http://www.tandfonline.com/loi/cjas20

Non-parametric procedurefor knockout tournamentsChristopher Todd EdwardsPublished online: 02 Aug 2010.

To cite this article: Christopher Todd Edwards (1998) Non-parametricprocedure for knockout tournaments, Journal of Applied Statistics, 25:3,375-385, DOI: 10.1080/02664769823106

To link to this article: http://dx.doi.org/10.1080/02664769823106

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy ofall the information (the “Content”) contained in the publicationson our platform. However, Taylor & Francis, our agents, and ourlicensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication arethe opinions and views of the authors, and are not the views of orendorsed by Taylor & Francis. The accuracy of the Content shouldnot be relied upon and should be independently verified with primarysources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses,damages, and other liabilities whatsoever or howsoever causedarising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

http://www.tandfonline.com/loi/cjas20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/02664769823106

http://dx.doi.org/10.1080/02664769823106

This article may be used for research, teaching, and private studypurposes. Any substantial or systematic reproduction, redistribution,reselling, loan, sub-licensing, systematic supply, or distribution inany form to anyone is expressly forbidden. Terms & Conditions ofaccess and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Journal of Applied Statistics, Vol. 25, No. 3, 1998, 375± 385

Non-parametric procedure for knockouttournaments

CHRISTOPHER TODD EDWARDS, Mathematics Department, University of

Wisconsin Oshkosh, USA

SUMMARY In a seeded knockout tournament, where teams have some preassigned

strength, do we have any assurances that the best team in fact has won? Is there some

insight to be gained by considering which teams beat which other teams solely examining

the seeds? We pose an answer to these questions by using the diŒerence in the seeds of the

two players as the basis for a test statistic. We oŒer several models for the underlying

probability structure to examine the null distribution and power functions and determine

these for small tournaments (less than ® ve teams). One structure each for 8 teams and

16 teams is examined, and we conjecture an asymptotic normal distribution for the test

statistic.

1 Introduction

Many sports organizations use tournaments to select their champions. A well-

known instance of this is the annual `March Madness’ National Collegiate Athletic

Association (NCAA) men’s and women’ s college basketball tournament, where the

champions are selected using a 64-team single-elimination tournament. This

tournament essentially comprises four regional 16-team tournaments, where the

participants are initially ranked, or `seeded’ , from 1 (the best team) to 16 (the

worst team). Other well-known applications of single-elimination tournaments are

those in professional tennis, basketball and football.

One natural question for observers is whether or not the best team has won.

One might suspect that, if the number 16 seedÐ supposedly the weakest team in

the regional tournamentÐ wins its regional tournament, then one of two things has

occured: the seedings were incorrect or some major upsets occurred.

This paper explores a method of testing the hypothesis that the initial seedings

Correspondence: C. T. Edwards, Mathematics Department, University of Wisconsin Oshkosh, 800

Algoma Boulevard Oshkosh, WI 54901, USA. Tel: 920 424 1333.

0266-476 3/98/030375-1 1 $7.00 � 1998 Carfax Publishing Ltd

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

376 C. T. Edwards

are meaningful, i.e. that the teams that won were the favorites. We contrast this

with the hypothesis that all the teams are of equal strength, so that the seedings

are meaningless. We make a comparison with round-robin tournaments, where it

is possible to test for the strongest team, based on its winning percentage. (We

assume that the reader has had some exposure to statistical inference at the

undergraduate level.)

2 General hypothesis testing

Before we describe our particular hypothesis test procedure, we describe statistical

hypothesis testing in general. In a typical setting, we test whether or not a

population parameter is equal to some prespeci® ed value. The test statisitic is

formed by some theoretical (or intuitive) criteria, and the value of the test statistic

is calculated from the data. In some non-parametric settings, the null hypothesis is

a statement about a situation, i.e. the data are independent, and, as such, the

hypothesis does not actually involve parameters.

There are two types of error that can be made in hypothesis testing. We can

falsely reject a true hypothesis (a type I error), or we can fail to reject an incorrect

hypothesis (a type II error). Clearly, we would like to make as few mistakes as

possible. However, in practice, it is usually not possible to have both error rates

simultaneously small without large samples. Typically, we (arbitrarily) select an

acceptable error rate for the type I error. To derive the rejection region or decision

rule, we select an acceptable test statistic and a cut-oŒ which gives us our

preselected acceptable error rate under the assumption that the null hypothesis

is true.

There are still many possible test statistics that one could choose, so we need

additional criteria to select a `best’ test. With the type I error rate ® xed, a typical

approach is to try to maximize the power, or one minus the type II error rate. In

practice, type II errors are not as carefully studied, mainly because of the complexity

of composite hypotheses. Another strategy is to minimize the sum of the type I

and type II error rates (for an appropriately chosen alternative hypothesis). One

could also try the `minimax’ approach, i.e. try to minimize the maximum of the

type I and type II error rates. We will explore the maximum power and minimum

sum techniques later in the paper.

3 Round-robin tournaments

Before we discuss the single-elimination tournament, we digress to a discussion

about the more well-known round-robin tournament. The following discussion

centers around the fact that, in a round-robin tournament, one is able to test

whether or not there is a `best’ team. This feature does not exist in single-

elimination tournaments, as we will see later. A round-robin tournament is one

where each team plays each other team exactly once. Using graph theory, these

tournaments can be represented with directed graphs, where teams are the vertices

and the directed line segments indicate the winner of each pairing. One feature of

the round-robin tournament is that there is only one possible arrangement of

games to be played: each team plays each other team exactly once. Of course, this

means that each team plays n 2 1 games and, in total, there are ( n2) games.

Although there is only one possible plan (all teams play each other team), there

are many possible outcomes, or score sequences. For example, in a three-player

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

Non-parametric procedure for knockout tournaments 377

FIG. 1. The two score sequences for three teams.

tournament, there are two possible score sequences, i.e. 210 and 111, as displayed

in Fig. 1. With 210, we have one team which has beaten both the other teams, one

team which has lost to both the other teams and one team which has won once

and lost once. With 111, all three teams have won once, in an intransitive cycle.

Because of the possibility of diŒerent score sequences in round-robin tourna-

ments, it is easy to construct a test of the `equal-strength’ null hypothesis. The

basic method of construction of the distribution of the test statistic is to enumerate

all possible score sequences and tabulate the test statistic in each case.

There are a number of appropriate test statistics that one might use to test the

null hypothesis that all teams have equal strength, i.e. the probability of team i

beating team j is 0.5 for all i and j. One statistic is the number of wins by the best

team and another statistic is the diŒerence in the number of wins between the best

and worst teams. In both these cases, we would reject the hypothesis if the test

statistic were large. David (1959) discusses at length the `equal-strength’ hypothesis

in round-robin tournaments.

To calculate the power of a particular test, we need to assume some alternative

hypothesis. Unfortunately, the alternative hypothesis is quite complicated, because

there are many ways for a set of teams to be of unequal strengths, and only one

way is that none of the teams is equal in strength to any other. To make the

problem more tractable, we will assume some speci® c structure on the probabilities

of teams beating each other. Several attractive models are listed in the next section.

By using these models, we do not want to imply that other models are not

appropriate but, rather, that these models are simple to calculate and, in the

absence of any other compelling models, these should demonstrate our technique.

4 Models

(1) All teams of equal strength: p i j 5 0.5, where p i j is the probability of team i

beating team j.

(2) The model of Bradley and Terry (1952):

p i j 5j

i+ j

Note that p j i 5 1 2 p i j.

(3) The model of Schwertman et al. (1991):

p i j 5 0.5 +1

2n( j 2 i )

Again, note that p j i 5 1 2 p i j.

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

378 C. T. Edwards

Each of these models preserves transitivity, i.e. if p i j > 0.5 and p jk > 0.5, then

p ik > 0.5. For each model, we can calculate the chance of observing a particular

outcome. For example, using the `equal-strength’ model, Pr(210) 5 6/8. (This can

be calculated by noting that there are three choices for the team that wins two

games, two remaining choices for the team that wins one game, and one choice for

the last team, or 3 3 2 3 1 5 6. There are three games, so there are 23 total

possibilities.) Similarly, Pr(111) 5 2/8. David (1959) has counted the score

sequences for round-robin tournaments for up to eight teams.

5 Single-elimination tournaments

A single-elimination tournament is a knockout tournament where teams are

eliminated after losing one game, and game winners continue to play until all but

one of the teams have been eliminated. This single remaining team is the winner

of the tournament. The single-elimination tournament has a number of diŒerences

from the round-robin tournament. First, of course, is the fact that fewer games are

played. Second, whereas each team in the round-robin tournament plays exactly

t 2 1 games, in the single-elimination tournament, each team does not play a

prespeci® ed number of games. Further, there are diŒerent schemes by which we

can pair up future game opponents. These diŒerent `structures’ are listed in Fig. 2.

Also, for the balanced structures (such as structure C), there is only one possible

score sequence. For example, with t 5 4 and using structure C, the winner will

have two wins and zero losses; one team will have one win and one loss; and two

teams will have one loss each. With t 5 8, the winner will have three wins and zero

losses; one team will have two wins and a loss; two teams will have one win and

one loss; and four teams will have one loss each. As a result of these unique score

sequences, it is not possible to use the score sequence itself as the basis for a test

statistic for all the tournament structures. Also, it seems that the hypothesis of

equal-strength teams would be di� cult to detect, because each tournament deter-

mines a winner, even if no diŒerences exist.

To illustrate, suppose that a penny, a nickel, a dime and a quarter are `competing’

in a coin-¯ ipping contest. It is decided to use structure C and the ® rst round is

with the penny versus the nickel and the dime versus the quarter. The coins are

¯ ipped until one coin in each pair shows a head and the other shows a tail. The

coin with heads is declared the `winner’ of the match. Under such a scheme, one

coin will be determined to be the best heads-showing coin, even though all the

coins have the same chance of showing heads.

Despite the negativism (the coin example), tournaments do give us some

information. In many tournaments, teams are `seeded’ into the structure. The

seeding is supposed to reward better teams, by giving them either a bye (a round

where they play no game) or easier opponents as they procede through the

tournament. The most common example is the plan using structure C, in which

the highest seeded team plays the lowest seeded team in one of the ® rst-round

games and the other two teams meet in the other ® rst-round games. Horen and

Riezman (1985) showed that, under such a scheme and assuming stochastic

transitivity among the four teams, the `best’ team has the greatest chance of

winning the tournament. (Stochastic transitivity states that the probabilities of

teams beating each other are transitive, so that we can identify a `best’ team.)

Thus, if we are willing to assume that the seeding has been carried out correctly,

so that the number one seed is really the best team, then the results of the

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014


FIG. 2. Single-elimination structures for 2 ± 6 teams.

tournament give us some evidence as to whether or not the seedings seem plausible.

This forms the basis for a hypothesis test.

As our null hypothesis, we will assume that all the teams are of equal strength

and that the seedings assigned to them have been attributed at random, i.e. that

they are meaningless. The alternative hypothesis is that the seedings are meaningful

and that lower seeding numbers denote better teams. To calculate the null

distributions of our test statistics, we must list all possible equally likely outcomes

and tabulate the chance of each possible test statistic. The test statistic that we

explore is the sum, over all n 2 1 games, of the diŒerences between the winner’ s

seeding and the loser’ s seeding. There are clearly other choices for a test statistic;

we have pursued ours as a matter of convenience.

For structure C (with four teams) and the seeding already mentioned (1432;

and shown in Fig. 3), there are 235 8 possible outcomes. These outcomes are

listed in Table 1, including the calculation of the test statistic. By examining the

® rst row, we see that, when the favorite wins each game, the test statistic is

small (more negative); thus, for small values, we would conclude the alternative

hypothesis that the seedings are correct. What we have described is a one-sided

test procedure: if the test statistic is small, then we reject the null hypothesis and,

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

380 C. T. Edwards

FIG. 3. Structure for teams considered in Table 1.

TABLE 1. Sample space and calculation of test statistic

Winners of the Winner seeding

three games 2 loser seeding Test statistic

121 2 3 2 1 2 1 2 5

122 2 3 2 1 + 1 2 3

131 2 3 + 1 2 2 2 4

133 2 3 + 1 + 2 0

422 + 3 2 1 2 2 0

424 + 3 2 1 + 2 4

433 + 3 + 1 2 1 3

434 + 3 + 1 + 1 5

if the test statistic is large, we `accept’ the null hypothesis. Similar calculations

follow for the other tournament structures.

If we reject our null hypothesis, then we might wish to consider various power

calculations. These calculations are quite complicated, because the `non-equal-

strengths’ hypothesis has many possible elements. The approach that we take is to

specify a particular alternative structure on the probabilities of teams beating each

other. Again, we assume the three models: the equal-strength model (the null

hypothesis), Bradley± Terry model, and the model of Schwertman et al. We perform

the power calculations as follows: for each particular model and structure being

investigated, we calculate the probability of obtaining each value of the test statistic.

Tables 2 and 3 display the calculations for t 5 4 teams. For all the structures, we

give the better teams more byes; thus, for structure D, the team seeded as number

one will play one game.

To clarify how the numbers in Tables 2 and 3 were obtained, we detail the

calculation for structure C using a type I error of 0.125 and the Bradley± Terry

model. To obtain a type I error of 0.125, we must have the test statistic equal to

2 5. Thus, team 1 beat team 4, team 2 beat team 3, and team 1 beat team 2.

Under the Bradley± Terry model, p14 5 0.8, p23 5 0.6 and p12 5 0.666. The product

of these three probabilities is 0.32, as displayed in Table 2.

From Tables 2 and 3, we can conclude that, for small values of the type I error

rate, structure C gives a better tournamentÐ at least when using the models of

Bradley and Terry and Schwertman et al. Further, Figs 4 and 5 display this

information graphically. We continue the comparisons of structures for ® ve teams

in Tables 4 and 5. Again, for all the structures, we give the better teams more byes;

thus, for structure D, the team seeded as number one will only play one game.

From the tables and ® gures (see also Figs 6 and 7), we can make several

observations. First, we note that structure G, which is the `king-of-the-hill’ style of

tournament, is less powerful than the other structures under both models. Second,

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014


TABLE 2. Four-team power calculations, assuming Bradley±

Terry model

Power under model

Probability of

type I error Structure C Structure D

0.125 0.320 ( 2 5) 0.229 ( 2 3)

0.250 0.560 ( 2 4)

0.375 0.720 ( 2 3) 0.590 ( 2 2)

0.500 0.705 ( 2 1)

0.625 0.880 (0)

0.750 0.926 (3) 0.914 (0)

0.875 0.966 (4) 0.971 (2)

1.000 1.000 (5) 1.000 (6)

Notes: Numbers in parentheses are critical values. Underlined

values are those that minimize the sum of the type I and type

II error rates.

TABLE 3. Four-team power calculations, assuming model of

Schwertman et al.

Power under model

Probability of

type I error Structure C Structure D

0.125 0.288 ( 2 5) 0.216 ( 2 3)

0.250 0.512 ( 2 4)

0.375 0.704 ( 2 3) 0.552 ( 2 2)

0.500 0.696 ( 2 1)

0.625 0.884 (0)

0.750 0.932 (3) 0.904 (0)

0.875 0.968 (4) 0.976 (2)

1.000 1.000 (5) 1.000 (6)

Notes: Numbers in parentheses are critical values. Underlined

values are those that minimize the sum of the type I and type

II error rates.

FIG. 4. Power versus size curve for the Bradley± Terry model.

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

382 C. T. Edwards

FIG. 5. Power versus size curve for the model of Schwertman et al.

TABLE 4. Five-team power calculations, assuming Bradley± Terry model

Probability Power under model

of type I

error Structure F Structure G Structure E

0.0625 0.163 ( 2 6) 0.127 ( 2 4) 0.178 ( 2 6)

0.1875 0.422 ( 2 5) 0.459 ( 2 5)

0.2500 0.503 ( 2 4) 0.439 ( 2 3)

0.3125 0.659 ( 2 4)

0.3750 0.653 ( 2 3) 0.586 ( 2 2) 0.733 ( 2 3)

0.4375 0.730 ( 2 2)

0.5000 0.770 ( 2 1) 0.822 ( 2 1)

0.5620 0.859 (0)

0.6250 0.876 (0) 0.837 ( 2 1) 0.891 (1)

0.6875 0.921 (1) 0.916 (2)

0.7500 0.947 (2) 0.909 (1) 0.939 (3)

0.8125 0.967 (3)

0.8750 0.981 (6) 0.976 (2) 0.976 (4)

0.9375 0.991 (8) 0.992 (8) 0.989 (7)

1.0000 1.000 (9) 1.000 (10) 1.000 (8)

Notes: Numbers in parentheses are critical values. Underlined values are those

that minimize the sum of the type I and type II error rates.

there seems to be no obvious preference between the other two structures, although

the Bradley± Terry model shows structure F to be slightly more powerful. Note

that this conclusion is consistent with that of the four-team tournaments, i.e. that

the `king-of-the-hill’ style of tournaments (structures E and G) are less powerful

for detecting a false hypothesis. One explanation for this phenomenon is that these

tournaments heavily favor the number one seed, and major upsets are less likely to

occur in these tournaments, because the low-seeded teams rarely play the top-

seeded teams. (Note that major upsets are the way that we reject the null

hypothesis.)

After tabulating the results for six teams, we constructed Table 6, which

summarizes the null distributions of the test statistic. If these distributions are

shown graphically, then one notices that they are reasonably symmetric and that,

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014


TABLE 5. Five-team power calculations, assuming model of Schwertman et al.

Probability Power under model

of type I

error Structure F Structure G Structure E

0.0625 0.173 ( 2 6) 0.130 ( 2 4) 0.173 ( 2 6)

0.1875 0.422 ( 2 5) 0.437 ( 2 5)

0.2500 0.557 ( 2 4) 0.432 ( 2 3)

0.3125 0.653 ( 2 4)

0.3750 0.723 ( 2 3) 0.597 ( 2 2) 0.739 ( 2 3)

0.4375 0.782 ( 2 2)

0.5000 0.840 ( 2 1) 0.847 ( 2 1)

0.5625 0.890 (0)

0.6250 0.910 (0) 0.846 ( 2 1) 0.910 (1)

0.6875 0.939 (1) 0.938 (2)

0.7500 0.964 (2) 0.928 (1) 0.960 (3)

0.8125 0.984 (3)

0.8750 0.993 (6) 0.983 (2) 0.990 (4)

0.9375 0.997 (8) 0.998 (5) 0.995 (7)

1.0000 1.000 (9) 1.000 (10) 1.000 (8)

Notes: Numbers in parentheses are critical values. Underlined values are those

that minimize the sum of the type I and type II error rates.

FIG. 6. Power versus size curve for the Bradley± Terry model.

as t becomes larger, the curves become more symmetric. In addition, we produced

(by `brute force’ ) the distributions for t 5 8 and t 5 16 teams. The graph of these

distributions are shown in Figs 8 and 9, and suggest that the asymptotic distribution

might be normal.

6 Conclusions and remaining questions

Our test statistic can answer several questions. First, we can decide if, in a given

tournament, the favorite teams have won in a probabilistically acceptable way.

Second, we can assess the power of various tournament structures in deciding if

the favorites have won. We conclude by raising the unanswered questions that this

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

384 C. T. Edwards

FIG. 7. Power versus size curve for the model of Schwertman et al.

TABLE 6. Summary of test statistic distributions

t Structure Mean Variance Minimum Maximum

4 C 0 14.3 2 5 5

D 0 8.3 2 3 6

5 E 0 19.2 2 6 8

F 0 21.3 2 6 9

G 0 12.7 2 4 10

6 H 0 24.3 2 7 12

I 0 38.3 2 9 13

J 0 27.5 2 7 13

K 0 24.8 2 8 10

L 0 28.1 2 7 14

M 0 17.6 2 5 15

8 Balanced structure 0 116.4 2 21 21

16 Balanced structure 0 977.7 2 85 85

work raises. Although some of these questions are well-de ® ned and perhaps easy

to answer, others may have more elusive solutions.

(1) Are other alternative probability models reasonable? Of course, the answer is

yes, but perhaps there are some particularly appealing alternative probability

models that we have overlooked. Again, it should be mentioned that the

choice of the models examined in this paper was arbitrary and they were

chosen only to demonstrate the techniques involved.

(2) Can we show for all (transitive) preference matrices that one structure

`dominates’ any other under some alternative model? In other words, is the

king-of-the-hill tournament always less powerful? This question points the

way to new research on which tournament structures are `best’ .

(3) Does the test statistic have an asymptotically normal distribution? If so, is

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014


FIG. 8. Cumulative distribution functions for eight teams.

FIG. 9. Cumulative distribution functions for 16 teams.

there a closed form formula for the variance? We desire asymptotic normality,

because it would make the construction of a type I error level exceedingly

simple; one would just look up the desired area on a normal table, and use

the variance to convert that normal score to a critical value for the test

statistic.

(4) Is there a `better’ test statistic, i.e. is there a more powerful test? We could

employ techniques to develop a `most powerful test’ , but we have not yet

attempted this.

R EFERENCES

BRADLEY, R. A. & TERRY, M. E. (1952) Rank analysis of incomplete block designs, B iometr ika, 39,

pp. 324 ± 345.

DAVID, H. A. (1959 ) Tournaments and paired comparisons, B iometrika, 46, pp. 139 ± 149.

HOREN, J. & R IEZMAN, R. (1985 ) Comparing draws for single elimination tournaments, Operations

Research , 33, pp. 249 ± 262.

SCHWERTMAN, N. C., MCCREADY, T. A. & HOWARD, L. (1991 ) Probability models for the NCAA

regional basketball tournaments, American Statistician, 45, pp. 35 ± 38.

Dow

nloa

ded

by [

Uni

vers

itat P

olitè

cnic

a de

Val

ènci

a] a

t 16:

25 2

5 O

ctob

er 2

014

Non-parametric procedure for knockout tournaments

Documents

Transcript of Non-parametric procedure for knockout tournaments