Non-parametric procedure for knockout tournaments
-
Upload
christopher-todd -
Category
Documents
-
view
212 -
download
0
Transcript of Non-parametric procedure for knockout tournaments
This article was downloaded by: [Universitat Politècnica de València]On: 25 October 2014, At: 16:25Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number:1072954 Registered office: Mortimer House, 37-41 Mortimer Street,London W1T 3JH, UK
Journal of AppliedStatisticsPublication details, including instructionsfor authors and subscription information:http://www.tandfonline.com/loi/cjas20
Non-parametric procedurefor knockout tournamentsChristopher Todd EdwardsPublished online: 02 Aug 2010.
To cite this article: Christopher Todd Edwards (1998) Non-parametricprocedure for knockout tournaments, Journal of Applied Statistics, 25:3,375-385, DOI: 10.1080/02664769823106
To link to this article: http://dx.doi.org/10.1080/02664769823106
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy ofall the information (the “Content”) contained in the publicationson our platform. However, Taylor & Francis, our agents, and ourlicensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication arethe opinions and views of the authors, and are not the views of orendorsed by Taylor & Francis. The accuracy of the Content shouldnot be relied upon and should be independently verified with primarysources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses,damages, and other liabilities whatsoever or howsoever causedarising directly or indirectly in connection with, in relation to orarising out of the use of the Content.
This article may be used for research, teaching, and private studypurposes. Any substantial or systematic reproduction, redistribution,reselling, loan, sub-licensing, systematic supply, or distribution inany form to anyone is expressly forbidden. Terms & Conditions ofaccess and use can be found at http://www.tandfonline.com/page/terms-and-conditions
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
Journal of Applied Statistics, Vol. 25, No. 3, 1998, 375± 385
Non-parametric procedure for knockouttournaments
CHRISTOPHER TODD EDWARDS, Mathematics Department, University of
Wisconsin Oshkosh, USA
SUMMARY In a seeded knockout tournament, where teams have some preassigned
strength, do we have any assurances that the best team in fact has won? Is there some
insight to be gained by considering which teams beat which other teams solely examining
the seeds? We pose an answer to these questions by using the diŒerence in the seeds of the
two players as the basis for a test statistic. We oŒer several models for the underlying
probability structure to examine the null distribution and power functions and determine
these for small tournaments (less than ® ve teams). One structure each for 8 teams and
16 teams is examined, and we conjecture an asymptotic normal distribution for the test
statistic.
1 Introduction
Many sports organizations use tournaments to select their champions. A well-
known instance of this is the annual `March Madness’ National Collegiate Athletic
Association (NCAA) men’s and women’ s college basketball tournament, where the
champions are selected using a 64-team single-elimination tournament. This
tournament essentially comprises four regional 16-team tournaments, where the
participants are initially ranked, or `seeded’ , from 1 (the best team) to 16 (the
worst team). Other well-known applications of single-elimination tournaments are
those in professional tennis, basketball and football.
One natural question for observers is whether or not the best team has won.
One might suspect that, if the number 16 seedÐ supposedly the weakest team in
the regional tournamentÐ wins its regional tournament, then one of two things has
occured: the seedings were incorrect or some major upsets occurred.
This paper explores a method of testing the hypothesis that the initial seedings
Correspondence: C. T. Edwards, Mathematics Department, University of Wisconsin Oshkosh, 800
Algoma Boulevard Oshkosh, WI 54901, USA. Tel: 920 424 1333.
0266-476 3/98/030375-1 1 $7.00 � 1998 Carfax Publishing Ltd
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
376 C. T. Edwards
are meaningful, i.e. that the teams that won were the favorites. We contrast this
with the hypothesis that all the teams are of equal strength, so that the seedings
are meaningless. We make a comparison with round-robin tournaments, where it
is possible to test for the strongest team, based on its winning percentage. (We
assume that the reader has had some exposure to statistical inference at the
undergraduate level.)
2 General hypothesis testing
Before we describe our particular hypothesis test procedure, we describe statistical
hypothesis testing in general. In a typical setting, we test whether or not a
population parameter is equal to some prespeci® ed value. The test statisitic is
formed by some theoretical (or intuitive) criteria, and the value of the test statistic
is calculated from the data. In some non-parametric settings, the null hypothesis is
a statement about a situation, i.e. the data are independent, and, as such, the
hypothesis does not actually involve parameters.
There are two types of error that can be made in hypothesis testing. We can
falsely reject a true hypothesis (a type I error), or we can fail to reject an incorrect
hypothesis (a type II error). Clearly, we would like to make as few mistakes as
possible. However, in practice, it is usually not possible to have both error rates
simultaneously small without large samples. Typically, we (arbitrarily) select an
acceptable error rate for the type I error. To derive the rejection region or decision
rule, we select an acceptable test statistic and a cut-oŒ which gives us our
preselected acceptable error rate under the assumption that the null hypothesis
is true.
There are still many possible test statistics that one could choose, so we need
additional criteria to select a `best’ test. With the type I error rate ® xed, a typical
approach is to try to maximize the power, or one minus the type II error rate. In
practice, type II errors are not as carefully studied, mainly because of the complexity
of composite hypotheses. Another strategy is to minimize the sum of the type I
and type II error rates (for an appropriately chosen alternative hypothesis). One
could also try the `minimax’ approach, i.e. try to minimize the maximum of the
type I and type II error rates. We will explore the maximum power and minimum
sum techniques later in the paper.
3 Round-robin tournaments
Before we discuss the single-elimination tournament, we digress to a discussion
about the more well-known round-robin tournament. The following discussion
centers around the fact that, in a round-robin tournament, one is able to test
whether or not there is a `best’ team. This feature does not exist in single-
elimination tournaments, as we will see later. A round-robin tournament is one
where each team plays each other team exactly once. Using graph theory, these
tournaments can be represented with directed graphs, where teams are the vertices
and the directed line segments indicate the winner of each pairing. One feature of
the round-robin tournament is that there is only one possible arrangement of
games to be played: each team plays each other team exactly once. Of course, this
means that each team plays n 2 1 games and, in total, there are ( n2) games.
Although there is only one possible plan (all teams play each other team), there
are many possible outcomes, or score sequences. For example, in a three-player
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
Non-parametric procedure for knockout tournaments 377
FIG. 1. The two score sequences for three teams.
tournament, there are two possible score sequences, i.e. 210 and 111, as displayed
in Fig. 1. With 210, we have one team which has beaten both the other teams, one
team which has lost to both the other teams and one team which has won once
and lost once. With 111, all three teams have won once, in an intransitive cycle.
Because of the possibility of diŒerent score sequences in round-robin tourna-
ments, it is easy to construct a test of the `equal-strength’ null hypothesis. The
basic method of construction of the distribution of the test statistic is to enumerate
all possible score sequences and tabulate the test statistic in each case.
There are a number of appropriate test statistics that one might use to test the
null hypothesis that all teams have equal strength, i.e. the probability of team i
beating team j is 0.5 for all i and j. One statistic is the number of wins by the best
team and another statistic is the diŒerence in the number of wins between the best
and worst teams. In both these cases, we would reject the hypothesis if the test
statistic were large. David (1959) discusses at length the `equal-strength’ hypothesis
in round-robin tournaments.
To calculate the power of a particular test, we need to assume some alternative
hypothesis. Unfortunately, the alternative hypothesis is quite complicated, because
there are many ways for a set of teams to be of unequal strengths, and only one
way is that none of the teams is equal in strength to any other. To make the
problem more tractable, we will assume some speci® c structure on the probabilities
of teams beating each other. Several attractive models are listed in the next section.
By using these models, we do not want to imply that other models are not
appropriate but, rather, that these models are simple to calculate and, in the
absence of any other compelling models, these should demonstrate our technique.
4 Models
(1) All teams of equal strength: p i j 5 0.5, where p i j is the probability of team i
beating team j.
(2) The model of Bradley and Terry (1952):
p i j 5j
i+ j
Note that p j i 5 1 2 p i j.
(3) The model of Schwertman et al. (1991):
p i j 5 0.5 +1
2n( j 2 i )
Again, note that p j i 5 1 2 p i j.
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
378 C. T. Edwards
Each of these models preserves transitivity, i.e. if p i j > 0.5 and p jk > 0.5, then
p ik > 0.5. For each model, we can calculate the chance of observing a particular
outcome. For example, using the `equal-strength’ model, Pr(210) 5 6/8. (This can
be calculated by noting that there are three choices for the team that wins two
games, two remaining choices for the team that wins one game, and one choice for
the last team, or 3 3 2 3 1 5 6. There are three games, so there are 23 total
possibilities.) Similarly, Pr(111) 5 2/8. David (1959) has counted the score
sequences for round-robin tournaments for up to eight teams.
5 Single-elimination tournaments
A single-elimination tournament is a knockout tournament where teams are
eliminated after losing one game, and game winners continue to play until all but
one of the teams have been eliminated. This single remaining team is the winner
of the tournament. The single-elimination tournament has a number of diŒerences
from the round-robin tournament. First, of course, is the fact that fewer games are
played. Second, whereas each team in the round-robin tournament plays exactly
t 2 1 games, in the single-elimination tournament, each team does not play a
prespeci® ed number of games. Further, there are diŒerent schemes by which we
can pair up future game opponents. These diŒerent `structures’ are listed in Fig. 2.
Also, for the balanced structures (such as structure C), there is only one possible
score sequence. For example, with t 5 4 and using structure C, the winner will
have two wins and zero losses; one team will have one win and one loss; and two
teams will have one loss each. With t 5 8, the winner will have three wins and zero
losses; one team will have two wins and a loss; two teams will have one win and
one loss; and four teams will have one loss each. As a result of these unique score
sequences, it is not possible to use the score sequence itself as the basis for a test
statistic for all the tournament structures. Also, it seems that the hypothesis of
equal-strength teams would be di� cult to detect, because each tournament deter-
mines a winner, even if no diŒerences exist.
To illustrate, suppose that a penny, a nickel, a dime and a quarter are `competing’
in a coin-¯ ipping contest. It is decided to use structure C and the ® rst round is
with the penny versus the nickel and the dime versus the quarter. The coins are
¯ ipped until one coin in each pair shows a head and the other shows a tail. The
coin with heads is declared the `winner’ of the match. Under such a scheme, one
coin will be determined to be the best heads-showing coin, even though all the
coins have the same chance of showing heads.
Despite the negativism (the coin example), tournaments do give us some
information. In many tournaments, teams are `seeded’ into the structure. The
seeding is supposed to reward better teams, by giving them either a bye (a round
where they play no game) or easier opponents as they procede through the
tournament. The most common example is the plan using structure C, in which
the highest seeded team plays the lowest seeded team in one of the ® rst-round
games and the other two teams meet in the other ® rst-round games. Horen and
Riezman (1985) showed that, under such a scheme and assuming stochastic
transitivity among the four teams, the `best’ team has the greatest chance of
winning the tournament. (Stochastic transitivity states that the probabilities of
teams beating each other are transitive, so that we can identify a `best’ team.)
Thus, if we are willing to assume that the seeding has been carried out correctly,
so that the number one seed is really the best team, then the results of the
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
Non-parametric procedure for knockout tournaments 379
FIG. 2. Single-elimination structures for 2 ± 6 teams.
tournament give us some evidence as to whether or not the seedings seem plausible.
This forms the basis for a hypothesis test.
As our null hypothesis, we will assume that all the teams are of equal strength
and that the seedings assigned to them have been attributed at random, i.e. that
they are meaningless. The alternative hypothesis is that the seedings are meaningful
and that lower seeding numbers denote better teams. To calculate the null
distributions of our test statistics, we must list all possible equally likely outcomes
and tabulate the chance of each possible test statistic. The test statistic that we
explore is the sum, over all n 2 1 games, of the diŒerences between the winner’ s
seeding and the loser’ s seeding. There are clearly other choices for a test statistic;
we have pursued ours as a matter of convenience.
For structure C (with four teams) and the seeding already mentioned (1432;
and shown in Fig. 3), there are 235 8 possible outcomes. These outcomes are
listed in Table 1, including the calculation of the test statistic. By examining the
® rst row, we see that, when the favorite wins each game, the test statistic is
small (more negative); thus, for small values, we would conclude the alternative
hypothesis that the seedings are correct. What we have described is a one-sided
test procedure: if the test statistic is small, then we reject the null hypothesis and,
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
380 C. T. Edwards
FIG. 3. Structure for teams considered in Table 1.
TABLE 1. Sample space and calculation of test statistic
Winners of the Winner seeding
three games 2 loser seeding Test statistic
121 2 3 2 1 2 1 2 5
122 2 3 2 1 + 1 2 3
131 2 3 + 1 2 2 2 4
133 2 3 + 1 + 2 0
422 + 3 2 1 2 2 0
424 + 3 2 1 + 2 4
433 + 3 + 1 2 1 3
434 + 3 + 1 + 1 5
if the test statistic is large, we `accept’ the null hypothesis. Similar calculations
follow for the other tournament structures.
If we reject our null hypothesis, then we might wish to consider various power
calculations. These calculations are quite complicated, because the `non-equal-
strengths’ hypothesis has many possible elements. The approach that we take is to
specify a particular alternative structure on the probabilities of teams beating each
other. Again, we assume the three models: the equal-strength model (the null
hypothesis), Bradley± Terry model, and the model of Schwertman et al. We perform
the power calculations as follows: for each particular model and structure being
investigated, we calculate the probability of obtaining each value of the test statistic.
Tables 2 and 3 display the calculations for t 5 4 teams. For all the structures, we
give the better teams more byes; thus, for structure D, the team seeded as number
one will play one game.
To clarify how the numbers in Tables 2 and 3 were obtained, we detail the
calculation for structure C using a type I error of 0.125 and the Bradley± Terry
model. To obtain a type I error of 0.125, we must have the test statistic equal to
2 5. Thus, team 1 beat team 4, team 2 beat team 3, and team 1 beat team 2.
Under the Bradley± Terry model, p14 5 0.8, p23 5 0.6 and p12 5 0.666. The product
of these three probabilities is 0.32, as displayed in Table 2.
From Tables 2 and 3, we can conclude that, for small values of the type I error
rate, structure C gives a better tournamentÐ at least when using the models of
Bradley and Terry and Schwertman et al. Further, Figs 4 and 5 display this
information graphically. We continue the comparisons of structures for ® ve teams
in Tables 4 and 5. Again, for all the structures, we give the better teams more byes;
thus, for structure D, the team seeded as number one will only play one game.
From the tables and ® gures (see also Figs 6 and 7), we can make several
observations. First, we note that structure G, which is the `king-of-the-hill’ style of
tournament, is less powerful than the other structures under both models. Second,
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
Non-parametric procedure for knockout tournaments 381
TABLE 2. Four-team power calculations, assuming Bradley±
Terry model
Power under model
Probability of
type I error Structure C Structure D
0.125 0.320 ( 2 5) 0.229 ( 2 3)
0.250 0.560 ( 2 4)
0.375 0.720 ( 2 3) 0.590 ( 2 2)
0.500 0.705 ( 2 1)
0.625 0.880 (0)
0.750 0.926 (3) 0.914 (0)
0.875 0.966 (4) 0.971 (2)
1.000 1.000 (5) 1.000 (6)
Notes: Numbers in parentheses are critical values. Underlined
values are those that minimize the sum of the type I and type
II error rates.
TABLE 3. Four-team power calculations, assuming model of
Schwertman et al.
Power under model
Probability of
type I error Structure C Structure D
0.125 0.288 ( 2 5) 0.216 ( 2 3)
0.250 0.512 ( 2 4)
0.375 0.704 ( 2 3) 0.552 ( 2 2)
0.500 0.696 ( 2 1)
0.625 0.884 (0)
0.750 0.932 (3) 0.904 (0)
0.875 0.968 (4) 0.976 (2)
1.000 1.000 (5) 1.000 (6)
Notes: Numbers in parentheses are critical values. Underlined
values are those that minimize the sum of the type I and type
II error rates.
FIG. 4. Power versus size curve for the Bradley± Terry model.
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
382 C. T. Edwards
FIG. 5. Power versus size curve for the model of Schwertman et al.
TABLE 4. Five-team power calculations, assuming Bradley± Terry model
Probability Power under model
of type I
error Structure F Structure G Structure E
0.0625 0.163 ( 2 6) 0.127 ( 2 4) 0.178 ( 2 6)
0.1875 0.422 ( 2 5) 0.459 ( 2 5)
0.2500 0.503 ( 2 4) 0.439 ( 2 3)
0.3125 0.659 ( 2 4)
0.3750 0.653 ( 2 3) 0.586 ( 2 2) 0.733 ( 2 3)
0.4375 0.730 ( 2 2)
0.5000 0.770 ( 2 1) 0.822 ( 2 1)
0.5620 0.859 (0)
0.6250 0.876 (0) 0.837 ( 2 1) 0.891 (1)
0.6875 0.921 (1) 0.916 (2)
0.7500 0.947 (2) 0.909 (1) 0.939 (3)
0.8125 0.967 (3)
0.8750 0.981 (6) 0.976 (2) 0.976 (4)
0.9375 0.991 (8) 0.992 (8) 0.989 (7)
1.0000 1.000 (9) 1.000 (10) 1.000 (8)
Notes: Numbers in parentheses are critical values. Underlined values are those
that minimize the sum of the type I and type II error rates.
there seems to be no obvious preference between the other two structures, although
the Bradley± Terry model shows structure F to be slightly more powerful. Note
that this conclusion is consistent with that of the four-team tournaments, i.e. that
the `king-of-the-hill’ style of tournaments (structures E and G) are less powerful
for detecting a false hypothesis. One explanation for this phenomenon is that these
tournaments heavily favor the number one seed, and major upsets are less likely to
occur in these tournaments, because the low-seeded teams rarely play the top-
seeded teams. (Note that major upsets are the way that we reject the null
hypothesis.)
After tabulating the results for six teams, we constructed Table 6, which
summarizes the null distributions of the test statistic. If these distributions are
shown graphically, then one notices that they are reasonably symmetric and that,
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
Non-parametric procedure for knockout tournaments 383
TABLE 5. Five-team power calculations, assuming model of Schwertman et al.
Probability Power under model
of type I
error Structure F Structure G Structure E
0.0625 0.173 ( 2 6) 0.130 ( 2 4) 0.173 ( 2 6)
0.1875 0.422 ( 2 5) 0.437 ( 2 5)
0.2500 0.557 ( 2 4) 0.432 ( 2 3)
0.3125 0.653 ( 2 4)
0.3750 0.723 ( 2 3) 0.597 ( 2 2) 0.739 ( 2 3)
0.4375 0.782 ( 2 2)
0.5000 0.840 ( 2 1) 0.847 ( 2 1)
0.5625 0.890 (0)
0.6250 0.910 (0) 0.846 ( 2 1) 0.910 (1)
0.6875 0.939 (1) 0.938 (2)
0.7500 0.964 (2) 0.928 (1) 0.960 (3)
0.8125 0.984 (3)
0.8750 0.993 (6) 0.983 (2) 0.990 (4)
0.9375 0.997 (8) 0.998 (5) 0.995 (7)
1.0000 1.000 (9) 1.000 (10) 1.000 (8)
Notes: Numbers in parentheses are critical values. Underlined values are those
that minimize the sum of the type I and type II error rates.
FIG. 6. Power versus size curve for the Bradley± Terry model.
as t becomes larger, the curves become more symmetric. In addition, we produced
(by `brute force’ ) the distributions for t 5 8 and t 5 16 teams. The graph of these
distributions are shown in Figs 8 and 9, and suggest that the asymptotic distribution
might be normal.
6 Conclusions and remaining questions
Our test statistic can answer several questions. First, we can decide if, in a given
tournament, the favorite teams have won in a probabilistically acceptable way.
Second, we can assess the power of various tournament structures in deciding if
the favorites have won. We conclude by raising the unanswered questions that this
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
384 C. T. Edwards
FIG. 7. Power versus size curve for the model of Schwertman et al.
TABLE 6. Summary of test statistic distributions
t Structure Mean Variance Minimum Maximum
4 C 0 14.3 2 5 5
D 0 8.3 2 3 6
5 E 0 19.2 2 6 8
F 0 21.3 2 6 9
G 0 12.7 2 4 10
6 H 0 24.3 2 7 12
I 0 38.3 2 9 13
J 0 27.5 2 7 13
K 0 24.8 2 8 10
L 0 28.1 2 7 14
M 0 17.6 2 5 15
8 Balanced structure 0 116.4 2 21 21
16 Balanced structure 0 977.7 2 85 85
work raises. Although some of these questions are well-de ® ned and perhaps easy
to answer, others may have more elusive solutions.
(1) Are other alternative probability models reasonable? Of course, the answer is
yes, but perhaps there are some particularly appealing alternative probability
models that we have overlooked. Again, it should be mentioned that the
choice of the models examined in this paper was arbitrary and they were
chosen only to demonstrate the techniques involved.
(2) Can we show for all (transitive) preference matrices that one structure
`dominates’ any other under some alternative model? In other words, is the
king-of-the-hill tournament always less powerful? This question points the
way to new research on which tournament structures are `best’ .
(3) Does the test statistic have an asymptotically normal distribution? If so, is
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014
Non-parametric procedure for knockout tournaments 385
FIG. 8. Cumulative distribution functions for eight teams.
FIG. 9. Cumulative distribution functions for 16 teams.
there a closed form formula for the variance? We desire asymptotic normality,
because it would make the construction of a type I error level exceedingly
simple; one would just look up the desired area on a normal table, and use
the variance to convert that normal score to a critical value for the test
statistic.
(4) Is there a `better’ test statistic, i.e. is there a more powerful test? We could
employ techniques to develop a `most powerful test’ , but we have not yet
attempted this.
R EFERENCES
BRADLEY, R. A. & TERRY, M. E. (1952) Rank analysis of incomplete block designs, B iometr ika, 39,
pp. 324 ± 345.
DAVID, H. A. (1959 ) Tournaments and paired comparisons, B iometrika, 46, pp. 139 ± 149.
HOREN, J. & R IEZMAN, R. (1985 ) Comparing draws for single elimination tournaments, Operations
Research , 33, pp. 249 ± 262.
SCHWERTMAN, N. C., MCCREADY, T. A. & HOWARD, L. (1991 ) Probability models for the NCAA
regional basketball tournaments, American Statistician, 45, pp. 35 ± 38.
Dow
nloa
ded
by [
Uni
vers
itat P
olitè
cnic
a de
Val
ènci
a] a
t 16:
25 2
5 O
ctob
er 2
014