A Reduction of Label Ranking to Multiclass Classi cation · 2020-01-10 · A Reduction of Label...

A Reduction of Label Rankingto Multiclass Classification

Klaus Brinker1 and Eyke Hullermeier2

1 Hamm-Lippstadt University of Applied Sciences, [email protected]

2 Paderborn University, [email protected]

Abstract. Label ranking considers the problem of learning a mappingfrom instances to strict total orders over a predefined set of labels. In thispaper, we present a framework for label ranking using a decompositioninto a set of multiclass problems. Conceptually, our approach can be seenas a generalization of pairwise preference learning. In contrast to the lat-ter, it allows for controlling the granularity of the decomposition, varyingbetween binary preferences and complete rankings as extreme cases. It isspecifically motivated by limitations of pairwise learning with regard tothe minimization of certain loss functions. We discuss theoretical prop-erties of the proposed method in terms of accuracy, error correction, andcomputational complexity. Experimental results are promising and in-dicate that improvements upon the special case of pairwise preferencedecomposition are indeed possible.

Keywords: Label Ranking · Multiclass Classification · Structured Out-put Prediction

1 Introduction

In the recent past, various types of ranking problems emerged in the field ofpreference learning [12]. A well-known example is “learning to rank” or objectranking [19], where the task is to rank any subset of objects (typically describedas feature vectors) from a given domain (e.g. documents). In this paper, ourfocus is on a related but conceptually different problem called label ranking (LR).The task in LR is to learn a mapping from an input space X to a total orderover a predefined set L = {λ1, . . . , λm} of labels (e.g. political parties, musicgenres, or social emotions). Like in multiclass classification, these labels are onlydistinguished by their name but not described in terms of any properties.

Previous approaches to label ranking can be grouped into four main cate-gories with respect to model representation [23]: Algorithms learning real-valuedscoring functions for each label [15, 8, 21], instance-based methods [2, 6], tree-based methods [5], and binary decomposition methods [18]. In terms of predictiveaccuracy, especially the latter turned out to be highly competitive.

2 K. Brinker and E. Hullermeier

This paper introduces a generalization of binary decomposition in the sensethat LR problems are transformed into multiple multiclass classification prob-lems. Thus, the atomic elements of our framework are partial rankings of afixed length k, including pairwise preferences (k = 2) and complete rankings(k = m) as special cases. Encoding rankings over fixed subsets of k labels as(meta-)classes, each such subset gives rise to a multiclass classification problem.At prediction time, the corresponding multiclass classifiers (either all of themor a suitable subset) are queried with a new instance, and their predictions arecombined into a ranking on the entire label set L.

Intuitively, a decomposition into partial rankings instead of label pairs in-creases the degree of overlap between the various subproblems, and thereby theability to correct individual prediction errors. Formally, it can indeed be shownthat, to minimize specific loss functions on rankings, knowledge about binary re-lationships between labels is principally insufficient; an extreme example is the0/1-loss, which simply checks whether the entire ranking is correct or not.

Our framework for LR is analyzed from a theoretical perspective with respectto both accuracy, error correcting properties, and computational complexity.More precisely, for the aforementioned 0/1-loss, we present a bound in termsof the average classification error of the underlying multiclass models on thetraining data, hence, providing a justification for the general consistency of ourapproach in the sense that accurate multiclass models imply accurate overall LRpredictions on the training data. With respect to error correction, we present atheoretical result on the number of multiclass errors which our framework cancompensate while still recovering the correct overall ranking.

Empirically, we also analyze the influence of the decomposition granularity k.Our results are promising and suggest that improvements upon the special caseof binary decomposition are indeed possible. In the experiments, we observed aconsistent relationship between a suitable choice of k and the overall number oflabels m.

Section 2 recalls the problem setting of LR, along with notational conven-tions. Our novel LR framework is presented in Section 3. Sections 4 and 5 analyzetheoretical properties in terms of accuracy, error correction, and computationalcomplexity, respectively. Section 6 is devoted to an experimental evaluation ofour approach. We conclude with a few remarks and open research directions inSection 7.

2 Label Ranking

In label ranking, the goal is to learn a predictive model in the form of a mappingf : X → L∗, where X is an instance space and L∗ the set of all rankings(strict total orders) over a set of labels L = {λ1, . . . , λm}. Formally, we representrankings in terms of permutations π of [m] = {1, . . . ,m}, such that π(i) is the(index of the) label at position i. Thus, LR assumes instances x ∈ X to beassociated with rankings π ∈ L∗. More specifically, we assume this dependency

A Reduction of Label Ranking to Multiclass Classification 3

to be probabilistic: Given x as input, there is a certain probability P(π |x) toobserve the ranking π ∈ L∗ as output.

We suppose training data to be given in the form of a set of observationsD = {(xi, πi)}ni=1 ⊂ X × L∗. Thus, for convenience, we assume the observedrankings to be complete. In practice, this assumption is of course not alwaysfulfilled. Instead, rankings are often incomplete, with some of the labels λi ∈ Lbeing missing. Predictive performance is evaluated in terms of a loss function` : L∗ × L∗ → R, which compares observed rankings π with predictions π.Common choices include rank correlation measures such as Kendall’s tau andSpearman’s rho (which are actually similarity measures).

In the pairwise approach to LR [18], called ranking by pairwise comparison(RPC), one binary model fi,j is trained for each pair of labels (λi, λj). The taskof fi,j is to predict, for a given instance x, whether λi precedes or succeeds λjin the ranking associated with x. A prediction π is produced by aggregating the(possibly conflicting) m(m− 1)/2 pairwise predictions, using techniques such as(weighted) voting.

An important question for any reduction technique is the following: Is itpossible to combine the solutions of the individual subproblems into an opti-mal solution for the original problem? In the case of RPC, this question canbe asked more concretely as follows: Is it possible to train and combine the bi-nary predictors fi,j so as to obtain an optimal predictor f for the original LRproblem, i.e., a predictor that minimizes the loss ` in expectation? Interestingly,this question can be answered affirmatively for several performance measures,including Kendall’s tau and Spearman’s rho. However, there are also measuresfor which this is provably impossible. These include the 0/1-loss, the Hamming,the Cayley, and the Ulam distance on rankings, as well as Spearman’s footrule[16]. Roughly speaking, for these measures, the loss of information due to pro-jecting the distribution P(· |x) on L∗ to its pairwise marginals Pi,j(· |x) is toohigh. That is, even knowledge about all pairwise marginals does not allow forreconstructing the risk-minimizing prediction π.

Indeed, pairwise relations may easily lead to ambiguities, such as preferentialcycles. Knowledge about the distribution of rankings on larger label subsets, suchas triplets or quadruples, may then help to disambiguate. This is a key motivationof the approach put forward in this paper, very much in line with the so-calledlistwise approaches to learning-to-rank problems [3]. In this regard, LR is alsoquite comparable to multi-label classification (MLC): There are loss functions(such as Hamming) that allow for reducing the original MLC problem to binaryclassification, but also others (such as the subset 0/1-loss) for which reductiontechniques cannot produce optimal results, and which require information aboutthe distribution on larger subsets of labels [9].


3 Reduction to Multiclass Classification

In this section, we present a novel framework for solving LR problems using adecomposition into multiple multiclass classification problems. We refer to thisapproach as LR2MC (Label Ranking to Multiclass Classification).

3.1 Decomposition

For a given k ∈ {2, . . . ,m}, consider all label subsets L ⊂ L of cardinality k.Each such L gives rise to a label ranking problem on the reduced set of labels.For the special case k = 2, the same decomposition into

(m2

)binary classification

problems is produced as for RPC.

To illustrate the decomposition process, consider a setting with m = 4 labelsL = {λ1, λ2, λ3, λ4} and subsets of cardinality k = 3. Each of the

(43

)subsets

{λ1, λ2, λ3}, {λ1, λ2, λ4}, {λ1, λ3, λ4}, and {λ2, λ3, λ4} gives rise to a separateLR problem. In the following, we denote by π(L) the restriction of a rankingπ ∈ L∗ to the label subset L ⊂ L, i.e., π(L) = (λ3, λ4, λ2) for π = (λ3, λ4, λ1, λ2)and L = {λ2, λ3, λ4}. For each ranking problem on a label subset L of size3, there are 3! = 6 elements in the output space L∗, for example (λ1, λ2, λ3),(λ1, λ3, λ2), (λ2, λ1, λ3), (λ2, λ3, λ1), (λ3, λ1, λ2), (λ3, λ2, λ1) in the case of thelabel subset L = {λ1, λ2, λ3}.

In general, for a label ranking problem with |L| = m labels, we constructa set of

(mk

)ranking problems on label subsets of size |L| = k. Each of these

problems is then converted into a multiclass problem. To this end, each of the k!rankings π(L) ∈ L∗ is associated with a class c ∈ C = {c1, . . . , ck!}. We denoteby fL the multiclass classifier on subset L. For the sake of simplicity, we assumethe decoding of class labels to the associated rankings to be done directly, i.e.,fL is viewed as a mapping X → L∗ (instead of a mapping X → C), and fL(x)is the ranking of labels in L predicted for x.

3.2 Aggregation

We will now discuss our approach to combining partial ranking predictions onlabel subsets into a prediction on the complete label set L. The idea is to find aconsensus ranking that disagrees with the minimum number of subset rankingpredictions. In Section 4, we further elaborate on theoretical properties of thisaggregation method.

For a new instance x ∈ X , we first compute the predicted rankings {fL(x) :L ⊂ L, |L| = k} on the

(mk

)label subsets. These predictions are combined into a

complete ranking f(x) ∈ L∗ by minimizing the sum of (partial) 0/1-loss values:

f(x)def= argmin

π∈L∗

∑L⊂L|L|=k

`01

(π(L), fL(x)

), (1)


where

`01(ρ, ρ′) =

{0 for ρ = ρ′

1 for ρ 6= ρ′(2)

for ρ, ρ′ ∈ L∗ denotes the 0/1-loss for two label rankings. Ties are broken arbi-trarily.

This loss-based aggregation is similar to Hamming decoding for computing amulticlass label from binary classifications [1]. Moreover, for k = 2, it is identicalto the Slater-optimal aggregation in pairwise LR [17]. For k = m, there is onlyone label subset L = L. In this case, the risk-minimizing ranking is obviouslythe multiclass prediction for this set, mapped to the associated label ranking.

3.3 Discussion

What is an optimal choice of k? This question is not easy to answer, and differentarguments support different choices:

– Loss of information: The decomposition of a ranking necessarily comes witha loss of information, and the smaller the components, the larger the loss.This argument suggests large values k (close or equal to m).

– Redundancy: The more redundant the predictions, the better mistakes ofindividual predictors fL can be corrected. This argument suggests midsizevalues k ≈ m/2, for which the number

(mk

)of predictions combined in the

aggregation step is largest.– Simplicity: The difficulty of a multiclass problem increases with the number

of classes. This argument suggests small values k, specifically k = 2. Prac-tically, it will indeed be difficult to go beyond k = 4, since the number ofclasses k! will otherwise be prohibitive—issues such as class-imbalance andempty classes will then additionally complicate the problem.

As for the last point, also note that for k > 2, our reduction to classificationcomes with a loss of structural information: By mapping partial rankings π(L) toclass labels c, i.e., replacing the space L∗ by the set of classes C, any informationabout the structure of the former is lost (since C is not equipped with anystructure except the discrete metric). This information is only reconsidered in theaggregation step later on. Interestingly, exactly the opposite direction, namelyexploiting structure on the label space by turning a multiclass classification intoa ranking problem, has recently been considered in [14].

LR2MC is conceptually related to the Rakel method for multilabel classifi-cation [22]. Rakel decomposes an MLC problem with m labels into (randomlychosen) MLC problems for subsets of size k, and tackles each of these problems asa multiclass problem with 2k classes (corresponding to the label subsets). Bothapproaches share the underlying motivation of taking dependencies between la-bels and partial rankings, respectively, into account. Moreover, both approachesalso resemble error correcting output encodings [10], which aim at improving theoverall multiclass classification accuracy by combining multiple binary classifiersand provide a means to distribute the output representation.


4 Theoretical Analysis

In this section, we upper bound the average training 0/1-loss of LR2MC (as de-fined in (2)) in terms of the average classification error of the multiclass classifiers{fL}L⊂L,|L|=k and analyze error correcting properties. Our analysis is similar to[1], where multiclass classification is reduced to learning multiple binary classi-fiers. In that paper, the authors provide a bound on the empirical multiclass lossin terms of the loss of the binary classifiers and properties and the decompositionscheme.

Theorem 1 (Training loss bound). Let {fL}L⊂L,|L|=k with 2 ≤ k ≤ m de-note a set of multiclass models and let ε denote their average classification erroron the training sets decomposed from the LR training data D = {(xi, πi)}ni=1.Then, we can upper bound the average ranking 0/1-loss on the training data asfollows:

1

n

n∑i=1

`01(πi, f(xi)) ≤2m(m− 1)

k(k − 1)· ε

Proof. Let f(x) = π denote the predicted ranking for the training example(x, π) ∈ D. Assume that the predicted and the ground truth rankings disagree,π 6= π, hence `01(π, π) = 1. Moreover, as π is a minimizer of (1), it holds that∑

L⊂L|L|=k

`01(π(L), fL(x)) ≥∑L⊂L|L|=k

`01(π(L), fL(x)).

Let S∆def= {L ⊂ L∗ : |L| = k ∧ π(L) 6= π(L)} denote the label sets of cardinality

k, where π and π disagree. Then,∑L∈S∆

`01(π(L), fL(x)) ≥∑L∈S∆

`01(π(L), fL(x)).

Therefore, ∑L⊂L|L|=k

`01(π(L), fL(x)) ≥∑L∈S∆

`01(π(L), fL(x))

=1

2

( ∑L∈S∆

(`01(π(L), fL(x)) + `01(π(L), fL(x)))

)

≥ 1

2

∑L∈S∆

(`01(π(L), fL(x)) + `01(π(L), fL(x))︸︷︷︸≥1

)

≥ 1

2|S∆| ≥

1

2

(m− 2

k − 2

).


The last step follows from the fact that for the non-equal rankings π and π atleast two labels are in reverse order. Hence, all

(m−2k−2

)restrictions to subsets of k

labels which contain the inversed label pair are not equal as well, and thereforeelements of S∆. In summary, a prediction mistake, i.e., `01(π, f(x)) = 1 impliesthat

2(m−2k−2

) ∑L⊂L|L|=k

`01(π(L), fL(x)) ≥ 1 .

Moreover, as2(

m−2k−2

) ∑L⊂L|L|=k

`01(π(L), fL(x)) ≥ 0

for `01(π, f(x)) = 0, it holds for all (x, π) that

`01(π, f(x)) ≤ 2(m−2k−2

) ∑L⊂L|L|=k

`01(π(L), fL(x)) .

Therefore, for the average 0/1-loss, it holds that

1

n

n∑i=1

`01(πi, f(xi)) ≤1

n

n∑i=1

2(m−2k−2

) ∑L⊂L|L|=k

`01(π(L)i , fL(xi))

=2(

m−2k−2

) 1

n

n∑i=1

∑L⊂L|L|=k

`01(π(L)i , fL(xi))

=2(mk

)(m−2k−2

) · 1

n(mk

) n∑i=1

∑L⊂L|L|=k

`01(π(L)i , fL(xi))

︸︷︷︸=ε

=2m(m− 1)

k(k − 1)· ε

For the special case of perfect multiclass models on the training data (ε = 0),the 0/1-loss for LR2MC is zero. Moreover, when entirely ignoring the structureof the output space (m = k) and approaching label ranking as a multiclassproblem with m! classes, the 0/1-loss for label ranking is bounded by twice theaverage multiclass error.

Corollary 1 (Error correction). Let {fL}L⊂L,|L|=k with 2 ≤ k ≤ m denotea set of multiclass models. For any observation (x, π) ∈ X × L∗ with∑

L⊂L|L|=k

`01

(π(L), fL(x)

)<

1

2

(m− 2

k − 2

)

it holds that f(x) = π.


Proof. As shown in the proof of Theorem 1,∑

L⊂L|L|=k

`01(π(L), fL(x)) ≥ 12

(m−2k−2

)for f(x) 6= π.

Corollary 1 provides some interesting insights into the error correction prop-erties of our framework:

– For k = 2, the right-hand side is 12 , and hence the inequality is satisfied

only if all pairwise predictions are correct. Indeed, it is obvious that a singlepairwise error may result in an incorrect overall prediction.

– For k = m, there exists only a single multiclass model which is mapped toan LR prediction. In this case, a correct overall prediction is guaranteed ifthis multiclass prediction is correct.

– For k = bm2 + 1c and k = dm2 + 1e, the right-hand side evaluates to themaximum possible value(s) and Corollary 1 indicates that our framework iscapable of correcting a substantial number of multiclass errors.

The last point suggests that values k ≈ m2 + 1 should lead to optimal perfor-

mance, because the ability to correct errors of individual classifiers is highest forthese values. One should keep in mind, however, that the error probability itself,which corresponds to the performance of a (k!)-class classifier3, will (strongly)increase with k. Depending on how quickly this error increases, and whether ornot it can be over-compensated through more redundancy, the optimal valueof k is supposed to lie somewhere between 2 and m

2 + 1. This is confirmed byexperimental results for a simplified synthetic setting that we present in thesupplementary material (Section A.8).

Corollary 2 (Training bound for Kendall’s tau). Rescaling Kendall’s τrank correlation into a loss function by 1−τ

2 ∈ [0, 1], it holds under the assump-tions of Theorem 1 that

1

n

n∑i=1

1− τ(πi, f(xi))

2≤ 2m(m− 1)

k(k − 1)· ε

Proof. Direct consequence of Theorem 1 and the fact that for all rankings π andπ′ it holds that

1− τ(π, π′)

2≤ `01(π, π′).

Corollary 2 shows that the average multiclass error on the training data notonly bounds the 0/1-loss but also guarantees a certain level of overall rankingaccuracy as measured by Kendall’s tau.

3 The effective number of classes a classifier is trained on corresponds to the distinctnumber of permutations of the k labels in the training data, which is normally < k!.


5 Computational Complexity

The set of multiclass models is at the core of LR2MC. The training data for eachmulticlass model consists of the original instances xi and the appropriate classlabels which represent rankings on label subsets. Hence, we need to train andevaluate

(mk

)multiclass models using n training examples with up to k! different

class labels.

Depending on the influence of the number of classes on the complexity of theclassification learning algorithm and the choice of k, the computational complex-ity can increase substantially compared to the pairwise ranking approach, where(m2

)= m(m−1)

2 binary models are required. More precisely, the maximum num-ber of multiclass models for a given m is required for k = bm2 c as a consequenceof basic properties of binomial coefficients. For this case, it holds that(

m

bm2 c

)≥(

m

bm2 c

)bm2 c≥ 2b

m2 c ≥ 2

m−12 = (

√2)m−1.

Hence, the maximum number of multiclass problems is lower bounded by a termwhich grows exponentially in the number of labels m.

For testing, we need to evaluate all(mk

)multiclass models and solve the

optimization problem (1). Even when ignoring the multiclass model evaluationcomplexity, the computational complexity for this combinatorial optimizationproblem increases substantially in m as there are m! possible rankings to beconsidered. Moreover, as stated above, for k = 2, this aggregation method isidentical to computing the Slater-optimal label ranking which is known to beNP-complete [17].

Overall, even considering that the number of labels m in LR is normally small(mostly < 10, comparable to multiclass classification), computational complexityis clearly an issue for LR2MC. Of course, there are various directions one couldthink of to improve efficiency. For example, instead of decomposing into all labelsubsets of size k, one may (randomly) chose only some of them, like in Rakel.Besides, aggregation techniques less expensive than Hamming decoding could beused, such as (generalized) Borda count. While these are interesting directions forfuture work, the main goal of this paper is to elaborate on the usefulness of theapproach in principle, i.e., to answer the question whether or not a generalizationfrom pairwise to multiclass decomposition is worthwhile at all.

6 Experimental Evaluation

6.1 Setup

This section presents an experimental evaluation of the accuracy of LR2MC,which essentially consists of two studies. For the first study, we replicated a set-ting previously proposed in [4], where a suite of benchmark datasets for label


Table 1. Dataset characteristics

Dataset Type Instances Attributes Labels

authorship A 841 70 4bodyfat B 252 7 7calhousing B 20640 4 4cpu-small B 8192 6 5fried B 40769 9 5glass A 214 9 6housing B 506 6 6iris A 150 4 3segment A 2310 18 7stock B 950 5 5vehicle A 846 18 4wine A 178 13 3

ranking was used4; see Table 1 for an overview of the dataset properties. Thesecond study is based on an artificial setting from [13], where the underlyingproblem is to learn the ranking function induced by an expected utility maxi-mizing agent. This setting allows for varying dataset properties, i.e., the numberof instances, features, and labels, to study specific aspects of our novel approachin a more controlled setting.

We consider the following evaluation measures: Kendall’s tau and Spear-man’s rho rank correlation coefficients, 0/1 loss, and Hamming distance (numberof items for which the predicted rank deviates from the true rank). For coher-ence, we turn the last two into [0, 1]-valued similarity scores: Match is definedas 1 minus 0/1 loss, and Hamming similarity is 1 minus normalized Hammingdistance.

For the first study, the empirical results are computed as in [4] using five rep-etitions of ten-fold cross-validation. In the second study, we averaged the resultsover 100 repetitions of the synthetic data generation process for each datasetproperty configuration with separate test sets consisting of 1000 examples.

As base learners, we use the decision tree (J48) and random forest (Ran-domForest) implementations from the Weka machine learning suite [11]. Bothmethods combine fast training and testing with a natural means of handlingmulticlass problems without using any further decomposition techniques. Thisis in contrast to learning methods that make use of reduction techniques them-selves (multiclass SVM, for example, uses a decomposition into one-vs-rest or

4 These are classification and regression datasets from the UCI and Statlog repository,which were turned into label ranking problems. The datasets are publicly availableat https://cs.uni-paderborn.de/is/research/research-projects/software/. Due to thecomputational demands of LR2MC, we restricted our evaluation to datasets withm ≤ 7 labels.


Table 2. Experimental results of the label ranking techniques with decision trees (J48)as base learner in terms of Kendall’s tau (in brackets the ranks; best average rank inboldface).

Dataset RPC LR2MC-2 LR2MC-3 LR2MC-4 LR2MC-5 LR2MC-6 LR2MC-7

authorship 0.787 (2) 0.789 (1) 0.782 (3) 0.771 (4)bodyfat 0.153 (1) 0.143 (2) 0.106 (3) 0.097 (4) 0.073 (5) 0.046 (7) 0.051 (6)calhousing 0.299 (1) 0.297 (2) 0.206 (3) 0.191 (4)cpu-small 0.311 (2) 0.311 (1) 0.265 (3) 0.242 (4) 0.224 (5)fried 0.808 (4) 0.814 (3) 0.862 (1) 0.836 (2) 0.749 (5)glass 0.801 (5) 0.818 (3) 0.830 (1) 0.826 (2) 0.812 (4) 0.790 (6)housing 0.310 (6) 0.333 (4) 0.402 (1) 0.375 (2) 0.341 (3) 0.313 (5)iris 0.780 (1) 0.765 (3) 0.777 (2)segment 0.830 (7) 0.839 (5) 0.887 (2) 0.888 (1) 0.869 (3) 0.854 (4) 0.832 (6)stock 0.729 (4) 0.727 (5) 0.781 (1) 0.768 (2) 0.755 (3)vehicle 0.815 (2) 0.819 (1) 0.815 (3) 0.795 (4)wine 0.862 (2) 0.863 (1) 0.838 (3)

average rank 3.08 2.58 2.17 2.90 4.00 5.50 6.00

all pairs). By excluding such methods, we try to avoid blending multiclass tobinary and label ranking to multiclass decomposition together, which may yieldempirical results that are difficult to interpret.

6.2 First Study (Real Data)

Results for the first study are shown in Table 2 for Kendall as performancemetric and decision trees as base learner—corresponding results for other metricsand learners can be found in the supplementary material. Note that the rankstatistic shown in that table (ranking of methods per dataset, and average ranksper method) is arguably biased, because not all decompositions are applicableto all datasets (for example, LR2MC-5 cannot be used for the 4-label authorshipdataset).

Therefore, Table 3 shows average normalized ranks, where normalizationmeans that, for each dataset, the ranks are linearly rescaled to have unit sum.As can be seen, none of the approaches consistently outperforms the others.However, a more fine-grained analysis of the results demonstrates that the pair-wise approach outperforms all other methods for datasets with up to 4 labels(middle entry in each triplet), while the LR2MC-3 approach outperforms allother methods for datasets with more than 4 labels (right entry).

The results for RPC [18] and LR2MC-2 are similar, as expected, since bothmethods use the same decomposition and only differ in the aggregation step.RPC uses a voting approach that is specifically tailored for Spearman’s rho,which in turn bounds Kendall’s tau [7]. For these metrics, RPC is indeed verystrong.


In agreement with our theoretical arguments, the situation looks differentwhen measuring performance in terms of Match and Hamming. Here, the pair-wise approach performs worse, and best results are obtained for LR2MC withk = 4 or k = 5. As already explained, pairwise information is not enough to op-timize these measures. Theoretically, one may even expect larger values for theoptimal k, but practically, classifiers are of course difficult to train on problemswith too many classes (and limited data).

6.3 Second Study (Synthetic Data)

In a second study, we carried out a set of experiments using the synthetic set-ting of [13], which allows for controlling dataset properties. More precisely, in afirst scenario, we considered the following setup: 250, 500, 1000, . . . , 16000 train-ing instances with 10 features and 7 labels, Match and Hamming as evaluationmeasures.

Table 3. Experimental results in terms of average normalized ranks (best result foreach metric/learner combination in boldface) for different combinations of measure(Kendall, Spearman, Match, Hamming) and base learner (J48 and Random Forest):all datasets / datasets with m ≤ 4 / datasets with m > 4.

RPC LR2MC-2 LR2MC-3 LR2MC-4 LR2MC-5 LR2MC-6 LR2MC-7

K J48 .21/.20/.21 .19/.21/.17 .20/.35/.09 .21/.40/.13 .21/—/.21 .23/—/.23 .21/—/.21K RF .20/.20/.19 .22/.31/.15 .17/.27/.09 .20/.37/.13 .23/—/.23 .23/—/.23 .25/—/.25S J48 .15/.15/.16 .22/.27/.18 .19/.35/.08 .22/.40/.15 .22/—/.22 .25/—/.25 .23/—/.23S RF .15/.18/.13 .22/.29/.18 .17/.29/.07 .22/.40/.15 .26/—/.26 .24/—/.24 .25/—/.25

M J48 .31/.33/.29 .28/.35/.23 .20/.24/.18 .11/.13/.10 .10/—/.10 .13/—/.13 .11/—/.11M RF .33/.40/.29 .29/.35/.24 .16/.17/.16 .11/.13/.10 .09/—/.09 .13/—/.13 .17/—/.17H J48 .29/.32/.27 .29/.37/.23 .19/.23/.16 .11/.13/.09 .10/—/.10 .16/—/.16 .16/—/.16H RF .29/.32/.27 .32/.40/.27 .19/.22/.17 .09/.10/.09 .10/—/.10 .13/—/.13 .14/—/.14

The experimental results with decision tree base learners are given in Table4 (detailed results for random forests are available as supplementary material).Here, the ranks are computed without normalization in compliance with [4]. Sinceonly the overall number of training examples varies, all considered methods areapplicable, for which reason a normalization is not needed.

As expected, the absolute accuracy increases with the number of trainingexamples for all methods. Moreover, the relative performance (ranks) of themethods is identical, regardless of the training set size, except for RPC andLR2MC-2 which achieve very similar results. With decision trees as base learners,LR2MC-4 consistently outperforms all other approaches, including LR2MC-7.This observation again supports the hypothesis that pure classification methodsthat do not leverage the structure of the output space are more difficult to train.For random forests as base learner, the experimental results are in line with theseobservations, with the slight difference that LR2MC-3 is the overall winner and


Table 4. Experimental results using decision trees (J48) as base learners in terms ofMatch (top) and Hamming (bottom), in parentheses the ranks.

Examples RPC LR2MC-2 LR2MC-3 LR2MC-4 LR2MC-5 LR2MC-6 LR2MC-7

250 0.128 (4) 0.127 (5) 0.157 (2) 0.165 (1) 0.151 (3) 0.111 (6) 0.081 (7)500 0.151 (4) 0.149 (5) 0.188 (2) 0.199 (1) 0.178 (3) 0.134 (6) 0.096 (7)

1000 0.176 (4) 0.174 (5) 0.216 (2) 0.231 (1) 0.209 (3) 0.155 (6) 0.111 (7)2000 0.197 (5) 0.197 (4) 0.246 (2) 0.263 (1) 0.239 (3) 0.178 (6) 0.128 (7)4000 0.219 (4) 0.216 (5) 0.277 (2) 0.294 (1) 0.269 (3) 0.203 (6) 0.147 (7)8000 0.243 (5) 0.244 (4) 0.306 (2) 0.323 (1) 0.298 (3) 0.230 (6) 0.166 (7)

16000 0.264 (5) 0.264 (4) 0.337 (2) 0.353 (1) 0.328 (3) 0.256 (6) 0.186 (7)

average rank 4.43 4.57 2.00 1.00 3.00 6.00 7.00

250 0.567 (4) 0.566 (5) 0.595 (2) 0.599 (1) 0.573 (3) 0.514 (6) 0.468 (7)500 0.591 (5) 0.592 (4) 0.627 (2) 0.629 (1) 0.604 (3) 0.541 (6) 0.492 (7)

1000 0.614 (4) 0.613 (5) 0.652 (2) 0.655 (1) 0.630 (3) 0.568 (6) 0.518 (7)2000 0.634 (5) 0.635 (4) 0.674 (2) 0.679 (1) 0.653 (3) 0.592 (6) 0.538 (7)4000 0.655 (5) 0.655 (4) 0.696 (2) 0.700 (1) 0.675 (3) 0.615 (6) 0.560 (7)8000 0.672 (4) 0.672 (5) 0.715 (2) 0.721 (1) 0.697 (3) 0.635 (6) 0.580 (7)

16000 0.690 (4) 0.689 (5) 0.733 (2) 0.740 (1) 0.717 (3) 0.658 (6) 0.601 (7)

average rank 4.43 4.57 2.00 1.00 3.00 6.00 7.00

Table 5. Experimental results for the second controlled scenario in terms of normalizedranks (best result for each setting in boldface): datasets with m ≤ 4 / datasets withm > 4.

Measure Base Learner RPC LR2MC-2 LR2MC-3 LR2MC-4 LR2MC-5 LR2MC-6 LR2MC-7

Kendall J48 .18/.18 .22/.15 .40/.05 .40/.14 —/.23 —/.25 —/.25Kendall RandomForest .13/.12 .27/.14 .40/.05 .40/.19 —/.25 —/.25 —/.25Spearman J48 .13/.13 .27/.18 .40/.05 .40/.14 —/.25 —/.25 —/.25Spearman RandomForest .22/.08 .18/.15 .40/.07 .40/.20 —/.25 —/.25 —/.25


a bit better than LR2MC-2 and LR2MC-4, which yield very similar absoluteresults.

In a second scenario, we used the following setting: 2000 training instanceswith 10 features and 3, 4, . . . , 7 labels, Kendall and Spearman as evaluation mea-sures. The experimental results are given in Table 5. This set of experimentscompletely supports the finding from the first study: For datasets with m ≤ 4labels, the pairwise approaches RPC and LR2MC-2 outperform all other de-compositions. However, for datasets with m > 4 labels, LR2MC-3 consistentlyachieves the highest accuracy in terms of both Kendall’s tau and Spearman’srho.

6.4 Discussion

Our experimental results suggest that the number of labels is an essential prop-erty for choosing a suitable decomposition approach in label ranking. Moreprecisely, while standard pairwise decomposition appears to be favorable fordatasets with m ≤ 4 labels, our novel multiclass decomposition with k = 3(LR2MC-3) seems to provide a promising alternative for m > 4 labels and rankcorrelation (Kendall’s tau or Spearman’s rho) as performance measure. If othermeasures are used, such as 0/1 loss (Match) or Hamming, higher order decom-positions are even more advantageous. Interestingly, in the context of multilabellearning, the experimental conclusions for the conceptually related Rakel method[22] are similar, since label set sizes around k = 3 often yield the best accuracy.

As our experimental evaluation suggests that multiclass decomposition withk = 3 is a particularly promising alternative to the well-known pairwise labelranking approach, we will add some remarks regarding its computational com-plexity in terms of the number of classification problems: For a dataset with mlabels, pairwise label ranking considers an overall number of

(m2

)binary clas-

sification problems, while LR2MC-3 makes use of a decomposition into(m3

)multiclass problems. Hence, ignoring special cases, there is an increase in thenumber of classification problems by a factor of m−2

3 . For example, for m = 7this amounts to roughly 67% more classification problems with the training setsize being identical for both approaches. Depending on the application field, theincrease in accuracy may justify these computational requirements.

One may hypothesize that the increase in accuracy may be attributed to theincreased number of underlying classifiers, and hence may be a straightforwardensemble size consequence. We have conducted some preliminary experiments tofurther investigate this hypothesis. In these experiments, a maximum number of(m2

)classifiers was subsampled from the overall set of multiclass classifiers to re-

move any potential ensemble advantage over the pairwise setting. Interestingly,the experimental results when using the limited number of underlying classi-fiers are comparable to those of the overall set of classifiers (see supplementarymaterial for some more detailed results), suggesting that ensemble size cannotexplain the observed difference in accuracy.


7 Conclusion and Future Research

Accepting pairwise learning as a state-of-the-art approach to label ranking, themajor objective of this work was to address the question whether, in principle,going beyond pairwise comparisons and decomposing preference information intopartial rankings of length k > 2 could be useful. Technically, this questionscomes down to comparing a reduction to binary classification with a reductionto multiclass classification.

The answer we can give is clearly affirmative: Our method, called LR2MC,tends to be superior to pairwise learning as soon as the number of labels exceedsfour. In agreement with our theoretical arguments, this superiority is especiallypronounced for performance metrics that are difficult to optimize based on pair-wise preference information. Practically, k = 3 or k = 4 seem to be reasonablechoices, which optimally balance various factors responsible for the success ofthe learning process. Such factors include the inherent loss of information causedby decomposition, the redundancy of the reduction, and the practical difficultyof the individual classification problems created.

While this is an important insight into the nature of label ranking, whichpaves the way for new methods beyond pairwise preference learning, the in-creased computational complexity of LR2MC is clearly an issue. In future work,we will therefore elaborate on various ways to reduce complexity and increaseefficiency, both in the decomposition and aggregation step. One may think, forexample, of incomplete decompositions that do not comprise projections to alllabel subsets, or mixed decompositions including rankings of different length (in-stead of using a fixed k). Moreover, the efficiency of the aggregation step couldbe increased by computationally less complex aggregation techniques, such asthe well-known Borda-count. Preliminary experiments suggest that it is indeedpossible to substantially reduce the computational complexity for aggregationwhile still benefit from superior ranking accuracy for datasets with m > 4 labels.

Currently, our approach is limited to complete label rankings as trainingdata. Therefore, another direction of future research is to develop a generaliza-tion that allows for incorporating partial ranking data. For example, we maygeneralize LR2MC by inserting a preprocessing step for training examples whichare associated with a partial ranking only, and consider all compatible rankingextensions as (weighted) virtual training examples in the decomposition process.

References

1. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: A unifyingapproach for margin classifiers. J. of Machine Learning Res. 1, 113–141 (2000)

2. Brinker, K., Hullermeier, E.: Case-based label ranking. In: Proceedings of the Eu-ropean Conference on Machine Learning (ECML 2006). Berlin, Germany (2006)

3. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise ap-proach to listwise approach. In: Proceedings ICML, 24th International Conferenceon Machine learning. pp. 129–136 (2007)


4. Cheng, W., Dembczynski, K., Hullermeier, E.: Label ranking methods based onthe plackett-luce model. In: Furnkranz, J., Joachims, T. (eds.) Proceedings of the27th International Conference on Machine Learning (ICML-10). pp. 215–222. Om-nipress, Haifa, Israel (June 2010)

5. Cheng, W., Huhn, J., Hullermeier, E.: Decision tree and instance-based learningfor label ranking. In: Bottou, L., Littman, M. (eds.) Proceedings of the 26th In-ternational Conference on Machine Learning (ICML-09). pp. 161–168. Omnipress,Montreal, Canada (June 2009)

6. Cheng, W., Hullermeier, E.: A new instance-based label ranking approach usingthe Mallows model. In: Yu, W., He, H., Zhang, N. (eds.) LNCS 5551 Advances inNeural Networks: The 6th International Symposium on Neural Networks (ISNN-09). pp. 707–716. Springer, Wuhan, China (May 2009)

7. Coppersmith, D., Fleischer, L.K., Rurda, A.: Ordering by weighted number of winsgives a good ranking for weighted tournaments. ACM Trans. Algorithms 6(3),55:1–55:13 (2010)

8. Dekel, O., Manning, C.D., Singer, Y.: Log-linear models for label ranking. In:Thrun, S., Saul, L.K., Scholkopf, B. (eds.) Advances in Neural Information Pro-cessing Systems 16 (NIPS 2003). pp. 497–504. MIT Press, Cambridge, MA (2004)

9. Dembczynski, K., Waegeman, W., Cheng, W., Hullermeier, E.: On label depen-dence and loss minimization in multi-label classification. Machine Learning 88(1–2), 5–45 (2012)

10. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. of Artif. Intell. Res. 2, 263–286 (1995)

11. Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix forData Mining: Practical Machine Learning Tools and Techniques. Morgan Kauf-mann, fourth edition edn. (2016)

12. Furnkranz, J., Hullermeier, E.: Preference learning: An introduction. In:Furnkranz, J., Hullermeier, E. (eds.) Preference Learning, pp. 1–18. Springer-Verlag (2010)

13. Furnkranz, J., Hullermeier, E.: Pairwise preference learning and ranking. In: Proc.ECML 2003. vol. 2837. Springer, Berlin, Heidelberg (2003)

14. Hamm, J., Belkin, M.: Probabilistic zero-shot classification with semantic rankings.CoRR abs/1502.08039 (2015), http://arxiv.org/abs/1502.08039

15. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: A new approach tomulticlass classification and ranking. In: Advances in Neural Information Process-ing Systems 15 (NIPS 2002) (2002)

16. Hullermeier, E., Furnkranz, J.: On predictive accuracy and risk minimization inpairwise label ranking. J. of Computer and System Sciences 76(1), 49–62 (2010)

17. Hullermeier, E., Furnkranz, J.: Comparison of ranking procedures in pairwise pref-erence learning. In: 10th Int. Conf. on Information Processing and Management ofUncertainty in Knowledge-Based Systems (IPMU–04). pp. 535–542 (2004)

18. Hullermeier, E., Furnkranz, J., Cheng, W., Brinker, K.: Label ranking by learningpairwise preferences. Artificial Intelligence 172(16-17), 1897 – 1916 (2008)

19. Liu, T.Y.: Learning to Rank for Information Retrieval. Springer (2011)20. Marden, J.: Analyzing and Modeling Rank Data. London, New York (1995)21. Shalev-Shwartz, S., Singer, Y.: Efficient learning of label ranking by soft projections

onto polyhedra. Journal of Machine Learning Research 7, 1567–1599 (2006)22. Tsoumakas, G., Katakis, I., Ioannis, V.: Random k-labelsets for multilabel classi-

fication. IEEE Transactions on Knowledge and Data Engineering 23(7) (2011)23. Zhou, Y., Lui, Y., Yang, J., He, X., Liu, L.: A taxonomy of label ranking algorithms.

Journal of Computers 9(3) (2014)


A Supplementary Material

A.1 Experimental Results I — Kendall’s tau rank correlationcoefficient

Table 6. Experimental results of the label ranking techniques with decision trees (J48)as base learner in terms of Kendall’s tau (in brackets the ranks; best average rank inboldface)



average rank 3.08 2.58 2.17 2.90 4.00 5.50 6.00


Table 7. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Kendall’s tau (in brackets the ranks; bestaverage rank in boldface)



average rank 2.92 2.75 1.92 2.80 4.29 5.50 7.00


A.2 Experimental Results I — Spearman’s rho rank correlationcoefficient

Table 8. Experimental results of the label ranking techniques with decision trees (J48)as base learner in terms of Spearman’s rank correlation (in brackets the ranks; bestaverage rank in boldface)



average rank 2.25 2.92 2.08 3.20 4.14 6,00 6.50


Table 9. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Spearman’s rank correlation (in bracketsthe ranks; best average rank in boldface)



average rank 2.00 2.92 1.83 3.20 5.00 5.75 7.00


A.3 Experimental Results I — Match

Table 10. Experimental results of the label ranking techniques with decision trees(J48) as base learner in terms of Match (in brackets the ranks; best average rank inboldface)



average rank 4.33 3.67 2.75 1.70 1.86 3.00 3.00


Table 11. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Match (in brackets the ranks; best averagerank in boldface)



average rank 4.50 3.92 2.42 1.70 1.57 3.00 4.00


A.4 Experimental Results I — Hamming similarity

Table 12. Experimental results of the label ranking techniques with decision trees(J48) as base learner in terms of Hamming similarity (in brackets the ranks; bestaverage rank in boldface)



average rank 4.25 3.83 2.58 1.80 2.00 4.00 4.50


Table 13. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Hamming similarity (in brackets the ranks;best average rank in boldface)



average rank 4.25 4.42 2.67 1.50 2.00 3.00 4.00


A.5 Experimental Results II — Controlled setup I

Table 14. Experimental results of the label ranking techniques with with randomforests (RandomForest) as base learner in terms of Match (in brackets the ranks)


250 0.185 (4) 0.185 (3) 0.203 (1) 0.195 (2) 0.176 (5) 0.138 (6) 0.105 (7)500 0.218 (4) 0.219 (3) 0.236 (1) 0.230 (2) 0.207 (5) 0.166 (6) 0.129 (7)

1000 0.248 (4) 0.249 (3) 0.269 (1) 0.266 (2) 0.241 (5) 0.198 (6) 0.150 (7)2000 0.279 (4) 0.282 (3) 0.304 (1) 0.297 (2) 0.272 (5) 0.224 (6) 0.173 (7)4000 0.311 (3) 0.310 (4) 0.336 (1) 0.330 (2) 0.304 (5) 0.254 (6) 0.198 (7)8000 0.340 (3) 0.339 (4) 0.367 (1) 0.357 (2) 0.335 (5) 0.287 (6) 0.224 (7)

16000 0.368 (3) 0.367 (4) 0.395 (1) 0.391 (2) 0.366 (5) 0.316 (6) 0.254 (7)

average rank 3.57 3.43 1.00 2.00 5.00 6.00 7.00

Table 15. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Hamming similarity (in brackets the ranks)


250 0.622 (3) 0.620 (4) 0.634 (1) 0.623 (2) 0.599 (5) 0.553 (6) 0.504 (7)500 0.651 (3) 0.649 (4) 0.663 (1) 0.653 (2) 0.628 (5) 0.583 (6) 0.531 (7)

1000 0.675 (3) 0.674 (4) 0.687 (1) 0.679 (2) 0.656 (5) 0.611 (6) 0.557 (7)2000 0.697 (3) 0.696 (4) 0.710 (1) 0.701 (2) 0.679 (5) 0.634 (6) 0.584 (7)4000 0.715 (3) 0.715 (4) 0.730 (1) 0.723 (2) 0.700 (5) 0.657 (6) 0.605 (7)8000 0.734 (3) 0.732 (4) 0.747 (1) 0.741 (2) 0.719 (5) 0.679 (6) 0.627 (7)

16000 0.750 (3) 0.749 (4) 0.764 (1) 0.758 (2) 0.738 (5) 0.701 (6) 0.650 (7)

average rank 3.00 4.00 1.00 2.00 5.00 6.00 7.00


A.6 Experimental Results II — Controlled setup II

Table 16. Experimental results of the label ranking techniques with decision trees(J48) as base learner in terms of Kendall’s tau (in brackets the ranks)

Labels RPC LR2MC-2 LR2MC-3 LR2MC-4 LR2MC-5 LR2MC-6 LR2MC-7

3 0.837 (1) 0.836 (2) 0.822 (3)4 0.833 (2) 0.835 (1) 0.823 (3) 0.802 (4)5 0.836 (3) 0.836 (2) 0.849 (1) 0.823 (4) 0.788 (5)6 0.837 (4) 0.837 (3) 0.854 (1) 0.848 (2) 0.810 (5) 0.771 (6)7 0.838 (4) 0.836 (5) 0.859 (1) 0.857 (2) 0.840 (3) 0.793 (6) 0.755 (7)

average rank 2.80 2.60 1.80 3.00 4.33 6.00 7.00

Table 17. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Kendall’s tau (in brackets the ranks)


3 0.874 (1) 0.874 (2) 0.864 (3)4 0.872 (1) 0.871 (2) 0.859 (3) 0.845 (4)5 0.873 (3) 0.873 (2) 0.875 (1) 0.857 (4) 0.830 (5)6 0.872 (2) 0.871 (3) 0.875 (1) 0.866 (4) 0.841 (5) 0.811 (6)7 0.870 (2) 0.869 (4) 0.877 (1) 0.869 (3) 0.854 (5) 0.825 (6) 0.787 (7)

average rank 1.80 2.60 1.80 3.75 5.00 6.00 7.00


Table 18. Experimental results of the label ranking techniques with decision trees(J48) as base learner in terms of Spearman’s rank correlation (in brackets the ranks)


3 0.867 (1) 0.866 (2) 0.852 (3)4 0.885 (1) 0.883 (2) 0.872 (3) 0.852 (4)5 0.900 (2) 0.900 (3) 0.905 (1) 0.885 (4) 0.857 (5)6 0.909 (3) 0.908 (4) 0.920 (1) 0.913 (2) 0.883 (5) 0.855 (6)7 0.916 (3) 0.915 (4) 0.929 (1) 0.927 (2) 0.914 (5) 0.878 (6) 0.851 (7)

average rank 2.00 3.00 1.80 3.00 5.00 6.00 7.00

Table 19. Experimental results of the label ranking techniques with random forests(RandomForest) as base learner in terms of Spearman’s rank correlation (in bracketsthe ranks)


3 0.900 (2) 0.900 (1) 0.889 (3)4 0.913 (1) 0.913 (2) 0.902 (3) 0.891 (4)5 0.924 (1) 0.924 (3) 0.924 (2) 0.911 (4) 0.891 (5)6 0.931 (2) 0.930 (3) 0.932 (1) 0.926 (4) 0.908 (5) 0.885 (6)7 0.936 (2) 0.935 (3) 0.939 (1) 0.933 (4) 0.923 (5) 0.901 (6) 0.875 (7)

average rank 1.60 2.40 2.00 4.00 5.00 6.00 7.00


A.7 Experimental Results III — Classifier Subsampling

Table 20. Experimental results of the label ranking techniques with classifier subsam-pling and decision trees (J48) as base learner in terms of Kendall’s tau (in brackets theranks; best average rank in boldface)



average rank 3.17 2.67 2.08 2.80 3.86 5.25 7.00

A.8 Synthetic Setting

In a complementary experimental study, we considered a synthetic setting thataims at simulating the reconstruction of a target ranking (of length m) from“noisy fragments” (of length k) under simplified assumptions:

– We proceed from a probability distribution P on the set of all rankings of mlabels, i.e., all rankings π of a fixed length m. The mode of this distribution,i.e., the ranking with the highest probability, is the target ranking π∗ we seekto reconstruct. We instantiate P with the well-known Mallows and Plackett-Luce (PL) distributions [20].

– For each subset L of k labels, the distribution P induces a distribution PLon all rankings of that subset (the latter is a marginal of the former).

– To simulate the prediction of a label ranker on L, we randomly sample πLfrom PL.

– The rankings thus produced (for each k-subset of labels) are finally combinedinto an overall prediction π according to

π = argminπ

∑L

`(π, πL) , (3)


Table 21. Experimental results of the label ranking techniques with classifier subsam-pling and random forests (RandomForest) as base learner in terms of Kendall’s tau (inbrackets the ranks; best average rank in boldface)



average rank 3.00 2.75 2.08 2.90 4.00 5.50 7.00

where ` denotes a loss function that measures how well π matches the k-ranking πL; we instantiate ` with the Kendall loss (number of pairwise in-versions) and the 0/1 loss (which is 0 if the order of the k labels is the samein π and πL, and 1 otherwise).

We are interested in the expected quality of the reconstruction π, i.e., the ex-pected loss of π as a prediction of π∗. This quality is approximated through theaverage reconstruction error on many repetitions of the above experiment. Asloss functions, we again consider Kendall and 0/1 loss.

In the following tables, we present results for different configurations of prob-ability distributions P and loss functions for reconstruction (3) and evaluationof the prediction π. For the Mallows model, different values of the spread pa-rameter θ are used. The PL distribution has real-valued parameters v1, . . . , vm,one for each label. Here, we use two types of parametrizations, one with linearweights (vi = i) and another one with exponential weights (vi = 2i).


Table 22. Mallows, 0/1, 0/1.

N = 5, 10 000 iterations

k = 2 k = 3 k = 4 k = 5

θ = 0.3 .9675 .9539 .9646 .9684θ = 0.5 .9326 .9134 .9277 .9372θ = 0.8 .8694 .7895 .8296 .8699θ = 2.0 .4314 .2077 .2575 .4306


k = 2 k = 3 k = 4 k = 5 k = 6

θ = 0.3 .9845 .9825 .9790 .9865 .9885θ = 0.5 .9745 .945 .9465 .968 .9765θ = 0.8 .9265 .806 .8085 .89 .927θ = 2.0 .5 .191 .103 .2565 .5025

N = 7, 100 iterations

k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

θ = 0.3 1 .99 .99 .99 1 1θ = 0.5 .99 .96 .95 .97 .99 .99θ = 0.8 .91 .86 .76 .78 .96 .93θ = 2.0 .59 .13 .03 .08 .33 .6


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8

θ = 0.8 .9735 .83 .653 .7155 .886 .958 .9765


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 k = 9

θ = 0.8 .98 .74 .58 .56 .79 .91 .96 .98

Table 23. Mallows, Kendall, 0/1.


k = 2 k = 3 k = 4 k = 5

θ = 0.3 .9688 .9724 .9698 .9687θ = 0.5 .9367 .9125 .9183 .9369θ = 0.8 .8652 .7641 .7749 .8652θ = 2.0 .4216 .1641 .1859 .4285


k = 2 k = 3 k = 4 k = 5 k = 6

θ = 0.3 .9875 .986 .9825 .995 .992θ = 0.5 .9735 .935 .917 .9555 .9715θ = 0.8 .9205 .789 .6695 .8135 .92θ = 2.0 .5105 .1845 .0585 .1825 .502


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

θ = 0.3 .99 .99 .94 .99 .99 1θ = 0.5 .99 .96 .83 .85 .95 .99θ = 0.8 .96 .73 .49 .47 .76 .92θ = 2.0 .63 .09 .02 .01 .05 .44


Table 24. Mallows, Kendall, Kendall.


k = 2 k = 3 k = 4 k = 5

θ = 0.3 3.5941 3.3489 3.4765 3.7761θ = 0.5 2.8623 2.4469 2.6776 3.0295θ = 0.8 1.9906 1.4529 1.6165 2.2024θ = 2.0 0.5713 0.1822 0.1772 0.5652


k = 2 k = 3 k = 4 k = 5 k = 6

θ = 0.3 5.016 4.406 4.1195 4.7895 5.515θ = 0.5 3.7505 2.8215 2.506 3.3085 4.2555θ = 0.8 2.6005 1.4805 1.1295 1.842 2.9385θ = 2.0 0.6985 0.202 0.0675 0.2 0.723


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

θ = 0.3 6.44 5.19 4.17 4.61 6.16 7.32θ = 0.5 4.64 3.33 1.85 2.23 3.53 5.82θ = 0.8 3.21 1.38 0.72 0.91 1.39 3.75θ = 2.0 0.84 0.07 0 0.01 0.06 1.04

Table 25. Mallow, 0/1, Kendall.


k = 2 k = 3 k = 4 k = 5

θ = 0.3 3.6422 3.3469 3.6178 3.7927θ = 0.5 2.8616 2.5142 2.8318 3.0521θ = 0.8 1.9795 1.577 1.9154 2.2242θ = 2.0 0.5585 0.2299 0.3168 0.5881


k = 2 k = 3 k = 4 k = 5 k = 6

θ = 0.3 5.1285 4.285 4.5475 5.3885 5.41θ = 0.5 3.7515 2.9715 3.1485 3.933 4.249θ = 0.8 2.571 1.6505 1.71 2.57 2.971θ = 2.0 0.673 0.1885 0.0995 0.351 0.7465


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

θ = 0.3 6.66 5.28 5.22 6.07 7.32 6.92θ = 0.5 5.07 3.27 3.08 3.5 5.8 5.39θ = 0.8 3.19 1.48 1.54 1.91 3.17 3.83θ = 2.0 0.92 0.09 0.05 0.06 0.45 0.87


Table 26. PL, 0/1, 0/1.


k = 2 k = 3 k = 4 k = 5

vi = i .9465 .9336 .9438 .9567vi = 2i .8764 .8219 .8674 .8983


k = 2 k = 3 k = 4 k = 5 k = 6

vi = i .985 .971 .9715 .9845 .9855vi = 2i .9325 .8465 .8575 .937 .94


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7vi = i 1 .98 1 1 .99 1vi = 2i .97 .92 .83 .85 .99 .96

Table 27. PL, Kendall, 0/1.


k = 2 k = 3 k = 4 k = 5

vi = i .9457 .939 .949 .9542vi = 2i .8767 .8109 .8171 .896


k = 2 k = 3 k = 4 k = 5 k = 6

vi = i .978 .9775 .973 .9845 .9845vi = 2i .937 .8225 .7365 .845 .95


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

vi = i 1 .99 .98 1 1 1vi = 2i .97 .73 .57 .63 .93 .96


Table 28. PL, Kendall, Kendall.


k = 2 k = 3 k = 4 k = 5

vi = i 2.9772 2.6318 2.8197 3.222vi = 2i 1.9854 1.5071 1.6464 2.2453


k = 2 k = 3 k = 4 k = 5 k = 6

vi = i 4.274 3.416 3.3245 4.0525 4.8105vi = 2i 2.49 1.6265 1.263 1.91 2.9635


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

vi = i 5.66 4.8 3.61 4.2 5.09 6.17vi = 2i 2.95 1.45 0.94 0.85 2.01 3.55

Table 29. PL, 0/1, Kendall.


k = 2 k = 3 k = 4 k = 5

vi = i 2.9781 2.5988 2.8278 3.2vi = 2i 1.9614 1.5177 1.6526 2.2269


k = 2 k = 3 k = 4 k = 5 k = 6

vi = i 4.207 3.51 3.321 3.949 4.721vi = 2i 2.489 1.568 1.2995 1.911 2.9655


k = 2 k = 3 k = 4 k = 5 k = 6 k = 7

vi = i 5.95 4.21 3.75 4 5.38 6.48vi = 2i 3.06 1.71 0.86 1.07 1.86 3.83

A Reduction of Label Ranking to Multiclass Classi cation · 2020-01-10 · A Reduction of Label...

Documents

Transcript of A Reduction of Label Ranking to Multiclass Classi cation · 2020-01-10 · A Reduction of Label...