arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

33
October 14, 2021 TRUNCATED RANK-BASED TESTS FOR TWO-PART MODELS WITH EXCESSIVE ZEROS AND APPLICATIONS TO MICROBIOME DATA BY WANJIE WANG 1 ,ERIC Z. CHEN 2 AND HONGZHE LI 2,* 1 National University of Singapore, [email protected] 2 University of Pennsylvania, * [email protected] High-throughput sequencing technology allows us to test the composi- tional difference of bacteria in different populations. One important feature of human microbiome data is that it often includes a large number of zeros. Such data can be treated as being generated from a two-part model that includes a zero point-mass. Motivated by analysis of such non-negative data with exces- sive zeros, we introduce several truncated rank-based two-group and multi- group tests for such data, including a truncated rank-based Wilcoxon rank- sum test for two-group comparison and two truncated Kruskal-Wallis tests for multi-group comparison. We show both analytically through asymptotic relative efficiency analysis and by simulations that the proposed tests have higher power than the standard rank-based tests, especially when the propor- tion of zeros in the data is high. The tests can also be applied to repeated measurements of compositional data via simple within-subject permutations. We apply the tests to the analysis of a gut microbiome data set to compare the microbiome compositions of healthy and pediatric Crohn’s disease patients and to assess the treatment effects on microbiome compositions. We identify several bacterial genera that are missed by the standard rank-based tests. 1. Introduction. The human microbiome includes all microorganisms in various human body sites such as gut, skin and mouth and has been shown to be associated with many human diseases such as obesity, diabetes and inflammatory bowel disease (Turnbaugh et al., 2006; Qin et al., 2012; Manichanh et al., 2012). Two high-throughput sequencing based approaches, including 16S ribosomal RNA (rRNA) sequencing and shotgun metagenomic sequencing are commonly used in microbiome studies (Turnbaugh et al., 2007; Qin et al., 2010). Bioin- formatics methods are available for quantifying the microbial relative abundances based on such sequencing data, which typically involve aligning the reads to some known database or marker genes (Huson et al., 2007; Segata et al., 2012). Since the DNA yielding materials are different across different samples, the resulting numbers of sequencing reads vary greatly from sample to sample. In order to make the microbial abundance comparable across sam- ples, the abundance in read counts is usually normalized to the relative abundance of the bacteria observed, which results in high dimensional compositional data. Some of the most widely used metagenomic processing software such as MEGAN (Huson et al., 2007) and MetaPhlAn (Segata et al., 2012) only outputs the relative abundances of the bacterial taxa. In microbiome studies, one is often interested in identifying the bacterial taxa such as genera or species that show different distributions between two or more conditions. One im- portant feature of the microbiome compositional data is that the data include large clumps of zeros that represent the absence of the bacterial taxa in the samples, especially for those arXiv: math.PR/0000000 * This research was supported by NIH grants GM123056 and GM129781. Hongzhe Li is the corresponding author. Keywords and phrases: Asymptotic relative efficiency; Differential abundance analysis; Two-part model; Truncation;. 1 arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

Transcript of arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

Page 1: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

October 14, 2021

TRUNCATED RANK-BASED TESTS FOR TWO-PART MODELS WITHEXCESSIVE ZEROS AND APPLICATIONS TO MICROBIOME DATA

BY WANJIE WANG1, ERIC Z. CHEN2 AND HONGZHE LI2,*

1National University of Singapore, [email protected]

2University of Pennsylvania, *[email protected]

High-throughput sequencing technology allows us to test the composi-tional difference of bacteria in different populations. One important feature ofhuman microbiome data is that it often includes a large number of zeros. Suchdata can be treated as being generated from a two-part model that includes azero point-mass. Motivated by analysis of such non-negative data with exces-sive zeros, we introduce several truncated rank-based two-group and multi-group tests for such data, including a truncated rank-based Wilcoxon rank-sum test for two-group comparison and two truncated Kruskal-Wallis testsfor multi-group comparison. We show both analytically through asymptoticrelative efficiency analysis and by simulations that the proposed tests havehigher power than the standard rank-based tests, especially when the propor-tion of zeros in the data is high. The tests can also be applied to repeatedmeasurements of compositional data via simple within-subject permutations.We apply the tests to the analysis of a gut microbiome data set to compare themicrobiome compositions of healthy and pediatric Crohn’s disease patientsand to assess the treatment effects on microbiome compositions. We identifyseveral bacterial genera that are missed by the standard rank-based tests.

1. Introduction. The human microbiome includes all microorganisms in various humanbody sites such as gut, skin and mouth and has been shown to be associated with many humandiseases such as obesity, diabetes and inflammatory bowel disease (Turnbaugh et al., 2006;Qin et al., 2012; Manichanh et al., 2012). Two high-throughput sequencing based approaches,including 16S ribosomal RNA (rRNA) sequencing and shotgun metagenomic sequencing arecommonly used in microbiome studies (Turnbaugh et al., 2007; Qin et al., 2010). Bioin-formatics methods are available for quantifying the microbial relative abundances based onsuch sequencing data, which typically involve aligning the reads to some known databaseor marker genes (Huson et al., 2007; Segata et al., 2012). Since the DNA yielding materialsare different across different samples, the resulting numbers of sequencing reads vary greatlyfrom sample to sample. In order to make the microbial abundance comparable across sam-ples, the abundance in read counts is usually normalized to the relative abundance of thebacteria observed, which results in high dimensional compositional data. Some of the mostwidely used metagenomic processing software such as MEGAN (Huson et al., 2007) andMetaPhlAn (Segata et al., 2012) only outputs the relative abundances of the bacterial taxa.

In microbiome studies, one is often interested in identifying the bacterial taxa such asgenera or species that show different distributions between two or more conditions. One im-portant feature of the microbiome compositional data is that the data include large clumpsof zeros that represent the absence of the bacterial taxa in the samples, especially for those

arXiv: math.PR/0000000*This research was supported by NIH grants GM123056 and GM129781. Hongzhe Li is the corresponding

author.Keywords and phrases: Asymptotic relative efficiency; Differential abundance analysis; Two-part model;

Truncation;.

1

arX

iv:2

110.

0536

8v3

[st

at.M

E]

13

Oct

202

1

Page 2: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

2

relatively rare taxa. Zero observations can also result from under-sampling of the sequencingreads for rare taxa. As an example, Figure 1 shows the heatmap of zeros and the relative abun-dances for 26 healthy and normal samples and 85 samples in Crohn’s disease group for eachof the 60 bacterial genera. We observed over 62.5% of the observations are zero. In addition,the compositional data are often skewed, which makes parametric modeling of such data dif-ficult and parametric tests problematic. Non-parametric tests, such as the Wilcoxon rank-sumtest and Kruskal-Wallis test, can be applied to such data and are commonly used in analysisof microbiome data. However, such rank-based tests tend to have low power because of thelarge number of ties from zero observations (Lachenbruch, 1976, 2001; Hallstrom, 2010).A two-part test combining the square of a test statistic for comparison of the proportion ofzeros and the square of an appropriate normal test such as the Wilcoxon rank-sum to comparethe non-zero scores was proposed and evaluated by Lachenbruch (2001). This two-part testwas recently applied to analysis of microbiome data (Wagner, Robertson and Harris, 2011).However, the theory developed for Lachenbruch’s 2 degree of freedoms χ2 test assumes thatthe non-zero compositions are independent of the proportion of zeros.

g__B

acter

oides

g_

_Prev

otella

g_

_Esc

heric

hia

g__L

actob

acillu

s g_

_Stre

ptoco

ccus

g_

_Bifid

obac

terium

g_

_Rum

inoco

ccus

g_

_Eub

acter

ium

g__F

aeca

libac

terium

g_

_Alist

ipes

g__E

nteroc

occu

s g_

_Kleb

siella

g_

_Ros

eburi

a g_

_Clos

tridium

g_

_Dial

ister

g__H

aemo

philu

s g_

_Sutt

erella

g_

_Veil

lonell

a g_

_Para

bacte

roide

s g_

_Coll

insell

a g_

_Acti

nomy

ces

g__M

egam

onas

g_

_Lac

tococ

cus

g__O

dorib

acter

g_

_Blau

tia

g__C

oproc

occu

s g_

_Dore

a g_

_Akke

rman

sia

g__P

edioc

occu

s g_

_Pep

toniph

ilus

__Me

thano

brevib

acter

g_

_Cop

robac

illus

g__R

othia

hasc

olarct

obac

terium

g_

_Hold

eman

ia g_

_Bilo

phila

g_

_Turi

cibac

ter

g__E

nterob

acter

g_

_Ana

erosti

pes

g__G

ranuli

catel

la g_

_Gem

ella

g__E

ggert

hella

g_

_Ana

erotru

ncus

g_

_Cate

nibac

terium

g_

_Des

ulfov

ibrio

g__A

naero

cocc

us

g__P

orphy

romon

as

g__V

ictiva

llis

g__A

biotro

phia

g__S

oloba

cteriu

m g_

_Atop

obium

g_

_Sub

dolig

ranulu

m g_

_Parv

imon

as

g__G

ordon

ibacte

r g_

_Slac

kia

_Can

didatu

s_Zin

deria

g_

_Ana

erofus

tis

g__O

ribac

terium

g_

_Buty

rivibr

io _P

seud

oflav

onifra

ctor

0

20

40

60

80

Croh

n’s D

iseas

e No

rmal

Fig 1: Heatmap of zeros and the relative abundances for 26 normal samples and 85 samplesin Crohn’s disease group for each of the 60 bacterial genera, showing over 62.5% of theobservations being zero.

To account for excessive zeros in non-negative distributions, Hallstrom (2010) introduceda truncated Wilcoxon rank-sum test, where the Wilcoxon rank-sum test is performed afterremoving an equal and maximal number of zeros from each sample. He showed that thistest recovers much of the power loss from the standard application of the Wilcoxon test.Compared with a directional modification of a two-part test proposed by Lachenbruch, thetruncated Wilcoxon test has similar power when the non-zero scores are independent of theproportion of zeros. In addition, the truncated Wilcoxon test is relatively unaffected whenthese processes are dependent. Hallstrom (2010) however only considered the two-sampletest when the sample sizes are the same in both groups.

In this paper, we assume that the data are generated from two-part models with point-massat zero as one of the components. We develop several rank-based tests for the general two-group comparison with possible unequal sample sizes, and for the multiple-group comparison

Page 3: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

3

with equal or unequal sample sizes. These new tests are based on the idea of data truncationand asymptotic calculations and can effectively deal with the clumps of zeros in the data.The asymptotic null distributions of the tests are given. The key difficulty of defining suchtruncated rank-based tests is to calculate the variance of the test statistics under the null,which does not have close-form expressions. We instead use asymtotic analysis to obtainapproximations of the variance estimates.

In order to demonstrate the advantages of the proposed truncated rank-based tests, wealso derive the asymptotic relative efficiency of the proposed tests compared to commonlyused Wilcoxon rank-sum test and the Kruskal-Wallis test when the data are generated fromtwo-part models (Lachenbruch, 2001). We show large gain in efficiency, especially when theproportions of zeros in the data are high. These tests are rank-based, easy to calculate andprovide new tools for identifying the bacterial taxa with different distributions among differ-ent groups in human microbiome studies. We illustrate and compare the proposed tests usinga microbiome study comparing healthy and pediatric Crohn’s disease patients and assessingthe effects of treatment over time.

The rest of the paper is organized as follows. In Section 2, we extend the truncatedWilcoxon rank-sum test of Hallstrom (2010) to data with unequal sample sizes. We thenpresent in Section 3 and Section 4 a modified Kruskal-Wallis test to account for clumps ofzeros for multiple-sample setting with equal sample sizes and unequal sample sizes, respec-tively. For all these extensions, we derive the asymptotic relatively efficiency compared to thestandard rank-based tests. In Section 5, we present simulations to evaluate the proposed testsand to compare their power. In Section 6, we present results from an analysis of pediatricCrohn’s disease microbiome study at the University of Pennsylvania to identify the bacterialgenera that are associated with pediatric Crohn’s disease and to identify the bacterial generathat change their abundances over time during the anti-TNF treatment. Finally, we give abrief summary and discussion in Section 7. Details of the mathematical derivations and proofof relevant lemmas are provided as online Supplemental Materials.

2. A truncated Wilcoxon rank-sum test for data with excessive zero.

2.1. A truncated Wilcoxon rank-sum test. Consider the two-sample setting where wehave N1 non-negative independent observations from population 1, x = (x(1), · · · , x(N1)),and N2 independent observations from population 2, y = (y(1), · · · , y(N2)) where x(i) (ory(i)) represents the relative abundance of a bacterium in the ith sample of group 1 (or 2). Formost of the bacteria, x and y include many zeros, which represent absence of the bacteriumin these samples. We are interested in testing whether these observations x and y are fromthe same distribution, i.e., the hypothesis testing problem that

H0 : x∼ F,y∼ F vs H1 : x∼ F,y∼G,F 6=G,

where F andG are both probability density functions with a proportion of zeros. We considerthe two-part model of Lachenbruch (1976), which assumes that the data are generated fromthe following distributions,

(1) x(i)∼ (1− θ1)δ0 + θ1f, y(j)∼ (1− θ2)δ0 + θ2g, i= 1, · · · ,N1, j = 1, · · · ,N2,

where θk is the probability of being nonzero for population k, δ0 is point mass at 0, and f (org) is the distribution for nonzero elements for the population 1 (or 2).

Because of the excessive zeros, the standard non-parametric Wilcoxon rank-sum test statis-tic is less effective. A truncated Wilcoxon rank-sum test statistic has been proposed by Hall-strom (2010) for the case N1 =N2 =N . We first extend the test to the general setting where

Page 4: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

4

N1 6=N2, and examine the asymptotic relative efficiency compared to the standard Wilcoxontest statistic.

Given x and y, denote n1 (or n2) as the number of non-zero observations in x (or y) and letp1 (or p2) be the proportion of non-zero observations in x (or y), where p1 = n1/N1 (or p2 =n2/N2). Let p= max{p1, p2}. We rank the combined observations x∪ y from the largest tothe smallest so that zeros have the highest ranks, where the tied measurements are given theaverage rank. For x (or y), we keep only bpN1c (or bpN2c) observations with the smallestranks, which implies that the observations removed are all zeros. The truncated samples aredenoted as x and y, respectively. Let R denote the sum of the ranks of all observations in x.Hence, the Wilcoxon test statistic can be written as

(2) TW =S2

Var(S), where S =R− N1 +N2 + 1

2N1,

and Var(S) is the variance of S under the null hypothesis.The same procedure can be applied to the truncated data x and y. Rank the combined

observations x ∪ y, from the largest to the smallest, and let r denote the sum of ranks of allobservations in x. First define the counterpart of S as

s= r− bp(N1 +N2)c+ 1

2bpN1c −

1

4

p1 + p2

2

(1− p1 + p2

2

)(N2 −N1),

and define the truncated Wilcoxon test statistic as

TtW =s2

Var(s).

The statistic s is very similar to S, except an extra term

1

4

p1 + p2

2

(1− p1 + p2

2

)(N2 −N1),

which is caused by the difference between the variances due to different sample sizes. Thisterm disappears when N1 =N2. Under the alternative hypothesis, this term is a small orderterm compared to the other part of s. Under the null hypothesis, this extra term is used toeliminate the effect of different sample sizes, leading to an expectation close to 0.

To calculate Var(s) and to derive the asymptotic distribution of s, we show that under thenull that both x and y follow the same distribution with a point mass at 0, when N1→∞,N2→∞, we have

E[S] = 0, E[s] =O(max{N1,N2}),

where the remainder O(max{N1,N2}) is caused by the difference between bpN1c (orbpN1c) and pN1 (or pN2). In addition, we have

Var(S) =N2

1N22

4E{(p2 − p1)2}+

N1N2

12E(N1p

21p2 +N2p1p

22 + p1p2),

Var(s) =N2

1N22

4Var{(p2 − p1)p}+

N1N2

12E(N1p

21p2 +N2p1p

22 + p1p2).

(see Supplemental Materials). With these the results, under the null hypothesis, when N1→∞, N2→∞, and c≤N1/N2 ≤C for some constants c and C , we have

Var(S) =N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5),

Var(s) =N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(N2.5).

Page 5: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

5

where θ is the expected value of the proportion non-zero observations in the data. Hence,√Var(s) = O(

√N1N2 max{N1,N2}), and E[s] = O(max{N1,N2}) is relatively small

when N1 and N2 are large. Asymptotically, s/√V ar(s) follows a standard normal distri-

bution, and the truncated Wilcoxon test statistic TtW = s2/Var(s) has a χ21 null distribution.

The asymptotic analysis above provides a way of approximating Var(s) using the mainterm of Var(s), which leads to the final test statistic

TtW =s2

N1N2(N1 +N2)(p1+p22 )3(4/3− p1+p2

2 )/4.

This is used in our simulation and real data analysis.

2.2. Asymptotic relative efficiency of TW and TtW for two-sample test. To compare thetest statistics TW and TtW , we evaluate the asymptotic property of the relative efficiency.Since we reject the null hypothesis when the statistic is large, the relative efficiency is definedas

RE(TtW , TW ) =E1(TtW )

E1(TW )=

E1[s2]/Var(s)

E1[S2]/Var(S).

Here, E1 denotes the expectation of the statistic under alternative hypothesis. A value largerthan 1 implies power gain using the truncated Wilcoxon test statistic compared to the standardWilcoxon statistic.

For the two-part model (1), we need some terms to quantify the difference between thetwo distributions. Let θ = (θ1 + θ2)/2, θm = max{θ1, θ2}, and4θ = θ2− θ1. Define ∆f,g =P (X < Y ) + P (X = Y )/2− 1/2, X ∼ f and Y ∼ g. The term ∆f,g is used to measure theeffect size of the non-zero part. The following theorem provides an explicit expression forARE.

Theorem 1 Under model (1) and that N1→∞, N2→∞, and c ≤ N1/N2 ≤ C holds forsome constants c and C , then the ARE can be derived as

ARE(TtW , TW ) =1

θ2

1− θ+ θ2/3

1− θ+ 1/3

(θ1θ2∆f,g +4θθm/2θ1θ2∆f,g +4θ/2

)2

+O(N−1/2).

Especially, we have that E[s|n1, n2

]= n1n2∆f,g.

To illustrate the gain in efficiency of using the truncated Wilcoxon test, we present theARE(Wm,W ) for the following four different parameter settings to assess the effect of ∆f,g ,θ,4θ, and N1/N2:

(a) Effect of ∆f,g: N1 = 40, and N2 = 50 and θ1 = 0.3, θ2 = 0.8. Let ∆f,g range between−1/2 and 1/2, with a step size 0.01.

(b) Effect of θ: N1 = 40, N2 = 50, and ∆f,g = 0.1. Let θ range between 0.1 and 0.9, with astep size 0.01. With θ, take θ1 = θ− 0.1 and θ2 = θ+ 0.1.

(c) Effect of 4θ: N1 = 40, N2 = 50 and ∆f,g = 0.1. Let 4θ range between 0.2 and 1, witha step size 0.02. With4θ, take θ1 = 0.5−4θ/2 and θ2 = 0.5 +4θ/2.

(d) Effect of sample size: θ1 = 0.3, θ2 = 0.8, ∆f,g = 0.1, and N2 = 50. Let N1 range be-tween 20 and 120, with a step size 5.

Figure 2 shows the ARE for these four different settings. We observe that AREs are largerthan 1 for all four settings, indicating that the truncated Wilcoxon statistic has greater powerthan the standard Wilcoxon statistic. In settings (a) and (b), ARE(TtW , TW ) increases when∆f,g increases or the proportion of zeros increases. Hence, when the nonzero part are awayfrom each other, the truncated test statistic gains more power compared to the standard

Page 6: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

6

Wilcoxon statistics. In setting (c), the ARE(TtW , TW ) function is not monotone due to thecancellation of difference between non-zero part and the non-zero proportions. In setting (d),obviously ARE(TtW , TW ) does not depend on the value of N1 or N2, which can be seen inthe formula.

Fig 2: Asymptotic relative efficiency ARE(TtW , TW ) comparing the truncated Wilcoxon testand the standard Wilcoxon test as a function of (a) ∆f,g , (b) θ, (c)4θ and (d) N1.

3. A truncated Kruskal-Wallis test for K-group comparisons with equal samplesizes.

3.1. A truncated Kruskal-Wallis test. Consider the setting where we have data from Kgroups x1,x2, · · · ,xK , all containing non-negative independent observations from popula-tion 1,2, · · · ,K , respectively. Each sample xi = (x1(1), · · · , x1(Ni)) contains Ni i.i.d ob-servations. We assume that the samples from the K group are generated from the followingtwo-part model,

(3) Xk(i)∼ (1− θk)δ0 + θkfk, k = 1, · · · ,K.

Page 7: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

7

The hypothesis of interest is

H0 : θ1 = · · ·= θK = θ, f1 = · · ·= fK = f vs H1 : not all θis and fis are equal.

Kruskal-Wallis test is a standard nonparametric test for K-group comparison based on theranks. We propose to develop a similar rank-based test that account for excessive zeros in thetwo-part model. We first consider the case when the samples sizes from all K samples arethe same, i.e., N1 =N2 = · · ·=NK =N . Let ri(j) be the rank of the jth observation in theith sample among all KN observations. The Kruskal-Wallis statistic can be written as

TKW =12

KN2(KN + 1)

K∑i=1

S2i ,

where Si =∑N

j=1 ri(j)− (KN + 1)N/2. To derive the distribution of TKW under the nullhypothesis, we rewrite TKW in terms of Yi, where Yi =

∑ij=1 Sj − iSi+1, 1≤ i≤K − 1,

TKW =12

KN2(KN + 1)

K−1∑i=1

Y 2i

i(i+ 1)=

K−1∑i=1

Y 2i

Var(Yi).

With this transformation, Yis are asymptotically independent with each other, and under thenull hypothesis, the test statistic is the summation of K − 1 chi-square random variableswith degree of freedom of 1 and therefore TKW is asymptotically distributed as a χ2

K−1distribution.

In order to account for excessive of zeros in the data sets, we propose to modify the TKWstatistic using the idea of truncation. Let ni denote the number of nonzero observations insample i, so the proportion of nonzero entries is pi = ni/N . Let p= max1≤i≤K pi. For eachsample i, keep the pN observations with smallest ranks (or largest values), so that all theremoved entries are zeros. Let n= pN = max1≤i≤K piN = max1≤i≤K ni, then the truncateddata set has Kn observations in total. Rank all Kn observations from smallest to largest, andthen let ri denote the rank-sum of the observations in sample i. Define

si = ri − n(Kn+ 1)/2, Ui =

i∑j=1

sj − isi+1, 1≤ i≤K − 1,

where sK =−∑K−1

i=1 si. The following Lemma show that Ui’s are independent.

Lemma 1 Under the model that all xi(j) are independently and identically distributed,

E[Ui] = 0, Cov(Ui1 ,Ui2) = 0.

This leads to our definition of the truncated Kruskal-Wallis test statistic as

TtKW =

K−1∑i=1

U2i

Var(Ui),

which has a χ2 distribution with degree of freedoms K − 1 under the null hypothesis. WhenK = 2, this statistic becomes the truncated Wilcoxon rank-sum statistic given N1 = N2.The following Lemma 2 provides an approximation of the variance of Yi and Ui under nullhypothesis and alternative hypothesis.

Lemma 2 Consider the two-part model (3). Under null hypothesis and suppose that N →∞, we have

Var(Yi) =i(i+ 1)K2N3θ

4

{θ2

3+ (1− θ)

}+O(N5/2), 1≤ i≤K − 1;

Page 8: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

8

Var(Ui) =i(i+ 1)K2N3θ3

4

{1

3+ (1− θ)

}+O(N5/2), 1≤ i≤K − 1.

Under alternative hypothesis, we have the upper bound of the variances,

Var1(Yi)≤K3N3, Var1(Ui)≤K3N3, 1≤ i≤K.

Based on this lemma, the unknown variance Var(Ui) can be approximated by

i(i+ 1)K2N3(∑K

k=1 pk/K)3

4

(4

3−

K∑k=1

pk/K

).

3.2. Asymptotic relative efficiency of TtKW and TKW under a two-part model. We eval-uate the relative efficiency of the truncated statistic TtKW with the standard statistic TKWby

RE(TtKW , TKW ) =E1[TtKW ]

E1[TKW ]=

∑K−1i=1 E1[U2

i ]/Var(Ui)∑K−1i=1 E1[Y 2

i ]/Var(Yi).

We assume that K samples have the same sample size N and the data are generated from thetwo-part model (3). To find RE(TtKW , TKW ), there are four terms to calculate: the expecta-tion of U2

i and Y 2i under the alternative hypothesis, and the variance of Ui and Yi under the

null hypothesis. To find E1[U2i ] and E1[Y 2

i ], we can calculate Var(Ui) and Var(Yi), and theexpectation of Ui and Yi separately, where the variance are needed under either hypothesis.Lemma 2 provides an approximation of the variance of Yi and Ui under null hypothesis andalternative hypothesis.

In order to calculate the expectation of Yi and Ui under alternative hypothesis, we notethat the basic terms in Yi and Ui are the rank-sum of the nonzero observations. Let r0

i denotethe rank-sum of the non-zero observations in sample i, 1≤ i≤K , and that s0

i = r0i − (1 +∑K

j=1 nj)ni/2. Define ∆i,k = P (Xi <Xk) + P (Xi =Xk)/2− 1/2, Xi ∼ fi and Xk ∼ fk,for 1≤ i, k ≤K− 1, where ∆i,k can be used to measure the effect sizes. In addition, one caneasily check that

E[ i∑j=1

s0j − is0

i+1

∣∣n1, · · · , nK]

=

i∑j=1

∑k 6=j

njnk∆j,k − ini+1

∑k 6=i+1

nk∆i+1,k,(4)

for 1≤ i≤K − 1. For model (3) and the effect sizes specified above, we have the results onE1[Yi] and E1[Ui], as shown in the following Lemma.

Lemma 3 Under alternative hypothesis, for 1≤ i≤K − 1, the expectation of Yi is

E1[Yi] =N2

( i∑j=1

∑k 6=j

θjθk∆j,k − iθi+1

∑k 6=i+1

θk∆i+1,k

)+KN2

2

(iθi+1 −

i∑j=1

θj

),

and the expectation of Ui is

E1[Ui] =N2

( i∑j=1

∑k 6=j

θjθk∆j,k−iθi+1

∑k 6=i+1

θk∆i+1,k

)+KN2

2θ(K)

(iθi+1−

i∑j=1

θj

)+o(N2).

Plugging E1[Ui], Var1(Ui) and Var(Ui) into the definition of ARE(TtKW , TKW ) and usingLemma 2 and Lemma 3, we obtain the ARE of TtKW versus TKW given by the followingtheorem.

Page 9: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

9

Theorem 2 Under model (3) and that N →∞, then the ARE can be derived as

(5) ARE(TtKW , TKW ) =

K−1∑i=1

{(∑i

j=1 θj+iθi+1)2∆i+Kθ(K)(iθi+1−∑i

j=1 θj)/2}2

θ2i(i+1)(1/3+1−θ)

K−1∑i=1

{(∑i

j=1 θj+iθi+1)2∆i+K(iθi+1−∑i

j=1 θj)/2}2

i(i+1)(θ2/3+1−θ)

+ o(1).

3.3. Asymptotic relative efficiency for zero-Beta distributions. We consider the casewhere the nonzero functions fks are Beta distributions with parameters αk and β = 1, inwhich case the effect size ∆j,k can be calculated and the ARE can be expressed by αk’sand θk’s. Since compositional data are always between 0 and 1, Beta distribution provides areasonable parametric model for such data.

Given the Beta distribution for each sample, note that P (Xi = Xk) = 0 as they are con-tinuous random variables. Hence ∆i,k = P (X(i) < X(k)) − 1/2. If X ∼ Beta(α,1), thenXα ∼ Unif(0,1), therefore

P (Xi <Xk) = P (Xαi

i <Xαi

k ) = P (U < (Xαk

k )αk/αi), U ∼ Unif(0,1).

Note that Xαk

k ∼ Unif(0,1), so (Xαk

k )αi/αk ∼ Beta(αk/αi,1). Further, for any continuousrandom variable X with range [0,1], there is

P (U <X) =

∫ 1

0(1− FX(u))du=E[X].

This leads to

∆i,k = P (Xi <Xk)− 1/2 =αk/αi

αk/αi + 1=

αkαi + αk

− 1/2, 1≤ i, k ≤K.

When combining with (4), we have

E1[r0i |n1, · · · , nK ] = ni

{(ni + 1)/2 +

∑k 6=i

nkαk

αi + αk

},

E1[s0i |n1, · · · , nK ] = E1[r0

i ]−1 +

∑Kk=1 nk

2ni = ni

∑k 6=i

nk

(αk

αi + αk− 1/2

).

Plugging these equalities to (5) gives the closed-form expression for ARE(TtKW , TKW ).To demonstrate the gain in efficiency, we calculate ARE(TtKW , TKW ) for 5 group com-

parisons (K = 5) in two scenarios: (a) αi = i, 1≤ i≤ 5. θ1 = · · ·= θK = θ, where θ rangesfrom 0.1 to 1 with a step size 0.01; (b) αi = i, 1≤ i≤ 5. θ = (0.2,0.15,0.3,0.1,0.25) + d,where d ranges from 0 to 0.7 with a step size 0.01. The results are shown in Figure 3. In bothcases, we observe a high asymptotic relative efficiency using the truncated Kruskal-Wallistest statistic when compared to the original Kruskal-Wallis test.

4. A truncated Kruskal-Wallis multi-group test with unequal sample sizes. We nowconsider the setting where we have multiple samples x1,x2, · · · ,xK , all containing non-negative independent observations from populations 1,2, · · · ,K , respectively. Each samplexi = (x1(1), · · · , x1(Ni)) contains Ni observations sampled from the two-part model reft-wopart. We consider the case that N1, N2, · · · , NK are not necessarily equal.

Consider the standard Kruskal-Wallis statistic first. Let ri(j) be the rank of the jth obser-vation in the ith sample among all the observations. Define Si =

∑Ni

j=1 ri(j)− (∑K

i=1Ni +

Page 10: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

10

Fig 3: Asymptotic relative efficiency. ARE(TtKW , TKW ) as a function of (a) θ when θ1 =· · ·= θK = θ, and (b) θ(K) when not all θis are equal.

1)Ni/2 and Yi =∑i

j=1(Ni+1Sj −NjSi+1), 1≤ i≤K − 1, where Yi is also related to thesample size. Then, Kruskal-Wallis statistic is defined as

TKW =12∑K

j=1Nj + 1

K−1∑i=1

Y 2i

Ni+1(∑i

j=1Nj)(∑i+1

j=1Nj)(∑K

j=1Nj), 1≤ i≤K − 1.

Each term follows a χ21 distribution asymptotically and is independent with each other, so the

statistic follows χ2K−1 distribution under null hypothesis.

To account for zeros, for sample i, there are ni non-zero elements, and the correspondingnon-zero ratio is pi = ni/Ni. Let p= max

1≤i≤Kpi, and keep the bpNic largest observations only

for sample i as the truncated data. All the removed entries are zeros. Let ri be the rank-sumfor the truncated sample i, and

si = ri −bp∑K

i=1Nic+ 1

2bpNic.

We define the statistic Ui as the counterpart of Yi in standard Kruskal-Wallis statistic,

Ui =

i∑j=1

(Ni+1sj −Njsi+1

), 1≤ i≤K − 1.

Then, a natural test statistic based on truncation is

(6) TtKW =

K−1∑i=1

U2i

Var(Ui),

where Var(Ui) is the variance of Ui under the null. This is calculated by noting thatVar(Ui) = Var(E[Ui|p1, · · · , pK ]) + E[Var(Ui|p1, · · · , pK)]. Lemma 4 in Supplemental Ma-terials shows that

Var[Ui|p1, · · · , pK ] =1

12E

{(

K∑k=1

nk + 1)

[ K∑k=1

nk[N2i+1

i∑j=1

nj + ni+1(

i∑j=1

Nj)2]

Page 11: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

11

−(ni+1

i∑j=1

Nj −Ni+1

i∑j=1

nj)2

]},

E(Var[Ui|p1, · · · , pK ]) =θ2

12Ni+1(

i∑j=1

Nj)(

i+1∑j=1

Nj)(

K∑j=1

Nj)[(

K∑j=1

Nj)θ+ 3− 2θ].

However, when the sample sizes are not equal, it is difficult to evaluate Var[Ui|p1, · · · , pK ].Instead we approximate this under the null hypothesis by assuming that the non-zero proba-bility θ is the average of the empirical non-zero probability among all samples and by simu-lations since Var[Ui|p1, · · · , pK ] only depends on the non-zero proportions, but not of f .

Finally, in order to prove that TtKW has an asymptotic distribution of χ2K−1, we show that

(E[Ui])2 is much smaller than Var(Ui) so that E[Ui]/

√Var(Ui) is approximately 0, and the

correlation between either two terms is asymptotically 0. Combining upper bound of E[Ui]given in Lemma 3 and the lower bound for the variance term given in Lemma 4, we can showthat E[Ui]/

√Var(Ui)→ 0. Using the upper bound for the covariance in Lemma 5, we see

that when max1≤i≤KNi/N2(1)→ 0 and N(1)→∞, we have

|Cor(Ui,Uj)|= |Cov(Ui,Uj)√

Var(Ui)Var(Uj)| ≤C

√√√√(

K∑k=1

1

Nk)≤ C

√K√

N(1)

→ 0,1≤ i, j ≤K − 1.

Therefore, as long as the sample sizes are at the same order and go to infinity, (Ui/Var(Ui), i=1, · · · ,K) asymptotically has a multivariate normal distribution with zero mean and identitycovariance matrix. We leave the details of these lemmas in the Supplemental Materials.

5. Simulations. To further verify the gain in efficiency in using the proposed truncatedrank-based tests, we present simulation studies to evaluate the proposed tests and to com-pare with the standard Wilcoxon rank-sum and Kruskal-Wallis test statistics. In each of thesimulation setups, we perform the following steps.

1. Given parameters (K,N,θ,α,β, s), where N , θ, α and β are all K × 1 arrays, generateX from the distribution

Xi(j)i.i.d∼ (1− θi)δ0 + θiBeta(αi, βi), 1≤ i≤K,1≤ j ≤Ni.

2. Calculate the p-value from usual rank tests, including the Wilcoxon rank-sum test fortwo-sample test, and Kruskal-Wallis test for more groups, and our proposed tests.

3. Repeat steps 1-2 for M times, and calculate the power or type I error for a given signifi-cance level α.

5.1. Simulation 1 - evaluation of Type I error. We first evaluate the type I errors of var-ious test statistics proposed in this paper and compare them to Wilcoxon rank-sum test andKruskal-Wallis test. To simulate data from the null distribution, for each sample i, we simu-late data from a two-part model,

Xi(j)i.i.d∼ (1− θ)δ0 + θBeta(a, b),

where a= b= 2, θ = 0.5. We consider three different scenarios :

(a) K = 2, sample sizes = (0.65,1)×N ;(b) K = 3, equal sample sizes N ;(c) K = 3, sample sizes= (0.7,1,1.5)×N .

Page 12: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

12

For each scenario, take N = {30,60,100,300,600,900}, and simulate 100,000 test statisticsfor each choice of N . The empirical Type I errors are summarized in Table 1 for significancelevel α ∈ {0.05,0.01,0.001}. In general, we observe that the proposed tests have correct TypeI errors, especially when the sample sizes are large. For small sample sizes, the proposed testsare slightly anti-conservative, indicating the asymptotic approximation of the test statisticsmay require relatively large sample sizes. In practice, when the sample sizes are small, onecan obtain more accurate p-values based on permutations.

TABLE 1Comparison of empirical Type I errors for two-group test with unequal sample sizes and three-group tests with

equal and unequal sample sizes. TW : Wilcoxon rank-sum test; TtW : truncated Wilcoxon rank-sum test; TKW :Kruskal-Wallis test; TtKW : truncated Kruskal-Wallis test.

Two-group - unequal sample sizesα-level (20, 30) (39, 60) (65, 100) (195, 300) (390, 600)

0.05TW .049 .049 .050 .049 .049TtW .054 .051 .051 .051 .050

0.01TW .009 .010 .010 .010 .010TtW .014 .013 .012 .011 .011

0.001TW 6.8× 10−4 9.3× 10−4 9.1× 10−4 1.2× 10−3 9.9× 10−4

TtW 2.2× 10−3 1.9× 10−3 1.8× 10−3 1.6× 10−3 1.2× 10−3

Three-group - equal sample sizes30 60 100 300 600

0.05TKW .049 .048 .049 .050 .050KMm .056 .053 .053 .052 .051

0.01TKW .009 .009 .010 .010 .010TtKW .015 .013 .013 .012 .011

0.001TKW 6.7× 10−4 8.7× 10−4 1.1× 10−3 9.9× 10−4 1.0× 10−3

TtKW 2.5× 10−3 2.0× 10−3 1.9× 10−3 1.6× 10−3 1.2× 10−3

Three-group - unequal sample sizes(21, 30, 45) (42, 60, 90) (70, 100, 150) (210, 300, 450) (420, 600, 900)

0.05TKW .047 .050 .049 .049 .049TtKW .055 .056 .054 .051 .051

0.01TKW .009 .010 .010 .010 .010TtKW .014 .014 .013 .012 .011

0.001TKW 6.4× 10−4 8.1× 10−4 1.0× 10−3 8.9× 10−4 8.6× 10−4

TtKW 2.5× 10−3 2.1× 10−3 1.9× 10−3 1.4× 10−3 1.3× 10−3

5.2. Simulation 2 - power comparisons. We next evaluate the power of the proposedtests. We consider three different models with K = 3 groups and repeat the simulationM = 10000 times. For the first model, we assume that the proportions of zeros are thesame across different groups and examine how the distribution of the non-zero observa-tion affects the power of the proposed tests. We set the θ = (0.5,0.5,0.5) and equal sam-ple sizes for the three groups, ranging from 100 to 2000 with a step size 100. We setα ∈ {(1.5,2,2.5), (1.7,2,2.3)}, and β = (2,2,2) and calculate the power function for dif-ferent sample sizes N . The resulting power curves are shown in the top row of Figure 4. Weobserve a substantial gain in power from the truncated Kruskal-Wallis test.

For the second model, we study how different proportions of zeros affect the test poweras the sample size N changes. We set (α,β) = {(1/2,1/2,1/2), (2,2,2)}, which assumesthat the nonzero distributions are the same across different groups. We again assume that thesample sizes are the same for all three groups, ranging from 20 to 300 with a step size 20. Weset θ ∈ {(0.1,0.15,0.22), (0.15,0.20,0.27)}, and calculate the power function for different

Page 13: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

13

500 1000 1500 2000

0.0

0.4

0.8

Sample Size

Pow

er

500 1000 1500 2000

0.0

0.4

0.8

Sample Size

Pow

er

50 100 150 200 250 300

0.0

0.4

0.8

Sample Size

Pow

er

50 100 150 200 250 300

0.0

0.4

0.8

Sample Size

Pow

er

500 1000 1500 2000

0.0

0.4

0.8

Sample Size

Pow

er

500 1000 1500 2000

0.0

0.4

0.8

Sample Size

Pow

er

(a) (b)

(c) (d)

(f)(e)

Fig 4: Power curves for the truncated Kruskal-Wallis (solid line) and Kruskal-Wallistest (dashed line) as a function of the sample size N . (a)-(b ): a two-part modelwith θ = (0.5,0.5,0.5), α = (1.5,2,2.5), β = (2,2,2) and (b )α = (1.7,2,2.3),β = (2,2,2). (c)-(d): a two-part model with (α,β) = {(1/2,1/2,1/2), (2,2,2)}, θ ∈{(0.10,0.15,0.22), (0.15,0.20,0.27)}. (e)-(f): a two-part model with θ = (.4, .5, .6) andα ∈ {(1.5,2,2.5), (1.7,2,2.3)}, and β = (2,2,2).

sample sizes N . The resulting power curves are shown in the middle row of Figure 4. Ourtest has some improvement over the original test. However, in this case, the improvement isnot as large as when fis are different.

Page 14: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

14

For the last model, we evaluate the proposed tests when the sample sizes are different indifferent groups. We set θ = (0.4,0.5,0.6) and the sample size as N × (0.8,1,1.5), whereN ranges from 100 to 2000 with a step size 100. We choose α ∈ {(1.5,2,2.5), (1.7,2,2.3)}and β = (2,2,2), and calculate the power function for different sample sizes N . The powercurves are shown in the bottom row of Figure 4, showing that the truncated Kruskal-Wallistest has much higher power than the Kruskal-Wallis test when there are ties.

6. Identifying the Crohn’s disease-associated bacterial genera and the effects oftreatment. Crohn’s disease, a chronic inflammatory bowel disease, is characterized by al-tered composition of the gut microbiota or dysbiosis. The etiology and clinical significanceof the dysbiosis is unknown. In a recent study at the University of Pennsylvania, the compo-sition of the gut microbiota among a cohort of 85 children with Crohn’s disease who wereinitiating therapy with either a defined formula diet (n=33) or an anti-tumor necrosis fac-tor α (anti-TNF) drug (n=52) was examined in order to better understand the cause of thedysbiosis in Crohn’s disease. Fecal samples were collected at baseline, 1, 4, and 8 weeksand DNA content characterized by shot-gun genomic sequencing with >0.5 tetra-bytes oftotal sequences (Lewis et al., 2015). The gut microbiota data of 26 normal children with noknown gastrointestinal disorders were similarly collected. The MetaPhlAn program (Segataet al., 2012) was applied to first obtain the relative proportions of the bacteria genera for eachof the normal and Crohn’s disease samples. The number of bacterial genera called by theMetaPhlAn program varies from sample to sample. Altogether, 60 bacterial genera were ob-served in both normal samples and the Crohn’s disease samples. Figure 1 shows the relativeabundances of the 60 genera in all the samples, where large proportions of zeros are observedfor many of the genera.

6.1. Comparison of gut microbiome between normal and Crohn’s disease patients. Weare interested in identifying the bacterial genera that show different distribution between nor-mal and Crohn’s disease children. At an α level of 0.05, the truncated Wilcoxon rank-sumtest identified 23 genera that showed different distributions between normal and Crohn’s dis-eases, while the standard Wilcoxon test identified 20, among these, 19 were identified by bothmethods. At FDR of 10%, the truncated Wilcoxon rank-sum test identified 21 genera and thestandard Wilcoxon test identified the same 20 genera. Four genera, including Eggerthella,Lactobacillus, Gemella, and Rothia were identified only by the truncated Wilcoxon rank-sum test. Figure 5 shows the the proportions of zeros and boxplots of these four genera thatclearly show difference in abundances in normal and Crohn’s disease children, both in termsof proportion of zeros and the median of non-zero abundances. All four genera had higherabundances in Crohn’s patients than normal controls. Among these genera, Eggerthella, Lac-tobacillus, both are anaerobic, non-sporulating, Gram-positive bacilli, have been reported tobe associated with clinically significant bacteraemia and Crohn’s disease (Lau et al., 2004).In contrast, for genus Methanobrevibacter, the standard Wilcoxon test had a sightly smallerp-value. However, 98% of the disease individuals did not carry this genus.

6.2. Comparison of gut microbiome across time after treatment. We next aim to identifythe bacterial genera that show changes of relative abundances across the four time pointsduring the anti-TNF treatment. We applied our truncated Kruskal-Wallis test statistic withwithin-subject permutations (100,000 permutations) and identified 7 genera with changesof abundances during the treatment with p < 0.05. As a comparison, the Friedman’s testidentified only three. Figure 6 shows the boxplots of the abundances of the genera across 4time points for the four genera identified only by our truncated Kruskal-Wallis test, includingBacteroides, Roseburia, Eubacterium and Bilophila. Among these, Bacteroides, Eubacterium

Page 15: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

15

Fig 5: Proportions of zeros and the box plots of the non-zero relative abundances for thefour bacterial genera that were identified by the truncated Wilcoxon rank-sum test only for anominal α level of 0.05, where p-values from TW and TtW tests are shown as pS and ps. Thegenus Methanobrevibacterm was identified only by the standard Wilcoxon rank-sum test.

and Bilophila showed increased abundances at 8 weeks after the anti-TNF treatment. Interest-ingly, all three genera have been shown to have reduced abundance in Crohn’s disease patients(Gevers et al., 2014). Reduction of Roseburia, a well-known butyrate-producing bacteriumof the Firmicutes phylum, has been consistently demonstrated to be associated with Crohn’sdisease (Machiels et al., 2014). There has been evidence that the gut bacteria in patients withinflammatory bowel disease do not make butyrate, and that they have low levels of the fattyacid in their gut (Sartor and Mazmanian, 2012). The decreased abundance of Bilophila, maytranslate into a reduction of commensal bacteria-mediated, anti-inflammatory activities in themucosa, which are relevant to the pathophysiology of Crohn’s disease. The result shows theeffect of anti-TNF treatment in increasing the relative abundances of Roseburia, Bacteroidesand Bilophila, therefore potentially increase level of fatty acid butyrate and anti-inflammatoryactivities. This partially explains that 50% of the patients showed clinical improvement, asreflected by reduction of fecal calprotection below 250 mcg/g.

7. Discussion. Motivated by comparing the distributions of taxa composition in differ-ent groups in microbiome studies, we have developed several extensions of the popular rank-based tests to account for clumps of zeros, including truncated rank-based Wilcoxon andKruskal-Wallis tests for two- or multiple-group comparisons. These tests are nonparametricand are easy to implement. By using within-sample permutations, such tests can also be ap-plied to paired samples or repeated measurements analysis. We have shown that the proposedtests have better power than the standard rank-based tests, both by asymptotic relative effi-ciency analysis and by simulations. We observed a large gain in power when the proportionsof zeros in the data sets are high as compared to the standard rank-based tests due the highnumber of tied ranks. Hallstrom (2010) showed that the truncated Wilcoxon rank-sum test

Page 16: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

16

0 1 4 8Bacteroides

0.00

0.02

0.04

Per

cent

age

of Z

eroe

s

0 4

040

80

pkw =.036, pKW =.21

Rel

ativ

e A

bund

ance

0 1 4 8Eubacterium

0.00

0.10

0.20

0 4

020

4060

pkw =.047, pKW =.142

0 1 4 8Roseburia

0.00

0.15

0.30

●●

●●●

0 4

020

40

pkw =.034, pKW =.141

0 1 4 8Bilophila

0.0

0.2

0.4

0.6

0 4

0.0

1.0

2.0

3.0

pkw =.01, pKW =.079

Fig 6: Bar plots of proportions of zeros (top row) and box plots of the relative non-zeroabundances (bottom row) over four time points during anti-TNF treatment for four bacteriagenera that were identified only by the truncated Kruskal-Wallis test.

has equal or better power than the test of two-degree of freedoms based on two-part modelwhen the sample sizes are equal. Although such tests can be extended to two-part modelfor multiple group comparisons, it has not been studied in literature. It would be interestingto compare the performance of our proposed truncated Kruskal-Wallis tests with other testsbased on the two-part models.

We have demonstrated the applications of the proposed tests to analysis of real metage-nomic data sets. Our results have shown that the truncated rank-based tests are effective andidentify more bacterial genera that are associated with the clinical phenotypes and treatment.As observed in other studies (Wagner, Robertson and Harris, 2011), tests that account forclumps of zeros can be more powerful in testing abundance difference in microbiome stud-ies. We have demonstrated that some well-known Crohn’s disease-associated bacterial generacan be missed by using the standard rank-based tests without adjusting for excessive zeros.We expect to see more applications of the proposed tests in microbiome studies.

Acknowledgment. This research was supported by NIH grants GM129781 and GM123056.

SUPPLEMENTARY MATERIALS. The online Supplemental Materials include proofsof Theorem 1, Theorem 2 and all the lemmas. It also includes detailed derivations ofthe proposed test statistics. An R package of the proposed method is available on github,https://rdrr.io/github/chvlyl/ZIR/. The source codes for these methods canalso be found in the Supplemental Materials.

REFERENCES

GEVERS, D., KUGATHASAN, S., DENSON, L. A. and ET AL (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host & Microbe 15 382–392.

HALLSTROM, A. (2010). A modified Wilcoxon test for non-negative distributions with a clump of zeros. Statisticsin Medicine 29 391–400.

HUSON, D. H., AUCH, A. F., QI, J. and SCHUSTER, S. C. (2007). MEGAN analysis of metagenomic data.Genome research 17 377–386.

Page 17: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

17

LACHENBRUCH, P. (1976). Analysis of Data with Clumping at Zero. Biometrische Zeitschrift 18 351–356.LACHENBRUCH, P. (2001). Comparison of two-part models with competitors. Statistics in Medicine 20 1215–

1234.LAU, S., WOO, P., FUNG, A., CHAN, K., WOO, G. and YUEN, K. (2004). Journal of Medical Microbiology 53

1247–1253.LEWIS, J., CHEN, E. Z., BALDASSANO, R. N., OTLEY, A. R., GRIFFITHS, A. M., LEE, D., BITTINGER, K.,

BAILEY, A., FRIEDMAN, E. S., HOFFMANN, C., ALBENBERG, L., SINHA, R., COMPHER, C., GILROY, E.,NESSEL, L., GRANT, A., CHEHOUD, C., LI, H., WU, G. D. and BUSHMAN, F. D. (2015). Inflammation,Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn’s Disease. CellHost & Microbe 18 489-500.

MACHIELS, K., JOOSSENS, M., SABINO, J., DE PRETER, V., ARIJS, I., EECKHAUT, V., BALLET, V.,CLAES, K., VAN IMMERSEEL, F., VERBEKE, K., FERRANTE, M., VERHAEGEN, J., RUTGEERTS, P. andVERMEIRE, S. (2014). A decrease of the butyrate-producing species Roseburia hominis and Faecalibacteriumprausnitzii defines dysbiosis in patients with ulcerative colitis. Gut 63(8) 1275–1283.

MANICHANH, C., BORRUEL, N., CASELLAS, F. and GUARNER, F. (2012). The gut microbiota in IBD. NatureReviews Gastroenterology and Hepatology 9 599–608.

QIN, J., LI, R., RAES, J., ARUMUGAM, M., BURGDORF, K. S., MANICHANH, C., NIELSEN, T., PONS, N.,LEVENEZ, F., YAMADA, T. et al. (2010). A human gut microbial gene catalogue established by metagenomicsequencing. Nature 464 59–65.

QIN, J., LI, Y., CAI, Z., LI, S., ZHU, J., ZHANG, F., LIANG, S., ZHANG, W., GUAN, Y., SHEN, D. et al.(2012). A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490 55–60.

SARTOR, R. and MAZMANIAN, S. (2012). Intestinal Microbes in Inflammatory Bowel Diseases. American Jour-nal of Gastroenterology (Supplements) 1 15-21.

SEGATA, N., WALDRON, L., BALLARINI, A., NARASIMHAN, V., JOUSSON, O. and HUTTENHOWER, C.(2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Meth-ods 9 811–814.

TURNBAUGH, P. J., LEY, R. E., MAHOWALD, M. A., MAGRINI, V., MARDIS, E. R. and GORDON, J. I. (2006).An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444 1027–131.

TURNBAUGH, P. J., LEY, R. E., HAMADY, M., FRASER-LIGGETT, C. M., KNIGHT, R. and GORDON, J. I.(2007). The human microbiome project. Nature 449 804–810.

WAGNER, B., ROBERTSON, C. and HARRIS, J. (2011). Application of Two–Part Statistics for Comparison ofSequence Variant Counts. PloS One 6(5) e20296.

Page 18: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

18

SUPPLEMENTARY MATERIALS

APPENDIX A: BASIC PROPERTIES

Suppose there is a sample with size n. Rank all the observations so that the smallest ob-servations have largest ranks. Let ri and rj denote the ranks of two randomly selected ob-servations. We have the results for the expectation, variance and covariance as follows. Thecalculations are also presented here although its basic in statistics.

Property 1. E[ri] = n+12 .

Property 2. E[r2i ] = (n+1)(2n+1)

6 .Proof. E[r2

i ] = 1n

∑ni=1 i

2 = (n+1)(2n+1)6 .

Property 3. Var[ri] = n2−112 .

Proof. Note that Var(ri) = E[r2i ]− (E[ri])

2. According to Properties 1–2, the variance is

Var(ri) =1

n

n∑i=1

i2 − (n+ 1

2)2 =

n2 − 1

12.

Property 4. Cov[ri, rj ] =−n+112 .

Proof. When i 6= j, we have

E[rirj ] =1

n(n− 1)

∑i 6=j

ij =(n+ 1)(3n+ 2)

12.

Therefore,

Cov(ri, rj) = E[rirj ]− [E(ri)]2 =

(n+ 1)(3n+ 2)

12− (

n+ 1

2)2 =−n+ 1

12.

Property 5. Consider a randomly selected group of observations S with size n1. The rank-sum of this group is denoted by rg . Then,

E[rg] =1 + n

2n1, Var(rg) =

n1(n− n1)(n+ 1)

12.

Proof. For the expectation part, it simply introduces in E[ri] for each observation in thisgroup. Next we check the variance.

Var(rg) =∑i∈S

Var(ri) +∑

i,j∈S,i6=jCov(ri, rj)

=∑i∈S

n2 − 1

12+

∑i,j∈S,i6=j

−n+ 1

12

=n1(n− n1)(n+ 1)

12.

APPENDIX B: DERIVATIONS OF THE MODIFIED WILCOXON RANK SUMSTATISTIC

For a given statistic S, we use E[S] and Var(S) to represent its mean and variance calcu-lated under the null hypothesis and use E1[S] to represent its mean calculated under the alter-native hypothesis. In this section, given that W = S2/Var(S) where S =R− N1+N2+1

2 N1,and s= r− bp(N1+N2)c+1

2 bpN1c − 14p1+p2

2 (1− p1+p22 )(N2 −N1), we want to verify the fol-

lowing results:

Page 19: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

19

• E[S] = 0, E[s] =O(max{N1,N2});• the results for Var(S) and Var(s):

Var(S) =N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5),

Var(s) =N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(N2.5);

Proof. Now we prove the results one by one.

• We show the expectation first. The result for E[S] is direct, so we focus on E[s].Recall that p = max{p1, p2}, where p1 = n1/N1 and p2 = n2/N2. Let r = r −

bp(N1+N2)c+12 bpN1c, and so s= r− 1

4p1+p2

2 (1− p1+p22 )(N2 −N1). To find E[s], we need

to calculate E[r] and E[14p1+p2

2 (1− p1+p22 )(N2 −N1)].

With independent samples, p1 and p2 are independent, E[pi] = θi and Var(pi) = θi(1−θi)/Ni), i= 1,2. Under null hypothesis, θ1 = θ2 = θ. We then have

E

[1

4

p1 + p2

2(1− p1 + p2

2)(N2 −N1)

]=N2 −N1

4

[(θ1 + θ2

2)− (

θ1 + θ2

2)2 − θ1(1− θ1)

4N1− θ2(1− θ2)

4N2

](Let θ =

θ1 + θ2

2) =

N2 −N1

4(θ(1− θ) +O(

1

N1+

1

N2)).(7)

Here, θ is the probability of zero under null and θ = (θ1 + θ2)/2 under alternative hypoth-esis.

Next we calculate E[r]. Recall that r is the sum-rank of the truncated sample 1. Let r0

be the sum-rank of all the non-zero elements in the truncated sample 1. With r0, we rewriter as

r = r0 −n1 + n2 + 1

2n1 +

n1 + n2 + 1 + bp(N1 +N2)c2

(bpN1c − n1)

−1 + bp(N1 +N2)c2

bpN1c+n1 + n2 + 1

2n1

= (r0 −n1 + n2 + 1

2n1) +

1

2pN1N2(p2 − p1) +O(max{N1,N2}),

where O(N2) comes from the difference between bpN1c (bpN1c) and pN1 (pN2).Since all the observations are non-negative, so the non-zero elements always enjoy the

ranks from 1 to n1 +n2. Hence, under null hypothesis that f = g, E[r0− n1+n2+12 n1] = 0.

Therefore,

(8) E[r|p1, p2] =pN1N2

2(p2 − p1) +O(max{N1,N2}).

We still need to remove the condition that p1 and p2 are given. The following lemma isused to describe the distribution of p(p2 − p1).

Lemma 1 Define p1 = n1/N1, p2 = n2/N2, and p= max{p1, p2}. Under null hypothesisthat θ1 = θ2 = θ, F =G, and let N1,N2→∞ at the same order, there is

E[p(p2 − p1)/2] =θ(1− θ)

4

N1 −N2

N1N2,

E[p2(p2 − p1)2] = θ3(1− θ)(1/N1 + 1/N2) +O(N−3/21 ).

Page 20: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

20

Combining Lemma 1 with (8) leads to

E[r] =θ(1− θ)

4(N1 −N2) +O(max{N1,N2}).

Combining this with (7), we have that

E[s] =O(max{N1,N2}).

The first conclusion is proved.• Next we want to check Var(S) and Var(s).

Note that s= r− 14p1+p2

2 (1− p1+p22 )(N2 −N1) = r− I . Therefore,

(9) Var(s) = Var(r) + Var(I)− 2 Cov (() r, I).

Check the terms one by one.The first term is Var(r). Note that we already find E[r|p1, p2] = pN1N2(p2 − p1)/2 +

O(max{N1,N2}). Therefore,

Var(r) = Var(E[r|p1, p2]) + E[Var(r|p1, p2)]

=N2

1N22

4{E[(p2 − p1)2p2]− (E[(p2 − p1)p])2}+

N1N2

12E[N1p

21p2 +N2p1p

22 + p1p2].

With Lemma 1 we have that

E[(p2 − p1)2p2] = θ3(1− θ)(1/N1 + 1/N2) +O(N−3/21 ),

(E[(p2 − p1)p])2 = θ2(1− θ)2(N1 −N2)2/(4N21N

22 ) =O(N−2

1 ).

Combining withE[N1p21p2 +N2p1p

22 +p1p2] = (N1 +N2)θ3 +O(1) from basic statistics,

we have that

(10) Var(r) =N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(√N1N

22 ).

For the second term, note that

Var(I)≤E[I2]≤ (N2 −N1)2E[(1

4

p1 + p2

2(1− p1 + p2

2))2]≤ (N2 −N1)2 =O(N2),

and

Cov (() r, I)≤√

Var(r),Var(I)≤ (N2 −N1)√N1N2(N1 +N2) =O(N5/2).

Introduce the result and (10) into (9), we have(11)Var(s) = Var(r)+Var(I)−2 Cov (() r, I) =N1N2(N1 +N2)θ3(1−θ+1/3)/4+O(N2.5).

Similarly, we have that Var(S) =N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5).

APPENDIX C: PROOF OF THEOREM 1 IN MAIN PAPER

Recall that the definition of ARE(Wm,W ) is

ARE(Wm,W ) =E[s2]/Var(s)

E1[S2]/Var(S),

where E1 denote the expectation under alternative hypothesis, and Var(s) means the vari-ance of s under null hypothesis. With previous analysis, Var(s) and Var(S) are known. Theremaining problem is E1[s2], for which we can derive E1[s] and Var1(s) separately, and thecombine them together. Here, Var1(s) denote the variance of s under alternative hypothesis.

Page 21: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

21

Recall that we rewrite s= r− I , where r can be rewritten as

r = [r0 −n1 + n2 + 1

2n1] +

1

2pN1N2(p2 − p1) +O(N2).

Since r0 depend on the non-zero part only, so we only consider the non-zero elements of xand y. Let x(j) and y(j) denote the jth non-zero observation in x and y, respectively. Therank of x(j) equals to

rj =1

2+

n1∑m=1

1{x(j)< x(m)}+n2∑m=1

1{x(j)< y(m)}+ 1

2

n1∑m=1

1{x(j) = x(m)}+ 1

2

n2∑m=1

1{x(j) = x(m)}.

Note that E[1{x(j) < x(m)}] = P (X1 < X2) when X1,X2i.i.d.∼ f , and E[1{x(j) <

y(m)}] = P (X < Y ) when X ∼ f,Y ∼ g independently. Therefore, we have

E1[rj |n1, n2] = 1/2 + n1P (X1 <X2) + n2P (X < Y ) +n1

2P (X1 =X2) +

n2

2P (X = Y ).

Note that P (X1 =X2)/2 + P (X1 <X2) = 1/2 since they follow the same distribution andP (X = Y )/2 + P (X < Y ) = ∆f,g + 1/2 according to the definition, plug them in and wehave that

E1[r0|n1, n2] = E1

[ n1∑j=1

r1j

]= n1

[(n1 + 1)/2 + n2[∆f,g + 1/2]

].

Introduce it into the definition of r, we have

E1[r|n1, n2] = E1[r0|n1, n2]− n1 + n2 + 1

2n1 +

1

2pN1N2(p2−p1) = n1n2∆f,g+

1

2pN1N2(p2−p1).

Further, let N1,N2→∞, and note that n1 ∼ Binomial(N1, θ1), n2 ∼ Binomial(N2, θ2),we have that

E1[r] = θ1θ2N1N2∆f,g +1

2θ(m)N1N2∆θ,

For the last term I , note that |E1[I]|<N2 −N1 =O(N). Combining all these terms, wehave that

(12) E1[s] = E1[r]−E[I] = θ1θ2N1N2∆f,g +1

2θ(m)N1N2∆θ+O(N).

Similarly, we have the result for S as

E1[S] = E1

[E1[r0|n1, n2] +

n1 + n2 + 1 +N1 +N2

2(N1 − n1)− N1 +N2 + 1

2N1

]= E1

[n1n2∆f,g +

N1n2 − n1N2

2

]= θ1θ2N1N2∆f,g +

1

2N1N2∆θ.(13)

Now we consider the variance under alternative hypothesis. We deal with S first.Note that S is a function of all the observations x and y. To find the variance, we try to

bound P (|S − µ| ≥ t) for any t with concentration inequality. When one observation of xchanges, the largest change of S caused by it is N2. When one observation of y changes, thelargest change of S is N1. Mathematically, we have

|S(x(1), · · · , x(k), · · · , x(N1),y)− S(x(1), · · · , x′(k), · · · , x(N1),y)| ≤N2,

|S(x, y(1), · · · , y(k), · · · , y(N2))− S(x, y(1), · · · , y′(k), · · · , y(N2))| ≤N1.

Page 22: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

22

According to McDiarmid’s Inequality (?, Page 206), there is

P (|S −E[S]|> t)≤ 2e−2t2

N1N22+N2N2

1 .

Therefore,

Var1(S) = E[|S −E[S]|2] =

∫ ∞0

P (|S −E(S)|2 ≥ t)dt

=

∫ ∞0

P (|S −E(S)| ≥√t)dt

≤∫ ∞

02e

−2t

N1N22+N2N2

1 dt

=N1N2(N1 +N2).

Therefore,

(14) Var1(S)≤N1N2(N1 +N2) =O(N3).

Similarly, we derive the result for s. When one observation of x changes, the change of scannot be larger than the change of S, and so

(15) Var1(s)≤N1N2(N1 +N2) =O(N3).

Combine (12), (13), (14) and (15), with the results about variance in the second bullet, wehave

ARE(Wm,W ) =E1[s2]/Var(s)

E1[S2]/Var(S)=

(E1[s])2 + Var1(s)

(E1[S])2 + Var1(S)

Var(S)

Var(s)

=

(θ1θ2N1N2∆f,g + 1

2θ(m)N1N2∆θ)2

+O(N3)(θ1θ2N1N2∆f,g + 1

2N1N2∆θ)2

+O(N3)· N1N2(N1 +N2)θ(1− θ+ θ2/3)/4 +O(N2.5)

N1N2(N1 +N2)θ3(1− θ+ 1/3)/4 +O(N2.5)

=

(θ1θ2∆f,g + 1

2θ(m)∆θ)2(

θ1θ2∆f,g + 12∆θ

)2 · (1− θ+ θ2/3)

θ2(1− θ+ 1/3)+O(N−1/2).

The result is proved.

APPENDIX D: PROOF OF LEMMA 1 IN MAIN PAPER

Proof. It is easy to verify that E[si] = 0, 1≤ i≤K . Since Ui is a linear combination of si,there is

E[Ui] =

i∑j=1

E[sj ]− iE[si+1] = 0, 1≤ i≤K.

To check the variance, we begin with the variances and covariances of si. Given n1, · · · , nK ,Var(si|n1, · · · , nK) equals to the variance of the sum-rank of non-zero elements, which isalready derived in the basic property section in Supplementary materials. Therefore, with thelaw of total variance, we find that

Var(si) = E

[ni(

K∑j=1,j 6=i

nj)(

K∑j=1

nj + 1)

]/12 + E

[n2(

K∑j=1

nj −Kni)2

]/4,

Cov(si, sj) =−E

[ninj(

K∑k=1

nk + 1)

]/12 + E

[n2(

K∑k=1

nk −Kni)(K∑k=1

nk −Knj)]/4,

Page 23: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

23

where 1≤ i, j ≤K − 1, i 6= j. From this, we obtain

Var(Ui) =E

{∑Kj=1 nj + 1

12

[(

i∑j=1

nj)ni+1(i+ 1)2 + (

K∑k=i+2

nk)(

i∑j=1

nj + i2ni+1)]}

+E

[K2n2

4(

i∑j=1

nj − ini+1)2

],

Cov(Ui1 ,Ui2) = E

[K2n2(

i1∑k=1

nk − i1ni1+1)(

i2∑j=1

nj − i2ni2+1)

]/4,

where 1≤ i≤K − 1, 1≤ i1 < i2 ≤K − 1 and∑K

k=i+2 nk = 0 when i+ 2>K .Now the remaining term is E

[K2n2(

∑i1k=1 nk− i1ni1+1)(

∑i2j=1 nj− i2ni2+1)

]. We prove

the following lemma.

Lemma 2 Let n1, · · · , nKi.i.d∼ Binomial(N,θ), and n= max1≤i≤K ni. Given n, we have

(16) E

[(

i1∑k=1

nk − i1ni1+1)(

i2∑j=1

nj − i2ni2+1)|n]

= 0, 1≤ i1 < i2 ≤K − 1.

With the lemma, we have that

Cov(Ui1 ,Ui2) = E

[K2n2E[(

i1∑k=1

nk− i1ni1+1)(

i2∑j=1

nj − i2ni2+1)|n]

]/4 = E[K2n2 · 0] = 0.

The result is proved. 2

Proof of Lemma 2:To show (16), note that

E

[(

i1∑k=1

nk − i1ni1+1)(

i2∑j=1

nj − i2ni2+1)|n]

= E

[ i1∑k=1

(nk − ni1+1)

i2∑j=1

(nj − ni2+1)|n]

=

i1∑k=1

i2∑j=1

E

[(nk − ni1+1)(nj − ni2+1)|n

],

for 1 ≤ i1 < i2 ≤K − 1. Therefore, to show the left-hand side is 0, we check the value ofeach single term

E[(nk − ni1)(nj − ni2)|n].

As k ≤ i1 < i2, and 1≤ j ≤ i2− 1, there are 3 cases: 1. j = k; 2. j = i1; 3. j 6= k and j 6= i1.In case 1, we have that

E[(nk − ni1)(nk − ni2)|n] = E[n2k − ni1nk − ni2nk + ni1ni2 |n]

Given n, ninj has the same joint distribution for any i 6= j, and so they share the sameexpectation. What’s more, ni and nj has the same conditional variance given n. Therefore,

E[n2k − ni1nk − ni2nk + ni1ni2 |n]

= (E[nk|n])2 + Var(nk|n)−E[ni1 |n]E[nk|n]−Cov (()ni1 |n,nk|n)

= Var(n1|n)−Cov (()n1|n,n2|n).(17)

Page 24: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

24

The last equation stands since nk|n and ni1 |n follow the same distribution.In case 2, we have that

E[(nk − ni1)(ni1 − ni2)|n] = E[nkni1 − n2i1 − ni2nk + ni1ni2 |n].

With the same analysis of (17), we have that

(18) E[nkni1 − n2i1 − ni2nk + ni1ni2 |n] = Cov (()n1|n,n2|n)−Var(nk|n).

In case 3, there is

(19) E[(nk − ni1)(nj − ni2)|n] = E[nknj − ni1nj − ni2nk + ni1ni2 |n] = 0.

Combining (17)–(19), and summation over 1≤ k ≤ i1 and 1≤ j ≤ i2, we can find that themean is 0, and (16) follows. 2

APPENDIX E: PROOF OF LEMMA 2 IN MAIN PAPER

In this proof, we use the calculation of Var(UK−1) as an example for simplicity. Thecalculation of Var(Ui) is similar.

We first consider the term under null hypothesis. According to (??), there is an expressionfor Var(UK−1). Decompose Var(UK−1) into two parts,

Var(UK−1)

K2=

E[nK(∑K−1

j=1 nj)(∑K

j=1 nj + 1)]

12+

E[n2(∑K

j=1 nj −KnK)2]

4= I + II.

Part I can be easily calculated by the distribution of nj , and with basic calculations we obtain

(20) I =N3θ3K(K − 1)/12 +O(N2).

For II , we take n = m+Nθ, where m = max{n1 −Nθ,n2 −Nθ, · · · , nK −Nθ}. Thenwe have

4 · II = E[(m+Nθ)2(

K∑j=1

nj −KnK)2]

=N2θ2E[(

K∑j=1

nj −KnK)2] + 2NθE[m(

K∑j=1

nj −KnK)2] + E[m2(

K∑j=1

nj −KnK)2]

= IIa+ IIb+ IIc.

With basic calculations and that (∑K

j=1 nj − KnK)/N is asymptotically distributed as anormal distribution with mean 0 and variance K(K − 1)θ(1− θ), we have that

(21) IIa=N3θ3K(K − 1)(1− θ) +O(N2).

For IIb and IIc, according to Cauchy-Schwarz inequality, we have that

(22) IIb≤ 2Nθ

√√√√E[m2]E[(

K∑j=1

nj −KnK)4] = 2Nθ√

E[m2]√

3K(K − 1)Nθ(1− θ),

and

(23) IIc≤

√√√√E[m4]E[(

K∑j=1

nj −KnK)4] =√

E[m4]√

3K(K − 1)Nθ(1− θ).

Page 25: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

25

The unknown part here is E[m2] and E[m4]. For these two terms, we can bound them by

E[m4]≤KE[(n1 −Nθ)4] =O(N2),

E[m2]≤KE[(n1 −Nθ)2] =O(N).

Plugging these into (22) and (23), we have that

(24) IIb.O(N5/2), IIc.O(N2).

Combining (21) and (24), we have that

II =N3θ3K(K − 1)(1− θ)/4 +O(N5/2).

Combining this with (20), and we have that

Var(UK−1) =N3θ3K3(K − 1)(1/3 + (1− θ))/4 +O(N5/2).

The result under null hypothesis is proved.Next, we consider the variances under alternative hypothesis. The analysis is similar with

what we did for the modified Wilcoxon statistics under alternative.Consider Yi first. Yi is a function of x1,x2, · · · ,xK . When one observation change, the

change of Yi is no larger than KN . For Ui, the same thing happens that the change is nolarger than KN . Therefore,

KN∑i=1

(KN)2 =K3N3.

According to McDiarmid’s Inequality (?, Page 206), there is

P (|Yi −E[Yi]|> t)≤ 2e−2t2

K3N3 .

Therefore,

Var1(Yi) = E[|Yi −E[Yi]|2] =

∫ ∞0

P (|Yi −E(Yi)|2 ≥ t)dt

=

∫ ∞0

P (|Yi −E(Yi)| ≥√t)dt

≤∫ ∞

02e

−2t

K3N3 dt

=K3N3.

Therefore,

(25) Var1(Yi)≤K3N3, Var1(Ui)≤K3N3, 1≤ i≤K.

The lemma is proved. 2

APPENDIX F: PROOF OF LEMMA 3 IN MAIN PAPER

In this section, all the proofs are for the expectations under alternative hypothesis. So wedrop the subscriptsof E1 for simplicity in this section only.

With basic calculations, it can be shown that, for 1≤ i≤K − 1,

Yi = (

i∑j=1

s0j − is0

i+1) +NK

2[ini+1−

i∑j=1

nj ], Ui = (

i∑j=1

s0j − is0

i+1) +nK

2[ini+1−

i∑j=1

nj ].

Page 26: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

26

So, for Yi terms, the problem reduces to find E[NK2 (ini+1 −∑i

j=1 nj)]. With basic calcu-lations, this expectation turns to be N2K

2 (iθi+1 −∑i

j=1 θj). Combining with E[∑i

j=1 s0j −

is0i+1

∣∣n1, · · · , nK]

=∑i

j=1

∑k 6=j njnk∆j,k − ini+1

∑k 6=i+1 nk∆i+1,k, we have that

E[Yi] =N2

[ i∑j=1

∑k 6=j

θjθk∆j,k−iθi+1

∑k 6=i+1

θk∆i+1,k

]+KN2

2

(iθi+1−

i∑j=1

θj

), 1≤ i≤K−1.

So, the first part of the conclusion is proved.To find E[Ui], it also reduces to find E1[nK2 [ini+1−

∑ij=1 nj ]]. Decompose n=Nθ(K) +

m, then we have that

E[Ui] =N2

[ i∑j=1

∑k 6=j

θjθk∆j,k − iθi+1

∑k 6=i+1

θk∆i+1,k

]

+KN2

2θ(K)(iθi+1 −

i∑j=1

θj) + E[mK

2[ini+1 −

i∑j=1

nj ]].

As long as we can show that E[mK2 [ini+1 −∑i

j=1 nj ]] = o(N2), the lemma is proved.With Cauchy-Schwarz Inequality, we have that

(1) E[mK

2[ini+1 −

i∑j=1

nj ]]≤K

2

√√√√E[m2]E[(ini+1 −i∑

j=1

nj)2].

If θ1 = θ2 = · · ·= θK = θ, then θ(K) = θ. What’s more, E[m2]≤∑K

i=1 E[(ni −Nθ)2] =

KNθ(1 − θ) = O(N). The second term that E[(ini+1 −∑i

j=1 nj)2] = i(i + 1)Nθ(1 −

θ) = O(N). Combining the two terms with (1), it can be concluded that E[mK2 [ini+1 −∑ij=1 nj ]]≤O(N). So, the result is proved.On the other hand, if there is some i and j, such that θi 6= θj . Then,

(2) E[(ini+1 −i∑

j=1

nj)2] .O(N2).

For any given N , we decompose the index set S = {1,2, · · · ,K} into S1 = {j : θj − θ(K) ≥−2√

log(N)/N} and S2 = S\S1. Correspondingly, assume that the index that achieves themaximum is h, then the expectation can also be decomposed as following

E[m2] = E[m21{h ∈ S1}] + E[m21{h ∈ S2}] = I + II.

For part I , we have that

(3) I ≤ E[∑j∈S1

n2j ]≤ 4 log(N)N +N/4 =O(N log(N)).

To bound II , assume that θ1 = θ(K) without loss of generality. Then we have that

(4) II ≤∑j∈S2

E[(nj −Nθ(K))21{nj ≥ n1}]≤

∑j∈S2

√E[(nj −Nθ(K))4]P (nj ≥ n1).

The last inequality comes from Cauchy-Schwartz Inequality. Recall that we have θj−θ(K) <

−2√

log(N)/N} as j ∈ S2. According to Hoeffding’s Inequality, we have that

P (nj−n1 ≥ 0)≤ P ((nj−n1)−N(θj−θ(K))> 2√

log(N)N})≤ exp(−2(2√

log(N)N)2

4N),

Page 27: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

27

and it reduces to P (nj − n1 ≥ 0)≤ 1/N2. Combining with (4), we have that

II ≤O(√

1/N2 ∗N4) =O(N).

Combining this result about II with (3), there is E[m2] =O(N log(N)). Plugging this resultand (2) into (1), we have

E[mK

2[ini+1 −

i∑j=1

nj ]] =O(N3/2√

log(N)) = o(N2).

So, the lemma is proved. 2

APPENDIX G: PROOFS OF LEMMA 1 IN SUPPLEMENTAL MATERIALS

When N1,N2→∞, according to Central Limit Theorem, we have that(p1

p2

)(d)→N

((θθ

)(θ(1− θ)/N1 0

0 θ(1− θ)/N2

)).

Let α= p1− p2, β = (N1p1 +N2p2)/(N1 +N2), inversely there are p1 = β + N2

N1+N2α and

p2 = β − N1

N1+N2α. What’s more, the distribution of (α,β)T is that(

αβ

)=N

((0θ

)(θ(1− θ)(1/N1 + 1/N2) 0

0 θ(1− θ)/(N1 +N2)

)).

With α, β, we rewrite the term p(p2 − p1) as

p(p2 − p1) =

{ 12(β + N2

N1+N2α)(−α) α≥ 0

12(β − N1

N1+N2α)(−α) α< 0

As α⊥ β asymptotically, we have

E[αβ] =E[α]E[β] = 0, E[α21{α≥ 0}] =E[α21{α< 0}] =1

2θ(1− θ)(1/N1 + 1/N2).

Combining with these and basic calculations, we have that

E[p(p2 − p1)] =θ(1− θ)

4

N1 −N2

N1N2.

On the other hand, assume N1 ≤N2 without loss of generality, we have that

E[α2β2] = θ3(1− θ)(1/N1 + 1/N2) +O(N−22 ),

E[α31{α≥ 0}] =E[α31{α< 0}] =O(N−3/21 ),

E[α41{α≥ 0}] =E[α41{α< 0}] =O(N−21 ).

So, we have that E[p2(p2−p1)2] =E[α2β2] +O(E[α3β]) +O(E[α4]) = θ3(1− θ)(1/N1 +

1/N2) +O(N−3/21 )

APPENDIX H: RELEVANT LEMMAS FOR TRUNCATED KRUSKAL-WALLIS TESTFOR UNEQUAL SAMPLE SIZES

We analyze E[Ui] under null hypothesis first. It is a linear function of si’s. For si, giventhe number of non-zero entries in each sample (equivalently, p1, · · · , pK are given), there is

E[si|p1, p2, · · · , pK ] =Ni

2

[p

K∑j=1

(pj − pi)Nj

]+O

(max{Ni,

K∑i=1

ni}),

Page 28: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

28

where the remainder comes from the fact that we take the largest integer smaller than pNi

instead of pNi for each sample. Therefore,(1)

E[Ui] =1

2Ni+1

K∑k=1

Nk

i∑j=1

NjE[p(pi+1 − pj)] +O

( K∑k=1

Nk

i+1∑j=1

Nj

), 1≤ i≤K − 1.

The expectation in above equation is hard to calculate. However, we cn show it is compar-atively small, hence we only need to find an upper bound, which is given as Lemma 3 (seeSupplementary Materials).

Lemma 3 Let pi ∼N(θ, θ(1− θ)/Ni) independently, i = 1, · · · ,K , and N(1) = min1≤i≤K

Ni,

then

|E[p(pj − pi)]| ≤ (2√K + 1)θ(1− θ)/N(1), i 6= j,1≤ i, j ≤K,

where p= max1≤i≤K

pi.

Based on Lemma 3, let N(1) = min1≤i≤K

Ni,

(2) E[Ui] =O

(Ni+1

K∑k=1

Nk

i∑j=1

Nj/N(1)

),1≤ i≤K − 1.

Next, we derive the order of Var(Ui). It is given by

Var(Ui) = Var(E[Ui|p1, · · · , pK ]) + E[Var(Ui|p1, · · · , pK)](3)

≥ E[Var(Ui|p1, · · · , pK)], 1≤ i≤K − 1

Lemma 4 Under the null hypothesis that f1 = f2 = · · ·= fK = f , there is

Var[Ui|p1, · · · , pK ] =

1

12E

{(

K∑k=1

nk + 1)[ K∑k=1

nk[N2i+1

i∑j=1

nj + ni+1(

i∑j=1

Nj)2]− (ni+1

i∑j=1

Nj −Ni+1

i∑j=1

nj)2]}.

Further, the expectation follows that

E(Var[Ui|p1, · · · , pK ]) =θ2

12Ni+1(

i∑j=1

Nj)(

i+1∑j=1

Nj)(

K∑j=1

Nj)[(

K∑j=1

Nj)θ+ 3− 2θ]

≥ θ3

12Ni+1(

i∑j=1

Nj)2(

K∑j=1

Nj)2.

Combine (3) with Lemma 4, the lower bound for Var(Ui) is

Var(Ui)≥θ3

12Ni+1(

i∑j=1

Nj)2(

K∑j=1

Nj)2 =O

(N2i+1(

K∑j=1

Nj)2

( i∑j=1

Nj

)2

θ3/Ni+1

),(4)

when 1≤ i≤K . Compare it to theE[Ui], where (E[Ui])2 =O

(N2i+1(

∑Kj=1Nj)

2(∑i

j=1Nj)2/N2

(1)

).

E[Ui]/Var(Ui) = o(1) when max1≤i≤KNi/N2(1)→ 0 and N(1)→∞.

For the covariance term, we apply the following lemma.

Page 29: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

29

Lemma 5 Under the null hypothesis where f1 = f2 = · · ·= fK = f , and all θi’s are equal,there is

|Cov(Ui,Uj)| ≤ θ2(

K∑k=1

Nk)2Ni+1Nj+1(1 +

K∑k=1

Nk/N(1))

√√√√(

i∑l=1

Nl)(

i∑l=j

Nl)

√√√√(

K∑k=1

1

Nk),

where N(1) = min1≤k≤KNk and N(1)→∞.

APPENDIX I: PROOF OF LEMMA 3

With Cauchy-Schwarz inequality and Jensen’s inequality, note that f(x) =√x is a con-

cave function, we have that∣∣E[p(pj − pi)]∣∣=∣∣E[(p− pi)(pj − pi)] +E[pi(pj − pi)]

∣∣≤√E[(p− pi)2]E[(pj − pi)2] +

∣∣E[pi(pj − pi)]∣∣.(1)

For the first part, note that pjNj ∼Binomial(Nj , θ), piNi ∼Binomial(Ni, θ), Ni and Nj

are constants and pi and pi are independent, so we have

(2) E[(pj − pi)2] = θ(1− θ)(1/Nj + 1/Ni).

To bound E[(p− pi)2], note that we have (p− pi)2 ≤∑K

k=1(pk − pi)2. Combining with (2),there is

(3) E[(p− pi)2]≤K∑k=1

E[(pk − pi)2] =∑k 6=i

θ(1− θ)(1/Nk + 1/Ni)≤ 2Kθ(1− θ)/N(1).

Introduce (2) and (3) into (1), and we have

|E[p(pj − pi)]| ≤ 2θ(1− θ)√K/N(1) + θ(1− θ)/Ni.

So the result follows. 2

APPENDIX J: PROOF OF LEMMA 4

There are two results to prove in this lemma, the conditional variance, and the expectationof the conditional variance.

• Conditional on the number of nonzeros in each sample, the variance is

Var[Ui|p1, · · · , pK ] =

1

12(

K∑k=1

nk + 1)[ K∑k=1

nk[N2i+1

i∑j=1

nj + ni+1(

i∑j=1

Nj)2]− (ni+1

i∑j=1

Nj −Ni+1

i∑j=1

nj)2].

• The expectation of the variance is

E(Var[Ui|p1, · · · , pK ]) =θ2

12Ni+1(

i∑j=1

Nj)(

i+1∑j=1

Nj)(

K∑j=1

Nj)[(

K∑j=1

Nj)θ+ 3− 2θ]

≥ θ3

12Ni+1(

i∑j=1

Nj)2(

K∑j=1

Nj)2.

Page 30: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

30

We will show them one by one.First we find the conditional variance. Recall that Ui =

∑ij=1(Ni+1sj −Njsi+1). Define

r0j as the sum-rank of non-zero elements in Sample j. Hence, sj = r0

j +∑K

i=1bpNic+1+∑K

i=1 ni

2 (bpNic−ni)−

∑Ki=1bpNic+1

2 (bpNic). Given p1, p2, · · · , pK , equivalently n1, n2, · · · , nK , the only ran-dom part is r0

j . Therefore,

Var(Ui|p1, · · · , pK) = Var(

i∑j=1

(Ni+1sj−Njsi+1)|p1, · · · , pK) = Var(

i∑j=1

(Ni+1r0j−Njr

0i+1)|p1, · · · , pK).

Given p1, · · · , pK , then n1, n2, · · · , nK are also given. It means we have∑K

i=1 ni indepen-dently and identically distributed observations, and we are interested in the sum-rank of onesample j, denoted as r0

j . Definitely, we can use the properties in Section A.

Var(

i∑j=1

(Ni+1r0j −Njr

0i+1)|p1, · · · , pK)

= Var(Ni+1

i∑j=1

r0j − r0

i+1

i∑j=1

Nj)

=N2i+1

i∑j=1

Var(r0j ) + (

i∑j=1

Nj)2Var(r0

i+1)− 2Ni+1(

i∑j=1

Nj)

i∑j=1

Cov (() r0i+1, r

0j ).

According to the properties in Section A, Var(r0i ) =

(∑K

k=1 nk+1)12 ni(

∑Kk 6=i,k=1 nk), Cov (() r0

i , r0j ) =

− (∑K

k=1 nk+1)12 ninj . Introduce the terms into the equation, and we have that

Var[Ui|p1, · · · , pK ]

=

∑Kk=1 nk + 1

12

[N2i+1(

i∑j=1

nj)(

K∑j=i+1

nj) + (

i∑j=1

Nj)2ni+1(

K∑j 6=i+1,j=1

nj) + 2Ni+1(

i∑j=1

Nj)ni+1(

i∑j=1

nj)]

=1

12(

K∑k=1

nk + 1)[ K∑k=1

nk[N2i+1

i∑j=1

nj + ni+1(

i∑j=1

Nj)2]− (ni+1

i∑j=1

Nj −Ni+1

i∑j=1

nj)2].

The first conclusion is proved.To derive the second conclusion, we simply calculate the expectation of the results. Note

that in the equation, there is no p term, only products of ni, nj and nk. Under null hypothesis,ni ∼Binomial(Ni, θ) independently, therefore, we have

E[ni] =Niθ, Var(ni) =Niθ(1− θ), E[n2i ] =Niθ(Niθ+ 1− θ),

E[ninj ] =NiNjθ2, E[n2

inj ] =NiNjθ2(Niθ+ 1− θ).

Introduce these terms into EVar(Ui|p1, · · · , pK), we have the final result

E(Var[Ui|p1, · · · , pK ]) =θ2

12Ni+1(

i∑j=1

Nj)(

i+1∑j=1

Nj)(

K∑j=1

Nj)[(

K∑j=1

Nj)θ+ 3− 2θ]

≥ θ3

12Ni+1(

i∑j=1

Nj)2(

K∑j=1

Nj)2.

The final inequality comes from 3− 2θ > 0 and∑i+1

j=1Nj >∑i

j=1Nj .Therefore, the lemma is proved.

Page 31: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

31

APPENDIX K: PROOF OF LEMMA 5

We are interested in the case that f1 = f2 = · · ·= fK = f , and all θi’s are equal. For thiscase, we have that(1)Cov (()Ui,Uj) = E[Cov (()Ui,Uj |p1, · · · , pK)]+Cov (()E[Ui|p1, · · · , pK ],E[Uj |p1, · · · , pK ]) = I+II.

We want to find the upper bound for I and II separately.Consider I first. Given p1, · · · , pK , the number of non-zero elements are given, so the

random part is the rank of non-zero elements in every sample. Let r0i denote the sum-rank of

the non-zero elements in sample i. Therefore, we calculate the conditional covariance as

Cov (()Ui,Uj |p1, · · · , pK) = Cov (()Ni+1

i∑l1=1

r0l1 − r

0i+1

i∑l1=1

Nl1 ,Nj+1

j∑l2=1

r0l2 − r

0j+1

j∑l2=1

Nl2)

=1 +

∑Kk=1 nk

12

[Ni+1Nj+1(

i∑l1=1

nl1)(

K∑l2=j+1

nl2) +Ni+1nj+1(

i∑l1=1

nl1)(

j∑l2=1

nl2)

−ni+1Nj+1(

i∑l1=1

Nl1)(

K∑l2=j+1

nl2)− ni+1nj+1(

i∑l1=1

Nl1)(

j∑l2=1

Nl2)

].

Note that ni ∼Binomial(Ni, θ) independently, so that we have

(2) I = E[Cov (()Ui,Uj |p1, · · · , pK)] = 0.

Next consider II . According to (1), we have the result that

E[Ui] =1

2Ni+1

K∑k=1

Nk

i∑j=1

NjE[p(pi+1 − pj)] +O

( K∑k=1

Nk

i+1∑j=1

Nj

), 1≤ i≤K − 1,

where the remainder comes from the difference between bpNic and pNi, which is a smallerorder term. Therefore, the covariance between two terms is

II = Cov (()1

2Ni+1

K∑k=1

Nk

i∑l1=1

Nl1p(pi+1 − pl1),1

2Nj+1

K∑k=1

Nk

j∑l2=1

Nl2p(pj+1 − pl2))

=1

4Ni+1Nj+1

( K∑k=1

Nk

)2Cov (()

i∑l1=1

Nl1p(pi+1 − pl1),j∑

l2=1

Nl2p(pj+1 − pl2)).

It’s hard to find an exact value, so we work on an upper bound only. We rewrite p= θ+ (p−θ), and so the covariance part is the summation of four parts

Cov (()

i∑l1=1

Nl1p(pi+1 − pl1),j∑

l2=1

Nl2p(pj+1 − pl2))

= Cov (()

i∑l1=1

Nl1θ(pi+1 − pl1),j∑

l2=1

Nl2θ(pj+1 − pl2)) + Cov (()

i∑l1=1

Nl1(p− θ)(pi+1 − pl1),j∑

l2=1

Nl2θ(pj+1 − pl2))

+ Cov (()

i∑l1=1

Nl1θ(pi+1 − pl1),j∑

l2=1

Nl2(p− θ)(pj+1 − pl2)) + Cov (()

i∑l1=1

Nl1(p− θ)(pi+1 − pl1),j∑

l2=1

Nl2(p− θ)(pj+1 − pl2))

= IIa+ IIb+ IIc+ IId.

Page 32: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

32

Check the terms one by one. For IIa, note that Cov (()pi, pj) = 0 when i 6= j because ofindependence, so we have

IIa= θ2 Cov (()

i∑l1=1

Nl1(pi+1 − pl1),j∑

l2=1

Nl2(pi+1 − pl2))

= θ2

[ i∑l=1

[N2l Var(pl)]− (

i∑l=1

Nl)Ni+1Var(pi+1)

]

= θ2

[ i∑l=1

Nlθ(1− θ)− (

i∑l=1

Nl)θ(1− θ)]

= 0.(3)

The parts IIb and IIc are symmetric. We analyze one and the result for the other one canalso be achieved.

IIb= θCov (()

i∑l1=1

Nl1(p− θ)(pi+1 − pl1),j∑

l2=1

Nl2(pi+1 − pl2))

≤ θ

√√√√Var[ i∑l1=1

Nl1(p− θ)(pi+1 − pl1)]Var[ j∑l2=1

Nl2(pi+1 − pl2)].(4)

Then we check the two variances. Consider the first one. Note that p = max1≤i≤K pi, so(p− θ)2 ≤

∑Ki=1(pi − θ)2. So we have

Var[ i∑l1=1

Nl1(p− θ)(pi+1 − pl1)]≤ E[(p− θ)2]E[(

i∑l1=1

Nl1(pi+1 − pl1))2]

≤[ K∑k=1

E[(pk − θ)2]

]E[(

i∑l1=1

Nl1(pi+1 − pl1))2]

(Note E[

i∑l1=1

Nl1(pi+1 − pl1)] = 0) =

[ K∑k=1

Var(pk − θ)]Var[

i∑l1=1

Nl1(pi+1 − pl1)]

= θ2(1− θ)2(

K∑k=1

1/Nk)(

i∑l=1

Nl)[1 +

i∑l=1

Nl/Ni+1](5)

For the second one, the calculation is straigtforward.

(6) Var[ j∑l2=1

Nl2(pi+1 − pl2)]

= θ(1− θ)(j∑l=1

Nl)[1 +

j∑l=1

Nl/Nj+1].

Combine (5) – (6) with (4), we have that

IIb≤ θ

√√√√θ2(1− θ)2(

K∑k=1

1/Nk)(

i∑l=1

Nl)[1 +

i∑l=1

Nl/Ni+1]θ(1− θ)(j∑l=1

Nl)[1 +

j∑l=1

Nl/Nj+1]

≤ θ2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl),(7)

Page 33: arXiv:2110.05368v3 [stat.ME] 13 Oct 2021

33

where N(1) is the minimum of N1, · · · ,NK .Similarly, for IIc, we have

(8) IIc≤ θ2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl).

Finally, we check IId. With the results above, we have the upper bound for IId as

IId≤

√√√√Var[ i∑l1=1

Nl1(p− θ)(pi+1 − pl1)]Var[ j∑l2=1

(p− θ)Nl2(pi+1 − pl2)]

√√√√θ2(1− θ)2(

K∑k=1

1/Nk)(

i∑l=1

Nl)[1 +

i∑l=1

Nl/Ni+1]θ2(1− θ)2(

K∑k=1

1/Nk)(

i∑l=1

Nl)[1 +

i∑l=1

Nl/Ni+1]

≤ θ2[1 +

K∑l=1

Nl/N(1)](

K∑k=1

1/Nk)

√√√√(

i∑l=1

Nl)(

j∑l=1

Nl).(9)

Finally, combine (3), (7), (8) and (9), we have that

|Cov (()

i∑l1=1

Nl1p(pi+1 − pl1),j∑

l2=1

Nl2p(pj+1 − pl2))|= |IIa+ IIb+ IIc+ IId|

≤ 0 + θ2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl) + θ2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl)

+θ2[1 +

K∑l=1

Nl/N(1)](

K∑k=1

1/Nk)

√√√√(

i∑l=1

Nl)(

j∑l=1

Nl)

≤ 3θ2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl),(10)

when N1, · · · ,NK →∞.Finally, combine (2) and (10) with (1), there is

|Cov (()Ui,Uj)|= |I + II|= |II| ≤ 1

4Ni+1Nj+1

( K∑k=1

Nk

)2 · 3θ2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl)

≤ θ2Ni+1Nj+1

( K∑k=1

Nk

)2[1 +

K∑l=1

Nl/N(1)]

√√√√(

K∑k=1

1/Nk)(

i∑l=1

Nl)(

j∑l=1

Nl).

Therefore, the result is proved.