Bio Statistics Epidemiology

ADVANCED AND NOVEL PARAMETRIC AND

NONPARAMETRIC LIKELIHOOD STATISTICAL

TECHNIQUES WITH APPLICATIONS IN

EPIDEMIOLOGY

by

Wan-Min Tsai

December 17, 2012

A dissertation submitted to the

Faculty of the Graduate School of

the University at Buffalo, State University of New York

in partial fulfillment of the requirements for the

degree of

Doctor of Philosophy

Department of Biostatistics

All rights reserved

INFORMATION TO ALL USERSThe quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

Microform Edition ProQuest LLC.All rights reserved. This work is protected against

unauthorized copying under Title 17, United States Code

ProQuest LLC.789 East Eisenhower Parkway

P.O. Box 1346Ann Arbor, MI 48106 - 1346

UMI 3554514Published by ProQuest LLC (2013). Copyright in the Dissertation held by the Author.

UMI Number: 3554514

ii

Dissertation Committee

Albert Vexler, Ph.D. (Advisor)

Alan Hutson, Ph.D.

Chang-Xing Ma, Ph.D.

Jihnhee Yu, Ph.D.

iii

ACKNOWLEDGEMENTS

I would like to express my gratitude to all the people who have helped along the journey

of my Ph.D. study.

Firstly, I would like to express my deepest appreciation to my advisor, Professor

Albert Vexler, for his guidance and encouragement. This dissertation would not have

been possible without him. I am also deeply thankful to Professors Alan Hutson, Chang-

Xing Ma, and Jihnhee Yu, who kindly agreed to serve on my dissertation committee, for

their interest and valuable comments on my work.

I would like to thank all of my professors, fellow students, and department staff for

their kind help and support during my graduate study at the University at Buffalo. My

sincere thanks also to Dr. Gregory Gurevich and Dr. Yaakov Malinovsky as well as the

reviewers of the manuscripts submitted to journals for the constructive suggestions

improving the quality of the manuscripts and this dissertation.

I would also like to thank all of my friends for always being there with a word of

encouragement or listening ear.

Last but not least, I would like to thank my beloved parents, brother and sisters.

Without their strong love and support, I would not have had a chance to go abroad to

pursue my graduate study. I love my family beyond expression for their support and

encouragement along the way. I would like to dedicate this work to them.

iv

TABLE OF CONTENTS

Acknowledgements ....................................................................................................... iii

List of Tables .................................................................................................................... x

List of Figures ............................................................................................................... xiii

Abstract. ........................................................................................................................ . xiv

Chapters. ........................................................................................................................... 1

Chapter 1. Introduction. ............................................................................................ 1

Chapter 2. Estimation and testing based on data subject to

measurement errors: from parametric to non-parametric likelihood

methods ............................................................................................................................ 9

2.1 Introduction .............................................................................................................................. 9

2.2 Parametric inferences............................................................................................................11

2.2.1 Parametric likelihood functions .......................................................................... 12

2.2.1.1 Parametric likelihood based on repeated data..12

2.2.1.2 Parametric likelihood based on pooled and unpooled data......13

2.2.2 Normal case......................................................................................................... 15

2.2.2.1 Maximum likelihood estimators based on repeated measures.15

2.2.2.2 Maximum likelihood estimators followed the hybrid design......17

v

2.2.2.3 Remarks on the normal case.. ...................18

2.3 Empirical likelihood method ...............................................................................................19

2.3.1 The empirical likelihood ratio test ...................................................................... 19

2.3.2 The empirical likelihood method based on repeated measures data ................... 21

2.3.3 The empirical likelihood method based on pooled-unpooled data ..................... 22

2.4 Monte Carlo experiments .....................................................................................................25

2.4.1 Simulation settings .............................................................................................. 25

2.4.2 Monte Carlo outputs ........................................................................................... 27

2.5 A real data example ..............................................................................................................32

2.6 Conclusions ............................................................................................................................35

Chapter 3. An empirical likelihood ratio based goodness-of-fit test for

inverse Gaussian distributions .............................................................................. 36

3.1 Introduction ............................................................................................................................36

3.2 Method ....................................................................................................................................39

3.2.1 Density-based empirical likelihood ratio goodness-of-fit test ............................ 40

3.2.2 Null distribution .................................................................................................. 46

3.3 Power properties ....................................................................................................................50

3.4 Data examples ........................................................................................................................52

3.5 Conclusions ............................................................................................................................54

vi

Chapter 4. An extensive power evaluation of a novel two-sample

density-based empirical likelihood ratio test for paired data. ................... 56

4.1 Introduction ............................................................................................................................56

4.2 Method ....................................................................................................................................61

4.3 Simulation study ....................................................................................................................67

4.3.1 Null distribution .................................................................................................. 67

4.3.2 Power study ......................................................................................................... 71

4.4 A real data example ..............................................................................................................75

4.5 Conclusions ............................................................................................................................78

Chapter 5. Two-sample density-based empirical likelihood ratio tests

based on paired data, with application to a treatment study of

Attention-Deficit/Hyperactivity Disorder and Severe Mood

Dysregulation............................................................................................................... 88

5.1 Introduction ............................................................................................................................88

5.2 Statement of problems and Methods ..................................................................................93

5.2.1 Hypotheses setting .............................................................................................. 93

5.2.2 Test statistics ....................................................................................................... 94

5.2.2.1 Test 1: vs. .....94

5.2.2.2 Test 2: vs. .........99

5.2.2.3 Test 3: vs. .......101

vii

5.2.3 Asympotic consistency of the tests .................................................................... 102

5.2.4 Null distributions of the proposed test statistics ................................................ 103

5.3 Simulation study ..................................................................................................................105

5.3.1 Power comparison with the parametric method................................................. 105

5.3.2 Power comparison with classic nonparametric methods ................................... 106

5.4 Data analysis ........................................................................................................................107

5.5 Conclusions ..........................................................................................................................110

Chapter 6. Optimal properties of parametric Shiryaev-Roberts

statistical control procedures ............................................................................... 117

6.1 Introduction ..........................................................................................................................117

6.2 Retrospective change point detection ...............................................................................119

6.3 Retrospective detection of two change points .................................................................124

6.3.1 Non-asymptotic optimal properties of the Shiryaev-Roberts test (6.10) ......... 125

6.3.2 A real data example .......................................................................................... 129

6.3.2.1 The methods for p-value approximation related to test (6.16)...132

6.3.2.2 Additional study.............134

6.4 Sequential change point detection.....................................................................................134

6.5 Conclusions ..........................................................................................................................138

viii

Chapter 7. Future works ....................................................................................... 140

7.1 Testing normality based on independent samples and errors in regression models

......................................................................................................................................................140

7.2 A sample density-based empirical likelihood ratio test for independence

......................................................................................................................................................141

Appendices ................................................................................................................. 143

A.1.1 Parametric likelihood based on the hybrid design ......................................................143

A.1.2 The asymptotic distribution of the log empirical likelihood ratio test .....................145

A.1.2.1 The repeated measurements .......................................................................... 145

A.1.2.2 The hybrid design ......................................................................................... 147

A.2.1 Proof of Lemma 3.1 ........................................................................................................149

A.2.2 To show the proposed test statistic is identical to the test statistic presented by

Mudholkar and Tian (2012)..................................................................................................... 150

A.2.3 Proof of Proposition 3.1 .................................................................................................152

A.2.4 Proof of Proposition 3.2 ................................................................................................ 155

A.3.1 Computing the empirical constraint (5.11) utilized to develop test 2 ......................156

A.3.2 Proof of Proposition 5.1 ................................................................................................ 157

A.3.2.1 Proposed test 1 .............................................................................................. 157

A.3.2.2 Proposed test 2 .............................................................................................. 165

A.3.3 Mathematical derivation of maximum likelihood ratio tests.................................... 168

ix

A.3.3.1 Maximum likelihood ratio test statistic for test 1 ......................................... 168



Bibliography .............................................................................................................. 171

x

LIST OF TABLES

2.1 The Monte Carlo evaluations of the maximum likelihood estimates based on

repeated measurements and the hybrid design ......................................................... 28

2.2 Coverage probabilities and confidence intervals based on repeated measurements

and the hybrid design ............................................................................................... 29

2.3 The Monte Carlo type I errors and powers of the empirical likelihood ratio test

statistics (2.3) and (2.7) for testing based on data following the

hybrid design ( ( ) ( )) ............................................................. 31

2.4 The Monte Carlo type I errors and powers of the empirical likelihood ratio test

statistics (2.3) and (2.7) for testing based on data following the

hybrid design ( (

)).. ............................................................. 32

2.5 Bootstrap evaluations of the confidence interval estimators based on the

parametric likelihood ratio test and the empirical likelihood ratio test.................... 34

3.1 The critical values, ( ), of the proposed test at the significance level .... 47

3.2 Type I error control of the proposed ( ) test: =0.05 .................................... 48

3.3 The Monte Carlo Type I errors of the proposed test with critical values obtained

by Proposition 2.2 to guarantee ................................................................ 50

3.4 An empirical power comparison for the density-based empirical likelihood ratio

test at ........................................................................................................ 51

xi

3.5 Test for the IG based on the data introduced in Folks and Chhikara (1978) ........... 52

3.6 Bootstrap type proportion of rejection of the tests for the IG at a 5% level of

significance .............................................................................................................. 54

4.1 The critical values of the proposed test at (4.7) with various ( ) values for

sample sizes ( ) at the significant level 0.01, 0.05, and 0.1,

respectively .............................................................................................................. 68

4.2 Type I error control of the proposed test statistic (4.6) with ( ) ( ) at

the significant level .................................................................................. 70

4.3 Designs of the alternative hypothesis to be applied to the following Monte Carlo

evaluations of the powers of the proposed test (4.6) ................................................ 72

4.4 Proportion of rejection based on the bootstrap method for each considered test.. .. 78

5.1 Hypotheses of interest to be tested based on paired data ......................................... 94

5.2 The critical values for Test 1 by (5.9) (Test 2 by (5.15)) [Test 3 by (5.17)] with

=0.1 for different sample sizes ( , ) and significance levels ..................... 104

5.3 The Monte Carlo powers of Test 1 by (5.9) vs. the MLR test for vs. with

different sample sizes ( ) at the significance level ........................ 111





xii

5.6 The Monte Carlo type I errors of the MLR tests ................................................... 113

5.7 The Monte Carlo powers of the proposed test (5.9) vs. the combined

nonparametric test (the two Wilcoxon signed rank tests and one Kolmogorov-

Smirnov test) at ...................................................................................... 113

5.8 The Monte Carlo powers of the proposed test (5.15) vs. the combined

nonparametric test (the one Wilcoxon signed rank test and one Kolmogorov-

Smirnov test) at ..................................................................................... 114

5.9 The Monte Carlo powers of the proposed test (5.17) vs. the Wilcoxon signed

rank test at the significance level ........................................................... 115

5.10 The proportions of rejections based on the bootstrap method for each considered

test .......................................................................................................................... 115

6.1 Means and standard deviations of the %TS data and the TIBC data in each group

............................................................................................................................... 132

xiii

LIST OF FIGURES

2.1 The histogram and the normal Q-Q plot of cholesterol data ................................... 33

4.1 3-D plots of powers of the considered tests via all 28 designs (K1-K28) with

different sample sizes at the significant level .......................................... 80

4.2 Histograms of the differences in CDRS-Rts at baseline and endpoint in the group

1 ( samples) and group 2 ( samples), respectively ................................... 87

4.3 Plot of sample sizes vs. p-value using a bootstrap method ...................................... 87

5.1 Histograms of CDRS-Rts related to the baseline and endpoint in group 1: ( ,

), and in group 2: ( , )... ............................................................................... 116

5.2 Histograms of the paired observations and based on the CDRS-Rts data,

with sample sizes ( )=(9, 6), that were sampled from the original data set .. 116

6.1 The left-hand side of the Figure 6.1 shows plots and histograms of %TS data; the

right-hand side of the Figure 6.1 shows plots and histograms of TIBC data... ...... 130

6.2 Histograms of %TS data in each group ................................................................. 131

6.3 Histograms of TIBC data in each group ................................................................ 131

xiv

ABSTRACT

The likelihood approach provides a basis for many important procedures and methods in

statistical inference. When data distributions are completely known, the parametric

likelihood approach is unarguably a powerful statistical tool that can provide optimal

statistical inference. In such cases, by virtue of the Neyman-Pearson lemma, the likelihood

ratio tests are the most powerful decision rules. The parametric likelihood methods cannot

be applied properly if assumptions on the forms of distributions of data do not hold. Often,

in the context of likelihood applications, the use of the misspecified parametric forms of

data distributions may result in inaccurate statistical conclusions. The empirical likelihood

(EL) methodology has been well addressed in the literature as a nonparametric counterpart

of its powerful parametric likelihood approach. The objective of this dissertation is to

develop several powerful parametric likelihood methods and nonparametric approaches

using the EL concept. Measurement error (ME) problems can cause bias or inconsistency

of statistical inferences. When investigators are unable to obtain correct measurements of

biological assays, special techniques to quantify MEs need to be applied. In this

dissertation, we present both parametric likelihood and EL methods for dealing with data

subject to MEs based on repeated measures sampling strategies and hybrid sampling

designs (a mixture of pooled and unpooled data). Utilizing the density-based EL

methodology, we also propose different efficient nonparametric tests that approximate

most powerful Neyman-Pearson test statistics. We first introduce the EL ratio based

goodness-of-fit test for the inverse Gaussian model. Then we extend and adapt the density-

based EL approach to compare two samples based on paired data. We present exact

nonparametric tests for composite hypotheses to detect various differences related to

xv

treatment effects in study groups based on paired measurements. Next, we review and

extend parametric retrospective and sequential Shiryaev-Roberts based policies, carrying out

different contexts of the non-asymptotic optimal properties of the procedures. We propose

techniques to construct novel and efficient retrospective tests for multiple change-points

detection problems. Finally, future works will be discussed.

1

CHAPTER 1

INTRODUCTION

Likelihood methods are powerful statistical tools for parametric statistical inference. The

parametric likelihood methodology provides optimal statistical inferential procedures

when data distributions are known. However, parametric forms of data distributions are

oftentimes unknown. When the key assumptions regarding the underlying distribution of

data are not met, the parametric likelihood approaches may be extremely biased and

inefficient when compared to their robust nonparametric counterparts. In the

nonparametric context, the classical empirical likelihood (EL) approach is often applied

in order to efficiently approximate properties of parametric likelihoods, using an

approach based on substituting empirical distribution functions for their population

counterparts. Thus, the objective of this work is to develop efficient parametric likelihood

approaches as well as nonparametric methods using the EL concept to deal with problems

arising from epidemiological and medical studies.

The EL ratios were first used by Thomas & Grunkemeier (1975) in the context of

estimating survival probabilities. Owen (1988, 1990), building on earlier work of Thomas

& Grunkemeier (1975), introduced the EL approach for constructing confidence regions

in nonparametric problems. The approach has been developed to various situations. For

example, Owen (1991) and Chen (1993, 1994) extended the method to regression

problems, Chen and Hall (1993) considered the case of quantiles, and Kolaczyk (1994)

made the extension to generalized linear models. Qin and Lawless (1994) have shown

that the EL methodology can be utilized to make inference on parameters of interest

2

under a semiparametric model. The EL method has several advantages compared to

contemporary methods such as the bootstrap (Hall and Scala, 1990). The EL approach

not only provides confidence regions whose shapes and orientations are determined

entirely by the data, but it can also easily incorporate known constraints on parameters

(Chen, 1994). Moreover, DiCiccio et al. (1989) showed that in a very general setting, the

EL method for constructing confidence regions is Bartlett correctable which allows very

low coverage error to improve the accuracy of inferences. For a more comprehensive

literature review, see e.g., Owen (2001).

In this dissertation, we mainly focus on the developments of novel and powerful

distribution-free approaches to approximate optimal parametric likelihood ratio test

statistics via the EL methodology. In Chapter 2, we introduce both parametric and

empirical likelihood methods for dealing with measurement error problems. In Chapters 3-

5, we develop several robust EL-based methods motivated by important biomedical studies.

In Chapter 6, we propose techniques to construct efficient parametric retrospective tests for

multiple change-points detection problems. In the last chapter, we discuss directions for

future works. In this introductory chapter, we begin with a brief description of background

and motivation of each study and review of related literature, which are essential for

comprehending the following chapters covered in this dissertation.

In Chapter 2, we propose and examine parametric and nonparametric tests for data

subject to measurement errors based on two sampling schemes---a common sampling

strategy based on repeated measures and a hybrid design (a mixture of pooled and

unpooled data). Biomarkers are distinctive biochemical indicators of biological processes

or events that help measure the progress of disease or the effects of treatment. The

3

imprecision of measurement of exposures can cause bias or inconsistency of estimation

or testing of unknown parameters of the biomarker distribution. The sampling based on

repeated measurements is a common method to deal with measurement error problems

(e.g., Dunn, 1989; Freedman et al., 2004). However, because of financial cost and other

resources constraints, obtaining and assaying the desired number of replications can be

infeasible. For example, the cost of a single assay to measure polychlorinated biphenyl

(PCB) is between $500 and $1000. The high assay expense restricts the number of

biological samples that can be measured in the study related to examination of the

association between the marker PCB and cancer or endometriosis (Louis et al., 2005;

Mumford et al., 2006). Take the Seveso cohort as another example. In 1976, due to a

severe industrial accident, serious dioxin contamination of a large resident population of

people and farm animals occurred in Seveso, Italy. Investigators need to assay those

frozen storage blood specimens from the resident population in the exposed area of

dioxin in order to examine whether dioxin contamination adversely affects human health.

However, each determination of blood dioxin currently costs more than $1000, resulting

in only the limited number of subjects that can be considered in this study project

(Weinberg and Umbach, 1999).

Pooling and simple random sampling designs have been suggested as cost-efficient

sampling approaches (e.g., Dorfman, 1943; Weinberg and Umbach, 1999; Faraggi et al.,

2003, Vexler et al., 2006, Schisterman and Vexler, 2008). Simple random sampling

designs are conducted by selecting and assaying a random subset of available specimens.

In this case, several individual biospecimens are ignored; however, estimation of the

operating characteristics of these individual biospecimens is simple. Pooling, on the other

4

hand, involves randomly grouping and physically mixing individual biological samples.

Assays are then performed on the small number of pooled samples. Therefore, pooling

sampling design reduces the number of measurements without ignoring any individual

biospecimens. Schisterman et al. (2010) proposed a pooled-unpooled hybrid design that

combines the advantages of both pooling and random sampling strategies. The hybrid

designs are implemented by assaying some randomly selected individual biospecimens

and one pooled set that is constituted by the remaining individual biospecimens. The

pooled sample in the hybrid design is used for estimation of the mean of the biomarker

distribution, whereas the random sample proportion in the hybrid design (unpooled data)

is utilized for estimation of the variance. One advantage of using the hybrid design is that

by utilizing a pooled sample, one can estimate not only mean and variance of the

biomarker distribution, but also measurement error, without requiring repeated measures.

The repeated measurements sampling strategy has been well addressed in the literature

under parametric assumptions. In the context of the hybrid sampling design, Schisterman

et al. (2010) have proposed and evaluated a parametric approach for normally distributed

data in the presence of measurement errors. However, nonparametric methods have not

been well investigated to analyze repeated measures data or pooled data subject to

measurement errors. In Chapter 2, we consider general cases of parametric and

nonparametric assumptions, comparing efficiency of pooled-unpooled samples and data

consisting of repeated measures. The applications of the proposed methods are illustrated

using the cholesterol biomarker data from a study of myocardial infarction.

The classical EL approach reflects distribution-based interpretation of the notion of

likelihood (Owen, 2001). Note that the Neyman-Pearson concept to show the optimality

5

of statistical decision rules utilizes density functions structures of likelihood type tests.

Recently, Vexler and Gurevich (2010) proposed the density-based EL methodology to

construct efficient nonparametric tests that approximate Neyman-Pearson test statistics.

In the following chapters, we extend and adapt the density-based EL approach for

different testing problems, including tests of goodness-of-fit (Chapters 3 and 7.1) and

two-sample comparisons based on paired data (Chapters 4 and 5).

Testing for distributional assumptions has been a major area of continuing statistical

research. Chapter 3 introduces a new and powerful density-based EL goodness-of-fit test

for the inverse Gaussian distribution. The inverse Gaussian distribution is commonly

introduced to model and examine right skewed data having positive support. It originates

as the distribution of the first passage time of Brownian motion with drift (Henze and

Klar, 2012). It has useful applications include reliability problems (Padgett and Tsoi,

1986), lifetime models (Chhikara and Folks, 1977), and accelerated life testing

(Bhattacharyya and Fries, 1982). Folks and Chhikara (1978, 1989) have presented several

properties and applications of this distribution. When applying the inverse Gaussian

model, it is critical to develop efficient goodness-of-fit tests. Mudholkar and Tian (2002)

presented the entropy-based goodness-of-fit test for the inverse Gaussian distribution that

strongly depends on values of the integer parameter m. The problem of selecting the

optimal parameter m causes the difficulty of efficiently implementing the test in practice.

In this chapter, we propose an EL based goodness-of-fit test for the inverse Gaussian

distribution that improves Mudholkar and Tians test (2002) in the context of eliminating

dependence on the integer parameter m.

6

It is a common practice to conduct medical trials in order to compare a new therapy

with a standard-of-care based on paired data consisted of pre- and post-treatment

measurements. In the subsequent chapters (Chapters 4 and 5), we focus on developments

of efficient nonparametric likelihood methods for comparing two study groups based on

paired observations. In Chapter 4, we present an efficient distribution-free test to compare

two independent samples, where each is based on paired data. The statistical literature

has pointed out several issues related to the well-known classical tests for paired data

such as the t-test and the Wilcoxons test. For instance, the t-test is not appropriate to be

applied to data from strongly skewed distributions (see, e.g. Vexer et al., 2009); the

Wilcoxons test may break down completely when a nonconstant change is in effect

under the alternative hypothesis (see, e.g. Albers et al., 2001). In addition, the classical

Kolmogorov-Smirnov test is not designed to be based on paired data directly; therefore, it

may not be well suited to compare two samples based on paired observations. Due to the

drawbacks of these classical statistical procedures, Chapter 4 aims to provide an

alternative powerful nonparametric test based on paired data to compare treatment effects

between two study groups. We extend and modify the density-based EL ratio test

presented by Gurevich and Vexler (2011) to formulate an appropriate parametric

likelihood ratio test statistic corresponding to the hypothesis of our interest and then to

approximate the test statistic nonparametrically.

In many case-control studies, a great interest often lies in identifying treatment effects

within each therapy group as well as detecting between-group differences. To this end, in

a nonparametric setting, one can consider combining relevant nonparametric standard

procedures, for example, the Kolmogorov-Smirnov test and the Wilcoxon signed rank

7

test. The former test is a known procedure to compare distributions of two study groups,

whereas the latter one can be applied to detect treatment effects within each study group.

However, the use of the classical procedures commonly requires complex considerations

to combine the known nonparametric tests, e.g., considerations of combined p-values. In

Chapter 5, we propose simple nonparametric tests for three composite hypotheses related

to treatment effects to provide efficient tools that compare study groups utilizing paired

data. We adapt and extend the density-based EL methodology to deal with various testing

scenarios involved in the two-sample comparisons based on paired data. The proposed

technique is applied for comparing two therapy strategies to treat childrens attention

deficit/hyperactivity disorder and severe mood dysregulation.

Parametric change point detection schemes based on the Shiryaev-Roberts approach

have been well addressed in the statistics and engineering literature. High efficiency of

such procedures can be partially explained by their known asymptotic optimal properties.

Recently, Shiryaev-Roberts based procedures were proposed and examined in applications

to the standard AMOC (at most one change) retrospective change point detection problems.

In Chapter 6, we review and extend parametric retrospective and sequential Shiryaev-

Roberts based policies, carrying out different contexts of the non-asymptotic optimal

properties of the procedures. We utilize the general principle of the Neyman-Pearson

fundamental lemma to show that the Shiryaev-Roberts approach implies the average most

powerful procedures. We also introduce techniques to construct efficient retrospective tests

for multiple change-points detection. A real data example based on biomarker

measurements is provided to demonstrate implementation and effectiveness of new tests in

practice.

8

The last chapter of this dissertation addresses directions for future work and outlines

two possible topics of future work: 1). Develop an EL ratio based goodness-of-fit test of

the normality based on several independent samples and errors in regression models; 2)

Develop a simple likelihood ratio type test for independence between two random

variables, without requiring the specification of any kind of dependence and any

assumptions on the forms of data distributions.

9

CHAPTER 2

ESTIMATION AND TESTING BASED ON DATA

SUBJECT TO MEASUREMENT ERRORS: FROM

PARAMETRIC TO NON-PARAMETRIC

LIKELIHOOD METHODS

2.1 INTRODUCTION

Commonly, many biological and epidemiological studies deal with data subject to

measurement errors (MEs) attributed to instrumentation inaccuracies, within-subject

variation resulting from random fluctuations over time, etc. Ignoring the presence of ME

in data can result in the bias or inconsistency of estimation or testing. The statistical

literature proposed different methods for ME bias correction (e.g., Carroll et al., 1984,

1999; Carroll and Wand, 1991; Fuller, 1987; Liu and Liang, 1992; Schafer, 2001;

Stefanski, 1985; Stefanski and Carroll, 1987, 1990). Among others, one of the common

methods is to consider repeated measurements of biospecimens collecting sufficient

information for statistical inferences adjusted for ME effects (e.g., Hasabelnaby et al.,

1989). In practice, measurement processes based on bioassays can be costly and time-

consuming and can restrict the number of replicates of each individual available for

analysis or the number of individual biospecimens that can be used. It can follow that

10

investigators may not have enough observations to achieve the desired power or

efficiency in statistical inferences.

Dorfman (1943), Faraggi et al. (2003), Liu and Schisterman (2003), Liu et al. (2004),

Mumford et al. (2006), Schisterman et al. (2008, 2010), and Vexler et al. (2006, 2008,

2010, 2011) addressed pooling sampling strategies as an efficient approach to reduce the

overall cost of epidemiological studies. The basic idea of the pooling design is to pool

together individual biological samples (e.g., blood, plasma, serum or urine) and then

measure the pooled samples instead of each individual biospecimen. Since the pooling

design reduces the number of measurements without ignoring individual biospecimens,

the cost of the measurement process is reduced, but relevant information can still be

derived. Recently, it has been found that we can utilize a hybrid design that takes a

sample of both pooled and unpooled biospecimens to efficiently estimate unknown

parameters, allowing for MEs presence in the data without requiring repeated measures

(Schisterman et al., 2010).

In the context of the hybrid strategy, Schisterman et al. (2010) evaluated data that

follow normal distribution functions. In this chapter, we consider general cases of

parametric and nonparametric assumptions, comparing efficiency of pooled-unpooled

samples and data consisting of repeated measures. It should be noted that the repeated

measurement technique proposes to collect a large amount of information regarding just

nuisance parameters related to distribution functions of ME, whereas the pooled-

unpooled design provides observations that are informative regarding target variables

allowing for ME. Therefore, we show that the pooled-unpooled sampling strategy is more

efficient than the repeated measurement sampling procedure. We construct parametric

11

likelihoods based on both sampling methods. Additionally, in order to preserve

efficiencies of both strategies without parametric assumptions, we consider a

nonparametric approach using the empirical likelihood (EL) methodology (e.g., DiCicco

et al., 1989; Owen, 1988, 1991, 2001; Vexler et al., 2009, 2010; Vexler and Gurevich,

2010; Yu et al., 2010). We develop and apply novel EL ratio test statistics creating the

confidence interval estimation based on pooled-unpooled data and repeated measures

data. Despite the fact that many statistical inference procedures have been developed to

operate with data subject to ME, to our knowledge, relevant nonparametric likelihood

techniques and parametric likelihood methods have not been well addressed in the

literature.

The rest of this chapter is organized as follows. In Section 2.2, we present a general

form of the likelihood function based on repeated measures data and pooled-unpooled

data. We propose the EL methodology to make nonparametric inferences based on

repeated measures data and pooled-unpooled data in Section 2.3. We claim that the EL

technique based on the hybrid design provides a valuable technique to construct statistical

tests and estimators of parameters when MEs are present. To evaluate the proposed

approaches, we utilize Monte Carlo simulations in Section 2.4. We present an application

to cholesterol biomarker data from a study of coronary heart disease in Section 2.5. In

Section 2.6, we provide some concluding remarks.

2.2 PARAMETRIC INFERENCES

In this section, we derive general forms of the relevant likelihood functions. In each case,

we assume that the total measurements of the biomarkers are fixed, say N, for example, N

is the total number of measurements that a study budget allows us to execute.

12

2.2.1 PARAMETRIC LIKELIHOOOD FUNCTIONS

2.2.1.1 PARAMETRIC LIKELIHOOD BASED ON REPEATED MEASURES

DATA

Suppose that we measure a biospecimen observing score , where true

values of biomarker measurements are independent identically distributed (i.i.d.) and

are i.i.d. values of ME, ; ; . Thus, we assume that

there is a subset of distinct biosassays, and each of them is times repeatedly

measured. In this case, we can define the total number of available individual bioassays

to be T, T > t, when we can consider obtaining a large number of individual biospecimens

to have a low cost with respect to high cost of measurement processes. We assume that

and are independent. Firstly, we consider the simple normal case, say ( )

and ( ). Accordingly, we observe (

). In this case, one can

show that if , there are no unique solutions of estimation of and

(non-

identifiability). The observations s in each group i are dependent because they are

measured using the same bioassay. Note that if we fix the value of , conditioned on

is independent of each other, for example, in the case of ( ), we have

| ( ) . In a general case, the likelihood function based on the repeated

measures data has the general form of

( |

) [ ( )

( | )]

When the distributions of and are known, we can obtain the specific likelihood

functions, and further, we can also derive the maximum likelihood estimators of , ,

13

and . Well-known asymptotic results related to the maximum likelihood estimation

give evaluations of properties of estimators based on the likelihood ( |

).

2.2.1.2 PARAMETRIC LIKELIHOOD BASED ON POOLED AND

UNPOOLED DATA

We briefly address the basic concept of the pooling design. Let be the number of

individual biospecimens available and be the total number of measurements that we

can obtain due to a limited study budget. We obtain the pooling samples by randomly

grouping individual samples into groups of size , where [ ] , the number of

individual samples in a pooling group and [ ] is the integer part of . The pooling design

requires a physical combination of specimens of the same group and a test of each pooled

specimen, obtaining a single observation, when the pooled sample is measured. Since the

measurements are generally per unit of volume, we assume that the true measurement for

a pooled set is the average of the true individual marker values in that group. In this case,

taking into account that instruments applied to the measurement process can be sensitive

and subject to some random exposure ME, we define a single observation to be a sum of

the average of individual marker values and a value of ME. Note that, in accordance with

the pooling literature, we assume that analysis of the biomarkers is restricted by the high

cost of the measurement process, whereas access to a large number of individual

biospecimens can be considered to have a relatively low cost.

In this subsection of hybrid design, we assume that distinct individual bioassays are

available, but still we can provide just measurements (N < T). The ratio of pooled and

unpooled samples is ( ) , [ ], and the pooling group size is p. Namely,

( ) . Specifically, we can obtain pooled data by mixing individual

14

bioassays together, and we therefore divide the bioassays into groups, where

. We measure the grouped biospecimens as single observations. Let

, denote measurements of pooled bioassays. In accordance with the literature, we

have

( )

(see, e.g., Faraggi et al., 2003; Liu and Schisterman, 2003; Liu et al. 2004; Schisterman

et al. 2008, 2010; Vexler et al., 2006, 2008, 2010, 2011). Hence, we can obtain that

are i.i.d. with the mean and the variance

, namely,

i.i.d. (

)

The unpooled samples are based on ( ) independent observations

( )

In this case, we have i.i.d. (

).

Note that the pooled and unpooled samples are independent of each other. As a result,

the likelihood function based on the combination of pooled and unpooled data has the

form of

( |

) (

)

( )

If we know the distribution functions of and , we can derive the likelihood functions

according to the distributions of

and . Therefore, we can also obtain the

corresponding theoretical maximum likelihood estimators of (

) . Since the

15

estimators follow the maximum likelihood methodology, we can easily show the

asymptotic properties of the estimators.

2.2.2 NORMAL CASE

In this subsection, we assume ( ) and (

). Then we obtain closed-

form analytical solutions for the maximum likelihood estimators of the unknown

parameters, , and

.

2.2.2.1 MAXIMUM LIKELIHOOD ESTIMATORS BASED ON REPEATED

MEASURES

Assume that ; ; . By the additive property of the

normal distributions, we have (

)

Referring to Searle et al. (1992), the likelihood function is a well-known result that can

be expressed by

( |

)

{ [ ( )

( )

( ) ]}

( )

[ ( )]

( )

where

Under the assumption that s are equal (i.e., assuming balanced data), the log

likelihood function is in the form of

( |

)

( )

( )

(

)

( ) ( )

( )

where ( )

, ( )

and ( )

.

16

Let

. By taking the partial derivatives of with respect to , , and

and setting the equations equal to zero, we obtain the maximum likelihood equations

with the roots

( )

and

( )

Thus, the maximum likelihood estimator of is , and the maximum

likelihood estimators of and

are

and

, respectively, when

;

and ( ) ( )

, respectively, when

.

Also, the large-sample variances and covariance of and

are given by

( )

[(

)

( )] (

)

( )

(

) (

)

(for details, see Searle et al., 1992).

By the property of the maximum likelihood estimators, it is clear that asymptotically

those estimators follow a multivariate normal distribution as

[

] ( ) as

where

[

[(

)

( )]

( )

( )

( ) ]

17

2.2.2.2 MAXIMUM LIKELIHOOD ESTIMATORS FOLLOWED THE

HYBRID DESIGN

Since we assume that ( ) and (

) , we can write

(

) , and (

) , ,

.

The likelihood function based on pooled-unpooled data then takes the form

( |

)

( ) (

) (

)

( )

(

)

( )

(

)

Differentiating the log likelihood function, log ( |

), with respect to , ,

and , respectively, we obtain the maximum likelihood estimators of ,

, and

given by

(

)

(

)

( ) ( )

[ ( )

(

)

]

( )

( )

Note that the estimator of has a structure that weighs estimations based on pooled and

unpooled data in a similar manner to a Bayes point estimator used in normal-normal

models (see Carlin and Louis, 2008). In this case, we show that we can obtain inference

regarding the parameters by using this hybrid approach without repeating measures on

the same individual, which is the most common strategy to solve ME problems.

18

By the virtue of the properties of the maximum likelihood estimators, the asymptotic

distribution of the estimators (2.1) is asymptotically

[

] ( ) where is the inverse of the Fisher

Information matrix, I,

[ ( )

( )

( ) ]

[

( )

( )

( )

( )

( )

( )

( )

( ) ]

where ( |

) is the corresponding log likelihood function (for

details, see Appendix A.1.1).

2.2.2.3 REMARKS ON THE NORMAL CASE

As shown above, when biomarkers values and MEs are normally distributed, the

maximum likelihood estimators exist and can be easily obtained. It is also clear that we

can consider these estimators as the least square estimators in a nonparametric context.

However, when data are not from normal distributions, it may be very complicated or

even be infeasible to extract the distributions of repeated measures data or pooled and

unpooled data (e.g., Vexler et al., 2011). For example, in various situations, closed

analytical forms of the likelihood functions cannot be found based on pooled data

19

because the density function of the pooled biospecimen values involves complex

convolutions of p-individual biospecimen values. Consequently, it is reasonable to

consider efficient nonparametric inference methodologies based on the repeated measures

data or pooled-unpooled data.

2.3 EMPIRICAL LIKELIHOOD METHOD

In this section, we apply the EL methodology to the statement of the problem in this

chapter. The EL technique has been extensively proposed as a nonparametric

approximation of the parametric likelihood approach (e.g., DiCiccio et al., 1989; Owen,

1988, 1991, 2001; Vexler et al. 2009, 2010; Vexler and Gurevich, 2010; Yu et al., 2010).

We begin by outlining the EL ratio method and then modifying the EL ratio test to apply

to construct confidence interval estimations and tests based on data with repeated

measures and pooled-unpooled data.

2.3.1 THE EMPIRICAL LIKELIHOOD RATIO TEST

Consider the following simple testing problem that is stated nonparametrically. Suppose

i.i.d. random variables with ( ) and | | are observable. The

problem of interest, for example, is to test the hypothesis

( ) ( ) ( )

where is fixed and known. To test for the hypothesis in equation (2.2), we can write

the EL function as

where we assume s to have values that maximize given empirical constraints. The

empirical constraints correspond to hypotheses settings. Then, under the null hypothesis

20

in equation (2.2), we maximize subject to Here the condition

is an empirical form of . Using the Lagrange multipliers, one

can show that the maximum EL function has the form of

( )

( )

where is a root of

( )

Similarly, under the alternative hypothesis, the maximum EL function has the simple

form

As a consequence, the 2log EL ratio test for equation (2.2) is

( ) [ ( ) ( ( ))]

[ ( )]

It is proven in Owen (2001) that the 2log EL ratio, ( ), follows asymptotically

distribution as . Thus, we reject the null hypothesis at a significance level if

( ) Furthermore, we can construct the confidence interval estimator of

as

{ ( ) }

(Here, is the ( ) percentile of a distribution with one degree of

freedom.)

21

2.3.2 THE EMPIRICAL LIKELIHOOD METHOD BASED

ON REPEATED MEASURES DATA

Following the statement mentioned in Section 2.2, we have correlated data with repeated

measures. In order to obtain an i.i.d. sample, we utilize the fact that is independent of

when . Therefore, we give an EL function for the block sample mean

, ,

, in a similar manner to the blockwise EL method

given in Kitamura (1997). Then, the random variables become and the

corresponding EL function for is given by

( )

( )

where is a root of

( )

In this case, the 2log EL ratio test statistic is in the form of

( )

[ ( )]

Proposition 2.1. Assume | | . Then the 2log EL ratio, ( ), distributes

,

when 1

1

t

iin

, as

Proof. See Appendix A.1.2.1.

22

The associated confidence interval estimator is then given by { ( )

} where is the ( ) percentile of a distribution with one degree of

freedom.

2.3.3 THE EMPIRICAL LIKELIHOOD METHOD BASED

ON POOLED-UNPOOLED DATA

In this section, we consider two distribution-free alternatives to the parametric likelihood

method mentioned in Section 2.2.1.2. To this end, we apply the EL technique. Note that,

in contrast to data that consist of repeated measures, in this section we use data that are

based on independent observations. Consequently, we can introduce a combined EL

function for the mean based on two independent samples, that is, i.i.d.

and i.i.d. ( ), representing measurements that correspond to

pooled and unpooled biospecimens, respectively. Under the null hypothesis, the EL

function for can be presented in the form of

( )

( )

( )

where and are roots of the equations

( )

and

( )

Finally, the 2log EL ratio test statistic can be given in the form of

23

( ) log[ ( )]

log[ ( )]

( )

In a similar manner to common EL considerations, one can show that the statistics

[ ( )]

and [ ( )]

follow asymptotically a

distribution, respectively. By virtue of the additive property of distributions, the

2log EL ratio, ( ), has an asymptotic distribution with two degrees of freedom.

Thus, we formulate the next proposition.

Proposition 2.2. Let | | . Then the 2log EL ratio, ( ), has a

distribution

as .

Proof. See Appendix A.1.2.2.

The corresponding confidence interval estimator is

{ ( ) } ( )

where is the ( ) percentile of a distribution with two degrees of

freedom.

In practice, to execute the procedure above, we can directly use standard programs

related to the classical EL ratio tests, for example, the code el.test of the R software can

be utilized to conduct the EL confidence interval estimator (2.4).

The EL technique mentioned above does not use an empirical version of the rule

( ) ( )

( )

that connects the second moments derived from pooled and unpooled observations.

Intuitively, using a constraint related to equation (2.5), one can increase the power of the

EL approach. Consider the EL function for under the null hypothesis,

24

(

)

( )

( ) (

)

( (

) )

( ( ) )

as an alternative to the simple EL function ( ) Here, is the estimator from

equation (2.1) that is defined under the null hypothesis,

( )

[ ( )

(

)

]

, , , and are roots of the equations mentioned under the operator in the

definition of (

) with ( ( ) )

and ( ( ) )

.

Likewise, under the alternative hypothesis, we maximize the EL function,

, subject to

, ,

( )

( ) (

) (

) ( )

where

( )

[ ( )

(

)

]

Thus, the EL under the alternative hypothesis that depends on ( ) is given by

((

) )

( )

( )

25

where we should numerically derive (

( ) )

, (

( ) )

as

well as

, and

using the equations (2.6). As a result, the corresponding 2log

EL ratio test statistic is

( ) [ (

((

) )) ( (

))] ( )

Note that, following Qin and Lawless (1994), ( ) is asymptotically equivalent to

the maximum log EL ratio test statistic. By virtue of results mentioned in Qin and

Lawless (1994), ( ) asymptotically follows a

distribution. Then we reject the null

hypothesis at a significance level when

( ) ( )

where is the ( ) percentile of a distribution. Moreover, the

corresponding confidence interval is { ( ) } where is the

( ) percentile of a distribution.

The Monte Carlo simulation study presented in the next section examines the

performance of each EL method mentioned above.

2.4 MONTE CARLO EXPERIMENTS

In this section, we conducted an extensive Monte Carlo study to evaluate the performance

of the parametric and nonparametric likelihood methods proposed in Sections 2.2 and 2.3.

2.4.1 SIMULATION SETTINGS

Examining the repeated measures sampling method, we randomly generated samples of

values from a normal distribution with mean ( ) and variance

var( ) . We let , denote the number of replicates for each subject.

For simplicity, we assumed that each subject had the same number of replicates

26

(i.e., assuming balanced data). Then, in a similar manner, we randomly generated

normally distributed MEs, s, having ( ) and var ( ) , ;

. Therefore, we conducted samples of . Each sample had

observations.

To obtain the hybrid samples, we first generated a sample of size T, where

( ) , , ( ) , to represent available individual

bioassays. Then we proceeded to generate pooled data. To this end, we pooled ,

[ ], samples of s to constitute pooled data, where we assumed to be an

integer and s, , were i.i.d. random samples from a normal distribution with

mean ( ) and var( ) . Following the pooling literature, if there were no

MEs, we assumed the average values of the pooled biospecimens,

( ) , to be observed and we could represent them as the

measurements of pooled bioassays. We took the remaining ( ) observations

, , as individual measurements. For each observation, we randomly

generated a ME from a normal distribution. Combining the pooled samples,

( ) , , with the unpooled samples,

, we obtained pooled-unpooled data with the total sample size

equal to that in the Monte Carlo evaluations related to the repeated measures

approaches.

To evaluate the performance of proposed methods, we applied the following

simulation settings: the fixed significance level was 0.05; =1 and =1;

=0.4, 1;

=2, 5, 10; the pooling group size = 2, 5, 10; the pooling proportion =0.5; the total

27

sample size =100, 300. For each set of parameters, there were 10,000 data generations

(Monte Carlo). In this section, following the pooling literature, we assumed that the

simulated analysis of biomarkers was restricted to execute just N measurements and

( ) individual biospeciaments are available, when the hybrid design was

compared with the repeated measures sampling method. The Monte Carlo simulation

results are presented in the next subsection.

2.4.2 MONTE CARLO OUTPUTS

Table 2.1 shows the estimated parameters based on the repeated measures data using the

parametric likelihood method. The results show that as the replicates increase, the

standard errors of the estimates of decrease, indicating that the estimations of

appear to be better as the number of replicates increases. Apparently, the Monte Carlo

standard errors of the estimators of and increase when we increase the number of

replicates.

To accomplish the efficiency comparison between the repeated measures strategy and

the hybrid design strategy, we provide the Monte Carlo properties of the maximum

likelihood estimates based on pooled-unpooled data in Table 2.1. Table 2.1 shows that

the Monte Carlo standard errors of the estimates for based on pooled-unpooled data

are clearly less than those of the corresponding estimates that utilize repeated measures,

when (respectively, ). One observed advantage is that the estimation for

based on pooled-unpooled data is very accurate when the total number of measurements

is fixed at the same level. Another advantage is that the standard errors of the estimates

for based on pooled-unpooled data are much smaller than those of the corresponding

estimates using repeated measures data, as shown in Table 2.1.

28

Table 2.1: The Monte Carlo evaluations of the maximum likelihood estimates based on repeated

measurements and the hybrid design

Sample

Size

Replicates n;

Pooling Size p

Parameters Estimates Standard Errors

( , ,

)

SE( ) SE( ) SE(

)

Repeated Measurements:

N=100 n=2 (1, 1, 0.4) 1.0021 0.9781 0.3997 0.1553 0.2410 0.0790

(1, 1, 1.0) 1.0006 0.9688 0.9984 0.1726 0.3106 0.1994

n=5 (1, 1, 0.4) 0.9966 0.9462 0.3990 0.2328 0.3305 0.0623

(1, 1, 1.0) 1.0015 0.9362 0.9998 0.2442 0.3688 0.1570

n=10 (1, 1, 0.4) 1.0026 0.8951 0.3999 0.3209 0.4346 0.0597

(1, 1, 1.0) 1.0044 0.8917 0.9995 0.3299 0.4690 0.1501

N=300 n=2 (1, 1, 0.4) 0.9987 0.9921 0.3999 0.0889 0.1405 0.0455

(1, 1, 1.0) 1.0005 0.9883 0.9999 0.0995 0.1803 0.1162

n=5 (1, 1, 0.4) 0.9995 0.9797 0.3998 0.1356 0.1950 0.0365

(1, 1, 1.0) 0.9990 0.9766 0.9990 0.1409 0.2181 0.0906

n=10 (1, 1, 0.4) 0.9985 0.9682 0.3997 0.1864 0.2633 0.0344

(1, 1, 1.0) 0.9985 0.9660 1.0002 0.1914 0.2782 0.0861

Hybrid Design:

N=100 p=2 (1, 1, 0.4) 1.0015 1.0160 0.4365 0.1048 0.6712 0.4579

(1, 1, 1.0) 1.0007 1.0754 1.0098 0.1327 1.0058 0.7275

p=5 (1, 1, 0.4) 0.9994 1.0045 0.3889 0.0924 0.3857 0.1662

(1, 1, 1.0) 1.0008 1.0053 0.9880 0.1240 0.5932 0.3217

p=10 (1, 1, 0.4) 0.9993 1.0049 0.3918 0.0871 0.3341 0.1164

(1, 1, 1.0) 0.9996 1.0050 0.9836 0.1197 0.5082 0.2486

N=300 p=2 (1, 1, 0.4) 0.9999 0.9974 0.4066 0.0608 0.3868 0.2652

(1, 1, 1.0) 1.0002 1.0069 0.9982 0.0758 0.5788 0.4179

p=5 (1, 1, 0.4) 0.9995 1.0013 0.3969 0.0534 0.2197 0.0954

(1, 1, 1.0) 0.9993 1.0076 0.9910 0.0711 0.3386 0.1819

p=10 (1, 1, 0.4) 0.9995 0.9995 0.3972 0.0497 0.1935 0.0671

(1, 1, 1.0) 0.9992 1.0059 0.9922 0.0688 0.2928 0.1436

29

Table 2.2 displays the coverage probabilities of the confidence interval estimators

constructed by the parametric likelihood and EL method based on repeated measures data

and the mixed data, respectively.

Table 2.2: Coverage probabilities and confidence intervals based on repeated measurements and

the hybrid design

Sample

Size

Replicates n;

Pooling Size p

Parameters Parametric Likelihood Empirical Likelihood

( , ,

) Coverage CI Coverage CI

Repeated Measurements:

N=100 n=2 (1, 1, 0.4) 0.9420 (0.7028, 1.3014) 0.9496 (0.6980, 1.3049)

(1, 1, 1.0) 0.9423 (0.6665, 1.3347) 0.9466 (0.6613, 1.3394)

n=5 (1, 1, 0.4) 0.9305 (0.5584, 1.4348) 0.9327 (0.5519, 1.4466)

(1, 1, 1.0) 0.9289 (0.5404, 1.4626) 0.9353 (0.5298, 1.4752)

n=10 (1, 1, 0.4) 0.9044 (0.4193, 1.5859) 0.8985 (0.4158, 1.5876)

(1, 1, 1.0) 0.9042 (0.4040, 1.6047) 0.9030 (0.4054, 1.6065)

N=300 n=2 (1, 1, 0.4) 0.9477 (0.8243, 1.1731) 0.9517 (0.8240, 1.1753)

(1, 1, 1.0) 0.9469 (0.8056, 1.1955) 0.9479 (0.8034, 1.1962)

n=5 (1, 1, 0.4) 0.9400 (0.7401, 1.2588) 0.9467 (0.7360, 1.2628)

(1, 1, 1.0) 0.9448 (0.7257, 1.2722) 0.9467 (0.7210, 1.2763)

n=10 (1, 1, 0.4) 0.9396 (0.6422, 1.3547) 0.9379 (0.6318, 1.3582)

(1, 1, 1.0) 0.9336 (0.6321, 1.3648) 0.9417 (0.6245, 1.3712)

Hybrid Design:

N=100 p=2 (1, 1, 0.4) 0.9512 (0.7939, 1.2090) 0.9492 (0.7725, 1.2303)

(1, 1, 1.0) 0.9463 (0.7422, 1.2592) 0.9421 (0.7146, 1.2869)

p=5 (1, 1, 0.4) 0.9424 (0.8230, 1.1757) 0.9490 (0.7978, 1.2010)

(1, 1, 1.0) 0.9439 (0.7644, 1.2372) 0.9509 (0.7314, 1.2703)

p=10 (1, 1, 0.4) 0.9393 (0.8339, 1.1646) 0.9498 (0.8099, 1.1887)

(1, 1, 1.0) 0.9431 (0.7701, 1.2290) 0.9478 (0.7376, 1.2614)

N=300 p=2 (1, 1, 0.4) 0.9482 (0.8817, 1.1182) 0.9551 (0.8660, 1.1337)

(1, 1, 1.0) 0.9469 (0.8525, 1.1479) 0.9520 (0.8334, 1.1672)

p=5 (1, 1, 0.4) 0.9478 (0.8963, 1.1026) 0.9532 (0.8822, 1.1166)

(1, 1, 1.0) 0.9463 (0.8616, 1.1371) 0.9506 (0.8433, 1.1556)

p=10 (1, 1, 0.4) 0.9462 (0.9030, 1.0961) 0.9584 (0.8896, 1.1095)

(1, 1, 1.0) 0.9484 (0.8652, 1.1332) 0.9532 (0.8475, 1.5080)

30

Table 2.2 shows that the EL ratio test statistic is as efficient as the traditional

parametric likelihood approach in the context of constructing confidence intervals

because the coverage probabilities and the interval widths of the two methods are very

close.

It is clearly shown that when sample sizes are greater than 100, the coverage

probabilities obtained via the pooled-unpooled design are closer to the expected 0.95

value than those based on repeated measurements. This, again, demonstrates that mixed

data are more efficient than repeated measures data.

To compare the Monte Carlo type I errors and powers of the tests based on the test

statistics ( ) and ( ) by equations (2.3) and (2.7), we performed 10,000

simulations for each parametric setting and sample size. To test the null hypothesis

, we use the statistics ( ) and ( ) by equations (2.3) and (2.7). Table

2.3 depicts results that correspond to the case when ( ). The outputs show that

the Monte Carlo type I errors and powers of the test statistic ( ) are slightly better

than those corresponding to the test statistic ( ). This indicates that the test based on

the simple statistic ( ) performs better than that based on the statistic ( ), in the

considered cases.

Table 2.4 displays the Monte Carlo simulation results of testing the null hypothesis

when , where

is a chi-squared distribution with two degrees of

freedom and is an effect size. Again, in this case, it is obvious that the type I errors

(when a=0) of the test statistic ( ) are much better controlled by 0.05 than those

based on the test statistic ( ). Moreover, the Monte Carlo powers of the test based on

the test statistic ( ) are higher than those based on the statistic ( ) when the effect

31

size a is large than 0.5. On the contrary, as the effect size a is small such as 0.1 and 0.2,

the Monte Carlo powers of the tests based on the test statistic ( ) seem higher than

those based on the statistic ( ). This shows that when the effect size a is large, the test

based on the simple statistic ( ) is preferable to that based on the statistic ( ).

Table 2.3: The Monte Carlo type I errors and powers of the EL ratio test statistics (2.3)

and (2.7) for testing based on data following the hybrid design ( ( ) (

). The pooling proportion = 0.5; the expected significance level was 0.05).

Sample Pooling

Group

Parameters ELR Test Statistic (2.3) ELR Test Statistic (2.7)

Size (N) Size (p) Type I

Error

Power Type I

Error

Power

=0 =0.5 =1.0 =0 =0.5 =1.0 N=100 p=2 0.4 0.0587 0.9919 1.0000 0.0580 0.9718 0.9990

1.0 0.0558 0.9373 1.0000 0.0611 0.9230 0.9986

p=5 0.4 0.0555 0.9989 1.0000 0.0530 0.9512 0.9992

1.0 0.0587 0.9626 1.0000 0.0604 0.9446 0.9976

p=10 0.4 0.0588 0.9995 1.0000 0.0595 0.9531 0.9999

1.0 0.0556 0.9684 1.0000 0.0621 0.9680 0.9992

N=200 p=2 0.4 0.0495 0.9999 1.0000 0.0594 0.9990 0.9985

1.0 0.0536 0.9991 1.0000 0.0593 0.9983 0.9995

p=5 0.4 0.0540 1.0000 1.0000 0.0524 0.9952 0.9996

1.0 0.0511 0.9997 1.0000 0.0543 0.9981 0.9999

p=10 0.4 0.0549 1.0000 1.0000 0.0546 0.9950 0.9996

1.0 0.0536 0.9999 1.0000 0.0551 0.9979 1.0000

32

Table 2.4: The Monte Carlo type I errors and powers of the EL ratio test statistics (2.3)

and (2.7) for testing based on data following the hybrid design (

( ) , and . The pooling proportion = 0.5; the expected

significance level was 0.05).

Sample Pooling

Group

Parameters ELR Test Statistic (2.3) ELR Test Statistic (2.7)

Size (N) Size (p) Type I

Error

Power Type I

Error

Power

a = 0 a = 0.1 a = 0.2 a = 0.5 a = 0 a = 0.1 a = 0.2 a = 0.5

N=100 p=2 0.4 0.0687 0.0862 0.2021 0.8567 0.0724 0.0989 0.2144 0.7446

1.0 0.0654 0.0792 0.1579 0.6978 0.0690 0.0936 0.1696 0.6417

p=5 0.4 0.0649 0.1123 0.3133 0.9808 0.0985 0.1583 0.3552 0.9084

1.0 0.0670 0.0906 0.1901 0.8141 0.0862 0.1131 0.2338 0.8171

p=10 0.4 0.0646 0.1454 0.4460 0.9990 0.1016 0.1724 0.4416 0.9269

1.0 0.0623 0.0916 0.2253 0.8776 0.0933 0.1241 0.2693 0.8744

N=200 p=2 0.4 0.0587 0.1137 0.3381 0.9907 0.0555 0.1210 0.3294 0.8227

1.0 0.0544 0.0940 0.2583 0.9427 0.0534 0.1044 0.2612 0.8147

p=5 0.4 0.0559 0.1699 0.5557 0.9998 0.0857 0.1903 0.5215 0.8944

1.0 0.0538 0.1134 0.3366 0.9860 0.0770 0.1468 0.3687 0.9436

p=10 0.4 0.0572 0.2279 0.7539 1.0000 0.0801 0.2027 0.6158 0.8875

1.0 0.0568 0.1233 0.3988 0.9935 0.0815 0.1629 0.4219 0.9527

2.5 A REAL DATA EXAMPLE

In this section, we illustrate the proposed methods via data from the Cedars-Sinai

Medical Center. This study on coronary heart disease investigated the discriminatory

ability of a cholesterol biomarker for myocardial infarction (MI). We had 80 individual

measurements of cholesterol biomarker in total. Half of these were collected on patients

who recently survived a MI (cases), and the other half on controls who had normal rest

electrocardiograms and were free of symptoms, having no previous cardiovascular

procedures or MIs. Additionally, the blood specimens were randomly pooled in groups of

, keeping cases and controls separate, and then re-measured. Consequently, we had

33

measurements for 20 samples of pooled cases and 20 samples of pooled controls,

allowing us to form the hybrid design.

The p-value of 0.8662 for Shapiro-Wilk test indicates that we can assume a

cholesterol biomarker follows a normal distribution. A histogram and a normal Q-Q plot

in Figure 2.1 confirm that the normal distributional assumption for the data is reasonable.

Figure 2.1: The histogram and the normal Q-Q plot of cholesterol data

We formed hybrid samples by taking combinations of 20 unpooled samples and 10

pooled samples from different individuals for cases and controls, separately. In this

example, we focused on the means of cholesterol measurements and therefore, we

calculated these means based on 40 individual samples for cases and controls, separately.

The obtained means were 226.7877 and 205.5290, respectively. Using a bootstrap

strategy, we compared the confidence interval estimators and the coverage probabilities

34

of the EL method with those of the parametric method. To execute the bootstrap study,

we proceeded as follows. We randomly selected 10 pooled assays of group size

with replacement. We then randomly sampled 20 assays from the individual assays,

excluding those performed on individual biospecimens that contributed to the 10 chosen

pooled assays. With our 20 sampled individuals and 10 pooled assays, we applied a

parametric likelihood method assuming a normal distributional assumption and an EL

ratio test (2.3) to calculate the 95% confidence interval of the mean of cholesterol

biomarkers. We repeatedly sampled and calculated the confidence interval of the

cholesterol mean 5,000 times, obtaining 5,000 values for the confidence interval of the

mean value of cholesterol measurements for both cases and controls. Then we take the

average of these obtained 5,000 values. Table 2.5 depicts the outputs of the bootstrap

evaluation.

Table 2.5: Bootstrap evaluations of the confidence interval estimators based on the

parametric likelihood ratio test and the EL ratio test

Health MI

Parametric

(Normal)

CI Length CI Length

(192.5738, 220.8708) 28.29704 (210.0585, 239.4560) 29.39748

Empirical (192.9715, 221.1471) 28.17561 (210.4337, 240.5975) 30.16376

In accordance with the results above, the confidence intervals of estimators of the

cholesterol mean via the EL ratio method are close to those corresponding to the

parametric approach; therefore, we cannot observe a significant difference in the

confidence intervals related to the approaches. This result shows that, in this example, the

proposed EL approach is as efficient as the traditional parametric likelihood approach in

the context of constructing confidence intervals.

35

2.6 CONCLUSIONS

In this chapter, we proposed and examined different parametric and distribution-free

likelihood methods to evaluate data subject to MEs. We evaluated the common sampling

strategy based on repeated measures and the novel hybrid sample procedure. When the

ME problem is in effect, we pointed out that the repeated measurements strategy might

not perform well. The proposed hybrid design utilizes the cost-efficient pooling approach

and combines the pooled and unpooled samples.

The study done in this chapter has confirmed that the strategy to repeat measures

provides a lot of information just related to ME distributions, reducing efficiency of this

procedure compared with the hybrid design in the context of the evaluation of

biomarkers characteristics. We proposed the application of the EL techniques, very

efficient nonparametric methods, to data subject to ME.

To verify the efficiency of the hybrid design and the EL methodology, we provided

theoretical propositions as well as the Monte Carlo simulation results.

The numerical studies have supported our arguments that the likelihoods based on

pooled-unpooled data are more efficient than those based on the repeated measures data.

We showed that the EL method can be utilized as a very powerful tool in statistical

inference involving MEs.

36

CHAPTER 3

AN EMPIRICAL LIKELIHOOD RATIO BASED

GOODNESS-OF-FIT TEST FOR INVERSE

GAUSSIAN DISTRIBUTIONS

3.1 INTRODUCTION

The Inverse Gaussian (IG) distribution has a probability density function of the form

( | ) (

)

{

( ) }

where and are parameters. The ( ) distribution is extensively known for

modeling and analyzing right skewed data with positive support across several different

fields of science, e.g., demography, electrical networks, meteorology, hydrology, ecology,

entomology, physiology, cardiology (see, for example, Chhikara and Folks, 1977;

Bardsley, 1980; Seshadri, 1993, 1999; Johnson et al., 1994; Barndorff-Nielsen, 1994).

Given the utility of this distribution, it is meaningful to develop a corresponding

goodness-fit-test, which has satisfactory statistical properties. Towards this end, we

propose constructing a distribution-free goodness-of-fit test for the IG distribution,

which is based on approximating the appropriate parametric likelihood ratio test statistic.

The parametric likelihood approach is a powerful statistical tool, which provides

optimal statistical tests under well-known conditions; e.g. see Lehmann and Romano

37

(2005); Vexler and Wu (2009); Vexler et al. (2010) and Vexler and Tarima (2010). By

virtue of the Neyman-Pearson lemma, the likelihood ratio , where and

correspond to the likelihoods under hypotheses and , respectively, is the most

powerful test statistic when and are completely known. In the nonparametric

context considered in this chapter, forms of and are unknown, but are estimable.

In this case, the Neyman-Pearson lemma motivates us to approximate the optimal

likelihood ratios using an empirical approach.

In this chapter, we develop a simple approach that approximates the most powerful

likelihood ratio goodness-of-fit test for the IG distribution. The method is efficient and

utilizes adapted principles of the empirical likelihood (EL) methodology. The EL

approach allows investigators to employ methods with properties that are close, in the

asymptotic sense, to those of parametric techniques without having to assume a

parametric form for the likelihood functions under either hypothesis or (e.g., Lazar,

2003; Qin and Lawless, 1994; Owen, 2001; Vexler et al., 2009; Vexler et al., 2010; Yu et

al., 2010). The main advantage of the EL approach is that it is based on the maximum

likelihood methodology when given a set of well-defined empirical constraints. The EL

function of n i.i.d. observations has the form of , where the

s, , maximize and satisfy empirical constraints corresponding to

hypotheses of interest. This approach is based on the likelihood methodology involving

cumulative distribution functions ( ) . In this case, the likelihood has the form of

{ ( ) ( )} (see, for details, Owen, 2001). For example, if the null hypothesis

is ( ) , then the values of s in should be chosen to maximize given

and

, where the constraint

is an empirical

38

version of ( ) . The components s of the EL can be obtained using Lagrange

multipliers.

Vexler and Gurevich (2010), as well as Gurevich and Vexler (2011), utilized the main

idea of the classical EL methodology to approximate parametric likelihood ratios. The

authors proposed a nonparametric approach based on approximate density functions. In

this chapter, we derive the EL ratio test for the IG distribution, using the density-based

distribution-free likelihood approach by Vexler and Gurevich (2010). Following

Mudholkar and Tian (2002), we transform the observations to be , ,

and present the likelihood function under the alternative hypothesis in the form of

where ( ( )) , f is a density function, ( ) is the i-order statistic

based on . Values of can be estimated by maximizing given an empirical

version of the constraint ( ) . The nonparametric likelihood method mentioned

above is then utilized for the purpose of developing our goodness-of-fit test statistic.

Since the proposed test statistic approximates the most powerful parametric likelihood

ratio, the density-based EL ratio test is shown to have very efficient characteristics. We

also demonstrate the proposed test statistic improves upon the decision rule of Mudholkar

and Tian (2002) and maintains good type I error control for finite samples.

In Section 3.2, we create the EL test based on densities for the IG. Here, theoretical

propositions depict properties of the test. A Monte Carlo study of the power of the test is

presented in Section 3.3. Section 3.4 is given to real data examples employing the

proposed test. Section 3.5 consists of concluding remarks.

39

3.2 METHOD

In this section, we derive the EL ratio goodness-of-fit test based on approximating

densities for the IG distribution under the null and alternative hypotheses. The proposed

test is shown to be consistent as . Consider the problem of testing the composite

hypothesis that are distributed as IG with unknown parameters and .

Mudholkar and Tian (2002) presented an entropy characterization property of the IG

family. The authors constructed a goodness-of-fit test for the IG distribution based on a

test statistic involving the sample entropy. Their test statistic contains an integer

parameter with unknown optimal values. The power of the entropy-based test statistic is

strongly dependent on values of the unknown parameter. Mudholkar et al. (2001) also

developed a goodness-of-fit test for the IG model by using a characterization theorem for

the IG distribution for which it is known

Bio Statistics Epidemiology

Documents

Transcript of Bio Statistics Epidemiology