Bio Statistics Epidemiology
-
Upload
oliverbardin -
Category
Documents
-
view
219 -
download
1
description
Transcript of Bio Statistics Epidemiology
-
ADVANCED AND NOVEL PARAMETRIC AND
NONPARAMETRIC LIKELIHOOD STATISTICAL
TECHNIQUES WITH APPLICATIONS IN
EPIDEMIOLOGY
by
Wan-Min Tsai
December 17, 2012
A dissertation submitted to the
Faculty of the Graduate School of
the University at Buffalo, State University of New York
in partial fulfillment of the requirements for the
degree of
Doctor of Philosophy
Department of Biostatistics
-
All rights reserved
INFORMATION TO ALL USERSThe quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
Microform Edition ProQuest LLC.All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code
ProQuest LLC.789 East Eisenhower Parkway
P.O. Box 1346Ann Arbor, MI 48106 - 1346
UMI 3554514Published by ProQuest LLC (2013). Copyright in the Dissertation held by the Author.
UMI Number: 3554514
-
ii
Dissertation Committee
Albert Vexler, Ph.D. (Advisor)
Alan Hutson, Ph.D.
Chang-Xing Ma, Ph.D.
Jihnhee Yu, Ph.D.
-
iii
ACKNOWLEDGEMENTS
I would like to express my gratitude to all the people who have helped along the journey
of my Ph.D. study.
Firstly, I would like to express my deepest appreciation to my advisor, Professor
Albert Vexler, for his guidance and encouragement. This dissertation would not have
been possible without him. I am also deeply thankful to Professors Alan Hutson, Chang-
Xing Ma, and Jihnhee Yu, who kindly agreed to serve on my dissertation committee, for
their interest and valuable comments on my work.
I would like to thank all of my professors, fellow students, and department staff for
their kind help and support during my graduate study at the University at Buffalo. My
sincere thanks also to Dr. Gregory Gurevich and Dr. Yaakov Malinovsky as well as the
reviewers of the manuscripts submitted to journals for the constructive suggestions
improving the quality of the manuscripts and this dissertation.
I would also like to thank all of my friends for always being there with a word of
encouragement or listening ear.
Last but not least, I would like to thank my beloved parents, brother and sisters.
Without their strong love and support, I would not have had a chance to go abroad to
pursue my graduate study. I love my family beyond expression for their support and
encouragement along the way. I would like to dedicate this work to them.
-
iv
TABLE OF CONTENTS
Acknowledgements ....................................................................................................... iii
List of Tables .................................................................................................................... x
List of Figures ............................................................................................................... xiii
Abstract. ........................................................................................................................ . xiv
Chapters. ........................................................................................................................... 1
Chapter 1. Introduction. ............................................................................................ 1
Chapter 2. Estimation and testing based on data subject to
measurement errors: from parametric to non-parametric likelihood
methods ............................................................................................................................ 9
2.1 Introduction .............................................................................................................................. 9
2.2 Parametric inferences............................................................................................................11
2.2.1 Parametric likelihood functions .......................................................................... 12
2.2.1.1 Parametric likelihood based on repeated data..12
2.2.1.2 Parametric likelihood based on pooled and unpooled data......13
2.2.2 Normal case......................................................................................................... 15
2.2.2.1 Maximum likelihood estimators based on repeated measures.15
2.2.2.2 Maximum likelihood estimators followed the hybrid design......17
-
v
2.2.2.3 Remarks on the normal case.. ...................18
2.3 Empirical likelihood method ...............................................................................................19
2.3.1 The empirical likelihood ratio test ...................................................................... 19
2.3.2 The empirical likelihood method based on repeated measures data ................... 21
2.3.3 The empirical likelihood method based on pooled-unpooled data ..................... 22
2.4 Monte Carlo experiments .....................................................................................................25
2.4.1 Simulation settings .............................................................................................. 25
2.4.2 Monte Carlo outputs ........................................................................................... 27
2.5 A real data example ..............................................................................................................32
2.6 Conclusions ............................................................................................................................35
Chapter 3. An empirical likelihood ratio based goodness-of-fit test for
inverse Gaussian distributions .............................................................................. 36
3.1 Introduction ............................................................................................................................36
3.2 Method ....................................................................................................................................39
3.2.1 Density-based empirical likelihood ratio goodness-of-fit test ............................ 40
3.2.2 Null distribution .................................................................................................. 46
3.3 Power properties ....................................................................................................................50
3.4 Data examples ........................................................................................................................52
3.5 Conclusions ............................................................................................................................54
-
vi
Chapter 4. An extensive power evaluation of a novel two-sample
density-based empirical likelihood ratio test for paired data. ................... 56
4.1 Introduction ............................................................................................................................56
4.2 Method ....................................................................................................................................61
4.3 Simulation study ....................................................................................................................67
4.3.1 Null distribution .................................................................................................. 67
4.3.2 Power study ......................................................................................................... 71
4.4 A real data example ..............................................................................................................75
4.5 Conclusions ............................................................................................................................78
Chapter 5. Two-sample density-based empirical likelihood ratio tests
based on paired data, with application to a treatment study of
Attention-Deficit/Hyperactivity Disorder and Severe Mood
Dysregulation............................................................................................................... 88
5.1 Introduction ............................................................................................................................88
5.2 Statement of problems and Methods ..................................................................................93
5.2.1 Hypotheses setting .............................................................................................. 93
5.2.2 Test statistics ....................................................................................................... 94
5.2.2.1 Test 1: vs. .....94
5.2.2.2 Test 2: vs. .........99
5.2.2.3 Test 3: vs. .......101
-
vii
5.2.3 Asympotic consistency of the tests .................................................................... 102
5.2.4 Null distributions of the proposed test statistics ................................................ 103
5.3 Simulation study ..................................................................................................................105
5.3.1 Power comparison with the parametric method................................................. 105
5.3.2 Power comparison with classic nonparametric methods ................................... 106
5.4 Data analysis ........................................................................................................................107
5.5 Conclusions ..........................................................................................................................110
Chapter 6. Optimal properties of parametric Shiryaev-Roberts
statistical control procedures ............................................................................... 117
6.1 Introduction ..........................................................................................................................117
6.2 Retrospective change point detection ...............................................................................119
6.3 Retrospective detection of two change points .................................................................124
6.3.1 Non-asymptotic optimal properties of the Shiryaev-Roberts test (6.10) ......... 125
6.3.2 A real data example .......................................................................................... 129
6.3.2.1 The methods for p-value approximation related to test (6.16)...132
6.3.2.2 Additional study.............134
6.4 Sequential change point detection.....................................................................................134
6.5 Conclusions ..........................................................................................................................138
-
viii
Chapter 7. Future works ....................................................................................... 140
7.1 Testing normality based on independent samples and errors in regression models
......................................................................................................................................................140
7.2 A sample density-based empirical likelihood ratio test for independence
......................................................................................................................................................141
Appendices ................................................................................................................. 143
A.1.1 Parametric likelihood based on the hybrid design ......................................................143
A.1.2 The asymptotic distribution of the log empirical likelihood ratio test .....................145
A.1.2.1 The repeated measurements .......................................................................... 145
A.1.2.2 The hybrid design ......................................................................................... 147
A.2.1 Proof of Lemma 3.1 ........................................................................................................149
A.2.2 To show the proposed test statistic is identical to the test statistic presented by
Mudholkar and Tian (2012)..................................................................................................... 150
A.2.3 Proof of Proposition 3.1 .................................................................................................152
A.2.4 Proof of Proposition 3.2 ................................................................................................ 155
A.3.1 Computing the empirical constraint (5.11) utilized to develop test 2 ......................156
A.3.2 Proof of Proposition 5.1 ................................................................................................ 157
A.3.2.1 Proposed test 1 .............................................................................................. 157
A.3.2.2 Proposed test 2 .............................................................................................. 165
A.3.3 Mathematical derivation of maximum likelihood ratio tests.................................... 168
-
ix
A.3.3.1 Maximum likelihood ratio test statistic for test 1 ......................................... 168
A.3.3.2 Maximum likelihood ratio test statistic for test 2 ......................................... 169
A.3.3.3 Maximum likelihood ratio test statistic for test 3 ......................................... 170
Bibliography .............................................................................................................. 171
-
x
LIST OF TABLES
2.1 The Monte Carlo evaluations of the maximum likelihood estimates based on
repeated measurements and the hybrid design ......................................................... 28
2.2 Coverage probabilities and confidence intervals based on repeated measurements
and the hybrid design ............................................................................................... 29
2.3 The Monte Carlo type I errors and powers of the empirical likelihood ratio test
statistics (2.3) and (2.7) for testing based on data following the
hybrid design ( ( ) ( )) ............................................................. 31
2.4 The Monte Carlo type I errors and powers of the empirical likelihood ratio test
statistics (2.3) and (2.7) for testing based on data following the
hybrid design ( (
)).. ............................................................. 32
2.5 Bootstrap evaluations of the confidence interval estimators based on the
parametric likelihood ratio test and the empirical likelihood ratio test.................... 34
3.1 The critical values, ( ), of the proposed test at the significance level .... 47
3.2 Type I error control of the proposed ( ) test: =0.05 .................................... 48
3.3 The Monte Carlo Type I errors of the proposed test with critical values obtained
by Proposition 2.2 to guarantee ................................................................ 50
3.4 An empirical power comparison for the density-based empirical likelihood ratio
test at ........................................................................................................ 51
-
xi
3.5 Test for the IG based on the data introduced in Folks and Chhikara (1978) ........... 52
3.6 Bootstrap type proportion of rejection of the tests for the IG at a 5% level of
significance .............................................................................................................. 54
4.1 The critical values of the proposed test at (4.7) with various ( ) values for
sample sizes ( ) at the significant level 0.01, 0.05, and 0.1,
respectively .............................................................................................................. 68
4.2 Type I error control of the proposed test statistic (4.6) with ( ) ( ) at
the significant level .................................................................................. 70
4.3 Designs of the alternative hypothesis to be applied to the following Monte Carlo
evaluations of the powers of the proposed test (4.6) ................................................ 72
4.4 Proportion of rejection based on the bootstrap method for each considered test.. .. 78
5.1 Hypotheses of interest to be tested based on paired data ......................................... 94
5.2 The critical values for Test 1 by (5.9) (Test 2 by (5.15)) [Test 3 by (5.17)] with
=0.1 for different sample sizes ( , ) and significance levels ..................... 104
5.3 The Monte Carlo powers of Test 1 by (5.9) vs. the MLR test for vs. with
different sample sizes ( ) at the significance level ........................ 111
5.4 The Monte Carlo powers of Test 2 by (5.15) vs. the MLR test for vs. with
different sample sizes ( ) at the significance level ........................ 112
5.5 The Monte Carlo powers of Test 3 by (5.17) vs. the MLR test for vs. with
different sample sizes ( ) at the significance level ........................ 112
-
xii
5.6 The Monte Carlo type I errors of the MLR tests ................................................... 113
5.7 The Monte Carlo powers of the proposed test (5.9) vs. the combined
nonparametric test (the two Wilcoxon signed rank tests and one Kolmogorov-
Smirnov test) at ...................................................................................... 113
5.8 The Monte Carlo powers of the proposed test (5.15) vs. the combined
nonparametric test (the one Wilcoxon signed rank test and one Kolmogorov-
Smirnov test) at ..................................................................................... 114
5.9 The Monte Carlo powers of the proposed test (5.17) vs. the Wilcoxon signed
rank test at the significance level ........................................................... 115
5.10 The proportions of rejections based on the bootstrap method for each considered
test .......................................................................................................................... 115
6.1 Means and standard deviations of the %TS data and the TIBC data in each group
............................................................................................................................... 132
-
xiii
LIST OF FIGURES
2.1 The histogram and the normal Q-Q plot of cholesterol data ................................... 33
4.1 3-D plots of powers of the considered tests via all 28 designs (K1-K28) with
different sample sizes at the significant level .......................................... 80
4.2 Histograms of the differences in CDRS-Rts at baseline and endpoint in the group
1 ( samples) and group 2 ( samples), respectively ................................... 87
4.3 Plot of sample sizes vs. p-value using a bootstrap method ...................................... 87
5.1 Histograms of CDRS-Rts related to the baseline and endpoint in group 1: ( ,
), and in group 2: ( , )... ............................................................................... 116
5.2 Histograms of the paired observations and based on the CDRS-Rts data,
with sample sizes ( )=(9, 6), that were sampled from the original data set .. 116
6.1 The left-hand side of the Figure 6.1 shows plots and histograms of %TS data; the
right-hand side of the Figure 6.1 shows plots and histograms of TIBC data... ...... 130
6.2 Histograms of %TS data in each group ................................................................. 131
6.3 Histograms of TIBC data in each group ................................................................ 131
-
xiv
ABSTRACT
The likelihood approach provides a basis for many important procedures and methods in
statistical inference. When data distributions are completely known, the parametric
likelihood approach is unarguably a powerful statistical tool that can provide optimal
statistical inference. In such cases, by virtue of the Neyman-Pearson lemma, the likelihood
ratio tests are the most powerful decision rules. The parametric likelihood methods cannot
be applied properly if assumptions on the forms of distributions of data do not hold. Often,
in the context of likelihood applications, the use of the misspecified parametric forms of
data distributions may result in inaccurate statistical conclusions. The empirical likelihood
(EL) methodology has been well addressed in the literature as a nonparametric counterpart
of its powerful parametric likelihood approach. The objective of this dissertation is to
develop several powerful parametric likelihood methods and nonparametric approaches
using the EL concept. Measurement error (ME) problems can cause bias or inconsistency
of statistical inferences. When investigators are unable to obtain correct measurements of
biological assays, special techniques to quantify MEs need to be applied. In this
dissertation, we present both parametric likelihood and EL methods for dealing with data
subject to MEs based on repeated measures sampling strategies and hybrid sampling
designs (a mixture of pooled and unpooled data). Utilizing the density-based EL
methodology, we also propose different efficient nonparametric tests that approximate
most powerful Neyman-Pearson test statistics. We first introduce the EL ratio based
goodness-of-fit test for the inverse Gaussian model. Then we extend and adapt the density-
based EL approach to compare two samples based on paired data. We present exact
nonparametric tests for composite hypotheses to detect various differences related to
-
xv
treatment effects in study groups based on paired measurements. Next, we review and
extend parametric retrospective and sequential Shiryaev-Roberts based policies, carrying out
different contexts of the non-asymptotic optimal properties of the procedures. We propose
techniques to construct novel and efficient retrospective tests for multiple change-points
detection problems. Finally, future works will be discussed.
-
1
CHAPTER 1
INTRODUCTION
Likelihood methods are powerful statistical tools for parametric statistical inference. The
parametric likelihood methodology provides optimal statistical inferential procedures
when data distributions are known. However, parametric forms of data distributions are
oftentimes unknown. When the key assumptions regarding the underlying distribution of
data are not met, the parametric likelihood approaches may be extremely biased and
inefficient when compared to their robust nonparametric counterparts. In the
nonparametric context, the classical empirical likelihood (EL) approach is often applied
in order to efficiently approximate properties of parametric likelihoods, using an
approach based on substituting empirical distribution functions for their population
counterparts. Thus, the objective of this work is to develop efficient parametric likelihood
approaches as well as nonparametric methods using the EL concept to deal with problems
arising from epidemiological and medical studies.
The EL ratios were first used by Thomas & Grunkemeier (1975) in the context of
estimating survival probabilities. Owen (1988, 1990), building on earlier work of Thomas
& Grunkemeier (1975), introduced the EL approach for constructing confidence regions
in nonparametric problems. The approach has been developed to various situations. For
example, Owen (1991) and Chen (1993, 1994) extended the method to regression
problems, Chen and Hall (1993) considered the case of quantiles, and Kolaczyk (1994)
made the extension to generalized linear models. Qin and Lawless (1994) have shown
that the EL methodology can be utilized to make inference on parameters of interest
-
2
under a semiparametric model. The EL method has several advantages compared to
contemporary methods such as the bootstrap (Hall and Scala, 1990). The EL approach
not only provides confidence regions whose shapes and orientations are determined
entirely by the data, but it can also easily incorporate known constraints on parameters
(Chen, 1994). Moreover, DiCiccio et al. (1989) showed that in a very general setting, the
EL method for constructing confidence regions is Bartlett correctable which allows very
low coverage error to improve the accuracy of inferences. For a more comprehensive
literature review, see e.g., Owen (2001).
In this dissertation, we mainly focus on the developments of novel and powerful
distribution-free approaches to approximate optimal parametric likelihood ratio test
statistics via the EL methodology. In Chapter 2, we introduce both parametric and
empirical likelihood methods for dealing with measurement error problems. In Chapters 3-
5, we develop several robust EL-based methods motivated by important biomedical studies.
In Chapter 6, we propose techniques to construct efficient parametric retrospective tests for
multiple change-points detection problems. In the last chapter, we discuss directions for
future works. In this introductory chapter, we begin with a brief description of background
and motivation of each study and review of related literature, which are essential for
comprehending the following chapters covered in this dissertation.
In Chapter 2, we propose and examine parametric and nonparametric tests for data
subject to measurement errors based on two sampling schemes---a common sampling
strategy based on repeated measures and a hybrid design (a mixture of pooled and
unpooled data). Biomarkers are distinctive biochemical indicators of biological processes
or events that help measure the progress of disease or the effects of treatment. The
-
3
imprecision of measurement of exposures can cause bias or inconsistency of estimation
or testing of unknown parameters of the biomarker distribution. The sampling based on
repeated measurements is a common method to deal with measurement error problems
(e.g., Dunn, 1989; Freedman et al., 2004). However, because of financial cost and other
resources constraints, obtaining and assaying the desired number of replications can be
infeasible. For example, the cost of a single assay to measure polychlorinated biphenyl
(PCB) is between $500 and $1000. The high assay expense restricts the number of
biological samples that can be measured in the study related to examination of the
association between the marker PCB and cancer or endometriosis (Louis et al., 2005;
Mumford et al., 2006). Take the Seveso cohort as another example. In 1976, due to a
severe industrial accident, serious dioxin contamination of a large resident population of
people and farm animals occurred in Seveso, Italy. Investigators need to assay those
frozen storage blood specimens from the resident population in the exposed area of
dioxin in order to examine whether dioxin contamination adversely affects human health.
However, each determination of blood dioxin currently costs more than $1000, resulting
in only the limited number of subjects that can be considered in this study project
(Weinberg and Umbach, 1999).
Pooling and simple random sampling designs have been suggested as cost-efficient
sampling approaches (e.g., Dorfman, 1943; Weinberg and Umbach, 1999; Faraggi et al.,
2003, Vexler et al., 2006, Schisterman and Vexler, 2008). Simple random sampling
designs are conducted by selecting and assaying a random subset of available specimens.
In this case, several individual biospecimens are ignored; however, estimation of the
operating characteristics of these individual biospecimens is simple. Pooling, on the other
-
4
hand, involves randomly grouping and physically mixing individual biological samples.
Assays are then performed on the small number of pooled samples. Therefore, pooling
sampling design reduces the number of measurements without ignoring any individual
biospecimens. Schisterman et al. (2010) proposed a pooled-unpooled hybrid design that
combines the advantages of both pooling and random sampling strategies. The hybrid
designs are implemented by assaying some randomly selected individual biospecimens
and one pooled set that is constituted by the remaining individual biospecimens. The
pooled sample in the hybrid design is used for estimation of the mean of the biomarker
distribution, whereas the random sample proportion in the hybrid design (unpooled data)
is utilized for estimation of the variance. One advantage of using the hybrid design is that
by utilizing a pooled sample, one can estimate not only mean and variance of the
biomarker distribution, but also measurement error, without requiring repeated measures.
The repeated measurements sampling strategy has been well addressed in the literature
under parametric assumptions. In the context of the hybrid sampling design, Schisterman
et al. (2010) have proposed and evaluated a parametric approach for normally distributed
data in the presence of measurement errors. However, nonparametric methods have not
been well investigated to analyze repeated measures data or pooled data subject to
measurement errors. In Chapter 2, we consider general cases of parametric and
nonparametric assumptions, comparing efficiency of pooled-unpooled samples and data
consisting of repeated measures. The applications of the proposed methods are illustrated
using the cholesterol biomarker data from a study of myocardial infarction.
The classical EL approach reflects distribution-based interpretation of the notion of
likelihood (Owen, 2001). Note that the Neyman-Pearson concept to show the optimality
-
5
of statistical decision rules utilizes density functions structures of likelihood type tests.
Recently, Vexler and Gurevich (2010) proposed the density-based EL methodology to
construct efficient nonparametric tests that approximate Neyman-Pearson test statistics.
In the following chapters, we extend and adapt the density-based EL approach for
different testing problems, including tests of goodness-of-fit (Chapters 3 and 7.1) and
two-sample comparisons based on paired data (Chapters 4 and 5).
Testing for distributional assumptions has been a major area of continuing statistical
research. Chapter 3 introduces a new and powerful density-based EL goodness-of-fit test
for the inverse Gaussian distribution. The inverse Gaussian distribution is commonly
introduced to model and examine right skewed data having positive support. It originates
as the distribution of the first passage time of Brownian motion with drift (Henze and
Klar, 2012). It has useful applications include reliability problems (Padgett and Tsoi,
1986), lifetime models (Chhikara and Folks, 1977), and accelerated life testing
(Bhattacharyya and Fries, 1982). Folks and Chhikara (1978, 1989) have presented several
properties and applications of this distribution. When applying the inverse Gaussian
model, it is critical to develop efficient goodness-of-fit tests. Mudholkar and Tian (2002)
presented the entropy-based goodness-of-fit test for the inverse Gaussian distribution that
strongly depends on values of the integer parameter m. The problem of selecting the
optimal parameter m causes the difficulty of efficiently implementing the test in practice.
In this chapter, we propose an EL based goodness-of-fit test for the inverse Gaussian
distribution that improves Mudholkar and Tians test (2002) in the context of eliminating
dependence on the integer parameter m.
-
6
It is a common practice to conduct medical trials in order to compare a new therapy
with a standard-of-care based on paired data consisted of pre- and post-treatment
measurements. In the subsequent chapters (Chapters 4 and 5), we focus on developments
of efficient nonparametric likelihood methods for comparing two study groups based on
paired observations. In Chapter 4, we present an efficient distribution-free test to compare
two independent samples, where each is based on paired data. The statistical literature
has pointed out several issues related to the well-known classical tests for paired data
such as the t-test and the Wilcoxons test. For instance, the t-test is not appropriate to be
applied to data from strongly skewed distributions (see, e.g. Vexer et al., 2009); the
Wilcoxons test may break down completely when a nonconstant change is in effect
under the alternative hypothesis (see, e.g. Albers et al., 2001). In addition, the classical
Kolmogorov-Smirnov test is not designed to be based on paired data directly; therefore, it
may not be well suited to compare two samples based on paired observations. Due to the
drawbacks of these classical statistical procedures, Chapter 4 aims to provide an
alternative powerful nonparametric test based on paired data to compare treatment effects
between two study groups. We extend and modify the density-based EL ratio test
presented by Gurevich and Vexler (2011) to formulate an appropriate parametric
likelihood ratio test statistic corresponding to the hypothesis of our interest and then to
approximate the test statistic nonparametrically.
In many case-control studies, a great interest often lies in identifying treatment effects
within each therapy group as well as detecting between-group differences. To this end, in
a nonparametric setting, one can consider combining relevant nonparametric standard
procedures, for example, the Kolmogorov-Smirnov test and the Wilcoxon signed rank
-
7
test. The former test is a known procedure to compare distributions of two study groups,
whereas the latter one can be applied to detect treatment effects within each study group.
However, the use of the classical procedures commonly requires complex considerations
to combine the known nonparametric tests, e.g., considerations of combined p-values. In
Chapter 5, we propose simple nonparametric tests for three composite hypotheses related
to treatment effects to provide efficient tools that compare study groups utilizing paired
data. We adapt and extend the density-based EL methodology to deal with various testing
scenarios involved in the two-sample comparisons based on paired data. The proposed
technique is applied for comparing two therapy strategies to treat childrens attention
deficit/hyperactivity disorder and severe mood dysregulation.
Parametric change point detection schemes based on the Shiryaev-Roberts approach
have been well addressed in the statistics and engineering literature. High efficiency of
such procedures can be partially explained by their known asymptotic optimal properties.
Recently, Shiryaev-Roberts based procedures were proposed and examined in applications
to the standard AMOC (at most one change) retrospective change point detection problems.
In Chapter 6, we review and extend parametric retrospective and sequential Shiryaev-
Roberts based policies, carrying out different contexts of the non-asymptotic optimal
properties of the procedures. We utilize the general principle of the Neyman-Pearson
fundamental lemma to show that the Shiryaev-Roberts approach implies the average most
powerful procedures. We also introduce techniques to construct efficient retrospective tests
for multiple change-points detection. A real data example based on biomarker
measurements is provided to demonstrate implementation and effectiveness of new tests in
practice.
-
8
The last chapter of this dissertation addresses directions for future work and outlines
two possible topics of future work: 1). Develop an EL ratio based goodness-of-fit test of
the normality based on several independent samples and errors in regression models; 2)
Develop a simple likelihood ratio type test for independence between two random
variables, without requiring the specification of any kind of dependence and any
assumptions on the forms of data distributions.
-
9
CHAPTER 2
ESTIMATION AND TESTING BASED ON DATA
SUBJECT TO MEASUREMENT ERRORS: FROM
PARAMETRIC TO NON-PARAMETRIC
LIKELIHOOD METHODS
2.1 INTRODUCTION
Commonly, many biological and epidemiological studies deal with data subject to
measurement errors (MEs) attributed to instrumentation inaccuracies, within-subject
variation resulting from random fluctuations over time, etc. Ignoring the presence of ME
in data can result in the bias or inconsistency of estimation or testing. The statistical
literature proposed different methods for ME bias correction (e.g., Carroll et al., 1984,
1999; Carroll and Wand, 1991; Fuller, 1987; Liu and Liang, 1992; Schafer, 2001;
Stefanski, 1985; Stefanski and Carroll, 1987, 1990). Among others, one of the common
methods is to consider repeated measurements of biospecimens collecting sufficient
information for statistical inferences adjusted for ME effects (e.g., Hasabelnaby et al.,
1989). In practice, measurement processes based on bioassays can be costly and time-
consuming and can restrict the number of replicates of each individual available for
analysis or the number of individual biospecimens that can be used. It can follow that
-
10
investigators may not have enough observations to achieve the desired power or
efficiency in statistical inferences.
Dorfman (1943), Faraggi et al. (2003), Liu and Schisterman (2003), Liu et al. (2004),
Mumford et al. (2006), Schisterman et al. (2008, 2010), and Vexler et al. (2006, 2008,
2010, 2011) addressed pooling sampling strategies as an efficient approach to reduce the
overall cost of epidemiological studies. The basic idea of the pooling design is to pool
together individual biological samples (e.g., blood, plasma, serum or urine) and then
measure the pooled samples instead of each individual biospecimen. Since the pooling
design reduces the number of measurements without ignoring individual biospecimens,
the cost of the measurement process is reduced, but relevant information can still be
derived. Recently, it has been found that we can utilize a hybrid design that takes a
sample of both pooled and unpooled biospecimens to efficiently estimate unknown
parameters, allowing for MEs presence in the data without requiring repeated measures
(Schisterman et al., 2010).
In the context of the hybrid strategy, Schisterman et al. (2010) evaluated data that
follow normal distribution functions. In this chapter, we consider general cases of
parametric and nonparametric assumptions, comparing efficiency of pooled-unpooled
samples and data consisting of repeated measures. It should be noted that the repeated
measurement technique proposes to collect a large amount of information regarding just
nuisance parameters related to distribution functions of ME, whereas the pooled-
unpooled design provides observations that are informative regarding target variables
allowing for ME. Therefore, we show that the pooled-unpooled sampling strategy is more
efficient than the repeated measurement sampling procedure. We construct parametric
-
11
likelihoods based on both sampling methods. Additionally, in order to preserve
efficiencies of both strategies without parametric assumptions, we consider a
nonparametric approach using the empirical likelihood (EL) methodology (e.g., DiCicco
et al., 1989; Owen, 1988, 1991, 2001; Vexler et al., 2009, 2010; Vexler and Gurevich,
2010; Yu et al., 2010). We develop and apply novel EL ratio test statistics creating the
confidence interval estimation based on pooled-unpooled data and repeated measures
data. Despite the fact that many statistical inference procedures have been developed to
operate with data subject to ME, to our knowledge, relevant nonparametric likelihood
techniques and parametric likelihood methods have not been well addressed in the
literature.
The rest of this chapter is organized as follows. In Section 2.2, we present a general
form of the likelihood function based on repeated measures data and pooled-unpooled
data. We propose the EL methodology to make nonparametric inferences based on
repeated measures data and pooled-unpooled data in Section 2.3. We claim that the EL
technique based on the hybrid design provides a valuable technique to construct statistical
tests and estimators of parameters when MEs are present. To evaluate the proposed
approaches, we utilize Monte Carlo simulations in Section 2.4. We present an application
to cholesterol biomarker data from a study of coronary heart disease in Section 2.5. In
Section 2.6, we provide some concluding remarks.
2.2 PARAMETRIC INFERENCES
In this section, we derive general forms of the relevant likelihood functions. In each case,
we assume that the total measurements of the biomarkers are fixed, say N, for example, N
is the total number of measurements that a study budget allows us to execute.
-
12
2.2.1 PARAMETRIC LIKELIHOOOD FUNCTIONS
2.2.1.1 PARAMETRIC LIKELIHOOD BASED ON REPEATED MEASURES
DATA
Suppose that we measure a biospecimen observing score , where true
values of biomarker measurements are independent identically distributed (i.i.d.) and
are i.i.d. values of ME, ; ; . Thus, we assume that
there is a subset of distinct biosassays, and each of them is times repeatedly
measured. In this case, we can define the total number of available individual bioassays
to be T, T > t, when we can consider obtaining a large number of individual biospecimens
to have a low cost with respect to high cost of measurement processes. We assume that
and are independent. Firstly, we consider the simple normal case, say ( )
and ( ). Accordingly, we observe (
). In this case, one can
show that if , there are no unique solutions of estimation of and
(non-
identifiability). The observations s in each group i are dependent because they are
measured using the same bioassay. Note that if we fix the value of , conditioned on
is independent of each other, for example, in the case of ( ), we have
| ( ) . In a general case, the likelihood function based on the repeated
measures data has the general form of
( |
) [ ( )
( | )]
When the distributions of and are known, we can obtain the specific likelihood
functions, and further, we can also derive the maximum likelihood estimators of , ,
-
13
and . Well-known asymptotic results related to the maximum likelihood estimation
give evaluations of properties of estimators based on the likelihood ( |
).
2.2.1.2 PARAMETRIC LIKELIHOOD BASED ON POOLED AND
UNPOOLED DATA
We briefly address the basic concept of the pooling design. Let be the number of
individual biospecimens available and be the total number of measurements that we
can obtain due to a limited study budget. We obtain the pooling samples by randomly
grouping individual samples into groups of size , where [ ] , the number of
individual samples in a pooling group and [ ] is the integer part of . The pooling design
requires a physical combination of specimens of the same group and a test of each pooled
specimen, obtaining a single observation, when the pooled sample is measured. Since the
measurements are generally per unit of volume, we assume that the true measurement for
a pooled set is the average of the true individual marker values in that group. In this case,
taking into account that instruments applied to the measurement process can be sensitive
and subject to some random exposure ME, we define a single observation to be a sum of
the average of individual marker values and a value of ME. Note that, in accordance with
the pooling literature, we assume that analysis of the biomarkers is restricted by the high
cost of the measurement process, whereas access to a large number of individual
biospecimens can be considered to have a relatively low cost.
In this subsection of hybrid design, we assume that distinct individual bioassays are
available, but still we can provide just measurements (N < T). The ratio of pooled and
unpooled samples is ( ) , [ ], and the pooling group size is p. Namely,
( ) . Specifically, we can obtain pooled data by mixing individual
-
14
bioassays together, and we therefore divide the bioassays into groups, where
. We measure the grouped biospecimens as single observations. Let
, denote measurements of pooled bioassays. In accordance with the literature, we
have
( )
(see, e.g., Faraggi et al., 2003; Liu and Schisterman, 2003; Liu et al. 2004; Schisterman
et al. 2008, 2010; Vexler et al., 2006, 2008, 2010, 2011). Hence, we can obtain that
are i.i.d. with the mean and the variance
, namely,
i.i.d. (
)
The unpooled samples are based on ( ) independent observations
( )
In this case, we have i.i.d. (
).
Note that the pooled and unpooled samples are independent of each other. As a result,
the likelihood function based on the combination of pooled and unpooled data has the
form of
( |
) (
)
( )
If we know the distribution functions of and , we can derive the likelihood functions
according to the distributions of
and . Therefore, we can also obtain the
corresponding theoretical maximum likelihood estimators of (
) . Since the
-
15
estimators follow the maximum likelihood methodology, we can easily show the
asymptotic properties of the estimators.
2.2.2 NORMAL CASE
In this subsection, we assume ( ) and (
). Then we obtain closed-
form analytical solutions for the maximum likelihood estimators of the unknown
parameters, , and
.
2.2.2.1 MAXIMUM LIKELIHOOD ESTIMATORS BASED ON REPEATED
MEASURES
Assume that ; ; . By the additive property of the
normal distributions, we have (
)
Referring to Searle et al. (1992), the likelihood function is a well-known result that can
be expressed by
( |
)
{ [ ( )
( )
( ) ]}
( )
[ ( )]
( )
where
Under the assumption that s are equal (i.e., assuming balanced data), the log
likelihood function is in the form of
( |
)
( )
( )
(
)
( ) ( )
( )
where ( )
, ( )
and ( )
.
-
16
Let
. By taking the partial derivatives of with respect to , , and
and setting the equations equal to zero, we obtain the maximum likelihood equations
with the roots
( )
and
( )
Thus, the maximum likelihood estimator of is , and the maximum
likelihood estimators of and
are
and
, respectively, when
;
and ( ) ( )
, respectively, when
.
Also, the large-sample variances and covariance of and
are given by
( )
[(
)
( )] (
)
( )
(
) (
)
(for details, see Searle et al., 1992).
By the property of the maximum likelihood estimators, it is clear that asymptotically
those estimators follow a multivariate normal distribution as
[
] ( ) as
where
[
[(
)
( )]
( )
( )
( ) ]
-
17
2.2.2.2 MAXIMUM LIKELIHOOD ESTIMATORS FOLLOWED THE
HYBRID DESIGN
Since we assume that ( ) and (
) , we can write
(
) , and (
) , ,
.
The likelihood function based on pooled-unpooled data then takes the form
( |
)
( ) (
) (
)
( )
(
)
( )
(
)
Differentiating the log likelihood function, log ( |
), with respect to , ,
and , respectively, we obtain the maximum likelihood estimators of ,
, and
given by
(
)
(
)
( ) ( )
[ ( )
(
)
]
( )
( )
Note that the estimator of has a structure that weighs estimations based on pooled and
unpooled data in a similar manner to a Bayes point estimator used in normal-normal
models (see Carlin and Louis, 2008). In this case, we show that we can obtain inference
regarding the parameters by using this hybrid approach without repeating measures on
the same individual, which is the most common strategy to solve ME problems.
-
18
By the virtue of the properties of the maximum likelihood estimators, the asymptotic
distribution of the estimators (2.1) is asymptotically
[
] ( ) where is the inverse of the Fisher
Information matrix, I,
[ ( )
( )
( ) ]
[
( )
( )
( )
( )
( )
( )
( )
( ) ]
where ( |
) is the corresponding log likelihood function (for
details, see Appendix A.1.1).
2.2.2.3 REMARKS ON THE NORMAL CASE
As shown above, when biomarkers values and MEs are normally distributed, the
maximum likelihood estimators exist and can be easily obtained. It is also clear that we
can consider these estimators as the least square estimators in a nonparametric context.
However, when data are not from normal distributions, it may be very complicated or
even be infeasible to extract the distributions of repeated measures data or pooled and
unpooled data (e.g., Vexler et al., 2011). For example, in various situations, closed
analytical forms of the likelihood functions cannot be found based on pooled data
-
19
because the density function of the pooled biospecimen values involves complex
convolutions of p-individual biospecimen values. Consequently, it is reasonable to
consider efficient nonparametric inference methodologies based on the repeated measures
data or pooled-unpooled data.
2.3 EMPIRICAL LIKELIHOOD METHOD
In this section, we apply the EL methodology to the statement of the problem in this
chapter. The EL technique has been extensively proposed as a nonparametric
approximation of the parametric likelihood approach (e.g., DiCiccio et al., 1989; Owen,
1988, 1991, 2001; Vexler et al. 2009, 2010; Vexler and Gurevich, 2010; Yu et al., 2010).
We begin by outlining the EL ratio method and then modifying the EL ratio test to apply
to construct confidence interval estimations and tests based on data with repeated
measures and pooled-unpooled data.
2.3.1 THE EMPIRICAL LIKELIHOOD RATIO TEST
Consider the following simple testing problem that is stated nonparametrically. Suppose
i.i.d. random variables with ( ) and | | are observable. The
problem of interest, for example, is to test the hypothesis
( ) ( ) ( )
where is fixed and known. To test for the hypothesis in equation (2.2), we can write
the EL function as
where we assume s to have values that maximize given empirical constraints. The
empirical constraints correspond to hypotheses settings. Then, under the null hypothesis
-
20
in equation (2.2), we maximize subject to Here the condition
is an empirical form of . Using the Lagrange multipliers, one
can show that the maximum EL function has the form of
( )
( )
where is a root of
( )
Similarly, under the alternative hypothesis, the maximum EL function has the simple
form
As a consequence, the 2log EL ratio test for equation (2.2) is
( ) [ ( ) ( ( ))]
[ ( )]
It is proven in Owen (2001) that the 2log EL ratio, ( ), follows asymptotically
distribution as . Thus, we reject the null hypothesis at a significance level if
( ) Furthermore, we can construct the confidence interval estimator of
as
{ ( ) }
(Here, is the ( ) percentile of a distribution with one degree of
freedom.)
-
21
2.3.2 THE EMPIRICAL LIKELIHOOD METHOD BASED
ON REPEATED MEASURES DATA
Following the statement mentioned in Section 2.2, we have correlated data with repeated
measures. In order to obtain an i.i.d. sample, we utilize the fact that is independent of
when . Therefore, we give an EL function for the block sample mean
, ,
, in a similar manner to the blockwise EL method
given in Kitamura (1997). Then, the random variables become and the
corresponding EL function for is given by
( )
( )
where is a root of
( )
In this case, the 2log EL ratio test statistic is in the form of
( )
[ ( )]
Proposition 2.1. Assume | | . Then the 2log EL ratio, ( ), distributes
,
when 1
1
t
iin
, as
Proof. See Appendix A.1.2.1.
-
22
The associated confidence interval estimator is then given by { ( )
} where is the ( ) percentile of a distribution with one degree of
freedom.
2.3.3 THE EMPIRICAL LIKELIHOOD METHOD BASED
ON POOLED-UNPOOLED DATA
In this section, we consider two distribution-free alternatives to the parametric likelihood
method mentioned in Section 2.2.1.2. To this end, we apply the EL technique. Note that,
in contrast to data that consist of repeated measures, in this section we use data that are
based on independent observations. Consequently, we can introduce a combined EL
function for the mean based on two independent samples, that is, i.i.d.
and i.i.d. ( ), representing measurements that correspond to
pooled and unpooled biospecimens, respectively. Under the null hypothesis, the EL
function for can be presented in the form of
( )
( )
( )
where and are roots of the equations
( )
and
( )
Finally, the 2log EL ratio test statistic can be given in the form of
-
23
( ) log[ ( )]
log[ ( )]
( )
In a similar manner to common EL considerations, one can show that the statistics
[ ( )]
and [ ( )]
follow asymptotically a
distribution, respectively. By virtue of the additive property of distributions, the
2log EL ratio, ( ), has an asymptotic distribution with two degrees of freedom.
Thus, we formulate the next proposition.
Proposition 2.2. Let | | . Then the 2log EL ratio, ( ), has a
distribution
as .
Proof. See Appendix A.1.2.2.
The corresponding confidence interval estimator is
{ ( ) } ( )
where is the ( ) percentile of a distribution with two degrees of
freedom.
In practice, to execute the procedure above, we can directly use standard programs
related to the classical EL ratio tests, for example, the code el.test of the R software can
be utilized to conduct the EL confidence interval estimator (2.4).
The EL technique mentioned above does not use an empirical version of the rule
( ) ( )
( )
that connects the second moments derived from pooled and unpooled observations.
Intuitively, using a constraint related to equation (2.5), one can increase the power of the
EL approach. Consider the EL function for under the null hypothesis,
-
24
(
)
( )
( ) (
)
( (
) )
( ( ) )
as an alternative to the simple EL function ( ) Here, is the estimator from
equation (2.1) that is defined under the null hypothesis,
( )
[ ( )
(
)
]
, , , and are roots of the equations mentioned under the operator in the
definition of (
) with ( ( ) )
and ( ( ) )
.
Likewise, under the alternative hypothesis, we maximize the EL function,
, subject to
, ,
( )
( ) (
) (
) ( )
where
( )
[ ( )
(
)
]
Thus, the EL under the alternative hypothesis that depends on ( ) is given by
((
) )
( )
( )
-
25
where we should numerically derive (
( ) )
, (
( ) )
as
well as
, and
using the equations (2.6). As a result, the corresponding 2log
EL ratio test statistic is
( ) [ (
((
) )) ( (
))] ( )
Note that, following Qin and Lawless (1994), ( ) is asymptotically equivalent to
the maximum log EL ratio test statistic. By virtue of results mentioned in Qin and
Lawless (1994), ( ) asymptotically follows a
distribution. Then we reject the null
hypothesis at a significance level when
( ) ( )
where is the ( ) percentile of a distribution. Moreover, the
corresponding confidence interval is { ( ) } where is the
( ) percentile of a distribution.
The Monte Carlo simulation study presented in the next section examines the
performance of each EL method mentioned above.
2.4 MONTE CARLO EXPERIMENTS
In this section, we conducted an extensive Monte Carlo study to evaluate the performance
of the parametric and nonparametric likelihood methods proposed in Sections 2.2 and 2.3.
2.4.1 SIMULATION SETTINGS
Examining the repeated measures sampling method, we randomly generated samples of
values from a normal distribution with mean ( ) and variance
var( ) . We let , denote the number of replicates for each subject.
For simplicity, we assumed that each subject had the same number of replicates
-
26
(i.e., assuming balanced data). Then, in a similar manner, we randomly generated
normally distributed MEs, s, having ( ) and var ( ) , ;
. Therefore, we conducted samples of . Each sample had
observations.
To obtain the hybrid samples, we first generated a sample of size T, where
( ) , , ( ) , to represent available individual
bioassays. Then we proceeded to generate pooled data. To this end, we pooled ,
[ ], samples of s to constitute pooled data, where we assumed to be an
integer and s, , were i.i.d. random samples from a normal distribution with
mean ( ) and var( ) . Following the pooling literature, if there were no
MEs, we assumed the average values of the pooled biospecimens,
( ) , to be observed and we could represent them as the
measurements of pooled bioassays. We took the remaining ( ) observations
, , as individual measurements. For each observation, we randomly
generated a ME from a normal distribution. Combining the pooled samples,
( ) , , with the unpooled samples,
, we obtained pooled-unpooled data with the total sample size
equal to that in the Monte Carlo evaluations related to the repeated measures
approaches.
To evaluate the performance of proposed methods, we applied the following
simulation settings: the fixed significance level was 0.05; =1 and =1;
=0.4, 1;
=2, 5, 10; the pooling group size = 2, 5, 10; the pooling proportion =0.5; the total
-
27
sample size =100, 300. For each set of parameters, there were 10,000 data generations
(Monte Carlo). In this section, following the pooling literature, we assumed that the
simulated analysis of biomarkers was restricted to execute just N measurements and
( ) individual biospeciaments are available, when the hybrid design was
compared with the repeated measures sampling method. The Monte Carlo simulation
results are presented in the next subsection.
2.4.2 MONTE CARLO OUTPUTS
Table 2.1 shows the estimated parameters based on the repeated measures data using the
parametric likelihood method. The results show that as the replicates increase, the
standard errors of the estimates of decrease, indicating that the estimations of
appear to be better as the number of replicates increases. Apparently, the Monte Carlo
standard errors of the estimators of and increase when we increase the number of
replicates.
To accomplish the efficiency comparison between the repeated measures strategy and
the hybrid design strategy, we provide the Monte Carlo properties of the maximum
likelihood estimates based on pooled-unpooled data in Table 2.1. Table 2.1 shows that
the Monte Carlo standard errors of the estimates for based on pooled-unpooled data
are clearly less than those of the corresponding estimates that utilize repeated measures,
when (respectively, ). One observed advantage is that the estimation for
based on pooled-unpooled data is very accurate when the total number of measurements
is fixed at the same level. Another advantage is that the standard errors of the estimates
for based on pooled-unpooled data are much smaller than those of the corresponding
estimates using repeated measures data, as shown in Table 2.1.
-
28
Table 2.1: The Monte Carlo evaluations of the maximum likelihood estimates based on repeated
measurements and the hybrid design
Sample
Size
Replicates n;
Pooling Size p
Parameters Estimates Standard Errors
( , ,
)
SE( ) SE( ) SE(
)
Repeated Measurements:
N=100 n=2 (1, 1, 0.4) 1.0021 0.9781 0.3997 0.1553 0.2410 0.0790
(1, 1, 1.0) 1.0006 0.9688 0.9984 0.1726 0.3106 0.1994
n=5 (1, 1, 0.4) 0.9966 0.9462 0.3990 0.2328 0.3305 0.0623
(1, 1, 1.0) 1.0015 0.9362 0.9998 0.2442 0.3688 0.1570
n=10 (1, 1, 0.4) 1.0026 0.8951 0.3999 0.3209 0.4346 0.0597
(1, 1, 1.0) 1.0044 0.8917 0.9995 0.3299 0.4690 0.1501
N=300 n=2 (1, 1, 0.4) 0.9987 0.9921 0.3999 0.0889 0.1405 0.0455
(1, 1, 1.0) 1.0005 0.9883 0.9999 0.0995 0.1803 0.1162
n=5 (1, 1, 0.4) 0.9995 0.9797 0.3998 0.1356 0.1950 0.0365
(1, 1, 1.0) 0.9990 0.9766 0.9990 0.1409 0.2181 0.0906
n=10 (1, 1, 0.4) 0.9985 0.9682 0.3997 0.1864 0.2633 0.0344
(1, 1, 1.0) 0.9985 0.9660 1.0002 0.1914 0.2782 0.0861
Hybrid Design:
N=100 p=2 (1, 1, 0.4) 1.0015 1.0160 0.4365 0.1048 0.6712 0.4579
(1, 1, 1.0) 1.0007 1.0754 1.0098 0.1327 1.0058 0.7275
p=5 (1, 1, 0.4) 0.9994 1.0045 0.3889 0.0924 0.3857 0.1662
(1, 1, 1.0) 1.0008 1.0053 0.9880 0.1240 0.5932 0.3217
p=10 (1, 1, 0.4) 0.9993 1.0049 0.3918 0.0871 0.3341 0.1164
(1, 1, 1.0) 0.9996 1.0050 0.9836 0.1197 0.5082 0.2486
N=300 p=2 (1, 1, 0.4) 0.9999 0.9974 0.4066 0.0608 0.3868 0.2652
(1, 1, 1.0) 1.0002 1.0069 0.9982 0.0758 0.5788 0.4179
p=5 (1, 1, 0.4) 0.9995 1.0013 0.3969 0.0534 0.2197 0.0954
(1, 1, 1.0) 0.9993 1.0076 0.9910 0.0711 0.3386 0.1819
p=10 (1, 1, 0.4) 0.9995 0.9995 0.3972 0.0497 0.1935 0.0671
(1, 1, 1.0) 0.9992 1.0059 0.9922 0.0688 0.2928 0.1436
-
29
Table 2.2 displays the coverage probabilities of the confidence interval estimators
constructed by the parametric likelihood and EL method based on repeated measures data
and the mixed data, respectively.
Table 2.2: Coverage probabilities and confidence intervals based on repeated measurements and
the hybrid design
Sample
Size
Replicates n;
Pooling Size p
Parameters Parametric Likelihood Empirical Likelihood
( , ,
) Coverage CI Coverage CI
Repeated Measurements:
N=100 n=2 (1, 1, 0.4) 0.9420 (0.7028, 1.3014) 0.9496 (0.6980, 1.3049)
(1, 1, 1.0) 0.9423 (0.6665, 1.3347) 0.9466 (0.6613, 1.3394)
n=5 (1, 1, 0.4) 0.9305 (0.5584, 1.4348) 0.9327 (0.5519, 1.4466)
(1, 1, 1.0) 0.9289 (0.5404, 1.4626) 0.9353 (0.5298, 1.4752)
n=10 (1, 1, 0.4) 0.9044 (0.4193, 1.5859) 0.8985 (0.4158, 1.5876)
(1, 1, 1.0) 0.9042 (0.4040, 1.6047) 0.9030 (0.4054, 1.6065)
N=300 n=2 (1, 1, 0.4) 0.9477 (0.8243, 1.1731) 0.9517 (0.8240, 1.1753)
(1, 1, 1.0) 0.9469 (0.8056, 1.1955) 0.9479 (0.8034, 1.1962)
n=5 (1, 1, 0.4) 0.9400 (0.7401, 1.2588) 0.9467 (0.7360, 1.2628)
(1, 1, 1.0) 0.9448 (0.7257, 1.2722) 0.9467 (0.7210, 1.2763)
n=10 (1, 1, 0.4) 0.9396 (0.6422, 1.3547) 0.9379 (0.6318, 1.3582)
(1, 1, 1.0) 0.9336 (0.6321, 1.3648) 0.9417 (0.6245, 1.3712)
Hybrid Design:
N=100 p=2 (1, 1, 0.4) 0.9512 (0.7939, 1.2090) 0.9492 (0.7725, 1.2303)
(1, 1, 1.0) 0.9463 (0.7422, 1.2592) 0.9421 (0.7146, 1.2869)
p=5 (1, 1, 0.4) 0.9424 (0.8230, 1.1757) 0.9490 (0.7978, 1.2010)
(1, 1, 1.0) 0.9439 (0.7644, 1.2372) 0.9509 (0.7314, 1.2703)
p=10 (1, 1, 0.4) 0.9393 (0.8339, 1.1646) 0.9498 (0.8099, 1.1887)
(1, 1, 1.0) 0.9431 (0.7701, 1.2290) 0.9478 (0.7376, 1.2614)
N=300 p=2 (1, 1, 0.4) 0.9482 (0.8817, 1.1182) 0.9551 (0.8660, 1.1337)
(1, 1, 1.0) 0.9469 (0.8525, 1.1479) 0.9520 (0.8334, 1.1672)
p=5 (1, 1, 0.4) 0.9478 (0.8963, 1.1026) 0.9532 (0.8822, 1.1166)
(1, 1, 1.0) 0.9463 (0.8616, 1.1371) 0.9506 (0.8433, 1.1556)
p=10 (1, 1, 0.4) 0.9462 (0.9030, 1.0961) 0.9584 (0.8896, 1.1095)
(1, 1, 1.0) 0.9484 (0.8652, 1.1332) 0.9532 (0.8475, 1.5080)
-
30
Table 2.2 shows that the EL ratio test statistic is as efficient as the traditional
parametric likelihood approach in the context of constructing confidence intervals
because the coverage probabilities and the interval widths of the two methods are very
close.
It is clearly shown that when sample sizes are greater than 100, the coverage
probabilities obtained via the pooled-unpooled design are closer to the expected 0.95
value than those based on repeated measurements. This, again, demonstrates that mixed
data are more efficient than repeated measures data.
To compare the Monte Carlo type I errors and powers of the tests based on the test
statistics ( ) and ( ) by equations (2.3) and (2.7), we performed 10,000
simulations for each parametric setting and sample size. To test the null hypothesis
, we use the statistics ( ) and ( ) by equations (2.3) and (2.7). Table
2.3 depicts results that correspond to the case when ( ). The outputs show that
the Monte Carlo type I errors and powers of the test statistic ( ) are slightly better
than those corresponding to the test statistic ( ). This indicates that the test based on
the simple statistic ( ) performs better than that based on the statistic ( ), in the
considered cases.
Table 2.4 displays the Monte Carlo simulation results of testing the null hypothesis
when , where
is a chi-squared distribution with two degrees of
freedom and is an effect size. Again, in this case, it is obvious that the type I errors
(when a=0) of the test statistic ( ) are much better controlled by 0.05 than those
based on the test statistic ( ). Moreover, the Monte Carlo powers of the test based on
the test statistic ( ) are higher than those based on the statistic ( ) when the effect
-
31
size a is large than 0.5. On the contrary, as the effect size a is small such as 0.1 and 0.2,
the Monte Carlo powers of the tests based on the test statistic ( ) seem higher than
those based on the statistic ( ). This shows that when the effect size a is large, the test
based on the simple statistic ( ) is preferable to that based on the statistic ( ).
Table 2.3: The Monte Carlo type I errors and powers of the EL ratio test statistics (2.3)
and (2.7) for testing based on data following the hybrid design ( ( ) (
). The pooling proportion = 0.5; the expected significance level was 0.05).
Sample Pooling
Group
Parameters ELR Test Statistic (2.3) ELR Test Statistic (2.7)
Size (N) Size (p) Type I
Error
Power Type I
Error
Power
=0 =0.5 =1.0 =0 =0.5 =1.0 N=100 p=2 0.4 0.0587 0.9919 1.0000 0.0580 0.9718 0.9990
1.0 0.0558 0.9373 1.0000 0.0611 0.9230 0.9986
p=5 0.4 0.0555 0.9989 1.0000 0.0530 0.9512 0.9992
1.0 0.0587 0.9626 1.0000 0.0604 0.9446 0.9976
p=10 0.4 0.0588 0.9995 1.0000 0.0595 0.9531 0.9999
1.0 0.0556 0.9684 1.0000 0.0621 0.9680 0.9992
N=200 p=2 0.4 0.0495 0.9999 1.0000 0.0594 0.9990 0.9985
1.0 0.0536 0.9991 1.0000 0.0593 0.9983 0.9995
p=5 0.4 0.0540 1.0000 1.0000 0.0524 0.9952 0.9996
1.0 0.0511 0.9997 1.0000 0.0543 0.9981 0.9999
p=10 0.4 0.0549 1.0000 1.0000 0.0546 0.9950 0.9996
1.0 0.0536 0.9999 1.0000 0.0551 0.9979 1.0000
-
32
Table 2.4: The Monte Carlo type I errors and powers of the EL ratio test statistics (2.3)
and (2.7) for testing based on data following the hybrid design (
( ) , and . The pooling proportion = 0.5; the expected
significance level was 0.05).
Sample Pooling
Group
Parameters ELR Test Statistic (2.3) ELR Test Statistic (2.7)
Size (N) Size (p) Type I
Error
Power Type I
Error
Power
a = 0 a = 0.1 a = 0.2 a = 0.5 a = 0 a = 0.1 a = 0.2 a = 0.5
N=100 p=2 0.4 0.0687 0.0862 0.2021 0.8567 0.0724 0.0989 0.2144 0.7446
1.0 0.0654 0.0792 0.1579 0.6978 0.0690 0.0936 0.1696 0.6417
p=5 0.4 0.0649 0.1123 0.3133 0.9808 0.0985 0.1583 0.3552 0.9084
1.0 0.0670 0.0906 0.1901 0.8141 0.0862 0.1131 0.2338 0.8171
p=10 0.4 0.0646 0.1454 0.4460 0.9990 0.1016 0.1724 0.4416 0.9269
1.0 0.0623 0.0916 0.2253 0.8776 0.0933 0.1241 0.2693 0.8744
N=200 p=2 0.4 0.0587 0.1137 0.3381 0.9907 0.0555 0.1210 0.3294 0.8227
1.0 0.0544 0.0940 0.2583 0.9427 0.0534 0.1044 0.2612 0.8147
p=5 0.4 0.0559 0.1699 0.5557 0.9998 0.0857 0.1903 0.5215 0.8944
1.0 0.0538 0.1134 0.3366 0.9860 0.0770 0.1468 0.3687 0.9436
p=10 0.4 0.0572 0.2279 0.7539 1.0000 0.0801 0.2027 0.6158 0.8875
1.0 0.0568 0.1233 0.3988 0.9935 0.0815 0.1629 0.4219 0.9527
2.5 A REAL DATA EXAMPLE
In this section, we illustrate the proposed methods via data from the Cedars-Sinai
Medical Center. This study on coronary heart disease investigated the discriminatory
ability of a cholesterol biomarker for myocardial infarction (MI). We had 80 individual
measurements of cholesterol biomarker in total. Half of these were collected on patients
who recently survived a MI (cases), and the other half on controls who had normal rest
electrocardiograms and were free of symptoms, having no previous cardiovascular
procedures or MIs. Additionally, the blood specimens were randomly pooled in groups of
, keeping cases and controls separate, and then re-measured. Consequently, we had
-
33
measurements for 20 samples of pooled cases and 20 samples of pooled controls,
allowing us to form the hybrid design.
The p-value of 0.8662 for Shapiro-Wilk test indicates that we can assume a
cholesterol biomarker follows a normal distribution. A histogram and a normal Q-Q plot
in Figure 2.1 confirm that the normal distributional assumption for the data is reasonable.
Figure 2.1: The histogram and the normal Q-Q plot of cholesterol data
We formed hybrid samples by taking combinations of 20 unpooled samples and 10
pooled samples from different individuals for cases and controls, separately. In this
example, we focused on the means of cholesterol measurements and therefore, we
calculated these means based on 40 individual samples for cases and controls, separately.
The obtained means were 226.7877 and 205.5290, respectively. Using a bootstrap
strategy, we compared the confidence interval estimators and the coverage probabilities
-
34
of the EL method with those of the parametric method. To execute the bootstrap study,
we proceeded as follows. We randomly selected 10 pooled assays of group size
with replacement. We then randomly sampled 20 assays from the individual assays,
excluding those performed on individual biospecimens that contributed to the 10 chosen
pooled assays. With our 20 sampled individuals and 10 pooled assays, we applied a
parametric likelihood method assuming a normal distributional assumption and an EL
ratio test (2.3) to calculate the 95% confidence interval of the mean of cholesterol
biomarkers. We repeatedly sampled and calculated the confidence interval of the
cholesterol mean 5,000 times, obtaining 5,000 values for the confidence interval of the
mean value of cholesterol measurements for both cases and controls. Then we take the
average of these obtained 5,000 values. Table 2.5 depicts the outputs of the bootstrap
evaluation.
Table 2.5: Bootstrap evaluations of the confidence interval estimators based on the
parametric likelihood ratio test and the EL ratio test
Health MI
Parametric
(Normal)
CI Length CI Length
(192.5738, 220.8708) 28.29704 (210.0585, 239.4560) 29.39748
Empirical (192.9715, 221.1471) 28.17561 (210.4337, 240.5975) 30.16376
In accordance with the results above, the confidence intervals of estimators of the
cholesterol mean via the EL ratio method are close to those corresponding to the
parametric approach; therefore, we cannot observe a significant difference in the
confidence intervals related to the approaches. This result shows that, in this example, the
proposed EL approach is as efficient as the traditional parametric likelihood approach in
the context of constructing confidence intervals.
-
35
2.6 CONCLUSIONS
In this chapter, we proposed and examined different parametric and distribution-free
likelihood methods to evaluate data subject to MEs. We evaluated the common sampling
strategy based on repeated measures and the novel hybrid sample procedure. When the
ME problem is in effect, we pointed out that the repeated measurements strategy might
not perform well. The proposed hybrid design utilizes the cost-efficient pooling approach
and combines the pooled and unpooled samples.
The study done in this chapter has confirmed that the strategy to repeat measures
provides a lot of information just related to ME distributions, reducing efficiency of this
procedure compared with the hybrid design in the context of the evaluation of
biomarkers characteristics. We proposed the application of the EL techniques, very
efficient nonparametric methods, to data subject to ME.
To verify the efficiency of the hybrid design and the EL methodology, we provided
theoretical propositions as well as the Monte Carlo simulation results.
The numerical studies have supported our arguments that the likelihoods based on
pooled-unpooled data are more efficient than those based on the repeated measures data.
We showed that the EL method can be utilized as a very powerful tool in statistical
inference involving MEs.
-
36
CHAPTER 3
AN EMPIRICAL LIKELIHOOD RATIO BASED
GOODNESS-OF-FIT TEST FOR INVERSE
GAUSSIAN DISTRIBUTIONS
3.1 INTRODUCTION
The Inverse Gaussian (IG) distribution has a probability density function of the form
( | ) (
)
{
( ) }
where and are parameters. The ( ) distribution is extensively known for
modeling and analyzing right skewed data with positive support across several different
fields of science, e.g., demography, electrical networks, meteorology, hydrology, ecology,
entomology, physiology, cardiology (see, for example, Chhikara and Folks, 1977;
Bardsley, 1980; Seshadri, 1993, 1999; Johnson et al., 1994; Barndorff-Nielsen, 1994).
Given the utility of this distribution, it is meaningful to develop a corresponding
goodness-fit-test, which has satisfactory statistical properties. Towards this end, we
propose constructing a distribution-free goodness-of-fit test for the IG distribution,
which is based on approximating the appropriate parametric likelihood ratio test statistic.
The parametric likelihood approach is a powerful statistical tool, which provides
optimal statistical tests under well-known conditions; e.g. see Lehmann and Romano
-
37
(2005); Vexler and Wu (2009); Vexler et al. (2010) and Vexler and Tarima (2010). By
virtue of the Neyman-Pearson lemma, the likelihood ratio , where and
correspond to the likelihoods under hypotheses and , respectively, is the most
powerful test statistic when and are completely known. In the nonparametric
context considered in this chapter, forms of and are unknown, but are estimable.
In this case, the Neyman-Pearson lemma motivates us to approximate the optimal
likelihood ratios using an empirical approach.
In this chapter, we develop a simple approach that approximates the most powerful
likelihood ratio goodness-of-fit test for the IG distribution. The method is efficient and
utilizes adapted principles of the empirical likelihood (EL) methodology. The EL
approach allows investigators to employ methods with properties that are close, in the
asymptotic sense, to those of parametric techniques without having to assume a
parametric form for the likelihood functions under either hypothesis or (e.g., Lazar,
2003; Qin and Lawless, 1994; Owen, 2001; Vexler et al., 2009; Vexler et al., 2010; Yu et
al., 2010). The main advantage of the EL approach is that it is based on the maximum
likelihood methodology when given a set of well-defined empirical constraints. The EL
function of n i.i.d. observations has the form of , where the
s, , maximize and satisfy empirical constraints corresponding to
hypotheses of interest. This approach is based on the likelihood methodology involving
cumulative distribution functions ( ) . In this case, the likelihood has the form of
{ ( ) ( )} (see, for details, Owen, 2001). For example, if the null hypothesis
is ( ) , then the values of s in should be chosen to maximize given
and
, where the constraint
is an empirical
-
38
version of ( ) . The components s of the EL can be obtained using Lagrange
multipliers.
Vexler and Gurevich (2010), as well as Gurevich and Vexler (2011), utilized the main
idea of the classical EL methodology to approximate parametric likelihood ratios. The
authors proposed a nonparametric approach based on approximate density functions. In
this chapter, we derive the EL ratio test for the IG distribution, using the density-based
distribution-free likelihood approach by Vexler and Gurevich (2010). Following
Mudholkar and Tian (2002), we transform the observations to be , ,
and present the likelihood function under the alternative hypothesis in the form of
where ( ( )) , f is a density function, ( ) is the i-order statistic
based on . Values of can be estimated by maximizing given an empirical
version of the constraint ( ) . The nonparametric likelihood method mentioned
above is then utilized for the purpose of developing our goodness-of-fit test statistic.
Since the proposed test statistic approximates the most powerful parametric likelihood
ratio, the density-based EL ratio test is shown to have very efficient characteristics. We
also demonstrate the proposed test statistic improves upon the decision rule of Mudholkar
and Tian (2002) and maintains good type I error control for finite samples.
In Section 3.2, we create the EL test based on densities for the IG. Here, theoretical
propositions depict properties of the test. A Monte Carlo study of the power of the test is
presented in Section 3.3. Section 3.4 is given to real data examples employing the
proposed test. Section 3.5 consists of concluding remarks.
-
39
3.2 METHOD
In this section, we derive the EL ratio goodness-of-fit test based on approximating
densities for the IG distribution under the null and alternative hypotheses. The proposed
test is shown to be consistent as . Consider the problem of testing the composite
hypothesis that are distributed as IG with unknown parameters and .
Mudholkar and Tian (2002) presented an entropy characterization property of the IG
family. The authors constructed a goodness-of-fit test for the IG distribution based on a
test statistic involving the sample entropy. Their test statistic contains an integer
parameter with unknown optimal values. The power of the entropy-based test statistic is
strongly dependent on values of the unknown parameter. Mudholkar et al. (2001) also
developed a goodness-of-fit test for the IG model by using a characterization theorem for
the IG distribution for which it is known