Bio Statistics Epidemiology

197
ADVANCED AND NOVEL PARAMETRIC AND NONPARAMETRIC LIKELIHOOD STATISTICAL TECHNIQUES WITH APPLICATIONS IN EPIDEMIOLOGY by Wan-Min Tsai December 17, 2012 A dissertation submitted to the Faculty of the Graduate School of the University at Buffalo, State University of New York in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Biostatistics

description

Mathematical Biology reasearch

Transcript of Bio Statistics Epidemiology

  • ADVANCED AND NOVEL PARAMETRIC AND

    NONPARAMETRIC LIKELIHOOD STATISTICAL

    TECHNIQUES WITH APPLICATIONS IN

    EPIDEMIOLOGY

    by

    Wan-Min Tsai

    December 17, 2012

    A dissertation submitted to the

    Faculty of the Graduate School of

    the University at Buffalo, State University of New York

    in partial fulfillment of the requirements for the

    degree of

    Doctor of Philosophy

    Department of Biostatistics

  • All rights reserved

    INFORMATION TO ALL USERSThe quality of this reproduction is dependent upon the quality of the copy submitted.

    In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted. Also, if material had to be removed,

    a note will indicate the deletion.

    Microform Edition ProQuest LLC.All rights reserved. This work is protected against

    unauthorized copying under Title 17, United States Code

    ProQuest LLC.789 East Eisenhower Parkway

    P.O. Box 1346Ann Arbor, MI 48106 - 1346

    UMI 3554514Published by ProQuest LLC (2013). Copyright in the Dissertation held by the Author.

    UMI Number: 3554514

  • ii

    Dissertation Committee

    Albert Vexler, Ph.D. (Advisor)

    Alan Hutson, Ph.D.

    Chang-Xing Ma, Ph.D.

    Jihnhee Yu, Ph.D.

  • iii

    ACKNOWLEDGEMENTS

    I would like to express my gratitude to all the people who have helped along the journey

    of my Ph.D. study.

    Firstly, I would like to express my deepest appreciation to my advisor, Professor

    Albert Vexler, for his guidance and encouragement. This dissertation would not have

    been possible without him. I am also deeply thankful to Professors Alan Hutson, Chang-

    Xing Ma, and Jihnhee Yu, who kindly agreed to serve on my dissertation committee, for

    their interest and valuable comments on my work.

    I would like to thank all of my professors, fellow students, and department staff for

    their kind help and support during my graduate study at the University at Buffalo. My

    sincere thanks also to Dr. Gregory Gurevich and Dr. Yaakov Malinovsky as well as the

    reviewers of the manuscripts submitted to journals for the constructive suggestions

    improving the quality of the manuscripts and this dissertation.

    I would also like to thank all of my friends for always being there with a word of

    encouragement or listening ear.

    Last but not least, I would like to thank my beloved parents, brother and sisters.

    Without their strong love and support, I would not have had a chance to go abroad to

    pursue my graduate study. I love my family beyond expression for their support and

    encouragement along the way. I would like to dedicate this work to them.

  • iv

    TABLE OF CONTENTS

    Acknowledgements ....................................................................................................... iii

    List of Tables .................................................................................................................... x

    List of Figures ............................................................................................................... xiii

    Abstract. ........................................................................................................................ . xiv

    Chapters. ........................................................................................................................... 1

    Chapter 1. Introduction. ............................................................................................ 1

    Chapter 2. Estimation and testing based on data subject to

    measurement errors: from parametric to non-parametric likelihood

    methods ............................................................................................................................ 9

    2.1 Introduction .............................................................................................................................. 9

    2.2 Parametric inferences............................................................................................................11

    2.2.1 Parametric likelihood functions .......................................................................... 12

    2.2.1.1 Parametric likelihood based on repeated data..12

    2.2.1.2 Parametric likelihood based on pooled and unpooled data......13

    2.2.2 Normal case......................................................................................................... 15

    2.2.2.1 Maximum likelihood estimators based on repeated measures.15

    2.2.2.2 Maximum likelihood estimators followed the hybrid design......17

  • v

    2.2.2.3 Remarks on the normal case.. ...................18

    2.3 Empirical likelihood method ...............................................................................................19

    2.3.1 The empirical likelihood ratio test ...................................................................... 19

    2.3.2 The empirical likelihood method based on repeated measures data ................... 21

    2.3.3 The empirical likelihood method based on pooled-unpooled data ..................... 22

    2.4 Monte Carlo experiments .....................................................................................................25

    2.4.1 Simulation settings .............................................................................................. 25

    2.4.2 Monte Carlo outputs ........................................................................................... 27

    2.5 A real data example ..............................................................................................................32

    2.6 Conclusions ............................................................................................................................35

    Chapter 3. An empirical likelihood ratio based goodness-of-fit test for

    inverse Gaussian distributions .............................................................................. 36

    3.1 Introduction ............................................................................................................................36

    3.2 Method ....................................................................................................................................39

    3.2.1 Density-based empirical likelihood ratio goodness-of-fit test ............................ 40

    3.2.2 Null distribution .................................................................................................. 46

    3.3 Power properties ....................................................................................................................50

    3.4 Data examples ........................................................................................................................52

    3.5 Conclusions ............................................................................................................................54

  • vi

    Chapter 4. An extensive power evaluation of a novel two-sample

    density-based empirical likelihood ratio test for paired data. ................... 56

    4.1 Introduction ............................................................................................................................56

    4.2 Method ....................................................................................................................................61

    4.3 Simulation study ....................................................................................................................67

    4.3.1 Null distribution .................................................................................................. 67

    4.3.2 Power study ......................................................................................................... 71

    4.4 A real data example ..............................................................................................................75

    4.5 Conclusions ............................................................................................................................78

    Chapter 5. Two-sample density-based empirical likelihood ratio tests

    based on paired data, with application to a treatment study of

    Attention-Deficit/Hyperactivity Disorder and Severe Mood

    Dysregulation............................................................................................................... 88

    5.1 Introduction ............................................................................................................................88

    5.2 Statement of problems and Methods ..................................................................................93

    5.2.1 Hypotheses setting .............................................................................................. 93

    5.2.2 Test statistics ....................................................................................................... 94

    5.2.2.1 Test 1: vs. .....94

    5.2.2.2 Test 2: vs. .........99

    5.2.2.3 Test 3: vs. .......101

  • vii

    5.2.3 Asympotic consistency of the tests .................................................................... 102

    5.2.4 Null distributions of the proposed test statistics ................................................ 103

    5.3 Simulation study ..................................................................................................................105

    5.3.1 Power comparison with the parametric method................................................. 105

    5.3.2 Power comparison with classic nonparametric methods ................................... 106

    5.4 Data analysis ........................................................................................................................107

    5.5 Conclusions ..........................................................................................................................110

    Chapter 6. Optimal properties of parametric Shiryaev-Roberts

    statistical control procedures ............................................................................... 117

    6.1 Introduction ..........................................................................................................................117

    6.2 Retrospective change point detection ...............................................................................119

    6.3 Retrospective detection of two change points .................................................................124

    6.3.1 Non-asymptotic optimal properties of the Shiryaev-Roberts test (6.10) ......... 125

    6.3.2 A real data example .......................................................................................... 129

    6.3.2.1 The methods for p-value approximation related to test (6.16)...132

    6.3.2.2 Additional study.............134

    6.4 Sequential change point detection.....................................................................................134

    6.5 Conclusions ..........................................................................................................................138

  • viii

    Chapter 7. Future works ....................................................................................... 140

    7.1 Testing normality based on independent samples and errors in regression models

    ......................................................................................................................................................140

    7.2 A sample density-based empirical likelihood ratio test for independence

    ......................................................................................................................................................141

    Appendices ................................................................................................................. 143

    A.1.1 Parametric likelihood based on the hybrid design ......................................................143

    A.1.2 The asymptotic distribution of the log empirical likelihood ratio test .....................145

    A.1.2.1 The repeated measurements .......................................................................... 145

    A.1.2.2 The hybrid design ......................................................................................... 147

    A.2.1 Proof of Lemma 3.1 ........................................................................................................149

    A.2.2 To show the proposed test statistic is identical to the test statistic presented by

    Mudholkar and Tian (2012)..................................................................................................... 150

    A.2.3 Proof of Proposition 3.1 .................................................................................................152

    A.2.4 Proof of Proposition 3.2 ................................................................................................ 155

    A.3.1 Computing the empirical constraint (5.11) utilized to develop test 2 ......................156

    A.3.2 Proof of Proposition 5.1 ................................................................................................ 157

    A.3.2.1 Proposed test 1 .............................................................................................. 157

    A.3.2.2 Proposed test 2 .............................................................................................. 165

    A.3.3 Mathematical derivation of maximum likelihood ratio tests.................................... 168

  • ix

    A.3.3.1 Maximum likelihood ratio test statistic for test 1 ......................................... 168

    A.3.3.2 Maximum likelihood ratio test statistic for test 2 ......................................... 169

    A.3.3.3 Maximum likelihood ratio test statistic for test 3 ......................................... 170

    Bibliography .............................................................................................................. 171

  • x

    LIST OF TABLES

    2.1 The Monte Carlo evaluations of the maximum likelihood estimates based on

    repeated measurements and the hybrid design ......................................................... 28

    2.2 Coverage probabilities and confidence intervals based on repeated measurements

    and the hybrid design ............................................................................................... 29

    2.3 The Monte Carlo type I errors and powers of the empirical likelihood ratio test

    statistics (2.3) and (2.7) for testing based on data following the

    hybrid design ( ( ) ( )) ............................................................. 31

    2.4 The Monte Carlo type I errors and powers of the empirical likelihood ratio test

    statistics (2.3) and (2.7) for testing based on data following the

    hybrid design ( (

    )).. ............................................................. 32

    2.5 Bootstrap evaluations of the confidence interval estimators based on the

    parametric likelihood ratio test and the empirical likelihood ratio test.................... 34

    3.1 The critical values, ( ), of the proposed test at the significance level .... 47

    3.2 Type I error control of the proposed ( ) test: =0.05 .................................... 48

    3.3 The Monte Carlo Type I errors of the proposed test with critical values obtained

    by Proposition 2.2 to guarantee ................................................................ 50

    3.4 An empirical power comparison for the density-based empirical likelihood ratio

    test at ........................................................................................................ 51

  • xi

    3.5 Test for the IG based on the data introduced in Folks and Chhikara (1978) ........... 52

    3.6 Bootstrap type proportion of rejection of the tests for the IG at a 5% level of

    significance .............................................................................................................. 54

    4.1 The critical values of the proposed test at (4.7) with various ( ) values for

    sample sizes ( ) at the significant level 0.01, 0.05, and 0.1,

    respectively .............................................................................................................. 68

    4.2 Type I error control of the proposed test statistic (4.6) with ( ) ( ) at

    the significant level .................................................................................. 70

    4.3 Designs of the alternative hypothesis to be applied to the following Monte Carlo

    evaluations of the powers of the proposed test (4.6) ................................................ 72

    4.4 Proportion of rejection based on the bootstrap method for each considered test.. .. 78

    5.1 Hypotheses of interest to be tested based on paired data ......................................... 94

    5.2 The critical values for Test 1 by (5.9) (Test 2 by (5.15)) [Test 3 by (5.17)] with

    =0.1 for different sample sizes ( , ) and significance levels ..................... 104

    5.3 The Monte Carlo powers of Test 1 by (5.9) vs. the MLR test for vs. with

    different sample sizes ( ) at the significance level ........................ 111

    5.4 The Monte Carlo powers of Test 2 by (5.15) vs. the MLR test for vs. with

    different sample sizes ( ) at the significance level ........................ 112

    5.5 The Monte Carlo powers of Test 3 by (5.17) vs. the MLR test for vs. with

    different sample sizes ( ) at the significance level ........................ 112

  • xii

    5.6 The Monte Carlo type I errors of the MLR tests ................................................... 113

    5.7 The Monte Carlo powers of the proposed test (5.9) vs. the combined

    nonparametric test (the two Wilcoxon signed rank tests and one Kolmogorov-

    Smirnov test) at ...................................................................................... 113

    5.8 The Monte Carlo powers of the proposed test (5.15) vs. the combined

    nonparametric test (the one Wilcoxon signed rank test and one Kolmogorov-

    Smirnov test) at ..................................................................................... 114

    5.9 The Monte Carlo powers of the proposed test (5.17) vs. the Wilcoxon signed

    rank test at the significance level ........................................................... 115

    5.10 The proportions of rejections based on the bootstrap method for each considered

    test .......................................................................................................................... 115

    6.1 Means and standard deviations of the %TS data and the TIBC data in each group

    ............................................................................................................................... 132

  • xiii

    LIST OF FIGURES

    2.1 The histogram and the normal Q-Q plot of cholesterol data ................................... 33

    4.1 3-D plots of powers of the considered tests via all 28 designs (K1-K28) with

    different sample sizes at the significant level .......................................... 80

    4.2 Histograms of the differences in CDRS-Rts at baseline and endpoint in the group

    1 ( samples) and group 2 ( samples), respectively ................................... 87

    4.3 Plot of sample sizes vs. p-value using a bootstrap method ...................................... 87

    5.1 Histograms of CDRS-Rts related to the baseline and endpoint in group 1: ( ,

    ), and in group 2: ( , )... ............................................................................... 116

    5.2 Histograms of the paired observations and based on the CDRS-Rts data,

    with sample sizes ( )=(9, 6), that were sampled from the original data set .. 116

    6.1 The left-hand side of the Figure 6.1 shows plots and histograms of %TS data; the

    right-hand side of the Figure 6.1 shows plots and histograms of TIBC data... ...... 130

    6.2 Histograms of %TS data in each group ................................................................. 131

    6.3 Histograms of TIBC data in each group ................................................................ 131

  • xiv

    ABSTRACT

    The likelihood approach provides a basis for many important procedures and methods in

    statistical inference. When data distributions are completely known, the parametric

    likelihood approach is unarguably a powerful statistical tool that can provide optimal

    statistical inference. In such cases, by virtue of the Neyman-Pearson lemma, the likelihood

    ratio tests are the most powerful decision rules. The parametric likelihood methods cannot

    be applied properly if assumptions on the forms of distributions of data do not hold. Often,

    in the context of likelihood applications, the use of the misspecified parametric forms of

    data distributions may result in inaccurate statistical conclusions. The empirical likelihood

    (EL) methodology has been well addressed in the literature as a nonparametric counterpart

    of its powerful parametric likelihood approach. The objective of this dissertation is to

    develop several powerful parametric likelihood methods and nonparametric approaches

    using the EL concept. Measurement error (ME) problems can cause bias or inconsistency

    of statistical inferences. When investigators are unable to obtain correct measurements of

    biological assays, special techniques to quantify MEs need to be applied. In this

    dissertation, we present both parametric likelihood and EL methods for dealing with data

    subject to MEs based on repeated measures sampling strategies and hybrid sampling

    designs (a mixture of pooled and unpooled data). Utilizing the density-based EL

    methodology, we also propose different efficient nonparametric tests that approximate

    most powerful Neyman-Pearson test statistics. We first introduce the EL ratio based

    goodness-of-fit test for the inverse Gaussian model. Then we extend and adapt the density-

    based EL approach to compare two samples based on paired data. We present exact

    nonparametric tests for composite hypotheses to detect various differences related to

  • xv

    treatment effects in study groups based on paired measurements. Next, we review and

    extend parametric retrospective and sequential Shiryaev-Roberts based policies, carrying out

    different contexts of the non-asymptotic optimal properties of the procedures. We propose

    techniques to construct novel and efficient retrospective tests for multiple change-points

    detection problems. Finally, future works will be discussed.

  • 1

    CHAPTER 1

    INTRODUCTION

    Likelihood methods are powerful statistical tools for parametric statistical inference. The

    parametric likelihood methodology provides optimal statistical inferential procedures

    when data distributions are known. However, parametric forms of data distributions are

    oftentimes unknown. When the key assumptions regarding the underlying distribution of

    data are not met, the parametric likelihood approaches may be extremely biased and

    inefficient when compared to their robust nonparametric counterparts. In the

    nonparametric context, the classical empirical likelihood (EL) approach is often applied

    in order to efficiently approximate properties of parametric likelihoods, using an

    approach based on substituting empirical distribution functions for their population

    counterparts. Thus, the objective of this work is to develop efficient parametric likelihood

    approaches as well as nonparametric methods using the EL concept to deal with problems

    arising from epidemiological and medical studies.

    The EL ratios were first used by Thomas & Grunkemeier (1975) in the context of

    estimating survival probabilities. Owen (1988, 1990), building on earlier work of Thomas

    & Grunkemeier (1975), introduced the EL approach for constructing confidence regions

    in nonparametric problems. The approach has been developed to various situations. For

    example, Owen (1991) and Chen (1993, 1994) extended the method to regression

    problems, Chen and Hall (1993) considered the case of quantiles, and Kolaczyk (1994)

    made the extension to generalized linear models. Qin and Lawless (1994) have shown

    that the EL methodology can be utilized to make inference on parameters of interest

  • 2

    under a semiparametric model. The EL method has several advantages compared to

    contemporary methods such as the bootstrap (Hall and Scala, 1990). The EL approach

    not only provides confidence regions whose shapes and orientations are determined

    entirely by the data, but it can also easily incorporate known constraints on parameters

    (Chen, 1994). Moreover, DiCiccio et al. (1989) showed that in a very general setting, the

    EL method for constructing confidence regions is Bartlett correctable which allows very

    low coverage error to improve the accuracy of inferences. For a more comprehensive

    literature review, see e.g., Owen (2001).

    In this dissertation, we mainly focus on the developments of novel and powerful

    distribution-free approaches to approximate optimal parametric likelihood ratio test

    statistics via the EL methodology. In Chapter 2, we introduce both parametric and

    empirical likelihood methods for dealing with measurement error problems. In Chapters 3-

    5, we develop several robust EL-based methods motivated by important biomedical studies.

    In Chapter 6, we propose techniques to construct efficient parametric retrospective tests for

    multiple change-points detection problems. In the last chapter, we discuss directions for

    future works. In this introductory chapter, we begin with a brief description of background

    and motivation of each study and review of related literature, which are essential for

    comprehending the following chapters covered in this dissertation.

    In Chapter 2, we propose and examine parametric and nonparametric tests for data

    subject to measurement errors based on two sampling schemes---a common sampling

    strategy based on repeated measures and a hybrid design (a mixture of pooled and

    unpooled data). Biomarkers are distinctive biochemical indicators of biological processes

    or events that help measure the progress of disease or the effects of treatment. The

  • 3

    imprecision of measurement of exposures can cause bias or inconsistency of estimation

    or testing of unknown parameters of the biomarker distribution. The sampling based on

    repeated measurements is a common method to deal with measurement error problems

    (e.g., Dunn, 1989; Freedman et al., 2004). However, because of financial cost and other

    resources constraints, obtaining and assaying the desired number of replications can be

    infeasible. For example, the cost of a single assay to measure polychlorinated biphenyl

    (PCB) is between $500 and $1000. The high assay expense restricts the number of

    biological samples that can be measured in the study related to examination of the

    association between the marker PCB and cancer or endometriosis (Louis et al., 2005;

    Mumford et al., 2006). Take the Seveso cohort as another example. In 1976, due to a

    severe industrial accident, serious dioxin contamination of a large resident population of

    people and farm animals occurred in Seveso, Italy. Investigators need to assay those

    frozen storage blood specimens from the resident population in the exposed area of

    dioxin in order to examine whether dioxin contamination adversely affects human health.

    However, each determination of blood dioxin currently costs more than $1000, resulting

    in only the limited number of subjects that can be considered in this study project

    (Weinberg and Umbach, 1999).

    Pooling and simple random sampling designs have been suggested as cost-efficient

    sampling approaches (e.g., Dorfman, 1943; Weinberg and Umbach, 1999; Faraggi et al.,

    2003, Vexler et al., 2006, Schisterman and Vexler, 2008). Simple random sampling

    designs are conducted by selecting and assaying a random subset of available specimens.

    In this case, several individual biospecimens are ignored; however, estimation of the

    operating characteristics of these individual biospecimens is simple. Pooling, on the other

  • 4

    hand, involves randomly grouping and physically mixing individual biological samples.

    Assays are then performed on the small number of pooled samples. Therefore, pooling

    sampling design reduces the number of measurements without ignoring any individual

    biospecimens. Schisterman et al. (2010) proposed a pooled-unpooled hybrid design that

    combines the advantages of both pooling and random sampling strategies. The hybrid

    designs are implemented by assaying some randomly selected individual biospecimens

    and one pooled set that is constituted by the remaining individual biospecimens. The

    pooled sample in the hybrid design is used for estimation of the mean of the biomarker

    distribution, whereas the random sample proportion in the hybrid design (unpooled data)

    is utilized for estimation of the variance. One advantage of using the hybrid design is that

    by utilizing a pooled sample, one can estimate not only mean and variance of the

    biomarker distribution, but also measurement error, without requiring repeated measures.

    The repeated measurements sampling strategy has been well addressed in the literature

    under parametric assumptions. In the context of the hybrid sampling design, Schisterman

    et al. (2010) have proposed and evaluated a parametric approach for normally distributed

    data in the presence of measurement errors. However, nonparametric methods have not

    been well investigated to analyze repeated measures data or pooled data subject to

    measurement errors. In Chapter 2, we consider general cases of parametric and

    nonparametric assumptions, comparing efficiency of pooled-unpooled samples and data

    consisting of repeated measures. The applications of the proposed methods are illustrated

    using the cholesterol biomarker data from a study of myocardial infarction.

    The classical EL approach reflects distribution-based interpretation of the notion of

    likelihood (Owen, 2001). Note that the Neyman-Pearson concept to show the optimality

  • 5

    of statistical decision rules utilizes density functions structures of likelihood type tests.

    Recently, Vexler and Gurevich (2010) proposed the density-based EL methodology to

    construct efficient nonparametric tests that approximate Neyman-Pearson test statistics.

    In the following chapters, we extend and adapt the density-based EL approach for

    different testing problems, including tests of goodness-of-fit (Chapters 3 and 7.1) and

    two-sample comparisons based on paired data (Chapters 4 and 5).

    Testing for distributional assumptions has been a major area of continuing statistical

    research. Chapter 3 introduces a new and powerful density-based EL goodness-of-fit test

    for the inverse Gaussian distribution. The inverse Gaussian distribution is commonly

    introduced to model and examine right skewed data having positive support. It originates

    as the distribution of the first passage time of Brownian motion with drift (Henze and

    Klar, 2012). It has useful applications include reliability problems (Padgett and Tsoi,

    1986), lifetime models (Chhikara and Folks, 1977), and accelerated life testing

    (Bhattacharyya and Fries, 1982). Folks and Chhikara (1978, 1989) have presented several

    properties and applications of this distribution. When applying the inverse Gaussian

    model, it is critical to develop efficient goodness-of-fit tests. Mudholkar and Tian (2002)

    presented the entropy-based goodness-of-fit test for the inverse Gaussian distribution that

    strongly depends on values of the integer parameter m. The problem of selecting the

    optimal parameter m causes the difficulty of efficiently implementing the test in practice.

    In this chapter, we propose an EL based goodness-of-fit test for the inverse Gaussian

    distribution that improves Mudholkar and Tians test (2002) in the context of eliminating

    dependence on the integer parameter m.

  • 6

    It is a common practice to conduct medical trials in order to compare a new therapy

    with a standard-of-care based on paired data consisted of pre- and post-treatment

    measurements. In the subsequent chapters (Chapters 4 and 5), we focus on developments

    of efficient nonparametric likelihood methods for comparing two study groups based on

    paired observations. In Chapter 4, we present an efficient distribution-free test to compare

    two independent samples, where each is based on paired data. The statistical literature

    has pointed out several issues related to the well-known classical tests for paired data

    such as the t-test and the Wilcoxons test. For instance, the t-test is not appropriate to be

    applied to data from strongly skewed distributions (see, e.g. Vexer et al., 2009); the

    Wilcoxons test may break down completely when a nonconstant change is in effect

    under the alternative hypothesis (see, e.g. Albers et al., 2001). In addition, the classical

    Kolmogorov-Smirnov test is not designed to be based on paired data directly; therefore, it

    may not be well suited to compare two samples based on paired observations. Due to the

    drawbacks of these classical statistical procedures, Chapter 4 aims to provide an

    alternative powerful nonparametric test based on paired data to compare treatment effects

    between two study groups. We extend and modify the density-based EL ratio test

    presented by Gurevich and Vexler (2011) to formulate an appropriate parametric

    likelihood ratio test statistic corresponding to the hypothesis of our interest and then to

    approximate the test statistic nonparametrically.

    In many case-control studies, a great interest often lies in identifying treatment effects

    within each therapy group as well as detecting between-group differences. To this end, in

    a nonparametric setting, one can consider combining relevant nonparametric standard

    procedures, for example, the Kolmogorov-Smirnov test and the Wilcoxon signed rank

  • 7

    test. The former test is a known procedure to compare distributions of two study groups,

    whereas the latter one can be applied to detect treatment effects within each study group.

    However, the use of the classical procedures commonly requires complex considerations

    to combine the known nonparametric tests, e.g., considerations of combined p-values. In

    Chapter 5, we propose simple nonparametric tests for three composite hypotheses related

    to treatment effects to provide efficient tools that compare study groups utilizing paired

    data. We adapt and extend the density-based EL methodology to deal with various testing

    scenarios involved in the two-sample comparisons based on paired data. The proposed

    technique is applied for comparing two therapy strategies to treat childrens attention

    deficit/hyperactivity disorder and severe mood dysregulation.

    Parametric change point detection schemes based on the Shiryaev-Roberts approach

    have been well addressed in the statistics and engineering literature. High efficiency of

    such procedures can be partially explained by their known asymptotic optimal properties.

    Recently, Shiryaev-Roberts based procedures were proposed and examined in applications

    to the standard AMOC (at most one change) retrospective change point detection problems.

    In Chapter 6, we review and extend parametric retrospective and sequential Shiryaev-

    Roberts based policies, carrying out different contexts of the non-asymptotic optimal

    properties of the procedures. We utilize the general principle of the Neyman-Pearson

    fundamental lemma to show that the Shiryaev-Roberts approach implies the average most

    powerful procedures. We also introduce techniques to construct efficient retrospective tests

    for multiple change-points detection. A real data example based on biomarker

    measurements is provided to demonstrate implementation and effectiveness of new tests in

    practice.

  • 8

    The last chapter of this dissertation addresses directions for future work and outlines

    two possible topics of future work: 1). Develop an EL ratio based goodness-of-fit test of

    the normality based on several independent samples and errors in regression models; 2)

    Develop a simple likelihood ratio type test for independence between two random

    variables, without requiring the specification of any kind of dependence and any

    assumptions on the forms of data distributions.

  • 9

    CHAPTER 2

    ESTIMATION AND TESTING BASED ON DATA

    SUBJECT TO MEASUREMENT ERRORS: FROM

    PARAMETRIC TO NON-PARAMETRIC

    LIKELIHOOD METHODS

    2.1 INTRODUCTION

    Commonly, many biological and epidemiological studies deal with data subject to

    measurement errors (MEs) attributed to instrumentation inaccuracies, within-subject

    variation resulting from random fluctuations over time, etc. Ignoring the presence of ME

    in data can result in the bias or inconsistency of estimation or testing. The statistical

    literature proposed different methods for ME bias correction (e.g., Carroll et al., 1984,

    1999; Carroll and Wand, 1991; Fuller, 1987; Liu and Liang, 1992; Schafer, 2001;

    Stefanski, 1985; Stefanski and Carroll, 1987, 1990). Among others, one of the common

    methods is to consider repeated measurements of biospecimens collecting sufficient

    information for statistical inferences adjusted for ME effects (e.g., Hasabelnaby et al.,

    1989). In practice, measurement processes based on bioassays can be costly and time-

    consuming and can restrict the number of replicates of each individual available for

    analysis or the number of individual biospecimens that can be used. It can follow that

  • 10

    investigators may not have enough observations to achieve the desired power or

    efficiency in statistical inferences.

    Dorfman (1943), Faraggi et al. (2003), Liu and Schisterman (2003), Liu et al. (2004),

    Mumford et al. (2006), Schisterman et al. (2008, 2010), and Vexler et al. (2006, 2008,

    2010, 2011) addressed pooling sampling strategies as an efficient approach to reduce the

    overall cost of epidemiological studies. The basic idea of the pooling design is to pool

    together individual biological samples (e.g., blood, plasma, serum or urine) and then

    measure the pooled samples instead of each individual biospecimen. Since the pooling

    design reduces the number of measurements without ignoring individual biospecimens,

    the cost of the measurement process is reduced, but relevant information can still be

    derived. Recently, it has been found that we can utilize a hybrid design that takes a

    sample of both pooled and unpooled biospecimens to efficiently estimate unknown

    parameters, allowing for MEs presence in the data without requiring repeated measures

    (Schisterman et al., 2010).

    In the context of the hybrid strategy, Schisterman et al. (2010) evaluated data that

    follow normal distribution functions. In this chapter, we consider general cases of

    parametric and nonparametric assumptions, comparing efficiency of pooled-unpooled

    samples and data consisting of repeated measures. It should be noted that the repeated

    measurement technique proposes to collect a large amount of information regarding just

    nuisance parameters related to distribution functions of ME, whereas the pooled-

    unpooled design provides observations that are informative regarding target variables

    allowing for ME. Therefore, we show that the pooled-unpooled sampling strategy is more

    efficient than the repeated measurement sampling procedure. We construct parametric

  • 11

    likelihoods based on both sampling methods. Additionally, in order to preserve

    efficiencies of both strategies without parametric assumptions, we consider a

    nonparametric approach using the empirical likelihood (EL) methodology (e.g., DiCicco

    et al., 1989; Owen, 1988, 1991, 2001; Vexler et al., 2009, 2010; Vexler and Gurevich,

    2010; Yu et al., 2010). We develop and apply novel EL ratio test statistics creating the

    confidence interval estimation based on pooled-unpooled data and repeated measures

    data. Despite the fact that many statistical inference procedures have been developed to

    operate with data subject to ME, to our knowledge, relevant nonparametric likelihood

    techniques and parametric likelihood methods have not been well addressed in the

    literature.

    The rest of this chapter is organized as follows. In Section 2.2, we present a general

    form of the likelihood function based on repeated measures data and pooled-unpooled

    data. We propose the EL methodology to make nonparametric inferences based on

    repeated measures data and pooled-unpooled data in Section 2.3. We claim that the EL

    technique based on the hybrid design provides a valuable technique to construct statistical

    tests and estimators of parameters when MEs are present. To evaluate the proposed

    approaches, we utilize Monte Carlo simulations in Section 2.4. We present an application

    to cholesterol biomarker data from a study of coronary heart disease in Section 2.5. In

    Section 2.6, we provide some concluding remarks.

    2.2 PARAMETRIC INFERENCES

    In this section, we derive general forms of the relevant likelihood functions. In each case,

    we assume that the total measurements of the biomarkers are fixed, say N, for example, N

    is the total number of measurements that a study budget allows us to execute.

  • 12

    2.2.1 PARAMETRIC LIKELIHOOOD FUNCTIONS

    2.2.1.1 PARAMETRIC LIKELIHOOD BASED ON REPEATED MEASURES

    DATA

    Suppose that we measure a biospecimen observing score , where true

    values of biomarker measurements are independent identically distributed (i.i.d.) and

    are i.i.d. values of ME, ; ; . Thus, we assume that

    there is a subset of distinct biosassays, and each of them is times repeatedly

    measured. In this case, we can define the total number of available individual bioassays

    to be T, T > t, when we can consider obtaining a large number of individual biospecimens

    to have a low cost with respect to high cost of measurement processes. We assume that

    and are independent. Firstly, we consider the simple normal case, say ( )

    and ( ). Accordingly, we observe (

    ). In this case, one can

    show that if , there are no unique solutions of estimation of and

    (non-

    identifiability). The observations s in each group i are dependent because they are

    measured using the same bioassay. Note that if we fix the value of , conditioned on

    is independent of each other, for example, in the case of ( ), we have

    | ( ) . In a general case, the likelihood function based on the repeated

    measures data has the general form of

    ( |

    ) [ ( )

    ( | )]

    When the distributions of and are known, we can obtain the specific likelihood

    functions, and further, we can also derive the maximum likelihood estimators of , ,

  • 13

    and . Well-known asymptotic results related to the maximum likelihood estimation

    give evaluations of properties of estimators based on the likelihood ( |

    ).

    2.2.1.2 PARAMETRIC LIKELIHOOD BASED ON POOLED AND

    UNPOOLED DATA

    We briefly address the basic concept of the pooling design. Let be the number of

    individual biospecimens available and be the total number of measurements that we

    can obtain due to a limited study budget. We obtain the pooling samples by randomly

    grouping individual samples into groups of size , where [ ] , the number of

    individual samples in a pooling group and [ ] is the integer part of . The pooling design

    requires a physical combination of specimens of the same group and a test of each pooled

    specimen, obtaining a single observation, when the pooled sample is measured. Since the

    measurements are generally per unit of volume, we assume that the true measurement for

    a pooled set is the average of the true individual marker values in that group. In this case,

    taking into account that instruments applied to the measurement process can be sensitive

    and subject to some random exposure ME, we define a single observation to be a sum of

    the average of individual marker values and a value of ME. Note that, in accordance with

    the pooling literature, we assume that analysis of the biomarkers is restricted by the high

    cost of the measurement process, whereas access to a large number of individual

    biospecimens can be considered to have a relatively low cost.

    In this subsection of hybrid design, we assume that distinct individual bioassays are

    available, but still we can provide just measurements (N < T). The ratio of pooled and

    unpooled samples is ( ) , [ ], and the pooling group size is p. Namely,

    ( ) . Specifically, we can obtain pooled data by mixing individual

  • 14

    bioassays together, and we therefore divide the bioassays into groups, where

    . We measure the grouped biospecimens as single observations. Let

    , denote measurements of pooled bioassays. In accordance with the literature, we

    have

    ( )

    (see, e.g., Faraggi et al., 2003; Liu and Schisterman, 2003; Liu et al. 2004; Schisterman

    et al. 2008, 2010; Vexler et al., 2006, 2008, 2010, 2011). Hence, we can obtain that

    are i.i.d. with the mean and the variance

    , namely,

    i.i.d. (

    )

    The unpooled samples are based on ( ) independent observations

    ( )

    In this case, we have i.i.d. (

    ).

    Note that the pooled and unpooled samples are independent of each other. As a result,

    the likelihood function based on the combination of pooled and unpooled data has the

    form of

    ( |

    ) (

    )

    ( )

    If we know the distribution functions of and , we can derive the likelihood functions

    according to the distributions of

    and . Therefore, we can also obtain the

    corresponding theoretical maximum likelihood estimators of (

    ) . Since the

  • 15

    estimators follow the maximum likelihood methodology, we can easily show the

    asymptotic properties of the estimators.

    2.2.2 NORMAL CASE

    In this subsection, we assume ( ) and (

    ). Then we obtain closed-

    form analytical solutions for the maximum likelihood estimators of the unknown

    parameters, , and

    .

    2.2.2.1 MAXIMUM LIKELIHOOD ESTIMATORS BASED ON REPEATED

    MEASURES

    Assume that ; ; . By the additive property of the

    normal distributions, we have (

    )

    Referring to Searle et al. (1992), the likelihood function is a well-known result that can

    be expressed by

    ( |

    )

    { [ ( )

    ( )

    ( ) ]}

    ( )

    [ ( )]

    ( )

    where

    Under the assumption that s are equal (i.e., assuming balanced data), the log

    likelihood function is in the form of

    ( |

    )

    ( )

    ( )

    (

    )

    ( ) ( )

    ( )

    where ( )

    , ( )

    and ( )

    .

  • 16

    Let

    . By taking the partial derivatives of with respect to , , and

    and setting the equations equal to zero, we obtain the maximum likelihood equations

    with the roots

    ( )

    and

    ( )

    Thus, the maximum likelihood estimator of is , and the maximum

    likelihood estimators of and

    are

    and

    , respectively, when

    ;

    and ( ) ( )

    , respectively, when

    .

    Also, the large-sample variances and covariance of and

    are given by

    ( )

    [(

    )

    ( )] (

    )

    ( )

    (

    ) (

    )

    (for details, see Searle et al., 1992).

    By the property of the maximum likelihood estimators, it is clear that asymptotically

    those estimators follow a multivariate normal distribution as

    [

    ] ( ) as

    where

    [

    [(

    )

    ( )]

    ( )

    ( )

    ( ) ]

  • 17

    2.2.2.2 MAXIMUM LIKELIHOOD ESTIMATORS FOLLOWED THE

    HYBRID DESIGN

    Since we assume that ( ) and (

    ) , we can write

    (

    ) , and (

    ) , ,

    .

    The likelihood function based on pooled-unpooled data then takes the form

    ( |

    )

    ( ) (

    ) (

    )

    ( )

    (

    )

    ( )

    (

    )

    Differentiating the log likelihood function, log ( |

    ), with respect to , ,

    and , respectively, we obtain the maximum likelihood estimators of ,

    , and

    given by

    (

    )

    (

    )

    ( ) ( )

    [ ( )

    (

    )

    ]

    ( )

    ( )

    Note that the estimator of has a structure that weighs estimations based on pooled and

    unpooled data in a similar manner to a Bayes point estimator used in normal-normal

    models (see Carlin and Louis, 2008). In this case, we show that we can obtain inference

    regarding the parameters by using this hybrid approach without repeating measures on

    the same individual, which is the most common strategy to solve ME problems.

  • 18

    By the virtue of the properties of the maximum likelihood estimators, the asymptotic

    distribution of the estimators (2.1) is asymptotically

    [

    ] ( ) where is the inverse of the Fisher

    Information matrix, I,

    [ ( )

    ( )

    ( ) ]

    [

    ( )

    ( )

    ( )

    ( )

    ( )

    ( )

    ( )

    ( ) ]

    where ( |

    ) is the corresponding log likelihood function (for

    details, see Appendix A.1.1).

    2.2.2.3 REMARKS ON THE NORMAL CASE

    As shown above, when biomarkers values and MEs are normally distributed, the

    maximum likelihood estimators exist and can be easily obtained. It is also clear that we

    can consider these estimators as the least square estimators in a nonparametric context.

    However, when data are not from normal distributions, it may be very complicated or

    even be infeasible to extract the distributions of repeated measures data or pooled and

    unpooled data (e.g., Vexler et al., 2011). For example, in various situations, closed

    analytical forms of the likelihood functions cannot be found based on pooled data

  • 19

    because the density function of the pooled biospecimen values involves complex

    convolutions of p-individual biospecimen values. Consequently, it is reasonable to

    consider efficient nonparametric inference methodologies based on the repeated measures

    data or pooled-unpooled data.

    2.3 EMPIRICAL LIKELIHOOD METHOD

    In this section, we apply the EL methodology to the statement of the problem in this

    chapter. The EL technique has been extensively proposed as a nonparametric

    approximation of the parametric likelihood approach (e.g., DiCiccio et al., 1989; Owen,

    1988, 1991, 2001; Vexler et al. 2009, 2010; Vexler and Gurevich, 2010; Yu et al., 2010).

    We begin by outlining the EL ratio method and then modifying the EL ratio test to apply

    to construct confidence interval estimations and tests based on data with repeated

    measures and pooled-unpooled data.

    2.3.1 THE EMPIRICAL LIKELIHOOD RATIO TEST

    Consider the following simple testing problem that is stated nonparametrically. Suppose

    i.i.d. random variables with ( ) and | | are observable. The

    problem of interest, for example, is to test the hypothesis

    ( ) ( ) ( )

    where is fixed and known. To test for the hypothesis in equation (2.2), we can write

    the EL function as

    where we assume s to have values that maximize given empirical constraints. The

    empirical constraints correspond to hypotheses settings. Then, under the null hypothesis

  • 20

    in equation (2.2), we maximize subject to Here the condition

    is an empirical form of . Using the Lagrange multipliers, one

    can show that the maximum EL function has the form of

    ( )

    ( )

    where is a root of

    ( )

    Similarly, under the alternative hypothesis, the maximum EL function has the simple

    form

    As a consequence, the 2log EL ratio test for equation (2.2) is

    ( ) [ ( ) ( ( ))]

    [ ( )]

    It is proven in Owen (2001) that the 2log EL ratio, ( ), follows asymptotically

    distribution as . Thus, we reject the null hypothesis at a significance level if

    ( ) Furthermore, we can construct the confidence interval estimator of

    as

    { ( ) }

    (Here, is the ( ) percentile of a distribution with one degree of

    freedom.)

  • 21

    2.3.2 THE EMPIRICAL LIKELIHOOD METHOD BASED

    ON REPEATED MEASURES DATA

    Following the statement mentioned in Section 2.2, we have correlated data with repeated

    measures. In order to obtain an i.i.d. sample, we utilize the fact that is independent of

    when . Therefore, we give an EL function for the block sample mean

    , ,

    , in a similar manner to the blockwise EL method

    given in Kitamura (1997). Then, the random variables become and the

    corresponding EL function for is given by

    ( )

    ( )

    where is a root of

    ( )

    In this case, the 2log EL ratio test statistic is in the form of

    ( )

    [ ( )]

    Proposition 2.1. Assume | | . Then the 2log EL ratio, ( ), distributes

    ,

    when 1

    1

    t

    iin

    , as

    Proof. See Appendix A.1.2.1.

  • 22

    The associated confidence interval estimator is then given by { ( )

    } where is the ( ) percentile of a distribution with one degree of

    freedom.

    2.3.3 THE EMPIRICAL LIKELIHOOD METHOD BASED

    ON POOLED-UNPOOLED DATA

    In this section, we consider two distribution-free alternatives to the parametric likelihood

    method mentioned in Section 2.2.1.2. To this end, we apply the EL technique. Note that,

    in contrast to data that consist of repeated measures, in this section we use data that are

    based on independent observations. Consequently, we can introduce a combined EL

    function for the mean based on two independent samples, that is, i.i.d.

    and i.i.d. ( ), representing measurements that correspond to

    pooled and unpooled biospecimens, respectively. Under the null hypothesis, the EL

    function for can be presented in the form of

    ( )

    ( )

    ( )

    where and are roots of the equations

    ( )

    and

    ( )

    Finally, the 2log EL ratio test statistic can be given in the form of

  • 23

    ( ) log[ ( )]

    log[ ( )]

    ( )

    In a similar manner to common EL considerations, one can show that the statistics

    [ ( )]

    and [ ( )]

    follow asymptotically a

    distribution, respectively. By virtue of the additive property of distributions, the

    2log EL ratio, ( ), has an asymptotic distribution with two degrees of freedom.

    Thus, we formulate the next proposition.

    Proposition 2.2. Let | | . Then the 2log EL ratio, ( ), has a

    distribution

    as .

    Proof. See Appendix A.1.2.2.

    The corresponding confidence interval estimator is

    { ( ) } ( )

    where is the ( ) percentile of a distribution with two degrees of

    freedom.

    In practice, to execute the procedure above, we can directly use standard programs

    related to the classical EL ratio tests, for example, the code el.test of the R software can

    be utilized to conduct the EL confidence interval estimator (2.4).

    The EL technique mentioned above does not use an empirical version of the rule

    ( ) ( )

    ( )

    that connects the second moments derived from pooled and unpooled observations.

    Intuitively, using a constraint related to equation (2.5), one can increase the power of the

    EL approach. Consider the EL function for under the null hypothesis,

  • 24

    (

    )

    ( )

    ( ) (

    )

    ( (

    ) )

    ( ( ) )

    as an alternative to the simple EL function ( ) Here, is the estimator from

    equation (2.1) that is defined under the null hypothesis,

    ( )

    [ ( )

    (

    )

    ]

    , , , and are roots of the equations mentioned under the operator in the

    definition of (

    ) with ( ( ) )

    and ( ( ) )

    .

    Likewise, under the alternative hypothesis, we maximize the EL function,

    , subject to

    , ,

    ( )

    ( ) (

    ) (

    ) ( )

    where

    ( )

    [ ( )

    (

    )

    ]

    Thus, the EL under the alternative hypothesis that depends on ( ) is given by

    ((

    ) )

    ( )

    ( )

  • 25

    where we should numerically derive (

    ( ) )

    , (

    ( ) )

    as

    well as

    , and

    using the equations (2.6). As a result, the corresponding 2log

    EL ratio test statistic is

    ( ) [ (

    ((

    ) )) ( (

    ))] ( )

    Note that, following Qin and Lawless (1994), ( ) is asymptotically equivalent to

    the maximum log EL ratio test statistic. By virtue of results mentioned in Qin and

    Lawless (1994), ( ) asymptotically follows a

    distribution. Then we reject the null

    hypothesis at a significance level when

    ( ) ( )

    where is the ( ) percentile of a distribution. Moreover, the

    corresponding confidence interval is { ( ) } where is the

    ( ) percentile of a distribution.

    The Monte Carlo simulation study presented in the next section examines the

    performance of each EL method mentioned above.

    2.4 MONTE CARLO EXPERIMENTS

    In this section, we conducted an extensive Monte Carlo study to evaluate the performance

    of the parametric and nonparametric likelihood methods proposed in Sections 2.2 and 2.3.

    2.4.1 SIMULATION SETTINGS

    Examining the repeated measures sampling method, we randomly generated samples of

    values from a normal distribution with mean ( ) and variance

    var( ) . We let , denote the number of replicates for each subject.

    For simplicity, we assumed that each subject had the same number of replicates

  • 26

    (i.e., assuming balanced data). Then, in a similar manner, we randomly generated

    normally distributed MEs, s, having ( ) and var ( ) , ;

    . Therefore, we conducted samples of . Each sample had

    observations.

    To obtain the hybrid samples, we first generated a sample of size T, where

    ( ) , , ( ) , to represent available individual

    bioassays. Then we proceeded to generate pooled data. To this end, we pooled ,

    [ ], samples of s to constitute pooled data, where we assumed to be an

    integer and s, , were i.i.d. random samples from a normal distribution with

    mean ( ) and var( ) . Following the pooling literature, if there were no

    MEs, we assumed the average values of the pooled biospecimens,

    ( ) , to be observed and we could represent them as the

    measurements of pooled bioassays. We took the remaining ( ) observations

    , , as individual measurements. For each observation, we randomly

    generated a ME from a normal distribution. Combining the pooled samples,

    ( ) , , with the unpooled samples,

    , we obtained pooled-unpooled data with the total sample size

    equal to that in the Monte Carlo evaluations related to the repeated measures

    approaches.

    To evaluate the performance of proposed methods, we applied the following

    simulation settings: the fixed significance level was 0.05; =1 and =1;

    =0.4, 1;

    =2, 5, 10; the pooling group size = 2, 5, 10; the pooling proportion =0.5; the total

  • 27

    sample size =100, 300. For each set of parameters, there were 10,000 data generations

    (Monte Carlo). In this section, following the pooling literature, we assumed that the

    simulated analysis of biomarkers was restricted to execute just N measurements and

    ( ) individual biospeciaments are available, when the hybrid design was

    compared with the repeated measures sampling method. The Monte Carlo simulation

    results are presented in the next subsection.

    2.4.2 MONTE CARLO OUTPUTS

    Table 2.1 shows the estimated parameters based on the repeated measures data using the

    parametric likelihood method. The results show that as the replicates increase, the

    standard errors of the estimates of decrease, indicating that the estimations of

    appear to be better as the number of replicates increases. Apparently, the Monte Carlo

    standard errors of the estimators of and increase when we increase the number of

    replicates.

    To accomplish the efficiency comparison between the repeated measures strategy and

    the hybrid design strategy, we provide the Monte Carlo properties of the maximum

    likelihood estimates based on pooled-unpooled data in Table 2.1. Table 2.1 shows that

    the Monte Carlo standard errors of the estimates for based on pooled-unpooled data

    are clearly less than those of the corresponding estimates that utilize repeated measures,

    when (respectively, ). One observed advantage is that the estimation for

    based on pooled-unpooled data is very accurate when the total number of measurements

    is fixed at the same level. Another advantage is that the standard errors of the estimates

    for based on pooled-unpooled data are much smaller than those of the corresponding

    estimates using repeated measures data, as shown in Table 2.1.

  • 28

    Table 2.1: The Monte Carlo evaluations of the maximum likelihood estimates based on repeated

    measurements and the hybrid design

    Sample

    Size

    Replicates n;

    Pooling Size p

    Parameters Estimates Standard Errors

    ( , ,

    )

    SE( ) SE( ) SE(

    )

    Repeated Measurements:

    N=100 n=2 (1, 1, 0.4) 1.0021 0.9781 0.3997 0.1553 0.2410 0.0790

    (1, 1, 1.0) 1.0006 0.9688 0.9984 0.1726 0.3106 0.1994

    n=5 (1, 1, 0.4) 0.9966 0.9462 0.3990 0.2328 0.3305 0.0623

    (1, 1, 1.0) 1.0015 0.9362 0.9998 0.2442 0.3688 0.1570

    n=10 (1, 1, 0.4) 1.0026 0.8951 0.3999 0.3209 0.4346 0.0597

    (1, 1, 1.0) 1.0044 0.8917 0.9995 0.3299 0.4690 0.1501

    N=300 n=2 (1, 1, 0.4) 0.9987 0.9921 0.3999 0.0889 0.1405 0.0455

    (1, 1, 1.0) 1.0005 0.9883 0.9999 0.0995 0.1803 0.1162

    n=5 (1, 1, 0.4) 0.9995 0.9797 0.3998 0.1356 0.1950 0.0365

    (1, 1, 1.0) 0.9990 0.9766 0.9990 0.1409 0.2181 0.0906

    n=10 (1, 1, 0.4) 0.9985 0.9682 0.3997 0.1864 0.2633 0.0344

    (1, 1, 1.0) 0.9985 0.9660 1.0002 0.1914 0.2782 0.0861

    Hybrid Design:

    N=100 p=2 (1, 1, 0.4) 1.0015 1.0160 0.4365 0.1048 0.6712 0.4579

    (1, 1, 1.0) 1.0007 1.0754 1.0098 0.1327 1.0058 0.7275

    p=5 (1, 1, 0.4) 0.9994 1.0045 0.3889 0.0924 0.3857 0.1662

    (1, 1, 1.0) 1.0008 1.0053 0.9880 0.1240 0.5932 0.3217

    p=10 (1, 1, 0.4) 0.9993 1.0049 0.3918 0.0871 0.3341 0.1164

    (1, 1, 1.0) 0.9996 1.0050 0.9836 0.1197 0.5082 0.2486

    N=300 p=2 (1, 1, 0.4) 0.9999 0.9974 0.4066 0.0608 0.3868 0.2652

    (1, 1, 1.0) 1.0002 1.0069 0.9982 0.0758 0.5788 0.4179

    p=5 (1, 1, 0.4) 0.9995 1.0013 0.3969 0.0534 0.2197 0.0954

    (1, 1, 1.0) 0.9993 1.0076 0.9910 0.0711 0.3386 0.1819

    p=10 (1, 1, 0.4) 0.9995 0.9995 0.3972 0.0497 0.1935 0.0671

    (1, 1, 1.0) 0.9992 1.0059 0.9922 0.0688 0.2928 0.1436

  • 29

    Table 2.2 displays the coverage probabilities of the confidence interval estimators

    constructed by the parametric likelihood and EL method based on repeated measures data

    and the mixed data, respectively.

    Table 2.2: Coverage probabilities and confidence intervals based on repeated measurements and

    the hybrid design

    Sample

    Size

    Replicates n;

    Pooling Size p

    Parameters Parametric Likelihood Empirical Likelihood

    ( , ,

    ) Coverage CI Coverage CI

    Repeated Measurements:

    N=100 n=2 (1, 1, 0.4) 0.9420 (0.7028, 1.3014) 0.9496 (0.6980, 1.3049)

    (1, 1, 1.0) 0.9423 (0.6665, 1.3347) 0.9466 (0.6613, 1.3394)

    n=5 (1, 1, 0.4) 0.9305 (0.5584, 1.4348) 0.9327 (0.5519, 1.4466)

    (1, 1, 1.0) 0.9289 (0.5404, 1.4626) 0.9353 (0.5298, 1.4752)

    n=10 (1, 1, 0.4) 0.9044 (0.4193, 1.5859) 0.8985 (0.4158, 1.5876)

    (1, 1, 1.0) 0.9042 (0.4040, 1.6047) 0.9030 (0.4054, 1.6065)

    N=300 n=2 (1, 1, 0.4) 0.9477 (0.8243, 1.1731) 0.9517 (0.8240, 1.1753)

    (1, 1, 1.0) 0.9469 (0.8056, 1.1955) 0.9479 (0.8034, 1.1962)

    n=5 (1, 1, 0.4) 0.9400 (0.7401, 1.2588) 0.9467 (0.7360, 1.2628)

    (1, 1, 1.0) 0.9448 (0.7257, 1.2722) 0.9467 (0.7210, 1.2763)

    n=10 (1, 1, 0.4) 0.9396 (0.6422, 1.3547) 0.9379 (0.6318, 1.3582)

    (1, 1, 1.0) 0.9336 (0.6321, 1.3648) 0.9417 (0.6245, 1.3712)

    Hybrid Design:

    N=100 p=2 (1, 1, 0.4) 0.9512 (0.7939, 1.2090) 0.9492 (0.7725, 1.2303)

    (1, 1, 1.0) 0.9463 (0.7422, 1.2592) 0.9421 (0.7146, 1.2869)

    p=5 (1, 1, 0.4) 0.9424 (0.8230, 1.1757) 0.9490 (0.7978, 1.2010)

    (1, 1, 1.0) 0.9439 (0.7644, 1.2372) 0.9509 (0.7314, 1.2703)

    p=10 (1, 1, 0.4) 0.9393 (0.8339, 1.1646) 0.9498 (0.8099, 1.1887)

    (1, 1, 1.0) 0.9431 (0.7701, 1.2290) 0.9478 (0.7376, 1.2614)

    N=300 p=2 (1, 1, 0.4) 0.9482 (0.8817, 1.1182) 0.9551 (0.8660, 1.1337)

    (1, 1, 1.0) 0.9469 (0.8525, 1.1479) 0.9520 (0.8334, 1.1672)

    p=5 (1, 1, 0.4) 0.9478 (0.8963, 1.1026) 0.9532 (0.8822, 1.1166)

    (1, 1, 1.0) 0.9463 (0.8616, 1.1371) 0.9506 (0.8433, 1.1556)

    p=10 (1, 1, 0.4) 0.9462 (0.9030, 1.0961) 0.9584 (0.8896, 1.1095)

    (1, 1, 1.0) 0.9484 (0.8652, 1.1332) 0.9532 (0.8475, 1.5080)

  • 30

    Table 2.2 shows that the EL ratio test statistic is as efficient as the traditional

    parametric likelihood approach in the context of constructing confidence intervals

    because the coverage probabilities and the interval widths of the two methods are very

    close.

    It is clearly shown that when sample sizes are greater than 100, the coverage

    probabilities obtained via the pooled-unpooled design are closer to the expected 0.95

    value than those based on repeated measurements. This, again, demonstrates that mixed

    data are more efficient than repeated measures data.

    To compare the Monte Carlo type I errors and powers of the tests based on the test

    statistics ( ) and ( ) by equations (2.3) and (2.7), we performed 10,000

    simulations for each parametric setting and sample size. To test the null hypothesis

    , we use the statistics ( ) and ( ) by equations (2.3) and (2.7). Table

    2.3 depicts results that correspond to the case when ( ). The outputs show that

    the Monte Carlo type I errors and powers of the test statistic ( ) are slightly better

    than those corresponding to the test statistic ( ). This indicates that the test based on

    the simple statistic ( ) performs better than that based on the statistic ( ), in the

    considered cases.

    Table 2.4 displays the Monte Carlo simulation results of testing the null hypothesis

    when , where

    is a chi-squared distribution with two degrees of

    freedom and is an effect size. Again, in this case, it is obvious that the type I errors

    (when a=0) of the test statistic ( ) are much better controlled by 0.05 than those

    based on the test statistic ( ). Moreover, the Monte Carlo powers of the test based on

    the test statistic ( ) are higher than those based on the statistic ( ) when the effect

  • 31

    size a is large than 0.5. On the contrary, as the effect size a is small such as 0.1 and 0.2,

    the Monte Carlo powers of the tests based on the test statistic ( ) seem higher than

    those based on the statistic ( ). This shows that when the effect size a is large, the test

    based on the simple statistic ( ) is preferable to that based on the statistic ( ).

    Table 2.3: The Monte Carlo type I errors and powers of the EL ratio test statistics (2.3)

    and (2.7) for testing based on data following the hybrid design ( ( ) (

    ). The pooling proportion = 0.5; the expected significance level was 0.05).

    Sample Pooling

    Group

    Parameters ELR Test Statistic (2.3) ELR Test Statistic (2.7)

    Size (N) Size (p) Type I

    Error

    Power Type I

    Error

    Power

    =0 =0.5 =1.0 =0 =0.5 =1.0 N=100 p=2 0.4 0.0587 0.9919 1.0000 0.0580 0.9718 0.9990

    1.0 0.0558 0.9373 1.0000 0.0611 0.9230 0.9986

    p=5 0.4 0.0555 0.9989 1.0000 0.0530 0.9512 0.9992

    1.0 0.0587 0.9626 1.0000 0.0604 0.9446 0.9976

    p=10 0.4 0.0588 0.9995 1.0000 0.0595 0.9531 0.9999

    1.0 0.0556 0.9684 1.0000 0.0621 0.9680 0.9992

    N=200 p=2 0.4 0.0495 0.9999 1.0000 0.0594 0.9990 0.9985

    1.0 0.0536 0.9991 1.0000 0.0593 0.9983 0.9995

    p=5 0.4 0.0540 1.0000 1.0000 0.0524 0.9952 0.9996

    1.0 0.0511 0.9997 1.0000 0.0543 0.9981 0.9999

    p=10 0.4 0.0549 1.0000 1.0000 0.0546 0.9950 0.9996

    1.0 0.0536 0.9999 1.0000 0.0551 0.9979 1.0000

  • 32

    Table 2.4: The Monte Carlo type I errors and powers of the EL ratio test statistics (2.3)

    and (2.7) for testing based on data following the hybrid design (

    ( ) , and . The pooling proportion = 0.5; the expected

    significance level was 0.05).

    Sample Pooling

    Group

    Parameters ELR Test Statistic (2.3) ELR Test Statistic (2.7)

    Size (N) Size (p) Type I

    Error

    Power Type I

    Error

    Power

    a = 0 a = 0.1 a = 0.2 a = 0.5 a = 0 a = 0.1 a = 0.2 a = 0.5

    N=100 p=2 0.4 0.0687 0.0862 0.2021 0.8567 0.0724 0.0989 0.2144 0.7446

    1.0 0.0654 0.0792 0.1579 0.6978 0.0690 0.0936 0.1696 0.6417

    p=5 0.4 0.0649 0.1123 0.3133 0.9808 0.0985 0.1583 0.3552 0.9084

    1.0 0.0670 0.0906 0.1901 0.8141 0.0862 0.1131 0.2338 0.8171

    p=10 0.4 0.0646 0.1454 0.4460 0.9990 0.1016 0.1724 0.4416 0.9269

    1.0 0.0623 0.0916 0.2253 0.8776 0.0933 0.1241 0.2693 0.8744

    N=200 p=2 0.4 0.0587 0.1137 0.3381 0.9907 0.0555 0.1210 0.3294 0.8227

    1.0 0.0544 0.0940 0.2583 0.9427 0.0534 0.1044 0.2612 0.8147

    p=5 0.4 0.0559 0.1699 0.5557 0.9998 0.0857 0.1903 0.5215 0.8944

    1.0 0.0538 0.1134 0.3366 0.9860 0.0770 0.1468 0.3687 0.9436

    p=10 0.4 0.0572 0.2279 0.7539 1.0000 0.0801 0.2027 0.6158 0.8875

    1.0 0.0568 0.1233 0.3988 0.9935 0.0815 0.1629 0.4219 0.9527

    2.5 A REAL DATA EXAMPLE

    In this section, we illustrate the proposed methods via data from the Cedars-Sinai

    Medical Center. This study on coronary heart disease investigated the discriminatory

    ability of a cholesterol biomarker for myocardial infarction (MI). We had 80 individual

    measurements of cholesterol biomarker in total. Half of these were collected on patients

    who recently survived a MI (cases), and the other half on controls who had normal rest

    electrocardiograms and were free of symptoms, having no previous cardiovascular

    procedures or MIs. Additionally, the blood specimens were randomly pooled in groups of

    , keeping cases and controls separate, and then re-measured. Consequently, we had

  • 33

    measurements for 20 samples of pooled cases and 20 samples of pooled controls,

    allowing us to form the hybrid design.

    The p-value of 0.8662 for Shapiro-Wilk test indicates that we can assume a

    cholesterol biomarker follows a normal distribution. A histogram and a normal Q-Q plot

    in Figure 2.1 confirm that the normal distributional assumption for the data is reasonable.

    Figure 2.1: The histogram and the normal Q-Q plot of cholesterol data

    We formed hybrid samples by taking combinations of 20 unpooled samples and 10

    pooled samples from different individuals for cases and controls, separately. In this

    example, we focused on the means of cholesterol measurements and therefore, we

    calculated these means based on 40 individual samples for cases and controls, separately.

    The obtained means were 226.7877 and 205.5290, respectively. Using a bootstrap

    strategy, we compared the confidence interval estimators and the coverage probabilities

  • 34

    of the EL method with those of the parametric method. To execute the bootstrap study,

    we proceeded as follows. We randomly selected 10 pooled assays of group size

    with replacement. We then randomly sampled 20 assays from the individual assays,

    excluding those performed on individual biospecimens that contributed to the 10 chosen

    pooled assays. With our 20 sampled individuals and 10 pooled assays, we applied a

    parametric likelihood method assuming a normal distributional assumption and an EL

    ratio test (2.3) to calculate the 95% confidence interval of the mean of cholesterol

    biomarkers. We repeatedly sampled and calculated the confidence interval of the

    cholesterol mean 5,000 times, obtaining 5,000 values for the confidence interval of the

    mean value of cholesterol measurements for both cases and controls. Then we take the

    average of these obtained 5,000 values. Table 2.5 depicts the outputs of the bootstrap

    evaluation.

    Table 2.5: Bootstrap evaluations of the confidence interval estimators based on the

    parametric likelihood ratio test and the EL ratio test

    Health MI

    Parametric

    (Normal)

    CI Length CI Length

    (192.5738, 220.8708) 28.29704 (210.0585, 239.4560) 29.39748

    Empirical (192.9715, 221.1471) 28.17561 (210.4337, 240.5975) 30.16376

    In accordance with the results above, the confidence intervals of estimators of the

    cholesterol mean via the EL ratio method are close to those corresponding to the

    parametric approach; therefore, we cannot observe a significant difference in the

    confidence intervals related to the approaches. This result shows that, in this example, the

    proposed EL approach is as efficient as the traditional parametric likelihood approach in

    the context of constructing confidence intervals.

  • 35

    2.6 CONCLUSIONS

    In this chapter, we proposed and examined different parametric and distribution-free

    likelihood methods to evaluate data subject to MEs. We evaluated the common sampling

    strategy based on repeated measures and the novel hybrid sample procedure. When the

    ME problem is in effect, we pointed out that the repeated measurements strategy might

    not perform well. The proposed hybrid design utilizes the cost-efficient pooling approach

    and combines the pooled and unpooled samples.

    The study done in this chapter has confirmed that the strategy to repeat measures

    provides a lot of information just related to ME distributions, reducing efficiency of this

    procedure compared with the hybrid design in the context of the evaluation of

    biomarkers characteristics. We proposed the application of the EL techniques, very

    efficient nonparametric methods, to data subject to ME.

    To verify the efficiency of the hybrid design and the EL methodology, we provided

    theoretical propositions as well as the Monte Carlo simulation results.

    The numerical studies have supported our arguments that the likelihoods based on

    pooled-unpooled data are more efficient than those based on the repeated measures data.

    We showed that the EL method can be utilized as a very powerful tool in statistical

    inference involving MEs.

  • 36

    CHAPTER 3

    AN EMPIRICAL LIKELIHOOD RATIO BASED

    GOODNESS-OF-FIT TEST FOR INVERSE

    GAUSSIAN DISTRIBUTIONS

    3.1 INTRODUCTION

    The Inverse Gaussian (IG) distribution has a probability density function of the form

    ( | ) (

    )

    {

    ( ) }

    where and are parameters. The ( ) distribution is extensively known for

    modeling and analyzing right skewed data with positive support across several different

    fields of science, e.g., demography, electrical networks, meteorology, hydrology, ecology,

    entomology, physiology, cardiology (see, for example, Chhikara and Folks, 1977;

    Bardsley, 1980; Seshadri, 1993, 1999; Johnson et al., 1994; Barndorff-Nielsen, 1994).

    Given the utility of this distribution, it is meaningful to develop a corresponding

    goodness-fit-test, which has satisfactory statistical properties. Towards this end, we

    propose constructing a distribution-free goodness-of-fit test for the IG distribution,

    which is based on approximating the appropriate parametric likelihood ratio test statistic.

    The parametric likelihood approach is a powerful statistical tool, which provides

    optimal statistical tests under well-known conditions; e.g. see Lehmann and Romano

  • 37

    (2005); Vexler and Wu (2009); Vexler et al. (2010) and Vexler and Tarima (2010). By

    virtue of the Neyman-Pearson lemma, the likelihood ratio , where and

    correspond to the likelihoods under hypotheses and , respectively, is the most

    powerful test statistic when and are completely known. In the nonparametric

    context considered in this chapter, forms of and are unknown, but are estimable.

    In this case, the Neyman-Pearson lemma motivates us to approximate the optimal

    likelihood ratios using an empirical approach.

    In this chapter, we develop a simple approach that approximates the most powerful

    likelihood ratio goodness-of-fit test for the IG distribution. The method is efficient and

    utilizes adapted principles of the empirical likelihood (EL) methodology. The EL

    approach allows investigators to employ methods with properties that are close, in the

    asymptotic sense, to those of parametric techniques without having to assume a

    parametric form for the likelihood functions under either hypothesis or (e.g., Lazar,

    2003; Qin and Lawless, 1994; Owen, 2001; Vexler et al., 2009; Vexler et al., 2010; Yu et

    al., 2010). The main advantage of the EL approach is that it is based on the maximum

    likelihood methodology when given a set of well-defined empirical constraints. The EL

    function of n i.i.d. observations has the form of , where the

    s, , maximize and satisfy empirical constraints corresponding to

    hypotheses of interest. This approach is based on the likelihood methodology involving

    cumulative distribution functions ( ) . In this case, the likelihood has the form of

    { ( ) ( )} (see, for details, Owen, 2001). For example, if the null hypothesis

    is ( ) , then the values of s in should be chosen to maximize given

    and

    , where the constraint

    is an empirical

  • 38

    version of ( ) . The components s of the EL can be obtained using Lagrange

    multipliers.

    Vexler and Gurevich (2010), as well as Gurevich and Vexler (2011), utilized the main

    idea of the classical EL methodology to approximate parametric likelihood ratios. The

    authors proposed a nonparametric approach based on approximate density functions. In

    this chapter, we derive the EL ratio test for the IG distribution, using the density-based

    distribution-free likelihood approach by Vexler and Gurevich (2010). Following

    Mudholkar and Tian (2002), we transform the observations to be , ,

    and present the likelihood function under the alternative hypothesis in the form of

    where ( ( )) , f is a density function, ( ) is the i-order statistic

    based on . Values of can be estimated by maximizing given an empirical

    version of the constraint ( ) . The nonparametric likelihood method mentioned

    above is then utilized for the purpose of developing our goodness-of-fit test statistic.

    Since the proposed test statistic approximates the most powerful parametric likelihood

    ratio, the density-based EL ratio test is shown to have very efficient characteristics. We

    also demonstrate the proposed test statistic improves upon the decision rule of Mudholkar

    and Tian (2002) and maintains good type I error control for finite samples.

    In Section 3.2, we create the EL test based on densities for the IG. Here, theoretical

    propositions depict properties of the test. A Monte Carlo study of the power of the test is

    presented in Section 3.3. Section 3.4 is given to real data examples employing the

    proposed test. Section 3.5 consists of concluding remarks.

  • 39

    3.2 METHOD

    In this section, we derive the EL ratio goodness-of-fit test based on approximating

    densities for the IG distribution under the null and alternative hypotheses. The proposed

    test is shown to be consistent as . Consider the problem of testing the composite

    hypothesis that are distributed as IG with unknown parameters and .

    Mudholkar and Tian (2002) presented an entropy characterization property of the IG

    family. The authors constructed a goodness-of-fit test for the IG distribution based on a

    test statistic involving the sample entropy. Their test statistic contains an integer

    parameter with unknown optimal values. The power of the entropy-based test statistic is

    strongly dependent on values of the unknown parameter. Mudholkar et al. (2001) also

    developed a goodness-of-fit test for the IG model by using a characterization theorem for

    the IG distribution for which it is known