Multivariable modeling of continuous covariates with a ...

Multivariable modeling of continuous covariates with a spike at zero

Dissertation zur Erlangung des Doktorgrades

vorgelegt vonCarolin Jenkner

an der Fakultät für Mathematik und Physikder Albert-Ludwigs-Universität Freiburg

März 2018

Dekan: Prof. Dr. Gregor Herten Physikalisches InstitutAlbert-Ludwigs-Universität Freiburg Hermann-Herder-Straße 379104 Freiburg, Deutschland

1. Referent: Prof. Dr. Martin SchumacherInstitut für Medizinische Biometrie und StatistikUniversitätsklinikum FreiburgAlbert-Ludwigs-Universität FreiburgStefan-Meier-Straße 2679104 Freiburg, Deutschland

2. Referent: Prof. Dr. Dankmar BöhningMathematical SciencesUniversity of SouthamptonBuilding 39Southampton SO17 1BJ, United Kingdom

Datum der Promotion: 14. Juni 2018

I

Danksagung

Mein besonderer Dank gilt Herrn Prof. Dr. Martin Schumacher für seine sehr gute Betreuung, die

wertvolle Unterstützung in allen Fragen und seine Geduld. Weiter danke ich dem gesamten Sp@z Projekt-

Team bestehend aus Prof. Dr. Willi Sauerbrei, Prof. Dr. Heiko Becher und Dr. Eva Lorenz für viele

gemeinsame Diskussionen mit neuen Anregungen und Input.

Zudem möchte ich allen KollegInnen des Instituts für Medizinische Biometrie und Statistik und des

Studienzentrums Freiburg für die wertvolle Unterstützung, Rat und Tat und viele hilfreiche Gespräche

danken.

Ein großer Dank geht auch an meine Familie und Freunde, die mich vor allem in anstrengenden Phasen

immer moralisch unterstützt haben.

Summary

In a lot of clinical and epidemiological studies, covariates with a large amount of observations with value

zero occur. One simple example is the covariate “cigarettes smoked per day”. Here non-smokers take

value zero, so that in studies investigating the effect of this covariate the amount of zeros will be relatively

high. Covariates with a spike at zero (SAZ) are continuous covariates with this special distribution.

The proportion of observations with value zero is higher than for continuous covariates with standard

distributions (e.g. normal distribution etc.). This thesis investigates the properties of these covariates

as predictor variables in statistical models. How can their effect on different outcomes be modelled

appropriately? Depending on the purpose of a model, e.g. explanation or prediction, or additional

assumptions on the underlying true relationship the favoured strategy can be very different. Due to the

nonstandard distribution of the covariates, common assumptions for regression techniques are not fulfilled.

For analysing the effect of two covariates with SAZ on an outcome variable, four modelling strategies

including different binary indicator covariates are proposed in this thesis. All four methods are extensions

of a non-linear modelling technique for one covariate with SAZ based on fractional polynomials (FPs)

with inclusion of a binary variable indicating if the covariate is zero or not (FP-spike). In the situation

of one covariate with SAZ, after estimating a model with both the indicator and the original covariate,

it is tested if both parts add information to the model. If this is not the case the covariate with the

larger p-value is dropped. The simplest extension to construct a method for two covariates with SAZ is

to use the univariate FP-spike procedure separately for both SAZ covariates (Bi-Sep). However, it might

be reasonable to also consider the proportions of zeros in both covariates simultaneously in the binary

indicators. For example, if the two covariates are both zero at the same time in most observations/patients,

and there are only a few observations/patients where only one of the covariates is positive, or if the effects

of the two covariates are not additive different solutions can be thought of. For these settings, the inclusion

of only one covariate indicating if both covariates are zero is proposed (Bi-D1), or for non-additive effects

the inclusion of three indicator covariates separating the four different kinds of observations (both zero,

only first covariate zero, only second covariate zero, both positive) (Bi-D3). If in addition one can assume

III

that the functional relationship for the positive values between one or both covariates and the outcome

is different in the subgroup of observations for which the respective other covariate is zero, estimation

of additional functional relationships is recommended (Bi-Sub). In a simulation study, the four methods

are investigated and compared to standard linear regression and FP as non-linear modelling technique

with respect to the selected functional forms, R2, Mean Squared Error (MSE) and a MSE in zero-related

subgroups proposed in this thesis. In addition to the bivariate methods, a possible extension to three

or more SAZ covariates is presented. A combination of a preassessment of the relationship between the

covariates with SAZ using log-linear models, in combination with using the bivariate approaches for the

two covariates with the strongest relationship is proposed. All modelling strategies are illustrated using

real data sets from clinical and epidemiological studies.

The main finding in all analyses and simulation settings is that the proposed methods lead to improved

models with respect to overall measures of fit such as MSE or R2 when compared to standard linear

regression if the true effect of observations with value zero on the outcome is large which means that

general assumptions for linear regression such as a linear relationship and homoscedasticity are violated.

The larger the difference of the effect on the outcome at zero and close to zero, the more important is the

inclusion of one or more binary indicators as the standard estimator will be biased due to the underlying

non-linear relationship which possibly even contains a “jump” close to zero. Because the methods were

constructed for different settings as described above, e.g. for potential relationships or interactions between

the two covariates and the outcome, a general recommendation which of them performs best cannot be

given. Detailed recommendations for specific situations will be given in this thesis. The proposed methods

are important in explanatory model building when many observations with value zero are present. In this

case, it should be assessed if basic assumptions of standard linear regression and non-linear modelling

techniques such as FPs are violated and one of the presented methods must be used instead. Standard

non-linear modelling techniques such as FPs perform similarly well with respect to overall measure of

fit, however, the selected functional forms with strong non-linearities may be difficult to interpret in

clinical and epidemiological applications which makes the simpler models estimated by the SAZ-methods

favorable.

IV

Contents

Danksagung II

Summary III

List of Figures VIII

List of Tables X

Abbreviations XII

1 Introduction 1

2 Definitions and methodological background 6

2.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Regression modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Cox regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Fractional polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Correlation and dependence between covariates (with or without SAZ) . . . . . . . . . . . 12

2.4.1 Correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.2 Odds ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.3 Log Linear Models for independence and interaction in three-way tables . . . . . . 14

2.5 Model validation - measures of comparison for normal error models . . . . . . . . . . . . . 16

2.5.1 Accuracy of the data fit: Mean squared error . . . . . . . . . . . . . . . . . . . . . 17

2.5.2 Predictive performance: PMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.3 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

V

Contents

2.5.4 Information criteria: AIC or BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.5 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.6 Comparison and interpretation of regression coefficients . . . . . . . . . . . . . . . 21

2.5.7 Graphical illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Areas of application of covariates with a spike at zero and current approaches 22

3.1 Predictor covariates with SAZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Zero-inflated outcome variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 One covariate with a spike at zero 32

4.1 Initial problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 One SAZ covariate - theoretical derivation of the influence of the binary indicator 32

4.1.2 Interpretation of the coefficient of the binary indicator . . . . . . . . . . . . . . . . 34

4.2 Modeling one covariate with a spike at zero . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 When is a binary indicator required? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Case Study I - Ozone data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.1 Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Case Study II - German Breast Cancer Study Group . . . . . . . . . . . . . . . . . . . . . 45

4.5.1 Data and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Two covariates with a spike at zero 51

5.1 Initial problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Modeling two covariates with SAZ: proposed methods . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Bi-sep: distinct consideration of covariates with SAZ . . . . . . . . . . . . . . . . . 52

5.2.2 Bi-D3: combination of dummy variables . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.3 Bi-D1: dummy indicating if both variables are zero or not . . . . . . . . . . . . . . 55

5.2.4 Bi-sub: submodels in each category . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 First ideas for comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 Case study III - Study on Laryngeal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.1 Data and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

VI

Contents

5.5 Simulation Study - Assessment of the proposed procedures . . . . . . . . . . . . . . . . . . 71

5.5.1 Basic Specifications and Objectives of the Simulation Study . . . . . . . . . . . . . 71

5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5.3 Summary of simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 More than two covariates with a spike at zero 114

6.1 Using Log-linear models for the analysis of three or more covariates with SAZ . . . . . . . 114

6.2 Case Study IV - Study on Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2.1 Univariate and Bivariate Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2.2 Analysis: more than two SAZ covariates . . . . . . . . . . . . . . . . . . . . . . . . 118

7 Discussion and further research 122

7.1 Methods for the analysis of covariates with SAZ . . . . . . . . . . . . . . . . . . . . . . . . 122

7.2 Evaluation of methods proposed in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.5 Summary of the findings of the DFG project . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.6 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8 Bibliography 133

APPENDIX 140

VII

List of Figures

2.1 Function Selection Procedure of the FP Procedure (FSP) . . . . . . . . . . . . . . . . . . 11

4.1 Influence of the distribution of the outcome y for observations with x-value 0 on regression. 39

4.2 Residual-versus fitted plots for different effects with 3 types of regression. . . . . . . . . . 41

4.3 Ozone study: results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Breast Cancer Study: FP-Spike vs. FP for Estrogen receptor, univariate. . . . . . . . . . 47

5.1 Visualization of an example result of the Bi-sep procedure. . . . . . . . . . . . . . . . . . 54

5.2 Visualization of an example model of the Bi-sub approach. . . . . . . . . . . . . . . . . . . 57

5.3 Datasets 2a and 2b: Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Laryngeal cancer study: continuous distribution with SAZ. . . . . . . . . . . . . . . . . . 65

5.5 Laryngeal cancer study: Comparison of methods . . . . . . . . . . . . . . . . . . . . . . . 67

5.6 Simulation Study - two SAZ: simulation procedure . . . . . . . . . . . . . . . . . . . . . . 73

5.7 Simulation Study - two SAZ: S1- scatter plot of first three runs . . . . . . . . . . . . . . . 77

5.8 Simulation Study - two SAZ: S1- first 20 fitted functions . . . . . . . . . . . . . . . . . . . 78

5.9 Simulation Study - MSE comparison of scenarios S1-S5 . . . . . . . . . . . . . . . . . . . 82

5.10 Simulation Study - two SAZ: S7- first 20 fitted functions . . . . . . . . . . . . . . . . . . . 83

5.11 Simulation Study - MSE comparison in scenarios S6-S10 . . . . . . . . . . . . . . . . . . . 87

5.12 Simulation Study - two SAZ: S14 - first 20 fitted functions . . . . . . . . . . . . . . . . . . 88

5.13 Simulation Study - MSE comparison of scenarios S11 - S14 . . . . . . . . . . . . . . . . . 92

5.14 Simulation Study - two SAZ: S19 - scatter plot of first three runs . . . . . . . . . . . . . . 95





VIII

List of Figures


5.20 Simulation Study - MSE comparison in scenarios S20 - S24 . . . . . . . . . . . . . . . . . 112

6.1 Lung Cancer Study: Distribution of smoking, years in high risk occupation and life time

working days in asbestos exposure separated for cases and controls. . . . . . . . . . . . . . 117

IX

List of Tables

2.1 Loglinear Models for Three-Dimensional Tables . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Details of the search strategy of literature search performed in Web of Science on 1 Nov 2016. 23

4.1 Characteristics of simulated datasets 1a and 1b . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Comparison of univariate methods for true functional relationship y1a . . . . . . . . . . . 38

4.3 Comparison of univariate methods for true functional relationship y1c . . . . . . . . . . . 42

4.4 Ozone Study: Details to derive the FP-Spike models for “time spent outside”. . . . . . . . 43

4.5 Ozone study: comparison of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.6 Breast Cancer Study: Details to derive the FP-Spike models for estrogen receptor. . . . . 49

4.7 Breast Cancer Study: selected FPs, explicit functional forms for the predictor of the Cox

model for the log hazard ratio and deviances for the final model . . . . . . . . . . . . . . . 50

5.1 Subsets of observations of covariates with SAZ. . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Definition of the 7 variables used in Bi-Sub. . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Summary of FP-spike approaches for one or more spike variables . . . . . . . . . . . . . . 58

5.4 Characteristics of bivariate simulated data (datasets 2) . . . . . . . . . . . . . . . . . . . . 60

5.5 Comparison of bivariate linear methods for y2a and y2b . . . . . . . . . . . . . . . . . . . . 63

5.6 Laryngeal Cancer Study: distribution of alcohol and pack years. . . . . . . . . . . . . . . 64

5.7 Risk estimation for smoking and alcohol consumption from Ramroth et al (2004) [49]. . . 65

5.8 Laryngeal Cancer Study: Results of Bi-Sep, Bi-D3, Bi-D1 and Bi-Sub. . . . . . . . . . . . 68

5.9 Distribution of covariates x1 and x2 distinguishing observations with value zero and positive

observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.10 Simulation Study - two SAZ: standard settings . . . . . . . . . . . . . . . . . . . . . . . . 74

5.11 Overview of all scenarios of the simulation study. . . . . . . . . . . . . . . . . . . . . . . . 75

5.12 Simulation Study - two SAZ: specifications of scenarios S1 - S5. . . . . . . . . . . . . . . . 76

X

List of Tables

5.13 Simulation Study - two SAZ: S1 - selected functions . . . . . . . . . . . . . . . . . . . . . 79

5.14 Simulation Study - two SAZ: S1 - R2, MSE and MSE in categories . . . . . . . . . . . . . 80

5.15 Simulation Study - two SAZ: specifications of scenarios S6-S10. . . . . . . . . . . . . . . . 81


5.17 Simulation Study - two SAZ: S7 - R2, MSE and MSE in categories . . . . . . . . . . . . . 85

5.18 Simulation Study - two SAZ: specifications of scenarios S11-S14. . . . . . . . . . . . . . . 86


5.20 Simulation Study - two SAZ: S14 - R2, MSE and MSE in categories . . . . . . . . . . . . 90







5.27 Specifications of scenarios S25-S29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107



6.1 Lung Cancer Study: Fitted values for loglinear models . . . . . . . . . . . . . . . . . . . . 118

6.2 Lung Cancer Study: results of loglinear models. . . . . . . . . . . . . . . . . . . . . . . . . 119

6.3 Lung Cancer Study: Results of Bi-Sep, Bi-D3, Bi-D1 and Bi-Sub with add. SAZ covariate. 120

7.1 Overview of work packages that were addressed in DFG research project “BE 2056/10; SA

580/7” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

XI

Abbreviations

This following list contains relevant and general notation and may not be complete.AIC Akaike Information Criterion

BIC Baysian Information Criterion

CI Confidence Interval

d.f. degrees of freedom

er estrogen receptor

FP Fractional polynomials

FSP Function selection procedure

MFP Multivariable fractional polynomials

MSE Mean squared error

MSE_A Mean squared error in categorie A (= { x1 = 0, X2 = 0})

MSE_B Mean squared error in categorie B (= { x1 = 0, X2 > 0})

MSE_C Mean squared error in categorie C (= { x1 > 0, X2 = 0})

MSE_D Mean squared error in categorie D (= { x1 > 0, X2 > 0})

n Number of observations

OR Odds ratio

PMSE Predictive Mean Squarred Error

pr progesterone receptor

py pack-year

RFS Recurrence free survival

SAZ spike at zero

XII

1 Introduction

One of the basic wants of mankind is to find explanations for relationships in nature and to make pre-

dictions or get a deeper understanding for natural and artificial relationships. Statistical models, for

example, can be used to describe relationships between an outcome variable of interest and one or several

influencing covariates in a structured way. The questions that can be answered by statistical models are

very diverse. One field of applications is the natural sciences, specifically medical science. Model building

in clinical applications, but also in many other fields, faces several types of covariates which can have

different distributions. The main focus of this dissertation is a special type of covariates called variables

with a spike at zero (SAZ) and how they can be modeled adequately in regression models. Variables of this

type are continuous variables with the extra characteristic that their distribution has a high probability

mass at zero. There are several different labels for this type of variable in literature such as clump at zero

(Hallstrom (2010) [27]), zero inflation [66], mass at zero (Aitchison (1955) [3]), semi-continuous variable

(Olsen and Schafer (2001) [47]) etc.. A typical example for variables with a SAZ is the measure for the risk

factor of smoking. Smoking can be measured, for example, in the number of cigarettes smoked per day or

in pack years. In a clinical or observational study, there will usually be a large amount of non-smokers.

These take value zero in both measures. The distribution of such a variable “number of cigarettes smoked

per day” in a random population will probably have a high probability mass at zero. The probability

mass function for values of “number of cigarettes smoked per day” greater than zero might be close to

normal or many further types of probability mass functions.

This dissertation will be dealing with independent (predictor) variables with a large amount of observa-

tions with value zero in a regression model and not consider zero inflated outcome variables. Only some

basic issues of zero-inflated outcome variables are included for completeness in section 3.2.

Robertson et al. [50] were among the first to specifically address this situation of independent covariates

with a SAZ in the estimation of dose-response models in case-control studies. For them, the distinction

between zero and non-zero values was the distinction of exposed and unexposed individuals to any specific

substance. They included a binary indicator variable in their model which indicated whether the original

1

1 Introduction

measurement is zero or not (1 if the covariate with a SAZ equals 0 and 0 otherwise), in order to be able

to distinguish the effect of e.g. being a smoker or not. The additional information on how the amount

of smoking influences e.g. the risk of lung cancer is still kept. In their approach, the relationship of the

non-zero values and the outcome was assumed to be linear. The primary assumption when including a

binary indicator is that e.g. the risk for a certain disease is substantially different between exposed and

unexposed individuals, or in terms of the example above between smokers and non-smokers. Robertson

et al. showed that inclusion of the binary indicator variable leads to two different odds ratios (OR)

for the two types of observations. The OR for the non-zero observations of the covariate is estimated

using the information of these non-zero observations only conditional on modeling the effect of the “zero”

observations separately. This approach was criticized by Greenland and Poole (1995), [26] who were not

convinced that excluding unexposed subjects from the estimation of the OR necessarily leads to better

and unbiased results. Details will be found in section 3.1.

Royston and Sauerbrei [54] extended the approach by Robertson et al. in 2008. They combined the

idea of Robertson et al, the inclusion of a binary indicator variable, with their fractional polynomial

(FP) approach and thus allowed modeling non-linear relationships between the non-zero observations of

the covariate and the outcome. This method was called FP-Spike. Fractional polynomial regression is

a subclass of polynomial regression with the additional possibility of estimating logarithmic and root

functional relationship. Details can be found in Sauerbrei et al (2007) [56]. This extra option (logarithmic

and root function) required that all observations of covariates used in the model have a positive value.

Zero or negative values needed to be shifted before model estimation was possible. In 2012, a slight

modification of this procedure was proposed by Becher at al. [5] which avoided an initial shift of the data

prior to the analysis for the FP-Spike method. Becher et al. showed that omitting the pre-transformation

leads to less biased results. This version is the basis for further developments in this dissertation. In

short, it can be described as follows: in a two stage procedure the most suitable FP function is chosen

for the non-zero observations with an additional binary indicator. Then a significance test is used to

decide whether this chosen FP function (which may be linear) and the binary indicator are both needed

to describe a suitable functional relationship or if one or the other does not add additional information

to the model and can, thus, be deleted from the final model. In chapter 4.2, a detailed description of the

procedure will be given.

The occurrence of a covariate with distribution with a spike of zero is often not detected. Most of the

current analyses do not account for this special type of covariate distribution and either use only a binary

variable or the original continuous variable. They ignore a possible change in the relation to an outcome

2

variable due to this non-standard covariate distribution.

The decision whether to use strategies which are capable of modeling SAZ covariates or not is not a

purely statistical one. It is dependent on the generation of the zero values, e.g. if values occur due to

technical reasons such as a limit of detection or a measurement error (cf. Gleiss et al. (2015) [23]) or if

they are true biological zeros with certain properties. The main aim of this thesis is to investigate how two

covariates with a SAZ are adequately modeled in regression models. This will be done in several analyses

and examples using linear, logistic and cox regression. As an extension of the method of Robertson et

al. [50] for dose-response modeling with binary responses, Lorenz et al. (2015) [41] derived theoretical

OR functions for two covariates with SAZ. These lead to the conclusion that FP-Spike seems capable of

modeling the relationship between a “dose” with a SAZ and and a binary outcome. The situation with

two covariates with a SAZ can not be seen as a straight forward extension of the univariate case. Several

further influencing factors can and have to be considered. The probability mass of zero values of the two

covariates could be interrelated. Compared to the zero/non-zero distinction in the univariate case, there

are now three different types of zero values which can be distinguished, those for which both covariates

are zero, and then two categories in which either one or the other covariate is zero while the respective

other is positive. Several different strategies to handle two covariates with a SAZ will be proposed and

compared with each other and with standard existing techniques such as the straightforward extension

of the approach by Robertson et al. [50] and multivariable fractional polynomials (MFP), a regression

technique for non-linear model building. Different assumptions on the common distribution of the two

covariates and their relationship with the outcome lead to different options for modeling.

The easiest case is to extend the univariate FP-spike procedure to the bivariate setting, treating both

variables exactly the same. However, if a correlation or interaction between the two variables is assumed,

further strategies of modeling are possible. As there are three categories of zero observations, it is also

possible to separate them using three binary indicator covariates (a fourth category for the “both non-

zero” observations is then included as well). Depending on the size of the probability mass of the zero

values, it might be reasonable to use only one indicator separating the “both covariates take zero” from

all other observations. A further aspect can be observed in a current study by Fehringer et al. (2017)

[22] on alcohol and lung cancer risk among never smokers. Here, it is assumed that in the subgroup

of observations where a second covariate is zero, a different dose-response relationship can be assumed

than in smokers. Considering these aspects leads on to use different strategies. These will be presented,

explained and compared in detail. A short outlook will also be given on how a setting with three or more

variables with SAZ could be handled.

3

1 Introduction

The aim of this thesis, how to model two covariates with a SAZ adequately in regression models, needs

to be specified further and can address several different aspects. Depending on the aim of model building

the term “adequately” can have several other definitions.

Therefore, the subsequent chapters are structured as follows. Chapter 2 explains basic concepts and

potential aims in the field of model building. The purpose of the statistical model might be relevant for

the decision if a model is “adequate” or not. Further basic methodology will be explained in this chapter.

Fractional polynomial models as a non-linear model building technique are presented as they are the basis

for the methods which will be proposed and used for covariates with a SAZ. The question “what is the

best model?” is a very complex one. Many philosophical theories exist on this question. In this thesis, a

small selection of different model evaluation criteria are presented and explained. These criteria will be

used in several sections for the comparison of modeling techniques and models itself in single datasets and

in a simulation study. For the evaluation of the proposed methods, a combination of these criteria will be

used.

Chapter 3 will describe approaches which have been used so far in literature and will present different

research areas in which covariates with a SAZ are found. Basic concepts developed by Aitchison (1995)

[3] and Robertson et al. [50] will be presented. A short section on zero-inflated outcome covariates will be

given as these are closely related and easily confused with independent predictor covariates with a SAZ.

In order to understand the basic challenges with covariates with a SAZ, chapter 4 describes some data

settings in which ignoring the specific univariate distribution with a probability mass at zero leads to

different results in the built models. It will be shown why it is important to account for the distribution.

The univariate FP-spike procedure which was first proposed by Sauerbrei and Royston (2008) [54], further

explained in Royston et al. (2010) [55] and slightly modified by Becher et al. (2012) [5] is explained in

detail. Simulated data examples and a dataset from a breast cancer survival study will be used for

comparisons and explanations.

Concepts developed in chapter 4 provide the basis for the extensions proposed in chapter 5. The general

problem will be described using simulated datasets. Four different strategies to handle two variables with

a SAZ will be proposed. The construction is based on assumptions of the common distribution of the

covariates. The methods will be compared using data from a case-control study on laryngeal cancer. The

four procedures will also be compared systematically in a simulation study varying different distributional

properties of the data to be analyzed in normal errors regression. Different aspects such as the influence

of the effect size of the binary indicators, the influence of the correlation of the two covariates or the

influence of amount of observations in the four categories will be investigated.

4

There might be cases with more than two covariates with SAZ. Chapter 6 briefly describes first thoughts

on how to extend the concepts proposed in chapter 5 to more than two covariates. Chapter 7 will briefly

summarize and discuss all findings.

During this pathway, the thesis will investigate model building of covariates with SAZ in several different

situations. In order to draw a bigger picture, modeling techniques with and without SAZ will be compared

with respect to several different aspects. Advantages will be shown. Furthermore it will be evaluated in

which settings one can improve model building by considering model building techniques that can account

for covariates with SAZ and also situations in which simpler methods might already be sufficient.

The dissertation was written as a part of the DFG research project “BE 2056/10; SA 580/7”. In this

research project a second thesis was written by Eva Lorenz at the University of Heidelberg. The project

was divided into several work packages which were investigated individually. For reasons of completeness,

some of the results of the thesis of Eva Lorenz will also be presented in the following chapters. These parts

will be indicated. In the discussion, a general summary will be given of the major results of the project

to provide an overview and a broader picture of the research on regression modeling with covariates with

a spike at zero.

5

2 Definitions and methodological background

2.1 Statistical model

There are several definitions of a statistical model and its purpose. A theoretical definition of a statistical

model is a set of probability distributions on the sample space S which is found in several theoretical

statistics textbooks (cf. McCullagh (2002) [44]). In most applied statistic textsbooks less mathematical

definitions are found. The definition in the following thesis will be the one of Dunkler et al (2014) [17].

In their view, "[s]tatistical models are simple mathematical rules derived from empirical data describing

the association between an outcome and several explanatory variables." ([17] p.1) They furthermore state

that statistical models should fulfill two requirements. They should be valid and practically useful. To

be practically useful can be interpreted in several ways. One purpose of a statistical model can be to

predict outcomes of interest. Furthermore a model can also be used to explain a relationship between on

or several variables and an outcome of interest (Royston and Sauerbrei (2008) [54]).

What is the aim of statistical modeling?

This subsection will present the major aims in statistical modeling and give reasons why these aims are

important in the evaluation of the model. For the evaluation of the methods presented later in this thesis

it seems necessary to make these distinctions of aims because not in all situations they might be useful.

Many more purposes might be found but this thesis will focus on the distinction of two major aims.

Breiman (2001) [10] describes two goals in analyzing data, prediction (“to be able to predict what the

responses are going to be in the future” ([10] p. 199)) and information (“To extract some information

about how nature is associating the response variables to the input variables” ([10] p. 199)). Shmueli

(2010) [60] distinguishes prediction or explanation as the two different aims which are very similar to

the concepts distinguished by Breiman et al. [10]. “In many disciplines there is near-exclusive use of

statistical modeling for causal explanation and the assumption that models with high explanatory power

are inherently of high predictive power. Conflation between explanation and prediction is common, yet

6

2.2 Regression modeling

the distinction must be understood for progressing scientific knowledge. While this distinction has been

recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many

differences that arise in the process of modeling for an explanatory versus a predictive goal.” Shmueli

(2010) [60]

According to Shmueli for explanatory models, in most cases, data of observational studies is analyzed

in regression models. He defines “explanatory modeling as the use of statistical models for testing causal

explainations”. ([60], p.2) He compares this to predictive modeling which he defines as “the process of

applying a statical model or data mining algorithm to data for the purpose of predicting new or future

observations”( [60], p.3). For reasons of completeness, he also describes a third category of models,

descriptive models, with the aim of “summarizing the data structure in a compact manner” ([60], p.3).

Many more definitions of this kind can be found in literature but most research questions will fit in one of

these three categories defined by Shuemli. Therefore, they will be used in the following. These different

aims will be important later in the motivation of the construction of the methods used in this thesis and

in the evaluation and interpretation of their results.

The next sections will introduce the basic concepts of the regression models used in the following, of

nonlinear modeling using fractional polynomials and of model validation.


The aim of regression analyses is to find a relationship between an outcome y and explaining variables.

In general, this relationship cannot be given exact but there is an additional error term ε. That means y

is a random variable and it’s distribution is dependent on variables x1,...,xk. The expected value of y is

a function of these variables:

E(y|x1, ..., xk) = f(x1, ..., xk).

The outcome can be divided in

y = E(y|x1, ..., xk) + ε = f(x1, ..., xk) + ε,

where ε is the random error.

Basic notations were taken from Fahrmeir et al. (2007) [20], Bland (2015) [7], Hosmer et al (2000) [30]

and Cox (1972) [13] where a more detailed description can be found.

7


2.2.1 Linear regression

Probably the most famous type of regression is the linear regression model. For a metric continuous

variable

y = β0 + β1x1 + ...+ βkxk + ε

where it is assumed that f is a linear function, thus

E(y|x1, ..., xk) = f(x1, ..., xk) = β0 + β1x1 + ...+ βkxk.

For data (yi, xi), i = 1, ..., n of random variables y and x,

yi = β0 + β1xi + εi, i = 1, ..n

where ε1, ..εn are independent and identically distributed, with E(εi) = 0 and V ar(εi) = σ2.

The principle of ordinary least squares estimation leads to the estimated regression coefficients defined

as

β = (X ′X)(−1)X ′y

with the matrix X = (xip), i = 1, ..n, p = 1, ..., k and X ′ its transposed form.

An alternative to least square estimation is maximum likelihood estimation. It is equivalent to least

squares estimation in the case of normally distributed errors. This principle is based on the common

probability

P (Y1 = y1, Y2 = y2, ..., Yn = yn|β)

which depends on an unknown parameter vector β = (β0, ..., βp) which has to be estimated. The

probability or the density can be written as a function of the unknown parameter β. This function is

called likelihood and defined as

L(β) = P (Y1 = y1, Y2 = y2, ..., Yn = yn|β).

The maximum-likelihood-estimate β is the estimate of the parameter β that maximizes the function L

8


(cf. Fahrmeir et al 2007, p.467 [20]).

The linear regression model is sometimes also called normal-errors model (cf. Royston and Sauerbrei

2008 p.10 ([54]). This model can also be extended for non-normally distributed outcomes, e.g. for binary

outcomes. These are then called generalized linear model. These further types of models will be described

in the next subsection.

The linear model is a good basis for investigating properties of extensions of regression models because

a lot of these extensions can then be transferred to the class of generalized linear models. All simulation

scenarios in this thesis will therefore be performed in the linear regression model.

2.2.2 Logistic regression

In logistic regression the outcome y is binary (yi ∈ {0, 1}). The aim of a logistic regression model is to

analyze the probability

P (y = 1) = P (y = 1|x1, ..., xk) = π.

The expected value is

E(yi) = P (yi = 1) = πi = exp(ηi)1 + exp(ηi)

with the predictor

ηi = β0 + β1xi1 + ...+ βkxik.

A transformation of π is used for reasons of mathematical properties:

g(xi) = ln( π(xi)1− π(xi)

) = ηi.

Several medical questions have binary outcomes such as the existence of a certain characteristic (yes/no).

It is often of interest to predict the proportion of individuals having a certain characteristic or not or more

specifically the influence of further factors on this characteristic. In the following, example data sets with

binary outcomes such as the laryngeal cancer dataset with the binary outcome “laryngeal cancer yes/no”

will be used to illustrated the methods presented in this thesis. The basis here is the logistic regression

model. For more details see Hosmer and Lemeshow (2000) [30].

9


2.2.3 Cox regression

Cox regression models are used to analyze time to event data. Logistic regression is not sufficient here,

as additionally to a binary variable indicating the occurrence of an event, a second information, the time

until the occurrence of the event is modeled. A problem of this type of data is that not all of the events

which are observable might have taken place at the time of analysis (right-censoring).

The Cox model, also called proportional hazards model, models the probability of being alive at time t

under the assumption of having been alive until closely before t. This function λ(t) is unknown.

The standard model for the analysis is the Cox model [13]

λ(t|x) = λ0(t) exp(k∑

n=1xiβi)

with influencing covariates xi, i = 1, ..., k, covariate matrix x = (x1, ..., xk), effects of the respective

covariates βi and an unspecified baseline hazard λ0(t).

As already stated, a lot of medical questions can be addressed by the analysis of time to event data,

especially in the field of cancer research. The methods which will be proposed in the following thesis can

be used with the three model types (linear, logistic, cox) and can also be extended to further generalized

linear model. One example for illustration uses data from a study on breast cancer with a time to event

endpoint.

2.3 Fractional polynomials

In all the above mentioned modeling strategies the assumption of a linear relationship of the linear

predictor and the outcome is necessary. This assumption might in some data situations not be correct.

There are different techniques which avoid this assumption such as splines and fractional polynomials

(FPs). The methods proposed in this thesis are an extension of FP-functions proposed by Royston and

Altman (1994) [51]. FPs are extensions of power transformations of a covariate. One, two or more power

transformations of the form xp are fitted, with the exponent(s) p being chosen from a small, preselected

set S = {−2,−1,−0.5, 0, 0.5, 1, 2, 3} where x0 denotes log x. An FP function with two terms (FP2) is a

polynomial β1xp1 + β2x

p2 (and with one term (FP1) β1xp1), with exponents p1 and p2. For p1 = p2 = p

(repeated powers) FP2 is defined as β1xp +β2x

p log(x). This leads to eight FP1 functions and 36 possible

different FP2 functions. In Royston and Sauerbrei (2008) [54] a closed test procedure for function selection

is described.

10

2.3 Fractional polynomials

In a function selection procedure (FSP), the best FP function is then chosen. This procedure is il-

lustrated in figure 2.1. First, according to a deviance criterion (lowest deviance of all 8 FP1 or 36 FP2

models) the best FP2 and FP1 function are fitted. Then, in a closed test procedure it is assessed whether

the most complex function, the best FP2 function, is significantly better than the null model. If this test

is not significant the procedure is stopped concluding that the covariate does not have an influence on

the outcome. Otherwise, testing is continued. The best FP2 will be tested against the linear model. If

the FP2 is not significantly better, the procedure is stopped concluding that the best model is the linear

model. Otherwise, a next test will compare the best FP2 against the best FP1. If the test is not significant

it will be concluded that FP1 is the best model. Otherwise the final model will be FP2. In some cases,

it may be sensible to only allow FP1 functions. Possible reasons are restriction to monotonic functions

and an increased power if the effect is linear. For further details see Royston and Sauerbrei (2008) [54]

(p.74-83). The usual Multivariable Fractional Polynomial (MFP) procedure combines variable selection

using backward elimination with the above described function selection procedure.

Significance level α

Best FP1 and FP2according to deviance criterion

Best FP2 model vs. null model, 4df not significant Stop: Effect of x not signif.

significant

Best FP2 model vs. Linear, 3df not significant Stop: Final model: Linear

significant

Best FP2 model vs. best FP1, 2df not significant Stop: Final model: FP1

significant

Final Model: FP2

Figure 2.1: Function Selection Procedure of the FP Procedure (FSP)

Fractional polynomials are a simple approach for nonlinear modeling resulting in global functions which

are overall interpretable. The most obvious drawback of fractional polynomials is one with regard to

11


sample size. In small samples, FPs have "insufficient power to detect a nonlinear function and the possible

sensitivity to extreme values at either end of the distribution of a covariate" (Royston and Sauerbrei (2008)

[54], p.9). In the case of insufficient sample size a linear function might be chosen as default.

Modeling continuous covariates with non-linear relationships to outcome variables several different tech-

niques are available such as FPs (Royston and Altman (1994) [51]), and spline based techniques such as

restricted cubic splines (De Boer (2001) [14]), penalized regression splines (Eilers and Marx (1996) [18])

or smoothing splines (Green and Silverman (1994) [24]). There is extensive literature on the comparison

of splines and fractional polynomials e.g. in Strasak et al (2010) [61] and Binder et al (2013) [6].

The methods proposed in the following will be based on fractional polynomials.

2.4 Correlation and dependence between covariates (with or

without SAZ)

As in the following, two or more covariates with SAZ will be considered a relationship between those

covariates could also possibly influence the potential outcome. There are different concepts based on the

type of covariates (categorical, continuous, etc.).

2.4.1 Correlation coefficients

Correlation coefficients are measurements for the association of two continuous covariates. Two specific

types of correlation coefficients will be used in this thesis, Pearson’s and Spearman’s rank correlation

coefficient. The Pearson correlation is calculated as follows:

ρX,Y = COV (X,Y )σXσY

.

It is a measure for the degree of linear correlation between two covariates. If the relationship between

the two covariates is assumed to be nonlinear, the Spearman rank correlation coefficient can be used as a

measure of association. It uses the ranks of the observations instead of the original values. The definition

is as follows:

r =∑

(rg(xi)− rgX)(rg(yi)− rgY )√∑(rg(xi)− rgX)2∑(rg(yi)− rgY )2

where rg(xi) is the rank of xi, rgX is the mean of the ranks ofX . It measures the monotonic relationship

12

2.4 Correlation and dependence between covariates (with or without SAZ)

between the two covariates.

Huson (2007) [31] investigated the performance of measures for correlation in zero inflated data. He

found that for two binomial-lognormal distributed covariates, Pearson and Spearman correlation on aver-

age slightly underestimate the true correlation. “However, the bias is relatively small” (Huson (2007) [31],

p.533). The correlation coefficients were investigated in simulation scenarios. The overall result stated

that “the Pearon estimate is in fact, for most practical purposes, an adequate choice from the coefficients

studies [(Pearson correlation, Spearman correlation, weighted rank correlation)]” [31], p.536.

In the situation with two covariates with SAZ X1 and X2, four different types of observations can be

distinguished. There are observations with both X1 = 0 and X2 = 0 (category A), there are observations

with X1 = 0 and X2 > 0 (category B) or X1 > 0 and X2 = 0 (category C), and there are observations

with both X1 and X2 > 0 (category D). In some settings, it is assumed that the relationship between

observations of different categories to an outcome Y differ considerably. That implies that the relationship

between the two covariates X1, X2 and the outcome Y is not monotonous. In this situation, it can be

helpful to compare correlation of X1 and X2 in those categories, or to be more precise in category D, as

both the Pearson correlation and the Spearman correlation are not defined if one or both of the covariates

are always zero, which is the case for categories A, B and C. Considering the correlation between the

outcome Y and a covariate with SAZ X, it can also be reasonable to determine the correlation only

for observations with X > 0, as the overall relationship including observations with zero might not be

monotonous, however, in this subset, this assumption might still be correct.

2.4.2 Odds ratio

For binary data, the associating between to covariates or measurements can be summarized as an odds

ratio. Covariates with SAZ have the characteristic that they have a large amount of zero values. Thus,

a binary distinction between zero and non-zero observations seems reasonable. Therefore, odds ratios

can also be used for the description of the relationship between two covariates with SAZ. The basis is a

contingency table displaying frequencies (f) of observations in the respective categories.

Y

0 >0

X 0 f00 f01 f0.

>0 f10 f11 f1.

f.0 f.1 n

13


Using the frequencies displayed in the table the odds ratio can be calculated as follows:

P (0, 1|X = 0, Y = 1) = P (0, 1|X = 1)P (0, 1)|X = 1) = f00/f01

f10/f11= f00f11

f10f01.

2.4.3 Log Linear Models for independence and interaction in three-way tables

Log linear models are a technique to analyze multiway contingency tables. They are a special case of

GLMs for Poisson-distributed data. In this type of analysis, all variables are treated as response variables

i.e. no distinction between dependent and independent variables is made. A common use is modeling cell

counts in contingency tables. Thus, Log-linear models only demonstrate associations between variables,

similarly to the correlation coefficients and the odds ratio for two covariates described above. Its purpose

is the analysis of association and interaction of patterns. If there is only one single binary response variable

it is easier to use the logistic model described in the previous section. The principle will be explained using

a 2 x 2 x 2 multiway contingency table as this setting will be dealt with later in this thesis. Definitions

are taken from Agresti (2002) [1], chapter 8.

The basic setting is “an IxJxK contingency table that cross-classifies a multinomial sample of n

subjects on three categorical responses variables X, Y and Z. The cell probabilities are {πijk} with∑i

∑j

∑k πijk = 1 and the expected frequencies are µijk = nπijk” (Agresti (2002), p.314 [1]). The

observed cell counts are denoted by nijk. The three-way IxJxK cross-classification of responses variables

has several potential types of independence. These definitions are again taken from (Agresti (2002), p.31

[1]) The three variables are said to be mutually independent if

πijk = πi++π+j+π++k∀i, j, k.

For mutually independent variables, the expected frequencies µijk have the loglinear form

log(µijk) = λ+ λX + λY + λZ

Variable Y is jointly independent of X and Z when

πijk = πi+kπ+j+∀i, j, k.

This is ordinary two-way independence between Y and a variable composed of IK combinations of levels

X and Z. The loglinear model is

14

2.4 Correlation and dependence between covariates (with or without SAZ)

Symbol Loglinear model(X,Y,Z) logµ = λ+ λX + λY + λZ

(X, YZ) logµ = λ+ λX + λY + λZ + λY Z

(XY, YZ) logµ = λ+ λX + λY + λZ + λXY + λY Z

(XY,YZ,XZ) logµ = λ+ λX + λY + λZ + λXY + λY Z + λXZ

(XYZ) logµ = λ+ λX + λY + λZ + λXY + λY Z + λXZ + λXY Z

Table 2.1: Loglinear Models for Three-Dimensional Tables

logµijk = λ+ λXi + λY

j + +λZk + λXZ

ik .

This can similarly be X being jointly independent of Y and Z, or Z being jointly independent of X

and Y . Mutual independence implies joint independence between any one variable from the others.

X and Y are conditionally independent given Z when independence holds for each partial table within

which Z is fixed. That means that if πij|k = P (X = i, Y = j|Z = k), then

πij|k = πi+|kπ+j|k∀i, j, k.

Conditional independence of X and Y given Z, is the loglinear model


j + +λZk + λXZ

ik + λY Zjk .

This is a weaker condition than mutual or joint independence. A model that permits all three pairs to

be conditionally dependent is


j + +λZk + λXY

ij + λXZik + λY Z

jk .

This model is also called the loglinear model of homogeneous association or of no three factor interaction.

The general loglinear model for a three-way table is


j + +λZk + λXY

ij + λXZik + λY Z

jk + λXY Zijk .

This model has as many parameters as observations and is saturated. It describes all possible positive

µijk. Each pair of variables may be conditionally dependent, and an odds ratio for any pair may vary

across categories of the third variable. Table 2.1 lists some of the possible models, for which certain

parameters are equal to zero.

Interpretation of loglinear model parameters always concern the highest-order term included in the

15


model. Details can be found in Agresti (2002) [1], p.321). An application of loglinear models for covariates

with SAZ and an illustration in a data example is given in section 6.

2.5 Model validation - measures of comparison for normal error

models

In order to evaluate the methods which will be introduced and proposed in the following chapters, measures

for evaluation are needed. Depending on the purpose of a model, evaluation can be very different. The

two main aims were introduced at the beginning of the chapter. The question “What is a good model?”

is not easy to answer. Several criteria have been proposed mostly for the evaluation of very specific

characteristics of statistical models, e.g. the predictive performance. David Hand states that “the objective

of a classification rule is straightforward: one wants to classify correctly as many future objects as possible”

(Hand (1997) [28] p.97). This is clearly the aim of a model with the aim of classification or prediction.

Hosmer and Lemeshow’s aim in contrast is “how effectively the model we have describes the outcome

variable. [They refer to this] as its goodness-of-fit” (Hosmer and Lemeshow [30], p.143). Their aim is

that the values predicted by the model are very close to the original data. In literature, one will find

many more different strategies of evaluation. This thesis will focus on goodness-of-fit criteria which will

be described in the following.

Additionally to the general fit of model to the data and its predictive performance, every method has

assumptions which can be violated and thus influence the fit. All of the methods for analyzing variables

with a spike at zero proposed in literature make certain assumptions about the distribution of the data

and about the relationship between the covariates and the outcome. “When the assumptions of a model

are grossly violated or when a model is used unwisely for a given patient sample, the performance of a

model may be poor. [...] Only with appropriate model validation can an apparently accurate model be

shown to be inaccurate” (Harrell et al 1996 [29]). Thus, it is absolutely necessary to check carefully how

the proposed modeling techniques behave with different distributional data settings and how easily they

detect the correct or true underlying relationship in the data. Some assumptions of covariates with a spike

at zero and their relationship to an outcome can not be easily verified or falsified. Therefore, to evaluate

the fit of the estimated models several different aspects have to be taken into account. The inclusion of

subject-matter knowledge seems also necessary.

In this section, some criteria for model evaluation will be presented. Evaluating the special situation

16

2.5 Model validation - measures of comparison for normal error models

with covariates with a spike at zero requires to think carefully about the aim of this specific statistical

model and its evaluation. Is it sufficient to improve the overall fit, or the overall predictive performance

of the model? Or are there other criteria, like the interpretation of the estimates etc, which can not be

measured as easily as global measures of performance but which are needed if the aim of the statistical

model is the explanation of a real world relationship. Depending on this aim of the statistical model as

stated in the last section, evaluation criteria can vary. Breiman (2001) [10] addressed this issue and stated

that the standard techniques of goodness of fit criteria are often not sufficient (cf. [10] p.203). Several

further criteria could be considered. However with this small selection the main characteristics which will

be addressed can be evaluated. Most of the proposed criteria will be used for the evaluation of linear

regression models, as the main part of this thesis is the evaluation of methods for the analysis of covariates

with a SAZ with continuous outcome. Some of the proposed evaluation criteria can also be calculated for

logistic and cox models with small adaptions.

2.5.1 Accuracy of the data fit: Mean squared error

The mean squared error is a criterion which quantifies the difference between the observed and the

estimated outcome values. It is defined as follows

MSE = 1n

n∑i=1

(yi − yi)2.

It measures the mean squared difference between the observed and the estimated values of y (yi observed

values, yi estimated values, n total number of observations). One possible extension to separate evaluation

of the model for observations with zero and for the non-zero observations is to split the MSE in two parts

with

MSE0 = 1n0

n0∑i=1

(yi − yi)2 ∀ i with xi = 0

and

MSEcont = 1n− n0

n−n0∑i=n0+1

(yi − yi)2 ∀ i with xi 6= 0

where n is the total number of observations and n0 is the number of observations with xi = 0. This

results in a vector of MSEs, ~MSE = (MSE0,MSEcont). It means that two MSEs are calculated, one

MSE only for observations with x-value zero and then a second for the remaining observations. The

17


standard MSE can be calculated as weighted sum of these two MSEs:

MSE = n0

nMSE0 + n− n0

nMSEcont,

with n observations and n0 observations with x = 0.

As a univariate and also a bivariate situation will be discussed in this thesis, for the latter case,

the MSE can be separated further into four categories of observations defined in the following sub-

sets: A = {(x1i, x2i, yi) if x1i = 0, x2i = 0}, B = {(x1i, x2i, yi) if x1i = 0, x2i > 0}, C =

{(x1i, x2i, yi) if x1i > 0, x2i = 0}, and D = {(x1i, x2i, yi) if x1i > 0, x2i > 0}. The mean squared

difference of observed and estimated values of y is calculated separately in each category. In that way, one

can see directly which model has strengths and weaknesses on which types of observations. The MSEs

are defined as follows:

MSEA = 1|A|

∑i,(x1i,x2i∈A)

(yi − yi)2 (2.1)

MSEB = 1|B|

∑i,(x1i,x2i∈B)

(yi − yi)2 (2.2)

MSEC = 1|C|

∑i,(x1i,x2i∈C)

(yi − yi)2 (2.3)

MSED = 1|D|

∑i,(x1i,x2i∈D)

(yi − yi)2 (2.4)

where | ∗ | represents the number of observations in the set ∗. It is important to remember that the

sample size in these sets is not equivalent and therefore not every mean has the same weight in the

standard overall MSE. The standard MSE can also be calculated as a weighted sum of these four MSEs:

MSE = |A|nMSEA + |B|

nMSEB + |C|

nMSEC + |D|

nMSED.

All of these measures can be defined with the absolute difference instead of the squared difference as

well. Using squared measures extreme differences and errors are wighted larger than smaller ones.

2.5.2 Predictive performance: PMSE

Instead of evaluating the estimated model using the data it was fitted with, it is also possible to use a

new dataset with the same type of distribution, or to split the original dataset D into a training and a

18


test dataset D1 and D2. If D has a total number of N observations, D1 contains the first m observations

( D1 = {(xi, yi)∀i ≤ m}. D2 contains all further observations (D1 = {(xi, yi)∀m < i ≤N}). The model is

fitted using the observations in D1. The predictive mean squared error in case the original dataset was

split into two datasets is calculated using the observations of D2. It is defined as

PMSE = 1m

N∑t=N−(m+1)

(f(xt)− f(xt)). (2.5)

where f the observed functional relationship (values of the test dataset) and f the estimated functional

relationship (estimated on the training dataset D1).

2.5.3 R2

A typical measure to evaluate linear regression models and its calibration is R2 defined as

R2 =∑

i(yi − y)2∑i(yi − y)2 .

Here, i is the index label of the observations, yi are the observed values, yi are the estimated values,

and y is the mean of the observed values. Thus, it is the proportion of variation in the forecast variable

that is accounted for (or explained) by the regression model. If the predictions are close to the true values,

R2 tends to 1. In the same way, as we proposed to split and separate the mean square error, one could

separate the R2 in the different categories to assess the fit in the different areas. As for the MSE, it is

also possible to calculate R2 in the four categories.

2.5.4 Information criteria: AIC or BIC

The AIC measures the relative quality of a statistical model for a given set of data. For different non

nested models, AIC takes both goodness of fit and the complexity into account. It is a suitable global

measure to compare different modeling techniques with different complexities. It is defined

AIC = 2k − 2 logL

where k is the number of parameters in the model, and L is the maximized value of the likelihood

function for the model. The Bayesian information criterion (BIC) is similar. It furthermore takes the

19


number of observations n into account:

BIC = k log(n)− 2 logL.

AIC and BIC can be used both for the univariate and the bivariate models. As the penalty in BIC is

larger than in AIC, evaluation and model decision using BIC leads to sparse models.

2.5.5 Bias and Variance

The criteria presented so far are global measurements which can be easily calculated for each model.

Bias and Variance are not always easily measurable and in many cases not easy to separate. However,

they are always present in model building. Shuemli (2010) [60] used these two definitions to explain the

different focus of model building with the aim of prediction or explanation. He states that “in explanatory

modeling, validation consists of two parts: model validation validates that [f (estimated model] adequately

represents [f (true model)], and model fit validates that f fits the data [{xi, yi, i = 1, ..., n}]. In contrast,

validation in predictive modeling is focused on generalization, which is the ability of f to predict new data

{Xnew, Ynew}” ([60], p. 11/12).

The expected prediction error can be defined as follows:

EPE = E[Y − f(x)]2 (2.6)

= E[Y − f(x)]2 + (E[f(x)]− f(x))2 + E [(f)− E [(f)(x)]]2 (2.7)

= V ar(Y ) +Bias2 + V ar(f(x)) (2.8)

Bias is the result of a misspecified statistical model f . Estimation variance (third term) is the result of

using a sample to estimate f . The first term is the error that results even if the model is correctly specified

and accurately estimated. This decomposition shows a source of the difference between explanatory and

predictive modeling. Explanatory modeling focuses on the minimization of Bias to get the most accurate

representation of the underlying theory whereas predictive modeling aims to minimize the combination

of bias and the estimation variance (cf. Shmueli (2010) [60] p.6). There are even cases where incorrectly

specified models lead to better prediction results (cf. [60] p. 19/20).

20


2.5.6 Comparison and interpretation of regression coefficients

Comparison and interpretation of the fitted regression coefficients is an essential part of an explanatory

model. “What do the estimated coefficients in the model tell us about the research questions that moti-

vated the study?” (Hosmer and Lemeshow (2000) [30], p.47). If the objective is to study a relationship

between an influential covariate and an outcome, the size of the effect is of interest. Depending on the

type of model the interpretation of the coefficients is different. This is also an important aspect in the

analysis of covariates with a spike at zero, as the proposed models will contain additional coefficients for

the zero values. A direct and simple comparison of the regression coefficients of traditional linear models

and the extension for spike at zero seems unfeasible in non-linear regression modeling and in non-nested

models. Qualitative comparisons in terms of the interpretation of the chosen regression model will be

performed.

2.5.7 Graphical illustration

A first impression of the data which is to be modeled can be given in different visualizations. Plotting the

estimated functions with the observed data, and furthermore the residuals adding a smoothing function

(e.g. the lowess smoother) gives a first impression and visual evaluation of the fitted model. Especially in

non-linear regression models where the coefficients can not be compared directly, a graphical illustration

gives a first possible of a descriptive comparison of two models. Depending on the type of regression model

(linear, logistic, cox) residuals differ. For the logistic regression model categorized odds ratios will be used

to give an impression of the original data. They are included as bubbles. The size of the bubble represents

the amount of observations in this subset of observations. Plots can be used both in the univariate and

in the bivariate setting. For bivariate models one can either use 3-D plots or use plots in two dimensions

for fixed values of the second covariate. Box plots will be used to compare the distribution of the MSE

and MSE in categories in several simulation scenarios.

21

3 Areas of application of covariates with a

spike at zero and current approaches

The phenomenon of variables with a large amount of zero values is often found in literature. Due to the

variety of problems being addressed, differing terminologies are used. Aitchison [3] used the description

“discrete probability mass at zero”. Olsen & Schafer [47] called the variables “semi-continuous”. Schis-

terman et al. [58] and Robertson et al. [50] called them “variables with a spike at zero (SAZ)”. As

terminologies are not used consistently, a systematic literature search may not lead to all relevant pub-

lications. The general issue was often addressed in the context of outcome variables e.g. in the linear

model. Predictor covariates with a SAZ have, to our knowledge, only seldom been considered. However,

it is a common situation that the distribution of this predictor covariate is non-standard.

A systematic literature search was performed on the 1st November 2016 in Web of Science using search

terms “zero-inflated”, “spike at zero”, “mass at zero”, “excess zeros” and “clump at zero”. The results

were refined using the Web of Science categories "mathematics", "mathematics applied", "mathematical

computational biology", "mathematics interdisciplinary applications" and "statistics probability". Publi-

cations from 1980 to date were included. This lead to 336 results. Details of the search strategy can be

found in table 3.1.

These 336 records were screened. Abstracts which clearly stated that the paper was on zero inflated

outcome variables were excluded. Only a small selection of those excluded is presented in this section for

reasons of completeness. After screening, 17 citations remained which addressed regression modelling with

predictor covariates with SAZ. For the application of the DFG project, a literature search was performed

in advance. Therefore the timeframe was selected from 1980 onwards. Results of this earlier search were

also included. A lot of additional articles were found in an unsystematic strategy using bibliographies of

related papers.

The literature search showed that the terms "spike at zero", "zero inflated" etc., are not only used for

one specific phenomenon. Most of the found records are publications on the analysis of zero inflated

22

Set Results Strategie13 336 12 AND 112 2969 11 OR 10 OR 8 OR 7 OR 5 OR 311 28 TOPIC: ("spike at zero")10 2457 TOPIC: (semicontinuous OR semi-continuous OR "semi continuous") Refined

by: WEB OF SCIENCE CATEGORIES: ( MATHEMATICS APPLIED ORMATHEMATICS OR MATHEMATICAL COMPUTATIONAL BIOLOGYOR MATHEMATICS INTERDISCIPLINARY APPLICATIONS )

9 7628 TOPIC: (semicontinuous OR semi-continuous OR "semi continuous")8 1 TOPIC: ("clump at zero")7 7 TOPIC: ("excess zero") Refined by: WEB OF SCIENCE CATEGORIES: (

STATISTICS PROBABILITY OR MATHEMATICAL COMPUTATIONALBIOLOGY OR MATHEMATICS OR MATHEMATICS INTERDISCI-PLINARY APPLICATIONS OR MATHEMATICS APPLIED )

6 31 TOPIC: ("excess zero")5 72 TOPIC: ("mass at zero") Refined by: WEB OF SCIENCE CATEGORIES:

( STATISTICS PROBABILITY OR MATHEMATICAL COMPUTATIONALBIOLOGY OR MATHEMATICS APPLIED OR MATHEMATICS )

4 119 TOPIC: ("mass at zero")3 431 TOPIC: ("zero inflated") Refined by: WEB OF SCIENCE CATEGORIES:

( STATISTICS PROBABILITY OR MATHEMATICAL COMPUTATIONALBIOLOGY OR MATHEMATICS INTERDISCIPLINARY APPLICATIONSOR MATHEMATICS APPLIED OR MATHEMATICS )

2 1526 TOPIC: ("zero inflated")1 702798 TOPIC: (regression)

Table 3.1: Details of the search strategy of literature search performed in Web of Science on 1 Nov 2016.

23

3 Areas of application of covariates with a spike at zero and current approaches

outcome variables. Therefore, this chapter is divided into two sections. First, an overview on literature

on predictor covariates with SAZ will be given. For reasons of completeness and as methods are closely

related to the covariate analysis, the second section will present research on outcome variables with SAZ.

Recently a special issue on models for continuous data with SAZ was published in the Biometrical Journal

(Boehning and Alfo (2016) [8]). All of the articles addressed outcome variables with SAZ.

The selected literature described in the following raises no claim to completeness. It includes publica-

tions which are closely related to the methodology investigated in this thesis with the combined systematic

and unsystematic search strategy described above.

3.1 Predictor covariates with SAZ

A simple approach to covariates with SAZ which can be found in several analyses is probably categorization

of the respective predictor covariate into separate (two or more) categories. Taking the example of smoking,

one can distinguish between smokers and non-smokers. This simplification leads to a loss of all additional

information measured such as the number of cigarettes smoked per day. Altman and Royston (2006) [4]

already addressed the cost of the dichotomization of continuous variables. In many cases, the information

of the whole covariate could also be of interest. Therefore, a next step could be to separate the positive

observations into different categories. Royston et al. (2006) [52] and Sauerbrei and Royston (2010) [57]

and others found, however, that “loss of information is an important drawback of categorizing continuous

variables, key issue being the number of cutpoints and where to place them” (Sauerbrei and Royston

(2010) [57], p. 2).

There are several methods available using the full information of a covariate with SAZ. In 1955, Aitchison

[3] already discussed the problem in an econometric setting. Possible examples of covariates with such

a distribution were found in the analysis of household budgets. Covariates such as money for children

clothing was zero for a number of households in the sample. Aitchison stated that “ the correct procedure

in any analysis is to recognize explicitly this dichotomy of the population into the categories, spender

and non-spender”(Aitchison (1955) [3], p.901). He proposed to split the distribution of a random variable

x and investigated mean and standard deviation for the respective distribution. “There is a non-zero

probability θ that x is zero and hence a probability 1 − θ, that x is non-zero; further the distribution

of x conditional on x 6= 0 is some well-known distribution of a positive variable, either continuous or

discrete.”(Aitchison (1955) [3], p 902). That means

24


P ({x = 0}) = θ

P ({x > 0}) = 1− θ

define probabilities for being zero or positive. Additionally the distribution of the positive observations

(excluding the “zero’s”) is defined as

P ({X ⊂ (x, x+ dx)}|x > 0) = g(x)dx,

where g(x) is the conditional density function. And thus

P ({X ⊂ (x, x+ dx)}) = (1− θ)g(x)dx, x > 0.

The mean and standard deviation of the distribution of the positive observations g(x) are µ>0 and σ>0.

They are dependent on mean and standard deviation of the original covariate X, µ and σ. The parameters

of the original covariate X are then

µ = (1− θ)µ>0

and

σ = (1− θ)σ>0 + θ(1− θ)µ2>0.

In Aitchison [3] efficient estimates for µ and σ are proposed. In the econometric setting (money for

child clothing etc.) used as an example, the clear distinction between zero and non-zero values is easily

seen and easily observable. Their work in this paper purely focused on the description of the distribution.

No regression setting was considered.

The clear distinction between zero and non-zero values is not always as easy in medical settings. In a

case-control study, Robertson et al. [50] investigated covariates with SAZ in a medical setting. Analyzing

risk factors for illnesses such as pancreatic cancer, different types of covariates are present. For several

exposure covariates such as alcohol consumption, cigarette consumption, tea or coffee consumption, ob-

servations can be distinguished in two categories. Those who are unexposed as they are non-smokers or

non-drinkers etc. For those exposed, there are several different doses. For these examples, the zero/non-

zero distinction is more difficult. Considering smoking, are true zeros only observations of never-smokers?

25


Considering alcohol, are true zeros only people who never drank alcohol? These covariates might also be

considered as having some “limit of detection”. In this case, it is not easily decided if this clear distinction

in two categories is reasonable. In the following chapters and in the data analyses performed in this thesis,

it will be assumed that there is a considerable difference between smoker and non-smoker, and drinker

and non-drinkers, respectively. In their work, Robertson et al. [50] argued the same way and highlighted

the importance of treating unexposed subjects separately from those who were exposed in the estimation

of dose-response relationships. They proposed to include a binary indicator variable v to separate the

effects.

log(odds) = α+ β1v + β2vx

where “β1 represents the effect of the exposure, and the second parameter, β2, represents the effect of

the levels of exposure among those who are exposed” (Robertson et al. (1994) [50], p. 167). v is a dummy

variable indicating if the exposure is zero (v = 0) or not. They came to the conclusion, that unexposed

subjects “[...] cannot be treated simply as if they are continuous measurements. [...] Failure to do so will

lead to problems in the interpretation of the effect of the exposure variable on the relative risk of having

the disease”(Robertson et al. (1994) [50], p. 169).

In a commentary by Sander Greenland and Charles Poole [26], however, it was stated that “deleting

unexposed subjects from the dose-response analysis is not always superior to including them.”(p. 326).

They argue that this “[...] does not necessarily yield more accurate estimates than an analysis that

includes them. When [...] subject matter considerations and statistics do not clearly indicate which

analysis is more accurate, both should be conducted and compared to see whether basic conclusions are

sensitive to inclusion of the unexposed” (Greenland and Poole (1995) [26], p.328). They state that this is

a similar proceeding like analyzing confounder covariates. As already explained for the covariates smoking

and alcohol intake, decision on the correct way of modeling covariates with a spike at zero is strongly

dependent on subject matter knowledge.

Jedrychowski et al. [32] considered smoking as a covariate with a spike at zero and included a binary

indicator separating smokers and non-smokers in the logistic regression model analyzing the influence of

smoking on the risk of lung cancer. This was an ad hoc analysis without any theoretical justification.

Beside the inclusion of the indicator, they worked on a measure to combine the lifetime of smoking and

the amount of cigarettes smoked. This is one possible option of using further information on smoking

history. Both measures can have a spike at zero. The amount of zero values will in this covariate be

26


different to the zeros of a covariate measuring the current dose (number of cigarettes) of smoking. So far,

all proposed strategies assumed the relationship between the positive part of the data and the outcome

linear or constant.

Modeling continuous covariates in regression models, the assumption, that the relationship between

the covariate and the outcome is linear, might not always be correct. In 2008, Royston and Sauerbrei

[54] extended the univariate fractional polynomial procedure, a regression technique allowing non-linear

functional relationships (described in 2.3) for covariates with a spike at zero, FP-Spike. The extension was

based on the approach by Robertson et al. (1994) [50] and consisted of an inclusion of a binary indicator

variable combined with non-linear regression modeling using fractional polynomials for the observations

with positive value only. “The procedure comprises two stages: first, to determine the best FP function

when [v] is included in the model; second, to assess whether [v] or the FP component can be eliminated

without harming the model fit” (Royston et al. (2010) [55], p.1220). FP-Spike is illustrate in several

examples in Royston et al. (2010) [55]. Furthermore, an extension for multivariable model building with

additional confounder covariates is proposed. The procedure was modified in 2012 by Becher et al. [5].

FP estimation can only be performed if the values of all covariates for each observations are positive.

Therefore, the data is shifted before analysis (e.g. by adding a constant factor). Becher et al. [5] avoided

this shift of the original data, by using only observations with positive value for the estimation of the

functional relationship. This concept is the basis of the developed methodology in this thesis. Details on

development and changes will be found in chapter 4.

Different terminologies for more or less the same phenomenon might occur due to different understand-

ings and concepts behind covariates with a spike at zero. While Robertson et al. assume that unexposed

subjects are substantially different than exposed subjects, Greenland and Poole state that not considering

unexposed subjects might lead to biased estimates. Confusion might come from the type of covariates

which were chosen as examples. This phenomenon was already explained in the dose-response setting.

Treatment of zeros due to a limit of detection might not be comparable to treatment of covariates with

true zero values. A common strategy for measurements below a limit of detection (LOD) is to replace

them with value zero. Replacement might be reasonable in some applications and with some further

considerations e.g. with an additional inclusion of a binary indicator. Treating these measurements below

LOD as zeros, one is often faced with problems similar to those in other spike at zero settings. The

main difference, however, is that as these values are not necessarily true zero values, the effect in this

area of observations might not necessarily be substantially different to positive observations and these

two different types of observations might, therefore, not be independent. Another field which is easily

27


confused with covariates with a spike at zero are zero-inflated outcome variables. The types of variables

are more or less the same, but the way of analysis varies considerably. A lot of research has been done

on this topic and some of the problems are similar to problems with predictor covariates with a spike at

zero. The next section will give a brief overview.

3.2 Zero-inflated outcome variables

This section describes a different situation of analyses of variables with a spike at zero. All methods

presented so far dealt with predictor covariates and certain further assumptions. Performing a literature

research on the key words “spike at zero”, “zero inflation”, “mass at zero”, etc., however, mostly leads

to results for zero-inflated outcome variables. Methodologies are different and research questions are not

comparable to the situation of predictor covariates with a spike at zero. For outcome variables, different

types can be distinguished. There is zero-inflated count data which can be handled by an extension of

the Poisson model. The type of variables is the same as in the predictor covariate case. Variables are also

continuous covariates with a large amount of zero values due to different reasons. However, the research

question is different. For continuous outcome variables with a SAZ, selected strategies are presented here

for reasons of completeness and to highlight some differences.

Olsen and Schafer 2001 [47] called this type of variables “semi-continuous”, as they are a mixture of 0’s

and continuously distributed positive values. They argue that this type of variable is “different from one

that has been left censored or truncated, because the 0’s are valid self-representing data values, not proxies

for negative or missing responses.”(Olsen and Schafer (2001) [47], p. 730) and that this is the result of

two processes, “one determining whether the response is zero and the other determining the actual level

if it is non-0”(Olsen and Schafer (2001) [47], p. 730). This again is a qualitative assumption based on

subject matter knowledge. They propose to model these types of variables by a pair of regression models

and to split a semi-continuous response into two variables,

Uij =

1 Yij = 0

0 Yij > 0

and

28


Vij =

g(Yij) if Yij > 0

- (s.Uij) if Yij = 0

where g is a monotone increasing function that will make Vij approximately normally distributed.

Kipnis et al. [36] called the phenomenon excess zeros. The term excess implies something additional.

This is due to the fact that their data contains dietary data which is based on self-report, which normally

“fail to measure true usual intake precisely”(Kipnis et al. (2009) [36] , p.2). Their aim was to develop

a method of predicting an individual’s intake of an episodically consumed food and its relationship to a

health outcome. When investigating an instrument documenting nutritional intake over 24 hours, a lot

of zero values are obtained as not every food is consumed every day. This effect, the occurrence of extra

zeros, is rather due to a measurement problem than to the fact that the specific person does not consume

this type of nutrition. “Probably the most important factor that determines the impact of food frequency

questionnaires is the overall probability to consume the food on a given day”(Kipnis et al. (2009) [36],

p.7). They proposed a two-part model as an extension of the Olsen and Schafer model in which “the

first part specifies the probability of the point mass at zero, and the second part conditionally models

the continuous variable given that it is positive” (Kipnis et al. (2009) [36], p.7). Zhang et al. (2011) [66]

also dealt with data of a 24-hour dietary recall with multivariate data. They proposed a survey-weighted

MCMC computation strategy to fit such a model, which is a generalization of the model introduced by

Kipnis et al. (2009) [36].

Lachenbruch (2001) [38] also used the terminology excess zeros. He studied data on hospitalization

cost in a health insurance plan. Most insurance members will have no hospitalizations costs in one year,

therefore analysis will be driven by zero versus non-zero costs. He also proposed two-part models. “Two

part models are a mixture of a discrete point-mass variable (with all mass at zero) and a continuous

random variable” (Lachenbruch (2001) [38], p. 297). Farcomeni (2014) [21] examined the situation for

multivariate ecological data with a spike at zero. He extended the univariate approach proposed by

Lachenbruch (2001) [38] for multivariate outcomes with several point masses at zero. Hallstrom 2010 [27]

developed a modified Wilcoxon test for non-negative distributions with a clump of zero as one part model

and compared it to Lachenbruchs approach.

There are also some investigations in the field of molecular data. A special characteristic of data from

molecular biology is the frequent occurrence of zero intensity values which can arise either by true absence

of a compound or by a signal that is below a technical limit of detection. In this setting, a lot of further

29


challenges are present when modeling data. In most cases, a very high number of covariates is present

(p >> n). Standard regression techniques are often not applicable.

Nia and Ghannad-Rezaie (2005) [48] took a Bayesian perspective on metabolic data with spike at zero.

Here measurements also “consist of several continuous measurements of subjects or tissues over multiple

attributes or metabolites [... This] requires grouping subjects and attributes [...] for data modeling”

([48],p.1). Zero values occur due to a detection limit. They propose a spike-and-slab model with tractable

marginals for metabolic data with the aim of clustering which can then be visualized in e.g. a dendrogram

tree.

Another possible research question is the detection of differential expression in omics experiments. For

two or more groups of observations in the expression of omics data, the log fold change (LFC, logarithmic

fold change in the expression in two or more groups) is compared. The above described methods can be

distinguished into “two-part test [(Lachenbruch (2001) [38])] [to] compare mixture distributions between

groups and one-part tests [( Zhang et al. (2009) [65], Hallstrom 2010 [27] )] [which] treat the zero-

inflated distributions as left-censored” (Gleiss et al. (2015) [23], p.2310). A new model called left-inflated

mixture model proposed by Gleiss et al. (2015) [23] combines these two approaches. They also compared

the just mentioned different types of models and came to the conclusion that it is important to choose

the “right” test statistic. The difference in the results is as they state a “direct consequence of their

respective construction” (Gleiss et al. (2015) [23], p.2315). Similar to the situation of predictor covariates,

observations with value zero are the result of different phenomenons. In their setting they describe them

to be either technical (LOD) or biological (true zero values).

In one-part tests e.g. when using Wilcoxon’s rank-sum test or an adapted T-test, the null hypothesis is

stated for the full-distribution LFC (including zero values). Often zero values are replaced or imputed using

strategies for covariates with limits of detection. Two-part tests test a composite null hypothesis on the

subdistribution LFC (positive values) and the difference in the amount of zero proportions. Examples are

a continuity-corrected version of Pearson’s Chi-square test to compare the zero proportions and second

using a Wilcoxon or t-test to compare the continuous parts. Gleiss et al. combined both approaches

assuming a (left-censored) normal-distribution proposing a left-inflated mixture-likelihood ratio test with

equal means in the continuous distribution (µx = µy, x,y are two groups of observations) and equal zero

proportions (px = py, percentage of observations with value zero) in both groups. A detection limit λ is

assumed equally in both groups.

In simulation results they found that one-part tests give acceptable results if the proportion of zeros is

equal in both groups. Two-part tests generally showed higher true positive rates and bias of the estimates

30


was relatively small depending on the proportion of zero values. They state that in many cases it might

be difficult to judge whether the observed zero values are a result of technical or biological reasons. The

interpretation of the above described types of models is different. One part models assume left-censoring.

Differences between groups are calculated on the full distribution. Two part models give two separate

results. They compare the proportion of zero values and then compare the difference between the groups

only for positive values. It can, thus, not be differentiated between the different “types” of zeros. The

combined test procedure by Gleiss et al. (2015) [23] can treat technical and biological zeros differently.

The special issue on models for continuous data with SAZ of the Biometrical Journal appeared in 2016.

Boehning and Alfo (2016) [8] give a brief overview of the contents in their editorial. None of the articles

addressed predictor covariates with SAZ. However, one can see, that the fields in which spike at zero

situations occur are even more diverse than described so far. Techniques e.g. for clustering and other

research questions are presented. All these, however, will not be addressed in this thesis.

Next steps - scope of this thesis This chapter gave a cross sectional overview of methods proposed

for both predictor and outcome variables with a SAZ. The thesis will focus on predictor covariates with a

SAZ under the assumption of a substantial difference between observations with value zero and positive

observations.

The next chapter will describe the basis for all proposed methodologies. The approach for modeling

continuous covariates with a spike at zero by Royston and Sauerbrei (2008) [54], further developed by

Royston et al. (2010) [55] and Becher et al. (2012)[5] will be presented and extended.

31

4 One covariate with a spike at zero

4.1 Initial problem

Building linear regression models, a lot of assumptions about the characteristics of the data used for the

estimation of model parameters are made. One strong assumption is that the relationship between the

covariate and the outcome variable of interest is linear which might be correct in many situations. The

probability distribution of the predictor variables x has a major influence on the precision of estimates.

If for a covariate in our study there is a huge number observations with value zero, the estimate of this

continuous covariate can be biased as already described in section 3. This specific distribution of the

covariate can contain more information than the linear regression model is able to capture. Observations

with value zero might be substantially different to the ones with a positive value, as they are e.g. lacking

a certain characteristic completely. This characteristic could be quantified further. In this case, it might

be reasonable to separate the effect estimate of two different types of observations: the zeros and the

positive one. Building two separate regression model leads to a reduction of sample size in the respective

models and might lead to a reduction of power.

In addition, the assumption of a linear functional relationship might be violated. Multivariable frac-

tional polynomials (MFP), presented in section 2, are able to model nonlinear functional relationships.

This section will demonstrate the effects of the inclusion of a binary indicator combined with nonlinear

regression modeling using fractional polynomials in situations with one covariate with SAZ.

4.1.1 One SAZ covariate - theoretical derivation of the influence of the binary

indicator

As already described in chapter 3, Robertson et al. [50] illustrated the problem of covariates with SAZ in

a case/control setting. For a covariate x with SAZ, they included a binary indicator variable v with v = 1

if x = 0, and v = 0 otherwise as explained in section 3. Odds ratios for exposed vs unexposed were consid-

32

4.1 Initial problem

ered. The binary indicator could also be seen as an auxiliary variable allowing to model non-continuous

functional relationships (or semi-continuous as the only “jump” which is possible is between zero and the

positive values). In this section, a theoretical justification and explanation for this phenomenon for a

continuous response in normal errors models will be given in the following theorem.

Theorem Including a binary variable v indicating if an observation is zero or not, separates the effect

of the observations with value zero and the continuous observations by splitting the regression into two

regression coefficient estimators. In

E[y|(v, x)] = β1 ∗ v + β2 ∗ x,

β1 is estimated using observations with x = 0, β2 is estimated on observations with x > 0..

Proof. Preliminaries and conditions:

x =

v1 x1

· ·

· ·

vn xn

vi ∈ {0, 1}, xi ∈ R+

0 and

vi =

0 if xi 6= 0

1 if xi = 0

and

v =

v1

·

·

vn

and x =

x1

·

·

xn

are orthogonal, i.e. the crossproduct of the vectors equals 0.

v × x =n∑

i=1vixi = 0

In linear regression β = (XT ∗X)−1 ∗XT ∗ Y . For X:

33


XT ∗X =

∑ni=1 v

2i

∑ni=1 vi ∗ xi∑n

i=1 vi ∗ xi

∑ni=1 x

2i

=

∑ni=1 v

2i 0

0∑n

i=1 x2i

= C

For diagonal matrices it is:

C−1 =

1∑n

i=1v2

i

0

0 1∑n

i=1x2

i

.

Then

β1

β2

=

1∑n

i=1v2

i

0

0 1∑n

i=1x2

i

v1 · · · vn

x1 · · · xn

y1

.

.

yn

=

∑n

j=11∑n

i=1v2

i

∗ vj∑nj=1

1∑n

i=1x2

i

∗ xj

y1

.

.

yn

=

∑n

k=1(∑n

j=11∑n

i=1v2

i

∗ vj) ∗ yk∑nk=1(

∑nj=1

1∑n

i=1x2

i

∗ xj) ∗ yk

.

For every l for which xl = 0, yk does not influence in the estimation of β2, as the factor∑nj=1

1∑n

i=1x2

i

xj = 0 , for xj = 0. The (positive) values of all these observations are included in the

estimate β1 because the factor∑n

j=11∑n

i=1v2

i

vj 6= 0 ∀j for which xj = 0.

4.1.2 Interpretation of the coefficient of the binary indicator

Inclusion of a binary indicator variable leads to a separation of the effect of zero and non-zero values of

the covariate x. The estimated regression coefficient of the covariate x is then only dependent on the

outcome values (yi) of the positive observations. This can be seen in the calculated regression coefficient.

The regression coefficient of the binary indicator is calculated using the outcome values of observations

with value zero in the covariate x only.

34

4.2 Modeling one covariate with a spike at zero

Robertson et all showed this separation for odds ratios in logistic regression. This is a first “theoretical”

justification for coefficients in linear regression. This simple proof does not consider an intercept, yet. In

standard linear regression it is assumed that

y = βx+ α

where α defines the intercept and thus y = α if x = 0. If the effect of the zero values, is as assumed,

different from the linear relationship defined for all other observations and, in fact, the true underlying

relationship is not continuous inclusion of a binary indicator solves this problem as shown above.

y = β1v + β2x+ α (4.1)

= β2x+ α1{x} + β11{x=0} (4.2)

In that case y = α + β1 if x = 0. The inclusion of the indicator allow to model non continuous rela-

tionships. One has to keep in mind, that for these calculations strong distributional and interpretative

assumption were made. These have to be checked before using these strategies for analyses. The next

sections will describe the procedure proposed by Royston and Sauerbrei (2008) [54] and different inter-

pretative situations will be assessed to get an impression in which situations the inclusion of a binary

indicator might be reasonable.

4.2 Modeling one covariate with a spike at zero

An extension of the FP-procedure, described in section 2.3, to handle one SAZ covariate was proposed in

Royston and Sauerbrei (2008) [54]. They proposed a two stage procedure, in which in the first stage the

usual FP procedure (as described in section 2.3) with an additional binary indicator is performed. In a

second stage, the indicator and the continuous covariate are tested for removal. In Royston et al (2010)

[55], this method was further explained and illustrated with examples. Becher et al (2012) [5] modified

the procedure to avoid the small shift of the original data in order to obtain only positive values, and

instead used the power transformation done by the FPs only on positive observations and not on zero

values. All three papers describe the development of the procedure for one SAZ covariate. The method

proposed in the latter paper will be used as a basis for the extensions which will be described in chapter

5. The procedure for one covariate with a SAZ will be described in detail in the following.

35


Compared to the usual FP procedure, it adds a binary indicator V to all models of the FP class. In

the first stage it estimates the effect of the binary indicator and determines the best FP function for the

positive values of X. In the second stage, both the continuous part X and the binary indicator V are

tested for removal. Both will remain in the model only if their elimination would significantly decrease

the fit, otherwise the one with the larger p-value will be eliminated. In the context of a generalized linear

model the full model is

ϕ(x, β) = β0v + (β1f(x) + βcons)(1− v)

where ϕ represents the link function of the respective type of regression. The FP procedure then

determines the function f(x). The notation will be the following. For each positive variable x with a spike

a binary indicator v is generated indicating if x is zero or positive. The description and detailed procedure

is an extension of the description of the standard procedure described in Royston and Sauerbrei (2008)

[54] with the respective changes due to the inclusion of a binary indicator.

0. The maximum permitted complexity for the continuous part of the class of FP functions is chosen;

e.g. FP2 or FP1 respectively. The suggested default is FP2. The procedure in now explained for

FP2. Corresponding versions for higher (FP3 etc.) or lower FP-classes are obvious.

1. All FP transformations are only applied to positive values. The transformation will be called

FPiSpike where i is the degree of the FP transformation.

2. To select the ’best’ function a nominal p -values α is chosen. A typical value is α = 0.05. Taking

α = 1 selects the function with the lowest deviance from the class of the most complex permitted

FP functions (no function selection procedure).

3 Function selection procedure for variables with a spike (FSP-Spike)

3.1 All transformations for (x, v) are fitted. The best (FP2Spike(x) + v) =: FP2∗ and (FP1Spike(x) +

v) =: FP1∗ are chosen according to deviance criterion. Correspondingly, (x+ v) =: Lin∗.

3.2 In the first stage, FP2∗ is compared to the null model on five d.f. If the likelihood ratio test is

not significant, the covariate is considered to have no influence and the algorithm stops. Otherwise,

FP2∗ is compare to Lin∗ (3 df). If the test is not significant, the algorithm stops choosing Lin∗.

Otherwise, FP2∗ is tested against FP1*. If the test is not significant, the algorithm stops choosing

FP1*. Otherwise FP2* will be the final function (FP*) in the first stage.

36

4.3 When is a binary indicator required?

3.3 In the second stage, the two components of FP* are each tested for removal from the model. If both

parts are significant, then the final model includes both; if one or both parts are non-significant,

then the one with the larger p-value is removed. In the latter case, the final model comprises either

the binary dummy variable v or the selected FP function (FP2Spike(x),FP1Spike(x), x) . If only

an FP function is selected, then the spike at zero plays no specific part.

At the end of the first stage of the procedure, the full model was fit including both v and x. Both

covariates are forced into the model. No method for variables selection is applied. For x the FP function

selection procedure selected an appropriate functional form. The second stage, described in 3.3, allows

to simplify the model, as both covariates v and x are tested if their inclusion leads to a significant

improvement of the model.


To illustrate the theoretically shown results of section 4.1.1 and to compare analyses using standard linear

regression and the proposed FP-Spike procedure, example datasets were generated.

ValuesTrue functions y1a = 1.5 ∗ v1 + 2 ∗ x1

y1b = 20 ∗ v1 + 2 ∗ x1y1c = 50 ∗ v1 + 0.5 ∗ x2

1 − 10no. of observations 100Proportion of zero values 30%Distribution normalMean µ 7Standard Deviation 4.5

Table 4.1: Characteristics of simulated data (datasets 1).y1a has a small effect for the indicator, y1b a largeeffect. y1c has a nonlinear functional relationship.

Description of datasets Datasets were generated in STATA 13 using rnormal, a function drawing

normally distributed data with mean µ and standard deviation σ. Details can be found in table 4.1.

Covariate x ∼ N(7, 4.5) was simulated. The amount of observations with x = 0 is 30%. In dataset 1a

there is a linear relationship between the covariate x1 and the continuous outcome y1a and a small effect

of the indicator v1. In dataset 1b, there is the same linear relationship between x1 and the continuous

outcome y1b and a large effect of the indicator v1. In dataset 1c, the effect of the indicator v1 is even

larger. Furthermore the relationship of the continuous part of the covariate x1 and the outcome y1c is

nonlinear.

37


Aim and methods The aim of these analyses is the comparison of model selection using standard linear

regression, a linear regression model which is estimated only on the positive observations of the dataset

(Linear>0), the method proposed by Robertson et al [50] (Linear+indicator) in the two datasets 1a and

1b. Dataset 1c is additionally analyzed using standard FP and FP-Spike. Thus, the datasets are analyzed

with all methods. As measures for comparison, R2, MSE and MSE in categories are calculated for each

method for every dataset. Details about the measures can be found in section 2.5.

1a 1bMethod R2 MSE MSE_0 MSE_(> 0) R2 MSE MSE_0 MSE_(> 0)Linear regression 0.80 14.92 15.69 14.59 0.06 58.54 69.16 53.99Linear>01 0.78 14.23 14.23 0.86 9.43 9.43Linear+indicator 0.80 14.53 15.23 14.23 0.84 10.24 12.13 9.43

Table 4.2: Comparison of univariate methods for true functional relationship y1a = 1.5 ∗ v1 + 2 ∗ x1,y1b = 20 ∗ v1 + 2 ∗ x1

Results Figure 4.1 shows a scatter plot for datasets 1a and 1b. The cardinality of the set of observations

P = {(xi, yi)} ∀i with xi > 0 is the same in both datasets. Furthermore, observations in P1a and P1b

follow the same distribution. The cardinality of the set of observations V = {(xi, yi)} ∀i with xi = 0 is

also the same. Only the value of yi for observations with xi = 0 differ in the two datasets as the true

effect of v1 on y is much larger.

The left graph in figure 4.1, shows a scatter plot of data for which the effect of the binary indicator is

small and the linearity assumption might still be reasonable. The dashed line shows the result of standard

linear regression. For the estimation of the dotted line, only the positive values of x were modeled. The

true effect of the binary indicator β0 is 1.5 and the true β1 is 2. The true intercept is zero. The graph

on the right shows data with the same distribution for x > 0, but the effect of the binary indicator is

much higher. The positive observations can still be described with a linear relationship to the outcome.

However, for the whole range of observations (including the observations with value zero), this assumption

is certainly incorrect and the model fit is very bad, for both parts with X = 0 and for X > 0. The dashed

line, which is the result of a standard linear regression, estimates a coefficient for x which is correct for

the whole range of observations and thus fails to detect the correct relationship between the covariate

and the outcome variable in this case is not continuous. Including the binary indicator variable leads

to a separation of the two types of observations. The estimated coefficient of the positive values in now

only dependent on their respective value in the outcome. In standard linear regression, the estimated

coefficients for covariates are calculated using information of the outcome variable of all observed data.

38


a.

−10

010

2030

y_1a

0 5 10 15x

linear model

linear model for x>0

−10

010

2030

y_1b

0 5 10 15x

linear model

linear model for x>0

Toy Example: Visualization of problem

b.

−50

050

100

y

0 5 10 15x

linear modellinear model for x>0Fractional polynomial

c.

−20

020

4060

y

0 .5 1 1.5 2x

linear modellinear model for x>0Fractional polynomial

Figure 4.1: a. Example of the influence on regression of the distribution of the outcome y for observationswith x-value 0. On the left, data of y1a (defined in table 4.2 for which a general linearityassumption might still hold is plotted with the respective regression function for all values ofx (dashed), and for x > 0 (dotted). On the right (y1b, in the true underlying relationship theeffect of the zero observation is much higher.b. Analyses of y1c with different methods.c. Enlargement of the area close to zero of graph b.

39


That means that observations with x-value zero are considered and included in an overall effect estimate

for the covariate.

The impact on the model fit given as R2 and the mean squared error (MSE) can be seen in table 4.2. If

there is a large effect of the binary indicator, ignoring it leads to a very poor fit. R2 for linear regression

in dataset 1b is only 0.05. The linear regression model tries to find an overall average estimate and, thus,

fails to fit a suitable model. The inclusion of a binary variable can solve this problem. This can also

be seen in the residual plots in figure 4.2. Here, the predicted residuals of the fitted models are plotted.

Additionally, a lowess (locally weighted scatter plot smoothing) smoother is included. For the dataset

with a small effect of the binary indicator the residuals are randomly distributed in the plot. With a large

effect for the binary indicator variable as in dataset 1b a strong relationship can be found in the residuals

which gives a hint that the model fit might not be appropriate. This changes if only the observations

greater than zero are modeled. Here, in both datasets 1a and 1b, the residual distribution seems random.

The same is found when including a binary indicator. That means that if in reality the effect of the

observations with value zero differs substantially from the continuous observations inclusion of a binary

indicator will separate the two effect estimates and lead to more reasonable results.

In dataset 1c, the true functional relationship between x1 and the outcome y1c was quadratic. Figure

4.1b shows the fitted functions and the sampled data for linear regression, linear regression > 0 (which is

comparable to linear with binary indicator) and a standard FP. The observations close to zero are plotted

in a separate graph on the right in figure 4.1c to highlight the differences in this area. Table 4.3 gives

details about the R2 and MSE values of the plotted functions and furthermore of FP-Spike. If there

is a large effect of the binary indicator standard linear regression is not able to model the relationship

appropriately. Inclusion of a binary indicator leads to a significant improvement of the fit. This example

shows, however, that standard FPs without a spike at zero already produce very good results in this

specific situation. Adding a small constant solves the problem of not being able to handle zero values.

The main difference of standard FP and FP-spike is in that case an interpretive one. FP-spike provides

a point estimate for the zero values. Standard FPs try to model the effect with a function with a very

steep slope close to zero. The overall error of the two methods is therefore similar. For the values zero it

is in this example even identical.

This theoretical investigation used a highly specified data setting in order to be able to illustrate the

method. One could see, that only for large effects of the binary indicator, inclusion of binary indicators

leads to improved model fit. In real datasets, especially in medical applications, this difference in outcome

between the zero and positive observations might not be so extreme. Two data examples, one with a

40


−10

−5

05

10R

esid

uals

0 5 10 15 20 25fitted values y1a

−20

−10

010

20R

esid

uals

12 14 16 18fitted values y1b

Residual−versus−fitted plot: linear regression

−10

−5

05

Res

idua

ls


−10

−5

05

10R

esid

uals

0 5 10 15 20 25fitted values y1b

Residual−versus−fitted plot: linear regression if x>0

−10

−5

05

10R

esid

uals


−10

−5

05

10R

esid

uals

0 5 10 15 20 25fitted values y1b

Residual−versus−fitted plot: linear regression with binary indicator

Figure 4.2: Residual-versus fitted plots for y1a (small effect of dummy) and y1b (large effect of dummy)with linear regression, linear regression of only the positive values, and linear regression witha binary indicator

41


Method R2 MSE MSE_0 MSE_(> 0)Linear regression 0.05 571.47 612.06 554.07Linear>0 0.79 136.56 - 136.56Linear+indicator 0.80 118.93 77.79 136.56FP 0.86 83.79 77.79 86.35FP-Spike 0.86 83.78 77.79 86.34

Table 4.3: Comparison of univariate methods for true functional relationship y1c = 50 ∗ v1 + 0.5 ∗ x21 − 10

(seed 20202)

continuous and one with a survival endpoint will now be used to illustrate the procedure.

4.4 Case Study I - Ozone data

4.4.1 Data and methods

In a prospective cross sectional study on the influence of ozone on lung function in school children was

assessed. In spring 1996 to fall 1999, a cohort of initially 1101 primary school children from six cities

in southern Germany (Baden-Württemberg) were recruited. The overall drop out rate was with 16.9%

relatively high. The primary analysis investigated the longitudinal effect of ozone exposure on lung

function. Details on the study design and primary and secondary objectives can be found in Kühr et al

(2001) [37] and Ulmer (1997) [64].

One of the measured covariates in this study was “time spent outside 48 hours before measurement

of lung function”. This covariate had a spike at zero. Due to the high dropout rate, the first visit

is selected for a cross-sectional analysis of the influence of this covariate on the children’s forced vital

capacity (FVC), a measurement for lung functionality. In order to illustrate the method described before,

univariate analyses will be performed using linear regression, standard fractional polynomials (FP), the

method proposed by Robertson et al [50] and FP-Spike. A predictive factor for the forced vital capacity

is the height and/or the weight of the children. Thus, analyses should be adjusted for these covariates.

For reasons of easier comparison this is not done here. The overall behaviour, however, stays the same in

the adjusted analysis. Small differences in the estimated coefficients can be observed.

Figure 4.3 a. shows the distribution of the time spent outside 48 hours before the lung function

measurement at the first measurement time point of the study. One can see, that there is a huge amount

of children, that did not spend time outside in this time frame (30.6 %). Comparing the mean FVC

(M:1.87, SD:0.31) in children who did not spend time outside to all other children (M: 1.96, SD:0.32), one

can observe that it is slightly lower for children who did not spend time outside.

42

4.4 Case Study I - Ozone data

First stage1

Deviance Dev.Diff. d.f. P PowerFP2+v 599.9 -2, -2Null 9.4 0.05Linear+v 0.2 3 1.0 1

Second stage2

Linear + v 599.9Dropping v 9.4 0.002Dropping Linear 0.2 0.608

Table 4.4: Ozone Study: Results and details of model selection with univariate FP-Spike in the analysisof the influence of “time spent outside” on the forced vital capacity.1 In the first stage of the analysis, the full model including both the indicator v and the covariatex is estimated. The selected model in this first stage is then taken to the second stage.2 In the second stage both v and x are tested for removal.

4.4.2 Results

Complete case analyses were performed in the population of individuals with fvc and “time spent outside”

measurement at the first visit (n=1071). The detailed steps of function selection using FP-Spike is found

in table 4.4. In the first stage, the best FP-function including the indicator v is selected. A a significance

level of α = 0.05, FP2+v is significantly better than the null model. However, compared to a linear

function with indicator v, FP2+v is not significantly better. Therefore, the algorithm stops and in the

second stage both parts of the selected model are tested for removal. Dropping the linear part of the

model does not significantly worsen the fit of the model (p=0.608), thus it will not be kept in the model.

The final model selected by FP-Spike only contains the indicator v. The estimated coefficient for the

covariate “time spent outside” was close to zero. Therefore this seems reasonable.

Comparing FP-Spike to other techniques, one can observe some differences in the estimated models.

Figure 4.3 b and c visualizes the results of standard linear regression, standard fractional polynomials, the

method proposed by Robertson et al [50] and FP-Spike. The actually selected functions can be found in

table 4.5. Similar to what could be observed in the simulated examples before, standard linear regression

fits a model on the whole range of observations. The estimated regression line is steeper than for the other

three methods. The fractional polynomial approach selects a nonlinear functional form which is almost

parallel to the x-axis except for the area close to zero where a almost sharp bend can be observed. The

method proposed by Robertson et al. results in almost the same coefficient estimate for v. However, as

there is no variable selection, the linear functional form has a slightly steeper angle. As for FP-Spike only

the binary indicator is kept in the model, the function is constant for x > 0.

The R2 is considerably lower for standard linear regression than for the other three methods as displayed

43


a.

0.0

1.0

2.0

3D

ensi

ty

0 100 200 300 400time spent outside (min) 48h before measurement

b.

11.

52

2.5

33.

5fo

rced

vita

l cap

city


Linear Regression Standard FPRobertson et al. FP−Spike

c.

1.85

1.9

1.95

22.

05fo

rced

vita

l cap

city


Linear Regression Standard FPRobertson et al. FP−Spike

Figure 4.3: Ozone study: a.distribution of the covariate “time spent outside”.b+c. Visual comparison of the selected functional forms by Linear regression, FP, the methodproposed by Robertson et al [50] and FP-Spike.b. the four methods are plotted with the observed data.c. an enlarged version of the fitted functions is given in order to highlight differences betweenthe methods.

44

4.5 Case Study II - German Breast Cancer Study Group

Method predictor R2 Dev.Linear regression y = 0.0004x+ 1.91 0.008 609.3FP y = −0.01x−0.5 + 1.96 0.016 599.5Robertson et al y = −0.081v + 0.0009x+ 1.96 0.016 599.9FP-Spike y = −0.086v + 1.96 0.016 600.2

Table 4.5: Ozone study: Comparison of Linear regression, FP, the method proposed by Robertson et al[50] and FP-Spike.

in table 4.5. The respective deviance is slightly higher for linear regression. FP-Spike selected the simplest

model in terms of degrees of freedom needed for the final model as only the binary indicator is kept.

Interpretation of the selected model is in addition straightforward for this model. R2 and deviance are

very similar for FP, Robertson and FP-Spike. R2 are generally very low. This is due to the fact that

the selected covariate does not have a major influence on the chosen outcome, and important predictor

covariates such as weight and height were not included in the model. For illustrational purposes, however,

the example shows that there are situations in which the effect on an outcome of observations with value

zero is considerably different to all other observations. Unfortunately, only one covariate with SAZ is

present in this dataset. Therefore it can not be used later in the bivariate setting.

The situation of covariates with SAZ was analysed for a continuous endpoint, here. However, also in

different further research setting with other hypothesis and regression models covariates with SAZ can

occur.


As a second example, the German breast cancer study will be analyzed. Compared to the Ozon data with

a continuous endpoint, the GBSG study has a time to event endpoint (recurrence free survival). Some

interpretational advantages of the procedure will be highlighted.

4.5.1 Data and methods

The German Breast Cancer Study group (GBSG) performed a multi-center randomized clinical trial

to compare the effectiveness of three versus six cycles of chemotherapy with or without tamoxifen, an

additional hormonal treatment. The aim was to compare recurrence free and overall survival between the

different treatment procedures. Data of the comprehensive cohort study with 686 patients with primary

node positive breast cancer will be used to illustrate the procedure described in section 4.2. There were

299 events for recurrence free survival (RFS) and the recruitment took place from 1984 to 1989. Beside

45


hormonal treatment with tamoxifen, there are seven factors which were considered in several analyses on

prognostic factors. Here, the focus will be on the effect of the estrogen (er) and progesterone receptor

level (pr) on RFS because they both have a spike at zero. In the data, 11% of the patients take zero

for the value of the estrogen receptor and 13 % for the value of the progesterone receptor. In 9% of the

patients both receptor levels have value zero. In real data, the covariates of interest are often correlated.

The effect of hormonal treatment with tamoxifen was investigated in this old trial. For some years, it is

generally known that there is a strong interaction between the estrogen receptor and tamoxifen. Spearman

correlation between both covariates in the GBSG-Study was 0.51 for those who where both estrogen and

progesterone receptor positive. Thus, it will also be consider whether the effect of estrogen is different

in the two (yes or no) tamoxifen subgroups. Analysis was performed using a multivariate Cox regression

model. For more details on the study see [59]. The data are available on the web (http://www.imbi.uni-

freiburg.de/biom/Royston-Sauerbrei-book/). A standard cox model, FP and FP-Spike as three different

model selection procedures to estimate the influence of the estrogen receptor on recurrence free survival

(RFS) in breast cancer patients are compared and used to illustrate the univariate FP-Spike procedure.

Assuming proportional hazards, these survival data will be modeled with the Cox model.

4.5.2 Results

Analyses are performed in the entire population and in two subgroups. The step by step results for

FP-Spike in the overall population can be found in table 4.6 a. In the first stage, the best FP model

including the binary indicator v is selected using FSP-Spike, the function selection procedure described in

section 4.2. At the significance level α = 0.05, FP2+v is significantly better than the null model and the

Linear+ v model but not significantly better than FP1+v. So the FP1 function (FP(-0.5)) with the binary

indicator v is selected in the first stage of model selection. In the second stage, both v and the continuous

FP1 part are tested for removal. Both parts cannot be removed without receiving a significantly worse

model fit and, thus, both are kept in the final model. The actually selected functional form with the

corresponding estimates can be found in table 4.7.

Figure 4.4 a. shows the functional forms of both the usual FP and FP-Spike. To be able to use the

standard FP procedure, the data is pre-transformed (x = er+1) as one requirement for the use of FPs are

positive values. Furthermore, the data was centered to the mean. As the baseline hazard is not estimated

for the analysis, the results of the FP model (without indicator variable) can be compared to results of

the FP-spike model. One can see that the usual FP tends to infinity close to zero which does not lead

to a satisfying interpretation. The FP-Spike model produces a point estimate for all zero values and is,

46


0.2

.4.6

.81

Log−

Hazard

−R

atio

0 5 10 15 20Estrogen receptors

FP−Spike FP

Standard Cox

All patients

(a) In the overall population, both procedures select an FP1 model, but with different power terms. The estimatorof the Log-hazard ratio for v from FP-Spike is displayed as a black scatter.

01

23

4Log−

Hazard

−R

atio


FP−Spike FP

Standard Cox

TAM Subgroup

(b) In the TAM subgroup, the final model does not include the continuous part. Only the binary indicator is keptin the final model. Thus, the effect for observations > 0 is constantly zero.

01

23

Log−

Hazard

−R

atio


FP−Spike FP

Standard Cox

No−TAM Subgroup

(c) In the No-TAM subgroup, FP selects a more complex model than FP-Spike.

Figure 4.4: Breast Cancer Study: FP-Spike vs. FP for Estrogen receptor, univariate.

47


therefore, easier to interpret at zero. However, for values close to zero the continuous part still tends to

infinity. In this example, it might be the case that the true zeros and observations which are very close

to zero are relatively similar to observation with zero. Table 4.7 displays the actually selected functional

forms and the deviances. It can be seen that compared to the standard cox model (Dev: 3571.6), the

deviance for FP (Dev: 3557.2) considerably decreases. FP-Spike leads again to a slightly decrease deviance

(Dev: 3554.4).

The interpretative advantage of the results of the FP-Spike model becomes even clearer if a subgroup

of the study population is considered, namely, the patients that received a hormonal treatment with

tamoxifen. The status “estrogen receptor positive or not” has an influence on survival, however, the

actual value of the estrogen receptor does not seem to provide further information in this subgroup. Table

4.6 b shows that in the first stage of the FP-Spike function selection procedure, a linear function including

a binary indicator is selected. In the second stage, however, one can see that dropping the linear part

of the model does not significantly worsen the fit and, thus, the final model will only consist of v, the

binary indicator. The coefficient estimate for the linear function was close to zero, similar to the coefficient

estimated by the standard cox model. This can also be seen figure 4.4b. Here, the usual FP is plotted

together with the function selected using FP-Spike and the standard cox model. It can be seen that the

log-HR is constant for x > 0 (level of er). In the final model of the FP-spike procedure, the linear part is

not included. Compared to the FP1 function selected by FP, FP-Spike leads to a more simple function

(with respect to degrees of freedom), which is in this situation easier to interpret. The deviance is the

same for both FP (Dev: 920.6) and FP-Spike (Dev: 920.6). For the standard cox model (Dev: 935.5), the

deviance is considerably higher.

In the subgroup of patients that were not treated with tamoxifen, FP again selects a more complex

model with respect to degrees of freedom (FP(-2 -1). In Figure 4.4c, one can see that there is a maximum

at around x = 3 and for values close to zero, the function tends to negative infinity. As a biological

relationship this seems rather complicated to explain. FP-Spike selects an FP1 (FP(-0.5)), which is the

same functional form which was already selected in the overall population. The binary indicator v is

dropped (p=0.054).

In conclusion, one can state, that in some situations FP-Spike leads to simpler models. They can

be easier interpreted that the standard FP models which in these data setting select strong nonlinear

functional relationships tending to infinity or negative infinity close to zero. The overall model fit measured

by the deviance is improved compared to the standard Cox model. The deviance of FP and FP-Spike are

comparable with no preference for one of the methods in this data setting.

48


a) ALL PATIENTSFirst stage1

Deviance Dev.Diff. d.f. P PowerFP2+v 3552.5 -2, -1Null 23.8 5 <0.001Linear+v 14.1 3 0.003 1FP1+v 1.9 2 0.391 -0.5

Second stage2

FP1 + v 3554.4Dropping v 14.4 <0.001Dropping FP1 14.5 <0.001

b) TAM SUBGROUPFirst stage1

Deviance Dev.Diff. d.f. P PowerFP2+v 915.2 -2, -2Null 20.8 5 <0.001Linear+v 5.3 3 0.148 1FP1+v 1.9 2 0.101 3

Second stage2

Linear + v 920.6Dropping v 14.9 <0.001Dropping Linear <0.1 0.87

c) NO-TAM SUBGROUPFirst stage1

Deviance Dev.Diff. d.f. P PowerFP2+v 2244.0 -0.5, 3Null 18.13 5 0.003Linear+v 14.90 3 0.002 1FP1+v 0.89 2 0.640 -0.5

Second stage2

FP1 + v 2244.9Dropping v 3.45 0.054Dropping FP1 17.03 < 0.001

Table 4.6: Breast Cancer Study: Details to derive the FP-Spike models for estrogen receptor. a) allpatients. b) subgroup with hormonal treatment.1 In the first stage of the analysis, the full model including both the indicator v and the covariatex is estimated. The selected model in this first stage is then taken to the second stage.2 In the second stage both v and x are tested for removal.

49


A. ALL PATIENTSselected FP predictor Deviance

Standard Cox - −0.001x 3571.6FP FP(0) −0.137(ln(x)− 4.577) 3557.2FP-Spike FP(-0.5) 0.706 ∗ v + 1.096 ∗ (x−0.5 − 0.0956) 3554.4B. TAM SUBGROUP

selected FP predictor DevianceStandard Cox - −0.0005x 935.5FP FP(-1) 1.27(x−1 − 0.01) 920.6FP-Spike v 3.48v 920.6C. NO-TAM SUBGROUP

selected FP predictor DevianceStandard Cox - −0.001x 2258.9FP FP(-2 -1) −4.19x−2 + 4.52(x−1 − 0.1) 2245.0FP-Spike FP(-0.5) 3.44(x( − 0.5)− 0.1) 2248.3

Table 4.7: Breast Cancer Study: selected FPs, explicit functional forms for the predictor of the Cox modelfor the log hazard ratio and deviances for the final model

The GBSG study data set contains a second covariate with SAZ “progesterone receptor level”. The

results of a univariate analysis of progesterone receptor level is similar to the results of estrogen receptor

level described above. In an analysis including both covariates the effect of the estrogen receptor level

completely “disappears” due to a very strong correlation of the two covariates. This strong correlation is

the reason why the data will not be considered in the bivariate setting later as there are other datasets

that illustrate this situation in a better way.

50

5 Two covariates with a spike at zero

5.1 Initial problem

The situation with two covariates with a SAZ is similar to what was described in chapter 4. However,

there are some more potential challenges in this setting. If there are two covariates with a spike at

zero one can further distinguish between four categories of observations in the following subsets: A =

{(x1i, x2i, yi) if x1i = 0, x2i = 0}, B = {(x1i, x2i, yi) if x1i = 0, x2i > 0}, C = {(x1i, x2i, yi) if

x1i > 0, x2i = 0}, and D = {(x1i, x2i, yi) if x1i > 0, x2i > 0} (already defined in section 2.5.1).

Depending on the distribution of observations in the four subsets different modeling strategies might be

applicable.

5.2 Modeling two covariates with SAZ: proposed methods

If there are two covariatesX1 andX2 with a SAZ, the situation becomes more complex since the correlation

between these two covariates may exist differently between the continuous and the binary variables. For

further defintion of relation and dependence of covariates with SAZ see section 2.4. The situation has been

theoretically analyzed in Lorenz et al. (2015) [41] for the case of two normally distributed variables with

spikes and revealed a rather complex pattern. However, in reality, distributions are arbitrary and different

relationships need to be modeled. There are several possible ways to handle them. Four strategies are

proposed, which will be described in detail in the following sections. A summary can be found in table

5.3.

Method selection may depend on the specific situation (subject matter knowledge, aim of the study),

and especially on the distributions of the zero and non-zero values of both covariates as indicated in

table 5.1. In table 5.1 the four subsets of observations are defined again. Furthermore, three dummy

indicator variables are defined to distinguish those four categories. Depending on the cardinality of the

four subsets, different methods might be suitable. Each method described in the following is proposed

51


Dummyx1 x2 Category z1 z2 z3 v1 v20 0 A 1 0 0 1 10 > 0 B 0 1 0 1 0

> 0 0 C 0 0 1 0 1> 0 > 0 D 0 0 0 0 0

Table 5.1: The two covariates with SAZ consist of four different types of observations which can be sepa-rated into four categories. These four categories could be coded using 3 dummy variables.

for a very specific distributional assumption. Four different methods are described and compared in the

following sub-chapters. It is obvious that if the expectations and assumptions for the selected method are

not met, the model fit will not improve.

5.2.1 Bi-sep: distinct consideration of covariates with SAZ

The distinct analysis of the two spike variables represents the simplest case. The FP-spike procedure is

used for all spike variables, while applying additional rules for the visiting order. For each variable with

a SAZ we will include a binary indicator, then fit the best model and finally check whether we need both

the indicator and the continuous part in the model. The Bi-Sep approach derives a model under the

assumption that there is no interaction between the two spike variables and does not require conceptual

extensions of the approach for one variable. However, if this assumption is unrealistic or contradicts the

real data, an extension is needed. It is proposed to handle such more complex situations with one of

the other three procedures. In practice, the Bi-Sep approach uses the usual MFP procedure for variables

without spike and the FP- spike procedure individually for each variable with a spike. It is assumed that

the effects of the two spike variables are independent of each other. The FP-spike procedure is applied

for both variables x1 and x2 separately. Binary indicator variables v1 and v2 are defined accordingly. The

full model is defined as

E[y|(x1, x2)] = β01v1 + f(x1) + β02v2 + f(x2) + βcons,

where f is the selected functional form which is estimated using the function selection procedure (FSP)

described below. In the univariate FP-spike procedure, a modified FSP is applied [5]. Figure 5.1 visualizes

a possible result of the Bi-sep procedure. The two constant coefficient estimates of v1 and v2 are represented

by the red lines. The colored regression plane/shape represents the continuous functional relationship

between to non-zero parts of the outcome and is only defined for x1, x2 > 0. As in the case with only one

covariate with a spike at zero, this procedure models non (or semi) continuous relationships. The detailed

52


algorithm is as follows:

The Bi-sep procedure

a./b. Cf. FP-Spike. Choose values for each variable separately.

c. Nominal P -values α1 and α2 for variable and function selection are chosen for each spike variable.

Typical values are α1 = α2 = 0.05. Values may differ among variables. Taking α1 = 1 for a given

variable forces it into the model (no variable selection). Taking α2 = 1 for a continuous variable

forces the most complex permitted FP function to be fitted for it (no function selection).

d. For each spike variable x1 and x2 binary indicators v1 and v2 are generated.

e. Function selection procedure (FSP-Bi-Sep)

e.1 A model is fitted including (v1, x1) and (v2, , x2). The visiting order of the predictors is determined

according to the P -value for omitting either (x1, v1) or (v2, x2). The most significant predictor

(with the smallest p-value) is visited first. Assume that the variables (v1, x1) and (v2, x2) have been

arranged in this order, which is retained in all cycles of the procedure.

e.2 Let c = 0, to initialize the cycle counter.

e.3 1. Cycle: FSP-Spike is applied to x1 at the α = α1 level adjusted for (v2, x2). (x∗1)c1 is the selected

function for (v1, x1) (it may consist of only the binary indicator, only a continuous function, both

or might be dropped if neither part is significant, as FSP-Spike contains two stages. For details see

4.2.) in the first cycle, which is indicated with c1.

e.4 FSP- Spike is applied to (v2, x2) adjusted for (x∗1)c1. (x∗2)c1 is the selected function for (v2, x2).

e.5 2. Cycle: FSP Spike is again applied to (v1, x1) adjusted for (x∗2)c1. The selected function is (x∗1)c2.

Then FSP Spike is applied to (v2, x2) adjusted for (x∗1)c2 leading to (x∗2)c2.

e.6 The algorithm stops if (x∗1)ci = (x∗1)c(i+1) and (x∗2)ci = (x∗2)c(i+1).

In the univariate procedure two stages were described. This two stage procedure is embedded in this

Bi-Sep procedure as it is applied for each covariate separately during function selection, e.g. in step e.4.

Here, the best functional form for (v2, x2) is selected adjusted for an already selected functional variation

of (v1, x1. Thus, the best FP including v for x2 is selected. In this same step both x2 and v are tested for

removal.

53


020

40 010

2030

4050

0

10

20

30

40

50

x1

x2

y

Figure 5.1: Visualization of an example result of the Bi-sep procedure. The two constant coefficientestimates of v1 and v2 are represented by the red lines. The colored regression plane/shaperepresents the continuous functional relationship between to non-zero parts of the outcome.

5.2.2 Bi-D3: combination of dummy variables

If a possible correlation is assumed, a combination of dummy variables could be applied. These indicator

variables distinguish 4 categories of observations which are indicated in table 5.1. Specifically, they

distinguish if both covariates take value zero (category A), if one value equals zero and the other takes a

continuous value (categories B, C) or if both values take a non-zero value (category D). The full model is

defined as

E[y|(x1, x2)] = β01z1 + β02z2 + β03z3 + f(x1) + f(x2) + βcons.

The indicators z1, z2 and z3 are defined in table 5.1. If there is a common effect of the two covariates

present, more precisely, if there is a different effect to be expected depending on the zero/non-zero cate-

gories of the two variables, this procedure is able to estimate these effects. The detailed procedure is as

follows:

The Bi-D3 procedure

a.-c. As in Bi-sep

d. 3 dummy variables are generated indicating if x1 and x2 are zero (z1 = 1 if x1 = x2 = 0, only x1 is

zero (z2 = 1 if x1 = 0 and x2 > 0) or only x2 is zero (z3 = 1 if x1 > 0 and x2 = 0).

e. Function selection procedure (FSP-Bi-D3)

54


e.1 A model is fitted which includes z1, z2, z3 and the untransformed x1 and x2. The visiting order

of the predictors is determined according to the P -value for omitting either x1 or x2. The most

significant predictor is visited first and the least last. Assume that the variables x1 and x2 have

been arranged in this order, which is retained in all cycles of the procedure.

e.2 Let c = 0, to initialize the cycle counter.

e.3 The normal FSP (cf. MFP Royston et al.) is applied to x1 at the α = α1 level adjusted for x2

(untransformed) and the three dummies. (x∗1)c1 is the selected function for x1.

e.4 FSP is applied to x2 adjusted for (x∗1)c1, and z1, z2, z3. (x∗2)c1 is the selected function for x2.

e.5 In the 2nd cycle, FSP is again applied to x1 adjusted for (x∗2)c1. The selected function is (x∗1)c2.

Then FSP Spike is applied to x2 adjusted for (x∗1)c2 leading to (x∗2)c2.

e.6 The algorithm stops if (x∗1)ci = (x∗1)c(i+1) and (x∗2)ci = (x∗2)c(i+1).

The three indicator variables z1, z2 and z3 are always kept in the model.

5.2.3 Bi-D1: dummy indicating if both variables are zero or not

Assuming that in most cases either both of the covariates are simultaneously zero or non-zero, it might be

sufficient to create one dummy variable which indicates whether both variables are zero (z1 = 1 if x1 = 0

and x2 = 0, z1 = 0 otherwise). The full model is defined as

E[y|(x1, x2)] = β01z1 + f(x1) + f(x2) + βcons.

This method uses only one of the three dummy variables which were used in the previous approach.

It strongly depends on interpretative issues due to the combination of categories B, C and D and only

distinguishes them from category A. This means that strong distributional assumptions are made in this

approach. In categories B and C, a SAZ may still be present, but cannot be handle by this procedure.

The detailed procedure is similar to the procedures already described with only some slight changes:

The Bi-D1 procedure

a. -c. As in version: 2 spike variables

d. A new variable z1 is generated indicating that both spike variables take value zero.

55


e. Function selection procedure (FSP-Bi-D1)

e.1 cf. FSP-MFP-Spike 2, but replace 3 dummies with only z1.

The three indicator variables z1 is always kept in the model.

5.2.4 Bi-sub: submodels in each category

Another way of handling two variables with a SAZ is to consider submodels in four categories. An

example for a data situation for this method can be the study by Fehringer et al (2017) [22] which was

already mentioned in the introduction. In this study, the effect of alcohol on lung cancer is investigated

in never-smokers in order to answer the question if alcohol consumption is associated with lung cancer.

As smoking highly confounds this relationship interpretation of earlier studies was not easy. For method

proposed in this section, a submodel is built in each category and different types of relationships (linear

or non linear) are allowed for different types of observations (e.g. smokers and non-smokers). The four

categories contain the following observations: A = {{x1 = 0} ∩ {x2 = 0}}. B = {{x1 = 0} ∩ {x2 > 0}},

C = {{x1 > 0} ∩ {x2 = 0}} and D = {{x1 > 0} ∩ {x2 > 0}}.

f(x) = β0,0 ∗ 1A

+(β0,1 + FPB(x2)) ∗ 1B

+(β1,0 + FPC(x1)) ∗ 1C

+FPD(x1, x2) ∗ 1D

In other words, the variables z1 to z7 are included, which are defined in table 5.2 in the model. The

continuous observations of x1 are split into two covariates. The first, z4 only contains the positive ob-

servations if x2 = 0, all other observations are set to zero in this covariate. The second, z6 contains the

remaining positive observations. Observations included in z4 are set to zero in z6. The same is applied

for x2. The full model is defined as

E[y|(x1, x2)] = β01z1 + β02z2 + β03z3 + f(z4) + f(z6) + f(z5) + f(z7) + βcons.

The FSP is as follows: a model including all seven variables is fitted. The most significant variable

of the continuous variables (z4 to z7) is fitted first. In this way, the number of observations in each of

56


020

40 010

2030

4050

0

10

20

30

40

50

x1

x2

y

Figure 5.2: Visualization of an example model of the Bi-sub approach. The functional relationships for z4and z5 are represented as red lines. The colored shape visualizes the relationship between tothe positive parts of both covariates and the outcome

x1x2 0 > 0

0 z1 ={

1, if x1 = 0, x2 = 00, otherwise z3 =

{1, if x1 > 0, x2 = 00, otherwise

z5 ={x1, if x1 > 0, x2 = 00, otherwise

> 0 z2 ={

1, if x1 = 0, x2 > 00, otherwise z6 =

{x1, if x1 > 0, x2 > 00, otherwise

z4 ={x2, if x1 = 0, x2 > 00, otherwise z7 =

{x2, if x1 > 0, x2 > 00, otherwise

Table 5.2: Definition of the 7 variables used in Bi-Sub. In addition to the three dummy variables whichwere already included in the previous model, different continuous functional relationships forobservations for which only one variable is positive (categories B and C) are allowed.

the categories can be accounted for. The three indicator variables z1, z2 and z3 are always kept in the

model. In terms of model building, it may be reasonable to restrict the choice of the FP-models for z4

and z6 in order to force the same functional form and therefore the same choice of powers for the FP

models. Different coefficients, however, are allowed in the respective subgroups. If the functional form

is not estimated, however, lesser degrees of freedom are needed. This might be an advantage if some of

the subgroups are relatively small. However, if the groups are big enough, it is also possible that the FPs

are independently selected in the different groups. Figure 5.2 visualizes a possible result. The functional

relationships for z4 and z5 are represented as red lines. The colored shaped visualizes the relationship

between to the positive parts of both covariates and the outcome.

57


Nb. ofspikevars

Multiv. Method Explanation

1 no FP-spike Univariate analysis for a variable with spike.1 yes FP-spike adjusted The FP-spike procedure is used adjusting for a given

variable or index in every step.1 yes MFP-spike FP-spike variable is one of the candidates to derive

a multivariable model. For this variable FSP-spikeis used instead of FSP in the MFP procedure.

2 yes Bi-Sep Correlation structure of both the binary indicatorsand the continuous part is not considered. FP-Spikeis used separately for both variables.

2 yes Bi-D3 Three dummy variables distinguish between 4 cat-egories of observations and thus replace the binaryindicator in the univariate case. All 3 dummies arekept in the model.

2 yes Bi-D1 Only the first dummy indicating that both variablesare zero is kept in the model.

2 yes Bi-Sub For three of the 4 categories a separate functionalrelationship is estimated. Category A is representedby an indicator as it does not contain continuousvalues. 3 dummies are included plus additional con-tinuous variables which take a positive value only ifthe other variable is zero. Restrictions, such as samepower terms in all categories are possible.Unsuitableif sample size in one or more categories are much toosmall to estimate a function.

Table 5.3: Summary of FP-spike approaches for one or more spike variables

58

5.3 First ideas for comparison


Four different strategies to deal with two variables with a spike at zero were proposed. The first method

(Bi-Sep) considers the variables with spike separately. An indicator variable is created for both of them,

both pairs are considered separately. The second method (Bi-D3) uses a combination of dummy variables.

These separate the four categories of observations (both x1 and x2 zero, x1 zero and x2 positive, x2 zero

and x1, positive, and both positive). The third method (Bi-D1) uses one dummy variable indicating that

both x1 and x2 are zero. The fourth method (Bi-Sub) is the most complex one. It includes the 3 indicator

variables of method two but furthermore allows different functional relationships for observations for which

only one of the two variables is positive.

Aims and Methods The aim of this sections is to perform some first comparisons of the methods.

As a reference for comparison the standard linear regression model was chosen. Four example datasets

with specific characteristics were simulated in order to visualized and compare the methods. These first

comparisons will be used as a basis for the development of a design for a simulation study to compare

the methods systematically. The example data sets chosen for the bivariate case are straightforward

extensions of those in the univariate case in section 4.3 (table 4.1). Four different true outcome functions

will be investigated. As stated in section 2, the aim of a statistical model is not easily defined. Thus,

the selection of adequate measurements of evaluation is not straightforward. In this first setting, R2,

MSE, MSE in categories, PMSE, AIC and degrees of freedom will be calculated for the selected models.

Definitions of the measures are found in section 2.5.

Description of Datsets Table 5.4 shows the specifications of the four simulated example data sets which

are used for illustration. All true functional relationships are linear. The size of the effects for v1 and

v2 is varied in order to get a first impression of the behavior of the methods. To simplify this setting,

the proposed bivariate methods will be used with a restriction for the selected functional form. Only a

linear functional form is allow and combined with the proposed extensions of the indicators and in LR

-Sub allowing the possibility of a different coefficient for observations for which only one of the covariates

is positive. The details can be found in section 5.2. The datasets with two variables with a spike at zero

were generated in Stata 13 using drawnorm, a function which draws a sample from a multivariate normal

distribution, and in which the desired means µi, i ∈ {1, 2} and covariance matrices Σ have to be specified.

Furthermore, one can specify a correlation matrix ρ. For the different scenarios different mean, covariance

and correlation matrices can be chosen. Situations 2a-2c use uncorrelated variables. In situation 2d, x1

59


ValuesTrue functions y2a = 1.5∗v1 +2∗x1 +1.5∗v2 +2∗x2 +ε (corr(x1, x2): 0)

y2b = 20∗v1 +2∗x1 +20∗v2 +2∗x2 + ε (corr(x1, x2): 0)y2c = 20∗v1 +2∗x1 +1.5∗v2 +2∗x2 +ε (corr(x1, x2): 0)y2d = 20∗v1 +2∗x1 +20∗v2 +2∗x2 + ε (corr(x1, x2): 0.7)

number of obs. 100Proportion of zero values(qA, qB , qC))

(0.2, 0.3, 0.2)

Distributions of xi normalMeans µ1, µ2: (7,7)Standard Deviation σ1, σ2 (5,5)

Table 5.4: Characteristics of bivariate simulated data (datasets 2).y2a has a small effect for both dummies,y2b two large effects and y2c one dummy with a small and one dummy with a large effect. y2d

is similar to y2b, but it furthermore includes a correlation of 0.7 between x1 and x2.

and x2 have a Pearson correlation of 0.7.

After generating two variables, the proportion of zero values is set according to the chosen values qA,

qB and qC which specify the percentage of observations with value zero in categories A, B and C defined

in table 5.1. To ensure that the observations set to zero are chosen randomly, a uniformly distributed

variable “equidist” was generated and x1 was replaced with zero if equidist ≤ qA + qB and x2 with zero if

eqiudist ≤ qA and if qA + qB < equidist ≤ qA + qB + qC . Possible negative values which might have been

generated are set to zero as well.

Results For dataset 2a, results of a standard multiple linear regression model are visualized in figure

5.3a. One can see that the estimated plane of usual linear regression fits the data well. Distances from the

residual points to the plane, represented by the red and blue-dotted lines, are in general relatively small.

R2 for this model is 0.83. The residual plot shows that the distances from the observations to the plane

are relatively small. This can also be seen in the measures of fit in table 5.5. R2 and MSE are similar for

all five strategies. This can also be seen in the MSE in categories. Only very small differences can be seen

in the different groups.

In the second example with outcome y2b however, a larger effect of the binary indicator was assumed.

The standard linear model is strongly influenced by the huge effect of the zero values. The relationship and

the distribution of the positive values in the set P is the same as in example 2a. However, the ankle of the

plane is not as steep as in 2a (see figure 5.3b). Distances from the residual points to the plane, represented

by the red and blue-dotted lines, are increased compared to figure 5.3a and seem very large. The model

tries to average. This can be seen in the residual plot. Most of the observation points have relatively

large distances to the rather even plane. R2 of the model is only 0.06. Thus, it is not able to capture

60


3D Residuenplot − Dataset 2a

0 2 4 6 8 10 12

−10

0 1

0 2

0 3

0 4

0 5

0 6

0

0

5

10

15

x1

x2

y2a

(a) 3D residual plot: covariates x1, x2 and the continuous outcome y2a are visualized. In the true model for y2a

only a small effect for v is present. Th blue and red lines display the distance from the observations to the fittedvalue on the regression plane. Points with red lines lie above, points with blue lines lie below the estimatedplane. The regression plane fits the data well.

3D Residuenplot − Dataset 2b

0 2 4 6 8 10 12

010

2030

4050

60

0

5

10

15

x1

x2

y2b

(b) 3D residual plot: in the true model for y2b a large effect for v is present. This can be seen on the y-axis.The other observations have the same distribution as in the figure above. The blue and red lines display thedistance from the observations to the fitted value on the linear regression plane. Points with red lines lie above,points with blue lines lie below the estimated plane. The plane is almost horizontal and does not seem to fitthe data well. The distances of the scatters to the plane are very large.

Figure 5.3: Datasets 2a and 2b: Visualization

61


the effects that are present in this situation. The inclusion of binary indicators significantly improves the

fit (see table 5.5). Standard linear regression fails to describe the underlying true functional relationship

(R2 = 0.06). The MSE in separate categories (highlight in bold in section b of table 5.5) shows that due

to the averaging when trying to find a model which fits on the whole range of observations, the error in

all four categories of observations is inflated. Due to the design of our example, LR+Sep leads to a good

fit. This is also the case for LR+D3 and sub, as in this special case (same effect of both indicators and

same effect in continuous relationship in both category B and D, and C and D) can be handle easily by

all of them. LR+D1 is not able to model the effect of observations for which only one covariate is not

equal zero. This can be seen in increased errors in categories B, C and D(highlighted in bold in section b

of table 5.5).

Dataset 2c has one large and one small effect for the binary indicators. The results are similar to

example 2b. LR is a little better than in 2a as the effect of v2 is relatively small and thus the linear

regression model fits well. LR+ D1 also has problems modeling the effect. The number of observations

in categories B and C is relatively high (30 and 20%). The effect for observations in category B is high

and cannot be modeled with Bi-D1. This can be seen in table 5.5 c.

In dataset 2d, a correlation of the two covariates x1 and x2 was added. At a first glance, no considerable

difference to the results of setting 2b can be seen. The difference inR2 between LR and the other methods is

a little higher than in setting 2b. The same phenomenon can be observed for the additional measurements.

For further conclusions a more detailed investigation of the methods is necessary.

As a first conclusion of these four settings, one can say, that for small effects of v1 and v2, standard

linear regression seems to fit well. For larger effects, however, the overall model fit of LR is considerably

worse than for the four other methods. This can be seen in all measures of comparison. The most detailed

information of the subsets in which the model does not fit well is given by the MSE in categories. The

“trend” of the overall performance can be seen in all measures. For the simulation study R2, MSE and

MSE in categories will be selected as comparison criteria. Furthermore, the complexity with respect to

degrees of freedom will be compared. For a comparison of the four proposed methods, these examples are

not sufficient. However, they gave valuable insights in the behavior of the procedures. The data situations

presented here were very specific, as the aim was to demonstrate the methods. In the next section, these

methods will be compared in a real data situation.

62


a. Results for y2a - both effects of binary indicator small (small,small)Method R2 MSE MSEA MSEB MSEC MSED PMSE AIC dfLR1 0.81 30.3 23.5 23.1 23.0 43.7 33.9 627.7 3LR+ Sep2 0.83 28.1 21.7 24.5 21.1 7.0 31.4 623.5 5LR+ D33 0.82 29.2 21.7 24.5 21.1 37.1 31.5 625.5 6LR+ D14 0.81 30.3 21.7 23.4 23.8 41,7 32.6 627.2 4LR+ Sub5 0.83 28.1 21.7 23.1 20.7 35.7 33.3 625.9 8

b. Results for y2b- both effects of binary indicator large (large,large)Method R2 MSE MSEA MSEB MSEC MSED PMSE AIC dfLR1 0.06 125.4 140.6 58.8 80.9 201.9 119.6 770.6 3LR+ Sep2 0.785 23.0 14.5 26.4 18.3 25.6 30.6 603.5 5LR+ D33 0.785 23.0 13.8 26.4 18.3 25.3 32.3 603.5 6LR+ D14 0.364 77.4 13.8 54.5 80.1 128.8 92.3 721.9 4LR+ Sub5 0.792 23.0 13.8 26.0 15.8 24.6 33.8 604.7 8

c. Results for y2c one small and one large effect (small,large)Method R2 MSE MSEA MSEB MSEC MSED PMSE AIC dfLR1 0.2 56.3 32.5 29.6 48.3 99.2 57.0 689.8 3LR+ Sep2 0.78 16.0 15.6 13.4 19.2 12.9 26.25 563.7 5LR+ D33 0.79 16.0 15.4 13.3 19.0 12.9 26.1 564.7 6LR+ D14 0.30 50.4 15.4 44.6 33.0 83.3 48.8 679.0 4LR+ Sub5 0.80 16.0 15.4 13.2 17.0 12.0 26.5 564.0 8

d. Results for y2d, same function as 2c, additional strong correlation between x1 and x2Method R2 MSE MSEA MSEB MSEC MSED PMSE AIC dfLR1 0.05 158.7 176.5 44.6 40.9 322.8 140.6 793.3 3LR+ Sep2 0.85 25.0 19.6 21.9 20.9 31.4 23.7 611.9 5LR+ D33 0.85 26.1 19.6 21.9 20.8 31.4 23.8 613.9 6LR+ D14 0.41 100.0 19.6 56.7 81.3 197.4 87.5 759.1 4LR+ Sub5 0.86 25.0 19.6 21.1 17.3 36.1 26.1 616.2 8

Table 5.5: Comparison of bivariate linear methods for y2a and y2b: mean squared errors (MSE) for linearregression (LR) compared to the bivariate SAZ extension of the linear regression model. Inter-esting results are highlighted in bold and are explained in this section.1 Standard linear regression2 Bi-Sep: inclusion of v1 and v2, the selection of the functional form of x1 and x2 is restrictedto a linear functional form3 Bi-D3: inclusion of z1, z2 and z3, the selection of the functional form of x1 and x2 is restrictedto a linear functional form4 Bi-D1: inclusion of z1, the selection of the functional form of x1 and x2 is restricted to alinear functional form5 Bi-Sub: inclusion of z1, z2 and z3, generation and inclusion of z4, z5, z6 and z7, the selectionof the functional form of x1 and x2 is restricted to a linear functional form

63


Control(0) AlcoholPack years >0 0

>0 540(70.2%) 26 (3.4%)0 190 (24.7%) 13 (1.7%)

Case(1) AlcoholPack years >0 0

> 0 231(89.9%) 17(6.6%)0 8 (3.1% 1 (0.4%)

Table 5.6: Laryngeal Cancer Study: distribution of observations in the 4 categories in cases and controlsfor the covariates alcohol and pack years.

5.4 Case study III - Study on Laryngeal Cancer

The methods described in the previous section will, after these short simulated investigations, be applied

in a real data setting. The covariates of interest are alcohol and cigarette consumption and its influence

on the risk of laryngeal cancer.

5.4.1 Data and Methods

As case study, data from a population-based case-control study on laryngeal cancer with 257 cases (236

males, 21 females) and 769 controls (702 males, 67 females) will be considered. The sample had a 1:3

frequency matched by age, location and sex. “It was performed in the Rhein-Neckar region, Germany to

evaluate occupational risk factors for the development of laryngeal cancer (Rhein-Neckar-Larynx-Study).”

(Dietz (2004), p.907 [15]). Data on occupational exposures was obtained in face-to-face interviews using

a standardized questionnaire. Certain occupational factors like asbestos, mineral oil, wood dust etc were

found to be associated with the risk of developing laryngeal cancer. The main aim of the Rhein-Neckar-

Larynx-Study was to “investigate and narrow down the supposed cement-associated risk for laryngeal

cancer after adjustment for main confounding factors” (Dietz (2004), p.907 [15]). Ramroth et al (2004)

[49] investigated risk effects for smoking and alcohol on laryngeal cancer which were recorded as adjustment

factors in the study. Both variables have a spike at zero.

Figure 5.4 (left) shows the distribution of cigarette consumption, the first covariate with a SAZ, which

was measured in pack-years where one pack-year(py) is equivalent to smoking one pack per day for one

year. The second variable with a SAZ alcohol consumption was measured in grams per day. Its distribution

can be seen in figure 5.4 (right). In total, 21% of the observations in the study constitute non-smokers, 6

% those who refrain from alcohol consumption and 1.4 % those who neither smoke nor consume alcohol.

Table 5.6 gives details on the distribution in cases and controls. Ramroth et al (2004) [49] included

64

5.4 Case study III - Study on Laryngeal Cancer0

.01

.02

.03

.04

.05

Den

sity

0 20 40 60 80 100Packyears

0.0

05.0

1.0

15.0

2D

ensi

ty

0 100 200 300Alcohol in gram per day

Figure 5.4: Laryngeal cancer study: continuous distribution of packyears and alcohol consumption in com-plete sample, values greater than 100 for packyears were truncated (set to 100) and collectedin one category.

OR 95% CISmoking 0 1 -(packyears) >0-10 4.1 (1.6,10.7)

>10-20 10.2 (4.1,24.9)>20-40 21.6 (9.8, 47.4)>40-80 28.3 (12.9, 62.2)>80 59.8 (21.3, 167.3)

Alcohol consumption <= 25 1 -(g ethanol / day) >25-50 1.3 (0.8,2.1)

>50-75 1.6 (1,2.7)>75-100 1.6 (0.9,2.9)>100-150 2.2 (1.1,4.3)>150 3.0 (1.6,5.9)

Odds ratios, stratified by age and sex, smoking alcohol and education simulataneaously in one model95%CI: 95 % confidence interval

Table 5.7: Risk estimation for smoking and alcohol consumption from Ramroth et al (2004) [49]. Themodel additionally contained “time since quitting to smoke (years)” and “Educational level(years of school education)”.

smoking as log-transformed continuous variable (log(py+1)) in a conditional logistic regression model

using FPs conditioned on age x sex classification (five-year age groups). For further details on the study

data and the analysis see [49].

The analysis in this thesis will, as the analysis by Ramroth et al (2004) [49], focus on the two adjustment

variables smoking and alcohol. The results of Ramroth et al (2004) [49] can be found in table 5.7. They

will be used as comparison. Jenkner et al. (2016) [33] used this data set to illustrate the four strategies for

the analysis of two covariates with a SAZ. The aim of this section which is based on the before mentioned

paper is the illustration of the methods proposed in section 5.2. Differences in results and interpretative

advantages will be highlighted.

65


5.4.2 Results

For all four strategies, the analysis of the data described above will be done separately. Analysis was

performed in the total population using multivariable fractional polynomials with conditional logistic

regression and the respective specific function selection procedure for the methods Bi-Sep, Bi-D3, Bi-D1

and Bi-Sub. The results from the analysis by Ramroth et al. (2004) [49] are used as a comparison. Results

of the different methods can be found in the respective subsections.

Consider covariates with SAZ separately (Bi-Sep)

Using Bi-Sep, two binary indicators v1 and v2 are added to the full regression model. Then, the function

selection procedure described in 5.2.1 is applied to select a functional form for the relationship between

pack years and alcohol and the outcome “laryngeal cancer”. The selected function for the positive part of

both covariates (FP(0) for pack-years, linear for alcohol) are kept in the final model. The indicators did

not improve the fit of the overall model and are, therefore, removed from the final model. In table 5.8,

the details of the estimated model are given. In figure 5.5a, the functional relationships are plotted. In

the graph, the ORs for smoking and alcohol as categorized variables are also included as green bubbles.

These categories are defined in steps of three pack years or 30g/day respectively. The bubbles for the

categories represent the observed data and allow a graphical assessment of the fit of the function. The size

of the bubbles is set according to the numbers of observations in the category. The estimated functional

relationship is very close to the ORs of the categorized data. The large bubble for the OR of non-smokers

or non-drinkers is almost zero which might be an explanation why the binary indicator is dropped as

it’s estimate is not significantly different from the continuation of the functional estimate of the non-zero

observations. The analysis using Bi-Sep is similar to the results from the original analysis performed

by Ramroth et al. ([49]. Their result was, however, presented for categories. Both non-smokers and

non-drinkers had a log odds ratio of zero. Details can be found in table 5.7. In figure 5.5a, the log odds

ratio of these results is also included and labeled as “original analysis”.

Combination of dummy variables (Bi-D3) and Dummy 1 (Bi-D1)

In the Bi-D3 analysis, the selected model and the estimates for the continuous part is similar to the

model derived in Bi-Sep. The same functional forms are chosen for smoking and alcohol intake, and the

coefficients are nearly identical. It should be noted that non-significant dummies are not eliminated. The

distribution of zero and non-zero values in the example can be seen in table 5.6. Due to the small sample

66

5.4 Case study III - Study on Laryngeal Cancer0

12

34

5Lo

g−O

dds−

Rat

io

0 20 40 60 80 100Packyears

Odds ratio for categories with unit 3 packyears Bi−SepOrginal Analysis

−1

01

23

Log−

Odd

s−R

atio

0 100 200 300Alcohol g/day

Bi−Sep Original AnalysisOdds ratio for categories with unit 30g alcohol per day

(a) Selected functions of Bi-Sep for pack years (right) and alcohol(left). The green bubbles represent the oddsratios for successive categories of an interval size of 3 pack-years. The size of the bubbles is set according tothe numbers of observations in the respective category. The red function is taken from the original analysis byRamroth et al. [49].

−2

02

46

Log−

Odd

s−R

atio

0 20 40 60 80 100Packyears

Odds ratio for categories with unit 3 pack years Bi−D1Indicator z_1 with CI Original analysis

−2

−1

01

23

Log−

Odd

s−R

atio

0 100 200 300Alcohol g/day

Odds ratio for categories with unit 30 g/day Bi−D1Indicator z_1 with CI Orginal Analysis

(b) Selected functions of Bi-D1 for pack years (right) and alcohol(left). The diamond represents the coefficient ofthe binary variable, thus, the point estimate for non-smokers, with the respective CI.The bubbles representthe odds ratios for successive categories of interval size of 3 packyears or 30g/day of alcohol.The size of thebubbles is set according to the numbers of observations in the respective category. The red function is takenfrom the original analysis by Ramroth et al. [49].

−2

02

46

Log−

Odd

s−R

atio

0 20 40 60 80 100Packyears

Non drinkers z4 Alc & Smo continuous z6Ind Alc&Smo =0 z1 Ind Smo>0 Alc=0 z2Original Analysis

−2

02

46

Log−

Odd

s−R

atio

0 100 200 300Alcohol

Non smokers z5 Alc & Smo continuous z7Ind Alc&Smo =0 z1 Ind Smo=0 Alk>0 z3Original Analysis

(c) Selected functions for z4-z7 of Bi-sub. There are two continuous functions for each covariate. One for theexposed and one for the unexposed of the respective other covariate. The diamond represents the coefficient ofthe binary indicators z1, z2 and z3, with the respective CI.The bubbles represent the odds ratios for successivecategories of interval size of 3 packyears or 30g/day of alcohol.The size of the bubbles is set according to thenumbers of observations in the respective category. The red function is taken from the original analysis byRamroth et al. [49].

Figure 5.5: Laryngeal cancer study: Comparison of methods67


Coef. Std.Err. P1. Bi-Sepvpack dropped (p=0.51)Packyears (0) 0.809 0.079 <0.001valc dropped (p=0.07)Alcohol (1) 0.005 0.001 <0.001Deviance 851.32. Bi-D3z1 0.709 1.117 0.525z2 0.620 0.349 0.076z3 0.303 0.519 0.560Packyears(0) 0.842 0.107 <0.001Alcohol (1) 0.006 0.001 <0.001Deviance 847.73. Bi-D1z1 0.410 1.080 0.704Packyears (0.5) 0.495 0.500 < 0.001Packyears (2) -0.0001 0.324 0.001Alcohol (1) 0.005 0.001 < 0.001Deviance 847.14. Bi-Subz1 0.775 1.121 0.489z2 0.341 0.582 0.559z3 2.754 0.771 < 0.001z4(1) 0.020 0.018 0.253z5(1) 0.007 0.005 0.176z6(0) 0.861 0.111 < 0.001z7(1) 0.006 0.001 < 0.001Deviance 849.1

Table 5.8: Laryngeal cancer study: Bivariate Spike Model - Results of Bi-Sep, Bi-D3, Bi-D1 and Bi-Sub.The selected FP Power values are given in brackets after the variable name.

68

5.4 Case study III - Study on Laryngeal Cancer

sizes in categories with discrepant zero/positive values, it is likely that stronger effects are present, but

cannot be detected and are therefore non-significant. For the positive observations, a common functional

relationship is built (cf. Bi-Sub which allows different functional relationships for observations in B and

D or C and D). The analysis of Bi-D1 is illustrated in figure 5.5b. The functional relationships of smoking

and alcohol intake can be seen for a model with only one additional dummy. Again, the odds ratios

for smoking and alcohol as categorized variables with 3 pack-years or 30g/day intake as a reference are

included. The estimator of the binary indicator is also included in the graph with its respective confidence

interval (CI). As the CI includes 0, the indicator is not significant. This is reasonable as the log OR of the

categorized values is also zero for both non- smokers and non-drinkers. The detailed model descriptions

with coefficient values can be found in table 5.8. Visually, the functional relationship is very similar to the

one in the Bi-Sep approach, as the dummy z1 is not significant. The selected functional form (FP(0.5 2)

is, however, more complex than the functions selected by Bi-Sep and Bi-D3. The amount of observations

which take zero for both alcohol and smoking is very low (only 1.7 % in controls and 0.7% in cases). In

this situation, the approach might not be suited for such a distributional situation but might be preferable

in cases in which there is a strong relationship between the two or at least one binary indicators and the

outcome and if the number of observations in category A is relatively high).

Consider submodels (Bi-sub)

In the laryngeal cancer study, x1 = smoking and x2 = alcohol intake were chosen. For the analysis, the

variables as defined in table 5.2 in section 5.2.4 are generated. Then, the function selection procedure

described in section 5.2.4 is applied. In Bi-Sub, a functional form can be selected for every category of

observations with non-zero values using conditional logistic regression (table 5.8). This conditional logistic

regression model describes every category with a submodel. It is not necessary to collapse categories in

order to model the continuous observations. This is of course not the case for Bi-D3 and Bi-D1, which

demonstrates the superior flexibility of this approach. However, if there are only few observations in the

subcategories, the chosen models might not be very stable. One solution would be the selection of a

functional relationship for the biggest category. This will, in most cases, be the category in which both

variables are positive. After the function selection for the covariate of this category, the FP form of the

covariate for the other categories could be restricted, specifically to those power terms selected before.

Only new beta coefficients would be estimated. In figure 5.5c, it can be seen that for the variable pack

years, a different functional relationship is selected for non-drinkers (if alcohol is zero), compared to the

individuals that drink alcohol and, thus, both covariates are positive. A linear function for individuals

69


with alcohol zero (non-drinkers), and a logarithmic function (FP(0) for the part in which both variables

are continuous is selected. For alcohol, the functional from is linear in both categories. The estimates for

z5 and z7 are also very similar. This could be interpreted that even if no cigarettes are consumed, alcohol

increases the risk of laryngeal cancer. As there are very few observations (3.4% in cases, 6.6% in controls)

for z5 the point-wise confidence interval is very wide.

Summary

The risk factors smoking and alcohol on laryngeal cancer were analyzed using the four bivariate methods

for covariates with SAZ proposed in this thesis. Bi-D1 leads with 847.1 to the lowest deviance. However,

the most complex functional form for pack years (FP 0.5 2) is selected by this method. Bi-Sub leads to

the overall most complex model with respect to degrees of freedom, as in total 7 variables are included

in the full model, and for 4 of them a functional form can be selected which might be up to an FP2

function. This flexibility can be helpful in some situations. The study quoted in the introduction by

Fehringer et al. (2017) [22] used a subsets of non-smokers to estimate the risk of alcohol consumption

on lung cancer. Using Bi-Sub, for this group of individuals a separate functional form is estimated in a

complete model which also contains smokers. Bi-D3 and Bi-Sub lead to similar results. Both selected the

same functional form for pack years (FP(0)) and alcohol (linear). The estimated coefficients β11 and β12

are almost identical. The deviance for Bi-Sep is slightly increased for Bi-Sep. However, as Bi-Sep allows

variable selection whereas in Bi-D3 all three indicators z1, z2 and z3 are kept in the model, Bi-Sep leads

to the sparsest model. In this situation, Bi-Sep, therefore, might be the best selection.

Again, it can be seen, that an important influencing factor for the choice of the appropriate method

seems to be the relationship between observations with value zero and the outcome. Or in this setting, the

difference in risk for laryngeal cancer for observations with zero or with a relatively small value for pack

years or alcohol. Including a binary indicator as e.g. in Bi-Sep leads to risk estimation for positive values

excluding the unexposed individuals. That means that the reference for risk estimation are observations

with the lowest dose. This separation is reasonable if it can be assumed that there is a substantial

difference between exposed and unexposed observations as it implies a non-continuous functional form

with a jump at zero. If the outcome (risk) for unexposed can be assumed as the continuation of low doses

this separation does not improve the fit considerably. For a continuous endpoint, this behavior can be

more easily observed as e.g. in section 5.3. Therefore, the systematic comparison of the proposed methods

in the next section will be performed in the linear regression model with a continuous endpoint.

70

5.5 Simulation Study - Assessment of the proposed procedures


This section describes in detail a simulation study conducted to evaluate and compare the four different

strategies to analyze data situations with two variables with a spike at zero. It is structured according to

the guidelines on how to report a simulation study given in Burton et al. (2006) [11].

5.5.1 Basic Specifications and Objectives of the Simulation Study

Questions and Aims

The aim of this simulation study is to compare the four bivariate strategies proposed in section 5.2

and published in Jenkner et al (2016) [33]. They were constructed using different assumptions about

the distribution of the covariates and their true relationship to a continuous dependent variable. Thus,

there are already some obvious advantages and disadvantages due to the design of the methods. The

“goodness” of the model fit is, as stated before, dependent on the aim of model building. In explanatory

model building finding the true relationship is one of the essential aims. This cannot be easily verified in

real data situation. In simulations, however, the true relationship is known. Therefore, this simulation

study will investigate the properties of the given strategies within the linear model (continuous outcome)

varying different influential components in the data. The linear model is helpful for illustration as results

can be easily plotted and interpreted. Continuous endpoints are often found in medical research e.g. in

the evaluation of quality of life, or in indication specific questionnaires. The four proposed methods for

analysis will be compared to two references, first, the approach proposed by Robertson et al (1994) [50]

assuming linear relationships for the positive non-zero observations, and to standard MFP (Royston and

Altman (1994) [51]) without inclusion of a binary indicator. The following research questions will be of

interest:

1. Influence of the effect size on the need of including binary indicators - linear functional rela-

tionship The first question is how the effect size of the binary indicator influences model choice. In

the artificial examples in section 5.3 it could be seen that for datasets with a true underlying large effect

of one of the indicators, the model fit (regarding MSE) using FP-Spike significantly improves compared

to the standard analysis using linear regression. The following scenarios will investigate this size further

and compare several different effect sizes. Using different combinations of effect sizes and thus different

relationships between the covariate and outcome its influence on model fit will be assessed. The true

functional relationship between positive observations and the outcome will in the first settings be linear.

71


Details and further specifications will be found in section 5.5.2. Is there an influence of the true effect

size of the binary indicator on the overall quality of the model measured by R2, MSE and MSE in four

categories? And furthermore, is there an influence of the effect size on the functional form selected by

the MFP function selection procedure? The frequencies of the selected function will be compared. Using

a very simple functional relationship for the non-zero observations makes it possible to focus on the sole

effect of binary indicator.

2. Influence of the effect size on the need of including binary indicators - non-linear functional

relationship Using linear functional relationships as investigation 1 for the non-zero parts is the easiest

situation. In reality, however, this assumption might not always be correct. It will be investigated how

different non-linear relationships combined with varying effect sizes of the binary indicators influence the

model fit using R2, MSE and MSE in four categories. It is especially of interest how model fit of the

proposed strategies varies compared to standard MFP. The complexity of the fitted function and the

chosen FPs will be compared. Details will be found in section 5.5.2.

3. Influence of the number of observations in the zero subsets Varying the amount of observations

in the subsets specified in table 5.9 may influence the model fit and the performance of the different

analysis methods. Model fit will be compared using two true functional relationships and data with

varying percentages of zero values. The goodness of model fit will be evaluated using R2, MSE and MSE

in categories.

Simulation procedure

The simulation study is conducted using Stata. Stata uses the KISS generator introduced by George

Marsaglia for producing random numbers [43]. For the different scenarios the same seed (201112) was

used. Thus, the datasets of the scenarios are not independent. An overview of the simulation procedure

can be found in figure 5.6.

Procedure for generating datasets

The datasets were simulated in STATA using drawnorm, a function which draws a sample from a multi-

variate normal distribution, and in which the desired means µi, i ∈ {1, 2} and standard deviations Σ have

to be specified. Furthermore, one can specify a correlation matrix ρ. For the different scenarios different

means, standard deviations and correlation matrices can be chosen.

72


Generate bivariate nor-mally distributed datasets

Insert proportion of ob-servations with value zero

Transform range of data values tomimic real covariate distributions

Generate outcome variable withthe true functional relationship

Analyze dataset with each of the procedures

Save estimates, standard devi-ation, Confidence intervals, p-values, R2, Mean Square error

r runs

Figure 5.6: Simulation Study - Comparison of methods for two covariates with SAZ: flow chart of thesimulation procedure.

x10 > 0

x2 0 A C> 0 B D

Table 5.9: Distribution of covariates x1 and x2 distinguishing observations with value zero and positiveobservations.

73


After generating two normally distributed variables, the proportion of zero values is set according to the

chosen values q1, q2 and q3 which specify the percentage of observations with value zero in categories A, B

and C defined in table 5.9. To ensure that the observations set to zero are chosen randomly, a uniformly

distributed variable “equidist” was generated and x1 was replaced with zero if equidist ≤ q1 + q2 and

x2 with zero if eqiudist ≤ q1 and if q1 + q2 < equidist ≤ q1 + q2 + q3. Possible negative values which

might have been generated are set to zero as well. The data is then transformed to mimic the values of

pack-years and alcohol intake in the laryngeal cancer study, as those two are prominent examples of SAZ

variables.Depending on the scenario and, thus, on questions and aims a true functional relationships for

the outcome variable is chosen.

The number of simulations is set on a default of 500. As all analyses will be descriptive, potential

differences in comparison criteria R2, MSE and the MSE in categories are yet unknown and no data

will be used for analytical analyses, they can be seen as exploratory analyses. Therefore, the number of

simulations was chosen randomly. However, it was chosen in order to be able to also observe small changes

in the comparison criteria.

In each run, the estimates of the chosen model are stored with the respective standard error. P-values

and confidence intervals can, thus, be calculated with the respective test statistic. For each variable the

chosen FP power terms are saved. Furthermore, for each model Deviance, R2, MSE and MSE in four

categories are stored.

5.5.2 Results

To answer the questions specified in 5.5.1, different distributional settings will be used to compare the

four methods to two reference strategies. The basic settings which will be the same in all further scenarios

are specified in table 5.10. The sample size is fixed to 1000. The basis for the covariate distribution is the

normal distribution which is then extended to a distribution with a spike at zero. Further specifications

and the factors varied for investigation are described in the following sections as they are dependent on

the specific research question.

Valuesnumber of observations 1000Distribution of xi normalSimulation runs 500

Table 5.10: Simulation Study - Comparison of methods for two covariates with SAZ: Standard settingsfor all investigations if not specified differently in the respective scenario.

74


Short descriptionInvestigation 1aS1-S5 - symmetrically varied effect sizes of the binary indicator variables v1

and v2: (30,30), (25,25), (20,20), (15,15), (10,10)- linear functional relationship for x1 and x2

Investigation 1bS6-S10 - asymmetrically varied effect sizes of the binary indicator variables v1

and v2: (10,30), (10,25), (10,20), (10,15) (10,5)- linear functional relationship for x1 and x2

Investigation 1cS11-S14 - asymmetrically varied effect sizes of the binary indicator variables z1,

z2 and z3: (10,50,10), (50,10,50), (30,50,30), (20,40,50)- linear functional relationship for x1 and x2

Investigation 2S15-S19 - symmetrically varied effect sizes of the binary indicator variables v1

and v2: (30,30), (25,25), (20,20), (15,15), (10,10)- nonlinear functional relationship for x1 and x2

Investigation 3aS20-S24 - varying numbers of observations in categories A,B,C: (0.1, 0.05, 0.15),

(0.2, 0.1, 0.1), (0.1, 0.2, 0.2), (0.4, 0.05, 0.05), (0.3, 0.2, 0.2)- linear functional relationship for x1 and x2- constant effect of the binary indicator variables z1, z2 and z3: (20,40,50)

Investigation 3bS25-S29 - varying numbers of observations in categories A,B,C: (0.1, 0.05, 0.15),

(0.2, 0.1, 0.1), (0.1, 0.2, 0.2), (0.4, 0.05, 0.05), (0.3, 0.2, 0.2)- nonlinear functional relationship for x1 and x2- constant effect of the binary indicator variables v1 and v2: (25,25)

Table 5.11: Overview of all scenarios of the simulation study. The basic specifications, like the numberof observations per dataset, the number of simulation runs, and the distribution of x1 and x2for linear and non-linear outcomes, are the same in all scenarios. Detailed specifications aregiven in the respective section.

These are e.g. the true functional relationship between the outcome and the two covariates with a

SAZ including the effect sizes and the proportion of zero values q1, q2 and q3. The three values for the

proportion of zero values given in the tables with further specifications of the scenarios correspond to the

proportion of zero values in the categories A, B and C in table 5.9.

Investigation 1: influence of the effect size on the relevance of including binary indicators

Fifteen different scenarios will be used to investigate the influence of the effect size of the indicators

on the model fit and on the functions selected by the MFP function selection procedure. S1 - S5 will

investigate two symmetrically varied effect sizes for v1 and v2 (Investigation 1a). S6 - S10 will investigate

two asymmetrically varied effect sizes for v1 and v2 (Investigation 1b) and S11 - S15 (Investigation 1c)

three varied effect sizes for z1, z2 and z3. In all scenarios the distribution of the amount of zero values

75


SPECIFICATIONS 1a S1 S2 S3 S4 S5Means x1, x2: µ := (µ1, µ2) (25,30)Standard deviation (5,7)(qA, qB , qC) (0.25, 0.05, 0.2)Covariance ρ (1 0 / 0 1)True functional relationship y = β01v1 + f(x1) + β02v2 + f(x2)f(x1) 2x1f(x2) 2x2(β01, β02) (30,30) (25,25) (20,20) (15,15) (10, 10)

Table 5.12: Simulation Study - Comparison of methods for two covariates with SAZ: specifications ofscenarios S1 - S5 (Investigation 1a). Effect size is varied symmetrically, continuous functionalrelationship is linear.

and the covariance structure is kept fixed. Detailed specifications and assumptions can be found in tables

5.12, 5.15, and 5.18. The true functional relationship is set as a linear functional relationship in both

covariates in all scenarios of this investigation in order to conclude that the observed effects are due to

the effect size of the binary indicators. The results will not be described in detail for each scenario. In

every investigational setting, one interesting scenario is chosen for which the results will be presented in

detail. The results of the comparison of the different methods of analysis in the different scenarios of one

research questions are similar as the main difference in the increasing effect size of the binary indicators.

The overall behavior of the methods does not change. Furthermore the differences in the the results of

the connected scenarios of one investigation will be described.

1a. Effect size varied symmetrically using two dummy indicators First, the influence of size of

the effect of the binary indicator on model fit of the different techniques presented and of standard

techniques like standard linear regression and fractional polynomial models without any adjustments will

be compared. In five scenarios the size of the effect of the binary indicator for both x1 and x2 is decreased

in steps of 5 from 30 to 10. The cardinality of the subsets of observations is kept fixed, i.e. qA, qB , and

qC remain the same in each scenario. The functional relationship between the positive observations of the

covariate and the continuous outcome is linear with β11, β21 = 2 . All detailed specifications can be found

in table 5.12.

Scenario 1 was chosen for a detailed look and comparison of the six methods in this setting. A first

impression of the generated datasets can be seen in figure 5.7. Several observations take value zero. The

overall linear relationship in both covariates can be guessed. The empirical Pearson correlation for x1 and

the outcome y-S1 in the complete data (with all 500 datasets) is 0.66. For x2 and the outcome y-S1, the

Pearson correlation is 0.78. The covariate x1 and x2 have a Pearson correlation of 0.45. For each of the

generated datasets six regression models were fit using the techniques describe above. All four different

76


050

100

150

200

y −

S1

0 10 20 30 40x1

050

100

150

200

y −

S1

0 10 20 30 40 50x2

1. Run

050

100

150

200

y −

S1

0 10 20 30 40x1

050

100

150

200

y −

S1

0 10 20 30 40 50x2

2. Run

050

100

150

200

y −

S1

0 10 20 30 40x1

050

100

150

200

y −

S1

0 10 20 30 40 50x2

3. Run

Scenario S1Scatter plots of example datasets

Figure 5.7: Simulation Study - Comparison of methods for two covariates with SAZ: Scatter plots of dataof scenario S1 of the first three runs.

strategies for two covariates with SAZ (Bi-Sep, Bi-D1, Bi-D3 and Bi-Sub) were used to build a regression

model. Furthermore, standard linear regression with binary indicator (Rob) and MFP (Ref) were applied

as well.

To assess the influence on the function selection procedure, some first impression of the fitted functions

can be gained in figure 5.8. The fitted functions for the first 20 runs are given separately for covariates

x1 and x2. In each functional plot a mean value of the respective other covariate is added (as the actual

functional relationship would be two dimensional). In this scenario the effect of both binary indicators

was high (β01, β02=30). As the functional relationship is linear, one can see that most the approaches

perform relatively well. The approach by Roberston et al., Bi-D3 and Bi-Sep lead to very similar results.

Bi-D1 choses an FP1+z (4.6%) and FP2+z (95.4 %) for x1 in most cases. It leads to similar results for

x2. As Stata makes it impossible to omit data, the functional plots start at x = 6 in order to see the

77


−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Robertson

−10

00

100

200

300

y

10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

MFP−

100

010

020

030

0y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−D1

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−D3

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sep

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sub

Selected functions − S1

Figure 5.8: Simulation Study - Comparison of methods for two covariates with SAZ: The first twentyfitted functions for each method of data of scenario S1 are plotted. For each of the covariatesa mean value of the respective other covariate transformed with the fitted functional form isadded. In the plot of Bi-D1 the coefficients of z1 are given in black with their respective 95%confidence interval. Coefficients for z2 and 3 are given in red and blue. The same colors areused for z1, z2 and z3 in the Bi-Sub plots. The functional relationships for z4 and z5 are givenin green.

differences more clearly. Otherwise this one function dominates the choice of the axis. Bi-D1 has some

difficulties modeling the effect of the zero values, as it is only able to model observations which are zero

for both x1 and x2 with the coefficient for the covariate z1. For all remaining observations with zero in

only one of the covariates the continuous functional relationship has to model values correctly. Again,

functional relationships are chosen which tend to infinity close to zero. The amount of observations in the

common category A is relatively high compared to those in category B and C. Furthermore, the effect of

both indicators is the same. Thus, Bi-D1 can nevertheless lead to relatively acceptable results.

Even if visually most of the results look similar, a detailed look at the actually selected functions shows,

that there is high variation between the methods. The method proposed by Robertson does obviously

lead to the correct model as a linear functional form is always estimated. The estimates for v1 (M:30.1,

SD: 2.5) and v2 (M:29.9, SD: 2.3) are very close to the correct model. The second reference method,

MFP, selects various different types of functional relationships but never the true linear one. In 99.4 %

78


Selected Functional Forms in Scenario S1X1 X2

Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z4 Bi-Sub Z6 Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z5 Bi-SubZ7Lin - 0 - - 0 - - - 0 - - 0 - -FP1 - 0.6 - - 0 - - - 2.4 - - 0.4 - -FP2 - 99.4 - - 0 - - - 97.6 - - 0 - -Lin + z 100 - 0 98.6 98.4 97.8 97.8 100.0 - 0 98.2 98,0 99.0 98.6FP1 + z - - 4.6 1.0 1.0 1.6 1.2 - - 2.2 0.8 0.6 0.8 0.6FP2 +z - - 95.4 0.4 0.6 0.6 1.0 - - 97.8 1.0 1.0 0.2 0.8z - - - - 0 - - - - - - 0 - -

Table 5.13: Simulation Study - Comparison of methods for two covariates with SAZ: Comparison of se-lected functional forms in scenario S1

of the datasets FP2 functions are selected for x1 (cf. table 5.13), mostly FP2 with 0.5, 0.5 (108x), with

0.5 1 (95x) and 0 0,5 (80x). Several further types were selected which yields the impression that the

method is very unstable. The same results are found for x2. In 97.6% an FP2 function was selected highly

varying in the actual chosen functional form. Bi-D1 also chooses complex functional forms. The true

linear functional form was never selected. Varying FP2 functions such as FP2(0.5 0.5) (209x), FP2(0.5 1)

(109x) or FP2(0.5 1) (100x) were chosen for x1 which leads to the conclusion that estimation is again very

unstable. The mean estimate for the indicator z1 is 0.1 (SD: 2.3). Bi-Sep selects the correct functional

form in 98.4% of the datasets (492x). The estimated functional form for x2 was also correct in 98.0% of

the datasets. The estimates for v1 (M:29.9, SD:12.1) and v2 (M:20.0, SD:6.7) are very close to the ones of

the correct function. Thus, except for some single datasets, function selection is very stable with Bi-Sep.

Similar results are found for Bi-D3. The correct functional form was chosen in 98.6% of the datasets for

x1 and in 98.2% of the datasets for x2. The estimates z1 (M:60.0, SD: 9.6), z2 (M:30.3, SD 7.4) and z3

(M:29.7, SD:6.4) are also very precise compared to the true estimates. As z1 indicates that both x1 and x2

are zero, its estimate is the sum of the effects for x1 = 0 and x2 = 0 which makes 60. Bi-Sub leads again

to similar results as Bi-Sep and Bi-D3. The correct functional form is selected in 97.8% of the datasets for

both z4 and z6, in 99.0% of the datasets for z5 and in 98.6% for z7. The estimates z1 (M:60.2, SD: 15.0),

z2 (M:30.8, SD:19.0) and z3 (M:32.1, SD:21.4) are similar to those of Bi-D3 and Bi-Sep, however, the

standard deviations are considerably higher for the estimates in Bi-Sub. Due to the different functional

forms which were selected by most of the methods, a direct comparison of the β coefficients for x1 and x2

is not reasonable.

Beside these differences in the actual chosen functional relationship and its complexity one can see that

the overall performance of the different methods of analysis is rather similar. Table 5.14 contains R2,

MSE and MSE in categories for this scenario. R2 values are identical for all procedures. The possibility

of choosing non-linear functional relationships compensates for the missing indicator variables for MFP.

For Bi-D1, one can see a slightly increased MSE. These differences result mainly from observations in

79


Comparison overall performance - Scenario S1Method R2 MSE MSE_A MSE_B MSE_C MSE_D

Mean Sd Mean Sd Mean Sd Mean Sd Mean Sd Mean Sd

Ref 0.80 0.01 398.01 19.38 396.98 36.06 394.68 75.75 396.91 38.96 399.30 26.44Rob 0.80 0.01 397.18 19.16 396.79 36.03 393.67 75.33 396.17 38.96 398.13 26.29Bi-Sep 0.80 0.01 397.81 19.57 396.84 36.03 394.48 76.12 396.63 39.10 399.10 27.14Bi-D1 0.80 0.01 403.57 23.96 396.60 36.02 402.27 77.29 403.30 42.96 407.29 32.55Bi-D3 0.80 0.01 396.67 19.20 396.60 36.02 388.60 75.08 395.79 38.95 397.87 26.35Bi-Sub 0.80 0.01 395.78 19.18 396.60 36.02 380.06 74.68 394.25 38.96 397.56 26.35

Table 5.14: Simulation Study - Comparison of methods for two covariates with SAZ: Comparison of R2,MSE and MSE in categories in scenario S1

categories B, C and D which sems reasonable, as BI-D1 has to model the zero values in B and C in the

continuous functional relationship. Assuming a symmetrical effect for all observations with value zero

might be an advantage as the estimate for the common group of zero is identical. Furthermore category

B is with 5% of the total number of observations relatively small.

In general, one can assume that choosing a more complex functional relationship in this setting, it

is sufficient regarding overall measures of fit to model the underlying true relationship with two strong

effects of binary indicator variables. This is probably due to the fact that the true functional relationship

is relatively simple, as it is linear for both covariates and the effects of the binary indicators are symmetric.

All strategies perform similarly well with respect to MSE. Even Bi-D1, for which the underlying assumption

of only one common effect for all zero values is violated, and MFP, which is not capable of modeling any

non-continuous effect for zero values, do not perform worse than the other strategies regarding overall

measures of performance. However, the selected functions vary highly and the true functional form

was never chosen neither by MFP nor by Bi-D1. Interpretation of the results might therefore be not

straightforward. Robertson, Bi-D3, Bi-Sep and Bi-Sub select simpler models with overall less degrees of

freedom for the functional forms as all of them in most case chose a linear functional relationship. As

Bi-Sep only includes two binary indicators, it is the sparsest model in this setting and might for this type

of data be preferable over Bi-D3 and Bi-Sub.

Comparing this first scenario to scenarios 2 to 5 one can see that there are no great differences between

the behaviour of the analysis strategies. The higher the effect of the binary indicator, the higher the MSE.

Figure 5.9b shows the overall MSE in scenarios S1-S5 and illustrates this decreasing error. Figure 5.9a

shows the MSE in the four categories of observations. Not many differences can be seen in the different

categories of observations in these examples. Bi-D1 has a slightly higher MSE in categories B and D,

the difference decreases, however, with decreasing effect of the indicator variables. All methods except

for Bi-D1 are capable of modeling the underlying true functional relationship as already described for

80


SPECIFICATIONS 1b S6 S7 S8 S9 S10Means x1, x2: µ := (µ1, µ2) (25,30)Standard deviation (5,7)(qA, qB , qC) (0.25, 0.05, 0.2)Covariance ρ (1 0 / 0 1)True functional relationship y = β01v1 + f(x1) + β02v2 + f(x2)f(x1) 2x1f(x2) 2x2(β01, β02) (10,30) (10,25) (10,20) (10,15) (10, 5)

Table 5.15: Simulation Study - Comparison of methods for two covariates with SAZ: specifications ofscenarios S6-S10 (Investigation 1b). Effect size is varied asymmetrically, continuous functionalrelationship is linear

scenario S1. It is surprising, however, that Ref performs well in all five scenarios independent of the size

of the effect of the binary indicators even though it is not capable of modeling a non-continuous binary

effect. The smaller the effect of the binary indicator, the more often MFP also selects less complex models

and even the correct functional form. In Scenario 5, e.g. MFP selects the correct functional form for x1

in 50.2% of the datasets and for x2 in 52.4% of the datasets. The other half of the estimated models are

still mostly FP2 models. The interpretation of this strong non-linearity, however is rather complex. If

the aim of model building is the explanation of a true underlying relationship, these MFP models might

be more difficult to interpret. The same behaviour is found for Bi-D1. In Scenario 5, the correct linear

functional form was selected in 65.8% of the datasets for x1 and in 55.8% for x2. For very small effects

of the binary indicator, Bi-Sep tends to drop the indicator completely as it is the only method capable

of doing so. In Scenario 5, where the effect β01 and β02 is only 10, this can be observed in 29.4% of the

datasets for the selected forms for x1 and in 27.6% for x2. In Scenario 4, with an effect of 15 for β01 and

β02, it is dropped in 1.4% of the datasets for v1 and in 0.8% for v2. The behaviour of Bi-D3 and Bi-Sub

does not differ considerably in the different scenarios.

1b. Effect size varied asymmetrically using two dummy indicators As a second question, it will be

investigated how the methods behave if the effects of the two indicators are different. In Scenario S6 -

S10, the difference of the two effects decreases in steps of five starting with effects (β01, β02) = (10, 30).

As in investigation 1a, the cardinality of the subsets of observations qA, qB and qC are kept fix. They

are, however, not equal. In scenarios S1 - S5, this aspect might not have influenced model fit. In the next

5 scenarios, it is important to remember that the binary indicator of the category with the respectively

larger effect contains only 5% of the observations. Increasing this number might lead to different results.

The functional relationship of the positive observations is again chosen linear with β11, β12 = 2 in order

to make these scenarios comparable to scenarios S1-S5. Further specifications can be found in table 5.15.

81


100

200

300

400

500

S1 S2 S3 S4 S5excludes outside values

Category A

100

200

300

400

500

600


Category B

100

200

300

400

500


Category C

100

200

300

400

500


Category D

Comparison of methods in Senarios S1 − S5

Rob Ref Bi−D1Bi−D3 Bi−Sep Bi−Sub

(a) MSE in categories

100

200

300

400

500

S1 S2 S3 S4 S5

excludes outside values

Overall MSE comparison of methods − S1 − S5


(b) Overall MSE

Figure 5.9: Simulation Study - Comparison of methods for two covariates with SAZ: MSE comparison ofscenarios S1-S5

82


−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Robertson

−10

00

100

200

300

y

10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

MFP

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−D1

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−D3

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sep−

200

020

040

0y

0 10 20 30 40 50x1

−20

00

200

400

y

0 20 40 60 80x2

Bi−Sub


Figure 5.10: Simulation Study - Comparison of methods for two covariates with SAZ: The first twentyfitted functions for each method of data of scenario S7 are plotted. For each of the covariatesa mean value of the respective other covariate transformed with the fitted functional formis added. In the plot of Bi-D1 the coefficients of z1 are given in black with their respective95% confidence interval. Coefficients for z2 and 3 are given in red and blue. The same colorsare used for z1, z2 and z3 in the Bi-Sub plots. The functional relationships for z4 and z5 aregiven in green.

Scenario 7 ((β01, β02) = (10, 25)) was chosen for a detailed description of the results. The empirical

Pearson correlation in the observations of all 500 dataset between the outcome y − S7 and x1 is 0.75.

Between y − S7 and x2 the Pearson correlation is 0.76 and between the two covariates x1 and x2, it is

0.45. As the distribution of x1 and x2 and their relationship to the outcome y − S7 are very similar to

those in Investigation 1a except for the difference in the effect size of the binary indicators, no scatter plot

is given. Scatter plots for this scenario are similar to those in figure 5.7.

In figure 5.10, the first twenty fitted functions of scenario S7 are plotted. The results seem similar to

those of scenario S1, especially for Bi-D3, Bi-Sep and Roberston. The functions select by Bi-D1 seem to

vary more. This can also be observed for MFP. As it is the only technique without any indicator variable,

the intercept is varying considerably. Some “outlier” functions can also be seen for Bi-Sub.

Despite the visual similarities, there are again huge differences in the actually selected models for the six

different methods. The models selected by Robertson are very close to the correct model. The estimates

83


for v1 (M:9.9, SD:3.3) and v2 (M:24.8, SD:3.3) are very precise. In addition the estimates for x1 (M:2.0,

SD:0.1) and x2 (M:2.0, SD:0.1) are also almost as in the underlying true model. The second reference

MFP selects varying functional forms. In 197 datasets, a correct linear functional form for x1 is selected.

For x2, in 90.2% of the datasets varying FP2 functional forms were selected. The effect of the observations

with value zero is modeled with the help of the intercept with a mean estimate of 11.9 and a relatively

high SD of 33.9. These varying intercepts can also be observed in figure 5.10.

The four methods proposed in this thesis also show some differences. BI-D1 choses a linear functional

relationship for x1 in 53.8% of the datasets. This might be a result of the relatively small β01 (=10) and

a low number of observations in category B (5%). The non-zero observations dominate function selection.

For x2 Bi-D1 selects an FP2 function in 90.2% of the datasets with varying FP2 functional forms. The

correct functional form was never selected. This might be due to the fact that the true effect of v2 is

relatively high (compared to the one of v1) and the number of observations in category C is with 20 % not

neglectable. That the small effect of v1 influences model estimation can also be seen in Bi-Sep. In 17.4%

of the datasets a linear functional relationship was selected and the indicator was dropped. In 80.0%

Bi-Sep selected the correct functional form for x1. For x2 the situation is slightly different. The effect for

v2 in the true model is higher, thus, in 97.2% of the datasets, a linear functional form is selected and the

binary indicator is kept in the model. The estimates for v1 (M:10.4, SD.8.7) and v2 (M:25.8, SD:47.4) are

also very close to the ones of the true underlying model. Bi-D3 selects the correct linear functional form

for x1 in 98.6% of the datasets (493x). For x2, the correct form is select in 97.4% of the datasets. The

estimates of the three indicators z1 (M:34.3, SD:48.1), z2 (M:9.7, SD:9.8) and z3 (M:24.6, SD:47.1) are

relatively close to the correct ones, but the SDs are higher than those of the estimates of Bi-Sep. Bi-Sub

selects the correct linear model for z4 and z6 in around 98% of the datasets and for z5 and z7 in 99.4%

and 96.6% of the datasets. The estimates for the indicators z1 (M:33.9, SD:48.9), z2 (M:9,1, SD: 50.6) and

z3 (M:25.8, SD:51.0) are similar to those of Bi-D3, however with even larger SDs. Estimates for x1 and

x2 can again not be compared for methods that allow the selection of different functional forms. Further

details on frequencies of the selected functional forms for this scenario can be found in table 5.16.

The overall model fit considering R2 and MSE is very similar for all methods. R2 values are again iden-

tical for all methods. The possibility of choosing non-linear functional relationships seems to compensate

for the missing indicator variables. This can be seen in MFP and Bi-D1. However, the MSE for Bi-D1 is

slightly increased. This is due to the inability of modeling observations in category B and C adequately.

This can also be seen in the separated MSE in table 5.17. The increased MSE values are highlighted in

bold.

84



Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z4 Bi-Sub Z6 Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z5 Bi-SubZ7Lin - 0 - - 17.4 - - - 0 - - 0 - -FP1 - 39.4 - - 0.2 - - - 9,8 - - 0.8 - -FP2 - 60.6 - - 0.6 - - - 90.2 - - 0 - -Lin + z 100 - 53.8 98.6 80.0 98.0 98.8 100.0 - 0 97.4 97.2 99.4 96.6FP1 + z - - 14.2 0.6 0.4 1.2 0.4 - - 8.8 1.2 0.6 0.6 1.4FP2 +z - - 31.8 0.8 0.6 0.8 0.8 - - 91.2 1.4 1.4 0 2.0z - - - - 0 - - - - - - 0 - -




MFP 0.80 0.01 283.27 14.76 284.12 27.68 282.25 58.58 283.84 28.21 282.71 20.11Rob 0.80 0.01 283.30 14.74 284.03 27.67 282.10 58.56 283.59 28.08 282.94 20.03Bi-Sep 0.80 0.01 283.38 14.82 284.06 27.68 282.26 58.68 283.84 28.32 282.97 20.06Bi-D1 0.80 0.01 288.94 19.58 283.89 27.67 288.14 59.20 291.09 34.24 290.69 26.26Bi-D3 0.80 0.01 282.90 14.73 283.89 27.67 278.52 58.01 283.30 28.03 282.68 20.02Bi-Sub 0.80 0.01 282.25 14.74 283.89 27.67 272.75 57.45 282.14 27.99 282.42 20.02


To sum up the results of scenario 7, one can say that Bi-D1 is obviously the method which has disad-

vantages in this setting but nevertheless leads to an acceptable overall performance. However, the chosen

models are a lot more complex and a higher number of degrees of freedom are needed. MFP lead to very

instable selection of the functional form. If the effect of the indicator is low, the performance is a bit

more stable. Overall, however, function selection is not satisfying. Taking the selected complexity of the

model (in terms of degrees of freedom) and the stability with which the same model is selected also into

consideration, Robertson and Bi-Sep need the least degrees of freedom and lead to a good, sparse model

fit. Furthermore, they are both easily interpretable in this distributional situation.

Comparing the results of Scenarios S6-S10, one can see in figure 5.11b that decreasing the difference in

the effect size of the two indicators leads to an higher overall MSE. However, it does not lead to greater

differences in the performance of the different modeling strategies. In the categorized MSE, there are also

no great differences visible. A possible reason might be the cardinality of observations in the categories.

Category B, e.g., contains only 5% of the observations. Investigation 3 will give details on the effect of

varying cardinalities of the categories.

With respect to the actually selected functional forms, one can say that the smaller the two effects for

v1 and v2, the more often simple linear functional forms are selected with MFP. Bi-Sep drops the binary

indicator v1 in around 20% of the datasets (the true effect is 10 in all scenarios). Overall, the results and

their explanation are very similar to what has already been stated in investigation 1a. Varying the effect

85


SPECIFICATIONS 1c S11 S12 S13 S14Means x1, x2: µ := (µ1, µ2) (25,30)Standard deviation (5,7)(qA, qB , qC) (0.25, 0.05, 0.2)Covariance ρ (1 0 / 0 1)True functional relationship y = β01z1 + β02z2 + β03z3 + f(x1) + f(x2)(β01, β02, β03) (10,50,10) (50,10,50) (30,50,30) (20,40,50)f(x1) 2x1f(x2) 2x2

Table 5.18: Simulation Study - Comparison of methods for two covariates with SAZ: specifications ofscenarios S11-S14 (Investigation 1c). Effect size is varied asymmetrically using three dummyindicators, continuous functional relationship is linear

asymmetrically, but keeping two separate effects (v1 and v2) in the correct model, leads to similar results.

Robertson and Bi-Sep are to be preferred in this setting.

1c. Effect size varied asymmetrically using three dummy indicators In this third setting, the true

functional relationship contains three different effects for the indicators of the categories of observations.

Different combinations of the three effect sizes were investigated. In five scenarios, the effect sizes β01,

β02 and β03 vary from 10 to 50. Again, the functional relationship between the non-zero observations

of the covariates and the outcome is linear, to make the results comparable to investigations 1a and 1b.

Detailed specifications can be found in table 5.18. The interesting change in these scenarios is that the

observed outcome for an observations with covariate value zero is dependent on the second covariate, i.e.

f(0, x2 > 0) 6= f(0, 0).

Scenario 14 was chosen for a detailed description of the results. The Pearson correlation between x1

and the outcome y − S14 is with 0.78 relatively high, the correlation between x2 and y − S14 is 0.63.

The covariates x1 and x2 have a Pearson correlation of 0.44. The overall distribution of the covariates

x1 and x2 is the same as in investigation 1a and 1b. Figure 5.12 displays the first 20 fitted functions in

this scenario. Again Ref and Bi-D1 use non-linear functional relationships. However, the quality of their

results differs. Ref can only estimate one single effect for the area close to zero whereas Bi-D1 has an

additional effect estimate for zero observations in category A. The results for Bi-D3, Bi-Sep and Bi-Sub

seem similar at the first look. However, one can see a clear separation of the effect estimates β01 (black)

and β02 (red) and β03 (blue) due to the differences in the true relationship.

Table 5.19 gives a further insight in the selected functions. The method proposed by Robertson obviously

“selects” the correct linear functional relationship for both x1 and x2. The estimates for β11(M:2.0, SD:0.2)

and β12(M:2.0, SD:0.1) are close to the true underlying function. The estimates for the v1(M:-9.6, SD:4.4)

and v2(M:34.1, SD:4.7) are considerably different to the true estimates. As the true model contains three

86


200

300

400

500

600


Category A

020

040

060

080

0


Category B

200

300

400

500

600


Category C

200

300

400

500

600


Category D




200

300

400

500

S6 S7 S8 S9 S10




(b) Overall MSE

Figure 5.11: Simulation Study - Comparison of methods for two covariates with SAZ: MSE comparisonin scenarios S6-S10

87


−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Robertson

−20

0−10

0010

0200

300

y

10 20 30 40 50x1

−20

0−10

0010

0200

300

y

0 20 40 60 80x2

MFP

−10

00

1002

0030

0y

0 10 20 30 40 50x1

−10

00

1002

0030

0y

0 20 40 60 80x2

Bi−D1−

100

010

020

030

0y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−D3

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sep

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sub



88



Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z4 Bi-Sub Z6 Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z5 Bi-SubZ7Lin - 78.0 - - 47.6 - - - 0 - - 0 - -FP1 - 1.8 - - 0.4 - - - 34.0 - - 0 - -FP2 - 20.2 - - 0 - - - 66.0 - - 0.2 - -Lin + z 100 - 0 98.2 51.2 99.0 98.0 100.0 - 0 97.4 97.4 98.4 98.0FP1 + z - - 3.8 0.8 0.4 0.8 1.4 - - 0 1.2 0.8 1.0 0.8FP2 +z - - 96.2 1.0 0.4 0.2 0.6 - - 100.0 1.4 1.8 0.6 1.2z - - - - 0 0 0 - - - - 0 - -


indicator effects, Robertson is not able to adequately model the effects for observations with value zero.

MFP selects a linear model in 78% of the datasets for x1. This is surprising as the true β01 and β02 are

20 and 40. In the remaining datasets, however, very different functional forms are chosen. For x2, MFP

never selects the correct model and the types of FP2 vary highly.

Bi-D1 never selects the correct functional form for x1. In 96.2% of the datasets an FP2 functional form

is selected with varying different types. For x2, Bi-D1 always selects an FP2 with varying functional forms

for x2 (FP(0.5 05) 193x, FP(-1 2) 85x, FP(-0.5 1) 81, etc.). The estimate for the indicator z1(M:-69.0,

SD:3.8) is very different to the true effect. In addition, the intercept (M:89.9, SD:3.4) is estimated with

a relatively high value. Bi-Sep drops the binary indicator and selects a linear function in 47.6% of the

datasets for x1. A possible reason might be that due to the high variance in the outcome due to the

different effects for the categories, a common average estimate will lead to insufficient results and not add

additional information to the model and will therefore be dropped in the second stage of Bi-Sep. In 51.2%

a linear functional form with the additional indicator is selected. The estimates for v1 (M:-14-5, SD: 17.4)

and v2 (M:33.6, SD:17.0) are not able to capture the true effect of this scenario.

Both strategies with three indicator variables, Bi-D3 and Bi-Sub, perform well in this setting. The

estimates for the indicators z1(M:17.4, SD:30.9), z2(M:38.4, SD:26.3) and z3(M:48.9, SD:17.1) estimated

by Bi-D3 are very close to the true. The selected functional form for x1 is linear in 98.2% of the datasets.

For x‘‘, it is linear in 97.4% of the datasets. The selected functions for Bi-Sub are very similar. A linear

functional form is selected in 99% of the datasets for z4, and in 98% for z6. For the respective covariates

built from x2, z5 and z7, a linear function is selected in 98.4% and 98.0% of the datasets. The estimates

for the indicators in Bi-Sub z1(M:17.7, SD:20.7), z2(M:40.7, SD:38.5) and z3(M:48.2, SD:22.5) are again

very close to the true underlying effects.

The above described behaviors can also be found in the MSE and the MSE in categories displayed in

table 5.20. Roberston, MFP and Bi-Sep have an mean overall MSE of about 557. The mean overall MSE

Bi-D1, Bi-D3 and Bi-Sub is with about 398 considerably lower than for the other three methods. The

89




MFP 0.72 0.01 558.36 24.49 488.59 37.77 2,320.14 222.05 544.45 45.59 422.64 27.00Rob 0.72 0.01 556.95 24.21 482.40 37.04 2,424.48 214.13 525.53 42.91 420.03 26.26Bi-Sep 0.72 0.01 557.25 24.31 485.81 37.55 2,388.37 217.76 531.41 43.45 420.19 26.33Bi-D1 0.80 0.01 397.97 19.28 400.89 36.75 385.95 81.10 397.60 40.76 397.86 26.09Bi-D3 0.80 0.01 398.62 19.25 400.89 36.75 386.33 80.75 397.81 40.81 399.04 26.07Bi-Sub 0.80 0.01 397.78 19.18 400.89 36.75 377.79 79.53 396.58 40.83 398.70 26.11


number of observations in category B is relatively small and still Robertson, MFP and Bi-Sep lead to large

errors especially in this category. All three methods are not able to capture the true effect. Surprisingly

Bi-D1 still leads to very good results in the MSE, although only an effect for one single indicator is

estimated. However, more complex functions (using more degrees of freedom) are selected. Bi-D3 and

Bi-Sub lead to similar results with respect to the selected functional forms for x1 and x2 and the estimates

for the indicators z1, z2 and z3. As Bi-Sub on average needs more degrees of freedom in the estimation

than Bi-D3 (due to the additional estimation of functions in the subgroups), Bi-D3 seems preferable in

this setting.

Comparing the results of scenario 14 to the other scenarios of this investigation, one can observe that

th four scenarios differ more than the ones in the first two investigational settings. Details can be found

in figure 5.13a. Several further combinations of effects could have been investigated. S11 differs from

the others in the single large effect β02 for a small category of observations (5%). It leads to the overall

highest MSEs. In this setting the true effects β01 and β03 were relatively low (10). For the observations

of the small category B, however, the effect was very large (50). High error rates are found in category B

for Robertson, MFP and Bi-Sep. With respect to the selected functional forms, S11 is comparable to S14

described above.

For scenario S12, surprisingly, results are very similar for all methods. A possible reason could be that

the effect in category B is relatively small and additionally the amount of observations in this category is

small (5%). Furthermore β01 and β03 take the same true value. With these assumptions, it is comparable

to settings described in Investigation 1a, with a symmetrical effect of two indicators v1 and v2. With

respect to the selected functional forms, the methods behave similar as described in S1.

In scenarios S13 and S14, as described before, Bi-D3 and Bi-Sub perform well as expected. But also Bi-

D1 leads to very good results with respect to overall MSE even with only a single indicator for category A

and very different effects in the three categories. Again non-linear functional forms are used to compensate,

90


which leads to very complex models and a high number of degrees of freedom are needed in the estimation.

Bi-Sep seems to have difficulties due to the two stage procedure, the functional form is chosen in the first

step. In the second step, it is tested whether both the indicator and the continuous covariate are needed.

If the binary indicator is dropped, no new functional relationship is estimated. That seems to be the

reason that Bi-Sep chooses a linear function relatively often in these two scenarios completely ignoring

the effect of the zero values.

Summary - Investigation 1 The aim of these three investigational settings was to investigate the in-

fluence of the effect size of the binary indicators on model selection. All scenarios used linear functional

relationships for covariates x1 and x2. Only the effects of the binary indicators were varied. All other

components were kept fixed to make the scenarios comparable. Major results will be given for the three

settings with respect to the selected functional forms and the overall goodness-of-fit measured by the

MSE.

Varying the effect size symmetrically in true models with two binary effects for v1 and v2, one could

observe that depending on the smaller the effects the more similar the methods perform. With respect

to MSE, the methods behave rather similarly for all effect sizes. Bi-D1 leads to slightly increased MSEs.

However, the smaller the effect of the binary indicators, the less the differences between the methods

including Bi-D1. With respect to the actually selected functional forms, more differences can be observed.

The higher the effect for v1 and v2, the more complex (with regard to degrees of freedom) are the models

selected by MFP and Bi-D1. For small effect Bi-Sep tends to drop the indicators completely. Bi-D3 and

Bi-Sub perform similar in this setting.

Varying the effect size asymmetrically in true models with two binary effects for v1 and v2, one could

observe that the smaller the differences between the two effects of the indicators the higher the overall

MSE. All methods select simpler models if the effects for v1 and v2 are smaller. Similar to the results

already found in the symmetric variation of the effect sizes, Robertson and Bi-Sep are preferred in this

setting as they results in stable estimations of the functional forms, good overall performance with respect

to MSE and compared to the other strategies need less degrees of freedom.

Varying the effect size asymmetrically in true models with three binary effects for z1, z2 and z3, one

could observe that results vary highly over the 6 methods. Obviously, methods without three estimates

for the zero categories, have to compensate this in their model selection by using non-linear functional

forms. However, in most settings Robertson, MFP and Bi-Sep perform considerably worse than the other

methods. One could, however furthermore observe that not only the effect of the binary indicators seems

91


200

300

400

500

600

700

S11 S12 S13 S14excludes outside values

Category A

01,

000

2,00

03,

000


Category B

020

040

060

080

0


Category C

200

300

400

500

600


Category D




200

300

400

500

600

S11 S12 S13 S14




(b) Overall MSE

Figure 5.13: Simulation Study - Comparison of methods for two covariates with SAZ: MSE comparisonof scenarios S11 - S14

92


to have an influence on model fit and function selection but also the number of observations in categories

A, B and C. This could be the reason why Bi-D1 despite the lack of three indicators is able to result in an

acceptable overall model fit. However, much more complex functional forms are fitted. This phenomenon

will be further investigated in Investigation 3.

All three settings showed, that the effect size does influence the selected models. Overall one can say,

that the higher the true underlying effect for the observations with zero, or to be more explicit, the

greater the difference between the value for the continuation of the functional form at zero and the true

outcome at zero, the more important it is to use a strategy including a binary indicator. For simple

linear functional relationships, the approach proposed by Robertson [50] seems sufficient. As the true

functional form and also the true distribution of the covariates is often unknown in real data applications,

recommendations for prespecified analyses are difficult to give. In the Bi-Sep procedure, either the binary

indicator or the continuous functional form can be dropped from the final model. Therefore, it might

be a good choice for standard situations. Investigation 1c, however showed, that in very specific data

situation with a complicated true relationship between the two covariates and the outcome, Bi-Sep can

lead to wrong results. Therefore, subject matter knowledge should be included in the choice of the model

selection procedure.

So far, the relationship of the positive observations and the outcome was always assumed to be linear.

The next step will be to evaluate the performance of the four proposed methods in settings in which the

true underlying model contains non-linear functional relationships for the positive observations.

93


SPECIFICATIONS 2 S15 S16 S17 S18 S19Means x1, x2: µ := (µ1, µ2) (7,5)Standard deviation (2,2)(qA, qB , qC) (0.25, 0.05, 0.2)Covariance ρ (1 0 / 0 1)True functional relationship y = β01v1 + f(x1) + β02v2 + f(x2)f(x1) 0.5x2

1f(x2) 0.2x3

2(β01, β02) (10,10) (15,15) (20,20) (25,25) (30, 30)

Table 5.21: Simulation Study - Comparison of methods for two covariates with SAZ: specifications ofscenarios S15-S19 (Investigation 2). Effect size is varied symmetrically, continuous functionalrelationship is linear

Investigation 2: Influence of the effect size on the relevance of including binary indicators -

non-linear functional relationship

In this setting, additionally to the effect of the binary indicator, the true functional relationship contains

non-linear continuous functional forms (x21, x3

2). As in investigation 1a, the effects of the binary indicators

are varied symmetrically in steps of 5. The distribution of x1 and x2 changed (x1 ∼ N(7, 2),x2 ∼ N(5, 2) )

The number of observations in categories A, B and C remain the same as in investigation 1. More detailed

specifications of all chosen parameters can be found in table 5.21.

Scenario 19 was chosen for a detailed look and comparison of the six methods in this setting. Figure

5.14 gives a first impression of the simulated datasets. The relation between x2 and the outcome y

seems stronger than between x1 and y. Due to the known non-linear relationship Spearman correlation

coefficients were calculated the evaluate this impression. The overall empirical Spearman correlation (in

an overall dataset including all 500 simulated datasets) between x1 and the outcome is 0.17. Between x2

and y − S19 it is 0.26. The covariates x1 and x2 have a Spearman correlation of 0.36. The quadratic

functional relationship can hardly be guessed in the scatter plot. This might also be due to the influence of

x2 increasing the outcome value y. For x2 the non-linear functional relationship to the outcome is already

visible in the scatter plot. Figure 5.15 displays the selected functions in the first 20 simulated datasets.

Obviously, the approach proposed by Robertson et al. will always lead to linear models. Comparing the

other approaches, one can get the impression, that all five non-linear modeling techniques lead to very

similar results. However, even if the actual shape seems comparable the chosen models are very different.

In table 5.22, the frequencies of the chosen complexity of the models is displayed. The reference model,

MFP, selected an FP2 model in all 500 datasets. If we take a closer look at the actually selected models,

it is to be seen that very different functional forms were chosen. For x1, the selected forms were FP2 with

powers -2, 2 (180x), an FP2 with 0, 2 (115x), an FP2 with 0.5, 1 (81x) and several further ones. The

94


050

1001

5020

0250

y −

S19

0 5 10 15x1

050

1001

5020

0250

y −

S19

0 2 4 6 8 10x2

1. Run

050

1001

5020

0250

y −

S19

0 5 10 15x1

050

1001

5020

0250

y −

S19

0 2 4 6 8 10x2

2. Run

010

020

030

0y

− S

19

0 5 10 15x1

010

020

030

0y

− S

19

0 2 4 6 8 10x2

3. Run

Scenario S19Scatter plots of example datasets

Figure 5.14: Simulation Study - Comparison of methods for two covariates with SAZ: Scatter plots of dataof scenario S19 of the first three runs.

95


−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Robertson

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

MFP

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−D1−

100

010

020

030

0y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−D3

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−Sep

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−Sub



96



Rob MFP Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z4 Bi-Sub Z6 Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z5 Bi-SubZ7Lin - 0 0 0 0 0 0 - 0 0 0 0 0 0FP1 - 0 0 0 0 0 0 - 0 0 0 0 0 0FP2 - 100 0 0 1.6 0 0 - 100 0 0 0 0 0Lin+z 100 - 0 1.2 1.2 49.6 5.8 100.0 - 0 0 0 0.8 0FP1+z - - 0 92.8 92.4 48.0 89.0 - - 0 92.4 92.6 96.0 93.4FP2+z - - 100 6.0 4.8 2.4 5.2 - - 100 7.6 7.4 3.2 6.6z - - 0 0 10.2 0 0 - - 0 0 0 0 0


estimation seems very instable. For x2 most of the FP2 combinations contain a cubic term in combination

with a second term (-2 was chosen 218x, -0.5 was chosen 137x). The true functional form was never chosen

for both x1 and x2.

Bi-D1 also never selected the true functional relationship for both x1 and x2. The most complex FP2

model was chosen in all cases. The selected functional forms, however, also varied highly over the 500

datasets. The mean estimate for the binary indicator z1 was 1.1 (SD 2.9). The amount of observations

in categories B and C in this scenario is 25% in total. Inclusion of the single indicator does not seem

to improve the stability of the chosen function. The procedure selected an FP2 with 0.5, 1 (189x), an

FP2 with 1 1 (174x), and an FP2 with -1,3 (112x). The overall behavior is, thus, similar to the reference

method MFP, but the selected functions are again different. Estimated coefficients for x1 and x2 cannot

be easily compared due to broad range of selected functional forms. Bi-D3 selected for x1 in 92.8% of the

datasets an FP1. In 453 datasets the correct power (2) was selected, in 11 datasets a cubic transformation

was selected. For x2, in 462 datasets, the correct transformation was selected. The estimates for z1 (M:

59.1, SD: 14.5) , z2 (M:29.1, SD 14.6) and z3 (M:30.0, SD: 2.1) are very close to the true model. Compared

to MFP and Bi-D1 the estimation of the functional form is overall very close to the true underlying model.

Due to the construction of the simulated data scenario, Bi-Sep seems to be in advantage. The results

support this assumption. In 451 datasets, the correct power for x1 was selected. For x2, in 463 of the

datasets the correct functional form was selected. The estimates for v1 (M: 29.5, SD: 14.0) and v2 (M:

29,9, SD: 2,0) are very close to the true model.

As Bi-Sub allows the selection of a different functional form for two subsets of the observations, the

models are very flexible. However, estimation is smaller subsets has less power. Therefore, the FP function

selection procedure tends to select less complex models. This can be seem in the selected forms for z4 ({x1

if x2=0}). In 248 datasets, a linear functional form was selected in this subset and only in 158 datasets

the correct transformation was chosen. For z6 in 430 datasets the transformation was selected correctly.

For z5 and z7, the results are more stable. In 449 (z5) and 467 (z7) datasets, the correct transformation

97


Method R2 MSE MSE_A MSE_B MSE_C MSE_DMean Sd Mean Sd Mean Sd Mean Sd Mean Sd Mean Sd


Table 5.23: Simulation Study - Comparison of methods for two covariates with SAZ: Comparison of R2,MSE and MSE in categories in scenario S19.

was selected. The estimates for z1 (M:59.1, SD:17.6), z2 (M:28.3, SD:19.0) and z3 (M:18.1, SD:33.6) are

again acceptable. The estimate for z3 is, however, considerably lower as the true value (30) and the mean

estimates in Bi-D3 or Bi-Sep. Overall, in this data scenario a lot of further degrees of freedom are needed

for this approach, without considerable gain in precision of the function which is to be estimated.

Comparing the six methods with respect to the selected functions, one can say that despite the visual

correspondence of the fitted functions, several differences can be found in the actual selected forms. The

first reference Roberston is obviously not able to estimate non-linear function relationship. Both MFP and

Bi-D1 lead to a very instable selection of functional forms. Bi-D3, Bi-Sep and Bi-Sub lead to comparable

results. If one furthermore considers the complexity of the selected model, Bi-Sep could be preferred as

on average the least degrees of freedom are needed for estimation (compared to Bi-D3 and Bi-Sub).

After this evaluation of the estimated functional forms, it seems surprising that so little of these differ-

ences can be seen in the MSE. Table 5.23 shows that, as already expected the MSE of models selected

using the method of Robertson et al. is considerably higher than for all other methods. Especially in

category D, the error is almost twice as high as for the other methods. The MSE for all other methods

is relatively similar, except for Bi-D1 which has a slightly increased MSE especially in categories B and

D. This is in line with the instable selection of functional form. The MSE for reference MFP, however, is

very close to the MSE of Bi-Sep, Bi-D3 and Bi-Sub, although very different functional forms were selected

by MFP.

In summary, the evaluation of scenario 19 has shown that in this situation Bi-Sep leads to reasonable

results. Bi-D3 and Bi-Sub have similar results for the selection of the functional form and the MSE,

however, they are more complex in terms of degrees of freedom needed for the estimation of the full

model. MFP leads to acceptable results in the MSE, however, the selected functional forms vary highly

and the true underlying model was not selected once. Bi-D1 leads to models with slightly increased

MSEs. Similar to MFP, function selection is very unstable. The classical linear model with the extension

98


proposed by Robertson is unsuitable in this situation.

In this investigational setting, five different effect sizes (β01, β02) were analyzed. Comparing the five

scenarios one can see that increasing the effect of the binary indicator does not lead to very different

results. The overall trend in the behavior of the methods stays the same. The MSEs in figure 5.16b

displays that the method proposed by Roberston et al. [50] performs worse than the methods which are

capable of modeling non-linear functional relationships which is, as described before, not surprising. The

smaller the effect of β01, the more Bi-Sub tends to choose linear functional forms for z4 and also z6. The

difference in MSE for Bi-D1 and the other methods is smaller if the true effect of (β01, β02) is smaller.

Figure 5.16a displays the MSE in the four categories. Again the results of the five scenarios are very

similar. As expected Rob has high errors in category D (non-zero observations) as non-linear functional

relationships cannot be estimated. The overall behavior is similar in all scenarios and was exemplarily

illustrated with the results of scenario 19.

99


150

200

250

300

350

400


Category A

020

040

060

080

0


Category B

100

200

300

400

500


Category C

200

400

600

800


Category D

Senarios S15 − S19Comparison of methods



200

300

400

500

600

S15 S16 S17 S18 S19




(b) Overall MSE


100


SPECIFICATIONS 3a S20 S21 S22 S23 S24Means x1, x2: µ := (µ1, µ2) (25,30)Standard deviation (5,7)qA 0.1 0.2 0.1 0.4 0.3qB 0.05 0.1 0.2 0.05 0.2qC 0.15 0.1 0.2 0.05 0.2Covariance ρ (1 0 / 0 1)True functional relationship y = β01z1 + β02z2 + β03z3 + f(x1) + f(x2)f(x1) 2x1f(x2) 2x2(β01, β02, β03) 20,40,50

Table 5.24: Simulation Study - Comparison of methods for two covariates with SAZ: specifications ofscenarios S20-S24 (Investigation 3a). Cardinality of zero categories is varied, continuous func-tional relationship is linear.

Investigation 3: Influence of the number of observations in the zero subsets

So far, the number of observations in the 4 categories were constant. However, in some settings one

could already guess that this factor might also influence the estimation of model parameters and model

selection. In ten different scenarios the influence of the number of observations in the zero subsets will be

investigated. The choice of appropriate settings is not easy as several different combinations are possible.

Therefore, five different settings will be investigated where x1 and x2 have a linear functional relationship,

so that a comparison with scenarios in Investigation 1 is possible, and further five scenarios with a non-

linear functional form for x1 and x2. In order to be able to compare the results to the Investigation 2

before, the same functional form was chosen.

Investigation 3a: Influence of number of observations in the zero subsets - linear functional rela-

tionship In Investigation 3a, five different scenarios with different numbers of observations are analyzed.

The range of the total amount of observations with zero varies highly in the 5 scenarios from 30% to 70%

of all observations in a dataset. Scenario 20 has with 30% the least amount of observations with zero.

In category B, there are only 5% of the observations of the whole datasets. The effect β02 corresponding

to observations in category B is with 40 relatively high. In scenario 21, the total amount of observations

for which at least one of the covariates x1 and x2 is zero is 40% which is almost half of the observations.

In scenario 22 and scenario 23 50% of the observations are zero for at least one of the covariates. In

scenario 22 categories B and C contain with 40% a large amount of observations. In scenario 23, these

two categories contain only 10% of the observations

Scenario 21 was chosen for a detailed description of the results. The Pearson correlation between x1

and the outcome y-S21 is 0.69. Between x2 and the y-S21, it is 0.72. The two covariates x1 and x2 have

101


a Pearson correlation of 0.45. Twenty percent of all observations are zero for both x1 and x2. And both

covariates have another 10% of observations with zero where the respective other covariate is positive.

Again, the four proposed bivariate methods Bi-Sep, Bi-D1, Bi-D3, Bi-Sub and the two references MFP

and Robertson were used to analyze the generated datasets.

Figure 5.17 displays the fitted models for the first twenty datasets. The linear functional form seems to

be similar in Robertson, Bi-Sep, Bi-D3 and Bi-Sub. MFP and Bi-D1 select varying non-linear functions.

The estimates for the indicators, however differ between the methods. Bi-D1 even estimates a negative

effect for z1. This can also be observed if a closer look is taken on the actually selected functions. Robertson

obviously “selects” correct linear functional form for x1 and x2 in all datasets. The corresponding effect

estimates β11 (M:2.0, SD:0.2) and β21 (M:2.0, SD: 0.1) are very close to the true coefficients. Estimates

for v1 (M:9.3, SD:4.4) and v2 (M:19.2, SD:3.7), however, fail to capture the underlying true effect in the

datasets. MFP selects in 76% of the datasets a linear functional form for x1, in the other 24% highly

varying FP2 functional forms were selected . For x2, MFP selects a linear functional form only in 1% of

the datasets, in 97.6% of the datasets an FP2 is selected with highly varying functional forms.

Bi-D1 never selected the correct linear functional form for x1. In 99.4% varying FP2 forms were selected.

This can also be observed for the functions selected for x2. In all 500 datasets FP2 functional forms were

selected. The estimated coefficient for z1 (M:-69.9, SD:3.1) is very different to the true value. The estimate

for the intercept (M:89.9, SD: 2.8) in contrast is very high and tries to “correct” estimation for observations

with x1 and/or x2 zero. Although visually the results of Bi-Sep seemed very similar to those of the other

three proposed methods in this thesis, its behavior and the corresponding results differ considerably. For

x1, in 98.6% of the datasets a linear functional form is selected. In 49.2% of the datasets, however, the

indicator v1 was dropped as the estimated effect did not improve model fit significantly. For x2, a linear

functional form was selected in 98.2% of the datasets. The indicator v2 was kept in all of the 493 datasets.

The estimates v1 (M:12.6, SD:3.6) and v2 (M:21.2, SD:35.0) differ from the true estimates which are not

additive and thus v1 and v2 are not able to display these effects. Bi-D3 selects a linear functional form

for x1 in 98.2% of the datasets. For x2, a linear functional form is selected in 98.4% of the datasets. The

estimates for z1 (M:18.2, SD:26.7), z2 (M:38.5, SD:25.8) and z3 (M:49.6, SD:7.6) are close to the true

coefficients. However, they seem to vary highly in the different datasets, as the SDs are relatively high.

All three estimates are lower than the true coefficients. Bi-Sub selects the correct linear functional form in

98.8% of the datasets for z4 and in 97.8% for z6. For the other two covariates z5 and z7, a linear functional

form is selected in 98.2% and in 99.0% of the datasets. The estimates for z1 (M:17.6, SD:17.8), z2 (M:38.1,

SD:23.7) and z3 (M:48.8, SD:26.3) similar to those estimated by Bi-D3, however they are overall lower

102


−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Robertson

−10

00

100

200

300

y

10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

MFP

−20

0−10

0010

0200

300

y

0 10 20 30 40 50x1

−20

0−10

0010

0200

300

y

0 20 40 60 80x2

Bi−D1

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y0 20 40 60 80

x2

Bi−D3

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sep

−10

00

100

200

300

y

0 10 20 30 40 50x1

−10

00

100

200

300

y

0 20 40 60 80x2

Bi−Sub



103



Rob MFP Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z4 Bi-Sub Z6 Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z5 Bi-SubZ7Lin - 76.0 0 0 49.2 0 0 - 1.0 0 0 0 0 0FP1 - 24.0 0 0 0.6 0 0 - 1.4 0 0 0 0 0FP2 - 0 0 0 0.4 0 0 - 97.6 0 0 0 0 0Lin+z 100 - 0 98.2 49.4 98.8 97.8 100.0 - 0 98.4 98.6 98.2 99.0FP1+z - - 0.6 0.8 0.4 0.8 1.8 - - 0 1.0 0.4 1.4 0.2FP2+z - - 99.4 1.0 0 0.4 0.4 - - 100 0.6 1.0 0.4 0.8z - - 0 0 0 0 0 - - 0 0 0 0 0



MFP 0.70 0.01 544.10 23.49 525.40 40.79 1,083.64 88.34 1,013.27 82.56 382.22 21.28Rob 0.70 0.01 543.59 23.45 533.26 40.34 1,043.90 77.15 1,045.17 79.76 380.05 21.13Bi-Sep 0.70 0.01 544.18 23.65 529.23 40.38 1,064.14 82.83 1,029.90 79.87 381.56 21.76Bi-D1 0.80 0.01 359.30 17.29 361.38 37.50 355.74 51.16 357.08 51.63 359.57 21.06Bi-D3 0.80 0.01 359.92 17.33 361.38 37.50 356.26 51.18 357.21 51.80 360.50 21.07Bi-Sub 0.80 0.01 359.14 17.29 361.38 37.50 352.54 51.02 354.16 51.38 360.32 21.09


than the true estimates and the mean estimates of Bi-D3.

Beside these differences in the selected functional forms, different results in the overall fit can be

observed. In this setting, Robertson, MFP and Bi-Sep lead to a lower R2 (0.7 compared to 0.8 for the

other methods). This can also be observed in the MSE which is considerably higher for Robertson, MFP

and Bi-Sep (around 540 compared to 360 for the other three methods). If a closer look at the MSE is

taken, it can be observed that errors are increased mainly in categories B and C for the three before

mentioned methods. The amount of observations is with 10% in each of the two categories moderate.

However the effects for z02 andz03 are with 40 and 50 relatively high. As category A contains 20% of the

observations, their impact on model estimation is higher. Robertson, MFP and Bi-Sep fail to model the

true underlying relationship correctly. Details can be found in table 5.26.

To sum up, one could observe that Robertson, MFP and Bi-Sep performed worse than the other three

methods in this setting. The reason might be the non additive effects for z1, z2 and z3. Therefore, the

three categories could not be displayed by only two indicators v1 and v2. The amount of observations in

the four categories seemed to be less influential than the relationship to the outcome.

If the results of scenario 21 are compared to the other settings in this investigation, one can observe

that, with respect to the overall goodness of fit which is displayed in figure 5.18b, the relation to the

outcome (effects β01, β02 and β03 seems be more important than the amount of observations in the

104


different categories. However, considerable differences can be observed in the MSE in categories for the

different scenarios. Scenario 20 leads to considerable differences in category B, although the amount of

observations is very low. Due to this small number of observations, their weight in model selection is

relatively low. Robertson, MFP and Bi-Sep which are all not capable of modeling non additive effects

of v1 and v2 lead to higher MSEs in this category. Thus, the selected models do not seem to fit well.

This again leads to the conclusion, that supports the conclusion, that the relation to the outcome of the

observations with zero is more important than the actual number of observations.

Scenario 22 leads to overall similar results as scenario 20. However, the behavior of the MSE in category

A is considerably different to the other four scenarios. Robertson, MFP and Bi-Sep lead to very high

errors in this category. In this scenario, the amount of observations in category A is smaller than in the

other two categories. Thus, model selection will be influenced more by observations in categories B and

C both of which have a relationship with a higher effect to the outcome.

In scenario 23, the overall differences between the six methods are smaller than in the other four

scenarios. This can be seen in figure 5.18b. Here, category B and C have the same amount of observations,

Both only contain 5%. The effect for category A, β01, is the smallest. MSE in category A and D are

very similar for all six methods in this scenario. With only a small amount of observations in the zero

categories, the overall MSE is still affected and Roberston, MFP and Bi-Sep lead to higher MSEs. This

is due to the true relation to the outcome for observations of categories B and C as β02 and β03 are larger

than β01.

Scenario 24 leads to the highest differences in the overall MSE when comparing Robertson, MFP and

Bi-Sep to the other 3 techniques. It is the only setting in which also the MSE in category D differs

considerably in the 6 proposed methods. The simulated data had the highest amount of observations for

which at least one of the covariates x1 and x2 is zero with in total 70%. Thus, with a very high amount

of observations with zero the overall functional relationship is heavily influenced by the observations with

zero, especially in MFP, but also for Robertson and Bi-Sep which cannot model three distinct outcomes

for three different types of observations with value zero.

Overall, one could observe that the smaller the number of observations in category the higher differences

between Robertson, MFP, Bi-Sep and the other three approaches in MSE in this category. However, the

effect sizes β01, β02 and β03 seem to have a considerable influence on model selection. As in all five

scenarios three different values for β01, β02 and β03 were chosen which were not additive, Bi-D3 and

Bi-Sub obviously had an advantage. In the next section, a non-linear functional relationship between x1

and x2 and the outcome will be considered in addition.

105


050

01,

000

1,50

0


Category A

050

01,

000

1,50

02,

000


Category B

050

01,

000

1,50

02,

000


Category C

200

300

400

500

600

700


Category D




200

400

600

800

S20 S21 S22 S23 S24




(b) Overall MSE


106


SPECIFICATIONS 3b S25 S26 S27 S28 S29Means x1, x2: µ := (µ1, µ2) (7,5)Standard deviation (2,2)qA 0.1 0.2 0.1 0.4 0.3qB 0.05 0.1 0.2 0.05 0.2qC 0.15 0.1 0.2 0.05 0.2Covariance ρ (1 0 / 0 1)True functional relationship y = β01v1 + f(x1) + β02v2 + f(x2)f(x1) 0.5x2

1f(x2) 0.2x3

2(β01, β02) (25,25)

Table 5.27: Specifications of scenarios S25-S29. Numbers of observations in categories A, B and C arevaried, effect size of v1 and v2 is symmetrical, continuous functional relationship is non-linear.

Investigation 3b: Influence of number of observations in the zero subsets - non-linear functional

relationship In this last investigation, in addition to varying the the number of observations in the

categories A, B and C, the true relationship between x1 and x2 is non-linear with a quadratic form for

x1 and a cubic polynomial for x2. The same functional forms as in Investigation 2 were chosen, to make

results comparable to this setting. Furthermore, the true functional relationship contains two symmetric

effects for indicators v1 and v2 ((β01, β02) = (25, 25). As in investigation 3a the results were heavily

influenced by the selected effects for z1, z2 and z3, a simpler setting is selected high so that differences in

the results can be traced back to the number of observations in the categories. Again, five scenarios with

different amounts of observations with zero from a total amount 30% to 70% of the observations in the

datasets will be analyzed. The settings are the same as in Investigation 3a. Details can be found in table

5.27.

Scenario 26 was chosen for a detailed description of the results as it has the same numbers of observations

in categories A, B and C as scenario 21 which was described in the previous investigation. The distribution

of x1 and x2 is the same as in Investigation 2 and, thus, different to Investigation 3a. No scatter plot is

given here. The empirical Pearson correlation in the observations of all 500 datasets between the outcome

y-S26 and x1 is 0.22. Between y-S26 and x2, it is 0.55. The two covariates x1 and x2 have a Pearson

correlation of 0.38. Figure 5.19 displays the fitted functional forms of the first 20 datasets. One can

already observe that Bi- D3 and Bi-Seb select linear functional forms for x1 in some datasets. The visual

functional shape of the functions selected for x2 seems similar for all methods except for Robertson.

If a detailed look at the actually selected functional forms is taken, however, several differences can be

observed. Robertson always fits a linear model. The coefficients for x1 (M:7.0, SD:0.4) and x2 (M:17.1,

SD: 0.8) can not be compared to the true coefficients as the true functional form is non-linear. Estimates

for the indicators v1 (M:47.3, SD:3.4) and v2 (M:74.2, SD:3.9) are very different to the true coefficients.

107


−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Robertson

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

MFP

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−D1−

100

010

020

030

0y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−D3

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−Sep

−10

00

100

200

300

y

0 5 10 15x1

−10

00

100

200

300

y

0 2 4 6 8 10x2

Bi−Sub



108


The mean intercept (M:-71.5, SD:4.6) is negative. The selected model is overall very different to the

true functional relationship. The second reference, MFP never selects the correct model for x1. In all

500 datasets an FP2 with highly varying functions is selected. This can also be observed for x2. In all

datasets an FP2 is selected. However, in 499 of the fitted FP2 functions a cubic term is included. The

overall selected functional forms vary highly.

Bi-D1 selects an FP2 functional form for x1 in all 500 datasets with highly varying forms (217x FP(1

1), 117x FP(0.5 1), 86x FP(-0.5 3). This is also the case for x2. In 499 of the datasets, however, one of

the polynomials is cubic like the true functional relationship. The mean estimate for z1 (M:2.3, SD:3.2) is

relatively low. Selected functional forms are similar to those selected by MFP. Bi-Sep selects the correct

quadratic functional form in 445 of the datasets (89.0%) for x1. In 15 dataset an FP(3) is selected. In

around 5% of the datasets, a linear functional form is selected for x1. In 1 dataset, the indicator v1

was dropped while an FP2 functional form was selected for x1. For x2 in 94.0% of the datasets, the

correct cubic functional form was selected. In 6% of the datasets a more complex FP2 functional form

was selected. The indicator v2 was never dropped. The mean estimates for v1 (M:25.4, SD: 9.2)and v2

(M:25.1, SD: 2.3) are very close to the true coefficients. Bi-D3 selects the correct quadratic functional

form in 443 of the datasets (88.6%) for x1. In 16 datasets (3.2%) a cubic FP1 is selected. In around 5% of

the datasets, a linear functional form is selected for x1. For x2, in 469 datasets (93.8%) the correct cubic

functional form was selected. In 6.2% of the datasets, Bi-D3 selects varying FP2 functions for x2. The

mean estimates for z1 (M:50.4, SD:9.5), z2 (M:25.5, SD:9.4) and z3 (M:25.1, SD:2.6) are close to the true

coefficients. As the true functional relationship contains only two effects (for v1 and v2), the coefficient

for z1 can be estimated as the sum of both effects. Thus, the performance of the selected functions for

Bi-D3 is relatively good. Bi-Sub selects for z4 (x1 if x2 is zero) in 82.8% of the datasets a linear functional

relationship. In only 19 datasets, the correct quadratic functional form is selected. In 61 datasets, a

cubic functional relationship was selected. For z6 (x1 if x2 is positive), in 418 of the datasets the correct

quadratic functional form is selected. In 7.8% of the datasets, a linear functional form is selected. For z5

and z7, the corresponding covariates for x4 in 92.6% and 93.2% of the datasets the correct cubic functional

form is selected. For z5 in 15 datasets an FP(2) is selected and in 4.8% more complex FP2 functional

forms were selected. For z7, if Bi-Sub selected an FP1 is was always cubic. In 6.8% of the datasets,

however, more complex FP2 functional forms were selected. The mean estimates for the indicators z1

(M:50.4, SD:11.8), z2 (M:25.7, SD:13.6) and z3 (M:9.6, SD:32.4) are not as close to the true coefficients

as were the results of Bi-D3 and Bi-Sep, as the mean coefficient for z3 differs considerably from the true

coefficient. Further details on the selected complexities of the functional forms of all methods can be

109



Rob MFP Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z4 Bi-Sub Z6 Rob Ref Bi-D1 Bi-D3 Bi-Sep Bi-Sub Z5 Bi-SubZ7Lin - 0 0 0 0 0 0 - 0 0 0 0 0 0FP1 - 0 0 0 0 0 0 - 0 0 0 0 0 0FP2 - 100 0 0 0.2 0 0 - 100 0 0 0 0 0Lin+z 100 - 0 4.3 4.6 82.8 7.8 100.0 - 0 0 0 0 0FP1+z - - 0 91.8 92.0 16.0 87.6 - - 0 93.8 94.0 95.2 93.2FP2+z - - 100 3.4 3.2 1.2 4.6 - - 100 6.2 6.0 4.8 6.8z - - 0 0 0 0 0 - - 0 0 0 0 0





found in table 5.28.

Beside the differences in the selected functional forms, results for the overall goodness of fit with respect

to R2 and MSE are differing. The mean R2 for Robertson is 0.68, for all other methods it is 0.8. Differences

between the methods can also be observed in the MSE displayed in table 5.29. Robertson leads to an

increased overall MSE (447.5). Especially for observations in categories B, C and D errors are very high.

A similar behaviour, but with smaller overall MSE (285.9) can be observed for Bi-D1. Again, errors in

categories B, C and D are higher than for the other four methods including MFP. MFP, Bi-Sep, Bi-D3

and Bi-Sub lead to very similar results with respect to overall performance measured in MSE.

To sum up scenario 26, one can say that if the true underlying functional relationship is non-linear,

the method proposed by Robertson leads obviously to incorrect selected functional forms and overall

worse performance than the other methods. The actually selected functional forms of the other methods.

Functions selected Bi-Sep are the most stable and closest to the true functional relationship including the

estimates for the indicators. Bi-D3 leads to very similar results as Bi-Sep with an comparable performance

with respect to the selected functional forms and the estimates for the indicators. However, more degrees

of freedom are needed for model estimation. This can also be observed for Bi-Sub, however, functional

selection is far less stable than for Bi-Sep and Bi-D3. Estimates for the coefficients of the indicators are

less precise than those of Bi-Sep and Bi-D3. Therefore, in this setting Bi-Sep leads to the overall best

110


results.

Comparing scenario 26 to the other four settings in this investigation, one can observe rather similar

results. Interestingly, one can observe that in case of a symmetric effect of two indicators (v1 and v2) and a

true non-linear functional relationship between x1 and x2 and the outcome, the number if observations in

the categories A, B, C, and D do not considerably influence the overall performance with respect to MSE

of the methods which can be seen in figure 5.20b. Regarding the error in the four categories, the highest

differences can be found in all 5 settings in category B and D where Robertson leads to higher errors.

Furthermore, one can also observe that the smaller the amount of observations in category D (especially

in scenario 29 with only 30%), the more often a linear functional form is selected for x1 in Bi-Sep, Bi-D3

and Bi-Sub. In all five scenarios, MFP and Bi-D1 select varying FP2 functional forms for both x1 and x2

in all 500 datasets.

5.5.3 Summary of simulation results

Several different questions for the investigation of methods for the analysis of two covariates with a SAZ

and a continuous outcome were addressed in the 29 scenarios presented in this thesis. Model selection for

covariates with SAZ is highly dependent on several influencing factors. The different scenarios showed

that even if only one single characteristic was varied, the setting of all other parameters heavily influenced

what types of model were selected and which of the methods performed best. A short summary on the

four bivariate SAZ procedures and the two references will be given.

MFP selects in all settings varying complex functional forms. Function selection is very unstable. The

overall performance with respect to R2 and MSE is in many settings surprisingly acceptable. However,

MFP leads to high errors if there are three different non-additive binary effects for z1, z2 and z3. Robertson

can only fit linear functional relationships, with two indicator effect. In easy settings, this method performs

well and needs a low number of degrees of freedom for estimation. If the true underlying functional

relationship is more complex or if there are three different non-additive binary effects for z1, z2 and z3,

the estimated models have high errors and are less precise than those fitted by the other methods.

The four bivariate SAZ methods also reveal some differences under certain conditions. In many sit-

uations Bi-D1 leads to slightly increased errors if there are two symmetric or asymmetric effects of the

indicators. Surprisingly, the overall performance with respect to errors is similar if the true functional

relationship contains effects of three indicators. In this situation errors are comparable to Bi-D3 and

Bi-Sub. Function selection is, however, very unstable and very complex functional forms are selected even

in “easy” settings.

111


100

200

300

400

500


Category A

020

040

060

080

0


Category B

100

200

300

400

500


Category C

020

040

060

080

0


Category D




100

200

300

400

500

600

S25 S26 S27 S28 S29




(b) Overall MSE

Figure 5.20: Simulation Study - Comparison of methods for two covariates with SAZ: MSE comparisonin scenarios S20 - S24

112


Bi-Sep performs well in situations in which the true functional relationship contains effects of two

indicators. Selected functional forms are very stable in these settings. The option of dropping parts of

the final selected model leads to sparse models. If there are three non-additive indicator effects, however,

Bi-Sep leads to higher errors and function selections is less stable. Bi-D3 performs well in all settings.

Function selection is very stable. With three estimated effects for three types of observations with “value

zero”, this procedure is very flexible with regard to the true underlying functional relationship. Both

additive and non additive effects are modeled. In other procedures, it could be observed that lack of this

flexibility often lead to worse model fits. The results for Bi-Sub are very similar to those of Bi-D3. The

selected models are, however, more complex due to the additional flexibility of modeling the continuous

relationship in subgroups in which the second covariate is zero separately. As in non of the scenarios,

the true functional relationship contained a different functional form in one or both of these subgroups,

models estimated by Bi-D3 already lead to acceptable fits. Models selected by Bi-Sub need overall more

degrees of freedom in the estimation. In the so far investigated settings, one could observe that Bi-D3 on

average leads to the best results considering the stability of selected function forms, the complexity of the

selected functional forms and errors.

In all of the scenarios, one could observe that the most relevant criterion for the decision which method

to use is the true relationship between the two covariates and the outcome, especially, the true effects

in the zero categories. Investigation 3a showed that the asymmetric choice of the effect of the three

indicators dominated model selection more than varying the sample size in the four categories. Finally,

what conclusion can be drawn from the results of this simulation study for real data settings? The current

example by Fehringer et al. (2017) [22] on “alcohol and lung cancer risk among never smokers” was based

on the assumption that the dose-response relationship in never-smokers (or in more general terms, if a

second covariate is zero) is different to the dose-response relationship of alcohol and lung cancer risk in

smokers. This is e.g. the assumption that is made in Bi-Sub. Coming back to the initial definition of a

statistical model and its different aims, general recommendations which strategy to use in which setting

are highly dependent on these aims. All of the bivariate techniques have advantages in specific settings

when it comes to the explanation of functional relationships. Therefore, subject matter knowledge on the

true underlying functional relationship or at least some assumptions on it should be used in the selection

for an adequate method. The prerequisite for the use of some of the methods are very strong and should,

therefore, be checked in advance.

113

6 More than two covariates with a spike at

zero

Four strategies were proposed for the bivariate SAZ situation. If one is faced with the situation that

there are more than two covariates with SAZ some additional points have to be considered. Bi-Sep can

be extended straightforward including an indicator for the third covariate. For the extension of Bi-D3,

further indicator variables are needed. Eight different categories of observations are present and, thus,

seven indicator variables are necessary to separate these categories. Bi-D1 could be extended easily. Only

one indicator which is one if all three covariates are zero could be included in the model. Bi-Sub would

become very complex, as seven indicator variables would have to be included, and furthermore , there

are six different combinations for which one of the covariates is zero and at least one covariate is positive

and, thus, several different functional forms are possible in theory. Thus, especially extensions of Bi-D3

and Bi-Sub do not seem feasible. A key issue could be the dependence structure of the covariates and

their relation to the outcome. To gain insight, it is proposed to investigate this structure for the binary

variables (0, > 0) for all covariates with SAZ. This can be done by building a log-linear model to test

for dependence between the covariates. Loglinear models are a method for the evaluation of dependence

structure in multi-way tables. They model the probability of the cell frequencies. Using this information,

one can try to choose a suitable strategy of the already proposed methods. Some first thoughts of possible

extension of the so far proposed bivariate methods are presented in this chapter.

6.1 Using Log-linear models for the analysis of three or more

covariates with SAZ

The general principle of log linear models is presented in chapter 2.4.3. For three or more spike variables,

a preliminary step is proposed to investigate the dependence structure of the dichotomized (0,>0) SAZ

114

6.1 Using Log-linear models for the analysis of three or more covariates with SAZ

covariates in a log-linear model. In the situation of three covariates with SAZ, a step-up procedure is

proposed to derive a suitable log-linear model. Chi-squared goodness-of-fit tests are performed with the

null hypothesis that the expected frequencies estimated by the model follow the same distribution as the

observed frequencies. That means, that the smaller the p-value the less likely it is, that the estimated

frequencies follow the same distribution as the observed frequencies. Thus, the fit of models with small

p-values for Chi-squared goodness-o-fit test is worse than for higher p-values. First, a loglinear model

under the assumption of mutual independence is fitted (for definitions see section 2.4.3). Then single

two-way dependences between any two of the covariates are added to the model. In the next step, the

combinations of two two-way dependences are added to the model. The next model contains all three

two-way dependences. In the last step the saturated model which furthermore contains a three factor

interaction is fit. The best loglinear model defined as the model with the highest p-value in the Chi-

squared goodness-o-fit test will then be used for the selection of a strategy for model selection in the

original research question, modelling the influence of three covariates with SAZ on an outcome.

Mutual independence If the best loglinear models only contains the three main effects, it is proposed

to use an extended version of Bi-Sep. Here, for each SAZ covariate a binary indicator vi is added to the

model.

Conditional (in)dependence If the best loglinear models contains one or two conditional dependen-

cies, it is proposed that the strongest dependency between two covariates, determined by the Chi-squared

goodness-o-fit test (smallest p-value when left out of the model) is model using one of the four bivari-

ate strategies proposed before. For the third covariate a binary indicator v is added without a further

procedure.

Three pairs of conditional dependencies or three factor interaction This situation is more complex

than the first two. And no general recommendation can be given here. As already stated, Bi-D3 and

Bi-Sub could be extended by defining further indicators. However, the fitted models become very complex.

For reasons of simplicity, the extended version of Bi-Sep might be also a good option in this setting, as

in the bivariate case it could already be seen, that model selection is rather dependent of the relationship

between the SAZ covariates and the outcome. The relationship between the covariates itself had minor

influence. It is again proposed to consider the strongest of the conditional dependencies, and use one the

bivariate strategies for those two, for the third covariate only the indicator is to be included.

A data example with three covariates with SAZ will be used to illustrate the procedure described above.

115

6 More than two covariates with a spike at zero

6.2 Case Study IV - Study on Lung Cancer

In this hospital-based case-control study with 1004 cases and 1004 controls the effect of occupational

factors on lung cancer in patients was assessed. Data was collected between 1988 and 1993 in hospitals in

Bremen and Frankfurt. Controls which were randomly drawn from the mandatory residence registry were

matched for region sex and age. Exposure assessment was based on 33 supplementary questionnaires.

For further details on the study design see Ahrens et al. [2], Jöckel et al. (1998a) [34] and Jöckel et

al. (1998b)[35]. Three covariates with SAZ, smoking, working life days of asbestos exposure and years

in high-risk occupations (so-called ’list A jobs’, with likely exposure to carcinogenic substances) will be

considered. This situation brings up further challenges in the analysis. The amount of observations with

value zero are relatively high in the covariates. Most of the subjects, 82 % did not work in a high-risk

occupation. There are 21 % non-smokers in the study and 68 % who had no asbestos exposition. The full

analysis set contains all observations for which measurements for smoking, working life days of asbestos

exposure and years in high-risk occupations were present. For smoking measured in packyears, there were

44 missing observations. The full analysis set thus contains 1964 observations.

Figure 6.1 shows the distribution of the three covariates smoking, years in high risk occupation and

life time working days in asbestos exposure separated for cases and controls. One can observe that all

three covariates have a high amount of observations with value zero. In the histogram for pack years,

one can observe that the amount of non-smokers in controls seems considerably higher than in cases. For

years in high risk occupation and life time working days in asbestos exposure, the amount of observations

with zero seem rather similar in cases and controls. Further descriptions of relationships and dependence

between the covariates will be assessed in the following section.

6.2.1 Univariate and Bivariate Analyses

Dr. Eva Lorenz (cooperation partner of the DFG project) performed univariate and bivariate analyses of

the lung cancer datasets in her PhD thesis. With the univariate procedure proposed by Royston (2010)

[55] and presented in chapter 4 in this thesis, she analysed the effect of smoking and lifetime years of

occupational exposure to list A jobs on lung cancer. Furthermore, she investigated the effects of smoking

of cigarettes and pipes and duration of working in list A jobs on lung cancer using the four bivariate

strategies proposed in this thesis and in Jenkner et al (2016) [33]. For more details see Lorenz (2015)[40].

116


0.0

2.0

4.0

6

0 50 100 150 0 50 100 150

0 1

Den

sity

Smoking in PackyearsGraphs by Case/Control status (0=control, 1=case)

(a)

0.5

0 20 40 60 0 20 40 60

0 1

Den

sity

Years in high risk occupation (list A job)Graphs by Case/Control status (0=control, 1=case)

(b)

0.0

01.0

02.0

03

0 5000 10000 15000 0 5000 10000 15000

0 1

Den

sity

Working life days with asbestos exposureGraphs by Case/Control status (0=control, 1=case)

(c)

Figure 6.1: Lung Cancer Study: Distribution of smoking, years in high risk occupation and life timeworking days in asbestos exposure separated for cases and controls.

117


Loglinear Model5S1 R2 A3 O-freq4 (A,S,R) (R,AS) (A, SR) (S, AR) (AR,AS) (AR,SR) (AR,SR,AS) (ASR)0 0 0 396 211.4 232 232.9 350.3 384.5 386.0 395.4 396

1 337 453.8 433.2 500.1 314.9 300.5 347.0 337.6 3371 0 184 317.1 348 295.5 178.1 195.5 166.0 184.6 184

1 746 680.7 649.8 634.5 819.7 782.4 764.0 745.4 7461 0 0 19 39.0 18.4 17.5 64.7 30.5 29.0 19.6 19

1 36 83.8 104.4 37.5 58.1 72.4 26.0 35.4 361 0 27 58. 27.6 80.1 32.9 15.5 45.0 26.4 27

1 225 125.7 156.6 171.9 151.3 188.6 207.0 225.6 225

Table 6.1: Lung Cancer Study: Fitted values for loglinear models applied to lung cancer data1 smoking in pack-years 0 (S=1), or > 0 (S=0)2 days in high risk occupation 0 (R=1), or > 0(R=0)3 asbestos exposure 0 (A=1), or > 0(A=0)4 observed frequencies in the lung cancer data set5 detailed definitions of the models can be found in table 6.2

6.2.2 Analysis: more than two SAZ covariates

So far only two covariates with SAZ were considered in the analyses of Lorenz (2015) [40]. The lung

cancer data set will be used to illustrate the analysis strategy for three covariates with SAZ described

above. As a first step, dependence and relationships between the SAZ covariates as described in section

6.1 will be assessed.

Pre-analysis of relationship of SAZ covariates For the three dichotomized variables, eight loglinear

models were build. The step-up procedure described above was used to derive a suitable log-linear model.

The expected frequencies estimated with the models can be found in table 6.1. General notation and

goodness of fit of the models can be found in table 6.2. For the model with main effects only, the numbers

of expected cases in the eight categories deviate severely from the number of observed cases in the eight

cells, clearly illustrating that correlations between the three binary variables exists. Adding mutual

dependencies between any two covariates to the model improves the expected frequencies. However, they

still deviate highly from the observed frequencies. Adding a dependence between asbestos exposure and

years in high risk occupation leads to the lowest χ2 value. Therefore, this dependence is kept while

others are added according to the step up procedure. This relationship is reasonable as one aspect of

the definition of high risk occupation is asbestos exposure. The model that leads to expected frequencies

which are closest to the observed ones is the one including all two-way interactions.

Now, it has to be decided how to incorporate this information in the analysis of the endpoint “lung

cancer”. It is proposed to use any of the bivariate strategies for the asbestos exposure and years in high

risk occupation, and include smoking with an indicator separately.

118


Symbol Loglinear model df χ2 p-value1

(A,S,R) logµ = λ+ λA + λS + λR 4 386.5 <0.001(R, AS) logµ = λ+ λA + λS + λR + λAS 3 303.5 <0.001(A, SR) logµ = λ+ λA + λS + λR + λSR 3 280.8 <0.001(S, AR) logµ = λ+ λA + λS + λR + λAR 3 91.9 <0.001(AR,AS) logµ = λ+ λA + λS + λR + λAS + λAR 2 45.4 <0.001(AR,SR) logµ = λ+ λA + λS + λR + λAR + λSR 2 18.9 <0.001(AR,SR,AS) logµ = λ+ λA + λS + λR + λAR + λAS + λSR 1 0.05 0.827(ASR) logµ = λ+ λA + λS + λR + λAR + λAS + λSR + λASR 0 0.0 .

Table 6.2: Lung Cancer Study: Labeling of loglinear models and results for the whole dataset and sepa-rately for cases and controls, star indicates the selected model in the respective step. The dfequals the number of cell counts minus the number of model parameters.1 p-value for χ2 statistic

Analysis of the endpoint risk of lung cancer A short summary of the analysis of the actual endpoint

is given here. The statistical methods are similar as in the bivariate analyses, which were e.g. used in the

analysis of the laryngeal cancer study. The only extension is that in each model, an additional adjustment

covariate with binary indicator is added. Table 6.3 contains the models estimated by the four techniques

and the models fitted with standard logistic regression and logistic regression with the extension proposed

by Robertson et al (1994) [50].

Both standard logistic regression and logistic regression with additional binary indicators can only

estimate linear functional relationship. The selected functional forms for all four bivariate approaches are

the same. For asbestos exposure, all methods select a linear functional relationship. This is the same

for “years worked in high risk occupation (risk)” for which also a linear functional form is selected. For

smoking, each of the four bivariate methods selects an FP(1,1) with a linear and a logistic component. The

estimated coefficients are also very similar for all three covariates. Bi-Sep drops the indicator variables

for asbestos exposure and for smoking. For years in high risk exposure only the binary indicator is kept

in the final model. With regard to deviance, all four non-linear techniques lead to lower deviances than

the two references (around 50 and 30 points lower). Bi-D3 and Bi-Sub lead to the lowest deviances.

Overall, it can be observed that smoking has the strongest influence of the three covariates on lung cancer

in terms of effect sizes. However, it can be observed that if years in high risk occupation are included with

binary indicator as in Bi-Sep, the effect estimate is higher than in the reference models. This can also be

observed for z2 which indicates if asbestos exposure is zero and years in high risk occupation is positive in

both Bi-D3 and Bi-Sub. This could be interpreted that even without asbestos exposure working in high

risk occupations influences the risk of lung cancer.

The main focus of this chapter was to introduce a possible concept for the pre-analysis if more than two

119


Coef. Std.Err. PRef I: Logistic regressionAsbestos 0.00006 0.00004 0.129Risk 0.013 0.005 0.014Smoking 0.032 0.003 <0.001Deviance 2500.5Ref II: Logistic regression with indicatorsvasbestos -0.146 0.127 0.250Asbestos 0.00007 0.00005 0.102vrisk 0.179 0.125 0.154Risk 0.007 0.006 0.229vsmoking 0.585 0.165 <0.001Smoking 0.027 0.003 <0.001Deviance 2484.21. Bi-Sep (Asbestos, Risk) + Smokingvasbestos dropped (p=0.12)Asbestos (1) 0.00005 0.00005 0.236vrisk 0.216 0.102 0.034Risk (1) dropped (p=0.21)vsmoking dropped (p=0.28)Smoking (1) 0.150 0.017 <0.001Smoking (log) -0.028 0.004 <0.001Deviance 2452.52. Bi-D3 (Asbestos, Risk) + Smokingz1 0.050 0.163 0.758z2 -0.347 0.174 0.046z3 0.079 0.145 0.585Asbestos(1) 0.00007 0.00005 0.122Risk (1) 0.007 0.006 0.258vsmoking -0.223 0.217 0.305Smoking (1) 0.168 0.023 <0.001Smoking (log) -0.032 0.005 <0.001Deviance 2445.83. Bi-D1 (Asbestos, Risk) + Smokingz1 0.094 0.143 0.511Asbestos (1) 0.00004 0.00004 0.402Risk (1) 0.010 0.006 0.074vsmoking -0.221 0.217 0.309Smoking (1) 0.167 0.023 < 0.001Smoking (log) -0.031 0.005 0.001Deviance 2450.74. Bi-Sub (Asbestos, Risk) + Smokingz1 0.105 0.181 0.562z2 -0.335 0.193 0.083z3 0.019 0.160 0.907z4(1) 0.00002 0.0001 0.846z5(1) -0.011 0.013 0.366z6(1) 0.00006 0.00008 0.438z7(1) 0.013 0.009 0.156v (Smoking) -0.221 0.217 0.309Smoking (1) 0.168 0.231 < 0.001Smoking (log) -0.031 0.005 < 0.001Deviance 2445.0

Table 6.3: Lung Cancer study: Bivariate Spike Methods with additional SAZ covariate - Results of Bi-Sep,Bi-D3, Bi-D1 and Bi-Sub. The selected FP Power values are given in brackets after the riskfactor name.

120


covariates with SAZ are present. However, this is only a sketch of one possibility and further extensions

need more detailed analyses and research. A short outlook will be given on this topic at the end of the

next chapter.

121

7 Discussion and further research

The main focus of this thesis was to investigate methods for the analysis of covariates with a SAZ. As

already stated in the introduction and in chapter 3, the questions of predictor covariates with SAZ has

seldom been addressed. Most of the current research is on outcome covariates with SAZ as can be seen

in a recently published special issue on spike at zero (Editorial by Boehning and Alfo [8]) and tutorial

papers by Neelon et al 2016 ([45], [46]). The relevance and influence of a binary indicator in modeling

independent covariates with SAZ in normal error regression model was investigated. In this thesis, a

theoretical justification and explanation of the effect of the inclusion of one or several indicators was

given. Bivariate extensions of already existing methods for one SAZ covariate [55] for which a two stage

procedure was used and in which the continuous part was modeled in the class of FPs ([5]) were proposed

and compared. The rather common situation of two SAZ covariates complicates the analysis considerably.

Four different approaches were presented, and arguments are given for either of them. For illustration and

comparison of results data from several epidemiological studies were used. In a simulation study, different

properties of the proposed methods for two covariates with SAZ were compared to each other and to two

reference methods. In addition, a possible extension for three or more SAZ variables was outlined. Major

findings will be discussed and evaluated in the following sections. As this thesis was written as a part

of a DFG research project, and some results are closely related to additional research questions, major

findings of other members will be sketched to draw a bigger picture. At the end, further questions which

still remain unsolved are presented to give an outlook on further research.

7.1 Methods for the analysis of covariates with SAZ

Recently published studies such as a study on the influence of alcohol on lung cancer in never smoker

by Fehringer et al (2017) [22] show, that there are still research questions in which the assumption of

different relationships to an outcome in observations with zero and non-zero values in certain covariates

is still present. Thus, methods are needed that can handle this type of data situation.

122

7.1 Methods for the analysis of covariates with SAZ

In univariate regression analysis, Robertson et al (1994) [50] and Royston et al (2010) [55] proposed the

inclusion of a binary indicator in the analysis of covariates with SAZ. In this thesis, it could be shown, that

the inclusion of a binary indicator leads to a separation of the effect of the zero and non-zero observation

in the linear regression model which can be an interpretational advantage. In general, including a binary

indicator seems important if the true underlying effect of the observations with x-value zero is substantially

different to the effect of the observations with non-zero values. This could be observed in small simulated

examples. As the FP-Spike procedure drops the binary indicator or the respective positive part if no

additional information is provided this seems a reasonable strategy. It leads to sparse models. Two

medical research questions were addressed, first the analysis of influence of the time spent outside on lung

function in school children and the analysis of the influence of the estrogen receptor level on breast cancer.

In some settings, FP-Spike leads to simpler models because only the binary indicator was kept in the final

model. This binary distinction is easier to interpret than possible non-linear functional forms which were

selected by standard FP or an “averaged” model over the whole range of observations ignoring possible

differences in the effect size estimated with standard linear or cox regression. However, comparing both

ways of model selection (with and without binary indicator) with additional regression diagnostics seems

reasonable and necessary for the decision of a final model. In several univariate settings, the necessity

and improvement of the extended model could be shown.

For bivariate regression analysis with two covariates with SAZ, four different analysis strategies were

proposed in this thesis. In the framework of the logistic regression model and when the positive parts of

the SAZ covariates follow a normal or log-normal distribution, theoretical ORs were derived in Lorenz et

al. (2015) [41]. These were the initial motivation for the construction of the different methods. Real data

distributions are often non-standard and the distribution of the zero values in the two variables might

not always be independent. Bi-Sep is the simplest of the four bivariate approaches. It uses the univariate

FP-spike procedure separately for the two SAZ covariates. In Bi-D3, Bi-D1 and Bi-Sub, proportions of

zeros in both covariates are considered simultaneously in a combination of binary indicators. Therefore,

these strategies can account for different relationships between the two covariates and to the outcome

variable. In Bi-D3, three indicator variables differentiate between the four types of observations. If the

amount of observations in categories B and C is relatively low or if the relationship of observations of this

category and the outcome is stronger than for observations in B and C, it might be reasonable to only

include one single indicator indicating that both variables take the value zero (Bi-D1). The procedures

were described in detail in section 5.2.

By construction, the methods already have some advantages and disadvantages. Compared to cate-

123


gorization, all proposed approaches use the full information from the semi-continuous data. Bi-sep does

not assume any relationship between the two covariates. It, furthermore, has the option to remove one

or both binary indicators or even the positive parts of the covariates from the final model if they do not

add additional information as in the laryngeal cancer data example. This makes model selection very

flexible and will mostly result in a sparse model. Bi-D3 includes three indicator variables and can, thus,

estimate three different effect sizes in three different types of observations. In some settings, this can be

necessary, or highlight certain properties in the relationship to the outcome like e.g. in the lung cancer

dataset. Bi-Sub is closely related to Bi-D3 but has extra flexibility in the estimation of functional forms in

subgroups. It can be useful in settings as addressed by Fehringer et al (2017) [22] when it can be assumed

that in a subset of patients the behaviour to the outcome differs. Bi-D1 used only one indicator which

can be an advantage. However, in many settings this might not be flexible enough.

In a setting for more than two covariates with SAZ, it was proposed to perform a pre-analysis inves-

tigating the relationship between the covariates with SAZ using loglinear models. This pre step can be

used to then apply the bivariate methods to specific pairs of covariates. This strategy is, however, only

sketched and needs further investigation.

7.2 Evaluation of methods proposed in this thesis

The main aim was the development and evaluation of methods for the analysis of two covariates with

SAZ. Some datasets were simulated in a similar fashion as already in the univariate SAZ situation to gain

a first insight in the behaviour of the proposed methods. The results of the exploratory analyses lead

to the design of the simulation study for the evaluation of the non-linear bivariate methods proposed in

Jenkner et al (2016) [33]. These examples provided some first guidance on the choice of effect sizes and

further parameters. Thus, they were the first step to the design, how to generate the data, to find out

which aspects have to be considered (e.g. introducing amount of zero values, how to deal with negative

values) and lead to the first choice of interesting scenarios for the simulation study. Different criteria

of evaluation were compared in artificial data sets to show which aspects of the estimated model were

actually influenced by the new analysis methods.

One could observe that the overall fit of the models only improves if the true effect of the binary indicator

variable is relatively high. To gain a better insight separated measures of fit in subsets of observations

were proposed to get an impression how the methods behave in different subsets of observations. It

was proposed to calculate the standard mean squared error in the different categories of observations in

124

7.3 Limitations

addition to overall measures of fit. In a simulation study several properties of the methods were assessed.

In 29 scenarios, different research questions were asked. It could be observed that the true effect size

of the binary indicator influences the selection of the functional forms severely. The higher the difference

of the effect on the outcome at zero and close to zero (“jump”), the more important is the inclusion of

a binary indicator. The reference methods linear regression and MFP lead to averaged models. MFP

selects non-linear functional forms, mostly, with strong tendencies to positive or negative infinity close to

zero. In all simulation settings, highly varying functional forms were selected. As in Bi-Sep dropping of

either the binary indicator or the continuous functional form is possible, this method seems favourable if

the difference in effect at and close to zero is relatively low. Three indicators as in Bi-D3 and Bi-Sub are

helpful in very specific functional settings. The actual number of observations in the four categories was

less influential in the selected functional forms, than the effect sizes. Details of the results were presented

in section 5.5.2.

A general recommendation which of the methods can be used in which situation does not seem possible.

There is also no overall mathematical algorithm selecting the correct strategy. Depending on the aim of

model building and thinking again about the initial definition of a statistical model, different models are

preferable in different research settings. Again, it is proposed to use subject matter knowledge in the

process of selecting an adequate model.

7.3 Limitations

Several differences in the use and goodness of fit of methods for the analysis of covariates with SAZ were

described in this thesis. It could be observed that performance of the methods is dependent on the true

data distribution and true functional relationship of the covariate to the outcome. Thus, some general

limitations resulting from this causes are given here.

Only in special distribution cases, the four proposed methods for the bivariate SAZ situation lead to

substantially improved models. This is due to the fact that the underlying assumptions for the strategies

are very strong and restrictive. Already for one SAZ covariates, the inclusion of a binary indicator leads

to a strong assumption about the relationship between the dependent and independent covariate. A non-

continuous or semi-continuous relationship separating observations at value zero and the rest is implied.

This assumption might not always hold and thus, an additional binary indicator is not needed. One

can conclude that using standard techniques for nonlinear modeling already leads in some situations to

acceptable results. However, in some cases, using SAZ methods is absolutely necessary. Similar to the

125


commentary by Greenland and Poole [26] and as already stated, it seem reasonable to compare models

including indicator variables to models without indicators not just using statistical measures for goodness

of fit but also considering subject matter knowledge in how the relationship between covariate(s) and

outcome is to be assumed biologically. Assumptions made when including one or several indicators and

in that way separating the estimates for two or four groups of observations are very strong as it could

be observed in the earlier chapters because it implies that these different types of observations have a

substantially different effects on the outcome.

Another issue in all that was presented so far is that several decisions for modeling one or more covariates

with SAZ are data dependent. The uncertainty of these processes is not reflected in the final estimates of

the model. Firstly, the distribution in the four categories may be an important argument for the decision

of choosing the most suitable of the four bivariate SAZ approaches. Secondly, based on the data the

procedure chooses a specific FP function and in a further step, at least in some of the procedures, it

decides whether a binary indicator or the FP function can be dropped. However, the estimates of the

finally derived model ignore all these data-dependent model building steps and inference is done as if the

model had been pre-specified (sometimes called naive estimates). It is well known that such estimated

standard errors and confidence intervals tend to be too small. This general weakness of all strategies that

derive a model in a data-dependent way is known for a long time. More than two decades ago, Breiman

(1992) [9] called this a “quiet scandal”. To tackle this issue, one could improve interval estimates using the

model uncertainty concept or approaches based on bootstrap ([16], [19]). For the proposed approaches,

it can be assumed that underestimation of standard errors and the lengths of confidence intervals can be

relevant for large values but that it is not a critical issue in the situations investigated so far. In several

examples, naive estimates of the confidence interval for an FP function have been compared to estimates

based on bootstrap replications ([54]).

7.4 Conclusion

To sum up, it has to be said that it is very important to check the assumption before using one of

the proposed strategies. Subject matter knowledge should guide the analysis and should give hints on

the best strategy of analysis. There is no overall mathematical algorithm for the choice of the right

method. In many medical settings, it will be almost impossible to check the assumptions. Here, using the

proposed methods in addition to standard techniques for sensitivity analyses can be an option for getting

further insights in the relationship between a covariate and an outcome. However, these results should be

126

7.5 Summary of the findings of the DFG project

interpreted with care.

Two important aspects should be considered in model building, first interpretability and second, trans-

portability (Royston and Sauerbrei (2008) [54], p.66). A monotonic function might in general be easier

to interpret than a function with a "jump". This jump in a function, however, might be the result of

the SAZ analysis as the estimated models are not necessarily continuous functions. If it is helpful for a

deeper understanding and if subject matter knowledge leads to this analysis, using specific methods for

covariates with SAZ are an advantageous way of analysis. One further important characteristic of good

models is that they are understood by others. Therefore, the philosophy of the proposed approaches is

to keep the models as simple as possible. In several data applications, it was observed that FP-Spike

and the bivariate FP methods lead to simpler models than the standard FP approach. Depending on

the aim of model building as stated in chapter 2, different approaches are to be favoured. So far, none

of the techniques can be seen as overall favourable. Advantages and disadvantages in specific situations

for applications were described above. However, many further questions could be asked and should be

answered, to draw an even bigger picture. Some outlook will be given in section 7.6.

7.5 Summary of the findings of the DFG project

This thesis was written in the context of the DFG research project “BE 2056/10-1; SA 580/7-1”. This

section will highlight the main findings by Dr. Eva Lorenz supervised by Prof. Dr. Heiko Becher. In

table 7.1 an overview of the work packages is given. Work packages which were predominantly done in

Freiburg under the supervision of Prof. Dr. Willi Sauerbrei were already described in this thesis in detail.

In order to give a broader picture of the results and to present the findings of this thesis in the context of

the whole project, a brief overview of the other work packages will be given here.

In work package 2, “Theoretical investigations of dose-response relationships for two covariates with

SAZ”, Eva Lorenz investigated the question “Given the distribution of a covariable in diseased and non-

diseased individuals, which is the resulting odds ratio function?” For different distributions (normal, log

normal, gamma, weibull, pareto and inverse gaussian (univariate case); normal, log normal (bivariate

case)), theoretical OR functions were derived for exposure variables with and without SAZ in logistic

regression. These theoretical functions lead to the conclusion that the inclusion of binary indicator vari-

ables could improve model building. Detailed results can be found in Lorenz et al (2015) [41]. FP-Spike

procedure proposed by Royston et al. (2010) [55] contained a pre-transformation step of the data adding

a small constant. Becher et al(2012) [5]) found out that this initial shift leads to biased estimates.

127


WP Topic Location1 Update of literature Freiburg,

Heidelberg2 Theoretical investigation on dose-response given distribution classes Heidelberg3 Simulation Studies for one spike variable Heidelberg4 Simulation study to investigate procedures for two spike-at-zero-variables Freiburg &

Heidelberg5 Multivariable extension for spike at zero variables Freiburg6 Interaction of a continuous variable with treatment â spike extension of

MFPInot done

7 Investigations of different dose-metrics for smoking and other variables inepidemiological and in clinical data

not done

8 Application to data from several large epidemiological and clinical studies Heidelberg,Freiburg

8.1 GBSG Data (univariat) Freiburg8.2 Ozone Data (univariate) Freiburg8.3 Postmenopausal breast cancer study (univariat and bivariate) Heidelberg8.4 Laryngeal Cancer (bivariate situation) Freiburg8.5 Lung Cancer (bivariate) Heidelberg8.6 Lung Cancer (more than two) Freiburg

Table 7.1: Overview of work packages that were addressed in DFG research project “BE 2056/10; SA580/7”

In work package 3, simulation studies for one covariate with SAZ in logistic and Cox regression were

defined. The performance of the standard FP, originally proposed FP-Spike (Royston et al. (2010)

[55]) and the modified FP-Spike (Becher et al (2012) [5]) was compared in a case-control setting. The

situation for one covariate with SAZ was also investigated and compared in simulations in a Cox regression

setting. Data were simulated for six different functional relationships, a sample size of 1000, a varying

proportion of zero values, a varying proportion of censoring, and two distributions of the exposure variable

X (standard normal and log normal). It could be observed that with a higher amount of observations

with value zero, standard FP selected more complex functional relationships than FP-Spike. As observed

in many situations in this thesis nonlinear functional forms were selected to estimate the (different) effect

for observations with value zero.

In addition to the scenarios investigated in the simulation study for two covariates with SAZ (work

package 4) presented in section 5.5 of this thesis, Eva Lorenz investigated the influence of Spearman

correlation between the two covariates on model fit. Correlation varied from 0.1 to 0.4 in steps of 0.1.

It was observed that changing correlation did only moderately influence model fit and does not lead to

changing results in the four methods. Work package 5 was done in Freiburg. The results are one of the

main parts of this thesis and are presented in chapter 5.

In work package 7, the construction of different dose metrics for smoking was assessed. Smoking is

128

7.6 Further research

an exposure covariate with diverse properties. Considering the intensity, the duration and many further

factors a measurement which describes the true exposure and not only an average exposure is difficult to

construct. Several different measures can be found in literature. Eva Lorenz investigated some measures

proposed by Leffondre et al (2006) [39] and Thurston et al. (2005) [63]. However, results did not differ

substantially to using the originally measured covariates. In an invited commentary, Thomas (2014) [62]

states “Indeed, I am not aware of any novel associations that were only discovered with the use of a more

complex exposure metric.” Therefore, we decided to stop this project early.

The univariate and bivariate methods were applied to several epidemiological datasets (work package 8)

listed in table 7.1. All datasets analysed in Freiburg are described in this thesis. The additional analyses

performed by Eva Lorenz can be found in Lorenz (2015) [40]. Some of the applications are published in

Lorenz et al (2017) [42].

Overall, it was observed that the proposed methods are important in the assessments of functional

relationships between covariates and different types of outcomes if the covariates have specific distributions.


At many points in this thesis, further points for investigations were reached which could not be dealt

with in this thesis. A selection of ideas for further research is presented in this section. Especially in

the design of the simulation study, a lot of further settings could be of interest. Research questions of

different scenarios can be combined. Some of these additional questions which need to be investigated

are asked and described. The so far investigated properties do not seem sufficient to decide if one of the

proposed bivariate strategies is an overall favorable method. Each of them could show advantages and

disadvantages in specific situations. In this sections, an outlook on what could be investigated on the way

to an even more sound evaluation of the methods is given.

What happens if assumptions are violated? - no binary effect in reality The error (which can and

has to be defined in several different ways) of using the four proposed bivariate SAZ methods if in reality

there is no additional effect of the observations with value zero can be investigated. Harrell et al (1996)

stated that “when the assumptions of a model are grossly violated or when a model is used unwisely for

a given patient sample, the performance of the model may be poor” (p.363 [29]). So far, all investigated

true functional relationships contained binary effects. All of the true settings favored at least one the

constructed methods. One could now furthermore investigate settings with no effect for the zero values at

129


all, or with effects in only one covariate. Several of the data applications only showed small effects, thus,

this situation might be very common in real data settings. Pre-testing if the methods are appropriate

leads to data dependent model selection decisions. All concerns that were address due to data dependent

modelling and model uncertainty have to be considered in this situation.

How would allowing variable selection affect our models (Does that depend on the sample size

of the categories)? It was found that in a lot of situations if the effect of the binary indicator is

not substantially different to the value of the continuation of the non-zero functional relationship, the

additional indicator does not improve the fit. The proposed strategies Bi-D3, Bi-D1 and Bi-sub do not

allow variable selection in their model selection strategy and thus all covariates are kept in the final model.

It might be reasonable to think about a potential strategy to allow variable selection and to eliminate

covariates that do not add additional information to the model. In some scenarios one could observe that

selection was not always favorable. Especially without a new estimation of the functional form without the

binary indicator. This could become a relevant problem if the real functional form of the relationship is

unknown which is usually the case. If dummy variables are used, variable selection is not straightforward.

Forward selection might for example “lead to misleading conclusions” (Cohen (1991), p.226 [12]), as “[...]

most stepwise procedures are not programmed to test at one step all the categories of the same categorical

variable” (Cohen (1991), p.228 [12])

How robust are the different models (e.g. if we have only small sample sizes in some of the

categories)? In some scenarios, one could already observe that the selected functions varied highly,

and also the complexity of the selected functions. Especially, if the assumptions of the models were

violated which was often the case for the reference MFP and Bi-D1. Both selected highly complex FP1

and FP2 functions in many data situations leading to good results with regard to R2 and MSE. However,

function selection was very unstable. In order to select one favorable strategy it would be interesting to

systematically compare the stability of the procedures. It could be interesting if beside the size of the

effects further influencing factors play an important role. Royston and Sauerbrei state that “[s]tability of

functional form has hardly been addressed in the literature. One possible approach is to use bootstrap

resampling to study the different types of function which are found. The selected functions may depend

on which other factors are chosen, and for continuous factors, on their functional form” (Royston and

Sauerbrei (2003) [53]). They extended the bootstrap approach to handle transformations of continuous

predictors.

130


Where could be further fields of application? As the results of this modeling technique depend on the

true underlying distribution of the data, and the medical and epidemiological examples chosen so far might

not fulfill these assumptions, further fields of applications could be investigated such as econometrical

research questions. The rather artificial assumption that that the outcome for zero values is completely

different than for the rest of the values of the covariate might be correct in the investigation of the influence

of the costs of child clothing in the total costs of living. People without children will take value zero here

but their general costs of living might be very different to those of families with children. Aitchison (1955)

[3] investigated this situation. Application in this field might be found more often.

Are there further alternatives to the proposed methods? The field of non-linear modeling is broad.

Therefore extensions for covariates with a SAZ could be considered for different techniques. This thesis

investigated only extensions for FP models. However, further research with additional modeling techniques

such as splines is also possible. To determine the functional relationship of a continuous covariate in a

univariate regression model, splines are an older and still more popular alternative to the class of FP

functions (de Boer (2001) [14])). Spline based approaches could replace the class of FP functions in the

most recent procedure for one SAZ variable and also in the bivariate approaches which were proposed

here. However, there are several spline based techniques available (restricted regression splines, penalized

regression splines, smoothing splines and many more). Greenland (1995) [25] stated in his description of

alternatives to categorical dose-response analysis first ideas for the analysis of unexposed subjects in the

analysis with splines. “An advantage of highly flexible models (with more than a few exposure terms) over

simpler models is that the overall curve will usually be less influenced by the unexposed than in simpler

models, and hence the decision to retain or delete the unexposed will be less momentous. In nonparametric

regression with ample data, smoothing neighborhoods can be made small, in which case the unexposed will

exert little or no influence on the curve beyond their immediate low- exposure neighborhood” (Greenland

(1995) p.362 [25]). However, other issues have to be considered when using non-parametric methods. For

further research, it could be interesting to include a spline technique as a reference in order to check how a

spline analysis is able to handle covariates with a spike at zero. However, it is already not a straightforward

decision which of the existing spline options to choose.

Additional questions In addition, for the situation with three or more SAZ covariates a possible strat-

egy was only outlined . Whether or not three or higher-dimensional interactions are present was sketched

as a potential aid in the selection of a modeling strategy. In this case, the bivariate approaches need to

131


be adapted, but they can be used in principle. If three or higher-dimensional dependencies are present,

more substantial extensions are required.

Furthermore, it is necessary to specifically address different types of outcomes. In logistic regression,

the definition of the binary indicator with z = 1 if x > 0 and z = 0 otherwise might be more reasonable,

as the interpretation is different. The coefficient or effect estimate of the indicator variable will, however,

not represent the sole effect of the observations with x-value zero. Lorenz (2015) [40] investigated first

settings in the Cox model for which the interpretation is again different. These interpretational differences

have to be investigated further.

Many further questions could be asked as this topic is and was only sparsely investigated until now.

Research mainly focuses on outcome variables with a spike at zero. However, questions concerning pre-

dictor covariates can also be of great interest, especially in the field of explanatory model building and

non- or semi-continuous regression model estimation.

132

8 Bibliography

[1] A. Agresti. Categorical Data Analysis. John Wiley & Sons, New Jersey, 2002.

[2] W. Ahrens, K.H. Joeckel, P. Brochard, U. Bolm-Audorff, K. Grossgarten, Y. Iwatsubo, E. Orlowsky,

H. Pohlabeln, and F. Berrino. Retrospective assessment of asbestos exposure- i. case-control anal-

ysis in a study of lung cancer: Effciency of job-specific questionnaires and job exposure matrices.

International Journal of Epidemiology, 22(2):83–95, 1993.

[3] J. Aitchison. On the Distribution of a Positive Random Variable Having a Discrete Probability Mass

at the Origin. Journal of the American Statistical Association, 50:901–908, 1955.

[4] D. G. Altman and P. Royston. The cost of dichotomising continuous variables. British Medical

Journal, 1080(7549):3049–3058, 2006.

[5] H. Becher, E. Lorenz, P. Royston, and W. Sauerbrei. Analysing covariates with spike at zero: a

modified FP procedure and conceptual issues. Biometrical Journal, 54(5):686–700, 2012.

[6] H. Binder, W. Sauerbrei, and P. Royston. Comparison between splines and fractional polynomials

for multivariable model building with continuous covariates: A simulation study with continuous

response. Statistics in Medicine, 32(13):2262–2277, 2013.

[7] M. Bland. An introduction to medical statistics. Oxford University Press, Oxford, 2015.

[8] D. Boehning and M. Alfo. Editorial: Special issue on models for contiunous data with a spike at

zero. Biometrical Journal, 58(2):255–258, 2016.

[9] L. Breiman. The little bootstrap and other methods for dimensionality selection in regression: X-fixed

prediction error. Journal of the American Statistical Association, 87:738–754, 1992.

[10] L. Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199–215, 08 2003.

133

8 Bibliography

[11] A. Burton, D. G. Altman, P. Royston, and R. L. Holder. The design of simulation studies in medical

statistics. Statistics in medicine, 25(24):4279–4292, 2006.

[12] A. Cohen. Dummy variables in stepwise regression. The American Statistician, 45(3):226–228, 1991.

[13] D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B,

34(2):187–220, 1972.

[14] C. de Boer. A practical guide to splines. Springer, New York, 2001. revised edn.

[15] A. Dietz, H. Ramroth, T. Urban, W. Ahrens, and H. Becher. Exposure to cement dust, related

occupational groups and laryngeal cancer risk: results of a population based case-control study.

International Journal of Cancer, 108:907–911, 2004.

[16] D. Draper. Assessment and propagation of model selection uncertainty (with) discussion. Journal of

the Royal Statistical Society. Series B, 57(1):45–97, 1995.

[17] D. Dunkler, M. Plischke, K. Leffondré, and G. Heinze. Augmented Backward Elimination: A Prag-

matic and Purposeful Way to Develop Statistical Models. PLoS ONE, 9(11):e113677, 2014.

[18] P. H. C. Eilers and B. D. Marx. Flexible smoothing with b-splines and penalties. Statistical Science,

11:89–121, 1996.

[19] C. Faes, M. Aerts, H. Geys, and G. Molenberghs. Model averaging using fractional polynomials to

estimate a safe level of exposure. Risk Analysis, 27:111–123, 2007.

[20] L. Fahrmeier, T. Kneib, and S. Lang. Regression - Modelle, Methoden und Anwendungen. Springer,

Berlin, 2007.

[21] A. Farcomeni. A MANOVA test for multivariate lognormal observations with a spike at zero, with

application to ecological niches of South Africa. Biometrical Journal, 58(2):320–330, 2016.

[22] G. Fehringer, D. R. Brenner, Z.-F. Zhang, Y.C. A. Lee, K. Matsuo, H. Ito, Q. Lan, P. Vineis,

M. Johansson, K. Overvad, E. Riboli, A. Trichopoulou, C. Sacerdote, I. Stucker, P. Boffetta, P. Bren-

nan, D. C. Christiani, Y.-C. Hong, M. T. Landi, H. Morgenstern, A. G. Schwartz, A. S. Wenzlaff,

G. Rennert, J. R. McLaughlin, C. C. Harris, S. Olivo-Marston, I. Orlow, B. J. Park, M. Zauderer,

J. M. Barros Dios, A. Ruano Ravina, J. Siemiatycki, A. Koushik, P. Lazarus, A. Fernandez-Somoano,

A. Tardon, L. Le Marchand, H. Brenner, K.U. Saum, E. J. Duell, A. S. Andrew, N. Szeszenia-

Dabrowska, J. Lissowska, D. Zaridze, P. Rudnai, E. Fabianova, D. Mates, L. Foretova, V. Janout,

134

8 Bibliography

V. Bencko, I. Holcatova, A. C. Pesatori, D. Consonni, A. Olsson, K. Straif, and R. J. Hung. Alcohol

and lung cancer risk among never smokers: A pooled analysis from the international lung cancer

consortium and the synergy study. International Journal of Cancer, 2017.

[23] A. Gleiss, M. Dakna, H. Mischak, and G. Heinze. Two-group comparisons of zero-inflated intensity

values: the choice of test statistic matters. Bioinformatics, 31(14):2310–2317, 2015.

[24] P. J. Green and B. W. Silverman. Nonparametric Regression And Generalized Linear Models: A

Roughness Penalty Approach. Chapman and Hall, London, 1994.

[25] S. Greenland. DoseResponse and Trend Analysis in Epidemiology: Alternatives to Categorical Anal-

ysis. Epidemiology, 6(4):356–365, 1995.

[26] S. Greenland and C. Poole. Interpretation and Analysis of Differential Exposure Variability and

Zero-Exposure Categories for Continuous Exposures. Epidemiology, 6(3):326–328, 1995.

[27] A. P. Hallstrom. A modified wilcoxon test for non-negative distributions with a clump of zeros.

Statistics in Medicine, 29:391–400, 2010.

[28] D. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons Ltd., Chichester,

1997.

[29] F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: issues in developing

models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in

medicine, 15(4):361–387, 1996.

[30] D. W. Hosmer and S. Lemeshow. Applied logistic regression. Wiley, New York, 2 edition, 2000.

[31] L. W. Huson. Performance of some correlation coefficients when applied to zero-clustered data.

Journal of Modern Applied Statistical Methods, 6(2):Article 17, 2007.

[32] W. Jedrychowski, H. Becher, J. Wahrendorf, Z. Basa-Cierpialek, and K. Gomola. Effect of tobacco

smoking on various histological types of lung cancer. Journal of cancer research and clinical oncology,

118(4):276–282, 1992.

[33] C. Jenkner, E. Lorenz, H. Becher, and W. Sauerbrei. Modeling continuous covariates with a "spike"

at zero: Bivariate approaches. Biometrical Journal, 58(4):783–796, 2016.

135

8 Bibliography

[34] K.-H. Jöckel, W. Ahrens, I. Jahn, H. Pohlabeln, and U. Bolm-Audorff. Occupational risk factors for

lung cancer: a case-control study in west germany. International Journal of Epidemiology, 27(4):549–

60, 1998.

[35] K.H. Jöckel, H. Pohlabeln, W. Ahrens, and M. Krauss. Environmental tobacco smoke and lung

cancer. Epidemiology, 9:672–675, 1998.

[36] V. Kipnis, D. Midthune, D. W. Buckman, K. W. Dodd, P. M. Guenther, S. M. Krebs-Smith, A. F.

Subar, J. A. Tooze, R. J. Carroll, and L. S. Freedman. Modeling Data with Excess Zeros and

Measurement Error: Application to Evaluating Relationships between Episodically Consumed Foods

and Health Outcomes. Biometrics, 65(4):1003–10, 2009.

[37] J. Kühr, J. Forster, G Bär, W. Bohnet, G. Ihorst, J. Mattes, C. Schneider, H. Schulz, and E. Strauch.

Prospektive Längsschnitt-Studie zur Erforschung der Ozon-Immission in ihrer Bedeutung für das

Lungenwachstum von Schulkindern. Froschungsbericht FZKA-BWPLUS, Förderkennzeichen PUG L

98001, 2001.

[38] P. A. Lachenbruch. Power and sample size requirements for two-part models. Statistics in Medicine,

20:1235–8, 2001.

[39] K. Leffondre, M. Abrahamowicz, Y. Xiao, and J. Siemiatycki. Modelling smoking history using a

comprehensive smoking index: application to lung cancer. Statistics in Medicine, 25:4132–4146, 2006.

[40] E. Lorenz. Dose-response modelling for semicontinuous variables in epidemiology and clinical re-

search. Dissertation, Ruprecht-Karls-Universität Heidelberg, 2015.

[41] E. Lorenz, C. Jenkner, W. Sauerbrei, and H. Becher. Dose-response modelling for bivariate covariates

with and without a spike at zero: theory and application to binary outcomes. Statistica Neerlandica,

69(4):374–398, 2015.

[42] E. Lorenz, C. Jenkner, W. Sauerbrei, and H. Becher. Modeling variables with a spike at zero:

Examples and practical recommendations. American Journal of Epidemiology, 185(8):650â660, 2017.

[43] G. Marsaglia. Random Number Generators. Journal of Modern Applied Statistical Methods, 2(1):Ar-

ticle 2., 2003.

[44] P. McCullagh. What is a statistical model. The Annals of Statistics, 30(5):1225–1310, 2002.

136

8 Bibliography

[45] B. Neelon, A.J. O’Malley, and V.A. Smith. Modeling zero-modified count and semicontinuous data in

health services research part 1: background and overview. Statistics in Medicine, 35(27):5070–5093,

2016.

[46] B. Neelon, A.J. O’Malley, and V.A. Smith. Modeling zero-modified count and semicontinuous data

in health services research part 2: case studies. Statistics in Medicine, 35(27):5094–5112, 2016.

[47] M. K. Olsen and J. L. Schafer. A Two-Part Random Effects Model for Semicontinuous Longitudinal

Data. Journal of the American Statistical Association, 96:730–745, 2001.

[48] V. Partovi Nia and M. Ghannad-Rezaie. Agglomerative joint clustering of metabolic data with spike

at zero: A bayesian perspective. Biometrical Journal, pages n/a–n/a, 2015.

[49] H. Ramroth, A. Dietz, and H. Becher. Interaction Effects and Population-attributable Risks for

Smoking and Alcohol on Laryngeal Cancer and its Subsites. Methods of Information in Medicine,

43:499–504, 2004.

[50] C. Robertson, P. Boyle, C. C. Hsieh, G. J. MacFarlane, and P. Maisonneuve. Some statistical

considerations in the analysis of case-control studies when the exposure variables are continuous

measurements. Epidemiology, 5(2):164–170, 1994.

[51] P. Royston and D. G. Altman. Regression Using Fractional Polynomials of Continuous Covariates:

Parsimonious Parametric Modelling. Journal of the Royal Statistical Society. Series C (Applied

Statistics), 43(3):429–467, 1994.

[52] P. Royston, D. G. Altman, and W. Sauerbrei. Dichotomizing continuous predictors in multiple

regression: a bad idea. Statistics in Medicine, 25(1):127–41, 2006.

[53] P. Royston and W. Sauerbrei. Stability of multivariable fractional polynomial models with selection

of variables and transformations: a bootstrap investigation. Statistics in Medicine, 22(4):639–659,

2003.

[54] P. Royston and W. Sauerbrei. Multivariable Model-building. Wiley, New York, 2008.

[55] P. Royston, W. Sauerbrei, and H. Becher. Modelling continuous exposures with a ‘spike’ at zero: A

new procedure based on fractional polynomials. Statistics in Medicine, 29(11):1219–1227, 2010.

137

8 Bibliography

[56] W. Sauerbrei, Royston P., and H. Binder. Selection of important variables and determination of

functional form for continuous predictors in multivariable model building. Statistics in Medicine,

26(30):5512â18, 2007.

[57] W. Sauerbrei and P. Royston. Continuous variables: to categorize or to model? In C. Reading,

editor, Data and context in statistics education: Towards an evidence-based society. Proceedings of

the Eighth International Conference on Teaching Statistics. ICOTS8, Ljubljana, Slovenia, 2010.

[58] E. F. Schisterman, D. Faraggi, B. Reiser, and J. Hu. Youden Index and the optimal threshold for

markers with mass at zero. Statistics in Medicine, 27(2):297 – 315, 2008.

[59] M. Schumacher, G. Bastert, H. Bojar, K. Hübner, M. Olschewski, W. Sauerbrei, C. Schmoor, C. Bey-

erle, RLA. Neumann, and HF Rauschecker. Randomized 2x2 trial evaluating hormonal treatment and

the duration of chemotherapy in node-positive breast cancer patients. Journal of Clinical Oncology,

12(10):2086 –2093, 1994.

[60] G. Shmueli. To explain or to predict? Statistical Science, 25(3):289–310, 08 2010.

[61] A. Strasak, N. Umlauf, R. Pfeiffer, and S. Lang. Comparing Penalized Splines and Fractional Polyno-

mials for Flexible Modelling of the Effects of Continuous Predictor Variables. Computational Statistics

& Data Analysis, 55(4):1540–1551, 2011.

[62] D. C. Thomas. Invited commentary: is it time to retire the "pack-years" variable? maybe not!

American Journal of Epidemiology, 179:299–302, 2014.

[63] S. W. Thurston, G. Liu, D. P. Miller, and D. C. Christiani. Modeling lung cancer risk in case-

control studies using a new dose metric of smoking. Cancer Epidemiology, Biomarkers & Prevention,

14:2296–2302, 2005.

[64] Ch. Ulmer, M. Kopp, G. Ihorst, T. Frischer, J. Forster, and J. Kuehr. Effects of ambient ozone

exposures during the spring and summer of 1994 on pulmonary function of schoolchildren. Pediatric

Pulmonology, 23(5):344–353, 1997.

[65] D. Zhang, C. Fan, J. Zhang, and C. H. Zhang. Nonparametric methods for measurements below

detection limit. Statistics in Medicine, 28:700–715, 2009.

[66] S. Zhang, D. Midthune, P. Guenther, S. Krebs-Smith, V. Kipnis, K. Dodd, D. Buckman, J. Tooze,

L. S. Freedman, and R. J. Carroll. A new multivariate measurement error model with zero-inflated

138

8 Bibliography

dietary data, and its application to dietary assessment. Annals of Applied Statistics, 5(2B):1456–1487,

2011.

139

APPENDIX

Sections of this thesis already published

Results by Eva Lorenz described in chapter 7 were already published in:

Becher H, Lorenz E, Royston P, Sauerbrei W. Analysing covariates with spike at zero: a modified FP

procedure and conceptual issues, Biom J. 54(5):686-700, 2012.

Lorenz E, Jenkner C, Sauerbrei W, Becher H. Dose-response modelling for bivariate covariates with and

without a spike at zero: Theory and application to binary outcomes, Statistica Neerlandica, 69: 374-398,

2015.

Lorenz E, Jenkner C, Sauerbrei W, Becher H. Modeling variables with a spike at zero. Examples and

practical recommendations. Am J Epidemiol. 22:1-11, 2017. doi: 10.1093/aje/kww122.

A selected version of the results presented in chapter 5 was already published in:

Jenkner C, Lorenz E, Becher H, Sauerbrei W. Modeling Continuous Predictors With a “Spike” at Zero:

Bivariate Approaches. Biometrical Journal, 58(4):783-796, 2016.

140

Multivariable modeling of continuous covariates with a ...

Documents

Transcript of Multivariable modeling of continuous covariates with a ...