Exploratory quantile regression with many covariates: An ...

22
Exploratory quantile regression with many covariates: An application to adverse birth outcomes Lane F. Burgette 1 , Marie Lynn Miranda 2 , and Jerome P. Reiter 1 1 Department of Statistical Science, Duke University 2 Nicholas School of the Environment and Department of Pediatrics, Duke University April 21, 2011 Abstract Covariates may affect continuous responses differently at various points of the response distribution. For example, some exposure might have minimal impact on conditional means, whereas it might lower conditional 10th percentiles sharply. Such differential effects can be important to detect. In studies of the determinants of birth weight, for instance, it is critical to identify exposures like the one above, since low birth weight is a risk factor for later health problems. Effects of covariates on the tails of distributions can be obscured by models that estimate conditional means like linear regression; however, they can be detected via quantile regression. We present two approaches for exploring high-dimensional predictor spaces to identify impor- tant predictors for quantile regression. These are based on the lasso and elastic net penalties. We apply the approaches to a prospective cohort study of adverse birth outcomes that includes a wide array of demographic, medical, psychosocial, and en- vironmental variables. While tobacco exposure is known to be associated with lower birth weights, the analysis suggests an interesting interaction effect not previously re- ported: tobacco exposure may act more strongly on the 20th and 30th percentiles of birth weight when mothers have high levels of lead in their blood compared to when mothers have low blood lead levels. 1

Transcript of Exploratory quantile regression with many covariates: An ...

Page 1: Exploratory quantile regression with many covariates: An ...

Exploratory quantile regression with many covariates:An application to adverse birth outcomes

Lane F. Burgette1, Marie Lynn Miranda2, and Jerome P. Reiter1

1Department of Statistical Science, Duke University2Nicholas School of the Environment and Department of Pediatrics, Duke

University

April 21, 2011

Abstract

Covariates may affect continuous responses differently at various points of the

response distribution. For example, some exposure might have minimal impact on

conditional means, whereas it might lower conditional 10th percentiles sharply. Such

differential effects can be important to detect. In studies of the determinants of birth

weight, for instance, it is critical to identify exposures like the one above, since low

birth weight is a risk factor for later health problems. Effects of covariates on the

tails of distributions can be obscured by models that estimate conditional means like

linear regression; however, they can be detected via quantile regression. We present

two approaches for exploring high-dimensional predictor spaces to identify impor-

tant predictors for quantile regression. These are based on the lasso and elastic net

penalties. We apply the approaches to a prospective cohort study of adverse birth

outcomes that includes a wide array of demographic, medical, psychosocial, and en-

vironmental variables. While tobacco exposure is known to be associated with lower

birth weights, the analysis suggests an interesting interaction effect not previously re-

ported: tobacco exposure may act more strongly on the 20th and 30th percentiles of

birth weight when mothers have high levels of lead in their blood compared to when

mothers have low blood lead levels.

1

Page 2: Exploratory quantile regression with many covariates: An ...

1 Introduction

In large epidemiological cohort studies, researchers often record data on dozens or even

hundreds of variables that are potentially relevant for regression models. When interac-

tion effects are suspected, the number of potentially relevant predictors can even exceed

the sample size. With such high-dimensional covariate spaces, specifying the predictors to

include in models can be daunting. Often, analysts use automated model selection proce-

dures to aid in finding potentially important predictors, such as stepwise regression and its

variants,1 regression trees,2 and penalized regression approaches.3,4

Most of these techniques are designed for estimating conditional means of the response

variable. Often, however, the effects of covariates on percentiles of the response distribu-

tion are relevant. For example, if some exposure decreases the birth weights of babies who

otherwise would have average to high birth weights by some modest amount, but it does

not decrease the birth weights of babies who otherwise would have low birth weights, it

might not be considered deleterious. On the other hand, an exposure would be troubling

if it lowers the birth weights of babies who already would have low birth weights, even if

it does not have a large impact on babies who otherwise would have average to high birth

weights.

Figure 1 illustrates this scenario graphically using artificial data. In each panel, the

mean response decreases by 300 units as the covariate moves from zero to one. But, at the

fifth response percentile, the effect of the covariate could be nearly absent (panel A) or

exaggerated (panel B). For birth weights, the impact of the covariate in panel B would be

more problematic than that of panel A.

Analysts can estimate such differential effects with quantile regression.5 Unlike least

squares regression, quantile regression lets covariates affect more than the mean structure.

However, analysts still must decide which among many possible covariates and interactions

to include in the model. We present two approaches for identifying potentially important

predictors in quantile regression that are particularly useful for high dimensional covari-

2

Page 3: Exploratory quantile regression with many covariates: An ...

Covariate

Response

2500

3000

3500

● ●

●●●

●●

● ●

●● ●

●●

●●

●●

● ●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●●●

● ●●●

●●

●●

●●

● ● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

● ●

●● ●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

● ●●

● ●●●

●●

●●

●● ●

●● ●●

●●●

●●

●●

●●

●●●

●●

●● ●

●●

● ●●

●●

● ●

● ●

●●

●●

A

0.0 0.2 0.4 0.6 0.8 1.0

2500

3000

3500

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

● ●●

●●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●●●

●●

●●

● ●●

● ● ●

● ● ●

● ●

●●

●●

● ●

● ●

●●

●●

● ●●

● ●●

●●

●●● ●

●● ●

●●●

●●

●●

●●● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●● ●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●● ●●

●●

● ●

●●

●●

●● ●

● ●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

B

Figure 1: Artificial examples with the same true mean structure. The least squares fitsare similar (solid lines). Quantile regressions of the 5th percentile reveal very different tailbehavior (dashed lines).

3

Page 4: Exploratory quantile regression with many covariates: An ...

ate spaces. First, we estimate quantile regression coefficients subject to the lasso penalty.4

This penalty forces many coefficients to equal zero, leaving the most important ones to

be nonzero. Second, we implement a quantile regression version of the elastic net.6 Like

lasso, the elastic net encourages coefficients to equal zero. But, it more effectively identifies

groups of important predictors that are highly correlated with each other.

Our exploratory framework complements existing methodology for penalized quantile

regression.7,8,9,10,11 Specifically, we adapt Zhao and Yu’s12 boosting technique to provide

computationally expedient and theoretically justified algorithms for penalized quantile re-

gression. Using both simulated data and data from a prospective study on the determi-

nants of birth weights, we demonstrate how epidemiologists can use these algorithms to

guide model specification for quantile regression in high dimensional settings.

2 Methods

Standard linear regressions are of the form

yi = x�i β + εi for i = 1, . . . , n (1)

where εi are independent and identically distributed errors with mean zero, β is a vector of

regression coefficients with length p, and x�i is a row vector of covariates for the ith indi-

vidual. The linear regression model implies that the distribution of y shifts in its mean but

not in its shape for different xi.

Quantile regression allows for both the mean and shape of the distribution of y to change

with x, without assuming a particular error distribution. Hence, it is more flexible than

linear regression. Quantile regression uses the same basic model as (1), but assumes that

εi are independent errors whose τth quantile is equal to zero, i.e., Pr(εi ≤ 0) = τ . Thus,

x�i β is interpreted as the conditional τth quantile of y given xi.

Standard quantile regression does not require a formal probability model.5 Instead, co-

4

Page 5: Exploratory quantile regression with many covariates: An ...

efficients are estimated by minimizing the empirical loss function

L(β) =n�

i=1

ρτ (yi − x�i β), (2)

where

ρτ (a) =

τ |a| if a ≥ 0

(1− τ)|a| if a < 0.

(3)

Techniques exist for estimating standard errors of the estimated coefficients, testing hy-

potheses, and constructing interval estimates. Koenker and Hallock22 provide an accessible

introduction to the topic with several case studies, including an application to birth weight

data. However, they do not provide guidance on navigating the high dimensional covariate

spaces that motivate this article.

Quantile regression has advantages for modeling birth outcomes over other regression

approaches. Categorizing birth outcomes into low and high values23or a small number of

ordered categories24 is common in this field. But, using a logistic regression on an indica-

tor for low birth weight (less than 2500g) treats a birth at 2499g as fundamentally differ-

ent from a birth at 2501g; it also treats a birth at 2499g like one at 2000g. Neither treat-

ment is scientifically or clinically defensible.

Relatedly, least squares regression assumes that changing covariates merely shifts the

conditional distribution. Although the marginal distribution of birth weights is nearly

Gaussian, its conditional distribution is known to do more than shift with covariates. For

example, baby boys tend to weigh more than girls, but this difference attenuates in the

lower quantiles: small baby boys and girls are much closer in weight than a linear regres-

sion would suggest.22 Quantile regression can illuminate such changing effects.

We now present two approaches for exploring high dimensional quantile regression co-

variate spaces. We emphasize that these are exploratory techniques for identifying useful

models and do not provide formal inference.

5

Page 6: Exploratory quantile regression with many covariates: An ...

2.1 Boosted lasso for quantile regression

Model selection techniques operate by penalizing large models. For example, in linear re-

gression, one may select the minimum Akaike Information Criterion (AIC)25 or Bayesian

Information Criterion (BIC)26 model. Adding a predictor penalizes these log likelihood-

based criteria values by two and log(n), respectively. Thus, the penalty increases with the

number of included predictors p.

The lasso method uses a penalty related to the sum of the absolute values of the re-

gression coefficients.4 Using simplified notation, lasso solutions for β minimize the function

Γ(β; λ1) = L(β) + λ1

p�

j=1

|βj|, (4)

for an empirical loss function L(β). For large values of λ1, many components of β are es-

timated to equal zero. As λ1 shrinks toward zero, the estimates of β move towards an un-

penalized estimate. Equivalently, one can solve (4) by minimizing L(β) subject to the re-

striction�

|βj| ≤ t1 for a value of t1 that corresponds to λ1.4 We follow the convention of

displaying lasso solutions using this representation.

Thus, lasso minimizes a loss function subject to a constraint on the length of the esti-

mated coefficients. AIC and BIC minimize the loss function subject to a constraint on the

number of nonzero coefficients, and then choose an appropriate number of nonzero compo-

nents. Although these methods are similar in spirit, enumerative minimization of AIC or

BIC is impractical in high dimensional applications. For example, our motivating analy-

sis equates to more than 600 possible predictors. Even restricting a search to models with

10 or fewer predictors results in more than 1021 possible models. In contrast, there exist

efficient methods to find lasso solutions, even in high dimensions.27

Before fitting a lasso or elastic net model, we center and scale covariates so that the

penalties exert their force equally on all components of β. Further, we work with a cen-

tered version of the response to penalize the intercept minimally. We also add any poten-

6

Page 7: Exploratory quantile regression with many covariates: An ...

tially relevant interactions among the covariates to xi.

To adapt the lasso to exploratory quantile regression, we set L(β) to be the loss func-

tion in (2). Rather than use one specific t1, we investigate the solutions of (4) for a wide

range of t1, beginning with zero and ending at a large value. We find the lasso solutions

via a boosting algorithm;12 see the online supplement.

The solution paths, i.e., plots of estimated β versus t1, can be used to identify impor-

tant variables in the quantile regression. In particular, we search for variables whose esti-

mates take on nonzero values early in the solution path. To illustrate with simulated data,

consider the first panel of Figure 2, which shows the lasso solution paths for a 25th per-

centile regression with ten variables, five of which have nonzero true coefficients. The es-

timates of four parameters quickly take on nonzero values as we move from left to right,

suggesting that these are important variables. The other estimated coefficients take much

longer to rise above zero, suggesting they are less important. Hence, if seeking to fit a par-

simonious quantile regression, the first four variables would be a starting point for model

building.

Although this illustrative example has a modest number of predictors, lasso works in

settings with more predictors than observations; see the online supplement for examples.

We simply follow the solution path until the penalty is weak enough that the penalized

and unpenalized estimates are equal, at which point the algorithm stops. With numer-

ous predictors, we typically are interested in models with significantly fewer nonzero β ele-

ments than observations, so we can stop the algorithm early to save computational time.

It is possible to select one value of t1, e.g., via cross validation, and use the fitted model

at that point in the solution path as a final model. When analysts’ primary goal is making

out-of-sample predictions, or when there is scant domain knowledge to guide model selec-

tion, this approach can provide useful models. However, as an exploratory tool, we prefer

to use penalized quantile regression as a process for ranking the importance of groups of

covariates. We then combine domain knowledge with the information gleaned from the

7

Page 8: Exploratory quantile regression with many covariates: An ...

t1

β

0.00

0.05

0.10

0.15

0.20

Lasso

0.0 0.2 0.4 0.6

λ2∗ = 1

10

0.00

0.05

0.10

0.15

0.20

100

0.00

0.05

0.10

0.15

0.20

0.0 0.2 0.4 0.6

500 1000

Figure 2: Start of lasso and elastic net solution paths for a quantile regression of the 25thpercentile of simulated data. Three of the covariates are very strongly correlated and havean associated true regression parameter of 0.2 (solid lines). The other covariates are lessstrongly correlated with each other. Two of these have a true β value of 0.3 (dashed lines)and the rest have a true value of zero (dotted lines). Lasso is equivalent to a λ∗2 = 0 elasticnet fit. The abscissa variable is t1 =

�pj=1 |βj|.

8

Page 9: Exploratory quantile regression with many covariates: An ...

exploratory fits in the formal model building process. This process is described further in

Section 2.3 and in the online supplement.

2.2 Boosted elastic net for quantile regression

As an exploratory tool, lasso can be sub-optimal when there are highly correlated groups

of predictors, as it will tend to grant only one of them a nonzero β coefficient. Here, the

elastic net6 may better identify groups of important predictors. It penalizes the empirical

loss function by adding a term of λ2

�j β2

j to the lasso criterion. Thus, we seek the value

of β that minimizes

ΓEN(β; λ1, λ2) = L(β) + λ1

p�

j=1

|βj| + λ2

p�

j=1

β2j (5)

for given λ1 and λ2. As with lasso, larger values of λ1 encourage more zeros in the solu-

tion for β. With a large value of λ2, components of β are encouraged to take on small but

nonzero values, since the penalty disfavors large coefficients.

To see why the elastic net penalty encourages groups of correlated predictors to enter

the model simultaneously relative to lasso, consider covariates X1 and X2 that are nearly

identical for all subjects. The lasso penalty cannot easily distinguish between a solution

with (β1, β2) = (0, φ) and (β1, β2) = (φ/2, φ/2) for some φ �= 0, so that the solution may

present only one nonzero predictor. The elastic net penalty with λ2 > 0 prefers (φ/2, φ/2),

so that both predictors take on nonzero values at similar points in the solution path.

To adapt the elastic net to exploratory quantile regression, we let L(β) be the loss

function in (2). Rather than select one value of (λ1, λ2), we investigate the solutions for

a range of values to identify important predictors. To facilitate the exploration, we replace

λ2 with λ1λ∗2, where λ∗2 is a positive constant. For a fixed λ∗2, we fit the solution path as λ1

shrinks toward zero. As with lasso, we look for groups of coefficients that move away from

zero early and fast in the solution path. We again use boosting to fit the solution paths

9

Page 10: Exploratory quantile regression with many covariates: An ...

given λ∗2; see the online supplement.

We examine the solution paths for several values of λ∗2, searching for a value of λ∗2 large

enough to allow more variables to enter the model early in the solution path (compared to

the lasso fit), but small enough that some parsimony is maintained early in the solution

path. Appropriate values of λ∗2 depend on the data. We recommend starting with λ∗2 =

1 and moving up or down by factors of five or ten until these goals are achieved. It may

be helpful to inspect intermediate values of λ∗2, but we have experienced only moderate

sensitivity to λ∗2 settings.

To illustrate, consider Figure 2, which shows the fits for λ∗2 between zero and 1000.

The fit with λ∗2 = 1 is nearly unchanged from the lasso (λ∗2 = 0) fit. When λ∗2 is 10 or 100,

a fifth variable enters the model early, thus appearing as an important predictor. When

we increase λ∗2 to 500 or 1000, it becomes difficult to discern exactly which variables are

likely to be most important since most of them take on nonzero values early in the solu-

tion path; we argue that these fits are not useful for an exploratory analysis. Thus, in this

simulated dataset, fits with λ∗2 equal to 10 or 100 satisfy our goals.

2.3 A framework for penalized exploratory quantile regression

The lasso and elastic net quantile regressions can be combined to improve exploratory

analyses for quantile regression. We recommend the following steps.

First, we fit the lasso regression solution path at the quantile of interest. We isolate

variables with coefficients that take on nonzero values early in the solution path. These

variables are the starting point for formal model building. We can rank the importance

of the remaining variables by the order in which their corresponding coefficients take on

nonzero values. We can choose to include some of these or additional variables in the final

model based on scientific considerations and formal hypothesis testing.

Second, as a check on the lasso results, we fit the elastic net at several values of λ∗2, as

outlined in Section 2.2. Here, we look for variables that appear early in the elastic net fit

10

Page 11: Exploratory quantile regression with many covariates: An ...

but are not prominent in the lasso. These variables can be considered for inclusion based

on scientific merits and formal hypothesis tests. The elastic net is especially useful in set-

tings where interaction effects are suspected, since interaction variables are often collinear

with main effects and thus may be excluded by the lasso.

This is an exploratory framework designed to highlight prominent groups of predictors.

Like all exploratory model building techniques, it should supplement — not replace — sci-

entific rationales for model building. Analysts should add or subtract variables accord-

ingly. Nonetheless, in high dimensional settings with many predictors, exploratory quantile

regression tools can help analysts find important predictors that might otherwise be diffi-

cult to uncover, as we now illustrate on genuine data.

3 Application

We apply the exploratory framework to the Healthy Pregnancy, Healthy Baby Study (HPHBS),

an observational prospective cohort study in Durham, NC, focused on the etiology of ad-

verse birth outcomes. These adverse outcomes, including low birth weight and preterm

birth, have been linked to many problems13 including blindness,14 deafness,15 and behav-

ioral problems.16 Unfortunately, the causes of adverse birth outcomes are poorly under-

stood, although predictors cited as important include smoking,17 lead exposure,18 environ-

mental exposures more generally,13 and psychological stress.19,20,21

We focus on non-Hispanic black mothers, resulting in a sample size of 881 births. We

seek models for birth weight and gestational age, with particular emphasis on low quan-

tiles of these distributions, e.g., the 10th, 20th and 30th percentiles. The data comprise

35 covariates, including maternal demographics like age, education and income; mater-

nal medical history variables like previous birth outcomes; maternal environmental expo-

sures to cadmium, nicotine, cotinine, mercury, and lead as measured in blood; and ma-

ternal psychosocial factors like scores on the Interpersonal Support Evaluation List, the

11

Page 12: Exploratory quantile regression with many covariates: An ...

CES-D depression scale, the NEO Five Factor Personality Inventory, perceived racism,

and availability of social support. We believe that interactions among these variables may

be important predictors of birth outcomes. Therefore, we include in x the main effects of

all predictors (using indicator variables for multi-category covariates), linear and squared

terms for continuous predictors, and all two-way interactions, resulting in over 600 co-

variates for consideration. Summary statistics for the covariates are in Table 1, and a his-

togram of birth weights is provided in the online supplement. Missing data in x (birth out-

comes are complete) were multiply imputed using regression trees.28

With lasso quantile regressions, measures of tobacco smoke exposure take on primary

importance, particularly for birth weight. When modeling the 10th percentile of birth

weight, four of the nine first covariates to enter the model are interactions that include a

tobacco measure (see Figure 3); at the 20th percentile, four of the first seven are tobacco-

related; at the 30th percentile, five of the first six are interactions involving tobacco. These

results are consistent with the literature that extensively documents the impact of smoking

on birthweight. Across the quantiles, a lead/tobacco interaction is consistently flagged.

Other important variables from the lasso 10th percentile regression for birth weight

include the mother’s age selected as a squared term and in an interaction with environ-

mental tobacco smoke exposure. The former suggests that the association between age

and birth weight is a U-shaped curve, which is again consistent with the literature.29 In

addition, the ISEL appraisal score, which is a measure of the availability of someone with

whom to discuss problems, was selected in an interaction with a variable indicating per-

ceived social standing and an interaction with having a “visiting” relationship with the

father of the child. At the 20th percentile, prominent variables (other than those involving

tobacco) include an interaction between negative paternal support and NEO-conscientiousness

(a measure of organization and persistence) and an interaction between negative pater-

nal support and the perceived self-efficacy score, which measures a woman’s belief that

she can steer her own life’s trajectory.30 Qualitatively, the results are similar at the 30th

12

Page 13: Exploratory quantile regression with many covariates: An ...

percentile of birth weight. Smoking measures are prominent, and environmental measures

other than a tobacco/lead interaction are slow to enter the model.

In the presented results, we focus on birth weight as the response because lasso fits of

gestational age reveal fewer useful explanatory variables, with less consistency across the

response quantiles. This may be evidence that the covariates poorly capture the determi-

nants of gestational age.

t1

β

−20

0

20

40

0 50 100

elastic net−20

0

20

40

lasso

Figure 3: Lasso and elastic net solution paths for a quantile regression of the 10thpercentile of birth weight. Tobacco-related variables and interactions includingtobacco-related variables are solid lines; all others are dashed. The elastic net fit usesλ∗2 = 0.05. Many components are overplotted on the β = 0 line. The abscissa variable ist1 =

�pj=1 |βj|, and estimates are scaled in grams.

With the elastic net, the fits are essentially identical to the lasso fit when λ∗2 = 0.01.

When λ∗2 = 0.1, dozens of variables take on nonzero values very early in the fitting process,

13

Page 14: Exploratory quantile regression with many covariates: An ...

which makes identifying the important variables difficult. Hence, we focus on the λ∗2 =

0.05 elastic net. This fit generally selects the same variables as lasso, but it also identifies

other potentially important covariates: interactions between blood lead levels and mother’s

self-reported tobacco use at the start of the pregnancy, and between blood lead levels and

mother’s tobacco use during the pregnancy, are among the early variables with nonzero

coefficients.

Building on the exploratory analyses, we estimate regressions for the 10th, 20th and

30th percentiles of birth weight. Although not selected in the exploratory stage, we in-

clude standard controls known to be predictors of birth weight, including the baby’s gen-

der and an indicator for first birth.29,30 Since the mother’s age appeared important as a

squared term, we include age and (age)2. We also consider the tobacco use at prenatal in-

take/lead interaction that was the first variable to appear in our 10th percentile lasso and

elastic net fits, and add in the main effects to comply with the hierarchy principle. We

also fit models including parental relationship status, ISEL appraisal scores, and their in-

teraction; however, statistically insignificant F -tests of these effects suggested that they

could be removed from the models without much sacrifice. We similarly remove the mater-

nal age/tobacco interaction after an F -test.

The results are summarized in Table 2, along with a median regression for complete-

ness. While age does not approach significance in any of the presented quantiles, (cen-

tered) age squared is consistently negative. This suggests a U-shaped effect of maternal

age on birth weight; i.e., women at the bottom and top end of the age distribution will

tend to have lower birth weights. It is not until the 30th percentile that infant sex nearly

reaches significance, and parity (first or higher order birth) never reaches significance in

the quantiles presented. Since infant sex and parity are well-established predictors of birth

weight in traditional estimates of the mean regression analysis, we might reasonably con-

clude that at lower quantiles, other processes overwhelm the effects that dominate at the

mean.

14

Page 15: Exploratory quantile regression with many covariates: An ...

The coefficients on the environmental variables (the continuous blood lead measure-

ment, tobacco use at prenatal care intake, and their interaction) are especially interesting.

The combined main and interaction effects of lead and tobacco are essentially a wash at

the 10th percentile. At the 20th percentile, women who both use tobacco and who have

lead exposure that is one standard deviation above the mean have a net decrease in birth

weight of 174g. However, at the 10th and 20th percentiles, these coefficient estimates are

not significant. At the 30th percentile, the combined main and interaction effects are as-

sociated with a 168g decrement in birth weight. Here, the interaction effect is significant,

and the two main effects approach significance.

The meta-analysis of Navas et al.32 found that “evidence is sufficient to infer a causal

relationship between lead exposure and high blood pressure” (p. 478). Concurrently, hy-

pertension is associated with poorer birth outcomes.29 We also know the anomalous but

persistent result that smoking during pregnancy reduces the risk for preeclampsia33,34,35

(although among preeclamptics, smoking increases the risk for perinatal mortality and fe-

tal growth restriction). Both preeclampsia and maternal hypertension are associated with

lower birth weights. Because hypertension can be either the cause or the result of prob-

lems in pregnancy (especially if it progresses to preeclampsia), this exploratory analysis

suggests that a nuanced treatment of the lead-hypertension-smoking nexus may be critical

to understanding adverse birth outcomes. Such modeling is part of our ongoing research

agenda.

A recent report23 noted that while the relationships between environmental exposures

and preterm birth are in general poorly understood, “possible exceptions are lead and en-

vironmental tobacco smoke, for which the weight of evidence suggests that maternal expo-

sure to these pollutants increases the risk for preterm birth” (p. 229). These variables are

also important contributors to low birth weights.13 The potential lead/smoking interaction

— while apparently unreported in the birth outcomes literature — has been highlighted in

other contexts. For example, the interaction has been found to be important for predicting

15

Page 16: Exploratory quantile regression with many covariates: An ...

mortality due to cancer36 and attention deficit/hyperactivity disorder.37

4 Summary

Quantile regression facilitates identifying covariates that differentially affect the tails of

continuous response distributions, which often are of clinical and scientific relevance. We

presented an exploratory framework for identifying potentially important covariates in

quantile regression for high dimensional predictor spaces. The underlying algorithms, based

on the lasso and elastic net penalties, are computationally fast and simple to use, thereby

offering analysts methods for investigating the importance of hundreds of potential regres-

sors on noncentral features of the response distribution. We applied the framework to a

study of adverse birth outcomes and identified an interaction between maternal lead expo-

sure and smoking that was not previously discussed in the literature.

The methods presented here have wider applicability beyond studies of adverse birth

outcomes. For example, quantile regression has been used to study effects of tobacco in

relation to sleep patterns,38 degradation of mental acuity in multiple sclerosis patients,39

and determinants of obesity.40 In each case, quantile regression provides a fuller picture of

covariate effects than standard linear regression could. As quantile regression gains pop-

ularity and datasets become richer, we believe that our exploratory techniques will be a

useful addition to the epidemiologist’s toolbox.

5 Acknowledgements

The authors wish to thank Geeta Swamy, two anonymous referees and Editors Wacholder

and Wilcox for comments that improved the manuscript.

16

Page 17: Exploratory quantile regression with many covariates: An ...

References

1. RB Bendel, AA Afifi. Comparison of stopping rules in forward “stepwise” regression. J

Am Stat Assoc. 1977;72:46–53.

2. L Breiman, JH Friedman, RA Olshen, CJ Stone. Classification and Regression Trees.

Boca Raton: Chapman & Hall/CRC; 1984.

3. AE Hoerl, RW Kennard. Ridge regression: Biased estimation for nonorthogonal prob-

lems. Technometrics. 2000;42:80–86.

4. R Tibshirani. Regression shrinkage and selection via the lasso. J R Stat Soc Series B

Stat Methodol. 1996;58:267–288.

5. R Koenker, G Bassett Jr. Regression quantiles. Econometrica. 1978;46:33–50.

6. H Zou, T Hastie. Regularization and variable selection via the elastic net. J R Stat Soc

Series B Stat Methodol. 2005;67:301–320.

7. N Fenske, T Kneib, T Hothorn. Identifying risk factors for severe childhood malnutri-

tion by boosting additive quantile regression. Technical Report, University of Munich

Department of Statistics, 2009.

8. Y Wu, Y Liu. Variable selection in quantile regression. Stat Sin. 2009;19:801–817.

9. R Koenker. Quantile regression for longitudinal data. J of Multivar Anal. 2004;91:74–

89.

10. Y Li, J Zhu. L1-norm quantile regression. J Comput and Graph Stat. 2008;17:163–185.

11. Q Li, R Xi, N Lin. Bayesian regularized quantile regression. Bayesian Anal. 2010;5:533–

556.

12. P Zhao, B Yu. Boosted lasso. In: H Lui, R Stine, L Auslender, chairs. Workshop on

Feature Selection for Data Mining. Newport Beach, CA: SIAM; 2005:35–44.

13. ML Miranda, D Kim, JP Reiter, MA Overstreet Galeano, P Maxson. Environmental

contributors to the achievement gap. Neurotoxicology. 2009;30:1019–1024.

14. BJ Crofts, R King, A Johnson. The contribution of low birth weight to severe vision

loss in a geographically defined population. Br J of Ophthalmol. 1998;82:9–13.

17

Page 18: Exploratory quantile regression with many covariates: An ...

15. JM Lorenz, DE Wooliever, JR Jetton, N Paneth. A quantitative review of mortality

and developmental disability in extremely premature newborns. Arch Pediatr Adolesc

Med. 1998;152:425–435.

16. MC McCormick, SL Gortmaker, AM Sobol. Very low birth weight children: Behavior

problems and school difficulty in a national sample. J Pediatr. 1990;117:687–693.

17. DA Savitz, N Dole, JW Terry Jr, H Zhou, JM Thorp Jr. Smoking and pregnancy out-

come among African-American and white women in central North Carolina. Epidemiol-

ogy. 2001;12:636–642.

18. T Gonzalez-Cossio, KE Peterson, LH Sanin, et al. Decrease in birth weight in relation

to maternal bone-lead burden. Pediatrics. 1997;100:856–862.

19. RW Newton, LP Hunt. Psychosocial stress in pregnancy and its relation to low birth

weight. BMJ. 1984;288:1191–1194.

20. ST Orr, JP Reiter, DG Blazer, SA James. Maternal prenatal pregnancy-related anxi-

ety and spontaneous preterm birth in Baltimore, Maryland. Psychosom Med. 2007;69:566–

570.

21. ST Orr, DG Blazer, SA James, JP Reiter. Depressive symptoms and indicators of ma-

ternal health status during pregnancy. J Womens Health. 2007;16:535–542.

22. R Koenker, KF Hallock. Quantile regression. J Econ Perspect. 2001;15:143–156.

23. RE Behrman, AS Butler. Preterm Birth: Causes, Consequences and Prevention. Wash-

ington, DC: National Academy Press; 1997.

24. GC Windham, B Hopkins, L Fenster, SH Swan. Prenatal active or passive tobacco

smoke exposure and the risk of preterm delivery or low birth weight. Epidemiology.

2000;11:427–433.

25. H Akaike. Information theory and an extension of the maximum likelihood principle.

In: BN Petrox and F Caski, eds. Second International Symposium on Information The-

ory. Budapest: Academiai Kiado; 1973:267–281.

26. G Schwarz. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.

18

Page 19: Exploratory quantile regression with many covariates: An ...

27. B Efron, T Hastie, I Johnstone, R Tibshirani. Least angle regression. Ann Stat. 2004;32:407–

499.

28. LF Burgette, JP Reiter. Multiple imputation for missing data via sequential regression

trees. Am J Epidemiol. 2010;172:1070–1076.

29. GK Swamy, S Edwards, A Gelfand, SA James, ML Miranda. Maternal age, birth or-

der and race: Differential effects on birthweight. J Epidemiol Community Health. Forthcoming;1–

23.

30. A Bandura. Self-efficacy. In Corsini Encyclopedia of Psychology. Wiley Online Li-

brary, 2010.

31. P Magnus, K Berg, T Bjerkedal. The association of parity and birth weight: Testing

the sensitization hypothesis. Early Hum Dev. 1985;12:49–54.

32. A Navas-Acien, E Guallar, EK Silbergeld, SJ Rothenberg. Lead exposure and cardio-

vascular disease—A systematic review. Environ Health Perspect. 2007;115:472–482.

33. S Cnattingius, JL Mills, J Yuen, O Eriksson, H Salonen. The paradoxical effect of

smoking in preeclamptic pregnancies: Smoking reduces the incidence but increases the

rates of perinatal mortality, abruptio placentae, and intrauterine growth restriction.

Am J Obstet Gynecol. 1997;177:156–161.

34. A Castles, EK Adams, CL Melvin, C Kelsch, ML Boulton. Effects of smoking during

pregnancy: Five meta-analyses. Am J Prev Med. 1999;16:208–215.

35. KY Lain, RW Powers, MA Krohn, RB Ness, WR Crombleholme, JM Roberts. Urinary

cotinine concentration confirms the reduced risk of preeclampsia with tobacco expo-

sure. Am J Obstet Gynecol. 1999;181:1192–1196.

36. M Lustberg, E Silbergeld. Blood lead levels and mortality. Arch Intern Med. 2002;162:2443–

2449.

37. TE Froehlich, BP Lanphear, P Auinger, et al. Association of tobacco and lead expo-

sures with attention deficit/hyperactivity disorder. Pediatrics. 2009;124:e1050–e1063.

19

Page 20: Exploratory quantile regression with many covariates: An ...

38. L Zhang, J Samet, B Caffo, NM Punjabi. Cigarette smoking and nocturnal sleep ar-

chitecture. Am J Epidemiol. 2006;164:529–537.

39. RA Marrie, NV Dawson, A Garland. Quantile regression and restricted cubic splines

are useful for exploring relationships between continuous variables. J Clin Epidemiol.

2009;62:511–517.

40. K Kan and WD Tsai. Obesity and risk knowledge. J Health Econ. 2004;23:907–934.

20

Page 21: Exploratory quantile regression with many covariates: An ...

Variable Description Mean SDIncome Ordered categorical reported income < $5000∗

Maternal education Ordered categorical Some high school∗

Maternal age In years 25 6Private insurance Private health insurance 13%Parity Is this the mother’s first child? 38%Cotinine Blood cotinine measurement (ng/mL) 21 58Nicotine Blood nicotine measurement (ng/mL) 0.6 2Cadmium Blood cadmium measurement (ng/mL) 0.4 0.6Mercury Blood mercury measurement (ng/mL) 0.5 1Manganese Blood manganese measurement (ng/mL) 1 1Lead Blood lead measurement (µg/dL) 0.4 0.8ISEL Belong ISEL belonging mean 9 2ISEL Self ISEL self-esteem score 9 2ISEL Appr ISEL appraisal score 10 3ISEL Tang ISEL Tangible score 9 3ISEL Total ISEL total 38 8Paternal visit Visiting relationship with father 22%Not romantically involved Not romantically involved with father 21%PS Score Mean of paternal support 2.7 0.3Pos support Score of positive paternal support 2.5 0.5Neg support Score of negative paternal support 1.2 0.3NEO-N NEO Neuroticism score 20 7NEO-A NEO Agreeableness score 31 6NEO-E NEO Extroversion score 29 6NEO-O NEO Openness score 25 5NEO-C NEO Conscientiousness score 35 6Perceived racism Sum of measures of experienced racism 0.8 1.3JHAC score John Henryism 52 6Ladder Position relative to peers 7 2CESD score Depression score 16 10.6PSS score Perceived stress score 2.6 0.7SE score Self-efficacy 3.3 0.5Environmental tobacco smoke Self-report environmental tobacco exposure 68%TobUse Tobacco use at delivery 20%Tobacco Tobacco use at prenatal care intake 19%

Table 1: Variables (along with interactions and squared terms) included in exploratoryanalyses with means and standard deviations. Starred entries are modal responses.

21

Page 22: Exploratory quantile regression with many covariates: An ...

10th

per

cent

ile

20th

per

cent

ile

30th

per

cent

ile

Med

ian

Est

imat

eIn

terv

alE

stim

ate

Inte

rval

Est

imat

eIn

terv

alE

stim

ate

Inte

rval

Inte

rcep

t22

71(1

977,

2564

)26

44(2

497,

2791

)28

02(2

715,

2890

)30

38(2

928,

3147

)A

ge(c

ente

red)

-5(-

31,20

)-1

(-16

,14

)1

(-11

,14

)-5

(-16

,6)

Age

2(c

ente

red)

-3(-

6,1)

-3(-

5,-1

)-2

(-4,

-0)

-1(-

2,1)

Mal

e-2

1(-

261,

219)

37(-

92,16

6)98

(-4,

201)

80(-

10,17

0)2n

dbab

yor

late

r-1

9(-

266,

228)

54(-

85,19

2)0

(-11

3,11

5)60

(-43

,16

3)B

lood

lead

206

(48,

365)

109

(-11

,23

0)81

(-7,

170)

88(2

1,15

6)Tob

acco

use

106

(-18

7,39

9)-8

6(-

240,

68)

-91

(-23

0,47

)-1

02(-

212,

9)Lea

d/T

obac

co-3

03(-

716,

110)

-197

(-36

8,29

)-1

58(-

280,

-37)

-180

(-30

5,-5

6)

Tab

le2:

Quan

tile

regr

essi

onof

10th

,20

th,30

th,an

d50

thper

cent

iles

ofbir

thw

eigh

t,usi

ng

the

lead

/tob

acco

inte

ract

ion

sugg

este

dby

explo

rato

ryan

alys

es,w

ith

par

amet

eres

tim

ates

and

95%

confiden

cebou

nds.

Est

imat

esar

ein

gram

s.

22