Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical...

8
Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* Haochang Shou,* Dawei Xie,* Jesse Y. Hsu,* Wei Yang,* Amanda H. Anderson,* J. Richard Landis,* Christopher Jepson,* Jiang He, Kathleen D. Liu, § Chi-yuan Hsu, §| and Harold I. Feldman* on behalf of the Chronic Renal Insufficiency Cohort (CRIC) Study Investigators Abstract Prediction models are often developed in and applied to CKD populations. These models can be used to inform patients and clinicians about the potential risks of disease development or progression. With increasing availability of large datasets from CKD cohorts, there is opportunity to develop better prediction models that will lead to more informed treatment decisions. It is important that prediction modeling be done using appropriate statistical methods to achieve the highest accuracy, while avoiding overfitting and poor calibration. In this paper, we review prediction modeling methods in general from model building to assessing model performance as well as the application to new patient populations. Throughout, the methods are illustrated using data from the Chronic Renal Insufficiency Cohort Study. Clin J Am Soc Nephrol : cccccc, 2016. doi: 10.2215/CJN.06210616 Introduction Predictive models and risk assessment tools are inten- ded to inuence clinical practice and have been a topic of scientic research for decades. A PubMed search of prediction model yields over 40,000 papers. In CKD, research has focused on predicting CKD progression (1,2), cardiovascular events (3), and mortality (46) among many other outcomes (7). Interest in developing prediction models will continue to grow with the emerging focus on personalized medicine and the availability of large electronic databases of clinical in- formation. Researchers carrying out prediction model- ing studies need to think carefully about design, development, validation, interpretation, and the report- ing of results. This methodologic review article will discuss these key aspects of prediction modeling. We illustrate the concepts using an example from the Chronic Renal Insufciency Cohort (CRIC) Study (8,9) as described below. Motivating Example: Prediction of CKD Progression The motivating example focuses on the development of prediction models for CKD progression. In addition to the general goal of nding a good prediction model, we explore whether a novel biomarker improves pre- diction of CKD progression over established predictors. In this case, urine neutrophil gelatinaseassociated lip- ocalin (NGAL) was identied as a potential risk factor for CKD progression on the basis of a growing litera- ture that showed elevated levels in humans and ani- mals with CKD or kidney injury (2). The question of interest was whether baseline urine NGAL would pro- vide additional predictive information beyond the in- formation captured by established predictors. The CRIC Study is a multicenter cohort study of adults with moderate to advanced CKD. The design and characteristics of the CRIC Study have been described previously (8,9). In total, 3386 CRIC Study participants had valid urine NGAL test data and were included in the prediction modeling. Details of the procedures for obtaining urine NGAL are provided elsewhere (2,10). Established predictors include sociodemographic characteristics (age, sex, race/ethnicity, and educa- tion), eGFR (in milliliters per minute per 1.73 m 2 ), proteinuria (in grams per day), systolic BP, body mass index, history of cardiovascular disease, diabe- tes, and use of angiotensinconverting enzyme inhib- itors/angiotensin II receptor blockers. In this example, all were measured at baseline. The outcome was progressive CKD, which was dened as a composite end point of incident ESRD or halving of eGFR from baseline using the Modication of Diet in Renal Disease Study equation (11). ESRD was considered to have occurred when a patient un- derwent kidney transplantation or began chronic di- alysis. For the purposes of this paper, in lieu of a broader examination of NGAL that was part of the original reports (2,10), we will on focus on occurrence of progressive CKD within 2 years from baseline (a yes/no variable). Among the 3386 participants, 10% had progressive CKD within 2 years. The median value of urine NGAL was 17.2 ng/ml, with an interquartile range of 8.139.2 ng/ml. Detailed characteristics of the study pop- ulation are given in the work by Liu et al. (2). We excluded patients with missing predictors (n=119), leaving a total of 3033 with a valid NGAL measurement, no missing predictors, and an observed outcome. We made this decision, because the percentage of missing *Department of Biostatistics and Epidemiology and Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania; Department of Epidemiology, Tulane University, New Orleans, Louisiana; § Department of Medicine, University of California, San Francisco, California; and | Division of Research, Kaiser Permanente Northern California, Oakland, California Correspondence: Dr. Jason Roy, Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, 632 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104. Email: jaroy@ upenn.edu www.cjasn.org Vol ▪ ▪▪▪, 2016 Copyright © 2016 by the American Society of Nephrology 1 . Published on September 22, 2016 as doi: 10.2215/CJN.06210616 CJASN ePress

Transcript of Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical...

Page 1: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

Special Feature

Statistical Methods for Cohort Studies of CKD:Prediction Modeling

Jason Roy,*† Haochang Shou,*† Dawei Xie,*† Jesse Y. Hsu,*† Wei Yang,*† Amanda H. Anderson,*† J. Richard Landis,*†

Christopher Jepson,*† Jiang He,‡ Kathleen D. Liu,§ Chi-yuan Hsu,§| and Harold I. Feldman*† on behalf of the ChronicRenal Insufficiency Cohort (CRIC) Study Investigators

AbstractPrediction models are often developed in and applied to CKD populations. These models can be used to informpatients and clinicians about the potential risks of disease development or progression. With increasingavailability of large datasets from CKD cohorts, there is opportunity to develop better prediction models that willlead to more informed treatment decisions. It is important that prediction modeling be done using appropriatestatistical methods to achieve the highest accuracy, while avoiding overfitting and poor calibration. In this paper,we review predictionmodelingmethods in general frommodel building to assessingmodel performance as well asthe application to new patient populations. Throughout, the methods are illustrated using data from the ChronicRenal Insufficiency Cohort Study.

Clin J Am Soc Nephrol ▪: ccc–ccc, 2016. doi: 10.2215/CJN.06210616

IntroductionPredictive models and risk assessment tools are inten-ded to influence clinical practice and have been a topicof scientific research for decades. A PubMed search ofprediction model yields over 40,000 papers. In CKD,research has focused on predicting CKD progression(1,2), cardiovascular events (3), and mortality (4–6)among many other outcomes (7). Interest in developingprediction models will continue to grow with theemerging focus on personalized medicine and theavailability of large electronic databases of clinical in-formation. Researchers carrying out prediction model-ing studies need to think carefully about design,development, validation, interpretation, and the report-ing of results. This methodologic review article willdiscuss these key aspects of prediction modeling. Weillustrate the concepts using an example from theChronic Renal Insufficiency Cohort (CRIC) Study (8,9)as described below.

Motivating Example: Prediction of CKDProgression

The motivating example focuses on the developmentof prediction models for CKD progression. In additionto the general goal of finding a good prediction model,we explore whether a novel biomarker improves pre-diction of CKD progression over established predictors.In this case, urine neutrophil gelatinase–associated lip-ocalin (NGAL) was identified as a potential risk factorfor CKD progression on the basis of a growing litera-ture that showed elevated levels in humans and ani-mals with CKD or kidney injury (2). The question ofinterest was whether baseline urine NGAL would pro-vide additional predictive information beyond the in-formation captured by established predictors.

The CRIC Study is a multicenter cohort study ofadults with moderate to advanced CKD. The design andcharacteristics of the CRIC Study have been describedpreviously (8,9). In total, 3386 CRIC Study participantshad valid urine NGAL test data and were included inthe prediction modeling. Details of the procedures forobtaining urine NGAL are provided elsewhere (2,10).Established predictors include sociodemographic

characteristics (age, sex, race/ethnicity, and educa-tion), eGFR (in milliliters per minute per 1.73 m2),proteinuria (in grams per day), systolic BP, bodymass index, history of cardiovascular disease, diabe-tes, and use of angiotensin–converting enzyme inhib-itors/angiotensin II receptor blockers. In thisexample, all were measured at baseline.The outcome was progressive CKD, which was

defined as a composite end point of incident ESRD orhalving of eGFR from baseline using the Modificationof Diet in Renal Disease Study equation (11). ESRDwas considered to have occurred when a patient un-derwent kidney transplantation or began chronic di-alysis. For the purposes of this paper, in lieu of abroader examination of NGAL that was part of theoriginal reports (2,10), we will on focus on occurrenceof progressive CKD within 2 years from baseline(a yes/no variable).Among the 3386 participants, 10% had progressive

CKDwithin 2 years. The median value of urine NGALwas 17.2 ng/ml, with an interquartile range of 8.1–39.2 ng/ml. Detailed characteristics of the study pop-ulation are given in the work by Liu et al. (2).We excluded patients with missing predictors (n=119),

leaving a total of 3033 with a valid NGALmeasurement,no missing predictors, and an observed outcome. Wemade this decision, because the percentage of missing

*Department ofBiostatistics andEpidemiology and†Center for ClinicalEpidemiology andBiostatistics, PerelmanSchool of Medicine,University ofPennsylvania,Philadelphia,Pennsylvania;‡Department ofEpidemiology, TulaneUniversity, NewOrleans, Louisiana;§Department ofMedicine, Universityof California, SanFrancisco, California;and |Division ofResearch, KaiserPermanente NorthernCalifornia, Oakland,California

Correspondence: Dr.Jason Roy,Department ofBiostatistics andEpidemiology,Perelman School ofMedicine, Universityof Pennsylvania, 632Blockley Hall, 423Guardian Drive,Philadelphia, PA19104. Email: [email protected]

www.cjasn.org Vol ▪ ▪▪▪, 2016 Copyright © 2016 by the American Society of Nephrology 1

. Published on September 22, 2016 as doi: 10.2215/CJN.06210616CJASN ePress

Page 2: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

data was low, the characteristics of those with observedand missing data were similar (data not shown), and wewanted to focus on prediction and not missing data issues.In practice, however, multiple imputation is generally rec-ommended for handling missing predictors (12,13).

Prediction ModelsIt is important to distinguish between two major types of

modeling that are found in medical research—associativemodeling and prediction modeling. In associative model-ing, the goal is typically to identify population-level rela-tionships between independent variables (e.g., exposures)and dependent variables (e.g., clinical outcomes). Al-though associative modeling does not necessarily establishcausal relationships, it is often used in an effort to improveour understanding of the mechanisms through which out-comes occur. In prediction modeling, by contrast, the goalis typically to generate the best possible estimate of thevalue of the outcome variable for each individual. Thesemodels are often developed for use in clinical settings tohelp inform treatment decisions.Prediction models use data from current patients for

whom both outcomes and predictors are available to learnabout the relationship between the predictors and out-comes. The models can then be applied to new patients forwhom only predictors are available—to make educatedguesses about what their future outcome will be. Predic-tion modeling as a field involves both the development ofthe models and the evaluation of their performance.In the era of big data, there has been an increased interest

in prediction modeling. In fact, an entire field, machinelearning, is now devoted to developing better algorithmsfor prediction. As a result of so much research activityfocused on prediction, many new algorithms have beendeveloped (14). For continuous outcomes, options includelinear regression, generalized additive models (15), Gauss-ian process regression (16), regression trees (17), andk-nearest neighbor (18) among others. For binary out-comes, prediction is also known as classification, becausethe goal is to classify an individual into one of two cate-gories on the basis of their set of predictors (features inmachine learning terminology). Popular options for binaryoutcome prediction include logistic regression, classifica-tion tree (17), support vector machine (19), and k-nearestneighbor (20). Different machine learning algorithms havevarious strengths and weaknesses, discussion of which isbeyond the scope of this paper. In the CRIC Study example,we use the standard logistic regression model of the binaryoutcome (occurrence of progressive CKD within 2 years).

Variable SelectionRegardless of which type of prediction model is used, a

variable selection strategy will need to be chosen. If we areinterested in the incremental improvement in predictiondue to a novel biomarker (like urine NGAL), then it isreasonable to start with a set of established predictors andassess what improvement, if any, the biomarker adds to themodel. Variable selection is, therefore, knowledge driven.Alternatively, if the goal is simply to use all availabledata to find the best prediction model, then a data-drivenapproach can be applied. Data-driven methods are

typically automated—the researcher provides a list of alarge set of possible predictors, and the method will selectfrom that a shorter list of predictors to include in a finalmodel. Data-driven methods include criterion methods,such as maximizing Bayesian information criterion (21), reg-ularization methods, such as Lasso (22), and dimension re-duction methods, such as principal components analysis(23), when there is a large set of predictors. Which type ofvariable selection approach to use depends on the purposeof the prediction model and its envisioned use in clinicalpractice. For example, if portability is a goal, then restrictingto commonly collected variables might be important.

Performance EvaluationAfter a model is developed, it is important to quantify how

well it performs. In this section, we describe several methodsfor assessing performance. Our emphasis is on performancemetrics for binary outcomes (classification problems); someof the metrics can be used for continuous outcomes as well.Typically, a prediction model for a binary outcome

produces a risk score for each individual, denoting theirpredicted risk of experiencing the outcome given theirobserved values on the predictor variables. For example,logistic regression yields a risk score, which is the log odds(logit) of the predicted probability of an individual expe-riencing the outcome of interest. A good prediction modelfor a binary outcome should lead to good discrimination (i.e., good separation in risk scores between individuals whowill, in fact, develop the outcome and those who will not).Consider the CRIC Study example. We fitted three logisticregression models of progressive CKD (within 2 yearsfrom baseline). The first included only age, sex, and raceas predictors. The second model also included eGFR andproteinuria. Finally, the third model also included otherestablished predictors: angiotensin–converting enzyme in-hibitor/angiotensin II receptor blocker, an indicator forany history of cardiovascular disease, diabetes, educa-tional level, systolic BP, and body mass index. In Figure1, the risk scores from each of the logistic regression mod-els are plotted against the observed outcome. The plotsshow that, as more predictors were added to the model,the separation in risk scores between participants who didor did not experience the progressive CKD outcome in-creased. In model 1, for example, the distribution of therisk score was very similar for both groups. However, inmodel 3, those not experiencing progressive CKD tendedto have much lower risk scores than those who did.

Sensitivity and SpecificityOn the basis of the risk score, we can classify patients as

high or low risk by choosing some threshold, values abovewhich are considered high risk. A good prediction modeltends to classify those who will, in fact, develop the outcomeas high risk and those who will not as low risk. Thus, we candescribe the performance of a test using sensitivity, theprobability that a patient who will develop the outcome isclassified as high risk, and specificity, the probability that apatient who will not develop the outcome is classified as lowrisk (24).In the CRIC Study example, we focus on the model that

included all of the established predictors. To obtain

2 Clinical Journal of the American Society of Nephrology

Page 3: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

sensitivity and specificity, we need to pick a threshold riskscore, above which patients are classified as high risk. Figure2 illustrates the idea using two risk score thresholds—a lowvalue that leads to high sensitivity (96%) but moderate spec-ificity (50%) and a high value that leads to lower sensitivity(43%) but a high specificity (98%). Thus, from the samemodel, one could have a classification that is highly sensitive(patients who will go on to CKD progression almost alwaysscreen positive) and moderately specific (only one half ofpatients who will not go on to CKD progression will screennegative) or one that is highly specific but only moderatelysensitive.

Receiver Operating Characteristic Curves and c StatisticBy changing the classification threshold, one can choose

to increase sensitivity at the cost of decreased specificity orvice versa. For a given prediction model, there is no way to

increase both simultaneously. However, both potentiallycan be increased if the prediction model itself is improved(e.g., by adding an important new variable to the model).Thus, sensitivity and specificity can be useful for compar-ing models. However, one model might seem better thanthe other at one classification threshold and worse than theother at a different threshold. We would like to compareprediction models in a way that is not dependent on thechoice of risk threshold.Receiver operating characteristic (ROC) curves display

sensitivity and specificity over the entire range of possibleclassification thresholds. Consider again Figure 2. By usingtwo different thresholds, we had two different pairs ofvalues of sensitivity and specificity. We could choose doz-ens or more additional thresholds and record equallymany pairs of sensitivity and specificity data points. Thesedata points can be used to construct ROC curves. In

Figure 1. | Plot of risk score against CKD progression within 2 years in the Chronic Renal Insufficiency Cohort (CRIC) Study data. The plotsare for three different models, with increasing numbers of established predictors included from left to right. The risk score is log odds (logit) ofthe predicted probability. The outcome, progressive CKD (1= yes, 0= no), is jittered in the plots to make the points easier to see. ACE/ARB,angiotensin-converting enzyme/angiotensin II receptor blocker; BMI, body mass index; CVD, cardiovascular disease; diab., diabetes; educ.,education; SBP, systolic BP.

Figure 2. | Plots of progressive CKDwithin 2 years by risk score from themodel with all established predictors. The plot in the left panel is onthe basis of a risk score cutoff of24, and the plot in the right panel uses a cutoff of zero. The vertical line separates (left panel) low- and (rightpanel) high-risk patients. The horizontal line separates the groupswith andwithout progressive CKD (1= yes, 0= no). By counting the number ofsubjects in each of the quadrants, we obtain a standard 232 table. For example, using the cutoff of 24, there were 12 patients who wereclassified as low risk and ended up with CKD progression within 2 years. The sensitivity of this classification approach can be estimated bytaking the number of true positives (294) and dividing by the total number of high-risk patients (294+12=306). Thus, sensitivity is 96%. Similarcalculations can be used to find the specificity of 50%. Given the same predictionmodel, a different classification cutpoint could be chosen. In theplot in the right panel, in which a risk score of zero is chosen as the cutpoint, sensitivity decreases to 43%, whereas specificity increases to 98%.

Clin J Am Soc Nephrol ▪: ccc–ccc, ▪▪▪, 2016 Prediction Modeling in CKD, Roy et al. 3

Page 4: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

particular, ROC curves are plots of true positive rate (sen-sitivity) on the vertical axis against the false positive rate(12 specificity) on the horizontal axis. Theoretically, theperfect prediction model would simply be representedby a horizontal line at 100% (perfect true positive rate,regardless of threshold). A 45° line represents a predictionmodel equivalent to random guessing.One way to summarize the information in an ROC curve

is with the area under the curve (AUC). This is also known asthe c statistic (25). A perfect model would have a c statisticof one, which is the upper bound of the c statistic, whereasthe random guessing model would have a c statistic of 0.5.Thus, one way to compare prediction models is with thec statistic—larger values being better. The c statistic also hasanother interpretation. Given a randomly selected caseand a randomly selected control (in the CRIC Studyexample, a CKD progressor and a nonprogressor), the prob-ability that the risk score is higher for the case than for thecontrol is equal to the value of the c statistic (26).In Figure 3A, the ROC curve is displayed from a pre-

diction model that includes urine NGAL as the only pre-dictor variable. The c statistic for this model is 0.8. If ourgoal was simply to determine whether urine NGAL hasprognostic value for CKD progression, the answer wouldbe yes. If, however, we were interested in the incrementalvalue of the biomarker beyond established predictors, wewould need to take additional steps. In Figure 3B, we com-pare ROC curves derived from two prediction models—one including only demographics and the other includingdemographics plus urine NGAL. From Figure 3B, we cansee that the ROC curve for the model with NGAL (redcurve in Figure 3B) (AUC=0.82) dominates (is above) theone without NGAL (blue curve in Figure 3B) (AUC=0.69).The c statistic for the model with urine NGAL is larger by0.13. Thus, there seems to be incremental improvementover a model with demographic variables alone. However,the primary research question in the work by Liu et al. (2)was whether NGAL had prediction value beyond that of

established predictors, which include additional factorsbeyond demographics. The blue curve in Figure 3C isthe ROC curve for the model that included all of the es-tablished predictors. That model had a c statistic of 0.9—avery large value for a prediction model. When NGAL isadded to this logistic regression model, the resulting ROCcurve is the red curve in Figure 3C. These two curves arenearly indistinguishable and have the same c statistic (totwo decimal places). Therefore, on the basis of this metric,urine NGAL does not add prediction value beyond estab-lished predictors. It is worth noting that urine NGALwas a statistically significant predictor in the full model(P,0.01), which illustrates the point that statistical signif-icance does not necessarily imply added prediction value.It is also important to consider uncertainty in the

estimate of the ROC curve and c statistic. Confidencebands for the ROC curve and confidence intervals for thec statistic can be obtained from available software. Forcomparing two models, a confidence interval for the dif-ference in c statistics could be obtained via, for example,nonparametric bootstrap resampling (27). However, thiswill generally be less powerful than the standard chi–squared test for comparing two models (28).A limitation of comparing models on the basis of the c

statistic is that it tends to be insensitive to improvements inprediction that occur when a new predictor (such as a novelbiomarker) is added to a model that already has a high cstatistic (29–31). It also should not be used without addition-ally assessing calibration, which we next briefly describe.

CalibrationA well calibrated model is one for which predicted

probabilities closely match the observed rates of the outcomeover the range of predicted values. A poorly calibratedmodelmight performwell overall on the basis ofmeasures, like the cstatistic, but would perform poorly for some subpopulations.Calibration can be checked in a variety of ways. A

standard method is the Hosmer–Lemeshow test, where

Figure 3. | Receiver operating characteristics (ROC) curves for three different prediction models. A is the ROC curve when urine neutrophilgelatinase–associated lipocalin (NGAL) is the only predictor variable included in the model. The c statistic is 0.8. B and C each include twoROC curves. In B, the blue curve (area under the curve [AUC] =0.69) is from themodel that includes only demographic variables. The red curve(AUC=0.82) additionally includes urine NGAL. In C, the blue curve is for the model that includes all established predictors. The red curveincludes all established predictors plus urine NGAL. The c statistics in C are both 0.9. This plots show that urine NGAL is a relatively goodpredictor variable alone and has incremental improvement over the demographics-onlymodel but does not help to improve prediction beyondestablished predictors.

4 Clinical Journal of the American Society of Nephrology

Page 5: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

predicted and observed counts within percentiles of predic-ted probabilities are compared (32). In the CRIC Study ex-ample with the full model (including NGAL), the Hosmer–Lemeshow test on the basis of deciles of the predicted prob-abilities has a P value of 0.14. Rejection of the test (P,0.05)would suggest a poor fit (poor calibration), but that is notthe case here. Another useful method is calibration plots. Inthis method, rather than obtaining observed and expectedrates within percentiles, observed and expected rates areestimated using smoothing methods. Figure 4 displays acalibration plot for the CRIC Study example, with estab-lished predictors and urine NGAL included in the model.The plot suggests that the model is well calibrated, becausethe observed and predicted values tend to fall near the 45°line.

Brier ScoreA measure that takes into account both calibration and

discrimination is the Brier score (33,34). We can estimatethe Brier score for a model by simply taking the averagesquared difference between the actual binary outcomeand the predicted probability of that outcome for eachindividual. A low value of this metric indicates a modelthat performs well across the range of risk scores. Theperfect model would have a Brier score of zero. The dif-ference in this score between two models can be used tocompare the models. In the CRIC Study example, the Brierscores were 0.087, 0.075, 0.057, and 0.056 for the demo-graphics-only, demographics plus NGAL, established pre-dictors, and established predictors plus NGAL models,respectively. Thus, there was improvement in addingNGAL to the demographics model (Brier score differenceof 0.012—a 14% improvement). There was also improve-ment by moving from the demographics-only model tothe established predictors model (Brier score differenceof 0.020). However, adding NGAL to the established pre-dictors model only decreased the Brier score by 0.001 (a2% improvement). Although how big of a changeconstitutes a meaningful improvement is subjective, animprovement in Brier score of at least 10% would be

difficult to dismiss, whereas a 2% improvement does notseem as convincing.

Net Reclassification Improvement and IntegratedDiscrimination ImprovementPencina et al. (35,36) developed several new methods for

assessing the incremental improvement of prediction dueto a new biomarker. These include net reclassification im-provement (NRI), both categorical and category free, andintegrated discrimination improvement. These methodsare reviewed in detail in a previous Clinical Journal of theAmerican Society of Nephrology paper (37), and therefore,here we briefly illustrate the main idea of just one of themethods (categorical NRI).The categorical NRI approach assesses the change in

predictive performance that results from adding one ormore predictors to a model by comparing how the twomodels classify participants into risk categories. In thisapproach, therefore, we begin by defining risk categories aswe did when calculating specificity and sensitivity—that is,we choose cutpoints of risk for the outcome variable thatdefine categories of lower and higher risks. NRI is calcu-lated separately for the group experiencing the outcomeevent (those with progressive CKD within 2 years in theCRIC Study example) and the group not experiencing theevent. To calculate NRI, study participants are assigned ascore of zero if they were not reclassified (i.e., if both mod-els place the participant in the same risk category), a scoreof one if they were reclassified in the right direction, and ascore of 21 if they were reclassified in the wrong direction.An example of the right direction is a participant who ex-perienced an event who was classified as low risk underthe old model but classified as high risk under the newmodel. Within each of the two groups, NRI is calculatedas the sum of scores across all patients in the group dividedby the total number of patients in the group. These NRIscores are bounded between 21 and one, with a score ofzero implying no difference in classification accuracy be-tween the models, a negative score indicating worse per-formance by the new model, and a positive score showingimproved performance. The larger the score, the better thenew model classified participants compared with the oldmodel on the basis of this criterion.We will now consider a three–category classification

model, where participants were classified as low risk iftheir predicted probability was ,0.05, medium risk if itwas between 0.05 and 0.10, and high risk if it was above0.10. These cutpoints were chosen a priori on the basis ofwhat the investigators considered to be clinically meaning-ful differences in the event rate (2). The results for thecomparison of the demographics-only model with the de-mographics and urine NGAL model are given in Figure 5,left panel. The NRIs for events and nonevents were 3.6%(95% confidence interval, 23.3% to 9.2%) and 33% (95%confidence interval, 30.8% to 35.8%), respectively. Thus,there was large improvement in risk prediction for non-events when going from the demographics-only model tothe demographics and NGAL model. Next, we comparedthe model with all established predictors with the samemodel with urine NGAL as an additional predictor. Thereclassification data are given in Figure 5, right panel.Overall, there was little reclassification for both events

Figure 4. | Calibration plot for the model that includes establishedpredictors and urine neutrophil gelatinase–associated lipocalin. Thestraight line is where a perfectly calibrated model would fall. Thecurve is the estimated relationship between the predicted and ob-served values. The shaded region is 62 SEMs.

Clin J Am Soc Nephrol ▪: ccc–ccc, ▪▪▪, 2016 Prediction Modeling in CKD, Roy et al. 5

Page 6: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

and nonevents, indicating no discernible improvement inprediction when NGAL was added to the established pre-dictors model, which is consistent with the c statistic andBrier score findings described above.The categorical NRI depends on classification cutpoints.

There is a category-free version of NRI that avoids actualclassification when assessing the models. Although NRIhas practical interpretations, there is concern that it can bebiased when the null is true (when a new biomarker is notactually predictive) (38,39). In particular, for poorly fittingmodels (especially poorly calibrated models), NRI tends tobe inflated, making the new biomarker seem to add morepredictive value than it actually does. As a result, bias-corrected methods have been proposed (40). In general,however, it is not recommended to quantify incrementalimprovement by NRI alone.

Decision AnalysesThe methods discussed above tend to treat sensitivity and

specificity as equally important. However, in clinical prac-tice, the relative cost of each will vary. An approach thatallows one to compare models after assigning weights tothese tradeoffs is decision analysis (41,42). That is, given therelative weight of a false positive versus a false negative, thenet benefit of one model over another can be calculated. Adecision curve can be used to allow different individualswho have different preferences to make informed decisions.

Continuous or Survival OutcomesWe have focused our CRIC Study example primarily

on a binary outcome thus far (classification problems).

However, most of the principles described above apply tocontinuous or survival data. Indeed, some of the samemodel assessment tools can also be applied. For example,for censored survival data, methods have been developedto estimate the c statistic (29,43–45). For continuous out-comes, plots of predicted versus observed outcome andmeasures, such as R2, can be useful. Calibration methodsfor survival outcomes have been described in detail else-where but are generally straightforward to implement(46).

ValidationInternalA prediction model has good internal validity or re-

producibility if it performs as expected on the populationfrom which it was derived. In general, we expect that theperformance of a prediction model will be overestimatedif it is assessed on the same sample from which itwas developed. That is, overfitting is a major concern. Amethod that assesses internal validation or one that avoidsoverfitting should be used. Split samples (training and testdata) can be used to avoid overestimation of performance.Alternatively, methods, such as bootstrapping and cross-validation, can be used at the model development phase toavoid overfitting (47,48).

ExternalExternal validity or transportability is the ability to

translate a model to different populations. We expect theperformance of a model to degrade when it is evaluated in

Figure 5. | Classification of patients in the event and nonevent groups in models with and without urine neutrophil gelatinase–associatedlipocalin (NGAL). The counts in red are from patientswhowere reclassified in thewrong direction. The left panel shows the reclassification thatoccurs when urine NGAL is added to amodel that includes only demographics. For events, 3+8+37=48were reclassified in the right direction,whereas 16+16+5=37 were reclassified in the wrong direction. The net reclassification improvement (NRI) for events was (48/306)2(37/306)=3.6%. The NRI for nonevents was (1153/2727)2(243/2727)=33%. Thus, there was large improvement in risk prediction for nonevents whengoing from the demographics-only model to demographics and NGALmodel. In the right panel, urine NGAL is added to amodel that includesall established predictors. Overall, there was little reclassification for both events and nonevents. The NRI for events is simply (6/306)2(8/306)=20.65%. The NRI for nonevents is (94/2727)2(79/2727)=0.55%.

6 Clinical Journal of the American Society of Nephrology

Page 7: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

new populations (49–51). Poor transportability of a modelcan occur because of underfitting. This would occur, forexample, when important predictors are either unknownor not included in the original model. Even if the associa-tions between the predictors and outcome stay the same,there is still a possibility that the baseline risk may be dif-ferent in the new populations. It is, therefore, important tocheck (via performance metrics described above) externalvalidity and possibly recalibrate the model when applyingthe model to a new population. Suppose, for example, thatour prediction model is a logistic regression, and we haveadequately captured the relationship between predictorsand the outcome. However, the baseline risk in the newpopulation might be higher or lower. Baseline risk is rep-resented by the intercept term. Therefore, a simple recali-bration method is to re-estimate the intercept term.Similarly, for survival data, re-estimation of the baselinehazard function can be used for recalibration.

Dynamic PredictionIn this paper, we have focused on situations where a

prediction model is developed using available data thatwould potentially be applied unchanged to the assessmentof new patients. However, a growing area of research isdynamic prediction models, where models are frequentlyupdated over time as new data become available (52,53). Themajor points of the article still hold for dynamic prediction,but these models have the potential to be more transport-able in that they can rapidly adapt to new populations.

ReportingAfter the prediction model has been developed, evalu-

ated, and possibly, validated, the next step is to report thefindings. A set of recommended guidelines for the report-ing of prediction modeling studies, the TRIPOD statement,was recently published and includes a 22-item checklist(54,55). For each item in the checklist, there are detailedexamples and explanations. We highly recommend re-viewing this document before publishing any results.The document places a strong emphasis on being veryspecific about the aims and design of the study, the defi-nition of the outcome, the study population, the statisticalmethods used, and discussion of limitations.An important part of reporting is a discussion of the

clinical implications. Typically, adoption of the predictionmodel should not be recommended if it has not beenvalidated. After the prediction model has been validated, itcould be used or further studied in a variety of ways. Forexample, a model that accurately predicts CKD progressionacross a wide range of populations would be helpful toprovide clinicians and patients with prognostic informa-tion. Such models could inform clinical decision making.For example, high-risk patients might need to be followedmore intensely (e.g., clinician visits every 3 months ratherthan every 12 months) and have evidence-based interven-tions to slow CKD rigorously applied (e.g., BP,130/80mmHg for those with proteinuria). Another example isthe decision to have surgery for arteriovenous fistula cre-ation if ESRD is imminent. If such models are used fordecision making, the relative costs of false positives and

false negatives need to be assessed. Along the same lines,whether knowing the risk score improves patient carewould need to be studied, possibly in a randomized trial.

Concluding RemarksPrediction modeling is likely to become increasingly

important in CKD research and clinical practice, with richerdata becoming available at a rapid rate. This paper de-scribed many of the key methods in the development,assessment, and application of prediction models. Otherpapers in this CKD methods series will focus on complexmodeling of CKD data, such as longitudinal data, com-peting risks, time-dependent confounding, and recurrentevent analysis.

AcknowledgmentsFunding for the Chronic Renal Insufficiency Cohort Study was ob-

tained under a cooperative agreement from the National Institute ofDiabetes and Digestive and Kidney Diseases (grants U01DK060990,U01DK060984, U01DK061022, U01DK061021, U01DK061028,U01DK60980, U01DK060963, andU01DK060902). Additional fundingwasprovidedbygrantsK01DK092353andU01DK85649(CKDBiocon).

DisclosuresNone.

References1. Lennartz CS, Pickering JW, Seiler-Mußler S, Bauer L, Untersteller

K, Emrich IE, Zawada AM, Radermacher J, Tangri N, Fliser D,Heine GH: External validation of the kidney failure risk equationand re-calibration with addition of ultrasound parameters. Clin JAm Soc Nephrol 11: 609–615, 2016

2. Liu KD, Yang W, Anderson AH, Feldman HI, Demirjian S,Hamano T, He J, Lash J, Lustigova E, Rosas SE, SimonsonMS, TaoK, Hsu CY; Chronic Renal Insufficiency Cohort (CRIC) study in-vestigators: Urine neutrophil gelatinase-associated lipocalinlevels do not improve risk prediction of progressive chronickidney disease. Kidney Int 83: 909–914, 2013

3. Solak Y, Yilmaz MI, Siriopol D, Saglam M, Unal HU, Yaman H,Gok M, Cetinkaya H, Gaipov A, Eyileten T, Sari S, Yildirim AO,Tonbul HZ, Turk S, Covic A, Kanbay M: Serum neutrophil ge-latinase-associated lipocalin is associated with cardiovascularevents in patients with chronic kidney disease. Int Urol Nephrol47: 1993–2001, 2015

4. DeoR, ShouH, SolimanEZ,YangW,Arkin JM,ZhangX, TownsendRR, Go AS, Shlipak MG, Feldman HI: Electrocardiographic mea-sures and prediction of cardiovascular and noncardiovasculardeath in CKD. J Am Soc Nephrol 27: 559–569, 2016

5. Weiss JW, Platt RW, Thorp ML, Yang X, Smith DH, Petrik A,Eckstrom E, Morris C, O’Hare AM, Johnson ES: Predicting mor-tality in older adults with kidney disease: A pragmatic predictionmodel. J Am Geriatr Soc 63: 508–515, 2015

6. Bansal N, Katz R, De Boer IH, Peralta CA, Fried LF, Siscovick DS,Rifkin DE, Hirsch C, Cummings SR, Harris TB, Kritchevsky SB,Sarnak MJ, Shlipak MG, Ix JH: Development and validation of amodel to predict 5-year risk of death without ESRD among olderadults with CKD. Clin J Am Soc Nephrol 10: 363–371, 2015

7. Tangri N, Kitsios GD, Inker LA, Griffith J, Naimark DM,Walker S,Rigatto C, Uhlig K, Kent DM, Levey AS: Risk prediction modelsfor patients with chronic kidney disease: A systematic review.Ann Intern Med 158: 596–603, 2013

8. Feldman HI, Appel LJ, Chertow GM, Cifelli D, Cizman B,Daugirdas J, Fink JC, Franklin-Becker ED, Go AS, Hamm LL, He J,Hostetter T, Hsu CY, Jamerson K, Joffe M, Kusek JW, Landis JR,Lash JP, Miller ER, Mohler ER 3rd, Muntner P, Ojo AO, RahmanM, Townsend RR, Wright JT; Chronic Renal Insufficiency Cohort(CRIC) Study Investigators: The Chronic Renal InsufficiencyCohort (CRIC) Study: Design and methods. J Am Soc Nephrol 14[Suppl 2]: S148–S153, 2003

Clin J Am Soc Nephrol ▪: ccc–ccc, ▪▪▪, 2016 Prediction Modeling in CKD, Roy et al. 7

Page 8: Statistical Methods for Cohort Studies of CKD: Prediction ... · Special Feature Statistical Methods for Cohort Studies of CKD: Prediction Modeling Jason Roy,* †Haochang Shou,*

9. Lash JP, Go AS, Appel LJ, He J, Ojo A, Rahman M, Townsend RR,Xie D, Cifelli D, Cohan J, Fink JC, Fischer MJ, Gadegbeku C,Hamm LL, Kusek JW, Landis JR, Narva A, Robinson N, Teal V,Feldman HI; Chronic Renal Insufficiency Cohort (CRIC) StudyGroup: Chronic Renal Insufficiency Cohort (CRIC) Study: Base-line characteristics and associations with kidney function. Clin JAm Soc Nephrol 4: 1302–1311, 2009

10. Liu KD, Yang W, Go AS, Anderson AH, Feldman HI, Fischer MJ,He J, Kallem RR, Kusek JW, Master SR, Miller ER 3rd, Rosas SE,Steigerwalt S, Tao K,Weir MR, Hsu CY; CRIC Study Investigators:Urine neutrophil gelatinase-associated lipocalin and risk of car-diovascular disease and death in CKD: Results from the ChronicRenal Insufficiency Cohort (CRIC) Study. Am J Kidney Dis 65:267–274, 2015

11. Levey AS, Coresh J, Greene T,Marsh J, Stevens LA, Kusek JW, VanLente F; Chronic Kidney Disease Epidemiology Collaboration:Expressing the Modification of Diet in Renal Disease Studyequation for estimating glomerular filtration rate with standard-ized serum creatinine values. Clin Chem 53: 766–772, 2007

12. Rubin DB: Multiple Imputation for Nonresponse in Surveys,New York, John Wiley & Sons, 2004

13. Carpenter J, Kenward M: Multiple Imputation and Its Applica-tion, 1st Ed., Chichester, United Kingdom, Wiley, 2013

14. James G, Witten D, Hastie T, Tibshirani R: An Introduction toStatistical Learning, Vol. 103, New York, Springer, 2013

15. Hastie T, Tibshirani R: Generalized additive models. Stat Sci 1:297–310, 1986

16. Rasmussen CE: Gaussian Processes for Machine Learning,Cambridge, MIT Press, 2006

17. Breiman L, Friedman J, Stone CJ, Olshen RA: Classification andRegression Trees, Wadsworth, Belmont, CRC Press, 1984

18. Altman NS: An introduction to kernel and nearest-neighbornonparametric regression. Am Stat 46: 175–185, 1992

19. Vapnik V: The support vector method of function estimation. In:Nonlinear Modeling, edited by Suykens JAK, Vandewalle J,Boston, Springer, 1998, pp 55–85

20. Hastie T, Tibshirani R, Friedman J: The Elements of StatisticalLearning, New York, Springer, 2009

21. Schwarz G: Estimating the dimension of a model. Ann Stat 6:461–464, 1978

22. Tibshirani R: Regression shrinkage and selection via the lasso. J RStat Soc Series B Stat Methodol 58: 267–288, 1994

23. Bair E, Hastie T, Paul D, Tibshirani R: Prediction by supervisedprincipal components. J Am Stat Assoc 101: 119–137, 2006

24. Pepe MS: The Statistical Evaluation of Medical Tests for Classi-fication and Prediction, Oxford, Oxford University Press, 2004

25. Bamber D: The area above the ordinal dominance graph and thearea below the receiver operating characteristic graph. J MathPsychol 12: 387–415, 1975

26. Hanley JA, McNeil BJ: The meaning and use of the area under areceiver operating characteristic (ROC) curve. Radiology 143:29–36, 1982

27. Carpenter J, Bithell J: Bootstrap confidence intervals: When,which, what? A practical guide for medical statisticians. Stat Med19: 1141–1164, 2000

28. Vickers AJ, Cronin AM, Begg CB: One statistical test is sufficientfor assessing new predictive markers. BMC Med Res Methodol11: 13, 2011

29. Harrell FE Jr., Lee KL,MarkDB:Multivariable prognosticmodels:Issues in developing models, evaluating assumptions and ade-quacy, and measuring and reducing errors. Stat Med 15: 361–387, 1996

30. Cook NR: Statistical evaluation of prognostic versus diagnosticmodels: Beyond the ROC curve. Clin Chem 54: 17–23, 2008

31. PepeMS, Feng Z, Huang Y, LongtonG, Prentice R, Thompson IM,Zheng Y: Integrating the predictiveness of a marker with its per-formance as a classifier. Am J Epidemiol 167: 362–368, 2008

32. Hosmer D, Lemeshow S, Sturdivant R: Applied Logistic Re-gression, 3rd Ed., New York, Wiley, 2013

33. Brier GW: Verification of forecasts expressed in terms of proba-bility. Mon Wea Rev 78: 1–3, 1950

34. Gerds TA, Cai T, Schumacher M: The performance of risk pre-diction models. Biom J 50: 457–479, 2008

35. Pencina MJ, D’Agostino RB Sr., Steyerberg EW: Extensions of netreclassification improvement calculations to measure usefulnessof new biomarkers. Stat Med 30: 11–21, 2011

36. Pencina MJ, D’Agostino RB Sr., D’Agostino RB Jr., Vasan RS:Evaluating the added predictive ability of a new marker: Fromarea under the ROC curve to reclassification and beyond. StatMed 27: 157–172, 2008

37. Kerr KF, Meisner A, Thiessen-Philbrook H, Coca SG, Parikh CR:Developing risk prediction models for kidney injury and assess-ing incremental value for novel biomarkers. Clin J Am SocNephrol 9: 1488–1496, 2014

38. Pepe MS, Fan J, Feng Z, Gerds T, Hilden J: The net reclassificationindex (NRI): A misleading measure of prediction improvementevenwith independent test data sets. Stat Biosci 7: 282–295, 2015

39. Hilden J, Gerds TA: A note on the evaluation of novel biomarkers:Do not rely on integrated discrimination improvement and netreclassification index. Stat Med 33: 3405–3414, 2014

40. Paynter NP, Cook NR: A bias-corrected net reclassification im-provement for clinical subgroups. Med Decis Making 33: 154–162, 2013

41. Hunink MM, Glasziou PP, Siegel J, Weeks JC, Pliskin JS, ElsteinAS, Weinstein MC: Decision Making in Health and Medicine.Integrating Evidence and Values. Cambridge, CambridgeUniversity Press, 2001

42. Vickers AJ, Elkin EB: Decision curve analysis: A novel method forevaluatingpredictionmodels.MedDecisMaking26: 565–574, 2006

43. Heagerty PJ, Zheng Y: Survival model predictive accuracy andROC curves. Biometrics 61: 92–105, 2005

44. Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ: On theC-statistics for evaluating overall adequacy of risk prediction proce-dures with censored survival data. Stat Med 30: 1105–1117, 2011

45. Pencina MJ, D’Agostino RB: Overall C as a measure of discrim-ination in survival analysis: Model specific population value andconfidence interval estimation. Stat Med 23: 2109–2123, 2004

46. D’Agostino RB, Nam B-H: Evaluation of the performance ofsurvival analysis models: Discrimination and calibration mea-sures. In: Handbook of Statistics, edited by Balakrishnan N,Rao CR, Amsterdam, Elsevier, 2004, pp 1–25

47. Borra S, Di Ciaccio A: Measuring the prediction error. A com-parison of cross-validation, bootstrap and covariance penaltymethods. Comput Stat Data Anal 54: 2976–2989, 2010

48. Steyerberg EW, Harrell FE Jr., Borsboom GJ, Eijkemans MJ,Vergouwe Y, Habbema JD: Internal validation of predictivemodels: Efficiency of some procedures for logistic regressionanalysis. J Clin Epidemiol 54: 774–781, 2001

49. Konig IR, Malley JD, Weimar C, Diener H-C, Ziegler A; GermanStroke Study Collaboration: Practical experiences on the neces-sity of external validation. Stat Med 26: 5499–5511, 2007

50. Altman DG, Vergouwe Y, Royston P, Moons KGM: Prognosis andprognostic research: Validating a prognostic model. BMJ 338:b605, 2009

51. Moons KGM, Kengne AP, Grobbee DE, Royston P, Vergouwe Y,Altman DG, Woodward M: Risk prediction models: II. Externalvalidation, model updating, and impact assessment. Heart 98:691–698, 2012

52. Finkelman BS, French B, Kimmel SE: The prediction accuracy ofdynamic mixed-effects models in clustered data. BioData Min 9:5, 2016

53. McCormick TH, Raftery AE, Madigan D, Burd RS: Dynamic lo-gistic regression and dynamic model averaging for binary clas-sification. Biometrics 68: 23–30, 2012

54. Collins GS, Reitsma JB, Altman DG, Moons KGM: TransparentReporting of a multivariable prediction model for IndividualPrognosis or Diagnosis (TRIPOD): The TRIPOD statement. AnnIntern Med 162: 55–63, 2015

55. Moons KGM, AltmanDG, Reitsma JB, Ioannidis JPA, Macaskill P,Steyerberg EW,Vickers AJ, Ransohoff DF, Collins GS: TransparentReporting of a multivariable prediction model for IndividualPrognosis or Diagnosis (TRIPOD): Explanation and elaboration.Ann Intern Med 162: W1–W73, 2015

Published online ahead of print. Publication date available at www.cjasn.org.

8 Clinical Journal of the American Society of Nephrology