BioSHaRE: Analysis of mixed effects models using federated data analysis approach - Edwin van den...

INDIVIDUAL PARTICIPANT DATA ANALYSIS: A FEDERATED APPROACH

EDWIN VAN DEN HEUVELSACHA LA BASTIDE – VAN GEMERT

CONTENTIntroductionFederated Data Analysis

Linear RegressionMixed models

EM-AlgorithmValidation results

Test dataBioSHARE data

Concluding remarks

INTRODUCTIONMeta-Analysis

Combining data from different sources started most likely with Carl Friedrich Gauss (1777-1855).

He used data from astronomers to calculate planet orbitsHe developed least squares and the classical reliability theory: the true

parameter is observed with

noise


Combining data was a true problem in the beginning of the 20th centuryPotency estimates from bioassays showed tremendous heterogeneity

Least squares was unsatisfactoryLandmark paper of Cochran in 1954 discussed various

weighted meansThis field implicitly used random effects model

Reference

Unknown

Response

Concentration

RU


Gene Glass introduces the term (aggregate data) meta-analysis in 1976 as the analysis of analyses

This paper did not refer to the bioassay field at allPools estimates from published papers

A meta-analysis assumes the existence ofThe estimate of the association bi at study i

A standard error si of the estimate bi

The number of degrees of freedom di for standard error si

Different statistical approaches are available to pool the estimates


Fixed effects meta-analysis modelbi = + ei, ei ~ N(0,i

2)

the standard error si is an estimate of i

Random effects meta-analysis modelbi = + Ui + ei, ei ~ N(0,i

2), Ui ~ N(0,2)

the standard error si is an estimate of i

2 represents heterogeneity in the estimates

In case heterogeneity is present ( ≠ 0) the fixed effects analysis underestimates the standard error of the pooled estimate


Not all researchers are in favor of meta-analysisVan Houwelingen (1997) wrote:

“…popular practice of analysing summary measures from selected publications is a poor man’s solution.”…I hope that we will have full multi-center multi-study databases that can be analysed by appropriate random effects models considering both random variation within and between studies and/or centres.”

Thus there is a strong need for individual participant data analysis


IPD meta-analysis can be performed in two ways:One-stage analysis: All individual data is simultaneously analyzed (possibly with sophisticated statistical models)Two-stage or coordinated meta analysis: Each study is analyzed separately and the model parameters are pooled according to original meta-analysis tools

Two-stage analysis seems easier to implement, since it does not require that data is pooled at one locationOne-stage IPD meta-analysis that does not pool data at one location is called federated data analysis

FEDERATED DATA ANALYSISLinear Regression

Consider the following settingYij is the response of subject j in study i

Xij is the exposure of subject j in study i

Zij is a confounder of subject j in study i

The simplest linear regression model isM1: Yij = 0 + Z·Zij + X·Xij + eij

Model M1 assumes:The populations are homogeneous – interceptAssociations are homogeneousResidual variances are homogeneousThe ratio sample and population size is homogenous


Federated data analysis for estimation of 0, Z, X, and 2 require study summary statistics:

Number of observationsSum of the confoundersSum of the exposuresSum of the squared confoundersSum of the squared exposuresSum of the confounder – exposure productSum of the responseSum of the response – confounder productSum of the response – exposure productsSum of the squared responses (for the SE)


Heterogeneous populationsM2: Yij = 0,i + Z·Zij + X·Xij + eij

Heterogeneous associationsM3a: Yij = 0,i + Z,i·Zij + X·Xij + eij

M3b: Yij = 0,i + Z·Zij + X,i·Xij + eij

M3c: Yij = 0,i + Z,i·Zij + X,i·Xij + eij

Heterogeneous residual variancesM4: Yij = 0,i + Z,i·Zij + X,i·Xij + eij,i

Standard deviation i of eij,i depends on study i


Models M2 and M3a Have a homogeneous association for the exposure Require a federated data analysis The same summary statistics for the federated data analysis of model M1 are involved

Models M3b, M3c, and M4 can be estimated with the same summary statistics used in the federated data analysisRequire aggregate data meta-analysis to pool the estimates bX,i from different studies


Simulation studies shows that an aggregate data meta-analysis for model M3a produces strong heterogeneity in the estimates bX,i even tough the association is homogeneousTreating the regression parameters in models M2, M3a, M3b, M3c, and M4 as fixed effects will underestimates the pooled association X

Thus models M2, M3a, M3b, M3c, and M4 need to assume that the parameters are random – mixed effects models like the random effects meta-analysis

FEDERATED DATA ANALYSISMixed Effects

Model M2 becomesM2: Yij = 0 + Z·Zij + X·Xij + Ui + eij

The associations are still assumed homogeneousThe residual variance is homogeneousIntercept is heterogeneous Ui ~ N(0,2): random intercept model

Federated data analysis for mixed models is less straightforward – random term complicates method The Expectation-Maximization algorithm can be used to estimate the model parameters in a federated approach

EM-ALGORITHMMixed Effects Models

Step 0: Choose starting values 0(0), Z(0), X(0), (0), and (0) for 0, Z, X, , and

Step 1: E-Step: using the estimates from the previous step, estimate Ui

M-Step: Using the result from the E-step determine 0(1), Z(1), X(1), and (1)

Evaluate: how much the estimates has changedIf the changes are small enough → convergenceIf the changes are still to large → conduct step 1 using the last available estimates

EM uses the same summary statistics

iteration

VALIDATION RESULTSTEST DATA: MULTICENTER TRIAL

Data from a multicenter trial was usedTwo responses: Hemoglobin in blood (g/dl)Blood loss during surgery (mL)Exposure: Treatment (control; new)Covariate: Age (years)Three centers (1, 2, 3) – Different centers were selected for the responses

The EM-algorithm was used with the summary statistics needed to estimate M1A random intercept model with maximum likelihood was applied on the full data set


Description of the validation dataHemoglobine

Blood loss

Center 1 (n=200)

Center 2 (n=20)

Center 3 (n=30)

P-value

Hb (Std) 6.50 (0.890) 6.81 (1.065) 6.80 (0.859) 0.179Age (Std) 66.4 (9.78) 67.3 (9.97) 66.1 (8.21) 0.864Treatment (%) 100 (50) 8 (40) 15 (50) 0.692

Center 1 (n=200)

Center 2 (n=48)

Center 3 (n=39)

P-value

BL (Std) 641 (701) 763 (527) 748 (428) <0.001Age (Std) 66.4 (9.78) 64.6 (9.64) 61.7 (9.73) 0.007Treatment (%) 100 (50) 21 (44) 21 (54) 0.622


Hemoglobine

EM-nr indicates the number of iterations used in EMConvergence criterion for all parameters was set at 10-8

A start value of 0 for leads to incorrect resultsConvergence is relatively fast and close to the truth for positive starting values

0 Z X 2 2

EM-0 1 1 1 1 1EM-190206 7.6086 -0.01548 0.04021 0.006659 0.7887EM-0 1 1 1 0 1EM-3 7.5683 -0.01556 0.03838 0 0.7933SAS 7.6086 -0.01548 0.04021 0.006657 0.7887


Blood loss

EM-nr indicates the number of iterations used in EMConvergence criterion for all parameters was set at 10-8

A start value of 1 for leads to incorrect resultsConvergence is really fast and close to the truth when = 0 as starting values

0 Z X 2 2

EM-0 1 1 1 1 1EM-1139723 608.23 1.7755 -97.9651 0.01140 411547EM-0 1 1 1 0 1EM-3 608.23 1.7755 -97.9651 0 411547SAS 608.23 1.7755 -97.9651 0 411547


Both sets of starting values are needed to make appropriate inference

When the two sets provide identical estimates on fixed parameters, then the set with = 0 provides the answerWhen the two sets provide identical estimates on fixed parameters, then ≠ 0 provides the answer

The standard errors can also be determined It has not yet been incorporated in the R-programManual calculations demonstrate that the results coincide with the SAS output, when the appropriate estimates are taken into account

VALIDATION RESULTSBIOSHARE DATA

Response: Systolic blood pressureExposure: NoiseConfounders: AgeSexPM10Two cohorts: HUNT and LifeLinesEM-algorithm applied to fit random interceptComparison with model M1

Using standard-algorithmUsing DataSHIELD glm

VALIDATION RESULTSBIOSHARE DATA

Systolic blood pressure

The EM-algorithm seems to demonstrate a heterogeneity in the interceptsThe analysis of model M1 and EM are identical when the starting value of = 0 → they both used the summariesDataSHIELD glm seems to deviate somewhat, but this did not happen on the test data

0 AGE SEX PM10 NOISE 2 2

EM-0 1 1 1 1 1 1 1EM-Final 111.59 0.4141 -7.255 0.04627 -0.01351 1.8992 217.43EM-0 1 1 1 1 1 0 1EM-Final 114.68 0.4143 -7.2473 -0.16617 -0.00300 0 217.45Model M1 114.68 0.4143 -7.2473 -0.16617 -0.00300 NA 217.45DataSHIELD 114.95 0.4149 -7.2449 -0.18129 -0.00230 NA 217.41

CONCLUDING REMARKSFollow-up steps for BioSHARE (in August):1. Complete the existing algorithm for linear random

intercept models including standard errors2. Implement this algorithm in DataSHIELD3. Finalize statistics paper on algorithms for federated data

analysis for mixed models

Extensions for DataSHIELD after BioSHARE1. To handling missing data sets as well2. To linear random coefficient models3. To generalized random coefficient models

Acknowledgement

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 261433 (Biobank Standardisation and Harmonisation for Research Excellence in the European Union - BioSHaRE-EU)

<please adapt text and lay out as necessary, and include other funders as well. >

BioSHaRE: Analysis of mixed effects models using federated data analysis approach - Edwin van den...

Health & Medicine

Transcript of BioSHaRE: Analysis of mixed effects models using federated data analysis approach - Edwin van den...