Lexis diagrams and analysis of register data
Transcript of Lexis diagrams and analysis of register data
Lexis diagrams and
analysis of register dataPractical use of Lexis diagrams in the analysis and
routine reporting from population registers
Bendix CarstensenSteno Diabetes Center& Department of Biostatistics, University of [email protected] www.biostat.ku.dk/~bxc
Statistics for Health Registers and Linked DatabasesMilton Keynes, UK, May 2009
1/ 39
OutlineHealth registers
Lexis diagrams
Tabulation of follow-up
Models and likelihood
Diabetes incidence
Diabetes mortality
Cox modelling
Advantage of parametric hazards
Summary
2/ 39
Health events
I Diagnoses (hospitals, clinics)
I Procedures (treatments, measurements)
I Purchases (prescriptions)
Linkage by person ID:⇒ (partial) health history of persons.
Health registers 3/ 39
Population registers
I All persons in the population included.
I Universal linkage of persons.I The Nordic countries (DK,FI,IS,NO,SE):
I Person ID for all citizensI Used for health care purposesI Used for all taxation and social records as well
Not only health histories, but also healthhistories linked to social, economic andeducations status is possible.
I Censuses replaced by register tabulations.
Health registers 4/ 39
Mortality rates from registers
I Entry time (e.g. date of diagnosis).
I Exit time (e.g. date of death).
Implicit here is the current state:Alive with a diagnosis.
Health registers 5/ 39
Incidence rates from registers
I Entry time (e.g. date of birth).
I Exit time (e.g. date of diagnosis).
Implicit here is the current state:Alive without a diagnosis.
Usually not compiled directly, but based on:
I Cases from a register
I Follow-up derived from censuses
Note: the follow-up is derived from a register of allat risk; in this case the entire population.
Health registers 6/ 39
General health history
I Current stateI Entry point in stateI Exit point from stateI Next state
Multistate models:
Well
DM
Dead
λλ
µµWµµD
Health registers 7/ 39
Wilhelm Lexis
Wilhelm Lexis(1837–1914)German statistician andeconomist.
� �����.�"�����
�������/�/��&�� ���������/�/������0�� ���1���
����"� �2��34������"� �2��������
!� ��&�� ��,�&�� �����5"����������&�� �
."�� ��&�� ��,�&�� ������ ���� � ��&�� �
�� ������&�� ��
����������2:�;�!����� �������/���6������/��7�23�������� � �� �� � 0�<</���2����8/�<2:<=�� >,7�� �>,�?� 0>@�>:�>:��/�8��'8�888
:����: A,�A@�:AAB�:@�CB
Lexis diagrams 8/ 39
Lexis diagram
I Shows the follow-upa person from entryto exit as a functionof date and age.
I In general: follow-upshown on twotimescales.
Lexis diagrams for a 1‰random sample of theDanish National DiabetesRegister.
1994 1998 2002 20060
20
40
60
80
100
Date
Age
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1994 1998 2002 20060
20
40
60
80
100
Date
Age
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Lexis diagrams 9/ 39
1990 1995 2000 2005 201045
50
55
60
65
Date
Age
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
Lexis diagrams 10/ 39
1990 1995 2000 2005 201045
50
55
60
65
Date
Age
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
Lexis diagrams 11/ 39
Tabulation by age, period and cohort
Two extra complications:
1. Population risk time must be split in triangles.Can be done from population figures.Available for entire nations in the humanmortality database.
2. Midpoints of age-intervals are no longeraverage age in the classes. The same for periodand cohort.Midpoints should be offset by 1
3 to give averageage at follow-up.
Both complications are treated in “Age-period-cohort models
in the Lexis diagram”, Statistics in Medicine, 2007. [1, 2, 3]
Lexis diagrams 12/ 39
Construction of rates
I Rate = Events / Risk time = D/Y
I Mortality = red blobs / length of lines
I Incidence rates = green blobs / population
What subsets of the Lexis diagram should this bedone for?
This is essentially asking:
“What timescales are we interested in?”
Tabulation of follow-up 13/ 39
Data manipulations
I From: Individual records (entry,exit,status)I To: Tables of (Events,risk time) = (D, Y ) by
timescales(age, date, duration. . . )
Each individual contributes to many cells of thetable, so it is not a tabulation of the individuals, itis a tabulation of the follow-up:
I Split follow-up in small pieces; each one with(d, y) recorded.
I Tabulate (d, y) by timescales(and other covariates of interest, such as sexand date of birth).
Tabulation of follow-up 14/ 39
Keeping track
Functions for splitting time; one record per personto one record per period of follow-up, while keepingtrack of time-scale, risk time and events:
SAS: Macro %Lexis available fromwww.biostat.ku.dk/~bxc/Lexis.
Stata: Functions stset and stsplit.
R: Functions Lexis, splitLexis andcutLexis; available in the Epi package.
Tabulation of follow-up 15/ 39
Keeping track in practice
The Epi package has a Lexis machinery (designedby Martyn Plummer, IARC, Lyon).
I Keeps track of multiple states and mutipletime scales
I Provides tools for summarizing and tabulatingfollow-up
I Lexis diagrams shown here are made byplot.Lexis
I An overview of the Lexis machinery isincluded in the package as a .pdf-document.
Tabulation of follow-up 16/ 39
Tabulation
Split records (i.e. (d, y)) are then tabulated by:
1. Fixed covariates(sex, genotype, date of birth, . . . )
2. Timescales(age, calendar time, duration, . . . )
Rates can now be computed by any of the variablesin the tabulation.
Analysis proceeds as if observations wereindependent Poisson observations.
Tabulation of follow-up 17/ 39
Analysis of rates
Rate = Events/Risk time = D/Y
This is based on the log-likelihood for observation ofD events during Y risk time with a constant rate λ:
`(λ) = Dlog(λ)− λY
Apart from a term Dlog(Y ), this is thelog-likelihood for a Poisson observation D withmean µ = λY ; log(µ) = log(λ) + log(Y )
The empirical rate is the ML-estimator in theconstant rate model.
Models and likelihood 18/ 39
Likelihood for one person
The likelihood from several intervals from oneindividual is a product of conditional probabilities:
P {event at t4| alive at t0}= P {event at t4| alive at t4}×
P {survive (t3, t4)| alive at t3}×P {survive (t2, t3)| alive at t2}×P {survive (t1, t2)| alive at t1}×P {survive (t0, t1)| alive at t0}
This can computationally be treated as thelikelihood of 4 independent Poisson observations,(1, 0, 0, 0) with possibly different means.
Models and likelihood 19/ 39
Likelihood for varying rates
I If we assume rates are constant in eachinterval, the log-likelihood from one individualis a sum of Poisson terms.
I Each term refers to one interval of follow-up, sonot independent but the likelihood is a product.
I The purpose of splitting the follow-up is toallow the rates to vary within the follow-up ofeach person.
I Intervals should be so small that rates can beassumed constant within each.(5-year age intervals are usually not.)
Models and likelihood 20/ 39
Analysis of split records
I Splitting the records allows rates to vary acrossfollow-up.
I The split records are analysed as independentPoisson.
I Tabulation makes analysis more handy,technically; but is formally superfluous.
I NOTE: A separate parameter for eachtabulation interval is not necessary.
I Use interval midpoints as a continuouscovariate, model the effect by splines,fractional polynomials, . . . .
Models and likelihood 21/ 39
Register data analysis
I The nature of data (individual records of eventdates) allows arbitrarily fine split of follow-up.
I The amount of data provides technicalproblems that can be solved by tabulation
I Analysis should report smoothed versions ofrates, possibly on multiple time-scales
Models and likelihood 22/ 39
Tabulation of incident DM cases
I Follow-up split byage and date (1-yrclasses).
I Cases (green dots)by age, date, sexand date of birth.
I Population figureswith risk timeamong DM ptt.subtracted to giverisk time amongnon-DM population.
1990 1995 2000 2005 201045
50
55
60
65
DateA
ge
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
Diabetes incidence 23/ 39
Model for DM incidence rates
Model (for each sex):
λ(a, p) = f(a)× g(p), g(2004) = 1
a — current agep — current date (period)
f(a) and g(p) are modelled by natural splines(restricted cubic splines)
Reported in detail in [4]: The National Danish Diabetes
Register: Trends in incidence, prevalence and mortality.
Diabetologia, 2008.Diabetes incidence 24/ 39
0 20 40 60 80 100
0.1
0.2
0.5
1.0
2.0
5.0
10.0
1
1
Age
Inci
denc
e ra
te p
er 1
000
pyrs
1996 2000 2004 2008
1
1
Date of inclusion
0.1
0.2
0.5
1.0
2.0
5.0
10.0
●●
Rat
e ra
tio
Diabetes incidence 25/ 39
Tabulation of DM deaths
I Follow-up split byage and date (1-yrclasses).
I Cases (red dots)and risk time (graylines) by age, date,duration, sex anddate of birth. 1990 1995 2000 2005 2010
45
50
55
60
65
DateA
ge
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
Diabetes mortality 26/ 39
Models for DM mortality rates
Model using current age (a), date at diagnosis(p− d) and duration (d) (two timescales):
λ(a, p) = f(a)×g(p−d)×h(d), g(2004) = 1, h(0) = 1
Model using current age at diagnosis, date atdiagnosis and duration (one timescale):
λ(a, p) = f(a−d)×g(p−d)×h(d), g(2004) = 1, h(0) = 1
f , g and h are modelled by natural splines(restricted cubic splines).
Diabetes mortality 27/ 39
DM mortality, two timescales
30 40 50 60 70 80 90
2
5
10
20
50
100
200
1
1
Age
Mor
talit
y ra
te p
er 1
000
pyrs
1
1
2000
Inclusion date
●●
0 4 8 12
1
1
Time since inclusion
0.2
0.5
1.0
2.0
5.0
10.0
20.0
●●
Rat
e ra
tio
Diabetes mortality 28/ 39
DM mortality, two timescales
30 40 50 60 70 80 90
2
5
10
20
50
100
200
1
1
Age
Mor
talit
y ra
te p
er 1
000
pyrs
1
1
2000
Inclusion date
●●
0 4 8 12
1
1
Time since inclusion
0.2
0.5
1.0
2.0
5.0
10.0
20.0
●●
Rat
e ra
tio
Diabetes mortality 29/ 39
DM mortality, one timescale
30 40 50 60 70 80 90
2
5
10
20
50
100
200
1
1
Age at inclusion
Mor
talit
y ra
te p
er 1
000
pyrs
1
1
2000
Inclusion date
●●
0 4 8 12
1
1
Time since inclusion
0.2
0.5
1.0
2.0
5.0
10.0
20.0
●●
Rat
e ra
tio
Diabetes mortality 30/ 39
Why not a Cox model?
Need to choose an underlying time-scale:
I Age (i.e. current age)
I Duration (i.e. time since diagnosis)
The other time scale is accomodated in aCox-model by splitting the follow-up on this, andincluding it as a covariate.
So you can accomodate more than one timescale ina Cox-model, but the hazzle with time-splitting isthe same.
The Poisson approach is easier because rates aredirectly estimated using smoothers.
Cox modelling 31/ 39
Age at entry?
If duration is taken as timescale and age at entry(e = a− d) as covariate:
λ(a, d,x) = λo(d)exp(α(a− d) + xβ)
= λo(d)e−αdexp(αa+ xβ)
The effect of current age is taken to be linear onthe log-scale, i.e. exponential effect of age.
Which in may cases is not too fra from reality —but unless you are prepared to split data you cannotcheck the feasibility of the model.
Cox modelling 32/ 39
The real advantage
Well DM
Dead (no DM) Dead (DM)
-
? ?
λ(a)
µW (a) µD(a, d)
Advantage of parametric hazards 33/ 39
The relationships between the rates and the probabilities are:
P {Well at a} = exp(−∫ a
0
λ(s) + µW (s) ds)
P {Dead (well) at a} =∫ a
0
µW (s)exp(−∫ s
0
λ(u) + µW (u) du)
ds
P {DM at a} =∫ a
0
P {DM diagnosis at s}
×P {survive with DM from s to a} ds
=∫ a
0
λ(s)exp(−∫ s
0
λ(u) + µW (u) du)
×exp(−∫ a
s
µD(u, u− s) du)
ds
P {Dead (DM) at a} = 1− P {Well at a} − P {Dead (well) at a}−P {DM at a}
Advantage of parametric hazards 34/ 39
The real advantage
Poisson models gives parametic expressions for therates, so calculation of integrals is simple; they arejust sums:
# Evaluate the cumulative rates at the *end* of the intervalsInc <- cumsum( inc )M.w <- cumsum( m.w )# Probability of being in the "Well" state
P.w <- exp( -(Inc+M.w) )# Probability of being dead without disease, i.e in the "Dead well" state
P.mw <- cumsum( m.w * exp( -(Inc+M.w) ) )# Probability of being alive with disease
P.wd <- x <- numeric( A )for( a in 1:A ){ for( d in 1:a ) # here d plays the role of age at diagnosis
x[d] <- inc[d] * exp( -(Inc[d]+M.w[d]) ) *exp( -sum( m.d[cbind(d:a,1:(a-d+1))] ) )
P.wd[a] <- sum( x[1:a] ) }res <- cbind( P.w, P.wd, 1-P.w-P.mw-P.wd, P.mw )
Advantage of parametric hazards 35/ 39
0 20 40 60 80 1000.0
0.2
0.4
0.6
0.8
1.0
a.pt
rep(
2, N
)
a.pt
rep(
2, N
)
20 40 60 80 1000.0
0.2
0.4
0.6
0.8
1.0
P(+, well)
P(+, DM)
P(Alive, DM)
P(Alive, well)
Age
Advantage of parametric hazards 36/ 39
●
0 20 40 60 80 1000
5
10
15
20
25
30
Age
P(
DM
bef
ore
age
a )
(%)
0
5
10
15
20
25
30
Advantage of parametric hazards 37/ 39
Summary
I Registers provide follow-up for health events.
I For the initial event, population risk time isneeded to compute rates.
I Tables of events and risk time should be withnarrow time intervals.
I Effects of timescales modelled by using theinterval midpoints as quantitative variables.
I Show data in Lexis diagrams.
I Show rates as interpretable curves.
I Usually many timescales are available; informedchoice is needed.
Summary 38/ 39
References
B Carstensen.Age-Period-Cohort models for the Lexis diagram.Statistics in Medicine, 26(15):3018–3045, July 2007.
J Rosenbauer and K Strassburger.Comments on: ”Age-Period-Cohort models for the Lexis diagram”.Statistics in Medicine, 27:1557–1561, 2007.
B Carstensen.Age-Period-Cohort models for the Lexis diagram (author’s reply).Statistics in Medicine, 27:1561–1564, 2007.
B Carstensen, JK Kristensen, P Ottosen, and K Borch-Johnsen.The Danish National Diabetes Register: Trends in incidence, prevalence andmortality.Diabetologia, 51:2187–2196, 2008.
The presentation is available on my homepage:www.biostat.ku.dk/~bxc/
Summary 39/ 39