Statistical techniques for local cluster detection

31

Transcript of Statistical techniques for local cluster detection

Page 1: Statistical techniques for local cluster detection

Statistical techniques for local cluster

detection

Alexandru Am rioarei

National Institute of Research and Development for Biological Sciences

2nd BIS Workshop: bioinformatic and statistical tools for data analysis

Friday 10 June, 2016

Research Platform for Biology and Systemic EcologyConference Hall, Spl. Independentei no 91-95, Bucharest, Romania

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 1 / 21

Page 2: Statistical techniques for local cluster detection

Outline

1 Who am I ?

EducationResearch Interests

2 What I do?

A �rst exampleDetecting Crohn's disease clusters

3 References

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 2 / 21

Page 3: Statistical techniques for local cluster detection

Who am I ? Education

Outline

1 Who am I ?

EducationResearch Interests

2 What I do?

A �rst exampleDetecting Crohn's disease clusters

3 References

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 3 / 21

Page 4: Statistical techniques for local cluster detection

Who am I ? Education

Education

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 4 / 21

Page 5: Statistical techniques for local cluster detection

Who am I ? Education

Alexandru Am rioarei

University of Science and Technologies, Lille, France 2014

Ph.D. Thesis: Approximations for Multidimensional Discrete Scan Statistics

Advisor: Prof. Cristian Preda

Member of MODAL Team - Models for Data Analysis and Learning Team(INRIA Lille)

University of Bucharest, Bucharest, Romania 2008�2010

Master Thesis: Markov chains with applications in biology (in romanian)

Advisor: Acad. Ioan Cuculescu

University of Bucharest, Bucharest, Romania 2004�2008

Bachelor Degree (Mathematics)

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 5 / 21

Page 6: Statistical techniques for local cluster detection

Who am I ? Research Interests

Outline

1 Who am I ?

EducationResearch Interests

2 What I do?

A �rst exampleDetecting Crohn's disease clusters

3 References

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 6 / 21

Page 7: Statistical techniques for local cluster detection

Who am I ? Research Interests

Research Interests

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 7 / 21

Page 8: Statistical techniques for local cluster detection

Who am I ? Research Interests

Research Topics

Scan statistics: methods and applications

Distribution of runs and patterns

Simulation techniques based on Monte Carlo methods

Scienti�c computing

Spatial Data Analysis

Concentration Innequalities

Languages

Matlab

R

SAS

Mathematica

Maple

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 8 / 21

Page 9: Statistical techniques for local cluster detection

What I do ? A first example

Outline

1 Who am I ?

EducationResearch Interests

2 What I do?

A �rst exampleDetecting Crohn's disease clusters

3 References

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 9 / 21

Page 10: Statistical techniques for local cluster detection

What I do ? A first example

A �rst example

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 10 / 21

Page 11: Statistical techniques for local cluster detection

What I do ? A first example

Example from Epidemiology

JFMAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASOND

1991 1992 1993 1994 1995

1 year

Observation of disease cases over time:N = 19 cases over a period of T = 5 years

Observation

The epidemiologist notes a one year period (from April 93 - through April94) with 8 cases: 42%!

Question

Given 19 cases over 5 years, how unusual is it to have a 1 year periodcontaining as many as 8 cases?

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 11 / 21

Page 12: Statistical techniques for local cluster detection

What I do ? A first example

The answer: A First approach

A First approach:

X = the number of cases falling in [April 93, April 94]

X ∼ Bin(19, 0.2)

P(X ≥ 8) = 0.023

Conclusion: an atypical situation !

But: it is not the answer to our question: the one year period is not �xedbut identi�ed after the scanning process !

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 12 / 21

Page 13: Statistical techniques for local cluster detection

What I do ? A first example

The answer: Correct approach

The scan statistics:

S = the maximum number of cases over any continuous one year

period in [0,T ]

Thus,

P(S ≥ 8) = 0.379

gives the answer to the epidemiologist question.

Conclusion: no unusual situation !

Example

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 13 / 21

Page 14: Statistical techniques for local cluster detection

What I do ? A first example

Animation for 2 dimensional scan statistics

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 14 / 21

Page 15: Statistical techniques for local cluster detection

What I do ? Detecting Crohn's disease clusters

Outline

1 Who am I ?

EducationResearch Interests

2 What I do?

A �rst exampleDetecting Crohn's disease clusters

3 References

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 15 / 21

Page 16: Statistical techniques for local cluster detection

What I do ? Detecting Crohn's disease clusters

Detection of Crohn's disease

clusters using spatial scan statistics

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 16 / 21

Page 17: Statistical techniques for local cluster detection

What I do ? Detecting Crohn's disease clusters

Problem and DataCrohn's disease(CD) - an in�ammatory disease of the intestines, whichhas no known pharmaceutical or surgical cure

genetic factorsenvironmental risk factors

Goal

Detect and highlight signi�cant atypical clusters of CD in terms of incidence

DataStudy region: North of France

Population: 5 790 526

Period: 1990�2006

Sub-division (cantons): 273 (small

French administrative area with

population between 1 500 and 212 000)

Per canton: strati�ed population(gender and age group)

Cases of Crohn's disease: 6 472

BRUSSELS

PARIS

LONDON

Pas-de-Calais

Nord

Somme

Seine-Maritime

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 17 / 21

Page 18: Statistical techniques for local cluster detection

What I do ? Detecting Crohn's disease clusters

Methods

Standardized Incidence Ratios(SIR) [Declercq et al., 2010,Besag and York, 1991]

detect global spatialheterogeneity

unable to detect unusual localclusters of CD

unable to test their signi�cance

cannot take into account thetime component

SIR

Spatial and space-time scanstatistics[Kulldor�, 1997,Kulldor�, 2006]

detect local clusters withoutpre-selection bias

detection of time-constantclusters

detection of time-varyingclusters

able to test their signi�cance

Spatial scan statistics

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 18 / 21

Page 19: Statistical techniques for local cluster detection

What I do ? Detecting Crohn's disease clusters

Results

Time-constant clusters

14 signi�cant clusters detected

5 clusters with high incidence (total:

726 cases)

9 clusters with low incidence (total:521 cases)

Time-varying clusters

4 signi�cant clusters detected

3 clusters with high incidence (779

cases within a period from 9 to 12 years)

1 clusters with low incidence (4cases over 7 years period)

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 19 / 21

Page 20: Statistical techniques for local cluster detection

What I do ? Detecting Crohn's disease clusters

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 19 / 21

Page 21: Statistical techniques for local cluster detection

What else? Other examples

A mathematical model for matching

in two aligned sequences

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 20 / 21

Page 22: Statistical techniques for local cluster detection

What else? Other examples

Matching in two aligned sequences

Let {Y1,Y2, . . . ,YT1} and {Z1,Z2, . . . ,ZT1

} be two i.i.d. sequences ofr.v.'s over the four-letter alphabet A = {A,C ,G ,T}.De�ne for 1 ≤ i ≤ T1, the score r.v.'s

Xi =

{1, if Yi = Zi0, otherwise

, Xi ∼ B(p), p = P(Yi = Zi )

Let Vc denote the length of the longest matching subsequence allowing atmost c mismatches.

Example (T1 = 26, p = 0.25, c = 1)

Y : A A A C C G G G C A C T A C T T T G A G A C G T G AZ : A A T C C C C C G T G C C C T T A G C G G C G T G G

X : 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0

c = 0: length of the longest success run LT1 ([Bateman, 1948])

c ∈ {1, 2}: almost perfect run ([Han and Hirano, 2003], [Bersimis et al., 2012])

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 23: Statistical techniques for local cluster detection

What else? Other examples

Matching in two aligned sequences

Let {Y1,Y2, . . . ,YT1} and {Z1,Z2, . . . ,ZT1

} be two i.i.d. sequences ofr.v.'s over the four-letter alphabet A = {A,C ,G ,T}.De�ne for 1 ≤ i ≤ T1, the score r.v.'s

Xi =

{1, if Yi = Zi0, otherwise

, Xi ∼ B(p), p = P(Yi = Zi )

Let Vc denote the length of the longest matching subsequence allowing atmost c mismatches.

Example (T1 = 26, p = 0.25, c = 1)

Y : A A A C C G G G C A C T A C T T T G A G A C G T G AZ : A A T C C C C C G T G C C C T T A G C G G C G T G G

X : 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 0

c = 0: length of the longest success run LT1 ([Bateman, 1948])

c ∈ {1, 2}: almost perfect run ([Han and Hirano, 2003], [Bersimis et al., 2012])

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 24: Statistical techniques for local cluster detection

References

Bateman, G. (1948).

On the power function of the longest run as a test for randomness in a sequence ofalternatives.

Biometrika, 35:97�112.

Bersimis, S., Koutras, M. V., and Papadopoulos, G. (2012).

Waiting time for an almost perfect run and applications in statistical processcontrol.

Methodol Comput Appl Probab.

Besag, J. and York, J. (1991).

Bayesian image restoration with two applications in spatial statistics.

Annals of the Institute of Statistical Mathematics, 43:1�21.

Breslow, N. and Day, N. (1980).

Statistical methods in cancer research.

The analysis of case-control studies: Distributed for IARC by WHO.

Declercq, C., Gower-Rousseau, C., Vernier-Massouille, G., Salleron, J., Balde, M.,Poirier, G., Lerebours, E., Dupas, J. L., Merle, V., Marti, R., Duhamel, A., Cortot,A., Salomez, J. L., and Colombel, J. F. (2010).

Mapping of in�ammatory bowel disease in northern france: spatial variations andrelation to a�uence.

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 25: Statistical techniques for local cluster detection

Appendix

In�amm Bowel Dis, 16(5):807�12.

Han, Q. and Hirano, K. (2003).

Waiting time problem for an almost perfect match.

Stat. and Prob. Letters, 65:39�49.

Kulldor�, M. (1997).

A spatial scan statistic.

Communications in Statistics - Theory and Methods, 26(6):1481�1496.

Kulldor�, M. (2006).

Tests of spatial randomness adjusted for an inhomogeneity.

Journal of the American Statistical Association, 101(475):1289�1305.

Samuels, S., Beaumont, J., and Breslow, N. (1991).

Power and detectable risk of seven tests for standardized mortality ratios.

American journal of epidemiology.A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 26: Statistical techniques for local cluster detection

Appendix

Standard Incidence Ratio

SIR: is de�ned as the ratio between Oi , the number of observed cases in region(canton) i over the studied period and the expected number of cases Ei under theincidence rate hypothesis adjusted by sex and age group over the referencepopulation.

nijk - population for the i th region (canton) with age class j and sex k

λjk - incidence ratio for the age class j and sex k

Ei =∑j

∑k

λjknijk

The standardized incidence ratio relative to region i :

SIRi =Oi

Ei

with E[SIRi ] = θi and V[SIRi ] =θiEi

estimated by Oi

E2

i

Return

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 27: Statistical techniques for local cluster detection

Appendix

Standard Incidence Ratio: Interpretation

Interpretation

SIRi = 1: the incidence in the region (canton) i is not di�erent than theexpected one in the reference population (no risk)

SIRi > 1: the incidence in the region (canton) i is higher than the expectedone in the reference population

SIRi < 1: the incidence in the region (canton) i is lower than the expectedone in the reference population

Statistical Test

H0: SIR = 1

H1: SIR 6= 1

Test statistics [Breslow and Day, 1980] and [Samuels et al., 1991].Return

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 28: Statistical techniques for local cluster detection

Appendix

SIR: Crohn's disease example

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 29: Statistical techniques for local cluster detection

Appendix

Spatial scan statistics

Assumption

The number of CD cases in each canton is Poisson distributed

The null hypothesis: the risk of being a�ected by CD is constantthroughout all cantons

The alternative hypothesis: there is at least one region for which theunderlying risk is higher inside the region as compared to outside

Description:

circular window of �exible size (varying form 0 up to a maximum radius sothat the window never contains more than 50% of the population-at-risk)

uses as center of the window the centroid of the cantons

for each circle, the likelihood to observe the number of CD cases within andoutside is computed and the circle, which maximizes the likelihood, isde�ned as the most likely cluster (MLC)

Return

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 30: Statistical techniques for local cluster detection

Appendix

Spatial scan statistics: likelihood

Under a Poisson model, the likelihood of a zone Z is given by:

L(Z ) = e−nG

nG !

(nZ

µ(Z)

)nZ (nG−nZ

µ(G)−µ(Z)

)nG−nZ ∏n

i=1µ(di )

where d1, d2, . . . , dn are the sites locations (centroid), µ(di ) is the population atrisk in the location di and nZ , µ(Z ), nG , µ(G ) are the number of CD cases andthe population at risk inside the circular zone Z and in the whole region G .The test statistic used is

ν = maxZ

L(Z )

L0

where the likelihood under the null hypothesis is

L0 =e−nG

nG !

(nG

µ(G )

)nG n∏i=1

µ(di )

The p-value, P(ν > νobs), associated to the MLC is obtained based onMonte-Carlo random replications under the null hypothesis. Return

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21

Page 31: Statistical techniques for local cluster detection

Appendix

Spatial scan statistics: illustration

CHAPITRE 2. STATISTIQUE DE SCAN SPATIALE 32/64

Cependant, la mesure µ peut prendre d’autres formes en fonction des applications. Par exemple,dans le cadre des donnees de sante, les epidemiologistes s’interessent a la detection de clusters demaladie. Or, il est important de prendre en compte l’heterogeneite de la population sous-jacenteafin de determiner si la zone etudiee presente un nombre de cas vraiment eleve. Aussi, ∀A ⊆ D,µ(A) correspond a la population sous-jacente relative a A. Dans le cadre des donnees latticielles,la mesure est associee a chaque sommet du graphe.

La methode se decompose en deux parties distinctes que sont la phase de detection de clusterd’evenements potentiellement atypique et la phase d’inference statistique liee au cluster detecte.

2.1 Phase de detection

Soit Z une fenetre circulaire qui va se deplacer dans la totalite de l’ensemble spatial D. La taillede la fenetre est variable et son rayon varie, en theorie, de 0 a +∞. En pratique, il est considerecomme discret et sa valeur maximale est definie comme etant la longueur maximale separant deuxelements de D. De surcroıt, la methode se limite aux zones Z contenant au maximum 50% desevenements. En effet, la detection d’un cluster comprenant plus de 50% des evenements reviendraita mettre en evidence un nombre d’evenements beaucoup trop faible a l’exterieur et non a l’interieurde ce dernier. Le balayage de l’ensemble spatial D est effectue par le centrage de la fenetre circulaireen un element de D et l’augmentation du rayon jusqu’aux limites definies supra. Cette procedureest reiteree pour chaque element de D et conduit a l’ensemble discret fini Z defini par l’ensembledes zones Z possibles de centres et tailles differentes (Figure 2.4).

d1

d4

d2 d3

D

Figure 2.4 – Ensemble Z de zones

Lorsque le rayon de la fenetre de scan est fixe, la statistique de scan est definie comme la fenetrepresentant le maximum d’evenements parmi toutes les fenetres possibles. cf Section 6.

Cependant, si la taille de la fenetre de scan est variable, la statistique de test ne peut plus etreune statistique de scan a fenetre fixe et par consequent un rapport de vraisemblance est utilise. Aameliorer avec l’article de Nagarwalla [Nagarwalla, 1996].

Aussi, a chaque fenetre de scan Z ∈ Z est associee un rapport de vraisemblance. Soit L(Z, p, q) lafonction de vraisemblance d’une fenetre Z dans laquelle la probabilite d’apparition d’un evenement

Return

A. Am rioarei (INCDSB) Detection of local clusters BIS Workshop 2016 21 / 21