Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods...

42
Some Statistical Methods For Detecting Clustering In Biological Sequences [email protected] John L. Spouge National Center for Biotechnology Information Bldg. 45, Rm. 6AS 4 NCBI, NLM, N Bethesda MD 208
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    2

Transcript of Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods...

Page 1: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

Some Statistical Methods For

Detecting ClusteringIn

Biological Sequences

Some Statistical Methods For

Detecting ClusteringIn

Biological Sequences

[email protected]@nih.gov

John L. SpougeNational Center for Biotechnology Information

John L. SpougeNational Center for Biotechnology Information

Bldg. 45, Rm. 6AS 47JNCBI, NLM, NIH

Bethesda MD 20894

Bldg. 45, Rm. 6AS 47JNCBI, NLM, NIH

Bethesda MD 20894

Page 2: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

Clustering in bacterial genomesClustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test)Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov testsKolmogorov-Smirnov tests Scan testsScan tests Local run (BLAST-like) tests using Poisson process (PP)Local run (BLAST-like) tests using Poisson process (PP)

Clustering of Intergenic conservationClustering of Intergenic conservation Hypergeometric testHypergeometric test

Clustering of PSSM motifsClustering of PSSM motifs Chi-square for 1/0 “motifs”Chi-square for 1/0 “motifs” Compound Poisson process (CPP) models for PSSM motifsCompound Poisson process (CPP) models for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using CPPLocal run (BLAST-like) tests for PSSM motifs using CPP

Clustering in bacterial genomesClustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test)Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov testsKolmogorov-Smirnov tests Scan testsScan tests Local run (BLAST-like) tests using Poisson process (PP)Local run (BLAST-like) tests using Poisson process (PP)

Clustering of Intergenic conservationClustering of Intergenic conservation Hypergeometric testHypergeometric test

Clustering of PSSM motifsClustering of PSSM motifs Chi-square for 1/0 “motifs”Chi-square for 1/0 “motifs” Compound Poisson process (CPP) models for PSSM motifsCompound Poisson process (CPP) models for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using CPPLocal run (BLAST-like) tests for PSSM motifs using CPP

OverviewOverview

Page 3: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

IK Jordan IK Jordan et al et al (2001) (2001) Genome Res Genome Res 1111:555-565:555-565

Given a small gene family in several bacterial genomes, Given a small gene family in several bacterial genomes, do its genes tend to cluster?do its genes tend to cluster?

Page 4: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Fisher Omnibus TestFisher Omnibus Test

The Fisher omnibus combines The Fisher omnibus combines several weak one-sided continuous p-values several weak one-sided continuous p-values

to test the aggregate for significance.to test the aggregate for significance.

1 2, ,..., np p p

1

2 lnn

ii

p

is chi-square with 2n degrees of freedomis chi-square with 2n degrees of freedom

Page 5: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Fisher Omnibus TestFisher Omnibus Test

lnX p is exponential (1) distributedis exponential (1) distributed

For any one-sided continuous p-value,For any one-sided continuous p-value,

ln x xp x p e e P P

1

2 lnn

ii

p

is chi-square with 2n degrees of freedomis chi-square with 2n degrees of freedom

1

lnn

ii

p

is gamma (1,n) distributedis gamma (1,n) distributed

Page 6: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Minimum DistanceMinimum Distance

* *1 0{ } 1

nni i iP n

* *1 0{0 }n

i i i P

0 0 *1

*2

*3 1 1n

0 1 2 3...

...

S Karlin & HM TaylorS Karlin & HM Taylor (1981) (1981) A Second Course in Stochastic Processes, p. A Second Course in Stochastic Processes, p. 132132

Page 7: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

0 11 0

1 ...{0 }

!

n

nni i i iP

n

1 0{0 }ni i i i P

0 1 2 3...

...

B de FinettiB de Finetti (1964) (1964) Giornale Istituto Italiano degli Attuari Giornale Istituto Italiano degli Attuari 2727:151:151

W FellerW Feller (1971) (1971) An Introduction to Probability Theory…, Vol. An Introduction to Probability Theory…, Vol. 22, , p.p. 42 42

0 0 1 2 3 1 1n

De Finetti’s FormulaDe Finetti’s Formula

Page 8: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

# { }... ...

!t i i i in n n

n

x X Xt x x x

n

t x x x

n

FHG

IKJ

1 0

0 1 0 1b gbg

# { }...t x x x i i in

nX X

0 10 1 0

0 0X X1 X 2 X 3...

1 1nX t

x0 x1 x2 x3...

X x1 0

X x x2 0 1

X x x x3 0 1 2

X t x x xn n 1 0 11 ......

# { }t i i i inx X X 1 0

Discrete VersionDiscrete Version

0 0X

Page 9: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Special CasesSpecial Cases

X1 X 2 X 3...

x0 x1 x2 x3...

# { }t i i i inx X X 1 0

{ } { , ,..., }xt

ni in

FHGIKJ0 0 0 0

# { }...

t i i i in nx X X

t x x x

n

FHG

IKJ 1 0

0 1

{ } { , , ,..., , }xt n

ni in

FHG

IKJ0 0 1 1 1 0

1

0 0X 1 1nX t

Page 10: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Minimum DistanceMinimum Distance

# { }... ...

!t i i i in n n

n

x X Xt x x x

n

t x x x

n

FHG

IKJ

1 0

0 1 0 1b gbg

Choose n distinct numbers from {1,2,…,t} such that the minimum distance between consecutive order statistics exceeds x 0.

Choose n distinct numbers from {1,2,…,t} such that the minimum distance between consecutive order statistics exceeds x 0.

X1 X 2 X 3...

x0 x1 x2 x3...

{ } { , , ,..., , } !( )

( )x x x x nt n x

nt n xi i

n n

FHG

IKJ 0 0 0

11 bg

S Karlin & HM TaylorS Karlin & HM Taylor (1981) (1981) A Second Course in Stochastic ProcessesA Second Course in Stochastic Processes

0 0X 1 1nX t

Page 11: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Threading ConfigurationsThreading Configurationsm #{ }threading configurations

{ ( ) }x X X l xi i i i i in 1 0 X { }X i i

n1

A11

A19

A28

A4

A1

A2

A33

44

1111

1919

2828

Page 12: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

Given a large gene family in one bacterial genome, Given a large gene family in one bacterial genome, do its genes tend to cluster?do its genes tend to cluster?

IK Jordan IK Jordan et al et al (2001) (2001) Genome Res Genome Res 1111:555-565:555-565

Page 13: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Kolmogorov-Smirnov TestsKolmogorov-Smirnov Tests

M Kendall & A Stuart (1979) M Kendall & A Stuart (1979) The Advanced Theory of StatisticsThe Advanced Theory of Statistics, , Vol. Vol. 2, 2, pp. 476 . 476

The Kolmogorov-Smirnov test examines whetherThe Kolmogorov-Smirnov test examines whether come from distribution functioncome from distribution function1 2

ˆ ˆ ˆ, ,..., nX X X F x

* *F X x X F x FF x x P P

1 2ˆ ˆ ˆ, ,..., nF X F X F X are uniformly distributedare uniformly distributed

Are uniformly distributed?Are uniformly distributed?1 2ˆ ˆ ˆ, ,..., nU U U

Page 14: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Kolmogorov-Smirnov TestsKolmogorov-Smirnov Tests

Are uniformly distributed?Are uniformly distributed?1 2ˆ ˆ ˆ, ,..., nU U U

*1,2,...,maxn k n k

kD n U

n

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

value

cum

ula

tive

dis

t

*,k

kU

n PlotPlot

*1,2,...,maxn k n k

kD n U

n

*1,2,...,maxn k n k

kD n U

n

Page 15: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Kolmogorov-Smirnov TestsKolmogorov-Smirnov Tests

* 1

1 1

1

k k nk

n n

k Z k n S k S kn U n

n Z n Z nn n n n

k kB B

n n

L BreimanL Breiman (1992) (1992) ProbabilityProbability

1E 2E 3E 1nE ...nE...

*

1

kk

n

ZU

Z

wherewhere1

k

k iiZ E

iE is exponential (1) distributedis exponential (1) distributed

1

1k

k k iiZ k S E

Page 16: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Kolmogorov-Smirnov TestsKolmogorov-Smirnov Tests

Are uniformly distributed?Are uniformly distributed?1 2ˆ ˆ ˆ, ,..., nU U U

*1,2,..., 0 1max max 1n k n k t

kD n U B t tB

n

*1,2,..., 0 1max max 1n k n k t

kD n U tB B t

n

*1,2,..., 0 1max max 1n k n k t

kD n U B t tB

n

Page 17: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

Given a large gene family in one linear genome, Given a large gene family in one linear genome, do its genes tend to cluster?do its genes tend to cluster?

*1U *

2U *3U 1... *

nU...0

*1,2,..., 0 1max max 1n k n k t

kD n U y B t tB y

n

P P

Page 18: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

*1U *

2U *3U 1... *

nU...0

*

1

kk

n

ZU

Z

wherewhere1

k

k iiZ E

iE is exponential (1) distributedis exponential (1) distributed

*1nU

Given a large gene family in one circular genome, Given a large gene family in one circular genome, do its genes tend to cluster?do its genes tend to cluster?

IK Jordan IK Jordan et al et al (2001) (2001) Genome Res Genome Res 1111:555-565:555-565

Page 19: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

*1U *

2U *3U 1... *

nU...0 *1nU

*

1

kk

n

ZU

Z

wherewhere1

k

k iiZ E

iE is exponential (1) distributedis exponential (1) distributed

* * 11

1

kk k k

n

EU U n E

Z

are approximately exponential (are approximately exponential (nn) distributed) distributed* *1k kU U

Page 20: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

Given a large gene family in one circular genome, Given a large gene family in one circular genome, do its genes tend to cluster?do its genes tend to cluster?

* *1,2,..., 1

0 1

max 1 exp

max 1

n k n k k

t

kD n n U U y

n

B t tB y

P

P

*1U *

2U *3U 1... *

nU...0 *1nU

IK Jordan IK Jordan et al et al (2001) (2001) Genome Res Genome Res 1111:555-565:555-565

Page 21: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

Given a set of restriction sites in a genome, Given a set of restriction sites in a genome, do the sites tend to cluster?do the sites tend to cluster?

*1U *

2U *3U 1... *

nU...0 *1nU

S Karlin & C Macken (1991) S Karlin & C Macken (1991) J Amer Stat Soc J Amer Stat Soc 8686:27-35:27-35

rkm kkth minimum in an th minimum in an rr-scan-scan r

kM kkth maximum in an th maximum in an rr-scan-scan

rr-scan-scan for for r = 3r = 3 1 2 3 1 1... n r n r n rX X X X X X

Page 22: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering in Bacterial GenomesClustering in Bacterial Genomes

1

0

1ln 1 ln ln exp

ikr

ki

M n r n xn i

P

1 !

xe

r

1

1 10

exp , !

i rkr

k ri

x xm

n i r

P

!

rx

r

A Dembo & S Karlin(1992) A Dembo & S Karlin(1992) Ann Appl Prob Ann Appl Prob 22:329-357:329-357C Chen & S Karlin (2000) C Chen & S Karlin (2000) J Appl Prob J Appl Prob 3737:865-880:865-880

Page 23: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Clustering of ConservationClustering of Conservation1 L

......

Conserved NucleotideConserved NucleotideConserved NucleotideConserved Nucleotide Non-conserved NucleotideNon-conserved NucleotideNon-conserved NucleotideNon-conserved Nucleotide

After accounting for After accounting for edge effectsedge effects, could uniformly random , could uniformly random conserved and non-conserved nucleotides be as clustered conserved and non-conserved nucleotides be as clustered

as the data from intergenic regions?as the data from intergenic regions?

After accounting for After accounting for edge effectsedge effects, could uniformly random , could uniformly random conserved and non-conserved nucleotides be as clustered conserved and non-conserved nucleotides be as clustered

as the data from intergenic regions?as the data from intergenic regions?

Alternative with Some Very Long Conserved ClustersAlternative with Some Very Long Conserved Clusters

Scan or Local Run test is powerful against alternative.Scan or Local Run test is powerful against alternative.

Alternative with Some Very Long Conserved ClustersAlternative with Some Very Long Conserved Clusters

Scan or Local Run test is powerful against alternative.Scan or Local Run test is powerful against alternative.

Alternative with Many Short Conserved ClustersAlternative with Many Short Conserved Clusters

Hypergeometric test offers more power against alternative.Hypergeometric test offers more power against alternative.

Alternative with Many Short Conserved ClustersAlternative with Many Short Conserved Clusters

Hypergeometric test offers more power against alternative.Hypergeometric test offers more power against alternative.

Page 24: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

Extreme CasesExtreme Caseskk = 0 or 1 corresponds to complete separation of = 0 or 1 corresponds to complete separation of

conserved and non-conserved positionsconserved and non-conserved positionsk = k = min{min{mm,,nn} corresponds to complete mixing} corresponds to complete mixing

Extreme CasesExtreme Caseskk = 0 or 1 corresponds to complete separation of = 0 or 1 corresponds to complete separation of

conserved and non-conserved positionsconserved and non-conserved positionsk = k = min{min{mm,,nn} corresponds to complete mixing} corresponds to complete mixing

Given Given mm conserved positions and conserved positions and nn non-conserved positions, non-conserved positions, calculate the probability that exactly calculate the probability that exactly kk of the conserved of the conserved

positions are followed by a non-conserved position.positions are followed by a non-conserved position.

Given Given mm conserved positions and conserved positions and nn non-conserved positions, non-conserved positions, calculate the probability that exactly calculate the probability that exactly kk of the conserved of the conserved

positions are followed by a non-conserved position.positions are followed by a non-conserved position.

1 L......

Conserved NucleotideConserved NucleotideConserved NucleotideConserved Nucleotide Non-conserved NucleotideNon-conserved NucleotideNon-conserved NucleotideNon-conserved Nucleotide

Clustering of ConservationClustering of Conservation

Page 25: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

Given Given mm conserved positions and conserved positions and nn non-conserved positions, non-conserved positions, calculate the probability that exactly calculate the probability that exactly kk of the conserved of the conserved

positions are followed by a non-conserved position.positions are followed by a non-conserved position.

Given Given mm conserved positions and conserved positions and nn non-conserved positions, non-conserved positions, calculate the probability that exactly calculate the probability that exactly kk of the conserved of the conserved

positions are followed by a non-conserved position.positions are followed by a non-conserved position.

, ;m n m n

p m n kk k n

Hypergeometric DistributionHypergeometric DistributionHypergeometric DistributionHypergeometric Distribution

1 L......

Conserved NucleotideConserved NucleotideConserved NucleotideConserved Nucleotide Non-conserved NucleotideNon-conserved NucleotideNon-conserved NucleotideNon-conserved Nucleotide

Clustering of ConservationClustering of Conservation

Page 26: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

Count the number of ways of placing Count the number of ways of placing mm conserved positions (1) and conserved positions (1) and

nn non-conserved positions (0) so that exactly non-conserved positions (0) so that exactly kk of the conserved of the conserved

positions are followed by a non-conserved position (10).positions are followed by a non-conserved position (10).

Count the number of ways of placing Count the number of ways of placing mm conserved positions (1) and conserved positions (1) and

nn non-conserved positions (0) so that exactly non-conserved positions (0) so that exactly kk of the conserved of the conserved

positions are followed by a non-conserved position (10).positions are followed by a non-conserved position (10).

1 L......

Conserved NucleotideConserved NucleotideConserved NucleotideConserved Nucleotide Non-conserved NucleotideNon-conserved NucleotideNon-conserved NucleotideNon-conserved Nucleotide

01100010011100110110001001110011

01011010000010100110111010011011Count the number of ways of placing Count the number of ways of placing k k 1010’s, ’s, nnkk 00’s, ’s,

and and mmkk 11’s so that none of the ’s so that none of the 11’s is followed by a ’s is followed by a 00. .

01100010011100110110001001110011

01011010000010100110111010011011Count the number of ways of placing Count the number of ways of placing k k 1010’s, ’s, nnkk 00’s, ’s,

and and mmkk 11’s so that none of the ’s so that none of the 11’s is followed by a ’s is followed by a 00. .

Clustering of ConservationClustering of Conservation

Page 27: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

n

k

m

k

Place the Place the k k 1010’s and ’s and mmkk 11’s ’s

in arbitrary order.in arbitrary order.

Place the Place the k k 1010’s and ’s and mmkk 11’s ’s

in arbitrary order.in arbitrary order.

111010 1010 11 111010 11 11 111010 1010 11 111010 11 11

Count the number of ways of placing Count the number of ways of placing k k 1010’s, ’s,

nnkk 00’s, and ’s, and mmkk 11’s so that no ’s so that no 00 follows a follows a 11. .

Count the number of ways of placing Count the number of ways of placing k k 1010’s, ’s,

nnkk 00’s, and ’s, and mmkk 11’s so that no ’s so that no 00 follows a follows a 11. .

01011010000010100110111010011011 01011010000010100110111010011011

0.00.0.0 0.00.0.0 0.00.0.0 0.00.0.0 Place the Place the nnkk 00’s in ’s in kk+1 bins.+1 bins.Place the Place the nnkk 00’s in ’s in kk+1 bins.+1 bins.

Clustering of ConservationClustering of Conservation

Page 28: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

Given PSSMs signals in pieces of DNA, Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals?does any piece have an unusual number of signals?

For 1/0 signals, a For 1/0 signals, a 22 test suffices. test suffices.

Page 29: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

Given PSSMs signals in pieces of DNA, Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals?does any piece have an unusual number of signals?

Consider the strength of the signal.Consider the strength of the signal.M Frith & Zhiping Weng (2001)M Frith & Zhiping Weng (2001)

Page 30: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

Given PSSMs signals in pieces of DNA, Given PSSMs signals in pieces of DNA, does any piece have an unusual number of signals?does any piece have an unusual number of signals?

(1)(1) Assume an independent, identically distributed Assume an independent, identically distributed DNA base composition.DNA base composition.

(2)(2) Assume the PSSM signals, appropriately Assume the PSSM signals, appropriately truncated, follow a compound Poisson truncated, follow a compound Poisson process process with parameters (with parameters (, , ). ).

S Schbath S Schbath et al.et al. (1998) (1998) J Comp Biol J Comp Biol 55:223-253:223-253

0Z

Page 31: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

compound Poisson process (compound Poisson process (, , ) ) “time” “time”

0Z L

Tail probability can be calculated Tail probability can be calculated by small sample asymptotics.by small sample asymptotics.

ZN L tP

exp exp 1ZT TN L TZ E E

cumulant generating function of sum of signalscumulant generating function of sum of signals

Page 32: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

Given different PSSM signals in a piece of DNA, Given different PSSM signals in a piece of DNA, any of the signals unusually concentrated?any of the signals unusually concentrated?

(1)(1) Assume an independent, identically distributed Assume an independent, identically distributed DNA base composition.DNA base composition.

(2)(2) Assume the PSSM signals, appropriately Assume the PSSM signals, appropriately truncated, follow a compound Poisson truncated, follow a compound Poisson process process with parameters (with parameters (, , ). ). 0Z

Page 33: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

0

ˆ supu v L

M L S v S u

ZS t N t at

L

Page 34: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Alignment MatricesAlignment Matrices

Page 35: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Alignment Score RenewalAlignment Score Renewal

local score below 0local score below 0local score below 0local score below 0renewalrenewalrenewalrenewal

1k1

Local Alignment Score on a Single DiagonalLocal Alignment Score on a Single DiagonalLocal Alignment Score on a Single DiagonalLocal Alignment Score on a Single Diagonal

K random random renewal lengthrenewal length

random random renewal lengthrenewal length

Page 36: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Alignment Score SuccessAlignment Score Success

successsuccesssuccesssuccess yEP probabilityprobabilityof successof successprobabilityprobabilityof successof success

1k1

Local Alignment Score on a Single DiagonalLocal Alignment Score on a Single DiagonalLocal Alignment Score on a Single DiagonalLocal Alignment Score on a Single Diagonal

local score above ylocal score above ylocal score above ylocal score above y

Page 37: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011HSP Poisson DistributionHSP Poisson Distribution

S Karlin & A Dembo (1992) S Karlin & A Dembo (1992) Adv Appl Prob Adv Appl Prob 2424:113:113S Karlin & A Dembo (1992) S Karlin & A Dembo (1992) Adv Appl Prob Adv Appl Prob 2424:113:113m

n

0 lim

y

ymn

K

EP

E

lim 0yy

EP

lim!

j

yy

N j ej

EP

Page 38: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011Finite-Size EffectFinite-Size Effect

successsuccesssuccesssuccess local score above ylocal score above ylocal score above ylocal score above y

1k1

Local Alignment Score on a Single DiagonalLocal Alignment Score on a Single DiagonalLocal Alignment Score on a Single DiagonalLocal Alignment Score on a Single Diagonal

| yT EE expected timeexpected timeto successto success

expected timeexpected timeto successto success

Page 39: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

m

n

ˆ lim | |y

AG y yy

m T n TK

E

E EP

E EE

lim 0yy

EP

lim!

j

yy

N j ej

EP

S Altschul & W Gish (1996) Methods Enzymology S Altschul & W Gish (1996) Methods Enzymology 266266S Altschul & W Gish (1996) Methods Enzymology S Altschul & W Gish (1996) Methods Enzymology 266266

Finite-Size EffectFinite-Size Effect

Page 40: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011PSSM Motif ClusteringPSSM Motif Clustering

0

ˆ supu v L

M L S v S u

L

*

*

2

0 *

1

1y

Z

Ze aL

Ze

E

E

M̂ L y e P

* *exp 1Z E

1a

Page 41: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.

NCBI

01101011

NCBI

01101011

Clustering in bacterial genomesClustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test)Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov testsKolmogorov-Smirnov tests Scan testsScan tests Local run (BLAST-like) tests using Poisson process (PP)Local run (BLAST-like) tests using Poisson process (PP)

Clustering of Intergenic conservationClustering of Intergenic conservation Hypergeometric testHypergeometric test

Clustering of PSSM motifsClustering of PSSM motifs Chi-square for 1/0 “motifs”Chi-square for 1/0 “motifs” Small sample asymptotic methods for PSSM motifsSmall sample asymptotic methods for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using compound PPLocal run (BLAST-like) tests for PSSM motifs using compound PP

Clustering in bacterial genomesClustering in bacterial genomes Minimum distance statistic (with Fisher omnibus test)Minimum distance statistic (with Fisher omnibus test) Kolmogorov-Smirnov testsKolmogorov-Smirnov tests Scan testsScan tests Local run (BLAST-like) tests using Poisson process (PP)Local run (BLAST-like) tests using Poisson process (PP)

Clustering of Intergenic conservationClustering of Intergenic conservation Hypergeometric testHypergeometric test

Clustering of PSSM motifsClustering of PSSM motifs Chi-square for 1/0 “motifs”Chi-square for 1/0 “motifs” Small sample asymptotic methods for PSSM motifsSmall sample asymptotic methods for PSSM motifs Local run (BLAST-like) tests for PSSM motifs using compound PPLocal run (BLAST-like) tests for PSSM motifs using compound PP

Summary of TechniquesSummary of Techniques

Page 42: Some Statistical Methods For Detecting Clustering In Biological Sequences Some Statistical Methods For Detecting Clustering In Biological Sequences spouge@nih.gov.