Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
-
Upload
kayley-chesmore -
Category
Documents
-
view
215 -
download
2
Transcript of Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
Yaroslav Ryabov
Lognormal Pattern
of Exon size distributions
in Eukaryotic genomes
Outline• Global and local approaches in analysis of genetic
information
• Vocabulary of contemporary genetics: Exons and Introns
• What we can learn from exon size distributions? Lognormal pattern and two clases of exons
• How we can model exon size distributions of real genomes?
• What could be the biological reason for observed pattern of exons size distributions?
• Conclusions
Analyzing Genetic Information
Local approach
analyzing details of nucleotide sequence Basic Local
Alignment Search Tool1990
Watson and Crick1953Marshall Nirenberg1968
Global approach
analyzing properties of entire genome
Georg Johann Mendel1866
Recombination of inherited properties
Frequency of mutations
Size of genome etc.
Analyzing Genetic Information in Post-Genomic Era
More than 60 animal genomes with complete annotations
Analyzing Genetic Information in Post-Genomic Era
Back to Global Approach ?
More than 60 animal genomes with complete annotations
Prokaryote and Eukaryote
pro + karyon
before + nucleos
eu + karyon
good + nucleos
Gene expression
DNA
Transcriptionwith RNA polymerase
mRNATranslationwith ribosome
Protein
DNA
Transcriptionwith RNA polymerase
mRNA
Exons
Introns
Splicing
Splicing in Eukaryotes
Prokaryote vs Eukaryote
Most of DNA code is used to produce some cellular product
Substantial fraction “silent” DNA regions
Exons Introns
Short genomes: ~ 1 000 exons
Long genomes: ~ 100 000 exons
Long exons: ~ 1 000 base pairs
Short exons: ~ 100 base pairs
0.0 0.5 1.0 1.5 2.00
10
20
30
40
50
60
70
80
Num
ber o
f cou
nts
in th
ousa
nds
Exon length in 1000 base pairs
0.0 0.5 1.0 1.5 2.00.00
0.05
0.10
0.15
0.20
Num
ber
of c
ount
s in
tho
usan
ds
Exon length in 1000 base pairs
Prokaryote vs Eukaryote Exon size distributions
Homo SapiensFlavobacteria Bacterium
0.0 0.5 1.0 1.5 2.00
10
20
30
40
50
60
70
80
Num
ber o
f cou
nts
in th
ousa
nds
Exon length in 1000 base pairs
0.0 0.5 1.0 1.5 2.00.00
0.05
0.10
0.15
0.20
Num
ber
of c
ount
s in
tho
usan
ds
Exon length in 1000 base pairs
Homo SapiensFlavobacteria Bacterium
2 592 exons 1 075 b.p. mean exon size
299 298 exons 267 b.p. mean exon size
What we can learn from exonsize distributions ?
Assumption:
Probability to split DNA at any location and any time is Constant
Consequence:
The lengths of intervals between splitting points obey
Exponential distribution
Poisson Process
0.0 0.5 1.0 1.5 2.00
10
20
30
40
50
60
70
80
Num
ber
of c
ount
s in
tho
usan
ds
Exon length, 1000 b.p.
= Const frequency of splitting
0 1 2 3 40
10
20
30
40
Num
ber
of c
ount
s in
thou
sand
s
Log10
(Exon Length)
Assumption:
Probability to split DNA at any location and any time is Constant
or in logarithm scale
Consequence:
The lengths of intervals between splitting points obey
Exponential distribution
Poisson Process
= Const frequency of splitting
Poisson Process Kolmogoroff Process
Lognormal distributionExponential distribution
0 1 2 3 40
10
20
30
40
Num
ber
of c
ount
s in
tho
usan
ds
Log10
(Exon Length)
0 1 2 3 40
10
20
30
40
Num
ber
of c
ount
s in
thou
sand
s
Log10
(Exon Length)
Lognormal distribution is a consequence of Central Limit Theorem
Sumof random variables
Normal distribution
Lognormal distribution
Productof random variables
Kolmogoroff process (1941)
Kolmogoroff process (1941)
Kolmogoroff process (1941)
Assumption:
Probability to split any exon is Independent of exon size
Kolmogoroff process (1941)
Assumption:
Probability to split any exon is Independent of exon size
Kolmogoroff process (1941)
Assumption:
Probability to split any exon is Independent of exon size
Kolmogoroff process (1941)
Assumption:
Probability to split any exon is Independent of exon size
Kolmogoroff process (1941)
Log (Exon Size)
Num
ber
of C
ount
s
Assumption:
Probability to split any exon is Independent of exon size
Consequence:
The lengths of exons obey
Lognormal distributionM
2s
Real Genomes: Two Lognormal Peaks
1 2 3 4 5 6 7 8 9 10
5
10
15
20
25
30
Ln(Exon Size)
peak A M = 4.80 ± 0.01 = 0.85 ± 0.01
peak B M = 5.44 ± 0.09 = 3.10 ± 0.20
Homo Sapiens
Data: SatDatHS_BModel: TwoGauss Chi^2/DoF = 182733.65479R^2 = 0.99722 y0 0 ±0A 28378.88917 ±679.3328w 0.85204 ±0.01438xc 4.80305 ±0.00554A1 15303.67377 ±838.0008w1 3.09794 ±0.15891xc1 5.43988 ±0.0879
Numb
er of
Cou
nts (t
hous
ands
)
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6peak A M = 5.07 ± 0.01 = 0.53 ± 0.02
peak B M = 5.78 ± 0.01 = 2.10 ± 0.02
Drosophila MelanogasterData: StaDR_BModel: TwoGauss Chi^2/DoF = 6036.76567R^2 = 0.99806 y0 0 ±0A 10674.50108 ±120.44604w 2.09918 ±0.02005xc 5.77986 ±0.01405A1 1461.13835 ±83.49117w1 0.53408 ±0.02349xc1 5.07437 ±0.00955
Numb
er of
Cou
nts (t
hous
ands
)
Ln(Exon Size)
2 4 6 8 100
5
10
15
20
25
30
35
peak A M = 4.81 ± 0.01 = 0.82 ± 0.02
peak B M = 5.08 ± 0.07 = 3.13 ± 0.18
Num
ber
of C
oun
ts (
tho
usan
ds)
Ln(Exon Size)
Pan troglodytes (chimpanzee)
2 4 6 8 100
1
2
3
4
5
6
7
peak A M = 5.07 ± 0.01 = 0.55 ± 0.03
peak B M = 5.53 ± 0.01 = 1.89 ± 0.02
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Anopheles gambiae (mosquito)
2 4 6 8 100
5
10
15
20
25
30
35
peak A M = 4.59 ± 0.05 = 2.06 ± 0.12
peak B M = 4.83 ± 0.01 = 0.77 ± 0.03
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Bos taurus (cow)
2 4 6 8 100
2
4
6
8
10
12
14
16
peak A M = 4.75 ± 0.02 = 0.71 ± 0.05
peak B M = 5.22 ± 0.03 = 1.46 ± 0.03
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Caenorhabditis elegans (worm)
2 4 6 8 100
5
10
15
20
25
30
35
peak A M = 4.62 ± 0.05 = 2.10 ± 0.12
peak B M = 4.84 ± 0.01 = 0.77 ± 0.02
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Canis familiaris (dog)
2 4 6 8 100
5
10
15
20
25
30
35
40
peak A M = 4.50 ± 0.05 = 2.49 ± 0.11
peak B M = 4.83 ± 0.01 = 0.77 ± 0.02
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Danio rerio (zebrafish)
2 4 6 8 100
5
10
15
20
25
30
peak A M = 4.56 ± 0.06 = 2.21 ± 0.13
peak B M = 4.83 ± 0.01 = 0.76 ± 0.03
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Gallus gallus (chiken)
2 4 6 8 100
5
10
15
20
25
30
35
peak A M = 4.81 ± 0.01 = 0.82 ± 0.02
peak B M = 4.86 ± 0.05 = 2.73 ± 0.16
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Macaca mulatta (rhesus macaque)
2 4 6 8 100
5
10
15
20
25
30
35
peak A M = 4.82 ± 0.01 = 0.84 ± 0.02
peak B M = 5.38 ± 0.10 = 3.09 ± 0.20
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Mus musculus (house mouse)
2 4 6 8 100
5
10
15
20
25
30
35
peak A M = 4.85 ± 0.01 = 0.76 ± 0.02
peak B M = 5.00 ± 0.03 = 1.95 ± 0.10
Nu
mb
er
of
Co
un
ts (
tho
usa
nd
s)
Ln(Exon Size)
Tetraodon nigroviridis (spotted green pufferfish)
Parameters of Two Lognormal peaks model: M location of peak maximum
Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)
Parameters of Two Lognormal peaks model: s peak width
Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)
Ryabov & Gribskov, Nucleic Acids Research, 36, 2756-2763 (2008)
Parameters of Two Lognormal peaks model: peak area Narrow and Wide, %
Summary of Observed Facts
• Exon lengths distributions of studied eukaryotic genomes can be fitted by Two Lognormal Distributions
• Parameters of those two peaks follow two distinctive patterns: changes of peak width and relative peak area correlate with complexity of species.
• This may indicate presence of two different classes of exons with different evolutionary histories
How we can model exon size distributions of real genomes ?
Growth of total genome length
Duplications in genome code
Merging exons together (intron loss)
Exon loss
any exon of length has probability to be modified during time interval
If
Parameters of Kolmogoroff Processas model parameters for elementary “exon modification” event
average number of exons with sizes
and
with
peak position peak width
Then the process converges to a lognormal distribution with
distribution
Exons splitting
(decreasing Exon length)
Increasing Exon length
Parameters of Q(k) function
0 1 2
d Q(k)
k
0 1 2
d Q(k)
k
0 1 2
d Q(k)
k
Exons duplicating
Initial exon size distribution A mockup of a bacterial genome with 500 exons
Having 1000 bp mean exons length
Modeling exon size distributions in real genomes
10000
1x103
2x103
Dis
trib
utio
n de
nsity
Log (Exon Length)
2 4 6 8 100
1x104
2x104
3x104
4x104
Dis
trib
utio
n de
nsity
Log (Exon Length)
0.0 0.5 1.0 1.5 2.0
dQ (k) Wide peakQ()=1.200
k
15 000 time steps for p=0.001
Modeling exon size distributions in real genomes
2 4 6 8 100
1x104
2x104
3x104
4x104
Dis
trib
utio
n de
nsity
Log (Exon Length)
0.0 0.5 1.0 1.5 2.0
dQ (k) Wide peakQ()=1.200
k
0.0 0.5 1.0 1.5 2.0
dQ (k) Narrow peakQ()=1.263
k
15 000 time steps for p=0.001
Modeling exon size distributions in real genomes
2 4 6 8 100
1x104
2x104
3x104
4x104
Dis
trib
utio
n de
nsity
Log (Exon Length)0.0 0.5 1.0 1.5 2.0
dQ (k) Narrow peakQ()=1.263
k
0.0 0.5 1.0 1.5 2.0
dQ (k) Wide peakQ()=1.200
k
15 000 time steps for p=0.001
Modeling exon size distributions in real genomes
50 000 time steps for p=0.001
0.0 0.5 1.0 1.5 2.0
dQ (k) Wide peakQ()=1.049
k
2 4 6 8 100
1x103
2x103
3x103
4x103
5x103
6x103
Dis
trib
utio
n de
nsity
Log (Exon Length)
Modeling exon size distributions in real genomes
2 4 6 8 100
1x103
2x103
3x103
4x103
5x103
6x103
Dis
trib
utio
n de
nsity
Log (Exon Length)
50 000 time steps for p=0.001
0.0 0.5 1.0 1.5 2.0
dQ (k) Wide peakQ()=1.049
k
0.0 0.5 1.0 1.5 2.0
dQ (k) Narrow peakQ()=1.010
k
Modeling exon size distributions in real genomes
2 4 6 8 100
1x103
2x103
3x103
4x103
5x103
6x103
Dis
trib
utio
n de
nsity
Log (Exon Length)
50 000 time steps for p=0.001
0.0 0.5 1.0 1.5 2.0
dQ (k) Wide peakQ()=1.049
k
0.0 0.5 1.0 1.5 2.0
dQ (k) Narrow peakQ()=1.010
k
Modeling exon size distributions in real genomes
What cloud be the biological reason for two exons peaks?
Narrow peak
Wide peak
Num
ber
of C
ount
s
Log (Exon Size)
Narrow peak
Holds approximately Constant position
Has approximately Constant width
Has greater relative occupation for Complex organisms
Wide peak
Changes position
Width increases with increasing complexity of and organism
Has greater relative occupation for Simple organisms
What could be the biological reason for two exons peaks?
DNA
mRNA
Protein
Alterative SplicingUntranslated
regions of mRNA
Introns
Two ways of Gene Expression Regulation
DNA
mRNA
Protein
Untranslated
regions of mRNA
Introns
Two ways of Gene Expression Regulation
Conclusions• Analysis of global properties of eukaryotic genomes reveals two
distinct peaks in statistical distribution of exon sizes
• The observed peak could be fitted by a sum of two lognormal distributions which may imply that they originated by two different exon splitting pathways described in the general frameworks of Kolmogoroff splitting process
• Two observed peaks of exons could be correlated with the phenomenon of alternative splicing and with exons contributing into untranslated regions of mRNA. This suggests that the observed separation of exons in two different classes may be originated from two different ways of protein expression regulation.
Acknowledgments
Michael Gribskov
Alexander Berezhkovskii