A mathematical model of the genetic code: structure and applications A mathematical model of the...
-
Upload
henry-cain -
Category
Documents
-
view
217 -
download
0
Transcript of A mathematical model of the genetic code: structure and applications A mathematical model of the...
A mathematical A mathematical model of the genetic model of the genetic code: structure and code: structure and
applicationsapplicationsAntonino Sciarrino Università di Napoli
“Federico II” INFN, Sezione
di Napoli
TAG 2006 Annecy-leVieux, 9 November 2006
QuickTime™ e undecompressore TIFF (LZW)sono necessari per visualizzare quest'immagine.
Mathematical Model of Mathematical Model of the Genetic Codethe Genetic Code
Work in collaboration with
Luc FRAPPATPaul SORBADiego COCURULLO
SUMMARYSUMMARY
IntroductionIntroduction Description of the modelDescription of the model Applications : Applications : Codon usage Codon usage
frequenciesfrequencies
DNA dimers free DNA dimers free energyenergy
Work in progressWork in progress
It is amazing that the complex It is amazing that the complex biochemical relations between biochemical relations between
DNADNA and proteins were very and proteins were very quickly reduced to a quickly reduced to a
mathematical model. Just few mathematical model. Just few months after the months after the WATSON-WATSON-CRICKCRICK discovery discovery G. GAMOWG. GAMOW
proposed the proposed the “diamond code”“diamond code”
Gamow “diamond code”Gamow “diamond code”Gamow, Nature (1954)
Nucleotides aredenoted by number 1,2,3,4
Amino-acids FIT the rhomb -shaped “holes” formed by the 4 nucleotides 20 a.a. !
Since 1954 many Since 1954 many mathematical modelisations of mathematical modelisations of the genetic coded have been the genetic coded have been
proposed (based proposed (based on on informatiom, informatiom,
thermodynamic, symmetry, thermodynamic, symmetry, topology… argumentstopology… arguments) )
Weak point of the Weak point of the models: models: often poor often poor explanatory and/or predictive explanatory and/or predictive
powerpower
The genetic codeThe genetic code
Crystal basis model of the genetic code
4 basis C, U/T (Pyrimidines) G, A (Purines) are identified by a couple of “spin” labels
(+ 1/2, - -1/2)
L.Frappat, A. Sciarrino, P. Sorba: Phys.Lett. A (1998)
Mathematically - C,U/T,G,A transform as the 4 basis vectors of irrep. (1/2, 1/2) of U q 0 (sl(2)H sl(2)V)
Crystal basis model of the genetic code
Dinucleotides are composite states ( 16 basis vectors of (1/2, 1/2)2 )
belonging to “sets” identified by two integer numbers
JH
JV In each “set” the dinucleotide is
identified by two labels - J
H JH,3 J
H - J
V JV,3 J
V Ex.
CU = (+,+) (+, -)
( JH = 1/2, J
H,3 = 1/2; JV = 1/2, J
V,3 = 1/2)
Follows from property of U(q 0)
(sl(2))
DINUCLEOTIDE DINUCLEOTIDE
Representation ContentRepresentation Content
Crystal basis model of the genetic code
Codons are composite states ( 64 basis vectors of (1/2, 1/2) )
belonging to “sets” identified by half- integer JH
JV
(“set” irreducible representation = irrep.)
Ex.
CUA = (+,+) (-, +) (-,-)
( JH = 1/2, J
H,3 = 1/2; JV = 1/2, J
V,3 = 1/2)
Follows from property of U(q 0)
(sl(2))
Codons in the Codons in the crystal basiscrystal basis
Codon usage frequencyCodon usage frequency Synonymous codons are not used uniformly Synonymous codons are not used uniformly
(codon bias)(codon bias) codon bias codon bias ((not fully understoodnot fully understood) ascribed to ) ascribed to
evolutive-selective effectsevolutive-selective effects codon bias depends codon bias depends Biological species (b.sp.)Biological species (b.sp.) Sequence analysedSequence analysed Amino acid (a.a.) encodedAmino acid (a.a.) encoded Structure of the considered multipletStructure of the considered multiplet Nature of codon Nature of codon XYZXYZ …………………………………………..
Codon usage in Homo Codon usage in Homo sap.sap.
Our analysis deals with global codon usage , i.e. computed over all the coding sequences (exonic region) for the b.sp.of the considered specimen
To put into evidence possible general features of the standardeukaryotic genetic code ascribable to its organisation and itsevolution
Let us define the codon usage probability for the codon XZN (X,Z,N {A,C,G,UT in
DNA} )P(XZN) = limit n n XZN / N tot
n XZN number of times codon XZN used in the processes N
tot total number of codons in the same processes
For fixed XZ
Normalization ∑ N P(XZN) = 1
Note - Sextets are
considered quartets + doublets
8 quartets
Def. - Correlation Def. - Correlation
coefficient coefficient rrXYXY for two for two
variables variables XX
PP..X..X YY PP..Y..Y
Specimen Specimen (GenBank Release 149.0 (GenBank Release 149.0
09/2005 - N09/2005 - Ncodonscodons > > 100.000)100.000)
26 VERTEBRATES26 VERTEBRATES 28 INVERTEBRATES28 INVERTEBRATES 38 PLANTS38 PLANTS TOTAL - 92 Biological speciesTOTAL - 92 Biological species
Correlation coefficient VERTEBRATES
rXY r CA r UG r UC r AG r UA r CG
P -0.89 -0.69 -0.75 -0.55 -0.76 -0.21T -0.92 -0.71 -0.89 -0.68 -0.91 -0.40A -0.88 -0.53 -0.89 -0.60 -0.76 -0.30S -0.92 -0.77 -0.87 -0.60 -0.75 -0.51V -0.84 -0.93 -0.69 -0.74 -0.68 -0.53L -0.83 -0.93 -0.87 -0.91 -0.87 -0.69R -0.90 -0.93 -0.39 -0.27 -0.41 -0.11G -0.94 -0.89 -0.75 -0.74 -0.77 -0.56
<r>a.a . -0.89 -0.80 -0.76 -0.64 -0.74 -0.41
Correlation coefficient PLANTS
rXY r CA r UG r UC r AG r UA r CG
P -0.91 -0.81 -0.54 -0.61 -0.41 -0.48T -0.94 -0.87 -0.79 -0.59 -0.75 -0.48A -0.94 -0.93 -0.72 -0.57 -0.63 -0.55S -0.87 -0.86 -0.75 -0.78 -0.71 -0.56V -0.66 -0.72 -0.75 -0.65 -0.71 -0.15L -0.72 -0.85 -0.57 -0.52 -0.54 -0.17R -0.76 -0.66 -0.67 -0.50 -0.16 -0.49G -0.83 -0.48 -0.73 -0.14 -0.36 -0.07
<r>a.a . -0.83 -0.77 -0.69 -0.55 -0.53 -0.37
Correlation coefficient INVERTEBRATES
rXY r CA r UG r UC r AG r UA r CG
P -0.78 -0.63 -0.50 -0.74 -0.20 -0.52T -0.85 -0.87 -0.74 -0.76 -0.62 -0.60A -0.82 -0.7 9 -0.75 -0.68 -0.51 -0.53S -0.91 -0.83 -0.71 -0.86 -0.55 -0.79V -0.78 -0.92 -0.66 -0.78 -0.72 -0.46L -0.49 -0.92 -0.48 -0.66 -0.50 -0.25R -0.55 -0.76 -0.76 -0.27 -0.01 -0.53G -0.73 -0.48 -0.57 -0.14 -0.02 -0.08
<r>a.a . -0.74 -0.78 -0.65 -0.61 -0.38 -0.47
Averaged value of P(..N)Averaged value of P(..N)
Averaged value of P(..N)Averaged value of P(..N)
Averaged value of sum of Averaged value of sum of two correlated P(N)two correlated P(N)
Ratios of Ratios of obsobs22(X+Y) and (X+Y) and
thth22(X+Y) = (X+Y) = obsobs
22(X)+ (X)+
obsobs22(Y) averaged over the (Y) averaged over the
8 a.a. for the sum of two 8 a.a. for the sum of two codon probabilitiescodon probabilities
Indication for correlation for codon usage
probabilities P(A) and P(C) ( P(U) and P(G))
for quartets.
Correlation between Correlation between codon probabilities for codon probabilities for
different a.a.different a.a. Correlation coefficients between the 28 couples
P XZN-X’Z’N where XZ (X’Z’) specify 8 quartets. The following pattern comes out for the whole eucaryotes specimen (n = 92)
Eucar. r XZA-X’Z’A r XZC-X’Z’ C r XZG-X’Z’G r XZ U-X’Z’ U
Ser-Thr 0.88 0.94 0.90 0.80Ser-Pro 0.93 0.90 0.87 0.91Ser-Ala 0.86 0.93 0.82 0.81Thr-Pro 0.83 0.91 0.93 0.74Thr-Ala 0.91 0.93 0.93 0.94Pro-Ala 0.86 0.93 0.93 0.94Leu-Val 0.85 0.82 -0.70 0.96
The set of 8 quartets The set of 8 quartets splits into 3 subsetssplits into 3 subsets
4 4 a.a. with correlated codon usage a.a. with correlated codon usage ( (Ser, Pro, Arg, Thr)Ser, Pro, Arg, Thr)
22 a.a. with correlated codon usage a.a. with correlated codon usage ( (Leu, ValLeu, Val))
22 a.a. with generally uncorrelated a.a. with generally uncorrelated codon usage (codon usage (ArgArg, , GlyGly))
Statistical analysis
Correlation for P(XZA)-P(XZC), XZ quartets
Correlation for P(N) between {Ser, Pro, Thr, Ala} and {Leu, Val}
The observed correlations well fit in the mathematical scheme of
the crystal basis model of the genetic code
The observed correlations well fit in the mathematical scheme of
the crystal basis model of the genetic code
In the In the crystal basiscrystal basis model model P(XYZ)P(XYZ) can be written as can be written as
function offunction of
ASSUMPTION
SUM RULESSUM RULES
K INDEPENDENT OF THE b.s.
XZ QUARTETS
SUM RULES SUM RULES “Theoretical” “Theoretical”
correlation matrixcorrelation matrixXZ = NC,CG,GG,CU,GUXZ = NC,CG,GG,CU,GU
Observed averaged value of the Observed averaged value of the correlation matrix , in correlation matrix , in redred the thetheoretical valuetheoretical value
Irrep. – JH, JV Codons 3/2, 3/2 Pro CCC, Ser UCC, Ala GCC, Thr ACC
(1/2, 3/2) 1 Pro CCU, Ser UCU, Ala GCU, Thr ACU
(3/2, 1/2) 1 Pro CCG, Ser UCG, Ala GCG, Thr ACG
(1/2, 1/2) 1 Pro CCA, Ser UCA, Ala GCA, Thr ACA
(1/2, 3/2) 2 Leu CUC, Leu CUU, Val GUC, Val GUU
(1/2, 1/2) 2 Leu CUG, Leu CUA, Val GUG, Val GUA
Shannon EntropyShannon Entropy
Let us define the Shannon entropy for the amino-acid specified by the first two nucleotide XZ (8 quartes)
Shannon EntropyShannon EntropyUsing the previous expression for P(XZN) we get
N (XZN), HbsN Hbs(XZN), PN P(XZN)
SXZ largely independent of the b.sp.
Shannon EntropyShannon Entropy
DNA dinucleotide free DNA dinucleotide free energy energy
Free energy for a pair of nucleotides, ex. GC, lying on one strand of DNA, coupled with complementary pair, CG, on the other strand.
CG from 5’ 3’ correlated with GC from 3’ 5’
DINUCLEOTIDE DINUCLEOTIDE
Representation ContentRepresentation Content
SUM RULES for FREE SUM RULES for FREE ENERGYENERGY
Comparison with exp. Comparison with exp. datadata
G in Kcal/mol
DINUCLEOTIDE DINUCLEOTIDE DistributionDistribution
Comparison with Comparison with experimental dataexperimental data
Work in progress and Work in progress and future perspectivesfuture perspectives
Fron the correspondence{C,U/T,G,A} I.R. (1/2,1/2) of U q 0 (sl(2)H sl(2)V)
Any ordered N nucleotides sequence
Vector of I.R. (1/2,1/2)N of U q 0 (sl(2)H sl(2)V)
New pametrization of nucleotidees sequences
“Spin” parametrisation
Algorithm for the “Algorithm for the “spinspin” ” parametrisation of orderedparametrisation of ordered
nn-nucleotide sequence-nucleotide sequence
From this From this parametrisation:parametrisation:
Alternative construction of mutation Alternative construction of mutation model, where mutation intensitydoes model, where mutation intensitydoes not depend from the Hamming not depend from the Hamming distance between the sequences, but distance between the sequences, but from the change of “labels” of the from the change of “labels” of the ““setssets”. ”. C. Minichini, A.S., Biosystems (2006)
Characterization of particular Characterization of particular sequences (exons, introns, promoter, sequences (exons, introns, promoter, 5’ or 3’ UTR sequences,….)5’ or 3’ UTR sequences,….)
L. Frappat, P. Sorba, A.S., L. Vuillon, in progress
For each gene of For each gene of Homo Homo SapSap. (total ~28.000 . (total ~28.000
genes)genes) Consider the Consider the NN-nucleotide coding -nucleotide coding
sequence (sequence (CDSCDS)) Compute the “ Compute the “ labelslabels” ” J JHH, , J J3H3H ; ; J JVV, , J J3V 3V
for any for any nn-nucleotide subsequence -nucleotide subsequence
( (11 nn NN) ) Plot “ Plot “ labelslabels” versus ” versus nn
Red JRed JHH - - Green JGreen J3H3H
Blue JBlue JVV - - Black JBlack J3V3V
Red JRed JHH - - Green JGreen J3H3H
Blue JBlue JVV - - Black JBlack J3V3V
Red JRed JHH - - Green JGreen J3H3H
Blue JBlue JVV - - Black JBlack J3V3V
Red JRed JHH - - Green JGreen J3H3H
Blue JBlue JVV - - Black JBlack J3V3V
Numerical estimatorNumerical estimatorDefine for any sequence of length N
Plot number of CDS with the same value of Diff (Sum) versus Diff (Sum) Compute Diff (Sum) for 28.000 random sequences (300 < N < 4300)with uniform probability for each nucleotideComparison number of CDS - random sequences
ConclusionsConclusions Correlations in codon usage frequencies Correlations in codon usage frequencies
computed over the whole exonic region fit well in computed over the whole exonic region fit well in the mathematical scheme of the the mathematical scheme of the crystal crystal basis basis model model of the genetic code Missing explanation of the genetic code Missing explanation for the correlationsfor the correlations
Formalism of Formalism of crystal crystal basis model basis model useful to useful to parametrize free energy for DNA dimersparametrize free energy for DNA dimers
More generally, use of More generally, use of U q 0 (sl(2)H sl(2)V) mathematical structure may be useful to describe mathematical structure may be useful to describe sequences of nucleotides . sequences of nucleotides .