A mathematical model of the genetic code: structure and applications A mathematical model of the...

A mathematical A mathematical model of the genetic model of the genetic code: structure and code: structure and

applicationsapplicationsAntonino Sciarrino Università di Napoli

“Federico II” INFN, Sezione

di Napoli

TAG 2006 Annecy-leVieux, 9 November 2006

QuickTime™ e undecompressore TIFF (LZW)sono necessari per visualizzare quest'immagine.

Mathematical Model of Mathematical Model of the Genetic Codethe Genetic Code

Work in collaboration with

Luc FRAPPATPaul SORBADiego COCURULLO

SUMMARYSUMMARY

IntroductionIntroduction Description of the modelDescription of the model Applications : Applications : Codon usage Codon usage

frequenciesfrequencies

DNA dimers free DNA dimers free energyenergy

Work in progressWork in progress

It is amazing that the complex It is amazing that the complex biochemical relations between biochemical relations between

DNADNA and proteins were very and proteins were very quickly reduced to a quickly reduced to a

mathematical model. Just few mathematical model. Just few months after the months after the WATSON-WATSON-CRICKCRICK discovery discovery G. GAMOWG. GAMOW

proposed the proposed the “diamond code”“diamond code”

Gamow “diamond code”Gamow “diamond code”Gamow, Nature (1954)

Nucleotides aredenoted by number 1,2,3,4

Amino-acids FIT the rhomb -shaped “holes” formed by the 4 nucleotides 20 a.a. !

Since 1954 many Since 1954 many mathematical modelisations of mathematical modelisations of the genetic coded have been the genetic coded have been

proposed (based proposed (based on on informatiom, informatiom,

thermodynamic, symmetry, thermodynamic, symmetry, topology… argumentstopology… arguments) )

Weak point of the Weak point of the models: models: often poor often poor explanatory and/or predictive explanatory and/or predictive

powerpower

The genetic codeThe genetic code

Crystal basis model of the genetic code

4 basis C, U/T (Pyrimidines) G, A (Purines) are identified by a couple of “spin” labels

(+ 1/2, - -1/2)

L.Frappat, A. Sciarrino, P. Sorba: Phys.Lett. A (1998)

Mathematically - C,U/T,G,A transform as the 4 basis vectors of irrep. (1/2, 1/2) of U q 0 (sl(2)H sl(2)V)


Dinucleotides are composite states ( 16 basis vectors of (1/2, 1/2)2 )

belonging to “sets” identified by two integer numbers

JH

JV In each “set” the dinucleotide is

identified by two labels - J

H JH,3 J

H - J

V JV,3 J

V Ex.

CU = (+,+) (+, -)

( JH = 1/2, J

H,3 = 1/2; JV = 1/2, J

V,3 = 1/2)

Follows from property of U(q 0)

(sl(2))

DINUCLEOTIDE DINUCLEOTIDE

Representation ContentRepresentation Content


Codons are composite states ( 64 basis vectors of (1/2, 1/2) )

belonging to “sets” identified by half- integer JH

JV

(“set” irreducible representation = irrep.)

Ex.

CUA = (+,+) (-, +) (-,-)

( JH = 1/2, J

H,3 = 1/2; JV = 1/2, J

V,3 = 1/2)

Follows from property of U(q 0)

(sl(2))

Codons in the Codons in the crystal basiscrystal basis

Codon usage frequencyCodon usage frequency Synonymous codons are not used uniformly Synonymous codons are not used uniformly

(codon bias)(codon bias) codon bias codon bias ((not fully understoodnot fully understood) ascribed to ) ascribed to

evolutive-selective effectsevolutive-selective effects codon bias depends codon bias depends Biological species (b.sp.)Biological species (b.sp.) Sequence analysedSequence analysed Amino acid (a.a.) encodedAmino acid (a.a.) encoded Structure of the considered multipletStructure of the considered multiplet Nature of codon Nature of codon XYZXYZ …………………………………………..

Codon usage in Homo Codon usage in Homo sap.sap.

Our analysis deals with global codon usage , i.e. computed over all the coding sequences (exonic region) for the b.sp.of the considered specimen

To put into evidence possible general features of the standardeukaryotic genetic code ascribable to its organisation and itsevolution

Let us define the codon usage probability for the codon XZN (X,Z,N {A,C,G,UT in

DNA} )P(XZN) = limit n n XZN / N tot

n XZN number of times codon XZN used in the processes N

tot total number of codons in the same processes

For fixed XZ

Normalization ∑ N P(XZN) = 1

Note - Sextets are

considered quartets + doublets

8 quartets

Def. - Correlation Def. - Correlation

coefficient coefficient rrXYXY for two for two

variables variables XX

PP..X..X YY PP..Y..Y

Specimen Specimen (GenBank Release 149.0 (GenBank Release 149.0

09/2005 - N09/2005 - Ncodonscodons > > 100.000)100.000)

26 VERTEBRATES26 VERTEBRATES 28 INVERTEBRATES28 INVERTEBRATES 38 PLANTS38 PLANTS TOTAL - 92 Biological speciesTOTAL - 92 Biological species

Correlation coefficient VERTEBRATES

rXY r CA r UG r UC r AG r UA r CG

P -0.89 -0.69 -0.75 -0.55 -0.76 -0.21T -0.92 -0.71 -0.89 -0.68 -0.91 -0.40A -0.88 -0.53 -0.89 -0.60 -0.76 -0.30S -0.92 -0.77 -0.87 -0.60 -0.75 -0.51V -0.84 -0.93 -0.69 -0.74 -0.68 -0.53L -0.83 -0.93 -0.87 -0.91 -0.87 -0.69R -0.90 -0.93 -0.39 -0.27 -0.41 -0.11G -0.94 -0.89 -0.75 -0.74 -0.77 -0.56

<r>a.a . -0.89 -0.80 -0.76 -0.64 -0.74 -0.41

Correlation coefficient PLANTS


P -0.91 -0.81 -0.54 -0.61 -0.41 -0.48T -0.94 -0.87 -0.79 -0.59 -0.75 -0.48A -0.94 -0.93 -0.72 -0.57 -0.63 -0.55S -0.87 -0.86 -0.75 -0.78 -0.71 -0.56V -0.66 -0.72 -0.75 -0.65 -0.71 -0.15L -0.72 -0.85 -0.57 -0.52 -0.54 -0.17R -0.76 -0.66 -0.67 -0.50 -0.16 -0.49G -0.83 -0.48 -0.73 -0.14 -0.36 -0.07

<r>a.a . -0.83 -0.77 -0.69 -0.55 -0.53 -0.37

Correlation coefficient INVERTEBRATES


P -0.78 -0.63 -0.50 -0.74 -0.20 -0.52T -0.85 -0.87 -0.74 -0.76 -0.62 -0.60A -0.82 -0.7 9 -0.75 -0.68 -0.51 -0.53S -0.91 -0.83 -0.71 -0.86 -0.55 -0.79V -0.78 -0.92 -0.66 -0.78 -0.72 -0.46L -0.49 -0.92 -0.48 -0.66 -0.50 -0.25R -0.55 -0.76 -0.76 -0.27 -0.01 -0.53G -0.73 -0.48 -0.57 -0.14 -0.02 -0.08

<r>a.a . -0.74 -0.78 -0.65 -0.61 -0.38 -0.47

Averaged value of P(..N)Averaged value of P(..N)

Averaged value of sum of Averaged value of sum of two correlated P(N)two correlated P(N)

Ratios of Ratios of obsobs22(X+Y) and (X+Y) and

thth22(X+Y) = (X+Y) = obsobs

22(X)+ (X)+

obsobs22(Y) averaged over the (Y) averaged over the

8 a.a. for the sum of two 8 a.a. for the sum of two codon probabilitiescodon probabilities

Indication for correlation for codon usage

probabilities P(A) and P(C) ( P(U) and P(G))

for quartets.

Correlation between Correlation between codon probabilities for codon probabilities for

different a.a.different a.a. Correlation coefficients between the 28 couples

P XZN-X’Z’N where XZ (X’Z’) specify 8 quartets. The following pattern comes out for the whole eucaryotes specimen (n = 92)

Eucar. r XZA-X’Z’A r XZC-X’Z’ C r XZG-X’Z’G r XZ U-X’Z’ U

Ser-Thr 0.88 0.94 0.90 0.80Ser-Pro 0.93 0.90 0.87 0.91Ser-Ala 0.86 0.93 0.82 0.81Thr-Pro 0.83 0.91 0.93 0.74Thr-Ala 0.91 0.93 0.93 0.94Pro-Ala 0.86 0.93 0.93 0.94Leu-Val 0.85 0.82 -0.70 0.96

The set of 8 quartets The set of 8 quartets splits into 3 subsetssplits into 3 subsets

4 4 a.a. with correlated codon usage a.a. with correlated codon usage ( (Ser, Pro, Arg, Thr)Ser, Pro, Arg, Thr)

22 a.a. with correlated codon usage a.a. with correlated codon usage ( (Leu, ValLeu, Val))

22 a.a. with generally uncorrelated a.a. with generally uncorrelated codon usage (codon usage (ArgArg, , GlyGly))

Statistical analysis

Correlation for P(XZA)-P(XZC), XZ quartets

Correlation for P(N) between {Ser, Pro, Thr, Ala} and {Leu, Val}

The observed correlations well fit in the mathematical scheme of

the crystal basis model of the genetic code

The observed correlations well fit in the mathematical scheme of

the crystal basis model of the genetic code

In the In the crystal basiscrystal basis model model P(XYZ)P(XYZ) can be written as can be written as

function offunction of

ASSUMPTION

SUM RULESSUM RULES

K INDEPENDENT OF THE b.s.

XZ QUARTETS

SUM RULES SUM RULES “Theoretical” “Theoretical”

correlation matrixcorrelation matrixXZ = NC,CG,GG,CU,GUXZ = NC,CG,GG,CU,GU

Observed averaged value of the Observed averaged value of the correlation matrix , in correlation matrix , in redred the thetheoretical valuetheoretical value

Irrep. – JH, JV Codons 3/2, 3/2 Pro CCC, Ser UCC, Ala GCC, Thr ACC

(1/2, 3/2) 1 Pro CCU, Ser UCU, Ala GCU, Thr ACU

(3/2, 1/2) 1 Pro CCG, Ser UCG, Ala GCG, Thr ACG

(1/2, 1/2) 1 Pro CCA, Ser UCA, Ala GCA, Thr ACA

(1/2, 3/2) 2 Leu CUC, Leu CUU, Val GUC, Val GUU

(1/2, 1/2) 2 Leu CUG, Leu CUA, Val GUG, Val GUA

Shannon EntropyShannon Entropy

Let us define the Shannon entropy for the amino-acid specified by the first two nucleotide XZ (8 quartes)

Shannon EntropyShannon EntropyUsing the previous expression for P(XZN) we get

N (XZN), HbsN Hbs(XZN), PN P(XZN)

SXZ largely independent of the b.sp.

Shannon EntropyShannon Entropy

DNA dinucleotide free DNA dinucleotide free energy energy

Free energy for a pair of nucleotides, ex. GC, lying on one strand of DNA, coupled with complementary pair, CG, on the other strand.

CG from 5’ 3’ correlated with GC from 3’ 5’

DINUCLEOTIDE DINUCLEOTIDE

Representation ContentRepresentation Content

SUM RULES for FREE SUM RULES for FREE ENERGYENERGY

Comparison with exp. Comparison with exp. datadata

G in Kcal/mol

DINUCLEOTIDE DINUCLEOTIDE DistributionDistribution

Comparison with Comparison with experimental dataexperimental data

Work in progress and Work in progress and future perspectivesfuture perspectives

Fron the correspondence{C,U/T,G,A} I.R. (1/2,1/2) of U q 0 (sl(2)H sl(2)V)

Any ordered N nucleotides sequence

Vector of I.R. (1/2,1/2)N of U q 0 (sl(2)H sl(2)V)

New pametrization of nucleotidees sequences

“Spin” parametrisation

Algorithm for the “Algorithm for the “spinspin” ” parametrisation of orderedparametrisation of ordered

nn-nucleotide sequence-nucleotide sequence

From this From this parametrisation:parametrisation:

Alternative construction of mutation Alternative construction of mutation model, where mutation intensitydoes model, where mutation intensitydoes not depend from the Hamming not depend from the Hamming distance between the sequences, but distance between the sequences, but from the change of “labels” of the from the change of “labels” of the ““setssets”. ”. C. Minichini, A.S., Biosystems (2006)

Characterization of particular Characterization of particular sequences (exons, introns, promoter, sequences (exons, introns, promoter, 5’ or 3’ UTR sequences,….)5’ or 3’ UTR sequences,….)

L. Frappat, P. Sorba, A.S., L. Vuillon, in progress

For each gene of For each gene of Homo Homo SapSap. (total ~28.000 . (total ~28.000

genes)genes) Consider the Consider the NN-nucleotide coding -nucleotide coding

sequence (sequence (CDSCDS)) Compute the “ Compute the “ labelslabels” ” J JHH, , J J3H3H ; ; J JVV, , J J3V 3V

for any for any nn-nucleotide subsequence -nucleotide subsequence

( (11 nn NN) ) Plot “ Plot “ labelslabels” versus ” versus nn

Red JRed JHH - - Green JGreen J3H3H

Blue JBlue JVV - - Black JBlack J3V3V

Numerical estimatorNumerical estimatorDefine for any sequence of length N

Plot number of CDS with the same value of Diff (Sum) versus Diff (Sum) Compute Diff (Sum) for 28.000 random sequences (300 < N < 4300)with uniform probability for each nucleotideComparison number of CDS - random sequences

ConclusionsConclusions Correlations in codon usage frequencies Correlations in codon usage frequencies

computed over the whole exonic region fit well in computed over the whole exonic region fit well in the mathematical scheme of the the mathematical scheme of the crystal crystal basis basis model model of the genetic code Missing explanation of the genetic code Missing explanation for the correlationsfor the correlations

Formalism of Formalism of crystal crystal basis model basis model useful to useful to parametrize free energy for DNA dimersparametrize free energy for DNA dimers

More generally, use of More generally, use of U q 0 (sl(2)H sl(2)V) mathematical structure may be useful to describe mathematical structure may be useful to describe sequences of nucleotides . sequences of nucleotides .

A mathematical model of the genetic code: structure and applications A mathematical model of the...

Documents

Transcript of A mathematical model of the genetic code: structure and applications A mathematical model of the...