A mathematical model of the genetic code: structure and applications A mathematical model of the...

59
A mathematical A mathematical model of the genetic model of the genetic code: structure and code: structure and applications applications Antonino Sciarrino Università di Napoli “Federico II” INFN, Sezione di Napoli QuickTim decompres sono necessa

Transcript of A mathematical model of the genetic code: structure and applications A mathematical model of the...

Page 1: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

A mathematical A mathematical model of the genetic model of the genetic code: structure and code: structure and

applicationsapplicationsAntonino Sciarrino Università di Napoli

“Federico II” INFN, Sezione

di Napoli

TAG 2006 Annecy-leVieux, 9 November 2006

QuickTime™ e undecompressore TIFF (LZW)sono necessari per visualizzare quest'immagine.

Page 2: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Mathematical Model of Mathematical Model of the Genetic Codethe Genetic Code

Work in collaboration with

Luc FRAPPATPaul SORBADiego COCURULLO

Page 3: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

SUMMARYSUMMARY

IntroductionIntroduction Description of the modelDescription of the model Applications : Applications : Codon usage Codon usage

frequenciesfrequencies

DNA dimers free DNA dimers free energyenergy

Work in progressWork in progress

Page 4: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

It is amazing that the complex It is amazing that the complex biochemical relations between biochemical relations between

DNADNA and proteins were very and proteins were very quickly reduced to a quickly reduced to a

mathematical model. Just few mathematical model. Just few months after the months after the WATSON-WATSON-CRICKCRICK discovery discovery G. GAMOWG. GAMOW

proposed the proposed the “diamond code”“diamond code”

Page 5: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Gamow “diamond code”Gamow “diamond code”Gamow, Nature (1954)

Nucleotides aredenoted by number 1,2,3,4

Amino-acids FIT the rhomb -shaped “holes” formed by the 4 nucleotides 20 a.a. !

Page 6: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Since 1954 many Since 1954 many mathematical modelisations of mathematical modelisations of the genetic coded have been the genetic coded have been

proposed (based proposed (based on on informatiom, informatiom,

thermodynamic, symmetry, thermodynamic, symmetry, topology… argumentstopology… arguments) )

Weak point of the Weak point of the models: models: often poor often poor explanatory and/or predictive explanatory and/or predictive

powerpower

Page 7: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

The genetic codeThe genetic code

Page 8: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Crystal basis model of the genetic code

4 basis C, U/T (Pyrimidines) G, A (Purines) are identified by a couple of “spin” labels

(+ 1/2, - -1/2)

L.Frappat, A. Sciarrino, P. Sorba: Phys.Lett. A (1998)

Mathematically - C,U/T,G,A transform as the 4 basis vectors of irrep. (1/2, 1/2) of U q 0 (sl(2)H sl(2)V)

Page 9: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Crystal basis model of the genetic code

Dinucleotides are composite states ( 16 basis vectors of (1/2, 1/2)2 )

belonging to “sets” identified by two integer numbers

JH

JV In each “set” the dinucleotide is

identified by two labels - J

H JH,3 J

H - J

V JV,3 J

V Ex.

CU = (+,+) (+, -)

( JH = 1/2, J

H,3 = 1/2; JV = 1/2, J

V,3 = 1/2)

Follows from property of U(q 0)

(sl(2))

Page 10: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

DINUCLEOTIDE DINUCLEOTIDE

Representation ContentRepresentation Content

Page 11: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Crystal basis model of the genetic code

Codons are composite states ( 64 basis vectors of (1/2, 1/2) )

belonging to “sets” identified by half- integer JH

JV

(“set” irreducible representation = irrep.)

Ex.

CUA = (+,+) (-, +) (-,-)

( JH = 1/2, J

H,3 = 1/2; JV = 1/2, J

V,3 = 1/2)

Follows from property of U(q 0)

(sl(2))

Page 12: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Codons in the Codons in the crystal basiscrystal basis

Page 13: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Codon usage frequencyCodon usage frequency Synonymous codons are not used uniformly Synonymous codons are not used uniformly

(codon bias)(codon bias) codon bias codon bias ((not fully understoodnot fully understood) ascribed to ) ascribed to

evolutive-selective effectsevolutive-selective effects codon bias depends codon bias depends Biological species (b.sp.)Biological species (b.sp.) Sequence analysedSequence analysed Amino acid (a.a.) encodedAmino acid (a.a.) encoded Structure of the considered multipletStructure of the considered multiplet Nature of codon Nature of codon XYZXYZ …………………………………………..

Page 14: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Codon usage in Homo Codon usage in Homo sap.sap.

Page 15: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Our analysis deals with global codon usage , i.e. computed over all the coding sequences (exonic region) for the b.sp.of the considered specimen

To put into evidence possible general features of the standardeukaryotic genetic code ascribable to its organisation and itsevolution

Page 16: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Let us define the codon usage probability for the codon XZN (X,Z,N {A,C,G,UT in

DNA} )P(XZN) = limit n n XZN / N tot

n XZN number of times codon XZN used in the processes N

tot total number of codons in the same processes

For fixed XZ

Normalization ∑ N P(XZN) = 1

Note - Sextets are

considered quartets + doublets

8 quartets

Page 17: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Def. - Correlation Def. - Correlation

coefficient coefficient rrXYXY for two for two

variables variables XX

PP..X..X YY PP..Y..Y

Page 18: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Specimen Specimen (GenBank Release 149.0 (GenBank Release 149.0

09/2005 - N09/2005 - Ncodonscodons > > 100.000)100.000)

26 VERTEBRATES26 VERTEBRATES 28 INVERTEBRATES28 INVERTEBRATES 38 PLANTS38 PLANTS TOTAL - 92 Biological speciesTOTAL - 92 Biological species

Page 19: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Correlation coefficient VERTEBRATES

rXY r CA r UG r UC r AG r UA r CG

P -0.89 -0.69 -0.75 -0.55 -0.76 -0.21T -0.92 -0.71 -0.89 -0.68 -0.91 -0.40A -0.88 -0.53 -0.89 -0.60 -0.76 -0.30S -0.92 -0.77 -0.87 -0.60 -0.75 -0.51V -0.84 -0.93 -0.69 -0.74 -0.68 -0.53L -0.83 -0.93 -0.87 -0.91 -0.87 -0.69R -0.90 -0.93 -0.39 -0.27 -0.41 -0.11G -0.94 -0.89 -0.75 -0.74 -0.77 -0.56

<r>a.a . -0.89 -0.80 -0.76 -0.64 -0.74 -0.41

Page 20: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Correlation coefficient PLANTS

rXY r CA r UG r UC r AG r UA r CG

P -0.91 -0.81 -0.54 -0.61 -0.41 -0.48T -0.94 -0.87 -0.79 -0.59 -0.75 -0.48A -0.94 -0.93 -0.72 -0.57 -0.63 -0.55S -0.87 -0.86 -0.75 -0.78 -0.71 -0.56V -0.66 -0.72 -0.75 -0.65 -0.71 -0.15L -0.72 -0.85 -0.57 -0.52 -0.54 -0.17R -0.76 -0.66 -0.67 -0.50 -0.16 -0.49G -0.83 -0.48 -0.73 -0.14 -0.36 -0.07

<r>a.a . -0.83 -0.77 -0.69 -0.55 -0.53 -0.37

Page 21: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Correlation coefficient INVERTEBRATES

rXY r CA r UG r UC r AG r UA r CG

P -0.78 -0.63 -0.50 -0.74 -0.20 -0.52T -0.85 -0.87 -0.74 -0.76 -0.62 -0.60A -0.82 -0.7 9 -0.75 -0.68 -0.51 -0.53S -0.91 -0.83 -0.71 -0.86 -0.55 -0.79V -0.78 -0.92 -0.66 -0.78 -0.72 -0.46L -0.49 -0.92 -0.48 -0.66 -0.50 -0.25R -0.55 -0.76 -0.76 -0.27 -0.01 -0.53G -0.73 -0.48 -0.57 -0.14 -0.02 -0.08

<r>a.a . -0.74 -0.78 -0.65 -0.61 -0.38 -0.47

Page 22: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Averaged value of P(..N)Averaged value of P(..N)

Page 23: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Averaged value of P(..N)Averaged value of P(..N)

Page 24: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Averaged value of sum of Averaged value of sum of two correlated P(N)two correlated P(N)

Page 25: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Ratios of Ratios of obsobs22(X+Y) and (X+Y) and

thth22(X+Y) = (X+Y) = obsobs

22(X)+ (X)+

obsobs22(Y) averaged over the (Y) averaged over the

8 a.a. for the sum of two 8 a.a. for the sum of two codon probabilitiescodon probabilities

Page 26: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Indication for correlation for codon usage

probabilities P(A) and P(C) ( P(U) and P(G))

for quartets.

Page 27: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Correlation between Correlation between codon probabilities for codon probabilities for

different a.a.different a.a. Correlation coefficients between the 28 couples

P XZN-X’Z’N where XZ (X’Z’) specify 8 quartets. The following pattern comes out for the whole eucaryotes specimen (n = 92)

Eucar. r XZA-X’Z’A r XZC-X’Z’ C r XZG-X’Z’G r XZ U-X’Z’ U

Ser-Thr 0.88 0.94 0.90 0.80Ser-Pro 0.93 0.90 0.87 0.91Ser-Ala 0.86 0.93 0.82 0.81Thr-Pro 0.83 0.91 0.93 0.74Thr-Ala 0.91 0.93 0.93 0.94Pro-Ala 0.86 0.93 0.93 0.94Leu-Val 0.85 0.82 -0.70 0.96

Page 28: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

The set of 8 quartets The set of 8 quartets splits into 3 subsetssplits into 3 subsets

4 4 a.a. with correlated codon usage a.a. with correlated codon usage ( (Ser, Pro, Arg, Thr)Ser, Pro, Arg, Thr)

22 a.a. with correlated codon usage a.a. with correlated codon usage ( (Leu, ValLeu, Val))

22 a.a. with generally uncorrelated a.a. with generally uncorrelated codon usage (codon usage (ArgArg, , GlyGly))

Page 29: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Statistical analysis

Correlation for P(XZA)-P(XZC), XZ quartets

Correlation for P(N) between {Ser, Pro, Thr, Ala} and {Leu, Val}

The observed correlations well fit in the mathematical scheme of

the crystal basis model of the genetic code

The observed correlations well fit in the mathematical scheme of

the crystal basis model of the genetic code

Page 30: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

In the In the crystal basiscrystal basis model model P(XYZ)P(XYZ) can be written as can be written as

function offunction of

Page 31: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

ASSUMPTION

Page 32: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

SUM RULESSUM RULES

K INDEPENDENT OF THE b.s.

XZ QUARTETS

Page 33: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

SUM RULES SUM RULES “Theoretical” “Theoretical”

correlation matrixcorrelation matrixXZ = NC,CG,GG,CU,GUXZ = NC,CG,GG,CU,GU

Page 34: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Observed averaged value of the Observed averaged value of the correlation matrix , in correlation matrix , in redred the thetheoretical valuetheoretical value

Page 35: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Irrep. – JH, JV Codons 3/2, 3/2 Pro CCC, Ser UCC, Ala GCC, Thr ACC

(1/2, 3/2) 1 Pro CCU, Ser UCU, Ala GCU, Thr ACU

(3/2, 1/2) 1 Pro CCG, Ser UCG, Ala GCG, Thr ACG

(1/2, 1/2) 1 Pro CCA, Ser UCA, Ala GCA, Thr ACA

(1/2, 3/2) 2 Leu CUC, Leu CUU, Val GUC, Val GUU

(1/2, 1/2) 2 Leu CUG, Leu CUA, Val GUG, Val GUA

Page 36: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Shannon EntropyShannon Entropy

Let us define the Shannon entropy for the amino-acid specified by the first two nucleotide XZ (8 quartes)

Page 37: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Shannon EntropyShannon EntropyUsing the previous expression for P(XZN) we get

N (XZN), HbsN Hbs(XZN), PN P(XZN)

SXZ largely independent of the b.sp.

Page 38: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Shannon EntropyShannon Entropy

Page 39: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

DNA dinucleotide free DNA dinucleotide free energy energy

Free energy for a pair of nucleotides, ex. GC, lying on one strand of DNA, coupled with complementary pair, CG, on the other strand.

CG from 5’ 3’ correlated with GC from 3’ 5’

Page 40: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

DINUCLEOTIDE DINUCLEOTIDE

Representation ContentRepresentation Content

Page 41: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.
Page 42: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

SUM RULES for FREE SUM RULES for FREE ENERGYENERGY

Page 43: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Comparison with exp. Comparison with exp. datadata

G in Kcal/mol

Page 44: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

DINUCLEOTIDE DINUCLEOTIDE DistributionDistribution

Page 45: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.
Page 46: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Comparison with Comparison with experimental dataexperimental data

Page 47: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Work in progress and Work in progress and future perspectivesfuture perspectives

Fron the correspondence{C,U/T,G,A} I.R. (1/2,1/2) of U q 0 (sl(2)H sl(2)V)

Any ordered N nucleotides sequence

Vector of I.R. (1/2,1/2)N of U q 0 (sl(2)H sl(2)V)

New pametrization of nucleotidees sequences

Page 48: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

“Spin” parametrisation

Page 49: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Algorithm for the “Algorithm for the “spinspin” ” parametrisation of orderedparametrisation of ordered

nn-nucleotide sequence-nucleotide sequence

Page 50: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

From this From this parametrisation:parametrisation:

Alternative construction of mutation Alternative construction of mutation model, where mutation intensitydoes model, where mutation intensitydoes not depend from the Hamming not depend from the Hamming distance between the sequences, but distance between the sequences, but from the change of “labels” of the from the change of “labels” of the ““setssets”. ”. C. Minichini, A.S., Biosystems (2006)

Characterization of particular Characterization of particular sequences (exons, introns, promoter, sequences (exons, introns, promoter, 5’ or 3’ UTR sequences,….)5’ or 3’ UTR sequences,….)

L. Frappat, P. Sorba, A.S., L. Vuillon, in progress

Page 51: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

For each gene of For each gene of Homo Homo SapSap. (total ~28.000 . (total ~28.000

genes)genes) Consider the Consider the NN-nucleotide coding -nucleotide coding

sequence (sequence (CDSCDS)) Compute the “ Compute the “ labelslabels” ” J JHH, , J J3H3H ; ; J JVV, , J J3V 3V

for any for any nn-nucleotide subsequence -nucleotide subsequence

( (11 nn NN) ) Plot “ Plot “ labelslabels” versus ” versus nn

Page 52: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Red JRed JHH - - Green JGreen J3H3H

Blue JBlue JVV - - Black JBlack J3V3V

Page 53: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Red JRed JHH - - Green JGreen J3H3H

Blue JBlue JVV - - Black JBlack J3V3V

Page 54: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Red JRed JHH - - Green JGreen J3H3H

Blue JBlue JVV - - Black JBlack J3V3V

Page 55: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Red JRed JHH - - Green JGreen J3H3H

Blue JBlue JVV - - Black JBlack J3V3V

Page 56: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

Numerical estimatorNumerical estimatorDefine for any sequence of length N

Plot number of CDS with the same value of Diff (Sum) versus Diff (Sum) Compute Diff (Sum) for 28.000 random sequences (300 < N < 4300)with uniform probability for each nucleotideComparison number of CDS - random sequences

Page 57: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.
Page 58: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.
Page 59: A mathematical model of the genetic code: structure and applications A mathematical model of the genetic code: structure and applications Antonino Sciarrino.

ConclusionsConclusions Correlations in codon usage frequencies Correlations in codon usage frequencies

computed over the whole exonic region fit well in computed over the whole exonic region fit well in the mathematical scheme of the the mathematical scheme of the crystal crystal basis basis model model of the genetic code Missing explanation of the genetic code Missing explanation for the correlationsfor the correlations

Formalism of Formalism of crystal crystal basis model basis model useful to useful to parametrize free energy for DNA dimersparametrize free energy for DNA dimers

More generally, use of More generally, use of U q 0 (sl(2)H sl(2)V) mathematical structure may be useful to describe mathematical structure may be useful to describe sequences of nucleotides . sequences of nucleotides .