Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford...

Comparative gene hunting

Irmtraud MeyerThe Sanger Institute

nowUniversity of Oxford

[email protected]

Making sense of the genome:

• What are the proteins and where are they encoded ?

Experimentsin

Lab

SequenceDatabase

• proteins• ESTs

protein

DNA

cctgctgggtgcgagagccggcgtaccggtgaggcc

Aim in ab initio gene prediction:

protein

SequenceDatabase

• proteinsExperimentsin

Lab

DNA


3059 million bases:GCTGCCAACGC…

We will very soon have:

3286 million bases:ACTGCGGGCGC…

Rough comparative map:

reference: www.ensembl.org

Typical Situation:. . . gagccgcctcctccccttccccacgctctaggagggggccgcgggggcctggctgcgtcggccaatcggagtgcacttccgcagctgacaaattcagtataaaagcttggggctggggccgagcactggggactttgagggtggccaggccagcgtaggaggccagcgtaggatcctgctgggagcggggaactgagggaagcgacgccgagaaagcaggcgtaccacggagggagagaaaagctccggaagcccagcagcgcctttacgcacagctgccaactggccgctgccgaccgtctccagctcccgaggacgcgcgaccggacaccgggtcctgccacagccgaggacagctcgccgctcgccgcagcgagcccggggcggcccttcagggggacctttcccagatcgCccaggccgcccggatgtgcacgaaaatggaacag. . .

. . . ggcgacgggggctcgggaagcctgacagggcttttgcgcacagctgccggctggtgctacccgcccgcgccagcccccgagaacgcgcgaccaggcacccagtccggtcaccgcagcggagagctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctccccagaccgcctgggccgcccggatgtgcactaaaatggaacagcccttctaccacgacgactcatacacagctacgggatacggccgggcccctggtggcctctctctacacgactacaaactcctgaaaccgagcctggcggtcaacctggccgacccctaccggagtctcaaagcgcctgGggctcgcggacccggcccagagggcggcggtggcggcagctacttttc . . .

? Human DNA

Mouse DNA

Similar problem:

demotic

greek

hieroglyphs

Aim in comparative ab initio gene prediction :

annotate simultaneously

DNA x:DNA y:

Input:

x:y:

Output:

?


cctgctgggagcgaaagcaggcgtaccacggaggg

Why is this a good idea ?

IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY

• advantages: • can detect new genes as there is no need to

search in databases for proteins • fewer assumptions needed than in one-strand

ab initio gene-prediction methods, i.e. can detect unusual genes

KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ

• 3059 million bases

Mouse – human comparison:

• 3286 million bases• about 30 000 (?) genes

Analysing mouse and human DNA:

• Training:• adjust parameters of Doublescan with set of known

pairs of orthologous mouse and human genes• Testing:

• Test set: 80 pairs of known mouse and human genes• 55 % : same number of exons, different coding

length• 42 % : same number of exons, same coding length• 3 % : different number of exons, different coding

length

Results - Performance:

Doublescan Doublescan with post-processing

Genscan

Gene Sensitivity Specificity Overlapping Missing Wrong

0.57 0.43 0.44 0.00 0.14

0.57 0.50 0.46 0.01 0.04

0.47 0.46 0.53 0.00 0.01

annotation:

prediction: correctoverlappingmissingwrong

C. elegans – C. briggsae

• C. elegans• sequenced in 1998• 97 million bases• 5 autosomes, one X• about 20 000 genes

• C. briggsae• around 100 million

bases• 5 autosomes, one X

Results - Performance:

Doublescan Gene Sensitivity

Specificity Overlapping Missing Wrong

0.80 0.71 0.20 0.00 0.07

annotation:

prediction: correctoverlappingmissingwrong

Summary:• Doublescan:

• predicts the gene structures of both sequences at the same time as aligning the sequences

• capable of predicting partial, complete and multiple genes or no genes at all as well as more diverged pairs of genes which are related by events of exon-fusion or exon-splitting

• can be used to analyse long sequences using the Stepping Stone algorithm (same performance as Hirschberg algorithm)

• general concept: can be trained to analyse other pairs of related genomes

• performance on mouse - human DNA and c. elegans – c. briggsae DNA very promising

To do list:• large scale mouse - human comparison• large scale c. elegans – c. briggsae comparison• search for regulatory regions:

x:y:

References:

• www.sanger.ac.uk/Software/analysis/doublescan

• I.M.Meyer And R. Durbin, Bioinformatics, 2002,18(10), pp. 1309-

Acknowledgements:

• Richard Durbin

• Sequencing centres

• Trinity College, Cambridge

• Wellcome Trust• The Sanger Centre

The method:

• What are pair hidden Markov models ?• How can they be used to find genes ?

Pair HMMs:

• idea: annotate the two sequences by parsing them through connected states

DNA y:

DNA x:

• each state reads a fixed number of letters from one or two of the sequences


DNA y:

DNA x:

matchexon

matchintron

• each state reads a fixed number of letters from one or two of the sequences

match intergenic

reads 1 letter from each sequence

at a time

startstate

Pair HMMs:


DNA y:

DNA x:

match intergenic

matchexon

matchintron

• each state reads a fixed number of letters from one or two of the sequences a statea

transition

Pair HMMs:

ACGTCGACATGGCCTATCCGCTGAGCT

ACGTCGGGCCTCTCCGCTAAGCT

Doublescan:

emit x:-

emit -:y

matchintergenic x:y

x:

y:

ACGTCGACATGGCCTATCCGCTGAGCT

ACGTCG - - - - GGCCTCTCCGCTAAGCT

emit x:-

emit -:y

matchintergenic x:y

match exonx1x2x3:y1y2y3

emit x1x2x3:-

emit -:y1y2y3

x:

y:

CAAGCATGCGACAAAGGATACAGCGACCTC

CAAGCCTGCGGATACAGCGAACTC

CAAGCATGCGACAAAGGATACAGCGACCTC

CAAGCCTGC - - - - - - GGATACAGCGAACTC

same amino-acid(Alanine)

insertion of two codons

similar amino-acids(Aspartic, Glutamic acid)

Doublescan:

emit x:-

emit -:y

matchintergenic x:y


emit x1x2x3:-

emit -:y1y2y3

start:start stop:stopstart codon stop codon

x:

y:

Doublescan:

emit x:- emit -:y

match intron x:y

start:start stop:stop

emit x:-

emit -:y

matchintergenic x:y


emit x1x2x3:-

emit -:y1y2y3

GT:GT AG:AG

GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA

GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA

intronexon exon5’ splice site 3’splice site

Doublescan:

AGx2x3:AGy2y3x1GT:y1GT


emit x:-

emit -:y

matchintergenic x:y


emit x1x2x3:-

emit -:y1y2y3

x1GT:y1GT AGx2x3:AGy2y3(…)

emit x:- emit -:y

match intron x:yGT:GT AG:AG

GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA

GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA

GCATGCAGTACAGTTG…GTCAGGAGGCGAACTCGCA

GCCTGCAGTACAGTTA…AGTACGAGGCGAACTCGCA

exon exonintron

Doublescan:

x1x2GT:y1y2GT AGx3:AGy3(…)


emit x:-

emit -:y

matchintergenic x:y


emit x1x2x3:-

emit -:y1y2y3


emit x:- emit -:y

match intron x:yGT:GT AG:AG

x1x2GT:y1y2GT AGx3:AGy3exon exonintron

GCATGCAGGTACAGTTG…GTCAGGAGCGAACTCGCA

GCCTGCAGGTACAGTTA…AGTACGAGCGAACTCGCA

Doublescan:


emit x:-

emit -:y

matchintergenic x:y


emit x1x2x3:-

emit -:y1y2y3



GT:GT AG:AG

emit x:- emit -:y

match intron x:y

x:

y:

x:

y:

exon fusion

Doublescan:

-:GT -:AGemit y

intron -:y

-:y1GT -:AGy2y3(…)-:y1y2GT -:AGy3(…)


emit x:-

emit -:y

matchintergenic x:y


emit x1x2x3:-

emit -:y1y2y3



GT:GT AG:AG

emit x:- emit -:y

match intron x:y

x1GT:- AGx2x3:-(…)x1x2GT:- AGx3:-(…)

GT:- AG:-emit x

intron x:-

Doublescan:Start End

are connected toall other states

score

Refinements:• Score all potential splice sites

• => distinguish between true and false splice sites by rescaling the nominal transition probs to the splice site states

score

cctgctgggtgcgagagccggcgtaccggtgaggcccctgctgggtgcgagagccggcgtaccggtg

x

ycctgctggaggcggtagcgtgcttagtggtgaggcccctgttgggcgcgagagccggtaaaccgctg


x1GT:y1GT

x1x2GT:y1y2GT

GT:GT

score

score

Refinements to Doublescan:• Score all potential translation start sites

• => distinguish between true and false translation start sites by rescaling the nominal transition probs to the START START state

matchintergenic x:y

start:start

stop:stop

cctgctggatgcggtagcgtgcttatgggtgaggcccctgttgggcatgagagccggtaaaccgctg

y

cgtgctggacgcatgagcgtgcttacgggtgatgcccctgtatggcaggagagccggtatggcgctg

x

Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford...

Documents

Transcript of Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford...