Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
5
Transcript of Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford...
Comparative gene hunting
Irmtraud MeyerThe Sanger Institute
nowUniversity of Oxford
Making sense of the genome:
• What are the proteins and where are they encoded ?
Experimentsin
Lab
SequenceDatabase
• proteins• ESTs
protein
DNA
cctgctgggtgcgagagccggcgtaccggtgaggcc
Aim in ab initio gene prediction:
protein
SequenceDatabase
• proteinsExperimentsin
Lab
DNA
cctgctgggtgcgagagccggcgtaccggtgaggcc
Typical Situation:. . . gagccgcctcctccccttccccacgctctaggagggggccgcgggggcctggctgcgtcggccaatcggagtgcacttccgcagctgacaaattcagtataaaagcttggggctggggccgagcactggggactttgagggtggccaggccagcgtaggaggccagcgtaggatcctgctgggagcggggaactgagggaagcgacgccgagaaagcaggcgtaccacggagggagagaaaagctccggaagcccagcagcgcctttacgcacagctgccaactggccgctgccgaccgtctccagctcccgaggacgcgcgaccggacaccgggtcctgccacagccgaggacagctcgccgctcgccgcagcgagcccggggcggcccttcagggggacctttcccagatcgCccaggccgcccggatgtgcacgaaaatggaacag. . .
. . . ggcgacgggggctcgggaagcctgacagggcttttgcgcacagctgccggctggtgctacccgcccgcgccagcccccgagaacgcgcgaccaggcacccagtccggtcaccgcagcggagagctcgccgctcgctgcagcgaggcccggagcggccccgcagggaccctccccagaccgcctgggccgcccggatgtgcactaaaatggaacagcccttctaccacgacgactcatacacagctacgggatacggccgggcccctggtggcctctctctacacgactacaaactcctgaaaccgagcctggcggtcaacctggccgacccctaccggagtctcaaagcgcctgGggctcgcggacccggcccagagggcggcggtggcggcagctacttttc . . .
? Human DNA
Mouse DNA
Aim in comparative ab initio gene prediction :
annotate simultaneously
DNA x:DNA y:
Input:
x:y:
Output:
?
cctgctgggtgcgagagccggcgtaccggtgaggcc
cctgctgggagcgaaagcaggcgtaccacggaggg
Why is this a good idea ?
IISPTHISJLKDAFKLJDFISDFLKJUEHIDDENWRWIERUOIYWERIUY
• advantages: • can detect new genes as there is no need to
search in databases for proteins • fewer assumptions needed than in one-strand
ab initio gene-prediction methods, i.e. can detect unusual genes
KISFTHISPLKDAPKOJGFISJYTKJUWHIDDENRUIEUNNKLZSBUEYQ
Analysing mouse and human DNA:
• Training:• adjust parameters of Doublescan with set of known
pairs of orthologous mouse and human genes• Testing:
• Test set: 80 pairs of known mouse and human genes• 55 % : same number of exons, different coding
length• 42 % : same number of exons, same coding length• 3 % : different number of exons, different coding
length
Results - Performance:
Doublescan Doublescan with post-processing
Genscan
Gene Sensitivity Specificity Overlapping Missing Wrong
0.57 0.43 0.44 0.00 0.14
0.57 0.50 0.46 0.01 0.04
0.47 0.46 0.53 0.00 0.01
annotation:
prediction: correctoverlappingmissingwrong
C. elegans – C. briggsae
• C. elegans• sequenced in 1998• 97 million bases• 5 autosomes, one X• about 20 000 genes
• C. briggsae• around 100 million
bases• 5 autosomes, one X
Results - Performance:
Doublescan Gene Sensitivity
Specificity Overlapping Missing Wrong
0.80 0.71 0.20 0.00 0.07
annotation:
prediction: correctoverlappingmissingwrong
Summary:• Doublescan:
• predicts the gene structures of both sequences at the same time as aligning the sequences
• capable of predicting partial, complete and multiple genes or no genes at all as well as more diverged pairs of genes which are related by events of exon-fusion or exon-splitting
• can be used to analyse long sequences using the Stepping Stone algorithm (same performance as Hirschberg algorithm)
• general concept: can be trained to analyse other pairs of related genomes
• performance on mouse - human DNA and c. elegans – c. briggsae DNA very promising
To do list:• large scale mouse - human comparison• large scale c. elegans – c. briggsae comparison• search for regulatory regions:
x:y:
References:
• www.sanger.ac.uk/Software/analysis/doublescan
• I.M.Meyer And R. Durbin, Bioinformatics, 2002,18(10), pp. 1309-
Acknowledgements:
• Richard Durbin
• Sequencing centres
• Trinity College, Cambridge
• Wellcome Trust• The Sanger Centre
Pair HMMs:
• idea: annotate the two sequences by parsing them through connected states
DNA y:
DNA x:
• each state reads a fixed number of letters from one or two of the sequences
• idea: annotate the two sequences by parsing them through connected states
DNA y:
DNA x:
matchexon
matchintron
• each state reads a fixed number of letters from one or two of the sequences
match intergenic
reads 1 letter from each sequence
at a time
startstate
Pair HMMs:
• idea: annotate the two sequences by parsing them through connected states
DNA y:
DNA x:
match intergenic
matchexon
matchintron
• each state reads a fixed number of letters from one or two of the sequences a statea
transition
Pair HMMs:
ACGTCGACATGGCCTATCCGCTGAGCT
ACGTCGGGCCTCTCCGCTAAGCT
Doublescan:
emit x:-
emit -:y
matchintergenic x:y
x:
y:
ACGTCGACATGGCCTATCCGCTGAGCT
ACGTCG - - - - GGCCTCTCCGCTAAGCT
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
x:
y:
CAAGCATGCGACAAAGGATACAGCGACCTC
CAAGCCTGCGGATACAGCGAACTC
CAAGCATGCGACAAAGGATACAGCGACCTC
CAAGCCTGC - - - - - - GGATACAGCGAACTC
same amino-acid(Alanine)
insertion of two codons
similar amino-acids(Aspartic, Glutamic acid)
Doublescan:
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
start:start stop:stopstart codon stop codon
x:
y:
Doublescan:
emit x:- emit -:y
match intron x:y
start:start stop:stop
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
GT:GT AG:AG
GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA
GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA
intronexon exon5’ splice site 3’splice site
Doublescan:
AGx2x3:AGy2y3x1GT:y1GT
start:start stop:stop
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
x1GT:y1GT AGx2x3:AGy2y3(…)
emit x:- emit -:y
match intron x:yGT:GT AG:AG
GCATGCGTACAGTTG…GTCAGGAGAGCGAACTCGCA
GCCTGCGTACAGTTA…AGTACGAGAGCGAACTCGCA
GCATGCAGTACAGTTG…GTCAGGAGGCGAACTCGCA
GCCTGCAGTACAGTTA…AGTACGAGGCGAACTCGCA
exon exonintron
Doublescan:
x1x2GT:y1y2GT AGx3:AGy3(…)
start:start stop:stop
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
x1GT:y1GT AGx2x3:AGy2y3(…)
emit x:- emit -:y
match intron x:yGT:GT AG:AG
x1x2GT:y1y2GT AGx3:AGy3exon exonintron
GCATGCAGGTACAGTTG…GTCAGGAGCGAACTCGCA
GCCTGCAGGTACAGTTA…AGTACGAGCGAACTCGCA
Doublescan:
start:start stop:stop
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
x1GT:y1GT AGx2x3:AGy2y3(…)
x1x2GT:y1y2GT AGx3:AGy3(…)
GT:GT AG:AG
emit x:- emit -:y
match intron x:y
x:
y:
x:
y:
exon fusion
Doublescan:
-:GT -:AGemit y
intron -:y
-:y1GT -:AGy2y3(…)-:y1y2GT -:AGy3(…)
start:start stop:stop
emit x:-
emit -:y
matchintergenic x:y
match exonx1x2x3:y1y2y3
emit x1x2x3:-
emit -:y1y2y3
x1GT:y1GT AGx2x3:AGy2y3(…)
x1x2GT:y1y2GT AGx3:AGy3(…)
GT:GT AG:AG
emit x:- emit -:y
match intron x:y
x1GT:- AGx2x3:-(…)x1x2GT:- AGx3:-(…)
GT:- AG:-emit x
intron x:-
Doublescan:Start End
are connected toall other states
score
Refinements:• Score all potential splice sites
• => distinguish between true and false splice sites by rescaling the nominal transition probs to the splice site states
score
cctgctgggtgcgagagccggcgtaccggtgaggcccctgctgggtgcgagagccggcgtaccggtg
x
ycctgctggaggcggtagcgtgcttagtggtgaggcccctgttgggcgcgagagccggtaaaccgctg
match exonx1x2x3:y1y2y3
x1GT:y1GT
x1x2GT:y1y2GT
GT:GT
score
score
Refinements to Doublescan:• Score all potential translation start sites
• => distinguish between true and false translation start sites by rescaling the nominal transition probs to the START START state
matchintergenic x:y
start:start
stop:stop
cctgctggatgcggtagcgtgcttatgggtgaggcccctgttgggcatgagagccggtaaaccgctg
y
cgtgctggacgcatgagcgtgcttacgggtgatgcccctgtatggcaggagagccggtatggcgctg
x