Gene Identification[1]
-
Upload
nalinee-dua -
Category
Documents
-
view
229 -
download
0
Transcript of Gene Identification[1]
-
8/4/2019 Gene Identification[1]
1/35
Gene Identification - I
Shivani Chandra
Birla Institute of Scientific Research
-
8/4/2019 Gene Identification[1]
2/35
Gene Identification
Goals :
Find genes
Map their position
Identify function(s)
-
8/4/2019 Gene Identification[1]
3/35
Gene Identification
Approaches :
Classical
Computational
-
8/4/2019 Gene Identification[1]
4/35
Classical Approaches
The three big Ms :
Mendel (1822-1884)
Morgan (1866-1945)
McClintock (1902-1992)
-
8/4/2019 Gene Identification[1]
5/35
Mendels Genetics
Dominant / Recessive
Genotype / Phenotype
Monohybrid / Dihybrid Crosses
Laws :
Segregation
Independent assortment
-
8/4/2019 Gene Identification[1]
6/35
Morgans Genetics
Won noble prize in 1933 for white eyed
fruit-fly mutation.
Linkage and crossing- over (recombination)
Genetic and chromosome mapping.
-
8/4/2019 Gene Identification[1]
7/35
McClintocks Genetics
Won a noble prize in 1983 for
transposons.
Ability of genes to change position on a
chromosomegenetic transposition.
Transposons :
Cause mutations.
Increase/decrease amount of DNA.
-
8/4/2019 Gene Identification[1]
8/35
Classical Approaches
Study F1 and F2 to Fn generations.
Test cross, back cross.
Complementation tests.
Chromosome mapping.
Transgenics. Gene Knock-outs.
-
8/4/2019 Gene Identification[1]
9/35
The Genomics Era
-
8/4/2019 Gene Identification[1]
10/35
What is Computational Gene
IdentificationGiven an uncharacterized DNA sequence, find out:
Where does the gene starts and ends?
Which DNA strand is used to encode the gene?
Which reading frame is used in that strand?
Which region codes for a protein?
-
8/4/2019 Gene Identification[1]
11/35
Computational Approaches
Computational methods to identify genes
have been an active field of research for
past 15 - 20 years.
Fast and accurate.
-
8/4/2019 Gene Identification[1]
12/35
Computational Gene
Identification Classes : Intrinsic and Extrinsic
Intrinsic, or ab initio, gene findersmake no
explicit use of information about DNAs or
proteins outsidethe sequence being studied.
Extrinsic gene finders utilize sequence
similarity search methods to identify thelocations of protein-codingregions.
-
8/4/2019 Gene Identification[1]
13/35
Software Tools
GeneMark
Glimmer
Grail
GenScan
Combined
-
8/4/2019 Gene Identification[1]
14/35
GeneMark
First tool for finding prokaryotic genes.(1993).
Access the protein coding potential of aDNA sequence by using Markov models ofcoding and non coding regions.
Relies on organism specific recognitionparameters to separate coding and non-coding regions.
-
8/4/2019 Gene Identification[1]
15/35
GeneMark
Exists in separate variants of gene
prediction in prokaryotic, eukaryotic, and
viral DNA sequences.
Requires a sufficiently large training set of
known genes.
-
8/4/2019 Gene Identification[1]
16/35
GeneMark
Input File :
DNA sequence in fasta format.
The file can contain multiple fasta records.
Each fasta record should be less than 5Mbp.
The total sequence size should be in the
range of 10 Mbp to 50 Mbp
-
8/4/2019 Gene Identification[1]
17/35
GeneMark
Open the GeneMarkserver:http://opal.biology.gatech.edu/GeneMark/
genemark24.cgi
Hit Browse to select your inputfile. Select the closest species of organism or host as
the model (M. tuberculosis for the
Mycobacteriophages)
Under graphics export options, select everything
except generate postscript & mark putative
exons.
-
8/4/2019 Gene Identification[1]
18/35
GeneMark
In the second column choose only list open
reading frames and list regions of
interest. Run Genemark (Start)10.
You should see a text output of your
GeneMark results.
-
8/4/2019 Gene Identification[1]
19/35
GeneMark Output
GeneMark can be instructed to generate
reports on open reading frames (ORFs),
regions of interest, and estimated exonboundaries.
-
8/4/2019 Gene Identification[1]
20/35
GeneMark Output
GENEMARK PREDICTIONS
Sequence file: cya
Sequence length: 2100 GC Content: 51.65%
Threshold value: 0.500
-
8/4/2019 Gene Identification[1]
21/35
GeneMark Output
Open Reading Frames ListLeft Right DNA Coding Avg Start RBS RBS RBS
end end Strand Frame Prob Prob Prob Site Seq
----- ---- ---------- ----- ---- ---- ---- --- ------
3 308 direct fr 3 0.82 .... 0 0 ....
195 308 direct fr 3 0.6 0.04 0.74 177 CCGCAG
348 668 complement fr 2 0.9 0.96 0.98 680 CAGGAT
1368 2102 direct fr 3 0.9 0.98 0.96 1359 TTGGAG
1371 2102 direct fr 3 0.91 0.96 0.96 1359 TTGGAG
1386 2102 direct fr 3 0.93 0.63 0.91 1367 AATGAT1410 2102 direct fr 3 0.96 0.9 0.76 1401 AACGAT
1509 2102 direct fr 3 0.98 0.27 0.51 1490 AGGGTT
1578 2102 direct fr 3 0.97 0.11 0.73 1567 ATGGCA
1620 2102 direct fr 3 0.97 0.11 0.16 1601 GCGCTG
-
8/4/2019 Gene Identification[1]
22/35
GeneMark Output
LEnd REnd Strand Frame
3 308 direct fr 3
348 686 complement fr 2
1092 1334 direct fr 31365 2102 direct fr 3
Regions of Interest
-
8/4/2019 Gene Identification[1]
23/35
Genemark Ouput
Frame Frame At base Strand
2 1 31152 +/-11bp complement
2 1 63372 +/-11bp direct
3 2 75528 +/-11bp complement
Possible Frameshifts
-
8/4/2019 Gene Identification[1]
24/35
GeneMark Output
Approx. Exon LocationLeft Right
End End Strand Frame Prob
50 300 direct fr 3 0.8566
63 247 0.9998
365 666 complement fr 2 0.9415
378 657 0.978
1201 1277 direct fr 3 0.8722
1225 1254 0.9986
1377 2042 direct fr 3 0.9085
1434 2042 0.978
-
8/4/2019 Gene Identification[1]
25/35
GenScan
Analyzes the DNA sequences : by using complex probabilistic structure
of gene based on research on the level
of transcriptional, translational, andsplicing signals.
Statistical properties of coding and non-
coding regions. GC contents.
-
8/4/2019 Gene Identification[1]
26/35
GenScan
The model treats the most general case
in which the sequence may contain no
genes, one gene, or multiple genes oneither or both DNA strands and partial
genes as well as complete genes are
considered.
-
8/4/2019 Gene Identification[1]
27/35
GenScan
Important Restrictions:
Only protein coding regions are considered
(no tRNA or rRNA genes).
Transcription units are assumed to be non
overlapping.
-
8/4/2019 Gene Identification[1]
28/35
GenScan
Input file :
The sequence file may be in either Fasta or
minimal GenBank format.
For minimal Genbank format,Locus and
Origin lines must be present.
-
8/4/2019 Gene Identification[1]
29/35
Minimal GenBank FormatL O C U S H U M R A S H 6 4 5 3 b p d s - D N A P R I 1 5 - M A R - 1 9 8 8
D E F I N I T I O N H u m a n c - H a - ra s 1 p r o t o - o n c o g e n e , c o m p l e te c o d i n g s e q u e n c e .
A C C E S S I O N J 0 0 2 7 7 J 0 0 2 0 6 J 0 0 2 7 6 K 0 0 9 5 4
F E A T U R E S L o c a t i o n / Q u a l i f i e r s
p r im _ t r a n s c r i p t < 1 6 6 4 . .3 7 4 4 / n o t e = " c - H a - r a s 1 m R N A "
C D S j o in ( 1 6 6 4 . .1 7 7 4 , 2 0 4 2 . . 2 2 2 0 , 2 3 7 4 . .2 5 3 3 , 3 2 3 1 . .3 3 5 0 )
/ n o t e = " c - H a - r a s 1 p 2 1 p r o t e i n ; N C B I g i : 1 9 0 8 9 1 . "
/ c o d o n _ s t a r t = 1
/ t r a n s l a t i o n = " M T E Y K L V V V G A G G V G K S A L T I Q L I Q N H F V D E Y D P T I E D S Y R K Q VV I D G E T C L L D I L D T A G Q E E Y S A M R D Q Y M R T G E G F L C V F A I N N T K S F E D I H Q Y R E Q I K RV K D S D D V P M V L V G N K C D L A A R T V E S R Q A Q D L A R S Y G I P Y I E T S A K T R Q G V E D A F Y T L V R E I R Q H K L R K L N P P D E S G P G C M S C K C V L S "
s o u r c e 1 . . 6 4 5 3 /
o r g a n i s m = " H o m o s a p i e n s "
B A S E C O U N T 9 4 6 a 2 2 8 7 c 2 1 1 3 g 1 1 0 7 t
O R I G I N 1 b p u p s t re a m o f B a m H I s i te .
1 g g a t c c c a g c c t t tc c c c a g c c c g t a g c c c c g g g a c c t c c g c g g t g g g c g g c g c c g c g c t 6 1 g c c g g c g c a g g g a g g g c c t c t g g t g c a c c g g c a c c g c t g a g t c g g g t tc t c t c g c c g g c c 1 2 1t g t tc c c g g g a g a g c c c g g g g c c c t g c t c g g a g a t g c c g c c c c g g g c c c c c a g a c a c c g g . .. .. .. .. .. ..
-
8/4/2019 Gene Identification[1]
30/35
GenScan Output
Predicted Genes/ExonsGn.Ex Type S Begin End Len Fr Ph I/Ac Do/T CodRg P. Tscr..
1.01 Intr + 739 851 113 0 2 49 66 74 0.287 0.98
1.02 Intr + 1748 1860 113 2 2 53 110 80 0.866 7.231.03 Intr + 1976 2055 80 0 2 97 94 10 0.999 2.271.04 Intr + 2132 2194 63 1 0 84 80 87 0.99 6.911.05 Intr + 2434 2631 198 0 0 88 -9 263 0.895 16.671.06 Intr + 2749 2910 162 0 0 107 109 97 0.965 14.391.07 Intr + 3279 3416 138 2 0 52 77 126 0.812 9.071.08 Intr + 3576 3676 101 2 2 87 119 113 0.996 13.711.09 Intr + 3780 3846 67 0 1 63 77 46 0.998 0.41.1 Term + 4179 4340 162 2 0 75 47 276 0.979 20.451.11 PlyA + 4397 4402 6 1.05
-
8/4/2019 Gene Identification[1]
31/35
GenScan Output
Predicted peptide sequence(s)
HS307871|GENSCAN_predicted_peptide_1|398_aa
VQAIVWTWLDKTVGIIVGTCAKLRIPRLSDENKFLMSPPQGFPELKNDTFLRAAWGEETDYTPVWCMRQAGRYLPEFRETRAAQDFFSTCRSPEACCELTLQPLRRFLLDAAIIFSDILVVPQALGMEVTMVPGKGPSFPEPLREEQDLERLRDPEVVASELGYVFQAITLTRQRLAGRVPLIGFAGAPWTLM
TYMVEGGGSSTMAQAKRWLYQRPQASHQLLRILTDALVPYLVGQVVAGAQALQLFESHAGHLGPQLFNKFALPYIRDVAKQVKARLREAGLAPVPMIIFAKDGHFALEELAQAGYEVVGLDWTVAPKKARECVGKTVTLQGNLDPCALYASEEEIGQLVKQMLDDFGPHRYIANLGHGLYPDMDPEHVGAFVDAVHKHSRLLRQN
-
8/4/2019 Gene Identification[1]
32/35
GRAIL
Gene Recognition and Assembly Internet
Link.
Identifies exons, polyA sites, promoters,repeats and frameshift errors in DNA
sequence by comparing them to database of
known mouse and human sequenceelements.
-
8/4/2019 Gene Identification[1]
33/35
GRAIL
Incorporates BLAST searches and
Glimmer.
It supports the protocols and file formatscommonly found on the World-Wide Web,
such as HTTP, FTP, and HTML.
-
8/4/2019 Gene Identification[1]
34/35
GRAIL
GrailExp is a software package developed
specifically for gene finding using pattern
recognition and expressed sequence tags.
Grail is an algorithm for inferring gene
structures from predicted exon candidates,
based on Expressed Sequence Tags (ESTs)
and biological intuition/rules.
-
8/4/2019 Gene Identification[1]
35/35
To be Continued