BIOINFORMATICS AND GENE DISCOVERY
-
Upload
cinderella-rufus -
Category
Documents
-
view
25 -
download
1
description
Transcript of BIOINFORMATICS AND GENE DISCOVERY
BIOINFORMATICSBIOINFORMATICSAND
GENE DISCOVERYGENE DISCOVERY
BIOINFORMATICSBIOINFORMATICSAND
GENE DISCOVERYGENE DISCOVERY
Iosif Vaisman
1998
UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Bioinformatics Tutorials
From genes to proteins
DNA
RNA
mRNA
TRANSCRIPTION
SPLICING
PROMOTERELEMENTS
PROTEIN
TRANSLATION
STARTCODON
STOPCODON
SPLICESITES
From genes to proteins
From genes to proteins
Comparative Sequence Sizes
• Yeast chromosome 3 350,000
• Escherichia coli (bacterium) genome 4,600,000
• Largest yeast chromosome now mapped 5,800,000
• Entire yeast genome 15,000,000
• Smallest human chromosome (Y) 50,000,000
• Largest human chromosome (1) 250,000,000
• Entire human genome 3,000,000,000
Low
-res
olut
ion
phys
ical
map
of
chr
omos
ome
19
Chr
omos
ome
19 g
ene
map
Computational Gene Prediction
•Where the genes are unlikely to be located?
•How do transcription factors know where to bind a region of DNA?
•Where are the transcription, splicing, and translation start and stop
signals?
•What does coding region do (and non-coding regions do not) ?
•Can we learn from examples?
•Does this sequence look familiar?
Artificial Intelligence in Biosciences
Neural Networks (NN)
Genetic Algorithms (GA)
Hidden Markov Models (HMM)
Stochastic context-free grammars (CFG)
Information Theory
0 1
1 bit
Information Theory
00 01
1 bit
1 bit
1110
Information Theory
1 bit
1 bit
Scientific Models
Mechanistic models
Predictive powerElegance
Consistency
Stochastic models
Predictive power
Hidden Markov models
Mechanism Black box
Stochastic mechanism
Physical models -- Mathematical models
Neural Networks•interconnected assembly of simple processing elements (units or nodes)•nodes functionality is similar to that of the animal neuron •processing ability is stored in the inter-unit connection strengths (weights)•weights are obtained by a process of adaptation to, or learning from, a set
of training patterns
Genetic AlgorithmsSearch or optimization methods using simulated evolution.
Population of potential solutions is subjected to natural selection, crossover, and mutation
choose initial populationevaluate each individual's fitnessrepeat
select individuals to reproducemate pairs at randomapply crossover operatorapply mutation operatorevaluate each individual's fitness
until terminating condition
Crossover
Child AB
Child BA
Parent A
Parent B
crossover point
Mutation
Markov Model (or Markov Chain)
A GATCT
Probability for each character based only on several preceding characters in the sequence
# of preceding characters = order of the Markov Model
Probability of a sequence
P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]
Hidden Markov Models
States -- well defined conditionsEdges -- transitions between the states
A
T
C
G
T
A C
ATGACATTACACGACACTAC
Each transition asigned a probability.
Probability of the sequence:single path with the highest probability --- Viterbi pathsum of the probabilities over all paths -- Baum-Welch method
Hidden Markov Model of Biased Coin Tosses
• States (Si): Two Biased Coins {C1, C2}
• Outputs (Oj): Two Possible Outputs {H, T}
• p(OutputsOij): p(C1, H), p(C1, T), p(C2, H) p(C2, T)
• Transitions: From State X to Y {A11, A22, A12, A21}
• p(Initial Si): p(I, C1), p(I, C2)
• p(End Si): p(C1, E), p(C2, E)
Hidden Markov Model for Exon and Stop Codon (VEIL Algorithm)
GRAIL gene identification program
POSSIBLE EXONSREFINED EXON
POSITIONSFINAL EXON CANDIDATES
Suboptimal Solutions for the Human Growth Hormone Gene (GeneParser)
Measures of Prediction Accuracy
TN FPFN TN TNTPFNTP FN
REALITY
PREDICTION
PR
ED
ICT
ION
REALITY
TP
FN TN
FP
c
cnc
ncSn = TP / (TP + FN)
Sp = TP / (TP + FP)
Sensitivity
Specificity
Nucleotide Level
Measures of Prediction Accuracy
REALITY
PREDICTION
Exon Level
WRONGEXON
CORRECTEXON
MISSINGEXON
Sn =Sensitivitynumber of correct exonsnumber of actual exons
Sp =Specificitynumber of correct exons
number of predicted exons
GeneMark Accuracy Evaluation
Gene Discovery Exercisehttp://metalab.unc.edu/pharmacy/Bioinfo/Gene
Bibliographyhttp://linkage.rockefeller.edu/wli/gene/list.html
andhttp://www-hto.usc.edu/software/procrustes/fans_ref/