3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern...
-
Upload
merry-marsh -
Category
Documents
-
view
218 -
download
0
Transcript of 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern...
3/14/08, UMKC 1
TCGR: A Novel DNA/RNA TCGR: A Novel DNA/RNA Visualization TechniqueVisualization Technique
Margaret H. DunhamMargaret H. Dunham and Donya Quick Donya Quick Southern Methodist UniversitySouthern Methodist University
Dallas, Texas 75275Dallas, Texas 75275
Some slides presented at IEEE BIBE 2006
3/14/08, UMKC 2
Outline
Introduction TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 3
Outline Introduction
Background CGR/FCGR Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 4
DNA Deoxyribonucleic Acid Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides
Adenine (A) Cytosine (C) Guanine (G) Thymine (T)
Two strands bound together Contains genetic information Image source:
http://www.visionlearning.com/library/module_viewer.php?mid=63
3/14/08, UMKC 5
Nucleotide Bases
http://www.people.virginia.edu/~rjh9u/gif/bases.gif
3/14/08, UMKC 6
TranscriptionDuring transcription, DNA is converted in
mRNARNA is processed and noncoding regions
removedCoding regions are converted in proteinEnzyme (RNA Polymerase) that starts
transcription by binding to DNA code
3/14/08, UMKC 7
Transcription
http://ghs.gresham.k12.or.us/science/ps/sci/ibbio/chem/nucleic/chpt15/transcription.gif
3/14/08, UMKC 8
RNARibonucleic AcidContains A,C,G but U (Uracil) instead of TSingle Stranded May fold back on itselfNeeded to create proteinsMove around cells – can act like a messengermRNA – moves out of nucleus to other parts of
cell
3/14/08, UMKC 9
Translation
Synthesis of Proteins from mRNANucleotide sequence of mRNA converted
in amino acid sequence of proteinFour nucleotidesTwenty amino acids Codon – Group of 3 nucleotidesAmino acids have many codings
3/14/08, UMKC 10
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Central Dogma: DNA -> RNA -> Protein
www.bioalgorithms.info; chapter 6; Gene Prediction
3/14/08, UMKC 11http://www.time.com/time/magazine/article/0,9171,1541283,00.html
3/14/08, UMKC 12
Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY?
Almost identical to that of Chimps. What makes the difference?
Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)
3/14/08, UMKC 13
More Questions If each cell in an organism contains the same DNA –
How does each cell behave differently? Why do cells behave differently during childhood
development? What causes some cells to act differently – such as
during disease? DNA contains many genes, but only a few are being
transcribed – why? One answer - miRNA
3/14/08, UMKC 14
miRNA Short (20-25nt) sequence of noncoding RNA Single strand Previously assumed to be garbage Impact/Prevent translation of mRNA Bind to target areas in mRNA – Problem is that this
binding is not perfect (particularly in animals) mRNA may have multiple (nonoverlapping) binding
sites for one miRNA
3/14/08, UMKC 15
miRNA FunctionsCauses some cancersEmbryo DevelopmentCell DifferentiationCell DeathPrevents the production of a protein
that causes lung cancerControl brain development in zebra
fishAssociated with HIV
3/14/08, UMKC 16
Outline Introduction
Background CGR/FCGR Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 17
Chaos Game Representation (CGR)
Scatter plot showing occurrence of patterns of nucleotides.
University of the Basque Country http://
insilico.ehu.es/genomics/my_words/
3/14/08, UMKC 19
Chaos Game Representation (CGR)
2D technique to visually see the distribution of subpatterns
Our technique is based on the following:
Generate totals for each subpattern
Scale totals to a [0,1] range. (Note scaling can be a problem)
Convert range to red/blue• 0-0.5: White to Blue• 0.5-1: Blue to Red
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
FCGR
3/14/08, UMKC 21
FCGR Example
Homo sapiens – all mature miRNA
Patterns of length 3
UUC
GUG
3/14/08, UMKC 22
Outline Introduction
Background CGR/FCGR Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 23
Motivation
2000bp Flanking Upstream Region mir-258.2 in C elegans
a) All 2000 bp b) First 240 bp b) Last 240 bp
3/14/08, UMKC 24
Research Objectives Identify, develop, and implement algorithms which
can be used for identifying potential miRNA
functions.
Create an online tool which can be used by other
researchers to apply our algorithms to new data.
3/14/08, UMKC 25
Outline Introduction
CGR/FCGR miRNA Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 26
Temporal CGR (TCGR)
Temporal version of Frequency CGR In our context temporal means the starting location of a window
2D Array Each Row represents counts for a particular window in sequence
• First row – first window• Last row – last window • We start successive windows at the next character location
Each Column represents the counts for the associated pattern in that window• Initially we have assumed order of patterns is alphabetic
Size of TCGR depends on sequence length and subpattern lengt As sequence lengths vary, we only examine complete windows We only count patterns completely contained in each window.
3/14/08, UMKC 27
TCGR Exampleacgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
A C G T
Pos 0-8 2 3 3 1
Pos 1-9 1 3 3 2
…Pos 34-42 2 4 2 1
A C G T
Pos 0-8 0.4 0.6 0.6 0.2
Pos 1-9 0.2 0.6 0.6 0.4
…Pos 34-42 0.4 0.8 0.4 0.2
3/14/08, UMKC 28
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
3/14/08, UMKC 29
TCGR Example (cont’d)
Window 0: Pos 0-8Window 1: Pos 1-9
Window 17: Pos 17-25Window 18: Pos 18-26
Window 34: Pos 34-42
acgtgcacgcgtgcacgt
tccggaaccccggaacca
ccacgtcga
A C G T
3/14/08, UMKC 31
TCGR – Mature miRNA(Window=5; Pattern=3)
All Mature
Mus musculus
Homo sapiens
C. elegans
ACG CGC GCG UCG
3/14/08, UMKC 32
Outline Introduction
CGR/FCGR miRNA Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 33
EMM Overview Time Varying Discrete First Order Markov
Model Nodes are clusters of real world states. Learning continues during prediction phase. Learning:
Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data
arrives
3/14/08, UMKC 34
EMM DefinitionExtensible Markov Model (EMM): at any time t, EMM
consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:
EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.
EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.
EMMDecrement algorithm, which removes nodes from the EMM when needed.
3/14/08, UMKC 35
EMM Cluster
Find closest node to incoming event.
If none “close” create new node Labeling of cluster is centroid of
members in cluster O(n)
3/14/08, UMKC 36
EMM Increment
<18,10,3,3,1,0,0><18,10,3,3,1,0,0>
<17,10,2,3,1,0,0><17,10,2,3,1,0,0>
<16,9,2,3,1,0,0><16,9,2,3,1,0,0>
<14,8,2,3,1,0,0><14,8,2,3,1,0,0>
<14,8,2,3,0,0,0><14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>
1/3
N1
N2
2/3
N3
1/11/3
N1
N2
2/3
1/1
N3
1/1
1/2
1/3
N1
N2
2/31/2
1/2
N3
1/1
2/3
1/3
N1
N2
N1
2/21/1
N1
1
3/14/08, UMKC 37
Outline Introduction
CGR/FCGR miRNA Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work
3/14/08, UMKC 38
Research Approach
1. Represent potential miRNA sequence with TCGR sequence of count vectors
2. Create EMM using count vectors for known miRNA (miRNA stem loops, miRNA targets)
3. Predict unknown sequence to be miRNA (miRNA stem loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM
3/14/08, UMKC 39
Related Work 1
Predicted occurrence of pre-miRNA segments form a set of hairpin sequences
No assumptions about biological function or conservation across species.
Used SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not.
Sensitivey of 93.3% Specificity of 88.1%
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
3/14/08, UMKC 40
Preliminary Test Data1
Positive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119.
Negative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters.
Positive Test: This dataset contains 30 pre-miRNAs. Negative Test: This dataset contains 1000 randomly
chosen sequences from coding regions.
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
3/14/08, UMKC 41
POSITIVENEGATIVE
TCGRs for Xue Training Data
3/14/08, UMKC 42
POSITIVE
NEGATIVE
TCGRs for Xue Test Data
3/14/08, UMKC 43
Predictive Probabilities with Xue’s Data
EMM Test Data Mean Std Dev Max Min
Negative Test-Neg 0 0 0 0
Test-Pos 0 0 0 0
Train-Neg 0.37963 0.050085 0.91256 0.2945
Train-Pos 0 0 0 0
Positive Test-Neg 0 0 0 0
Test-Pos 0.25894 0.18701 0.42075 0
Train-Neg 0 0 0 0
Train-Pos 0.38926 0.048439 0.91155 0.32209
3/14/08, UMKC 44
Preliminary Test Results
Positive EMM Cutoff Probability = 0.3 False Positive Rate = 0% True Positive Rate = 66%
Test results could be improved by meta classifiers combining multiple positive and negative classifiers together.
3/14/08, UMKC 45
Outline Introduction
CGR/FCGR miRNA Motivation Research Objective
TCGR EMM miRNA Prediction using TCGR/EMMConclusion / Future Work
3/14/08, UMKC 46
Future Research1. Obtain all known mature miRNA sequences for a species – initially the 119 C. elegans
miRNAs.
2. Create TCGR count vectors for each sequence and each sub-pattern length (1,2,3,4,5).
3. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created
4. Obtain negative data (much as Xue did in his research) from coding regions for C. elegans.
5. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created
6. Construct a meta-classifier based on the combined results of prediction from each of these ten EMMs.
7. Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic sequence in the C. elegans genome to search for miRNAs. Note: all 119 validated C. elegans miRNAs are contained in the non-exonic part of the genome and thus the first pass of the algorithm will be tested for its ability to detect all 119 validated miRNAs.
8. Validate the prediction of novel miRNAs using molecular biology.