3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern...

43
3/14/08, UMKC 1 TCGR: A Novel DNA/RNA TCGR: A Novel DNA/RNA Visualization Technique Visualization Technique Margaret H. Dunham Margaret H. Dunham and Donya Donya Quick Quick Southern Methodist Southern Methodist University University Dallas, Texas 75275 Dallas, Texas 75275 [email protected] Some slides presented at IEEE BIBE 2006

Transcript of 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern...

Page 1: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 1

TCGR: A Novel DNA/RNA TCGR: A Novel DNA/RNA Visualization TechniqueVisualization Technique

Margaret H. DunhamMargaret H. Dunham and Donya Quick Donya Quick Southern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

Some slides presented at IEEE BIBE 2006

Page 2: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 2

Outline

Introduction TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 3: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 3

Outline Introduction

Background CGR/FCGR Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 4: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 4

DNA Deoxyribonucleic Acid Basic building blocks of organisms Located in nucleus of cells Composed of 4 nucleotides

Adenine (A) Cytosine (C) Guanine (G) Thymine (T)

Two strands bound together Contains genetic information Image source:

http://www.visionlearning.com/library/module_viewer.php?mid=63

Page 5: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 5

Nucleotide Bases

http://www.people.virginia.edu/~rjh9u/gif/bases.gif

Page 6: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 6

TranscriptionDuring transcription, DNA is converted in

mRNARNA is processed and noncoding regions

removedCoding regions are converted in proteinEnzyme (RNA Polymerase) that starts

transcription by binding to DNA code

Page 7: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 7

Transcription

http://ghs.gresham.k12.or.us/science/ps/sci/ibbio/chem/nucleic/chpt15/transcription.gif

Page 8: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 8

RNARibonucleic AcidContains A,C,G but U (Uracil) instead of TSingle Stranded May fold back on itselfNeeded to create proteinsMove around cells – can act like a messengermRNA – moves out of nucleus to other parts of

cell

Page 9: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 9

Translation

Synthesis of Proteins from mRNANucleotide sequence of mRNA converted

in amino acid sequence of proteinFour nucleotidesTwenty amino acids Codon – Group of 3 nucleotidesAmino acids have many codings

Page 10: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 10

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Central Dogma: DNA -> RNA -> Protein

www.bioalgorithms.info; chapter 6; Gene Prediction

Page 11: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 11http://www.time.com/time/magazine/article/0,9171,1541283,00.html

Page 12: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 12

Human Genome Scientists originally thought there would be about 100,000 genes Appear to be about 20,000 WHY?

Almost identical to that of Chimps. What makes the difference?

Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

Page 13: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 13

More Questions If each cell in an organism contains the same DNA –

How does each cell behave differently? Why do cells behave differently during childhood

development? What causes some cells to act differently – such as

during disease? DNA contains many genes, but only a few are being

transcribed – why? One answer - miRNA

Page 14: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 14

miRNA Short (20-25nt) sequence of noncoding RNA Single strand Previously assumed to be garbage Impact/Prevent translation of mRNA Bind to target areas in mRNA – Problem is that this

binding is not perfect (particularly in animals) mRNA may have multiple (nonoverlapping) binding

sites for one miRNA

Page 15: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 15

miRNA FunctionsCauses some cancersEmbryo DevelopmentCell DifferentiationCell DeathPrevents the production of a protein

that causes lung cancerControl brain development in zebra

fishAssociated with HIV

Page 16: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 16

Outline Introduction

Background CGR/FCGR Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 17: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 17

Chaos Game Representation (CGR)

Scatter plot showing occurrence of patterns of nucleotides.

University of the Basque Country http://

insilico.ehu.es/genomics/my_words/

Page 18: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 19

Chaos Game Representation (CGR)

2D technique to visually see the distribution of subpatterns

Our technique is based on the following:

Generate totals for each subpattern

Scale totals to a [0,1] range. (Note scaling can be a problem)

Convert range to red/blue• 0-0.5: White to Blue• 0.5-1: Blue to Red

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

FCGR

Page 19: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 21

FCGR Example

Homo sapiens – all mature miRNA

Patterns of length 3

UUC

GUG

Page 20: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 22

Outline Introduction

Background CGR/FCGR Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 21: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 23

Motivation

2000bp Flanking Upstream Region mir-258.2 in C elegans

a) All 2000 bp b) First 240 bp b) Last 240 bp

Page 22: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 24

Research Objectives Identify, develop, and implement algorithms which

can be used for identifying potential miRNA

functions.

Create an online tool which can be used by other

researchers to apply our algorithms to new data.

Page 23: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 25

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 24: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 26

Temporal CGR (TCGR)

Temporal version of Frequency CGR In our context temporal means the starting location of a window

2D Array Each Row represents counts for a particular window in sequence

• First row – first window• Last row – last window • We start successive windows at the next character location

Each Column represents the counts for the associated pattern in that window• Initially we have assumed order of patterns is alphabetic

Size of TCGR depends on sequence length and subpattern lengt As sequence lengths vary, we only examine complete windows We only count patterns completely contained in each window.

Page 25: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 27

TCGR Exampleacgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window

A C G T

Pos 0-8 2 3 3 1

Pos 1-9 1 3 3 2

…Pos 34-42 2 4 2 1

A C G T

Pos 0-8 0.4 0.6 0.6 0.2

Pos 1-9 0.2 0.6 0.6 0.4

…Pos 34-42 0.4 0.8 0.4 0.2

Page 26: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 28

TCGR Example (cont’d)

TCGRs for Sub-patterns of length 1, 2, and 3

Page 27: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 29

TCGR Example (cont’d)

Window 0: Pos 0-8Window 1: Pos 1-9

Window 17: Pos 17-25Window 18: Pos 18-26

Window 34: Pos 34-42

acgtgcacgcgtgcacgt

tccggaaccccggaacca

ccacgtcga

A C G T

Page 28: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 31

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus musculus

Homo sapiens

C. elegans

ACG CGC GCG UCG

Page 29: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 32

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 30: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 33

EMM Overview Time Varying Discrete First Order Markov

Model Nodes are clusters of real world states. Learning continues during prediction phase. Learning:

Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data

arrives

Page 31: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 34

EMM DefinitionExtensible Markov Model (EMM): at any time t, EMM

consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:

EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.

EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.

EMMDecrement algorithm, which removes nodes from the EMM when needed.

Page 32: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 35

EMM Cluster

Find closest node to incoming event.

If none “close” create new node Labeling of cluster is centroid of

members in cluster O(n)

Page 33: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 36

EMM Increment

<18,10,3,3,1,0,0><18,10,3,3,1,0,0>

<17,10,2,3,1,0,0><17,10,2,3,1,0,0>

<16,9,2,3,1,0,0><16,9,2,3,1,0,0>

<14,8,2,3,1,0,0><14,8,2,3,1,0,0>

<14,8,2,3,0,0,0><14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/31/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

Page 34: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 37

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM Conclusion / Future Work

Page 35: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 38

Research Approach

1. Represent potential miRNA sequence with TCGR sequence of count vectors

2. Create EMM using count vectors for known miRNA (miRNA stem loops, miRNA targets)

3. Predict unknown sequence to be miRNA (miRNA stem loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM

Page 36: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 39

Related Work 1

Predicted occurrence of pre-miRNA segments form a set of hairpin sequences

No assumptions about biological function or conservation across species.

Used SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not.

Sensitivey of 93.3% Specificity of 88.1%

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

Page 37: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 40

Preliminary Test Data1

Positive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119.

Negative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters.

Positive Test: This dataset contains 30 pre-miRNAs. Negative Test: This dataset contains 1000 randomly

chosen sequences from coding regions.

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

Page 38: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 41

POSITIVENEGATIVE

TCGRs for Xue Training Data

Page 39: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 42

POSITIVE

NEGATIVE

TCGRs for Xue Test Data

Page 40: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 43

Predictive Probabilities with Xue’s Data

EMM Test Data Mean Std Dev Max Min

Negative Test-Neg 0 0 0 0

Test-Pos 0 0 0 0

Train-Neg 0.37963 0.050085 0.91256 0.2945

Train-Pos 0 0 0 0

Positive Test-Neg 0 0 0 0

Test-Pos 0.25894 0.18701 0.42075 0

Train-Neg 0 0 0 0

Train-Pos 0.38926 0.048439 0.91155 0.32209

Page 41: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 44

Preliminary Test Results

Positive EMM Cutoff Probability = 0.3 False Positive Rate = 0% True Positive Rate = 66%

Test results could be improved by meta classifiers combining multiple positive and negative classifiers together.

Page 42: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 45

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusion / Future Work

Page 43: 3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya.

3/14/08, UMKC 46

Future Research1. Obtain all known mature miRNA sequences for a species – initially the 119 C. elegans

miRNAs.

2. Create TCGR count vectors for each sequence and each sub-pattern length (1,2,3,4,5).

3. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created

4. Obtain negative data (much as Xue did in his research) from coding regions for C. elegans.

5. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created

6. Construct a meta-classifier based on the combined results of prediction from each of these ten EMMs.

7. Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic sequence in the C. elegans genome to search for miRNAs. Note: all 119 validated C. elegans miRNAs are contained in the non-exonic part of the genome and thus the first pass of the algorithm will be tested for its ability to detect all 119 validated miRNAs.

8. Validate the prediction of novel miRNAs using molecular biology.