DNA Classifications with Self-Organizing Maps (SOMs)

Post on 19-Jan-2016

39 views 0 download

Tags:

description

DNA Classifications with Self-Organizing Maps (SOMs). Thanakorn Naenna Mark J. Embrechts Robert A. Bress. May 2003 IEEE International Workshop on Soft Computing in Industrial Application. Presentation Outline. Introduction to DNA Splice Junctions Data Collection Introduction to SOMs - PowerPoint PPT Presentation

Transcript of DNA Classifications with Self-Organizing Maps (SOMs)

1

DNA Classifications with Self-Organizing Maps (SOMs)

Thanakorn NaennaMark J. EmbrechtsRobert A. Bress

May 2003 IEEE International Workshop on Soft Computing in Industrial Application

2

Presentation Outline

• Introduction to DNA Splice Junctions• Data Collection• Introduction to SOMs• SOM for DNA Splice Junction

Classification• Results• Conclusions

3

4

Human genome in a nutshell

• Human : 23 chromosomes• Chromosomes thousands of genes• Gene info : exons , comments : introns

Splice junction are like /* comment flags */ in C-code• Exons and introns codons• Codon bases

5

DNA Splice Junctions

• DNA billions of nucleotides ( A, C, G, T)• Genes sequences of amino acids (exons) that are often

interrupted by non-coding nucleotides (introns) • <.1% of human DNA is made up of exons• 99% of splice junctions have the same motif, for

– Exon to intron it is GT– Intron to exon it is AG

….GTGAAGGTTAA AGATGTAGAT GT ATTG…

Splice Junction Splice JunctionExonIntron Intron

6

Data Collection: HTML Browser + Perl scripts

BioBrowser

Download HTML ExtractLinks() Download HTML - data

ExtractData()

TranslateData()

7

8

DNA Splice Junction (Cont.)

• A complete gene is made up of different exons• Splice junction identification aids in the discovery of new genes• The dataset used for this study is made up of 1,424 sequences• Data were created ab initio from GENBANK• Each sequence is 32 nucleotides long with regions comprising -15 to +15

nucleotides from the splice-junction

…TGTAAGG AG ACGAGTT…Intron

Splice Junction Exon

Left Regions Splice Junction Right Regions ClassesIntron AG Exon AExon GT Inron B

Unknown AG or GT Unknown C

9

Self-Organizing Maps (SOM) Network

• Unsupervised learning neural network

• Projects high-dimensional input data onto two-dimensional output map

• Preserves the topology of the input data

• Visualizes structures and clusters of the data

c

i 1iw

3iw

4iw

5iw

1cw 2cw

3cw 4cw

5cw

Input layer Output layer

Component 1

Component 3

Component 5

Component 2

Component 4

2iw

10

Use of SOM for DNA Splice Junction Classification Model

SOM

SOM Classification Map

Classification

Class A: intron to exon

Class B: exon to intron

Class C: no transition

Classification

Class A: intron to exon

Class B: exon to intron

Class C: no transition

DNA training set

DNA test set

Neuron identification methods

- Highest frequency class

- Closest neuron

Neuron identification methods

- Highest frequency class

- Closest neuron

A

BC

U-Matrix Map

11

The U-matrix of the DNA Training Set

12

SOM Results for DNA Splice Junction Data

A

B

C

DNA sequences Class A Class B Class C TotalClass A 102 (93%) 2 (2%) 6 (5%) 110Class B 0 (0%) 90 (91%) 9 (9%) 99Class C 4 (2%) 6 (3%) 205 (95%) 215Total 106 98 220 424

Classified to

Confusion matrix of 424-DNA test set

The U-matrix of the DNA training set

13

Conclusions

• SOM is effective in DNA splice junction classification• SOM is powerful visualization for high dimensional data

14

Demo with Analyze Code

• 800 training data, 324 test data (160 features)• 96% correct overall classification on test data

IE FALSE EI

IE 98 0 0FALSE 5 111 3

EI 2 3 102

Confusion Matrix

9186

2000050000

0.90.05

1 // K// L// max_neighborhood// num_its// num_fine_its// alpha_max// alpha_min// LVQ_flag

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

THE END

16

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

AAAAGCATTGGGAA

GGTTC

CCGTTGAAC

GGTCAGGTTAGACTA

EXTRACTING KNOWLEDGE

17

NUCLEOTIDES

AA TT

CCGG

• DNA is double-stranded •A & C are Complements

•G & T are Complements

18

AMINO ACIDS

• Sequences of three nucleotides –“CODONS” – code for amino acids

• There are 20 different amino acids

• Amino acids make up the part of DNA known as exons

• Each amino acid can be translated between 1 and 6 different ways

19

PROTEINS

• Proteins are made up of sequences of amino acids• Generally responsible for some biological function

• May have complicated folding patterns that are difficult to predict

20

GENES

• 30,000 – 100,000 genes exist in the human genome

• Most genes have not yet been discovered

• Genes are made up of sequences of amino acids

• Genes are interrupted by non-coding regions of DNA “Introns”

21

CHROMOSOMES

22

READING FRAMES

…ACG TAGAT…

• Reading frames may be difficult to determine

• Reading frames may be shifted by splice junctions

23

GENE STRUCTURE

Start Codon (ATG)

Exon sequence (amino acid string)

Intron sequence (junk DNA)

Stop Codon (3 possible)

24

SPLICE JUNCTIONSSPLICE JUNCTIONS

• Segments of DNA that join coding and non-coding regions