1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A....

24
1 DNA Classifications with Self- Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft Computing in Industrial Application

Transcript of 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A....

Page 1: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

1

DNA Classifications with Self-Organizing Maps (SOMs)

Thanakorn NaennaMark J. EmbrechtsRobert A. Bress

May 2003 IEEE International Workshop on Soft Computing in Industrial Application

Page 2: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

2

Presentation Outline

• Introduction to DNA Splice Junctions• Data Collection• Introduction to SOMs• SOM for DNA Splice Junction

Classification• Results• Conclusions

Page 3: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

3

Page 4: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

4

Human genome in a nutshell

• Human : 23 chromosomes• Chromosomes thousands of genes• Gene info : exons , comments : introns

Splice junction are like /* comment flags */ in C-code• Exons and introns codons• Codon bases

Page 5: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

5

DNA Splice Junctions

• DNA billions of nucleotides ( A, C, G, T)• Genes sequences of amino acids (exons) that are often

interrupted by non-coding nucleotides (introns) • <.1% of human DNA is made up of exons• 99% of splice junctions have the same motif, for

– Exon to intron it is GT– Intron to exon it is AG

….GTGAAGGTTAA AGATGTAGAT GT ATTG…

Splice Junction Splice JunctionExonIntron Intron

Page 6: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

6

Data Collection: HTML Browser + Perl scripts

BioBrowser

Download HTML ExtractLinks() Download HTML - data

ExtractData()

TranslateData()

Page 7: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

7

Page 8: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

8

DNA Splice Junction (Cont.)

• A complete gene is made up of different exons• Splice junction identification aids in the discovery of new genes• The dataset used for this study is made up of 1,424 sequences• Data were created ab initio from GENBANK• Each sequence is 32 nucleotides long with regions comprising -15 to +15

nucleotides from the splice-junction

…TGTAAGG AG ACGAGTT…Intron

Splice Junction Exon

Left Regions Splice Junction Right Regions ClassesIntron AG Exon AExon GT Inron B

Unknown AG or GT Unknown C

Page 9: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

9

Self-Organizing Maps (SOM) Network

• Unsupervised learning neural network

• Projects high-dimensional input data onto two-dimensional output map

• Preserves the topology of the input data

• Visualizes structures and clusters of the data

c

i 1iw

3iw

4iw

5iw

1cw 2cw

3cw 4cw

5cw

Input layer Output layer

Component 1

Component 3

Component 5

Component 2

Component 4

2iw

Page 10: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

10

Use of SOM for DNA Splice Junction Classification Model

SOM

SOM Classification Map

Classification

Class A: intron to exon

Class B: exon to intron

Class C: no transition

Classification

Class A: intron to exon

Class B: exon to intron

Class C: no transition

DNA training set

DNA test set

Neuron identification methods

- Highest frequency class

- Closest neuron

Neuron identification methods

- Highest frequency class

- Closest neuron

A

BC

U-Matrix Map

Page 11: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

11

The U-matrix of the DNA Training Set

Page 12: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

12

SOM Results for DNA Splice Junction Data

A

B

C

DNA sequences Class A Class B Class C TotalClass A 102 (93%) 2 (2%) 6 (5%) 110Class B 0 (0%) 90 (91%) 9 (9%) 99Class C 4 (2%) 6 (3%) 205 (95%) 215Total 106 98 220 424

Classified to

Confusion matrix of 424-DNA test set

The U-matrix of the DNA training set

Page 13: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

13

Conclusions

• SOM is effective in DNA splice junction classification• SOM is powerful visualization for high dimensional data

Page 14: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

14

Demo with Analyze Code

• 800 training data, 324 test data (160 features)• 96% correct overall classification on test data

IE FALSE EI

IE 98 0 0FALSE 5 111 3

EI 2 3 102

Confusion Matrix

9186

2000050000

0.90.05

1 // K// L// max_neighborhood// num_its// num_fine_its// alpha_max// alpha_min// LVQ_flag

Page 15: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

THE END

Page 16: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

16

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

AAAAGCATTGGGAA

GGTTC

CCGTTGAAC

GGTCAGGTTAGACTA

EXTRACTING KNOWLEDGE

Page 17: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

17

NUCLEOTIDES

AA TT

CCGG

• DNA is double-stranded •A & C are Complements

•G & T are Complements

Page 18: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

18

AMINO ACIDS

• Sequences of three nucleotides –“CODONS” – code for amino acids

• There are 20 different amino acids

• Amino acids make up the part of DNA known as exons

• Each amino acid can be translated between 1 and 6 different ways

Page 19: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

19

PROTEINS

• Proteins are made up of sequences of amino acids• Generally responsible for some biological function

• May have complicated folding patterns that are difficult to predict

Page 20: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

20

GENES

• 30,000 – 100,000 genes exist in the human genome

• Most genes have not yet been discovered

• Genes are made up of sequences of amino acids

• Genes are interrupted by non-coding regions of DNA “Introns”

Page 21: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

21

CHROMOSOMES

Page 22: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

22

READING FRAMES

…ACG TAGAT…

• Reading frames may be difficult to determine

• Reading frames may be shifted by splice junctions

Page 23: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

23

GENE STRUCTURE

Start Codon (ATG)

Exon sequence (amino acid string)

Intron sequence (junk DNA)

Stop Codon (3 possible)

Page 24: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

24

SPLICE JUNCTIONSSPLICE JUNCTIONS

• Segments of DNA that join coding and non-coding regions