DNA Classifications with Self-Organizing Maps (SOMs)
date post
19-Jan-2016Category
Documents
view
33download
0
Embed Size (px)
description
Transcript of DNA Classifications with Self-Organizing Maps (SOMs)
DNA Classifications with Self-Organizing Maps (SOMs)
Thanakorn NaennaMark J. EmbrechtsRobert A. Bress
May 2003 IEEE International Workshop on Soft Computing in Industrial Application
Presentation OutlineIntroduction to DNA Splice JunctionsData CollectionIntroduction to SOMsSOM for DNA Splice Junction ClassificationResultsConclusions
Human genome in a nutshellHuman : 23 chromosomesChromosomes thousands of genesGene info : exons , comments : introns Splice junction are like /* comment flags */ in C-codeExons and introns codonsCodon bases
- DNA Splice Junctions DNA billions of nucleotides ( A, C, G, T)Genes sequences of amino acids (exons) that are often interrupted by non-coding nucleotides (introns)
Data Collection: HTML Browser + Perl scripts
DNA Splice Junction (Cont.)A complete gene is made up of different exonsSplice junction identification aids in the discovery of new genesThe dataset used for this study is made up of 1,424 sequencesData were created ab initio from GENBANKEach sequence is 32 nucleotides long with regions comprising -15 to +15 nucleotides from the splice-junction
Sheet1
Left RegionsSplice JunctionRight RegionsClasses
IntronAGExonA
ExonGTInronB
UnknownAG or GTUnknownC
Sheet2
Sheet3
Self-Organizing Maps (SOM) Network Unsupervised learning neural networkProjects high-dimensional input data onto two-dimensional output mapPreserves the topology of the input dataVisualizes structures and clusters of the data
Use of SOM for DNA Splice Junction Classification ModelSOMSOM Classification Map ClassificationClass A: intron to exonClass B: exon to intronClass C: no transition DNA training setDNA test setNeuron identification methods - Highest frequency class - Closest neuronU-Matrix Map
The U-matrix of the DNA Training Set
SOM Results for DNA Splice Junction DataThe U-matrix of the DNA training set
ConclusionsSOM is effective in DNA splice junction classificationSOM is powerful visualization for high dimensional data
Demo with Analyze Code 800 training data, 324 test data (160 features) 96% correct overall classification on test dataConfusion Matrix
confusion
IEEI
IE9800
51113
EI23102
SOMSOM
9
18
6
20000
50000
0.9
0.05
1
// K
// L
// max_neighborhood
// num_its
// num_fine_its
// alpha_max
// alpha_min
// LVQ_flag
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGTGATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGTTHE END
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
AAAAGCATTGGGAA
GGTTC
CCGTTGAAC
GGTCAGGTTAGACTA
EXTRACTING KNOWLEDGE
NUCLEOTIDES
DNA is double-stranded
A & C are ComplementsG & T are Complements
AMINO ACIDS
Sequences of three nucleotides CODONS code for amino acids
There are 20 different amino acids
Amino acids make up the part of DNA known as exons
Each amino acid can be translated between 1 and 6 different ways
PROTEINS
Proteins are made up of sequences of amino acids
Generally responsible for some biological function
May have complicated folding patterns that are difficult to predict
GENES
30,000 100,000 genes exist in the human genome Most genes have not yet been discovered Genes are made up of sequences of amino acids Genes are interrupted by non-coding regions of DNA Introns
CHROMOSOMES
READING FRAMES
Reading frames may be difficult to determine
Reading frames may be shifted by splice junctions
GENE STRUCTURE
SPLICE JUNCTIONS
Segments of DNA that join coding and non-coding regions
The data are split to:Training set 1000 sequencesTest set 424 sequences
- SOM is an unsupervised learning neural network- SOM is used to project high-dimensional input data onto lower-dimensional output mapSOM is used to visualize structures and clusters of the data
SOM AlgorithmInitialize weight vectors of all output neurons2 Present an input vector to the network2.1 Calculate the distance between the input vector and the weight vectors of all output neurons2.2 Find the winning neuron which has minimal Euclidean distance2.3 Update the weight vector of the winning neuron and the weight vectors of all neurons within the neighborhood of winning neuron. As a result of updating, these updated weight vectors move toward to the input vector.3 Present the next input vector to the network
-In the input layer, the number of input neurons correspond to the dimensionality of the input vector-In the output layer, the output neurons are arranged in a rectangular or hexagonal lattice-Each output neuron is connected with the input neurons by its weight vector-Gray circles show the rectangular neighborhood of a neuron C
The training data are map onto their winning output neurons
Class A(0) dominates the bottom area of the mapClass B(1) dominates the top right area of the mapClass C(0.5) dominates the left area of the map