DNA Classifications with Self-Organizing Maps (SOMs)

Click here to load reader

  • date post

    19-Jan-2016
  • Category

    Documents

  • view

    33
  • download

    0

Embed Size (px)

description

DNA Classifications with Self-Organizing Maps (SOMs). Thanakorn Naenna Mark J. Embrechts Robert A. Bress. May 2003 IEEE International Workshop on Soft Computing in Industrial Application. Presentation Outline. Introduction to DNA Splice Junctions Data Collection Introduction to SOMs - PowerPoint PPT Presentation

Transcript of DNA Classifications with Self-Organizing Maps (SOMs)

  • DNA Classifications with Self-Organizing Maps (SOMs)

    Thanakorn NaennaMark J. EmbrechtsRobert A. Bress

    May 2003 IEEE International Workshop on Soft Computing in Industrial Application

  • Presentation OutlineIntroduction to DNA Splice JunctionsData CollectionIntroduction to SOMsSOM for DNA Splice Junction ClassificationResultsConclusions

  • Human genome in a nutshellHuman : 23 chromosomesChromosomes thousands of genesGene info : exons , comments : introns Splice junction are like /* comment flags */ in C-codeExons and introns codonsCodon bases

  • DNA Splice Junctions DNA billions of nucleotides ( A, C, G, T)Genes sequences of amino acids (exons) that are often interrupted by non-coding nucleotides (introns)
  • Data Collection: HTML Browser + Perl scripts

  • DNA Splice Junction (Cont.)A complete gene is made up of different exonsSplice junction identification aids in the discovery of new genesThe dataset used for this study is made up of 1,424 sequencesData were created ab initio from GENBANKEach sequence is 32 nucleotides long with regions comprising -15 to +15 nucleotides from the splice-junction

    Sheet1

    Left RegionsSplice JunctionRight RegionsClasses

    IntronAGExonA

    ExonGTInronB

    UnknownAG or GTUnknownC

    Sheet2

    Sheet3

  • Self-Organizing Maps (SOM) Network Unsupervised learning neural networkProjects high-dimensional input data onto two-dimensional output mapPreserves the topology of the input dataVisualizes structures and clusters of the data

  • Use of SOM for DNA Splice Junction Classification ModelSOMSOM Classification Map ClassificationClass A: intron to exonClass B: exon to intronClass C: no transition DNA training setDNA test setNeuron identification methods - Highest frequency class - Closest neuronU-Matrix Map

  • The U-matrix of the DNA Training Set

  • SOM Results for DNA Splice Junction DataThe U-matrix of the DNA training set

  • ConclusionsSOM is effective in DNA splice junction classificationSOM is powerful visualization for high dimensional data

  • Demo with Analyze Code 800 training data, 324 test data (160 features) 96% correct overall classification on test dataConfusion Matrix

    confusion

    IEEI

    IE9800

    51113

    EI23102

    SOMSOM

    9

    18

    6

    20000

    50000

    0.9

    0.05

    1

    // K

    // L

    // max_neighborhood

    // num_its

    // num_fine_its

    // alpha_max

    // alpha_min

    // LVQ_flag

  • GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGTGATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGTTHE END

  • GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

    AAAAGCATTGGGAA

    GGTTC

    CCGTTGAAC

    GGTCAGGTTAGACTA

    EXTRACTING KNOWLEDGE

  • NUCLEOTIDES

    DNA is double-stranded

    A & C are ComplementsG & T are Complements

  • AMINO ACIDS

    Sequences of three nucleotides CODONS code for amino acids

    There are 20 different amino acids

    Amino acids make up the part of DNA known as exons

    Each amino acid can be translated between 1 and 6 different ways

  • PROTEINS

    Proteins are made up of sequences of amino acids

    Generally responsible for some biological function

    May have complicated folding patterns that are difficult to predict

  • GENES

    30,000 100,000 genes exist in the human genome Most genes have not yet been discovered Genes are made up of sequences of amino acids Genes are interrupted by non-coding regions of DNA Introns

  • CHROMOSOMES

  • READING FRAMES

    Reading frames may be difficult to determine

    Reading frames may be shifted by splice junctions

  • GENE STRUCTURE

  • SPLICE JUNCTIONS

    Segments of DNA that join coding and non-coding regions

    The data are split to:Training set 1000 sequencesTest set 424 sequences

    - SOM is an unsupervised learning neural network- SOM is used to project high-dimensional input data onto lower-dimensional output mapSOM is used to visualize structures and clusters of the data

    SOM AlgorithmInitialize weight vectors of all output neurons2 Present an input vector to the network2.1 Calculate the distance between the input vector and the weight vectors of all output neurons2.2 Find the winning neuron which has minimal Euclidean distance2.3 Update the weight vector of the winning neuron and the weight vectors of all neurons within the neighborhood of winning neuron. As a result of updating, these updated weight vectors move toward to the input vector.3 Present the next input vector to the network

    -In the input layer, the number of input neurons correspond to the dimensionality of the input vector-In the output layer, the output neurons are arranged in a rectangular or hexagonal lattice-Each output neuron is connected with the input neurons by its weight vector-Gray circles show the rectangular neighborhood of a neuron C

    The training data are map onto their winning output neurons

    Class A(0) dominates the bottom area of the mapClass B(1) dominates the top right area of the mapClass C(0.5) dominates the left area of the map