DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods

DNA Barcode Data AnalysisBoosting Accuracy by Combining Simple

Classification Methods

CSE 377 – Bioinformatics - Spring 2006

Sotirios Kentros Univ. of Connecticut

Bogdan Paşaniuc

Outline Motivation Problem Definition The Methods

Hamming Distance and Minimum Hamming Distance Aminoacid Similarity and Minimum Aminoacid Similarity Dinucleotide Distance Trinucleotide Distance Nucleotide Frequency Similarity

Combining the Methods Results

Specie Classification New Specie Recognition

Conclusion Future Work

Motivation “DNA barcoding” was proposed as a tool for

differentiating biological species Goal: To make a “finger print” for species, using

a short sequence of DNA Assumption: mitochondrial DNA evolve at a

lower rate than regular DNA Mitochondrial DNA: High interspecie variability

while retaining low intraspecie sequence variability

Choice was cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long).

Problem definition

The scope of our project was to explore if by combining simple classification methods one can increase the classification accuracy.

We address two problems: Classification of individuals given a training

set of species. Identification of individuals that belong in

new species. All the sequences are aligned

Problem definition

Specie differentiation:

INPUT: a set S of aligned DNA sequences for which the specie is known and x a new sequence

OUTPUT: find the specie of x, given that there are sequences in S that have the same specie as x

Problem definition

Specie differentiation&New Specie Determination:

INPUT: a set S of aligned DNA sequences for which the specie is known and x a new sequence

OUTPUT: find the specie of x, if there is at least a sequence in S with the same specie or determine if it is a new specie.

Methods Used

Hamming Distance and Minimum Hamming Distance

Aminoacid Similarity and Minimum Aminoacid Similarity

Dinucleotide Distance Trinucleotide Distance Nucleotide Frequency Similarity

Methods

Specie S1 xd(x,S1)

Specie S2

d(x,S2) …

Specie Snd(x,Sn)

1. d(x,Si) = Minimum{ d(x,y) | sequence y belongs to specie Si }• Notation: Minimum “Method” Classifier

2. d(x,Si) = Average{ d(x,y) | sequence y belongs to specie Si }• Notation: “Method” Classifier

Hamming Distance

Average: Given new sequence x find specie S such

that the minimum hamming distances on the average from x to y (y in S) is minimized

Assign to S to y Minimum

Given new sequence x find y such that the minimum hamming distance from x to y is minimized

Assign specie(y) to x

Aminoacid Similarity

Genetic code:

rules that map DNA sequences to proteins Codon: tri-nucleotide unit that encodes for one

aminoacid Divide DNA seq. into codons and substitute

each one by its corresp. aminoacid Blosum62 (BLOck SUbstitution Matrix)

20x20 matrix that gives score for each two aminoacids based on aminoacid properties

The higher the score the more likely no functional change in the protein

Aminoacid Similarity

Distance(x,y)

DNA sequences x, y ->Aminoacid sequences x’ , y’ (using codon to aminoacid transf.)

Using the Blosum aminoacid substitution matrix get the score of the alignment

Average: Find the specie with maximum average

similarity Minimum:

Find the sequence with max. similarity

Dinucleotide Distance For each specie find the frequency with which

each Dinucleotide appears. Compute the frequency of each Dinucleotide in

the unclassified sequence. Find the specie with the minimum Mean Square

distance to the new unclassified sequence

For New Species, after classifying the individual find the Average Intraspecie Mean Square distance for the candidate specie. If the individual is close enough, assign him at the specie, otherwise he belongs in a New Specie.

in/dels are ignored

Trinucleotide Distance For each specie find the frequency with which

each Trinucleotide appears. Compute the frequency of Trinucleotide

appearance of the unclassified sequence. Find the specie with the minimum Mean Square

distance to the new unclassified sequence

For New Species, after classifying the individual find the Average Intraspecie Mean Square distance for the candidate specie. If the individual is close enough, assign him at the specie, otherwise he belongs in a New Specie.

in/dels are ignored

Nucleotide Frequency Similarity For each position in the DNA find the frequency

with which the Nucleotides appear in the specie individuals. We include the frequency of in/dels appearing.

For unclassified individuals compute the log of the probability that the individual belongs to the specie and assign it to the specie for which the probability is maximum.

For new species, we compute the minimum probability for the individuals belonging in the specie and compare it with the one of the candidate specie in order to determine whether it belongs to the specie or not.

Combining the Methods The specie on which most classifiers

agreed is returned Simple Voting:

Every classifier’s returned specie has a weight of 1

Output the specie with the most votes Weighted Voting

Every classifier has a different weight based on the accuracy of each independent method

Output the specie with largest total As expected weighted voting yields higher

accuracy and thus in our results the combined method uses weighted voting

Datasets(1)

We used the dataset provided at http://dimacs.rutgers.edu/workshops/BarcodeResearchchallanges consisting of 1623 aligned sequences classified into 150 species with each sequence consisting of 590 nucleotides on the average.

We randomly deleted from each specie 10 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train

We made sure that in every specie has a least one sequence

Methods

Percent missing from each specie(%) 10 20 30 40 50

Aminoacid Similarity 95.1 94.8 94.7 94.3 93

Min. Aminoacid Similarity 99.3 99.2 98.7 98.1 97.3

Hamming Dist. 97.9 97.4 96.7 96.5 96.5Min. Hamming

Dist. 98.8 98.2 97.5 97.1 96.4Nucleotide Freq

Sim. 56.2 49.6 44.2 44.6 38.2Dinucleotide Freq. Dist. 44.9 42.2 41.6 41.5 39.3

Trinucleotide Freq. Dist 70.9 68.1 68 66.7 64.2

Combination 99.2 99.2 98.8 98.3 97.7

Specie Recovering Accuracy(in %)(no new specie)

Datasets(2)

In order to test the accuracy of new specie detection and classification we devised a regular leave one out procedure.

delete a whole specie randomly delete from each remaining

specie 0 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train

The following table gives accuracy results on average for 150x6 different testcases

Methods

Percent missing from each remaining specie(%)

0 10 20 30 40 50Aminoacid Similarity 65.1 49.2 43.6 42.0 41.0 37.4

Min. Aminoacid Similarity 72.6 61.0 56.2 56.4 52.6 51.0

Hamming Dist. 55.0 91.4 90.2 90.4 88.0 88.6Min. Hamming

Dist. 73.1 85.4 79.6 78.6 75.0 74.4Dinucleotide Freq. Dist. 51.0 50.4 48.2 48.2 45.2 43.4

Trinucleotide Freq. Dist 56.5 63.6 61.8 63.0 59.2 57.4

Nucleotide Freq Sim. 73.0 56.2 49.6 44.2 44.0 38.2

Combination 80.5 93.2 91.6 91.6 88.4 88.6

Leave one out Accuracy(in %)

Conclusions(1) Every method show a tradeoff between new

specie detection and classification accuracy

Hamming distance performs very good when no new species are present but the accuracy results are low for new specie detection

The combined method yields better accuracy results both on new specie detection and seq. classification.

The runtime of all methods is within same order of magnitude

Conclusions(2) By combining simple classification methods,

we managed to boost the accuracy both for classifying individuals in known species and for detecting new species

As expected the results imply a tradeoff between classification and new specie detection the higher the classification accuracy the

lower the detection

Hamming Distance is a very good metric for the training dataset provided

Future Work New specie clustering: determining the

different new species present

Further investigate threshold selection and weighting schemes.

Possible ignoring parts of the given sequences could improve accuracy. Are there redundant/noisy regions?

Use independent weighting schemes for new specie detection and classification into known species.

Questions

Thank you.

DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods

Documents

Transcript of DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods

A Member of - Barcode Printers - Barcode Scanners and Barcode

Datalogic barcode scanners for general purpose used handheld barcode readers

Lotus Vegetarian€¦ · Lotus Vegetarian also constantly innovates their offerings, combining constitution-boosting herbs and greens to create new aromatic blends. Our logo symbolises

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.

A theoretical introduction to Boosting · Boosting •Boosting refers to this general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate

Combining Bagging and Boosting - Semantic Scholar€¦ · machine learning techniques, such as decision trees, rule learners and Bayesian classifiers were used. Section 2 presents

Super Barcode Training Camp - Zebra Barcode Printer Presentation

Barcode Generator Software to create 2D Barcode Labels

An Advanced Approach for Barcode Modulation Using High ... · Keywords: HC2D barcode, data transfer, Barcode encoding, barcode decoding, Android platform . 1. Introduction Now a days

Subsea Boosting Systems - · PDF file1 Content Introduction Subsea Boosting and Compression –Why subsea boosting? Subsea Boosting Systems for Subsea Tiebacks –Total system approach

Boosting ---one of combining models

ASP Barcode Zapper · Barcode Zapper 1 The ASP Barcode Zapper ASP's Barcode Zapper is a compact high-performance hand-held barcode scanner suitable for …

Universal Obfuscation and Witness Encryption: … · Universal Obfuscation and Witness Encryption: Boosting Correctness and Combining Security ... BSF and the Israeli ... NSF Frontier

QR-Inception: Barcode in Barcode Attacks - iSecLabold.iseclab.org/people/atrox/qrinception-ccs2014-slides.pdf · QR-Inception: Barcode in Barcode Attacks Adrian Dabrowski adabrowski@sba-research.org

Kiran Barcode Systems€¦ · Kiran Barcode Systems was incorporated in 2010 with an objective of exceeding the expectations of ... Zebra Industrial Barcode Printer Citizen Barcode

The Barcode Comes of Age - Digimarc | The Barcode of ...

DNA Barcode Standards - The International Barcode of Life

Boosting Small Engines to High Performance – Boosting Systems ...

Combining Models - Greg Mori - CMPT 419/726mori/courses/cmpt726/... · Bishop PRML Ch. 14. BoostingDecision TreesMixture of Experts Outline Boosting Decision Trees Mixture of Experts.

An Efﬁcient Boosting Algorithm for Combining Preferences · AN EFFICIENT BOOSTING ALGORITHM FOR COMBINING PREFERENCES The second problem is the movie-recommendation problem described