DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based...

29
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character- Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion Mândoiu Computer Science & Engineering Department, University of Connecticu
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    222
  • download

    1

Transcript of DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based...

DNA Barcode Data Analysis:Boosting Assignment Accuracy by

Combining Distance- and Character-Based Classifiers

Bogdan Paşaniuc, Sotirios Kentros and Ion Mândoiu

Computer Science & Engineering Department, University of Connecticut

2

Outline

Motivation & Problem Definition Methods used

Hamming Distance (MIN-HD and AVG-HD) Aminoacid Similarity (MAX-AA-SIM and AVG-AA-SIM) Convex-score similarity (MAX-CS-SIM) Trinucleotide frequency (MIN-3FREQ) Positional weight matrix (MAX-PWM) Character-based pairwise species discrimination (k-

BEST) Combining the Methods Results

Species Classification New Species Recognition

Future Work & Conclusions

3

Motivation

“DNA barcoding” was proposed as a tool for differentiating species

Goal: To make a “finger print” for species, using a short sequence of DNA

Assumption: mitochondrial DNA evolves at a lower rate than regular DNA

Mitochondrial DNA: High interspecie variability while retaining low intraspecie sequence variability

Choice was cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long).

4

Problem definition

The scope of our project was to explore if by combining simple classification methods one can increase the classification accuracy.

We address two problems: Classification of barcodes given a training

set of species. Identification of barcodes that belong in

new species. Assumption: All the barcode DNA sequences

are aligned

5

Problem definition(1)

Species Differentiation:

INPUT: a set S of barcodes for which the species is known and x a new barcode

OUTPUT: the species of x, given that there are barcodes S that have the same species as x

6

Problem definition(2)

Species Differentiation & New Species Detection:

INPUT: a set S of barcodes for which the species is known and x a new barcode

OUTPUT: find the species of x, if there is at least a barcode in S with the same species or determine if x belongs to a new species.

7

Methods

Find a “distance” between barcodes that is “able to distinguish between species”:

1. Low intraspecie variability

2. High interpecie variability Hamming Distance Aminoacid Similarity Convex-score similarity Trinucleotide frequency

Closer barcodes tend to have similar trinucleotide frequencies Positional weight matrix

Compute the probability of that barcode x belongs to a given species

Character-based pairwise species discrimination Find k most informative characters that are able to distinguish

between two species.

8

Methods

species S1xd(x,S1)

species S2

d(x,S2) …

species Snd(x,Sn)

1. d(x,Si) = Minimum{ d(x,y) | sequence y belongs to species Si }• Minimum “Method” Classifier

2. d(x,Si) = Average{ d(x,y) | sequence y belongs to species Si }• Average “Method” Classifier

9

Hamming Distance

Percent of basepair divergences Average:

Given barcode x find species S such that the minimum hamming distances on the average from x to y (y in S) is minimized

species(x)= S. Minimum:

Given barcode x find barcode y that minimizes the hamming distance from x to y

species(x) = species(y)

10

Aminoacid Similarity

Genetic code:

rules that map DNA sequences to proteins Codon: tri-nucleotide unit that encodes for one

aminoacid Divide DNA seq. into codons and substitute

each one by its corresp. aminoacid Blosum62 (BLOck SUbstitution Matrix)

20x20 matrix that gives score for each two aminoacids based on aminoacid properties

The higher the score the more likely no functional change in the protein

11

Aminoacid Similarity

Measures How similar the two aminoacid

sequences encoded by the barcodes are

Distance(x,y) barcodes x, y -> Aminoacid sequences x’ , y’

(using genetic code) Score of the aminoacid alignment using the

Blosum62 Average:

Find the species with maximum average similarity

Minimum: Find the barcode with max. similarity

12

Convex-score Similarity

“Long runs of consecutive basepair matches” indicate that the encoded aminoacid sequence plays an important role -> the two barcodes are “close” on the evolutionary distance

The longer the run of basepair matches, the higher the score

The contribution of a run is convexly increasing with its length

The new sequence is assigned to the species containing the highest scoring sequence

13

Trinucleotide Distance

For each species compute the vector of trinucleotide frequencies

For the new sequence x we compute the vector of trinucleotide frequencies

Find the closest species. To measure the distance between 2

vectors of frequencies we use Minimum Mean Square distance

14

Positional weight matrix

For each species we compute a positional weight matrix

For each locus the PWM gives the probability of seeing each nucleotide appear at that locus in that species

We assume independence of loci

For a barcode x we can compute the probability that x belongs to species S as the product of the probabilities of observing at every locus the respective nucleotide in x

Assign x to the specie that gives the highest probability

15

Character-based pairwise species discrimination

Given species S1, S2 and new barcode x we find the k most discriminating characters

A locus -> character

Nucleotides -> possible values for character

Idea: If at a given locus, there is a nucleotide that appears in S1 and not in S2, then if x contains that nucleotide at that locus -> x is more likely to belong to S1 and not to S2

16

Character-based pairwise species discrimination

Finding the k most discriminative characters The discriminative power of character i is given by

Cnt(i,X,S1) - the number of times we see nucleotide X at position i in species S1

Size(S1) - number of barcodes in specie S1

GTCAX

SS

SSizeSSize

SXiCntSXiCntiw

,,, 21

21,

)()(

),,(),,,(max)( 21

17

Character-based pairwise species discrimination

i… A …… A …… C …… C …… C …… T …… T …… T …… G …… G…

w(i) = 1

The two species (red, blue) are discriminated by character i with 100% accuracy

The nucleotide present at position i in the new barcode x safely tells us in which specie x is more likely to belong

i is a “pure” character

18

Character-based pairwise species discrimination

i… A …… A …… C …… C …… C …… A …… T …… T …… G …… G…

w(i) = 0.9

The two species (red, blue) are discriminated by character i with 90% accuracy

if the new barcode x has a C,T,G at i we guess correctly the species of x

if the new barcode x has an A at i then we choose the species of x as the species containing the highest number of A’s at i (red sp.)

19

Character-based pairwise species discrimination

1. Given species S1, S2 and new barcode x we find the k most discriminating characters

2. We compute how many times specie S1 is favored over S2 and output the most favored specie

3. We repeat steps 1 and 2 for all pairs of species and the new barcode x

4. The specie S that is favored the most in all these pairwise discriminations is assigned to barcode x

20

Combining the Methods

Every classifier outputs the specie the new barcode is most likely to belong

Simple Voting: Every classifier’s returned species has

a weight of 1 Output the species with the most

votes

21

Datasets(1)

We used the dataset provided at http://dimacs.rutgers.edu/workshops/BarcodeResearchchallanges consisting of 1623 aligned sequences classified into 150 species with each sequence consisting of 590 nucleotides on the average.

We randomly deleted from each species 10 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train

We made sure that in every species has a least one sequence

22

Species Recovering Accuracy(in %)(no new species - DAWG train dataset)

ClassifierPercentage of barcodes removed from each species and used

for testing10% 20% 30% 40% 50%

MIN-HD 98.8 98.0 97.8 97.2 96.0

AVG-HD 97.2 97.2 96.6 96.2 95.6

MAX-AA-SIM 99.0 99.0 99.2 98.4 96.8

AVG-AA-SIM 94.6 94.2 94.8 94.2 93.0

MAX-CS-SIM 98.2 98.2 98.6 97.6 97.4

MIN-3FREQ 94.6 93.8 94.2 92.0 92.4

MAX-PWM 98.0 98.6 97.8 95.4 94.6

10-BEST 98.6 97.0 97.6 96.2 96.2

COMBINED 99.4 99.4 99.6 98.6 98.0

23

Datasets(2)

We used the cowries dataset provided at xxx

We removed the species containing less than 4 barcodes per species

We randomly deleted from each species 10 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train

We made sure that in every species has a least one sequence

24

Species Recovering Accuracy(in %)(no new species)

ClassifierPercentage of barcodes removed from each species and used

for testing10% 20% 30% 40% 50%

MIN-HD 96.6 96.0 96.2 96.4 96.3

AVG-HD 95.0 95.4 94.4 95.2 94.8

MAX-AA-SIM 96.4 95.2 95.6 95.8 96.2

AVG-AA-SIM 93.8 94.0 92.6 92.8 92.8

MAX-CS-SIM 96.2 95.6 95.6 96.0 95.6

MIN-3FREQ 89.2 90.1 89.4 89.0 89.0

MAX-PWM 91.2 91.4 90.4 90.8 90.4

10-BEST 92.6 91.4 91.2 91.2 91.8

COMBINED 96.6 96.4 96.2 96.0 96.2

25

Datasets(3)

In order to test the accuracy of new species detection and classification we devised a regular leave one out procedure.

delete a whole species randomly delete from each remaining

species 0 to 50 percent of the sequences Deleted seq -> test Remaining seq -> train

The following table gives accuracy results on average for 150x6 different testcases

26

Leave one out Accuracy(in %)DAWG train dataset

Classifier

Percentage of additional barcodes removed from each species and used for testing

0% 10% 20% 30% 40% 50%MIN-HD 80.9 91.7 92.8 91.6 90.3 88.4AVG-HD 81.1 91.5 92.3 91.0 89.9 87.8

MAX-AA-SIM 83.4 82.7 82.9 80.2 78.4 74.8AVG-AA-SIM 83.1 89.5 89.3 88.8 88.3 88.2MAX-CS-SIM 94.3 94.4 94.0 92.9 91.7 89.7MIN-3FREQ 82.9 70.3 69.6 67.8 65.8 63.0MAX-PWM 91.2 91.7 91.6 89.8 88.0 85.410-BEST 93.3 94.7 93.8 92.6 91.6 89.6

COMBINED 93.7 97.6 97.8 97.8 97.4 97.0

27

Leave one out Accuracy(in %)Cowries dataset

Classifier

Percentage of additional barcodes removed from each species and used for testing

0% 10% 20% 30% 40% 50%

MIN-HD 79.7 90.8 90.9 89.8 88.7 86.4AVG-HD 75.8 88.1 87.9 86.6 85.1 82.8

MAX-AA-SIM 82.6 83.2 81.9 80.3 78.9 76.5AVG-AA-SIM 60.2 90.0 91.3 91.4 91.2 90.3MAX-CS-SIM 70.7 93.5 94.7 94.8 95.1 94.4MIN-3FREQ 86.4 68.1 65.7 65.2 64.6 63.5MAX-PWM 86.1 78.9 77.1 76.4 75.4 73.410-BEST 62.3 88.6 89.2 89.5 89.8 88.1

COMBINED 92.7 82.3 81.8 82.3 82.3 81.8

28

Conclusions(1)

Every method shows a tradeoff between new species detection and classification accuracy

Hamming distance performs very good when no new species are present but the accuracy results are low for new species detection

The combined method yields better accuracy results both on new species detection and seq. classification.

The runtime of all methods is within the same order of magnitude

29

Future Work

New species clustering: determining the different new species present

Further investigate threshold selection and weighting schemes.

Possible ignoring parts of the given sequences could improve accuracy. Are there redundant/noisy regions?

Use independent weighting schemes for new species detection and classification into known species.