Download - BioinfoGRID Symposium 2007

BioinfoGRID Symposium 2007

Mathematical methods for the analysis of the Barcode Of Life

P. Bertolazzi, G. Felici

Istituto di Analisi dei Sistemi e Informatica del Consiglio Nazionale delle Ricerche

Optimization Laboratory for Data Mining

DNA Barcoding 1/4

• DNA barcoding is a new technique that uses a short DNA sequence from a standardized and agreed-upon position in the genome as a molecular diagnostic for species-level identification.

• The chosen sequence for Barcode is a small portion of mitocondrial DNA (mt-DNA) that differs by several percent, even in closely related species, and collects enough information to identify the species of an individual.

• It is easy to isolate and analyze.

• Moreover it resumes many properties of the entire mt-DNA sequence.

DNA Barcoding 2/4

A typical animal cell. Within the cytoplasm, the major organelles and cellular structures include: (1) nucleolus (2) nucleus (3) ribosome (4) vesicle (5) rough endoplasmic reticulum (6) Golgi apparatus (7) cytoskeleton (8) smooth endoplasmic reticulum (9) mitochondria (10) vacuole (11) cytosol (12) lysosome (13) centriole.

http://en.wikipedia.org/wiki/Cytoplasm

http://en.wikipedia.org/wiki/Nucleolus

http://en.wikipedia.org/wiki/Cell_nucleus

http://en.wikipedia.org/wiki/Ribosome

http://en.wikipedia.org/wiki/Endoplasmic_reticulum



DNA Barcoding 3/4

• The first studies on barcode (2003) are due to Hebert (see [1] for the last results and a complete bibliography)

• Two mt-DNA subsequences (genes) are proposed as barcode:– Cytochrome c Oxidase I (COI)– Cytochrome b

• Since 2003 COI has been used by Hebert to study fishes, birds, and other species

• Hebert employs the Neighbor Joining (NJ) [2] method, proposed to obtain phylogenetic trees, and identifies each species as represented by a distinct, non overlapping cluster of sequences in the NJ tree.

DNA Barcoding 4/4

• Recent studies [1] show that even fragments of the COI sequence have the same expressive power than the entire sequence

• The Consortium of Barcode of Life (CBOL) is an international initiative devoted to developing DNA barcoding as a global standard for the identification of biological species http://www.barcoding.si.edu/

• Data Analysis Working Group (CBOL and DIMACS, Rutgers) http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges2007/http://dimacs.rutgers.edu/Workshops/DNAInitiative/

http://www.barcoding.si.edu/

Research challenges 1/2

• Specimen identification versus species discovery : the knowledge about species is not always complete

• Is a species similar to another or not?

• Optimizing sample size: barcodes are not easy to measure, large samples are very expensive

• Using character-based barcodes: an alternative approach to comparing specimens in terms of overall percent sequence similarity

• Shrinking the barcode

Research challenges 2/2

• Shrinking the barcode: we want to identify the relevant portions of the barcode wrt each species; this would make easier to identify new data on species and study it

A barcode fragment for 2 speciesSpecies 1: CTGGCATAGTAGGTACTGCCCTTAGCCTCCCCCCAGTCCTCTCCC Species 2: CTGGCATAGTCGGAACCGCTCTCAGCCTACCCCCAGCCCTTTCTC

• 45 sites• Difference bases on only 8 sites• In this case only 1 site is sufficient to distinguish the 2 species

Main goal

• Given samples from different species, the objective is to identify those combinations of muted nucleotides that have determined the differences among species in the evolution path

Our work 1/2

– We have addressed these challenges using supervised learning and some special methods that we have produces for classification of logic data of large size

– Such methods already proved successful in other bio-computational problems (Tag SNPs selection, microarray data analysis, genetic diagnostics)[3]:

• Feature selection methods based on Integer Programming: aims at finding a subset of bases that allows to distinguish among species

• Logic classification methods based on logic programming:aims at finding rules or formulas that can distinguish between individuals of different species

Our Work 2/2

• Feature Selection is used to select a limited number of positions where the values of the bases differs consistently from species to species;

• Logic classification methods use the selected positions to find formulas with high semantic value that can provide a deep understanding of the analyzed data.

Logic classification is computationally expensive

Many features

Feature selection

Logic

classification

Few features

Compact logic formulas

Supervised Learning: the standard setting

• Given:• A finite number of objects described by a finite number of measures

X = (x1, x2, …, xn)• An interesting characteristic of these objects Y (TARGET)

• Determine the relation between TARGET and measures Y = F(X)• It is only required that, for each set of measures X, it produces an output Y

• Very important– The form of the relation is decided up front;– The parameter of the relation are determined by the data;– The results of the estimation need to be validated.

Classification: Y = (0,1,2, …)

Estimation: Y Q R

Supervised Learning: training and testing

• If data contains good information and the choice of the model is correct, the model has a good fit on the training data

• Good fitting to training data may provide information and knowledge

• The model should also adapt well to test data

• Only good fitting to test data strengthen the knowledge extracted and validates the forecasting or classification function of the model

Data

Training Set

Testing Set

Choose model

Estimate model

Performance

?

ok

The Special case of Logic Data 1/2

• Frequently, objects are (or can be) described only by logic attributes (True/False)

• The classification function is expressed in terms of:– Logic variables p, q, r, with True/False value– conjunction (p q), disjunction (p q), negation ( p)

Logic variables(property of the object)

S1 S2 S3 S4 S1 S2 S3 S4

A T T F ? A 1 1 -1 0A T F F T A 1 -1 -1 1A T F F F A 1 -1 -1 -1I T T T ? I 1 1 1 0I F ? F T I -1 0 -1 T

S1 S3 (1, 0, -1, 0)

“If the object shows both the presence of property 1 and the absence of property 3, then it is class A, else is class I”

1

The Special case of Logic Data 2/2

X1 X2 X3 X4 X5 X60 0 1 0 0 00 0 1 0 1 01 0 0 1 0 11 0 0 1 0 11 0 0 0 1 00 0 1 1 0 11 0 0 1 1 00 0 1 0 1 00 1 0 0 1 00 0 1 0 1 00 0 1 1 1 10 0 0 1 1 00 0 1 1 1 00 1 0 0 0 00 0 0 1 0 00 0 0 0 0 10 0 1 0 1 10 1 0 1 0 01 0 0 1 1 00 1 0 0 1 01 0 0 0 0 00 0 1 1 0 10 0 1 1 0 10 0 1 1 0 11 0 0 0 0 11 1 1 1 1 11 1 0 1 0 11 0 1 0 0 00 1 1 1 1 11 1 1 1 1 01 1 1 0 0 11 1 1 0 0 01 1 1 1 0 01 1 1 1 0 01 1 1 0 1 00 1 1 1 0 01 1 0 1 0 10 1 1 0 1 11 1 1 1 1 11 1 0 1 1 11 0 1 0 0 01 1 1 1 1 11 1 1 1 0 01 1 0 1 1 01 0 1 0 0 10 1 1 1 0 11 1 1 0 1 00 1 1 1 1 10 1 1 1 0 1

IF [ -X1 & -X2 ] OR [ -X2 & -X3 ] OR [ -X1 & -X3 ]

IF [ X2 & X6 ] OR [ X1 & X3 ] OR [ X2 & X3 ] OR [ X1 & X2 ]

True for red, false for blue

True for blue, false

for red

Mining Logic Data

• The Lsquare Logic Miner [4,5,6]• Builds logic separations in Disjunctive Normal Form

(DNF) • Identifies iteratively the clauses of the DNF that separates

the largest part of object in one class from all the objects of the other class

• Clause identification is based on the solution of a Minimum Cost Satisfiability Problem (MINSAT), computationally hard

Falseqp

TrueqFalsep

FalseqTruep

s

ii

ii

ii

i

0

, 1

, 1

43221

44321

34333231

24232221

1414131211

,,,

,,,

,,,,

qpqpp

pqqqq

dpdpdpdq

dqdpdpdq

dqdpdpdqdq

FalseqFalsep

TrueqFalsep

FalseqFalsep

FalseqTruep

,

,

,

,

44

33

22

11

31 ss

S1 S2 S3 S4

A T T F ?A T F F TA T F F FI T T T ?I F ? F T

Satisfying solution:

Mining Logic Data

1:,'

1:,'

,.....

''min

1:1:

2,1

,1

ii

ii

bii

bii

ii

niii

aiAap

aiAaq

Bbpq

niqpts

qcpc

ii

1:,

1:,

,.....

min

1:1:

2,1

iai

iai

bii

bii

ii

Aaa

aiAadp

aiAadq

Bbpq

niqpts

dc

ii

Select

Largest separable subset

0)('

1)(')('

FalseC

TrueCc

1 )('

0 )(')('

FalseC

TrueCc

Identify the

clause with

desired support

Lsquare finds 4 DNF formulas:

1: A from B, min support

2: A from B, max support

3: B from A, min support

4: B from A, max support

ALL: majority vote, use

undecided

The behavior of the formulas

strongly interacts with the

quality of the data…

Two MINSAT problems are solved at each iteration

Feature Selection

Most methods for FS are based on a greedy construction of the feature set, and do not take into proper account the interactions among the selected features (correlation, collinearity)

An interesting method adopts Integer Programming (Set Covering) to select the smallest set of features that enables to differentiate logic records belonging to different classes

1,0

,),(1

..

min

)(

j

j jjik

jj

x

Ik Ai: ki xa

ts

x

The Set Covering Approach for FS (Rutgers - LAD approach) A binary variable is associated with each featureA linear constraint is associated with each pair of objects belonging to different classes A set covering model is used requires that each pair of objects belonging to different classes are differentiated by at least one of the selected features

a(ij)k = 1 if i,k belong to different classesi,k are different on feature j a(ij)k = 1 feature j selected

Feature Selection : a simplified version

1,0

..

min

i

i iij

ii

x

j xd

ts

x

1 f and B class in isj row if

0 f and B class in isj row if

0f and A class in isj row if

1 f and A class in isj row if

d

i

i

i

i

ij

0

1

0

1

fi feature i PA(i) = proportion of elements with fi = 1 in

class A; PB(i) = proportion of elements with fi = 1 in

class B;

1,0

..

max

i

i i

i iij

x

x

j xd

ts

Rows are linear in the # of examples

Redundancy can be controlled May still need heuristics for

large number of features…

Coverage is given

Coverage is maximized, given the bound on the # of features

xi = 1 iff feature i is chosen in the solution

if PA(i) > PB(i)

To overcome the untreatable dimensions of the quadratic model, we propose a simplified version of the set covering model (Linear SC), where the number of constraints is equal to the number of samples

Application to Barcode Data

• Data Set from the 2006 conference • 1623 samples belonging to 150 different species• Each sample is described by 690 nucleotides (columns)• We search for a compact rule for each one of the 150

classes• For each species k, we solve a 2-class learning problem:

• class A: all samples in class k• class B: samples of all classes different from k

• We use Linear set covering for feature selection and logic mining to determine the formulas on a training subset (80-90%) of the available data, and then test their classification capabilities on the remaining data.

• Training and testing samples are drawn at random maintaining the same proportion in each class


DATA SET

TEST SET

TRAINING SET

FEATURE

SELECTION

Integer Programming model associated with the Linear set Covering model is solved optimally with commercial solver ILOG CPLEX with 10, 20 and 30 as values for

FORMULA

EXTRACTION

LSQUARE is used to separate each species from the others 149, and a compact formula explaining each specie is obtained

TEST SETThe formulas are used to predict the specie of each element in the test set.

TRAINING SET


• LSC construction• PAjk = proportion of samples in class k with nucleotide = A in column j• PCjk = proportion of samples in class k with nucleotide = C in column j• PGjk = proportion of samples in class k with nucleotide = G in column j• PTjk = proportion of samples in class k with nucleotide = T in column j• PNAjk = proportion of samples in class <>k with nucleotide = A in column j• PNCjk = proportion of samples in class <> k with nucleotide = C in column j• PNGjk = proportion of samples in class <> k with nucleotide = G in column j• PNTjk = proportion of samples in class <> k with nucleotide = T in column j

• aij = 1 iff:• Sample i is in class k, nucleotide i of j = A and PAjk > 2 PNAjk • Sample i is in class k, nucleotide i of j = C and PCjk > 2 PNCjk • Sample i is in class k, nucleotide i of j = G and PGjk > 2 PNGjk • Sample i is in class k, nucleotide i of j = T and PTjk > 2 PNTjk

1,0

max

j

j j

j jij

x

x

i xa

Select columns with largest

Use these columns to formulate a separation problem for each class

with Lsquare (1 vs all)

Obtain a logic formula for each class

An example1 2 3 4

CLASS 1 A C G TCLASS 1 A A A TCLASS 2 A A T ACLASS 3 C G T C

CLASS 1 1 2 3 4 1 2 3 4PA 1.0 0.5 0.5 0.0 PNA 0.50 0.50 0.00 0.50 1 2 3 4PC 0.0 0.5 0.0 0.0 PNC 0.50 0.00 0.00 0.50 1 1 0 1 1PG 0.0 0.0 0.5 0.0 PNG 0.00 0.50 0.00 0.00 2 1 1 1 1PT 0.0 0.0 0.0 1.0 PNT 0.00 0.00 1.00 0.00 3 0 1 1 1

4 1 1 1 1CLASS 2 1 2 3 4 1 2 3 4

PA 1.0 1.0 0.0 1.0 PNA 0.67 0.33 0.33 0.00PC 0.0 0.0 0.0 0.0 PNC 0.33 0.33 0.00 0.33PG 0.0 0.0 0.0 0.0 PNG 0.00 0.33 0.33 0.00PT 0.0 0.0 1.0 0.0 PNT 0.00 0.00 0.33 0.67

CLASS 3 1 2 3 4 1 2 3 4PA 0.0 0.0 0.0 0.0 PNA 1.00 0.67 0.33 0.33PC 1.0 0.0 0.0 1.0 PNC 0.00 0.33 0.00 0.00PG 0.0 1.0 0.0 0.0 PNG 0.00 0.00 0.33 0.00PT 0.0 0.0 1.0 0.0 PNT 0.00 0.00 0.33 0.67

With = 1: x3 = 1, x1=x2=x4=0: non separable

x4 = 1, x1=x2=x3=0: separableWith = 2: x3 = x4 =1, x1=x2=0: separable

IF (X4=T) THEN CLASS 1IF (X4=A) THEN CLASS 2IF (X4=C) THEN CLASS 3

4321

4321

432

4321

431

max

xxxx

xxxx

xxx

xxxx

xxx

Results test% training testing

10 2 10 11.14% 11.38%

10 2 10 5.62% 8.98%

10 3 10 5.99% 7.19%

10 3 20 9.21% 10.17%

10 3 20 7.11% 9.75%

10 2 20 7.81% 8.05%

7.81% 9.25%

20 6 10 2.06% 2.40%

20 6 10 0.84% 1.20%

20 6 10 0.28% 0.60%

20 6 20 1.90% 2.12%

20 6 20 0.30% 1.27%

20 6 20 0.30% 1.27%

0.95% 1.48%

30 9 10 0.37% 0.60%

30 9 10 0.28% 0.60%

30 9 10 0.37% 0.60%

30 9 20 0.30% 1.69%

30 9 20 0.30% 0.85%

0.33% 0.87%

3.03% 3.87%

errorv alues of

average

average

average

overall

For each row we solve:

1 set covering problem

150 logic classification problems

{10, 20, 30}

Test % {10, 20}

3 random repetitions for each setting

Results

580 54490 49346 41469 41544 41637 41331 39445 34154 3085 29307 29379 2931 25211 2528 2240 2210 2116 1534 154 14

Selected SiteOccurrences in

Formulas

average error on test for different values of

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

10 20 30

mode of optimal for different values of

1

2

3

4

5

6

7

8

9

10

10 20 30

Site 580 appears 54 times in the formulas that discriminate each class from the rest (approx. 1/3)

Results

• SPECIES DIM CLAUSE(S)• 0 1 * 274 A 499 T 580 C• 1 1 * 19 T 172 T• 2 1 * 340 G 343 A 445 C• 3 1 * 445 T 499 T 580 G• 4 1 * 172 C 445 T 493 G 499 C 580 A• 5 1 * 58 G 430 T• 6 1 * 268 C 289 A 334 C 430 T• 7 1 * 136 A 277 C 445 T• 8 2 * 58 A 121 T 172 C * 277 T 499 G• 9 1 * 163 T 274 A 334 G• 10 1 * 19 T 331 C 652 T• 11 1 * 4 C 10 T 274 C 289 A 445 A• 12 1 * 340 G 445 G• 13 1 * 277 A 340 A 430 T 445 C• 14 1 * 121 C 331 T• 15 1 * 58 A 283 C 331 T

Results

IF BASE IN POSITION 19 IS T AND BASE IN POSITION 172 IS T THEN SPECIES IS …1IF BASE IN POSITION 340 IS G AND BASE IN POSITION 343 IS A AND BASE IN POSITION 445 IS C THEN SPECIES IS … 2IF BASE IN POSITION 58 IS A AND BASE IN POSITION 121 IS T AND BASE IN POSITION 172 IS C OR BASE IN POSITION 277 IS T AND BASE IN POSITION 499 IS G THEN SPECIES IS … 8

Related work

• Haplotype Inference by Parsimony: new very efficient heuristic

• Tag SNP and SNP reconstruction problem

• Phylogenetic tree in polyploid organisms

Conclusions and Future work

• The logic technique is very powerful for identifying small non contiguous subsequences of the barcode

• We are testing the technique on a very huge set of data from Lepidoptera

• We will compare our technique with NJ method [2]• CBOL and DAWG have asked us to implemented

our technique as a web service• We are designing a software platform that

implements algorithms for all the above problems

References

• [1] M. Hajibabaei , G. A.C. Singer, E. L. Clare, P.D.N. Hebert Design and applicability of DNA arrays and DNA barcodes in biodiversity monitoring BMC Biology, 2007

• [2] M. Saitou., M. Nei Neighbour Joining Method, Mol Biol Evol. 1987 • [3] P.Bertolazzi, G. Felici, P. Festa, G. Lancia, Logic Classification and Feature

Selection for Biomedical Data, Computers & Mathematics with Applications, on-line version (2007)

• [4]G. Felici, K. Truemper, A Minsat Approach for Learning in Logic Domains, INFORMS JOC, 2002;

• [5] K. Truemper, Design of Logic-Based Intelligent Systems, Wiley-Interscience, 2004

• [6] G. Felici, K. Truemper, The Lsquare System for Mining Logic Data, Encyclopedia of Data Warehousing and Mining, 2005

• [7] P. Bertolazzi. G. Felici Species Classification with Optimized Logic Formulas poster, EMBO Conference, Rome, May 2007.

• [8] P. Bertolazzi, G. Felici Species Classification with Optimized Logic Formulas invited talk , Second BOL Conference, Taipei, Sept. 2007

http://www.biomedcentral.com/logon/logon.asp?msg=ce