BioinfoGRID Symposium 2007
Mathematical methods for the analysis of the Barcode Of Life
P. Bertolazzi, G. Felici
Istituto di Analisi dei Sistemi e Informatica del Consiglio Nazionale delle Ricerche
Optimization Laboratory for Data Mining
DNA Barcoding 1/4
• DNA barcoding is a new technique that uses a short DNA sequence from a standardized and agreed-upon position in the genome as a molecular diagnostic for species-level identification.
• The chosen sequence for Barcode is a small portion of mitocondrial DNA (mt-DNA) that differs by several percent, even in closely related species, and collects enough information to identify the species of an individual.
• It is easy to isolate and analyze.
• Moreover it resumes many properties of the entire mt-DNA sequence.
DNA Barcoding 2/4
A typical animal cell. Within the cytoplasm, the major organelles and cellular structures include: (1) nucleolus (2) nucleus (3) ribosome (4) vesicle (5) rough endoplasmic reticulum (6) Golgi apparatus (7) cytoskeleton (8) smooth endoplasmic reticulum (9) mitochondria (10) vacuole (11) cytosol (12) lysosome (13) centriole.
DNA Barcoding 3/4
• The first studies on barcode (2003) are due to Hebert (see [1] for the last results and a complete bibliography)
• Two mt-DNA subsequences (genes) are proposed as barcode:– Cytochrome c Oxidase I (COI)– Cytochrome b
• Since 2003 COI has been used by Hebert to study fishes, birds, and other species
• Hebert employs the Neighbor Joining (NJ) [2] method, proposed to obtain phylogenetic trees, and identifies each species as represented by a distinct, non overlapping cluster of sequences in the NJ tree.
DNA Barcoding 4/4
• Recent studies [1] show that even fragments of the COI sequence have the same expressive power than the entire sequence
• The Consortium of Barcode of Life (CBOL) is an international initiative devoted to developing DNA barcoding as a global standard for the identification of biological species http://www.barcoding.si.edu/
• Data Analysis Working Group (CBOL and DIMACS, Rutgers) http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges2007/http://dimacs.rutgers.edu/Workshops/DNAInitiative/
Research challenges 1/2
• Specimen identification versus species discovery : the knowledge about species is not always complete
• Is a species similar to another or not?
• Optimizing sample size: barcodes are not easy to measure, large samples are very expensive
• Using character-based barcodes: an alternative approach to comparing specimens in terms of overall percent sequence similarity
• Shrinking the barcode
Research challenges 2/2
• Shrinking the barcode: we want to identify the relevant portions of the barcode wrt each species; this would make easier to identify new data on species and study it
A barcode fragment for 2 speciesSpecies 1: CTGGCATAGTAGGTACTGCCCTTAGCCTCCCCCCAGTCCTCTCCC Species 2: CTGGCATAGTCGGAACCGCTCTCAGCCTACCCCCAGCCCTTTCTC
• 45 sites• Difference bases on only 8 sites• In this case only 1 site is sufficient to distinguish the 2 species
Main goal
• Given samples from different species, the objective is to identify those combinations of muted nucleotides that have determined the differences among species in the evolution path
Our work 1/2
– We have addressed these challenges using supervised learning and some special methods that we have produces for classification of logic data of large size
– Such methods already proved successful in other bio-computational problems (Tag SNPs selection, microarray data analysis, genetic diagnostics)[3]:
• Feature selection methods based on Integer Programming: aims at finding a subset of bases that allows to distinguish among species
• Logic classification methods based on logic programming:aims at finding rules or formulas that can distinguish between individuals of different species
Our Work 2/2
• Feature Selection is used to select a limited number of positions where the values of the bases differs consistently from species to species;
• Logic classification methods use the selected positions to find formulas with high semantic value that can provide a deep understanding of the analyzed data.
Logic classification is computationally expensive
Many features
Feature selection
Logic
classification
Few features
Compact logic formulas
Supervised Learning: the standard setting
• Given:• A finite number of objects described by a finite number of measures
X = (x1, x2, …, xn)• An interesting characteristic of these objects Y (TARGET)
• Determine the relation between TARGET and measures Y = F(X)• It is only required that, for each set of measures X, it produces an output Y
• Very important– The form of the relation is decided up front;– The parameter of the relation are determined by the data;– The results of the estimation need to be validated.
Classification: Y = (0,1,2, …)
Estimation: Y Q R
Supervised Learning: training and testing
• If data contains good information and the choice of the model is correct, the model has a good fit on the training data
• Good fitting to training data may provide information and knowledge
• The model should also adapt well to test data
• Only good fitting to test data strengthen the knowledge extracted and validates the forecasting or classification function of the model
Data
Training Set
Testing Set
Choose model
Estimate model
Performance
?
ok
The Special case of Logic Data 1/2
• Frequently, objects are (or can be) described only by logic attributes (True/False)
• The classification function is expressed in terms of:– Logic variables p, q, r, with True/False value– conjunction (p q), disjunction (p q), negation ( p)
Logic variables(property of the object)
S1 S2 S3 S4 S1 S2 S3 S4
A T T F ? A 1 1 -1 0A T F F T A 1 -1 -1 1A T F F F A 1 -1 -1 -1I T T T ? I 1 1 1 0I F ? F T I -1 0 -1 T
S1 S3 (1, 0, -1, 0)
“If the object shows both the presence of property 1 and the absence of property 3, then it is class A, else is class I”
1
The Special case of Logic Data 2/2
X1 X2 X3 X4 X5 X60 0 1 0 0 00 0 1 0 1 01 0 0 1 0 11 0 0 1 0 11 0 0 0 1 00 0 1 1 0 11 0 0 1 1 00 0 1 0 1 00 1 0 0 1 00 0 1 0 1 00 0 1 1 1 10 0 0 1 1 00 0 1 1 1 00 1 0 0 0 00 0 0 1 0 00 0 0 0 0 10 0 1 0 1 10 1 0 1 0 01 0 0 1 1 00 1 0 0 1 01 0 0 0 0 00 0 1 1 0 10 0 1 1 0 10 0 1 1 0 11 0 0 0 0 11 1 1 1 1 11 1 0 1 0 11 0 1 0 0 00 1 1 1 1 11 1 1 1 1 01 1 1 0 0 11 1 1 0 0 01 1 1 1 0 01 1 1 1 0 01 1 1 0 1 00 1 1 1 0 01 1 0 1 0 10 1 1 0 1 11 1 1 1 1 11 1 0 1 1 11 0 1 0 0 01 1 1 1 1 11 1 1 1 0 01 1 0 1 1 01 0 1 0 0 10 1 1 1 0 11 1 1 0 1 00 1 1 1 1 10 1 1 1 0 1
IF [ -X1 & -X2 ] OR [ -X2 & -X3 ] OR [ -X1 & -X3 ]
IF [ X2 & X6 ] OR [ X1 & X3 ] OR [ X2 & X3 ] OR [ X1 & X2 ]
True for red, false for blue
True for blue, false
for red
Mining Logic Data
• The Lsquare Logic Miner [4,5,6]• Builds logic separations in Disjunctive Normal Form
(DNF) • Identifies iteratively the clauses of the DNF that separates
the largest part of object in one class from all the objects of the other class
• Clause identification is based on the solution of a Minimum Cost Satisfiability Problem (MINSAT), computationally hard
Falseqp
TrueqFalsep
FalseqTruep
s
ii
ii
ii
i
0
, 1
, 1
43221
44321
34333231
24232221
1414131211
,,,
,,,
,,,,
qpqpp
pqqqq
dpdpdpdq
dqdpdpdq
dqdpdpdqdq
FalseqFalsep
TrueqFalsep
FalseqFalsep
FalseqTruep
,
,
,
,
44
33
22
11
31 ss
S1 S2 S3 S4
A T T F ?A T F F TA T F F FI T T T ?I F ? F T
Satisfying solution:
Mining Logic Data
1:,'
1:,'
,.....
''min
1:1:
2,1
,1
ii
ii
bii
bii
ii
niii
aiAap
aiAaq
Bbpq
niqpts
qcpc
ii
1:,
1:,
,.....
min
1:1:
2,1
iai
iai
bii
bii
ii
Aaa
aiAadp
aiAadq
Bbpq
niqpts
dc
ii
Select
Largest separable subset
0)('
1)(')('
FalseC
TrueCc
1 )('
0 )(')('
FalseC
TrueCc
Identify the
clause with
desired support
Lsquare finds 4 DNF formulas:
1: A from B, min support
2: A from B, max support
3: B from A, min support
4: B from A, max support
ALL: majority vote, use
undecided
The behavior of the formulas
strongly interacts with the
quality of the data…
Two MINSAT problems are solved at each iteration
Feature Selection
Most methods for FS are based on a greedy construction of the feature set, and do not take into proper account the interactions among the selected features (correlation, collinearity)
An interesting method adopts Integer Programming (Set Covering) to select the smallest set of features that enables to differentiate logic records belonging to different classes
1,0
,),(1
..
min
)(
j
j jjik
jj
x
Ik Ai: ki xa
ts
x
The Set Covering Approach for FS (Rutgers - LAD approach) A binary variable is associated with each featureA linear constraint is associated with each pair of objects belonging to different classes A set covering model is used requires that each pair of objects belonging to different classes are differentiated by at least one of the selected features
a(ij)k = 1 if i,k belong to different classesi,k are different on feature j a(ij)k = 1 feature j selected
Feature Selection : a simplified version
1,0
..
min
i
i iij
ii
x
j xd
ts
x
1 f and B class in isj row if
0 f and B class in isj row if
0f and A class in isj row if
1 f and A class in isj row if
d
i
i
i
i
ij
0
1
0
1
fi feature i PA(i) = proportion of elements with fi = 1 in
class A; PB(i) = proportion of elements with fi = 1 in
class B;
1,0
..
max
i
i i
i iij
x
x
j xd
ts
Rows are linear in the # of examples
Redundancy can be controlled May still need heuristics for
large number of features…
Coverage is given
Coverage is maximized, given the bound on the # of features
xi = 1 iff feature i is chosen in the solution
if PA(i) > PB(i)
To overcome the untreatable dimensions of the quadratic model, we propose a simplified version of the set covering model (Linear SC), where the number of constraints is equal to the number of samples
Application to Barcode Data
• Data Set from the 2006 conference • 1623 samples belonging to 150 different species• Each sample is described by 690 nucleotides (columns)• We search for a compact rule for each one of the 150
classes• For each species k, we solve a 2-class learning problem:
• class A: all samples in class k• class B: samples of all classes different from k
• We use Linear set covering for feature selection and logic mining to determine the formulas on a training subset (80-90%) of the available data, and then test their classification capabilities on the remaining data.
• Training and testing samples are drawn at random maintaining the same proportion in each class
Application to Barcode Data
DATA SET
TEST SET
TRAINING SET
FEATURE
SELECTION
Integer Programming model associated with the Linear set Covering model is solved optimally with commercial solver ILOG CPLEX with 10, 20 and 30 as values for
FORMULA
EXTRACTION
LSQUARE is used to separate each species from the others 149, and a compact formula explaining each specie is obtained
TEST SETThe formulas are used to predict the specie of each element in the test set.
TRAINING SET
Application to Barcode Data
• LSC construction• PAjk = proportion of samples in class k with nucleotide = A in column j• PCjk = proportion of samples in class k with nucleotide = C in column j• PGjk = proportion of samples in class k with nucleotide = G in column j• PTjk = proportion of samples in class k with nucleotide = T in column j• PNAjk = proportion of samples in class <>k with nucleotide = A in column j• PNCjk = proportion of samples in class <> k with nucleotide = C in column j• PNGjk = proportion of samples in class <> k with nucleotide = G in column j• PNTjk = proportion of samples in class <> k with nucleotide = T in column j
• aij = 1 iff:• Sample i is in class k, nucleotide i of j = A and PAjk > 2 PNAjk • Sample i is in class k, nucleotide i of j = C and PCjk > 2 PNCjk • Sample i is in class k, nucleotide i of j = G and PGjk > 2 PNGjk • Sample i is in class k, nucleotide i of j = T and PTjk > 2 PNTjk
1,0
max
j
j j
j jij
x
x
i xa
Select columns with largest
Use these columns to formulate a separation problem for each class
with Lsquare (1 vs all)
Obtain a logic formula for each class
An example1 2 3 4
CLASS 1 A C G TCLASS 1 A A A TCLASS 2 A A T ACLASS 3 C G T C
CLASS 1 1 2 3 4 1 2 3 4PA 1.0 0.5 0.5 0.0 PNA 0.50 0.50 0.00 0.50 1 2 3 4PC 0.0 0.5 0.0 0.0 PNC 0.50 0.00 0.00 0.50 1 1 0 1 1PG 0.0 0.0 0.5 0.0 PNG 0.00 0.50 0.00 0.00 2 1 1 1 1PT 0.0 0.0 0.0 1.0 PNT 0.00 0.00 1.00 0.00 3 0 1 1 1
4 1 1 1 1CLASS 2 1 2 3 4 1 2 3 4
PA 1.0 1.0 0.0 1.0 PNA 0.67 0.33 0.33 0.00PC 0.0 0.0 0.0 0.0 PNC 0.33 0.33 0.00 0.33PG 0.0 0.0 0.0 0.0 PNG 0.00 0.33 0.33 0.00PT 0.0 0.0 1.0 0.0 PNT 0.00 0.00 0.33 0.67
CLASS 3 1 2 3 4 1 2 3 4PA 0.0 0.0 0.0 0.0 PNA 1.00 0.67 0.33 0.33PC 1.0 0.0 0.0 1.0 PNC 0.00 0.33 0.00 0.00PG 0.0 1.0 0.0 0.0 PNG 0.00 0.00 0.33 0.00PT 0.0 0.0 1.0 0.0 PNT 0.00 0.00 0.33 0.67
With = 1: x3 = 1, x1=x2=x4=0: non separable
x4 = 1, x1=x2=x3=0: separableWith = 2: x3 = x4 =1, x1=x2=0: separable
IF (X4=T) THEN CLASS 1IF (X4=A) THEN CLASS 2IF (X4=C) THEN CLASS 3
4321
4321
432
4321
431
max
xxxx
xxxx
xxx
xxxx
xxx
Results test% training testing
10 2 10 11.14% 11.38%
10 2 10 5.62% 8.98%
10 3 10 5.99% 7.19%
10 3 20 9.21% 10.17%
10 3 20 7.11% 9.75%
10 2 20 7.81% 8.05%
7.81% 9.25%
20 6 10 2.06% 2.40%
20 6 10 0.84% 1.20%
20 6 10 0.28% 0.60%
20 6 20 1.90% 2.12%
20 6 20 0.30% 1.27%
20 6 20 0.30% 1.27%
0.95% 1.48%
30 9 10 0.37% 0.60%
30 9 10 0.28% 0.60%
30 9 10 0.37% 0.60%
30 9 20 0.30% 1.69%
30 9 20 0.30% 0.85%
0.33% 0.87%
3.03% 3.87%
errorv alues of
average
average
average
overall
For each row we solve:
1 set covering problem
150 logic classification problems
{10, 20, 30}
Test % {10, 20}
3 random repetitions for each setting
Results
580 54490 49346 41469 41544 41637 41331 39445 34154 3085 29307 29379 2931 25211 2528 2240 2210 2116 1534 154 14
Selected SiteOccurrences in
Formulas
average error on test for different values of
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
9.00%
10.00%
10 20 30
mode of optimal for different values of
1
2
3
4
5
6
7
8
9
10
10 20 30
Site 580 appears 54 times in the formulas that discriminate each class from the rest (approx. 1/3)
Results
• SPECIES DIM CLAUSE(S)• 0 1 * 274 A 499 T 580 C• 1 1 * 19 T 172 T• 2 1 * 340 G 343 A 445 C• 3 1 * 445 T 499 T 580 G• 4 1 * 172 C 445 T 493 G 499 C 580 A• 5 1 * 58 G 430 T• 6 1 * 268 C 289 A 334 C 430 T• 7 1 * 136 A 277 C 445 T• 8 2 * 58 A 121 T 172 C * 277 T 499 G• 9 1 * 163 T 274 A 334 G• 10 1 * 19 T 331 C 652 T• 11 1 * 4 C 10 T 274 C 289 A 445 A• 12 1 * 340 G 445 G• 13 1 * 277 A 340 A 430 T 445 C• 14 1 * 121 C 331 T• 15 1 * 58 A 283 C 331 T
Results
IF BASE IN POSITION 19 IS T AND BASE IN POSITION 172 IS T THEN SPECIES IS …1IF BASE IN POSITION 340 IS G AND BASE IN POSITION 343 IS A AND BASE IN POSITION 445 IS C THEN SPECIES IS … 2IF BASE IN POSITION 58 IS A AND BASE IN POSITION 121 IS T AND BASE IN POSITION 172 IS C OR BASE IN POSITION 277 IS T AND BASE IN POSITION 499 IS G THEN SPECIES IS … 8
Related work
• Haplotype Inference by Parsimony: new very efficient heuristic
• Tag SNP and SNP reconstruction problem
• Phylogenetic tree in polyploid organisms
Conclusions and Future work
• The logic technique is very powerful for identifying small non contiguous subsequences of the barcode
• We are testing the technique on a very huge set of data from Lepidoptera
• We will compare our technique with NJ method [2]• CBOL and DAWG have asked us to implemented
our technique as a web service• We are designing a software platform that
implements algorithms for all the above problems
References
• [1] M. Hajibabaei , G. A.C. Singer, E. L. Clare, P.D.N. Hebert Design and applicability of DNA arrays and DNA barcodes in biodiversity monitoring BMC Biology, 2007
• [2] M. Saitou., M. Nei Neighbour Joining Method, Mol Biol Evol. 1987 • [3] P.Bertolazzi, G. Felici, P. Festa, G. Lancia, Logic Classification and Feature
Selection for Biomedical Data, Computers & Mathematics with Applications, on-line version (2007)
• [4]G. Felici, K. Truemper, A Minsat Approach for Learning in Logic Domains, INFORMS JOC, 2002;
• [5] K. Truemper, Design of Logic-Based Intelligent Systems, Wiley-Interscience, 2004
• [6] G. Felici, K. Truemper, The Lsquare System for Mining Logic Data, Encyclopedia of Data Warehousing and Mining, 2005
• [7] P. Bertolazzi. G. Felici Species Classification with Optimized Logic Formulas poster, EMBO Conference, Rome, May 2007.
• [8] P. Bertolazzi, G. Felici Species Classification with Optimized Logic Formulas invited talk , Second BOL Conference, Taipei, Sept. 2007
Top Related