AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm...

100
AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Transcript of AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm...

Page 1: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

AAAI05 Tutorial on Bioinformatics & Machine Learning

Jinyan Li & Limsoon WongInstitute for Infocomm Research

21 Heng Mui Keng TerraceSingapore

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 2: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

1. Intro to Biology & Bioinformatics

2. TIS Predictionfeature generation &

feature selection

3. Tx Optimization for Childhood ALL

emerging patterns & PCL

4. Analysis of Proteomics Data decision tree ensembles &

CS4

5. Prognosis based on Gene

Expression extreme sample selection &

ERCOF

6. Subgraphs in Protein

Interactions closed patterns

7. Mining Errors from Bio

Databasesassociation rules

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Tutorial Plan

Page 3: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Coverage

• Data:– DNA sequences, gene expression profiles,

proteomics mass-spectra, protein sequences, protein structures, & biomedical text/annotations

• Feature selection: 2, t-statistics, signal-to-noise, entropy, CFS

• Machine learning and data mining: – Decision trees, SVMs, emerging patterns,

closed patterns, graph theories

Page 4: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Body and Cell

• Our body consists of a number of organs

• Each organ composes of a number of tissues

• Each tissue composes of cells of the same type

• Cells perform two types of function– Chemical reactions

needed to maintain our life

– Pass info for maintaining life to next generation

• In particular– Protein performs

chemical reactions– DNA stores & passes info– RNA is intermediate

between DNA & proteins

Page 5: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

DNA

• Stores instructions needed by the cell to perform daily life function

• Consists of two strands interwoven together and form a double helix

• Each strand is a chain of some small molecules called nucleotides

Francis Crick shows James Watson the model of DNA in their room number 103 of the Austin Wing at the Cavendish Laboratories, Cambridge

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 6: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

A C G T U

Classification of Nucleotides

• 5 diff nucleotides: adenine(A), cytosine(C), guanine(G), thymine(T), & uracil(U)

• A, G are purines. They have a 2-ring structure• C, T, U are pyrimidines. They have a 1-ring

structure• DNA only uses A, C, G, & T

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 7: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

A

T

10Å

G

C

10Å

Watson-Crick Rule

• DNA is double stranded in a cell • One strand is reverse complement of the

other• Complementary bases:

– A with T (two hydrogen-bonds)– C with G (three hydrogen-bonds)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 8: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Chromosome

• DNA is usually tightly wound around histone proteins and forms a chromosome

• The total info stored in all chromosomes constitutes a genome

• In most multi-cell organisms, every cell contains the same complete set of chromosomes – May have some small diff due to mutation

• Human genome has 3G bases, organized in 23 pairs of chromosomes

Page 9: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Gene

• A gene is a sequence of DNA that encodes a protein or an RNA molecule

• About 30,000 – 35,000 (protein-coding) genes in human genome

• For gene that encodes protein– In Prokaryotic genome, one gene corresponds

to one protein– In Eukaryotic genome, one gene may

correspond to more than one protein because of the process “alternative splicing”

Page 10: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Central Dogma

• Gene expression consists of two steps– Transcription

DNA mRNA

– Translation mRNA Protein

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 11: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Genetic Code• Start codon: ATG (code for M)• Stop codon: TAA, TAG, TGA

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 12: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Protein

• A sequence composed from an alphabet of 20 amino acids– Length is usually 20 to

5000 amino acids – Average around 350

amino acids

• Folds into 3D shape, forming the building block & performing most of the chemical reactions within a cell

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 13: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Classification of Amino Acids

• Amino acids can be classified into 4 types

• Positively charged (basic)– Arginine (Arg, R)– Histidine (His, H)– Lysine (Lys, K)

• Negatively charged (acidic)– Aspartic acid (Asp, D)– Glutamic acid (Glu, E)

Page 14: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Classification of Amino Acids• Polar (overall

uncharged, but uneven charge distribution. can form hydrogen bonds with water. they are called hydrophilic)– Asparagine (Asn, N)– Cysteine (Cys, C)– Glutamine (Gln, Q)– Glycine (Gly, G)– Serine (Ser, S)– Threonine (Thr, T)– Tyrosine (Tyr, Y)

• Nonpolar (overall uncharged and uniform charge distribution. cant form hydrogen bonds with water. they are called hydrophobic)– Alanine (Ala, A)– Isoleucine (Ile, I)– Leucine (Leu, L)– Methionine (Met, M)– Phenylalanine (Phe, F)– Proline (Pro, P)– Tryptophan (Trp, W)– Valine (Val, V)

Page 15: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

BioinformaticsApplications• Bio Data Searching• Gene finding• Cis-regulatory DNA• Gene/Protein Network• Protein/RNA Structure

Prediction• Evolutionary Tree

reconstruction• Infer Protein Function• Disease Diagnosis,

Prognosis, & Treatment Optimization, ...

Page 16: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Part 2:Translation Initiation Site Recognition

Page 17: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

A Sample cDNA

• What makes the second ATG the TIS?

Page 18: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Approach

• Training data gathering• Signal generation

– k-grams, distance, domain know-how, ...

• Signal selection– Entropy, 2, CFS, t-test, domain know-how...

• Signal integration– SVM, ANN, PCL, CART, C4.5, kNN, ...

Page 19: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Training & Testing Data

• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]

• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts

Page 20: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Signal Generation

• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame

Page 21: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Too Many Signals

• For each k, there are 4k * 3 * 2 k-grams• If we use k = 1, 2, 3, 4, 5, we have

24 + 96 + 384 + 1536 + 6144 = 8184 features!

• This is too many for most machine learning algorithms

Need to do signal selection– t-stats, 2, CFS, signal-to-noise, entropy, gini

index, info gain, info gain ratio, ...

Page 22: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Signal Selection Basic Idea

• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance

Page 23: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Signal Selection by T-Statistics

Page 24: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Signal Selection by 2

Page 25: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Sample K-GramsSelected by CFS

• Position –3• in-frame upstream ATG• in-frame downstream

– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC

Kozak consensus

Leaky scanning

Stop codon

Codon bias?

Page 26: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Signal Integration

• kNN– Given a test sample, find the k training

samples that are most similar to it. Let the majority class win

• SVM– Given a group of training samples from two

classes, determine a separating plane that maximises the margin of error

• Naïve Bayes, ANN, C4.5, CS4, ...

Page 27: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

Results (3-fold x-validation)

Page 28: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Technique Comparisons

• Pedersen&Nielsen [ISMB’97]

– Neural network– No explicit features

• Zien [Bioinformatics’00]

– SVM+kernel engineering

– No explicit features

• Hatzigeorgiou [Bioinformatics’02]

– Multiple neural networks

– Scanning rule– No explicit features

• This approach– Explicit feature

generation– Explicit feature

selection– Use any machine

learning method w/o any form of complicated tuning

– Scanning rule is optional

Page 29: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

F

L

I

MV

S

P

T

A

Y

H

Q

N

K

D

E

C

WR

G

A

T

E

L

R

S

stop

How about using k-grams from the translation?

mRNAprotein

Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Page 30: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Amino-Acid Features

Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Page 31: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Amino-Acid Features

Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Page 32: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Amino Acid K-Grams Discovered (by entropy)

Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Page 33: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

ATGpr

Ourmethod

Validation Results (on Chr X and Chr 21)

• Using top 100 features selected by entropy and train SVM on Pedersen & Nielsen’s

Page 34: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Part 3:Treatment Optimization of Childhood Leukemia

Image credit: FEER

Page 35: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Childhood ALL

• Major subtypes are: – T-ALL, E2A-PBX, TEL-

AML, MLL genome rearrangements, BCR-ABL, Hyperdiploid>50

• Diff subtypes respond differently to same Tx

Over-intensive Tx – Development of

secondary cancers– Reduction of IQ

Under-intensiveTx – Relapse

• The subtypes look similar

• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnostics

Unavailable in most ASEAN countries

Page 36: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Image credit: Affymetrix

Single-Test Platform ofMicroarray & Machine

Learning

Page 37: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Overall Strategy

Diagnosis of subtype

Subtype- dependentprognosis

Risk-stratifiedtreatmentintensity

• For each subtype, select genes to develop classification model for diagnosing that subtype

• For each subtype, select genes to develop prediction model for prognosis of that subtype

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 38: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Subtype Diagnosis by PCL

• Gene expression data collection• Gene selection by 2• PCL Classifier training by emerging

pattern• Classifier tuning (optional for some

machine learning methods)• Apply PCL for diagnosis of future cases

Page 39: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

EmergingPatterns

• An emerging pattern is a set of conditions– usually involving several features– that most members of a class satisfy – but none or few of the other class satisfy

Page 40: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

PCL: Prediction by Collective Likelihood

Page 41: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Childhood ALL Subtype Diagnosis Workflow

A tree-structureddiagnostic workflow was recommended byour doctor collaborator

Page 42: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Training and Testing Sets

Page 43: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

The classifiers are all applied to the 20 genes selected by 2 at each level of the tree

Accuracy of Various Classifiers

Page 44: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Understandability of EP & PCL

• E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs

• These give us the diagnostic rule

Page 45: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Conclusions

Conventional Tx:• intermediate intensity to everyone 10% suffers relapse 50% suffers side effects costs US$150m/yr

Our optimized Tx:• high intensity to 10%• intermediate intensity to 40%• low intensity to 50%• costs US$100m/yr

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

•High cure rate of 80%• Less relapse

• Less side effects• Save US$51.6m/yr

Page 46: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Part 4:Topology of Protein Interaction Networks: Hubs, Cores, Bipartites

Page 47: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Courtesy of JE Wampler

Protein Representations

• Four types of representations for proteins: primary, secondary, tertiary, and quaternay

• Protein interactions play an important role in inter cellular communication, in signal transduction, in the regulation of gene expression

Page 48: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Complex based

Computational Methods

Motif Discovery

Domain Domain

Interactions

Sequence based

Structure based

(Docking)

Experimental

Methods(Phage Display,

mutagenesis, two hybrid)

Interaction Determination Approaches

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 49: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Protein Interaction Databases

• PDB---Protein Databank (http://www.rcsb.org/pdb/; reference article: H.M. Berman, etal: The Protein Data Bank. NAR, 28 pp. 235-242 (2000) ) structure info; complex info.

• BIND---The Biomolecular Interaction Network Database (http://binddb.org; Bader et al. NAR, 29. 2001)

• The GRID: The General Repository for Interaction Datasets http://biodata.mshri.on.ca/grid/servlet/Index

• DIP---Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu/). Xml format.

Page 50: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Maslov & Sneppen, Science, v296, 2002

A power law: Most neighbors of highly connected nodes have rather low connectivity. The highly connected nodes are called hubs

318 edges329 nodes

In nucleus ofS. cerevisiae

Graph Representation of Protein Interactions

Page 51: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Yeast SH3 domain-domainInteraction network:

394 edges, 206 nodesTong et al. Science, v295. 2002

8 proteins containing SH35 binding at least 6 of them

K-cores in Protein Interaction Graphs

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 52: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Bipartite Subgraphs

Copyright © 2005 by Jinyan Li and Limsoon Wong

SH3 Proteins SH3-Binding Proteins

Page 53: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Some Definitions

• A graph G=VG, EG, where VG is a set of vertices and EG is a set of edges. No self loops, no parallel edges, undirected

• A graph H= VH, EH is a subgraph of G, if VH is contained in VG and EH is contained in EG

• H = V1V2, EH is bipartite subgraph of G if

– H is a subgraph of G, and

– V1V2 = {}, and

– every edge in EH joins a vertex in V1 and a vertex in V2

Page 54: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

V1 V2

Example Bipartite Subgraph of Protein Interaction

Network

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 55: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Maximal Complete Bipartite Subgraphs

• A complete bipartite subgraph H = V1V2, EH of G is said to be maximal if there is no other complete bipartite subgraph H’ = V’1 V’2, EH’ of G such that V1 V’1 and V2 V’2

Page 56: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

The problem previously studied by Epptein (1994, Info Proc Lett);Makino & Uno (2004, SWAT); Zaki & Ogihara (1998).

Study of Protein Interaction Network Topology

• List all maximal complete bipartite subgraphs from a protein interaction network (Step 1)

• Discover motifs from these protein groups (Step 2)

• Form binding motif pairs (Step 3)

Page 57: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Efficient Mining of Maximal Complete Bipartite Subgraphs• Equiv to mining of closed patterns from

adjacency matrix of G• Why:

– the number of closed patterns is even, – is precisely double of the number of the

maximal complete bipartite subgraphs, and – there is one-to-one correspondence betw a

maximal complete bipartite subgraph & a closed pattern pair

• Many state-of-the-art algorithms for mining closed patterns

Page 58: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

A is symmetrical, main diagonal is 0

Adjacency Matrix: A Special Transactional Database (TDB)• Let G be a graph with VG = {v1, v2, …, vp}

• Adjacency matrix A of G is a p x p matrix defined by

A[i,j] = 1 if (vi, vj) EG

0 otherwise

Page 59: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Support threshold = 2

)(CfgC TDBPasquier etal, ICDT 99; Grahne & Zhu,

FIMI03

TDB, Freq Patterns, Equiv Classes, & Closed

Patterns

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 60: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

A protein interaction network taken from GRIDBreitkreutz, Genome Biol. 2003

4904 proteins, 17440 interactions, ave size of neighborhood is 3.56

Some Expt Results

Mining algorithm---FPclose* Grahne & Zhu, 2003

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 61: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Fixed Point Idea

• The idea: f(x)=x• Instantiation in life science, f(DNA)=DNA,

f(proteinType) = proteinType (Meng et al. Bioinformatics, 2004)

• The variable x can be a pair of AA segment patterns, then after the transformations by f, the stable pattern is binding motif pair. (Li and Li, Bioinformatics, 2005)

Page 62: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Part 5:Treatment Prognosis for DLBC Lymphoma

Image credit: Rosenwald et al, 2002

Page 63: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Diffuse Large B-Cell Lymphoma

• DLBC lymphoma is the most common type of lymphoma in adults

• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients

DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy

• Intl Prognostic Index (IPI) – age, “Eastern Cooperative

Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...

• Not very good for stratifying DLBC lymphoma patients for therapeutic trials

Use gene-expression profiles to predict outcome of chemotherapy?

Page 64: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

“extreme”sampleselection

ERCOF

Copyright © 2005 by Jinyan Li and Limsoon Wong

Our Approach

Page 65: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Short-term Survivors v.s. Long-term Survivors

T: sampleF(T): follow-up time

E(T): status (1:unfavorable; 0: favorable)c1 and c2: thresholds of survival time

Short-term survivors who died within a short

period

F(T) < c1 and E(T) = 1

Long-term survivors who were alive after a

long follow-up time

F(T) > c2

Extreme Sample Selection

Page 66: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

ERCOFEntropy-Based

Rank Sum Test

& Correlati

on Filtering

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Page 67: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

• llm(X) = sign(kk*Yk* (Xk•X) + b)

• svm(X) = sign(kk*Yk* (Xk• X) + b)

svm(X) = sign(kk*Yk* K(Xk,X) + b)

where K(Xk,X) = (Xk• X)

SVM Kernel Function

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

LinearLearningMachine

SVM =LLM +Kernel

Page 68: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Linear Kernel SVM regression function

bixTKyaTG ii

i ))(,()(

T: test sample, x(i): support vector,yi: class label (1: short-term survivors; -1: long-term survivors)

Transformation function (posterior probability)

)(1

1)(

TGeTS

))1,0()(( TS

S(T): risk score of sample T

Risk Score Construction

Page 69: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Knowledge Discovery from Gene Expression of “Extreme” Samples

“extreme”sampleselection:< 1 yr vs > 8 yrs

knowledgediscovery from gene expression

240 samples

80 samples26 long-

term survivors

47 short-term survivors

7399genes

84genes

T is long-term if S(T) < 0.3

T is short-term if S(T) > 0.7 Copyright © 2005 by Jinyan Li and Limsoon Wong

From Rosenwald et al., NEJM 2002

Page 70: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

80 test casesp-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3

Kaplan-Meier Plots

Copyright © 2005 by Jinyan Li and Limsoon Wong

Improvement over IPICases rated as IPI lowcan still be splitp-value = 0.0063

Page 71: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

W/o sample selection (p =0.38)

No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Merit of “Extreme” Samples

Copyright © 2005 by Jinyan Li and Limsoon Wong

W/ sample selection (p < 0.0001)

Page 72: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Effectiveness of ERCOF

Copyright © 2005 by Jinyan Li and Limsoon Wong

22Total wins 38 46 47 48 51 47 67

Page 73: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Conclusions

• Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information

• ERCOF is a systematic feature selection method that is very useful – ERCOF is very suitable for SVM, 3-NN, CS4,

Random Forest, as it gives these learning algos highest no. of wins

– ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors

Page 74: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Part 6:Decision Tree Based In-Silico Cancer Diagnosis

• Bagging, • Boosting, • Random Forest, • Randomization Trees, & • CS4

Page 75: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

A B A

B A

x1

x2

x4

x3

> a1

> a2

Structure of Decision Trees

• If x1 > a1 & x2 > a2, then it’s A class

• C4.5 & CART, two of the most widely used• Easy interpretation, but accuracy generally

unattractive

Leaf nodes

Internal nodes

Root node

B

A

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 76: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Elegance of Decision Trees

• Diagnosis of central nervous system in hematooncologic patients (Lossos et al, Cancer, 2000)

• Prediction of neurobehavioral outcome in head-injury survivors (Temin et al, J. of Neurosurgery, 1995)

• Reconstruction of molecular networks from gene expression data (Soinov et al, Genome Biology, 2003)

AB

B AA

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 77: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Single coverage & Fragmentation Problem

For Decision trees

Decision Tree Construction

• Steps– Select the “best”

feature as the root node of the whole tree

– After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree

– Recursively, until the partitions become pure or almost pure

• 3 Measures to Evaluate Which Feature is Best– Gini index– Information gain– Information gain ratio

Page 78: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

• h1, h2, h3 are indep classifiers w/ accuracy = 60%

• C1, C2 are the only classes

• t is a test instance in C1

• h(t) = argmaxC{C1,C2} |{hj {h1, h2, h3} | hj(t) = C}|

• Then prob(h(t) = C1)

= prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) +

prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) +

prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) +

prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1)

= 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60%= 64.8%

Motivating Example

Page 79: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Bagging: Bootstrap Aggregating, Breiman 1996

50 p + 50 nOriginal training set

48 p + 52 n 49 p + 51 n 53 p + 47 n…A base inducer such as C4.5

A committee H of classifiers: h1 h2 …. hk

Draw 100 samples with replacement (injecting radomness)

Equal voting

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 80: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

AdaBoost: Adaptive Boosting, Freund & Schapire 1995

100 instanceswith equal weight

A classifier h1

error

If error is 0 or >0.5 stop

Otherwise re-weight correctly classified instances by: e1/(1-e1)

Renormalize total weight to 1

100 instanceswith different weights

A classifier h2

error

Samples misclassifiedby hi

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 81: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Random Forest Breiman 2001

50 p + 50 nOriginal training set

48 p + 52 n 49 p + 51 n 53 p + 47 n…A base inducer (not C4.5 but revised)

A committee H of classifiers: h1 h2 …. hk

Equal voting

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 82: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Rootnode

Original n number offeatures

Selection is from mtry

number of randomly chosen features

A Revised C4.5 as Base Classifier

Similar to Bagging, but the base inducer is not the standard C4.5Make use twice of randomness

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 83: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Rootnode

Original n number of features

Select one randomly from{feature 1: choice 1,2,3 feature 2: choice 1, 2, . . . feature 8: choice 1, 2, 3} Total 20 candidates

Equal voting on the committee of such decision trees

Randomization Trees Dietterich 2000

Make use of randomness in the selection of the best split point

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 84: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

1

2

k

tree-1

tree-2

tree-k

total k trees

root nodes

CS4: Cascading & Sharing for Decision Trees, Li et al 2003

• Selection of root nodes is in a cascading manner!

• Doesn’t make use of randomness• Not equal voting

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 85: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

BaggingRandom

Forest

AdaBoost.M1

Randomization Trees

CS4

Rules may not be correct whenapplied to training data

Rules correct

Summary of Ensemble Classifiers

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 86: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Example of Decision Tree

• Yeoh et al., Cancer Cell 1:133-143, 2002; Differentiating MLL subtype from other subtypes of childhood leukemia

• Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: 12558

Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision!

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 87: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Performance of CS4

• Whether top-ranked features have similar gain ratios

• Whether cascading trees have similar training performance

• Whether the trees have similar structure

• Whether the expanding tree committees can reduce the test errors gradually

• Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28.

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 88: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Discovery of Diagnostic Biomarkers for Ovarian

Cancer• Motivation: cure rate ~

95% if correct diagnosis at early stage

• Proteomic profiling data obtained from patients’ serum samples

• The first data set by Petricoin et al was published in Lancet, 2002

• Data set of June-2002.• 253 samples: 91 controls

and 162 patients suffering from the disease; 15154 features (proteins, peptides, precisely, mass/charge identities)

SVM: 0 errors; Naïve Bayes: 19 errors; k-NN: 15 errors.

Copyright © 2005 by Jinyan Li and Limsoon Wong

Page 89: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Part 7:Mining Errors from Bio Databases

Page 90: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Data Cleansing, Koh et al, DBiDB 2005

• 11 types & 28 subtypes of data artifacts– Critical artifacts (vector

contaminated sequences, duplicates, sequence structure violations)

– Non-critical artifacts (misspellings, synonyms)

• > 20,000 seq records in public contain artifacts

• Identification of these artifacts are impt for accurate knowledge discovery

• Sources of artifacts– Diverse sources of data

• Repeated submissions of seqs to db’s

• Cross-updating of db’s

– Data Annotation• Db’s have diff ways for data

annotation• Data entry errors can be introduced• Different interpretations

– Lack of standardized nomenclature

• Variations in naming• Synonyms, homonyms, & abbrevn

– Inadequacy of data quality control mechanisms

• Systematic approaches to data cleaning are lacking

Page 91: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

A Classificati

on of Errors

Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 92: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Uninformative sequences

Undersized sequences

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).

• In Nov 2004, the total number of undersized protein sequences increases to 3,350.

• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).

• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.

Undersized protein sequences in major databases

218 171

42

528

116 151

1015

364 383

12351

1253 2 120 0 23

0

200

400

600

800

1000

1200

1 2 3

Sequence Length

Num

ber

of r

ecor

ds

DDBJ

EMBL

GenBank

PDB

SwissProt

PIR

Undersized nucleotide sequences in major databases

2 3 924

5573 69

115

81

233228

108 108

5167

6

40 45

77104

0

50

100

150

200

250

1 2 3 4 5

Sequence LengthN

um

ber

of

reco

rds

DDBJ

EMBL

GenBank

PDB

Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Example Meaningless Seqs

Page 93: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Overlapping intron/exon

Annotation error

Dubious sequences

Sequence redundancy

Data Provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

• Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6.

• rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2.

Example Overlapping Intron/Exon

Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 94: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Replication of sequence informationDifferent views

Overlapping annotations of the same sequence

Submission of the same sequence to different databases

• Repeated submission of the same sequence to the same database

• Initially submitted by different groups

• Protein sequences may be translated from duplicate nucleotide sequences

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

Example Seqs w/ Identical Info

Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 95: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Association Rule Mining

for De-duplication

Compute similarity scores from known duplicate pairs

Select matching criteria

Generate association rules

Detect duplicates using the rules

Copyright © 2005 by Jinyan Li and Limsoon Wong.

Page 96: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Features to Match

Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 97: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0

AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0

P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0

S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0

P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

LE1.0 PD0 SQ1.0 (99.7%)SP1 PD0 SQ1.0 (97.1%)SP1 LE1.0 PD0 SQ1.0 (96.8%)DB0 PD0 SQ1.0 (93.1%)DB0 LE1.0 PD0 SQ1.0 (92.8%)DB0 SP1 PD0 SQ1.0 (90.4%)DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%)RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%)RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%)

Similarity scores of known

duplicate pairs

Association rule mining

Frequent item-set with support

Association Rule Mining

Page 98: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Scorpion venom dataset containing 520 records

695 duplicate pairs are collectively identified.

Snake PLA2 venom dataset containing 780 records

Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB)

251 duplicate pairs 444 duplicate pairs

scorpion AND (venom OR toxin) serpentes AND venom AND PLA2

Expert annotation

Dataset

Page 99: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Duplicates detected by association rules

6

36.3

0.3

49.4

5.2

32.7

0.11.8 2.4 3.8

5.77.5 7.9 9.4

0

10

20

30

40

50

60

Association rulesF

P%

an

d F

N%

FP% FN% x 1000

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7

S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6

S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5

S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3

S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2

S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1

Results

Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.

Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.

Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

Page 100: AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Acknowledgements

TIS Prediction

• Huiqing Liu, Roland Yap, Fanfan Zeng

Treatment Optimization for Childhood ALL

• James Downing, Huiqing Liu, Allen Yeoh

Prognosis based on Gene Expression Profiling

• Huiqing Liu

Analysis of Proteomics Data

• Huiqing Liu

Subgraphs in Protein Interactions

• Haiquan Li, Donny Soh, Chris Tan, See Kiong Ng

Mining Errors from Bio Databases

• Vladimir Brusic, Judice Koh, Mong Li Lee