Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...

Karsten Borgwardt: Data Mining in Bioinformatics,

Data Mining in BioinformaticsDay 9: String & Text Mining in Bioinformatics

Karsten Borgwardt

March 1 to March 12, 2010

Machine Learning & Computational Biology Research GroupMPIs Tübingen

Why compare sequences?


Protein sequencesProteins are chains of amino acids.20 different types of amino acids can be found in proteinsequences.Protein sequence changes over time by mutations, dele-tion, insertions.Different protein sequences may diverge from one com-mon ancestor.Their sequences may differ slightly, yet their function isoften conserved.

Why compare sequences?


Biological Question:Biologists are interested in the reverse direction:Given two protein sequences, is it likely that they origi-nate from the same common ancestor?

Computational Challenge:How to measure similarity between two protein se-quence, or equivalently:How to measure similarity between two strings

Kernel Challenge:How to measure similarity between two strings via a ker-nel function

In short: How to define a string kernel

History of sequence comparison


First phaseSmith-WatermanBLAST

Second phaseProfilesHidden Markov Models

Third phasePSI-BlastSAM-T98

Fourth phaseKernels

Sequence comparison: Phase 1


IdeaMeasure pairwise similarities between sequences withgaps

MethodsSmith-Waterman

dynamic programminghigh accuracyslow (O(n2))

BLASTfaster heuristic alternative with sufficient accuracysearches common substrings of fixed lengthextends these in both directionsperforms gapped alignment



IdeaCollect aggregate statistics from a family of sequencesCompare this statistics to a single unlabeled protein

MethodsHidden Markov Models (HMMs)

Markov process with hidden and observable parame-tersForward algorithm determines probability if given se-quence is output of particular HMM

ProfilesProfiles of sequence families are derived by multiplesequence alignmentGiven sequence is compared to this profile



IdeaCreate single models from database collectionsof homologous sequences

MethodsPSI-BLAST

Position specific iterative BLASTProfile from highest scoring hits in initial BLAST runsPosition weighting according to degree of conserva-tionIteration of these steps

SAM-T98, now SAM-T02database search with HMM from multiple sequencealignment

Phase 4: Kernels and SVMs


General ideaModel differences between classes of sequencesUse SVM classifier to distinguish classesUse kernel to measure similarity between strings

Kernels for Protein SequencesSVM-Fisher kernelComposite kernelMotif kernelString kernel

SVM-Fisher method


General ideaCombine HMMs and SVMs for sequence classificationWon best-paper award at ISMB 1999

Sequence representationfixed-length vectorcomponents are transition and emission probabilitiestransformation into Fisher score

SVM-Fisher method


AlgorithmModel protein family F as HMMTransform query protein X into fixed-length vector viaHMMCompute kernel between X and positive and negativeexamples of the protein family

Advantagesallows to incorporate prior knowledgeallows to deal with missing datais interpretableoutperforms competing methods

Composition kernels


General ideaModel sequence by amino acid contentBin amino acids w.r.t physico-chemical properties

Sequence representationfeature vector of amino acid frequenciesphysico-chemical properties includepredicted secondary structure, hydrophobicity,normalized van der Waals volume, polarity,polarizabilityuseful database: AAindex

Motif kernels


General ideaConserved motif in amino acid sequences indicatestructural and functional relationshipModel sequence s as a feature vector f representingmotifsi-th component of f is 1⇔ s contains i-th motif

Motif databasesPROSITEeMOTIFsBLOCKS+ combines several databases

Generated bymanual constructionmultiple sequence alignment

Pairwise comparison kernels


General ideaEmploy empirical kernel map on Smith-Waterman/Blastscores

AdvantageUtilizes decades of practical experience with Blast

DisadvantageHigh computational cost (O(m3))

AlleviationEmploy Blast instead of Smith-WatermanUse vectorization set for empirical map only

Phase 4: String Kernels


General ideaCount common substrings in two stringsA substring of length k is a k-mer

VariationsAssign weights to k-mersAllow for mismatchesAllow for gapsInclude substitutionsInclude wildcards

Spectrum Kernel


General ideaFor each l-mer α ∈ Σl, the coordinate indexed by α willbe the number of times α occurs in sequence x.Then the l-spectrum feature map is

ΦSpectruml (x) = (φα(x))α∈Σl

Here φα(x) is the # occurrences of α in x.The spectrum kernel is now the inner product in the fea-ture space defined by this map:

kSpectrum(x, x′) =< ΦSpectruml (x),ΦSpectrum

l (x′) >

Sequences are deemed the more similar, the more com-mon substrings they contain

Spectrum Kernel


PrincipleSpectrum kernel: Count exactly common k-mers

Mismatch Kernel


General ideaDo not enforce strictly exact matchesDefine mismatch neighborhood of an l-mer α with up tom mismatches:

φMismatch(l,m) (α) = (φβ(α))β∈Σl

For a sequence x of any length, the map is then ex-tended as

φMismatch(l,m) (x) =

∑l−mers α in x

(φMismatch(l,m) (α))

The mismatch kernel is now the inner product in featurespace defined by:

kMismatch(l,m) (x, x′) =< ΦMismatch

(l,m) (x),ΦMismatch(l,m) (x′) >

Mismatch Kernel


PrincipleMismatch kernel: Count common k-mers with max. mmismatches

Gappy Kernel


General ideaAllow for gaps in common substrings→ “subsequences”A g-mer then contributes to all its l-mer subsequences

φGap(g,l)(α) = (φβ(α))β∈Σl


φGap(g,l)(x) =∑

g−mers α in x(φGap(g,l)(α))

The gappy kernel is now the inner product in featurespace defined by:

kGap(g,l)(x, x′) =< ΦGap

(g,l)(x),ΦGap(g,l)(x

′) >

Gappy Kernel


PrincipleGappy kernel: Count common l-subsequences of g-mers

Substitution Kernel


General ideamismatch neighborhood→ substitution neighborhoodAn l-mer then contributes to all l-mers in its substitutionneighborhood

M(l,σ)(α) = {β = b1b2 . . . bl ∈ Σl : −l∑i

logP (ai|bi) < σ}


φSub(l,σ)(x) =∑

l−mers α in x(φSub(l,σ)(α))

The substitution kernel is now:

kSub(l,σ)(x, x′) =< ΦSub

(l,σ)(x),ΦSub(l,σ)(x

′) >

Substitution Kernel


PrincipleSubstitution kernel: Count common l-subsequences insubstitution neighborhood

Wildcard Kernels


General ideaaugment alphabet Σ by a wildcard character ∗→ Σ∪{∗}given α from Σl and β from {Σ∪ {∗}}l with maximum moccurrences of ∗l-mer α contributes to l-mer β if their non-wildcard char-acters matchFor a sequence x of any length, the map is then givenby

φWildcard(l,m,λ) (x) =

∑l−mers α in x

(φβ(α))β∈W

where φβ(α) = λj if α matches pattern β containing jwildcards, φβ(α) = 0 if α does not match β, and0 ≤ λ ≤ 1.

Wildcard Kernel


PrincipleWildcard kernel: Count l-mers that match except forwildcards

References and further reading


References

[1] C. Leslie, E. Eskin, and W. S. Noble. The spectrumkernel: A string kernel for SVM protein classification. InPSB, pages 564–575, 2002.

[2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis-match string kernels for SVM protein classification. InNIPS 2002. MIT Press.

[3] C. Leslie and R. Kuang. Fast kernels for inexact stringmatching. In COLT, 2003.

[4] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methodsin Computational Biology, Chapter 3 and 4. MIT Press,Cambridge, MA, 2004.

Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...

Documents

Transcript of Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...