Text mining on the command line - Introduction to linux for bioinformatics
Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...
Transcript of Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining...
![Page 1: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/1.jpg)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in BioinformaticsDay 9: String & Text Mining in Bioinformatics
Karsten Borgwardt
March 1 to March 12, 2010
Machine Learning & Computational Biology Research GroupMPIs Tübingen
![Page 2: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/2.jpg)
Why compare sequences?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Protein sequencesProteins are chains of amino acids.20 different types of amino acids can be found in proteinsequences.Protein sequence changes over time by mutations, dele-tion, insertions.Different protein sequences may diverge from one com-mon ancestor.Their sequences may differ slightly, yet their function isoften conserved.
![Page 3: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/3.jpg)
Why compare sequences?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Biological Question:Biologists are interested in the reverse direction:Given two protein sequences, is it likely that they origi-nate from the same common ancestor?
Computational Challenge:How to measure similarity between two protein se-quence, or equivalently:How to measure similarity between two strings
Kernel Challenge:How to measure similarity between two strings via a ker-nel function
In short: How to define a string kernel
![Page 4: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/4.jpg)
History of sequence comparison
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
First phaseSmith-WatermanBLAST
Second phaseProfilesHidden Markov Models
Third phasePSI-BlastSAM-T98
Fourth phaseKernels
![Page 5: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/5.jpg)
Sequence comparison: Phase 1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
IdeaMeasure pairwise similarities between sequences withgaps
MethodsSmith-Waterman
dynamic programminghigh accuracyslow (O(n2))
BLASTfaster heuristic alternative with sufficient accuracysearches common substrings of fixed lengthextends these in both directionsperforms gapped alignment
![Page 6: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/6.jpg)
Sequence comparison: Phase 2
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
IdeaCollect aggregate statistics from a family of sequencesCompare this statistics to a single unlabeled protein
MethodsHidden Markov Models (HMMs)
Markov process with hidden and observable parame-tersForward algorithm determines probability if given se-quence is output of particular HMM
ProfilesProfiles of sequence families are derived by multiplesequence alignmentGiven sequence is compared to this profile
![Page 7: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/7.jpg)
Sequence comparison: Phase 3
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
IdeaCreate single models from database collectionsof homologous sequences
MethodsPSI-BLAST
Position specific iterative BLASTProfile from highest scoring hits in initial BLAST runsPosition weighting according to degree of conserva-tionIteration of these steps
SAM-T98, now SAM-T02database search with HMM from multiple sequencealignment
![Page 8: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/8.jpg)
Phase 4: Kernels and SVMs
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
General ideaModel differences between classes of sequencesUse SVM classifier to distinguish classesUse kernel to measure similarity between strings
Kernels for Protein SequencesSVM-Fisher kernelComposite kernelMotif kernelString kernel
![Page 9: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/9.jpg)
SVM-Fisher method
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
General ideaCombine HMMs and SVMs for sequence classificationWon best-paper award at ISMB 1999
Sequence representationfixed-length vectorcomponents are transition and emission probabilitiestransformation into Fisher score
![Page 10: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/10.jpg)
SVM-Fisher method
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
AlgorithmModel protein family F as HMMTransform query protein X into fixed-length vector viaHMMCompute kernel between X and positive and negativeexamples of the protein family
Advantagesallows to incorporate prior knowledgeallows to deal with missing datais interpretableoutperforms competing methods
![Page 11: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/11.jpg)
Composition kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
General ideaModel sequence by amino acid contentBin amino acids w.r.t physico-chemical properties
Sequence representationfeature vector of amino acid frequenciesphysico-chemical properties includepredicted secondary structure, hydrophobicity,normalized van der Waals volume, polarity,polarizabilityuseful database: AAindex
![Page 12: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/12.jpg)
Motif kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
General ideaConserved motif in amino acid sequences indicatestructural and functional relationshipModel sequence s as a feature vector f representingmotifsi-th component of f is 1⇔ s contains i-th motif
Motif databasesPROSITEeMOTIFsBLOCKS+ combines several databases
Generated bymanual constructionmultiple sequence alignment
![Page 13: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/13.jpg)
Pairwise comparison kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
General ideaEmploy empirical kernel map on Smith-Waterman/Blastscores
AdvantageUtilizes decades of practical experience with Blast
DisadvantageHigh computational cost (O(m3))
AlleviationEmploy Blast instead of Smith-WatermanUse vectorization set for empirical map only
![Page 14: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/14.jpg)
Phase 4: String Kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
General ideaCount common substrings in two stringsA substring of length k is a k-mer
VariationsAssign weights to k-mersAllow for mismatchesAllow for gapsInclude substitutionsInclude wildcards
![Page 15: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/15.jpg)
Spectrum Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
General ideaFor each l-mer α ∈ Σl, the coordinate indexed by α willbe the number of times α occurs in sequence x.Then the l-spectrum feature map is
ΦSpectruml (x) = (φα(x))α∈Σl
Here φα(x) is the # occurrences of α in x.The spectrum kernel is now the inner product in the fea-ture space defined by this map:
kSpectrum(x, x′) =< ΦSpectruml (x),ΦSpectrum
l (x′) >
Sequences are deemed the more similar, the more com-mon substrings they contain
![Page 16: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/16.jpg)
Spectrum Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
PrincipleSpectrum kernel: Count exactly common k-mers
![Page 17: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/17.jpg)
Mismatch Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
General ideaDo not enforce strictly exact matchesDefine mismatch neighborhood of an l-mer α with up tom mismatches:
φMismatch(l,m) (α) = (φβ(α))β∈Σl
For a sequence x of any length, the map is then ex-tended as
φMismatch(l,m) (x) =
∑l−mers α in x
(φMismatch(l,m) (α))
The mismatch kernel is now the inner product in featurespace defined by:
kMismatch(l,m) (x, x′) =< ΦMismatch
(l,m) (x),ΦMismatch(l,m) (x′) >
![Page 18: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/18.jpg)
Mismatch Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
PrincipleMismatch kernel: Count common k-mers with max. mmismatches
![Page 19: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/19.jpg)
Gappy Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
General ideaAllow for gaps in common substrings→ “subsequences”A g-mer then contributes to all its l-mer subsequences
φGap(g,l)(α) = (φβ(α))β∈Σl
For a sequence x of any length, the map is then ex-tended as
φGap(g,l)(x) =∑
g−mers α in x(φGap(g,l)(α))
The gappy kernel is now the inner product in featurespace defined by:
kGap(g,l)(x, x′) =< ΦGap
(g,l)(x),ΦGap(g,l)(x
′) >
![Page 20: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/20.jpg)
Gappy Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
PrincipleGappy kernel: Count common l-subsequences of g-mers
![Page 21: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/21.jpg)
Substitution Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
General ideamismatch neighborhood→ substitution neighborhoodAn l-mer then contributes to all l-mers in its substitutionneighborhood
M(l,σ)(α) = {β = b1b2 . . . bl ∈ Σl : −l∑i
logP (ai|bi) < σ}
For a sequence x of any length, the map is then ex-tended as
φSub(l,σ)(x) =∑
l−mers α in x(φSub(l,σ)(α))
The substitution kernel is now:
kSub(l,σ)(x, x′) =< ΦSub
(l,σ)(x),ΦSub(l,σ)(x
′) >
![Page 22: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/22.jpg)
Substitution Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
PrincipleSubstitution kernel: Count common l-subsequences insubstitution neighborhood
![Page 23: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/23.jpg)
Wildcard Kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
General ideaaugment alphabet Σ by a wildcard character ∗→ Σ∪{∗}given α from Σl and β from {Σ∪ {∗}}l with maximum moccurrences of ∗l-mer α contributes to l-mer β if their non-wildcard char-acters matchFor a sequence x of any length, the map is then givenby
φWildcard(l,m,λ) (x) =
∑l−mers α in x
(φβ(α))β∈W
where φβ(α) = λj if α matches pattern β containing jwildcards, φβ(α) = 0 if α does not match β, and0 ≤ λ ≤ 1.
![Page 24: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/24.jpg)
Wildcard Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
PrincipleWildcard kernel: Count l-mers that match except forwildcards
![Page 25: Data Mining in Bioinformatics Day 9: String & Text Mining in … · Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 General idea Combine HMMs and SVMs for sequence classification](https://reader035.fdocuments.in/reader035/viewer/2022071002/5fbec72fc2e4636525570474/html5/thumbnails/25.jpg)
References and further reading
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
References
[1] C. Leslie, E. Eskin, and W. S. Noble. The spectrumkernel: A string kernel for SVM protein classification. InPSB, pages 564–575, 2002.
[2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis-match string kernels for SVM protein classification. InNIPS 2002. MIT Press.
[3] C. Leslie and R. Kuang. Fast kernels for inexact stringmatching. In COLT, 2003.
[4] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methodsin Computational Biology, Chapter 3 and 4. MIT Press,Cambridge, MA, 2004.