M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath...
-
Upload
peregrine-anderson -
Category
Documents
-
view
216 -
download
0
Transcript of M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath...
![Page 1: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/1.jpg)
MACHINE LEARNING FOR PROTEIN CLASSIFICATION: KERNEL METHODSCS 374
Rajesh Ranganath
4/10/2008
![Page 2: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/2.jpg)
OUTLINE
Biological Motivation and Background Algorithmic Concepts Mismatch Kernels Semi-supervised methods
![Page 3: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/3.jpg)
PROTEINS
![Page 4: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/4.jpg)
THE PROTEIN PROBLEM
Primary Structure can be easily determined 3D structure determines function Grouping proteins into structural and
evolutionary families is difficult Use machine learning to group proteins
![Page 5: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/5.jpg)
HOW TO LOOK AT AMINO ACID CHAINS
Smith-Waterman Idea Mismatch Idea
![Page 6: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/6.jpg)
FAMILIES
Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)
Families are further subdivided into Proteins
Proteins are divided into Species The same protein may be found in
several species
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
![Page 7: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/7.jpg)
SUPERFAMILIES
Proteins which are (remote) evolutionarily related
Sequence similarity low
Share function
Share special structural features
Relationships between members of a superfamily may not be readily recognizable from the sequence alone
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
![Page 8: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/8.jpg)
FOLDS
Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold
No evolutionary relation between proteins
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
![Page 9: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/9.jpg)
PROTEIN CLASSIFICATION Given a new protein, can we place it in its “correct”
position within an existing protein hierarchy?
Methods
BLAST / PsiBLAST
Profile HMMs
Supervised Machine Learning methods
Fold
Family
Superfamily
Proteins
?
new protein
![Page 10: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/10.jpg)
MACHINE LEARNING CONCEPTS
Supervised Methods Discriminative Vs. Generative Models Transductive Learning Support Vector Machines Kernel Methods
Semi-supervised Methods
![Page 11: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/11.jpg)
DISCRIMINATIVE AND GENERATIVE MODELS
Discriminative Generative
![Page 12: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/12.jpg)
TRANSDUCTIVE LEARNING
Most Learning is Inductive Given (x1,y1) …. (xm,ym), for any test input x*
predict the label y* Transductive Learning
Given (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}
![Page 13: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/13.jpg)
SUPPORT VECTOR MACHINES
Popular Discriminative Learning algorithm Optimal geometric marginal classifier Can be solved efficiently using the Sequential
Minimal Optimization algorithm
If x1 … xn training examples, sign(iixiTx)
“decides” where x falls Train i to achieve best margin
![Page 14: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/14.jpg)
SUPPORT VECTOR MACHINES (2)
Kernalizable: The SVM solution can be completely written down in terms of dot products of the input.
{sign(iiK(xi,x) determines class of x)}
![Page 15: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/15.jpg)
KERNEL METHODS
K(x, z) = f(x)Tf(z) f is the feature mapping x and z are input vectors High dimensional features do not need to be
explicitly calculated Think of the kernel function similarity measure
between x and z
Example:
![Page 16: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/16.jpg)
MISMATCH KERNEL
Regions of similar amino acid sequences yield a similar tertiary structure of proteins
Used as a kernel for an SVM to identify protein homologies
![Page 17: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/17.jpg)
K-MER BASED SVMS For given word size k, and mismatch tolerance l,
define
K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches
Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))
SVM can be learned by supplying this kernel functionA B A C A R D I
A B R A D A B I
X
Y
K(X, Y) = 4
K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1
![Page 18: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/18.jpg)
DISADVANTAGES
3D structure of proteins is practically impossible
Primary sequences are cheap to determine How do we use all this unlabeled data? Use semi-supervised learning based on the
cluster assumption
![Page 19: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/19.jpg)
SEMI-SUPERVISED METHODS• Some examples are labeled
• Assume labels vary smoothly among all examples
![Page 20: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/20.jpg)
• Some examples are labeled
• Assume labels vary smoothly among all examples
SEMI-SUPERVISED METHODS
• SVMs and other discriminative methods may make significant mistakes due to lack of data
![Page 21: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/21.jpg)
SEMI-SUPERVISED METHODS• Some examples are labeled
• Assume labels vary smoothly among all examples
![Page 22: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/22.jpg)
SEMI-SUPERVISED METHODS• Some examples are labeled
• Assume labels vary smoothly among all examples
![Page 23: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/23.jpg)
SEMI-SUPERVISED METHODS• Some examples are labeled
• Assume labels vary smoothly among all examples
![Page 24: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/24.jpg)
SEMI-SUPERVISED METHODS• Some examples are labeled
• Assume labels vary smoothly among all examples
Attempt to “contract” the distances within each cluster while keeping intracluster distances larger
![Page 25: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/25.jpg)
SEMI-SUPERVISED METHODS• Some examples are labeled
• Assume labels vary smoothly among all examples
![Page 26: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/26.jpg)
CLUSTER KERNELS Semi-supervised methods
1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs Nbd(X)
2. Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X) Φoriginal(X’)
“Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X”
3. Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X) Y’ Nbd(Y) K(X’, Y’)
2. Next bagged mismatch
![Page 27: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/27.jpg)
BAGGED MISMATCHED KERNEL Final method
1. Bagged mismatch
1. Run k-means clustering n times, giving p = 1,…,n assignments cp(X)
2. For every X and Y, count up the fraction of times they are bagged together
Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y))
3. Combine the “bag fraction” with the original comparison K(.,.)
Knew(X, Y) = Kbag(X, Y) K(X, Y)
![Page 28: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/28.jpg)
O. Jangmin
![Page 29: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/29.jpg)
WHAT WORKS BEST?
Transductive Setting
![Page 30: M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.](https://reader037.fdocuments.in/reader037/viewer/2022110100/56649dca5503460f94ac07cf/html5/thumbnails/30.jpg)
REFERENCES
C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004.
J. Weston et al. Semi-supervised protein classification using cluster kernels.2003.
Images pulled under wikiCommons