7/27/2019 whelanch_rpe
1/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATETRANSDUCERS
CHRISTOPHER WHELANOREGON HEALTH & SCIENCE UNIVERSITY
Abstract. The discovery of novel peptides with useful capabilities orcharacteristics could lead to significant advances in fields such as ma-terials science, nanotechnology, and medicine. However, the large sizeof the sequence search space, combined with the time required to ex-perimentally test or simulate peptide behavior at the molecular level,makes statistical computational approaches attractive. We present a
novel method for designing peptides based on sequence analysis, and ap-ply it to two problems in peptide design: inorganic binding peptides andantimicrobial peptides. Peptides with the ability to bind to inorganicmaterials have many potential applications including medical devices,nanotechnology, and bone and tooth regeneration. Antimicrobial pep-tides have attracted attention as a potential source of therapeutic agentsdue to the rise of microbes resistant to traditional antibiotics. To designthese peptides, we train a support vector machine classifier that discrim-inates between positive and negative sequences based on the counts ofn-grams of amino acid chemical classes. Using the model learned by theclassifier, we then build weighted finite-state transducers that we cansample or search for novel sequences sharing the characteristics of thepositive training examples. We used this framework to produce a setof putative inorganic binding peptides, which we are testing experimen-
tally. We also generated novel antimicrobial peptide sequences and usedthird-party prediction services to validate them, with strong initial re-sults. We believe that our framework is flexible and generally applicableto many problems in peptide design.
1. INTRODUCTION
1.1. Peptide Design Applications. Designed peptides have potential uses
in a wide variety of applications, from medicine to materials science and nan-
otechnology. However, the design of new peptides is made difficult by the
large search space and incomplete chemical knowledge. Each position in a
peptide sequence can potentially hold any of the 20 standard amino acid
residues. Therefore, the search space grows exponentially as the length of1
7/27/2019 whelanch_rpe
2/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 2
the sequence increases, making an exhaustive search impossible; for a se-
quence of length 30, there are are 2030 1039 possible sequences. This
makes it difficult to search for novel peptides with processes that involve ex-
perimental testing or computationally expensive methods such as molecular
dynamics simulations. In addition, it is still impossible to accurately predict
the complete structure of a peptide given its sequence. In applications in
which researchers do not have an experimentally verified three dimensional
structure of a known peptide to use as a template, statistical and machine
learning approaches could be very useful in the peptide design process. We
have built such a system for designing novel peptides and applied it to two
problems in peptide design: inorganic binding and antimicrobial peptides.
1.2. Inorganic Binding Peptides. Peptides that are capable of binding
inorganic materials such as metal or quartz have potential uses in a large
number of applications. Many organisms found in nature incorporate inor-
ganic materials into tissues such as such as bone, teeth, and shells. To build
these structures, they use proteins and enzymes that direct their assembly at
a nanoscale level [1]. Understanding and reverse engineering these processes
could lead the way to many advances in engineering and medicine. For ex-
ample, bone and tooth regeneration might be possible if we could recreate
the chemical machinery that directs their formation. Other applications
include the construction of nanoscale electronic and photonic devices. A
first step towards these goals is the discovery and understanding of peptides
capable of binding to inorganic substances. High throughput combinatorial
techniques such as phage display [2], in which a large number of bacterio-
phages are induced to express mutated peptides on their exteriors and then
screened for ability to bind to a surface, have begun to provide data sets that
can be used for statistical analysis of inorganic binding peptide sequences.
7/27/2019 whelanch_rpe
3/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 3
1.3. Antimicrobial Peptides. AMPs have attracted considerable atten-
tion as a potential new source of therapeutic agents effective against microor-
ganisms that have developed resistance to traditional drugs [3]. Researchers
have recently published several databases of AMPs, such as ADP2 [4] and
CAMP [5], allowing the creation of large data sets for the computational
analysis of AMPs. These data sets can be used to learn patterns in AMP
sequences, which researchers can then use to design novel AMPs. For ex-
ample, Wang et al. [4] used the most common motifs in the ADP2 database
to hand-design a peptide that exhibited strong activity against E. Coli.
1.4. Overview of the Paper. In Section 2, I give an overview of the nec-
essary components of a computational peptide design system, summarize
previous approaches, and introduce our approach. In Section 3, I discuss
the details of our approach, and our application of it to both the inorganic
binding peptide and antimicrobial peptide design problems. Section 4 con-
tains preliminary results, including an assessment of the performance of the
classifier used in our system, a description of the planned experimental ver-
ification of our generated inorganic binding peptides, and an assessment of
designed AMP sequences used third-party computational prediction servers.
I also examine the features identified by our models as most important for
both design problems. In Section 5, I summarize our contribution and dis-
cuss future improvements.
2. PEPTIDE DESIGN APPROACHES
2.1. Introduction to Peptide Design Solutions. Computational pep-
tide design approaches must have three components: a method for gener-
ating candidate sequences, a feature set to use to characterize sequences,
and a method for scoring candidate sequences. The first component must
produce a set of novel sequences, either randomly or using a probabilistic
7/27/2019 whelanch_rpe
4/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 4
or rule based system based on a model of the desired class of peptides.
The system must then extract a set of features for each sequence for use in
the final component, which is a scoring method that has been trained on
known sequences. The complete system generates novel sequences, extracts
their features, and scores them. Researchers then select the highest scor-
ing candidates for experimental validation. Some approaches then use these
experimental results to iteratively refine either their scoring model or their
sequence generation model.
2.2. Past approaches. Oren et al. [6] used a computational method to
search for novel peptides that bind to inorganic materials. For candidate
generation, they randomly generated 1,000,000 peptide sequences. Using a
clustering approach, they scored candidates based on their distances from
sets of known strong and weak binders, as measured by global pairwise
alignments. They trained their scorer, in this case the substitution matrix
used in the alignments, using a greedy stochastic search. Because the score
was based on sequence alignment, the feature set used in this approach was
the raw amino acid sequence. The model successfully designed several strong
and weak binding peptides.
In another peptide design approach, a group of researchers at the Uni-
versity of British Columbia have built a scoring method for novel AMPs
using artificial neural networks [7, 8]. The authors created feature sets
using two-dimensional quantitative structure-activity relationship (QSAR)
descriptors. QSAR descriptors quantify the chemical structure of a peptide
based on the the properties of amino acids in the sequence. To generate
novel sequences for scoring, the authors sampled probability distributions
that modeled the likelihood of a given amino acid appearing at a particular
location in the sequence.
7/27/2019 whelanch_rpe
5/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 5
Looseet al. [9] generated novel AMPs using a linguistically inspired ap-
proach. To generate sequences, they built a set of regular grammars based on
a database of known AMPs, and then exhaustively enumerated the language
defined by those grammars. They then clustered candidate sequences and
gave representatives from each cluster a score based on the number of known
AMPs that shared a rule in the grammar with the candidate sequence.
Socolich et al. [10] built a scoring model for WW domains, which are
proteins with a particular type of fold, based on coupling constraints
between positions in sequences aligned using multiple sequence alignment
(MSA). For an example of a coupling constraint, consider an MSA in which
the amino acid in a given position is always polar when the acid in another
fixed position is non-polar. The authors used a simulated annealing pro-
cedure to find sequences which minimized the difference between coupling
constraints in the test set and the novel sequences, and experimentally ver-
ified several generated peptides. Thomas et al. [11] used the same feature
set for this problem, but designed a more complex generation method. The
authors trained a probabilistic graphical model (PGM) to learn the AA cou-
pling constraints, which they encoded as edges in the model. To generate
sequences from their PGM, the authors developed two sampling methods.
They scored their sampled sequences based on their log likelihoods in the
PGM.
The approaches described above are summarized in Table 1. This sum-
mary demonstrates that it is possible to characterize peptide design ap-
proaches based on their sequence generation method, scoring method, and
chosen feature sets. Analyzing approaches in this way may help to differen-
tiate the strengths and shortcomings of various approaches, and encourages
a modular view of the problem. It may be possible in the future to combine
candidate generation methods with scoring methods based on their results
7/27/2019 whelanch_rpe
6/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 6
Target Generation Method Scoring Method Feature SetInorganic binding [6] Random Alignment score SequenceAMPs [7] AA distribution ANN QSAR
AMPs [8] AA distribution ANN QSARAMPs [9] Regular grammars Grammar Matching SequenceWW Domains [10] Simulated Annealing Residue Correlation MSA couplingWW Domains [11] Sampling of PGM PGM Log likelihood MSA coupling
Table 1. Selection of recent approaches for computational pep-tide design. Abbreviations are AMP: Antimicrobial peptide; AA:amino acid; ANN: Artificial neural networks; QSAR: Quantitativestructure-activity relationship; MSA: Multiple Sequence Align-ment; PGM: Probabilistic graphical model.
in similar peptide application domains. In the next section, we describe our
approach, which features a feature set and scoring method designed to take
advantage of recent machine learning techniques for biological sequences,
and incorporates a novel generation method that we believe will be more
efficient at suggesting new peptide sequence candidates than the methods
described above.
2.3. Overview of Proposed Solution. We propose a method for de novo
peptide design based on learning from n-gram counts of classes of amino
acid residues, and then using weighted finite-state transducers to produce
sequences that include those features that are strongly associated with the
desired class of peptides. Feature mappings based on n-gram counts are
analogous to representations of sequences in the frequency domain [12], and
have been used successfully in tasks such as protein remote homology de-
tection [13], the problem of detecting similarities between proteins from
different organisms. In our application, we attempt to learn a set of weights
that describe how each n-gram feature is associated with the target peptide
class, and then use those weights to generate new sequences. The latter task
is made difficult, however, by the fact that most vectors ofn-gram feature
counts do not represent valid peptide sequences. For example, a feature vec-
tor that has a positive count for a trigram feature must also have positive
7/27/2019 whelanch_rpe
7/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 7
counts for the bigrams and unigrams contained within the trigram; other-
wise, it cannot represent a valid sequence. If the weight associated with the
trigram feature is high, but the weight associated with the bigram features
is low or negative, one cannot simply increase the count of the trigram fea-
ture while decreasing the count of the bigram features. Weighted finite-state
transducers (WFSTs) are a potential solution for this problem in sequence
design. Commonly used in speech recognition and natural language process-
ing [14, 15], they can be used to build a weighted lattice of sequences that
can be searched or sampled using efficient algorithms to yield sequences rich
in a desired set ofn-gram features, while still being valid sequences.
3. METHODS
3.1. Construction of Inorganic Binding Peptides Data Set. In our
initial training phase, we used the 39 training examples from Oren et al.
[6] to train our scoring model. All sequences were of length 12. Oren et
al. characterized these sequences according to their quartz binding affinity
as either strong (10 sequences), moderate (14 sequences), or weak binders(15 sequences). We used the 10 strong binders as positive training examples
and the 29 moderate and weak binders as negative training examples. We
initially trained our model using these 39 examples, and then tested the
scoring model on the 10 novel peptides generated by Oren et al. Because
the data set was small, for the sequence generation phase of our approach
we retrained on original training set of 39 plus the two strongest and two
weakest of the 10 novel sequences, for a total of 43 sequences: 12 positive
examples and 31 negative examples.
3.2. Construction of Antimicrobial Peptides Data Set. We down-
loaded all experimentally verified AMPs from the CAMP database as of
March 8, 2010, yielding a set of 1,187 peptide sequences. After removing all
7/27/2019 whelanch_rpe
8/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 8
( )
Figure 1. An example of how we map peptide sequences to fea-tures. The substring DWP contributes to the count of the trigramfeaturef3, which represents subsequences of classes Acidic, Aro-matic, Cyclic.
sequences that contained the nonstandard amino acid letters B, X, and Z,
and then extracting representative sequences using the CD-HIT clustering
server [16] with a sequence identity parameter of 0.9, we were left with a data
set of 862 AMP sequences, with a mean length of 34 amino acid residues
and a median length of 30. It is difficult to create a negative training set
of experimentally verified non-AMPs, so we followed Thomas et al. [5] in
noting that AMPs are generally secreted from cells, and downloaded a set
of human protein sequences from the UniProt database that were between
twenty and fifty amino acids in length, not annotated as antimicrobial, and
not annotated as secreted. This gave us a set of 1,224 negative training
examples. We randomly split the data, putting 70% in a training set and
30% in a test set.
3.3. Extracting Features and Training an SVM Classifier. To build
a feature space and classifier for our peptide sequences, we define a set
of 13 classes to represent the chemical properties of amino acid residues:
acidic, cyclic, aliphatic, aromatic, basic, buried, charged, hydrophobic, large,
medium, small, non-neutral, and polar. We define unigram, bigram, and
trigram features to be the ordered set of classes contained in a subsequence
of one, two, or three amino acids. For each peptide sequence, we count the
number of times each feature appears, as shown in Figure 1.
Given a set of positive and negative training examples, we compute vectors
of feature counts and train an SVM using the SVMLight package [17] V6.02.
7/27/2019 whelanch_rpe
9/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 9
(a) Feature Machine
0
f1/-4.401e-05
f2/-0.00479
f3/0.00515
f4/-0.00492
f5/0.00492
(b) Scorer
Figure 2. Portions of the finite-state machines that generate newsequences. 2(a): A portion of the finite-state transducer that com-putes the list of features contained in a sequence, showing pathsthat can be taken from the state that represents a trigram historyof CW. In one path the machine accepts the amino acid V asinput, and emits the features f3,f6, and f8, before proceeding tothe state that indicates that the history is is now WV. On theother path the machine accepts the amino acid R, and emits thefeaturesf5,f7, andf9. The symbol represents the empty string.2(b): A finite-state acceptor that assigns scores to features.
Training an SVM produces a linear classifier in the feature space, defined
by
wT(x) + b,
where (x) is the mapping of a sequence to the feature space and w is a
weight vector that describes the decision boundary hyperplane in the feature
space. It has been shown that the distance from a point in the feature space
to the separating hyperplane can be mapped by a sigmoid function to an
estimate of the probability that the point will belong to the positive or
negative class [18]. Therefore, from our trained SVM we extract w, which
indicates the direction in the frequency feature space that we hypothesize
contain sequences that are more likely to be positive examples.
3.4. Building a WFST to Generate New Sequences. A weighted
finite-state transducer is an automaton in which each transition between
states is associated with an input symbol, an output symbol, and a weight.
7/27/2019 whelanch_rpe
10/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 10
Formally, they are defined as 8-tuples (, , Q , I , F , E , , ), where is an
input alphabet, is an output alphabet, Q is a finite set of states, I is the
set of initial states, F is the set of final states, E is the set of transitions
between states, is a weight function for initial states, and is a weight
function for final states. The input and output alphabets are augmented
with a symbol which represents the empty string, allowing transitions to
not input or output a symbol. If no outputs are associated with the tran-
sitions then the machine is referred to as a weighted finite-state acceptor
(WFSA). Examples of WFSTs and WFSAs can be seen in Figure 2. An
important operation on WFSTs is composition, denoted by . In compo-
sition, the outputs of one WFST are fed to the inputs of a second WFST
or WFSA. Efficient implementations of composition allow the construction
of complex models based on a set of simple machines [19].
We use WFSTs to generate novel peptide sequences that will score well ac-
cording to the weight vector learned by our classifier. To do so, we compose
together three finite-state machines to build a WFST capable of generating
our desired sequences. Our first machine, F, is an unweighted transducer
that maps from a sequence of amino acids to a sequence of tokens represent-
ing features, as shown in Figure 2(a). Our second machine,S, is a WFSA
that provides a score for a given sequence of features. This machine is built
using the weight vector from the SVM classifier (Figure 2(b)). Each arc
accepts a single feature token fiwith a weight equal to wi, where wiis the
value of the classifier weight vector for that feature and is a parameter set
when building the model. After applying a weight pushing algorithm and
normalizing, the weights within the machine are treated as log probabili-
ties; therefore, the parameter can be used to vary the peakedness of the
probability distribution over generated sequences because wi
is a factor in
the probability of a sequence. The third machine, T, is a simple transducer
7/27/2019 whelanch_rpe
11/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 11
that accepts and outputs any sequence of length m; this machine constrains
the length of our generated sequences. We use a value for m of 12 for the
inorganic binding peptide problem, since 12 a standard length for designed
inorganic binding peptides. For AMPs, we set m to 30, the median length
of the AMP positive training set.
We build our final machine by composing the three sub-machines:
T F S
We then determinize the paths through this transducer and normalize the
scores of each path. This produces a WFST that accepts amino acid se-
quences of lengthmwith a score given by summing up the individual weights
of every feature contained within the sequence times . We build our finite-
state machines using the open source OpenFST library, version 1.1 [20].
In addition to implementing the algorithms needed to produce the trans-
ducer as described above, OpenFST provides tools that can search for the
highest scoring sequences accepted by the machine, and can sample from
high-scoring sequences probabilistically, by treating the scores of each tran-
sition within the machine as a negative log probability. Random sampling
adds diversity to our results, which is desireable because the highest scor-
ing sequences generated by the model are often permutations of the same
motifs.
3.5. Addition of a Language Model for Inorganic Binding Peptides.
Because of the small size of the inorganic binding peptide training set, we
wanted to ensure that our model did not overfit and could generalize to
produce novel chemically and biologically plausible sequences. To do so,
we build a machine L, which models the probabilities of peptide sequences
based on published sequences that are known to have some binding function.
7/27/2019 whelanch_rpe
12/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 12
This machine is a finite-state language model for amino acid sequences,
of the type commonly used in speech recognition and language processing
[15]. We create a training set for L by downloading all protein sequences
from the Gene Ontology database [21] that are annotated with the term
binding. Our current language model training set was downloaded from
the 2010-01-10 seqdb release of the Gene Ontology database, which contains
187,269 sequences that are directly or indirectly associated with the term
binding. By extracting uniquely named sequences from this release, we
produced a training set of 64,894 unique protein sequences that we used to
train our model. To build the language model we used a set of functions
built on the OpenFST toolkit. The language model was built using an n-
gram order of 4, so that the probability of an amino acid appearing in the
sequence is conditional on the previous three amino acids. We use Witten-
Bell smoothing to estimate the probabilities of sequences of amino acids that
do not appear in the training data. Finally, we rescale the language model
probabilities using an additional parameter , by which we multiply each
transition weight inL. Much like the parameter defined above, controls
the peakedness of the probability distribution defined by the language
model; we use it to increase or decrease the effect of the language model on
the sequences generated by our system. In this case, we build our complete
machine by composing the four sub-machines:
L T F S
This produces a weighted finite-state acceptor that accepts sequences of
12 amino acids with a score given by summing up the individual weights
of every feature contained within the sequence times . The probability of
generating a sequence is affected by both the score of the sequence in the
7/27/2019 whelanch_rpe
13/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 13
SVM classifier and the probability of the sequence according to the language
model.
3.6. Parametrization of the Model and Generation of New Se-
quences. Because of the different validation methods available for the prob-
lems of inorganic binding peptide and AMP design, we used different ap-
proaches to set parameters and generate sequences for testing. For inorganic
binding peptides, we had a very small training set and had to select a small
number of designed sequences for experimental testing; therefore, we first
chose the parameters for our model based on observations of the sequences
generated by a variety of parameter combinations. We then selected the
highest scoring candidate peptides from a model built with the best combi-
nation of parameters. The existence of third-party computational prediction
servers for AMPs, on the other hand, allowed us to generate and test large
batches of sequences simultaneously, albeit without the certainty of exper-
imental verification. Therefore, we generated large groups of sequences for
several models, as well as groups of control sequences for comparison.
3.6.1. Selection of Inorganic Binding Peptide Sequences. In addition to the
language model described in Section 3.5, we had two additional strategies for
helping our model to generalize even with the small training set of inorganic
binding peptides. First, we chose parameters which maximized the diversity
of sequences generated as well as their scores within the model, so that we
would be more likely to have a successful result among our selected set.
Additionally, we used a distance constraint to filter out sequences that were
too different from the known positive examples.
To choose parameters, we sampled sets of 2000 sequences for each value
of the model peakedness parameter in [1, 2, 3, 5, 6, 7, 8, 10, 11], and the
7/27/2019 whelanch_rpe
14/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 14
language model strength parameter in [0.75, 1, 1.5, 2, 3]. For each combi-
nation of parameters, we computed the average score for all 2000 sequences
in the SVM classifier, as well as the average score given by the QUARTZ2
matrix designed by Oren et al. [6]. Greater values of and produce se-
quences with higher scores in both scoring systems, as do smaller values of
. However, the diversity of the sequences created decreases in both cases.
Based on a subjective analysis of the scores and diversity of the sequences
created, we chose a value for of 3 and a value for of 0.75; this set of
parameters produced consistently high scoring sequences while maintaining
population diversity in the generated set. POPDIV [22] is an algorithm for
computing the population diversity of a sample of peptide sequences. Our
final parameter choice yielded a set of sequences with high scores but sig-
nificant population diversity as measured by POPDIV; with larger values of
, population diversity drops quickly to insignificant levels.
To select our final candidates for testing, we added a final filtering step to
our process that ensured that our candidate sequences were similar to the
positive training examples. We first generated 100,000 random sequences.
For each sequence, we calculated the euclidean distance between it and the
centroid of the set of positive training examples in the feature space. In
other words, we calculated the distance d for a candidate sequence x as:
d(x) =
(x) |P|i=1 (Pi)|P|
2,
wherePis the set of positive training examples and is the mapping from
a sequence to the n-gram feature space. By discarding sequences for which
d(x) > for some , we constrained our search to find sequences that lie
within a region centered on the average positive training example. Based on
comparisons of the scores and diversities of sequences at various distances,
7/27/2019 whelanch_rpe
15/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 15
we chose a value for of 40. We discarded all sequences with distances
greater than 40, and then chose the 50 remaining sequences with the highest
scores in our SVM classifier as candidates for experimental validation. To
further test the properties of our system, we also generated sequences with a
value of -1. This inverts the weights and should produce peptide sequences
that have little to no inorganic binding affinity. To produce this set of
predicted weak binders, we chose the 20 lowest scoring sequences from a
sample of 100,000. We also included the lowest possible scoring sequence in
the model, found using a best path search of the WFST: AIRGIRGIRGIR.
3.6.2. Creation of Sets of Antimicrobial Peptide Sequences. We created five
sets of novel AMP sequences for testing, each of which consisted of 2,000
novel peptide sequences of length 30. The first, RAND, was a control set,
consisting of amino acid residues chosen randomly at each position. As a sec-
ond control, we also created a set AADIST, which was generated by setting
each amino acid independently based on the distribution of amino acids in
the CAMP database. We then sampled from our WFST to build three sets
WFST1, WFST2, and WFSTNEG, with the peakedness parameter set to
1, 2, and -1, respectively. The group WFSTNEG exists to demonstrate the
ability of the method to solve the inverse design problem: creating sequences
which are unlikely to be AMPs. For each of our WFST generated groups,
we verified that no sequences shared more than 0.9 sequence identity using
the CD-HIT clustering web service [16].
3.7. Experimental Validation of Inorganic Binding Peptides. From
our final set of candidate inorganic binding peptides, our collaborators in
the GEMSEC group at the University of Washington have selected several
peptides for synthesis and experimental validation. Their choices are based
7/27/2019 whelanch_rpe
16/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 16
on factors such as difficulty of synthesis and expected yield, properties of
sequences that could not be learned computationally from the data available.
3.8. Third-party Prediction Servers for AMP Sequences. To eval-
uate the novel AMP sequences produced by our system, we used the pre-
diction server provided by the CAMP database website [5]. Although it is
not a substitute for experimental validation, using a computational predic-
tion technique allows rapid testing of large sets of peptides, and is therefore
useful in validating our approach. The CAMP server classifies peptide se-
quences as AMPs using three methods: support vector machines, random
forests (RF), and discriminant analysis (DA). Thomas et al. showed that
these classifiers perform well on a test data set, with overall accuracies of
91.5%, 93.2%, and 87.5%, respectively. The CAMP predictors use a fea-
ture set composed of a variety of features including amino acid composition,
average hydrophobicity and hydrophilicity of the peptide, transition and
composition of groups based on reduced amino acid alphabets, and di- and
tri-peptide composition based on hydrophobicity. Therefore, there is partial
but not complete overlap with the feature set used in our SVM classifier
and WFSTs. Even though this overlap may make it easier for our method
to produce sequences that CAMP classifies as AMPs, we believe that the
construction of sequences that score highly in a third-party classification
system is a valuable demonstration of our approach.
4. RESULTS
4.1. Performance of SVM Classifiers. We began evaluating our system
by testing the SVM classifiers we trained for each peptide design problem
on a held out test set. Because we used the feature weights from the SVM
to build our WFSTs, our system depends on the SVMs ability to correctly
classify unseen sequences as positive or negative examples.
7/27/2019 whelanch_rpe
17/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 17
0.4 0.2 0.0 0.2 0.4
0.0
0.5
1.0
1.5
2.0
2.5
Distance from hyperplane
Dippositionshift(nm)
Predicted Strong BindersPredicted Weak Binders
Figure 3. Experimental binding scores of 10 test sequences fromOren et al. [6]. The sequences are arranged by their distancefrom the hyperplane score in the SVM classifier. The Y axis showsthe Dip position shift after six minutes as measured by Oren etal., a measurement of binding ability. Sequences that Oren etal. predicted to be strong binders are light lines; predicted weak
binders are dark.
4.1.1. Inorganic Binding Peptides. When trained on the 39 training exam-
ples from Oren et al. [6], our classifier successfully predicts strong and weak
binders when evaluated against the 10 novel peptides reported in that paper
(Figure 3). In addition, the distance from the hyperplane predicts a portion
of the observed binding ability (R2 = 0.52, p = 0.02). Although this is a
small data set, this strong performance is encouraging.
4.1.2. Antimicrobial Peptides. We evaluated the SVM classifier from the
AMP model against our held-out training set of 259 positive and 367 neg-
ative examples. Our classifier had an area under the ROC curve of 0.931.
This indicates that our feature set of n-gram counts of amino acid classes
7/27/2019 whelanch_rpe
18/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 18
NumberofPositivePredictions
0
500
1000
1500
2000
RAND AADIST WFST1 WFST2 WFSTNEG
ClassifierSVM
RF
DA
Figure 4. Number of sequences in a sample of 2000 predictedto be AMPs for different sequence generation methods. A valueof 2000 indicates that all of the generated sequences were pre-dicted to be AMPs; a score of 0 indicates that none were predictedto be AMPs. RAND: Sequences generated completely randomly.AADIST: Sequences generated according to the distribution of
amino acids in the CAMP database. WFST1: Sequences gener-ated using a WFST with a peakedness parameter = 1. WFST2:Sequences generated using a WFST with = 2. WFSTNEG: Se-quences generated using a WFST with = -1. Results are shownfor all three prediction methods available at the CAMP databaseserver: SVM: support vector machine. RF: Random Forest. DA:Discriminant Analysis.
provides sufficient information to train an SVM classifier with strong per-
formance in the AMP problem domain.
4.2. Validation of Generated Sequences. Our inorganic binding se-
quences are being tested experimentally by the GEMSEC group at the Uni-
versity of Washington at the time of this writing. However, we have been
able to test our AMP sequences using computational prediction servers.
7/27/2019 whelanch_rpe
19/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 19
Feature WeightTryptophan 0.027
Buried-Buried-Cyclic 0.022Buried-Cyclic-Aromatic 0.019Medium-Aromatic-Aliphatic 0.019Aliphatic-Medium-Aromatic 0.018Hydrophobic-Polar-Medium -0.018Polar-Small-Buried -0.020Hydrophobic-Medium-Aliphatic -0.021
Table 2. Highest and lowest scoring features in the inorganicbinding model
4.2.1. Antimicrobial Peptides. We used the CAMP prediction servers to de-
termine the number of sequences predicted to be AMPs in our control and
test groups. The results are shown in Figure 4. Averaged across the three
prediction measures, only 3.9% of the random control group were predicted
to be AMPs, while 59.6% of the AADIST control set had positive predic-
tions. In the WFST1 group, an average of 84.9% were predicted to be
AMPs. In the group created with a more peaked probability distribution
over sequences, WFST2, an average of 99.9% of the generated sequences
were predicted to be AMPs. Finally, in the WFSTNEG group, in which the
value of the weights was reversed to reward non-AMP features, only 0.48%
of the generated sequences were predicted to be AMPs.
4.3. Examining Highly Weighted Features in the Models. To better
understand the predictions made by our model, we extracted the features
with the highest and lowest weights for both peptide design domains.
4.3.1. Inorganic Binding Peptides. Table 2 shows the five highest and three
lowest weighted features in the inorganic binding peptide model. The impor-
tance of tryptophan in our model agrees with the results of Oren et al. [6].
An analysis of the trigrams with strong weights in the model may help us to
better understand the chemical structures necessary for inorganic binding.
7/27/2019 whelanch_rpe
20/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 20
Feature WeightCysteine 0.428
Isoleucine 0.279Lysine 0.246Aromatic-Buried-Small 0.169Medium-Medium-Aromatic 0.159Threonine -0.323Leucine -0.356Serine -0.380
Table 3. Highest and lowest scoring features in the AMP model
4.3.2. Antimicrobial Peptides. The five highest and three lowest weighted
features for the AMP model are shown in Table 3. As with the inor-
ganic binding peptides, our model identified unigram features of amino acid
residues that are overrepresented in AMP sequences; for example, the im-
portance of cysteine residues in AMPs has been widely studied [23, 24].
Analysis of the trigram and bigram features identified by the model may
yield insights into AMP structure and design.
5. DISCUSSION
5.1. Conclusions. We have shown that by using the n-gram features of
chemical classes of amino acid residues and a trained SVM classifier, we can
produce WFSTs that are capable of generating novel sequences which share
the same features as the training set. We have applied our solution to two
problems in peptide design. The SVM classifier component of our system
performs well on known data sets of positive and negative examples in both
problem domains. We have produced a set of inorganic binding peptide se-
quences that we are testing experimentally. We have also generated sets of
novel AMP sequences, and a third-party classification server predicts that
a large proportion of these novel sequences will have antimicrobial proper-
ties. By varying the parameters used to construct our machines, we can
7/27/2019 whelanch_rpe
21/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 21
exchange diversity of the generated sequences for a higher likelihood of gen-
erating novel peptides in a desired class. We believe that this framework is
a promising approach for novel peptide design. By applying our system to
both inorganic binding and AMP sequence design, we have shown that it
may be generalizable to many problems in peptide design.
5.2. Future Work. Although we believe that the use of third-party com-
putational prediction servers for AMPs shows that this is a promising ap-
proach, any attempt to computationally design novel AMPs must eventually
be validated by synthesizing and testing actual peptides. We will look to do
so in the future.
We believe that this framework is generally applicable to many problems
in peptide design. We originally designed the system for the inorganic bind-
ing peptide design problem, but found that we could easily adapt it to the
design of AMPs with minimal changes. Therefore, we are optimistic that it
could be easily generalized to other applications. One such problem is the
design of peptides capable of binding to larger proteins, such as G-coupled
protein receptors, which is an application with many implications in drug
design.
We would also like to study the impacts of using different feature map-
pings for sequences on the performance of the system. Our feature mapping
is closely related to string kernel methods such as the spectrum kernel [13]
and the mismatch kernel [25], and could easily be extended to incorporate
features of other kernels, like the gappy and wildcard [26]. In fact, because
our approach relates string kernels and weighted finite-state transducers, it
should be possible to incorporate any rational kernel [27], a class of kernels
defined in terms of WFSTs that includes most string kernels currently in
use in computational biology.
7/27/2019 whelanch_rpe
22/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 22
6. ACKNOWLEDGEMENTS
Kemal Sonmez and Brian Roark provided the initial ideas for this project,and contributed invaluable advice during the course of its implementa-
tion. Kemal Sonmez provided the initial implementation of the peptide
sequence to feature mapping code. Brian Roark provided tools for building
the UniProt language model. I am thankful for the willingness of Mehmet
Sarikaya, Candan Tamerler, and the GEMSEC group at the University of
Washington to test the novel inorganic binding sequences. I also thank the
NSF for supporting part of this work through Grant #IIS-0811745.
References
[1] C. Tamerler and M. Sarikaya, Molecular biomimetics: nanotechnology and bionan-otechnology using genetically engineered peptides, Philos Transact A Math PhysEng Sci, vol. 367, pp. 170526, May 2009.
[2] R. H. Hoess, Protein design and phage display, Chem Rev, vol. 101, pp. 320518,Oct 2001.
[3] H. Jenssen, P. Hamill, and R. E. W. Hancock, Peptide antimicrobial agents, ClinMicrobiol Rev, vol. 19, pp. 491511, Jul 2006.
[4] G. Wang, X. Li, and Z. Wang, APD2: the updated antimicrobial peptide databaseand its application in peptide design, Nucleic Acids Research, vol. 37, no. Databaseissue, p. D933, 2009.
[5] S. Thomas, S. Karnik, R. S. Barai, V. K. Jayaraman, and S. Idicula-Thomas, CAMP:
a useful resource for research on antimicrobial peptides, Nucleic Acids Research,vol. 38, pp. D77480, Jan 2010.
[6] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, andR. Samudrala, A novel knowledge-based approach to design inorganic-binding pep-tides,Bioinformatics, vol. 23, pp. 281622, Nov 2007.
[7] A. Cherkasov, K. Hilpert, H. Jenssen, C. Fjell, M. Waldbrook, S. Mullaly, R. Volkmer,and R. Hancock, Use of artificial intelligence in the design of small peptide antibioticseffective against a broad spectrum of highly antibiotic-resistant superbugs, ACSChem Biol, vol. 4, no. 1, pp. 6574, 2008.
[8] H. Jenssen, C. D. Fjell, A. Cherkasov, and R. E. W. Hancock, QSAR modeling andcomputer-aided design of antimicrobial peptides, J Pept Sci, vol. 14, pp. 1104, Jan2008.
[9] C. Loose, K. Jensen, I. Rigoutsos, and G. Stephanopoulos, A linguistic model forthe rational design of antimicrobial peptides, Nature, vol. 443, pp. 8679, Oct 2006.
[10] M. Socolich, S. Lockless, W. Russ, and H. Lee, Evolutionary information for speci-fying a protein fold, Nature, vol. 437, pp. 5128, Jan 2005.
[11] J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, Protein design by sampling anundirected graphical model of residue constraints, IEEE/ACM Trans Comput BiolBioinform, vol. 6, pp. 50616, Jan 2009.
[12] D. Anastassiou, Frequency-domain analysis of biomolecular sequences, Bioinfor-matics, vol. 16, pp. 107381, Dec 2000.
7/27/2019 whelanch_rpe
23/23
DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS 23
[13] C. Leslie, E. Eskin, and W. S. Noble, The spectrum kernel: a string kernel for SVMprotein classification,Pac Symp Biocomput, pp. 56475, Jan 2002.
[14] B. Roark and R. Sproat,Computational Approaches to Morphology and Syntax. Ox-
ford Surveys in Syntax & Morphology, Oxford: Oxford University Press, 2007.[15] M. Mohri, F. Pereira, and M. Riley, Weighted finite-state transducers in speech
recognition,Comput Speech Lang, vol. 16, pp. 6988, Jan 2002.[16] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT suite: a web server for
clustering and comparing biological sequences, Bioinformatics, vol. 26, pp. 6802,Mar 2010.
[17] T. Joachims, Making large-scale SVM learning practical, in Advances in KernelMethods - Support Vector Learning (B. Schlkopf, C. Burges, and A. Smola, eds.),ch. 11, Cambridge, MA: MIT Press, 1999.
[18] J. C. Platt, Probabilistic outputs for support vector machines and comparisons toregularized likelihood methods, inAdvances in Large Margin Classifiers, pp. 6174,MIT Press, 1999.
[19] F. C. N. Pereira and M. Riley,Finite-State Devices for Natural Language Processing.Cambridge, Massachusetts: MIT Press, 1997.
[20] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst: A generaland efficient weighted finite-state transducer library, in Proceedings of the NinthInternational Conference on Implementation and Application of Automata, (CIAA2007), vol. 4783 of Lecture Notes in Computer Science, pp. 1123, Springer, 2007.http://www.openfst.org.
[21] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M.Rubin, and G. Sherlock, Gene ontology: tool for the unification of biology. the GeneOntology Consortium, Nat Genet, vol. 25, pp. 259, May 2000.
[22] L. Makowski and A. Soares, Estimating the diversity of p eptide populations fromlimited sequence data, Bioinformatics, vol. 19, pp. 4839, Mar 2003.
[23] J. Dimarcq, P. Bulet, C. Hetru, and J. Hoffmann, Cysteine-rich antimicrobial pep-tides in invertebrates, Peptide Science, vol. 47, no. 6, pp. 465477, 1999.
[24] T. Ganz, Defensins: antimicrobial peptides of innate immunity,Nat Rev Immunol,vol. 3, pp. 71020, Sep 2003.
[25] C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch string kernels for SVMprotein classification,Advances in Neural Information Processing Systems, pp. 14411448, 2003.
[26] C. Leslie and R. Kuang, Fast string kernels using inexact matching for proteinsequences,The Journal of Machine Learning Research, vol. 5, Dec 2004.
[27] C. Cortes, P. Haffner, and M. Mohri, Rational kernels: Theory and algorithms,The Journal of Machine Learning, vol. 5, pp. 10351062, Aug 2004.